# Final Project Data Visualization
## Introduction
### Used Data
The data I used for this project is from Kaggle.com. (https://www.kaggle.com/datasets/bhavikjikadara/student-study-performance)
This dataset consists of the marks secured by students in various subjects.

- **Gender**: Sex of students (Male/Female)
- **Race/Ethnicity**: Ethnicity of students (Group A, B, C, D, E)
- **Parental Level of Education**: Parents' final education (Bachelor's degree, some college, master's degree, associate's degree, high school)
- **Lunch**: Having lunch before the test (standard or free/reduced)
- **Test Preparation Course**: Complete or not complete before the test
- **Math Score**
- **Reading Score**
- **Writing Score**

Here is a preview of the data:

In [1]:
# Import our data processing library 
import pandas as pd
data= pd.read_csv("study_performance.csv")
data.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


## Goals and Tasks
The goal of this analysis is to explore the **Student Study Performance** dataset and uncover patterns and relationships within the test scores and gender of the students. By doing so, we aim to gain insights into how different abilities (math, writing, and reading) are related and whether gender plays a significant role in the results.

### Means
We will conduct the following tasks:
1. **Data Exploration**: Investigate the dataset's structure, variables, and distributions.
2. **Correlation Analysis**: Examine correlations between test scores (math, writing, and reading).
3. **Gender Comparison**: Compare test scores between male and female students.
4. **Visualization**: Create visual representations (scatterplots, histograms) to highlight patterns and trends.
5. **Statistical Measures**: Calculate summary statistics (e.g., mean, median) to quantify differences.

### Characteristics
We seek to learn:
- How test scores (math, writing, reading) are distributed.
- Whether there are significant differences in scores based on gender.
- Whether students tend to excel in multiple subjects or specialize in one area.

### Target Data
The dataset contains information on students' test scores, gender, and other relevant attributes.

### Workflow
1. **Data Loading**: Load the dataset.
2. **Exploratory Data Analysis (EDA)**:
    - Explore the distribution of test scores.
    - Investigate gender distribution.
3. **Gender Comparison**:
    - Compare test scores by gender.
5. **Visualization**:
    - Create scatterplots to visualize relationships.
    - Overlay histograms to compare score distributions.


### Roles
The analysis will be executed by me as student interested in understanding student performance and identifying potential gender-related patterns.

## Low-fidelity prototype
To achieve this, I initialy created the following low-fidelity prototype scatterplots to plot the results of Math Score, Writing Score, and Reading Score in a coordinate system. For Writing and Reading scores, positions along the X and Y axes were chosen, respectively, and for the Math score, a color gradient was used. In another scatterplot, I plotted the properties Math Score and Writing Score together (representation by position) and represented the gender of the participants with different colors (classification). This provided a first impression of the data and possible interesting aspects. Additionally, I made the two scaterplots interactive, as it allows me to better analyse the data manually.

In [3]:
import altair as alt
c1 = alt.Chart(data).mark_circle().encode(x="writing_score", y="reading_score", color="math_score").interactive()
c2 = alt.Chart(data).mark_circle().encode(x="writing_score", y="math_score", color="gender").interactive()
c1|c2

These two plots indicate the following:
- That students are usually proficient in more than one subject. That is, those who are good at writing and reading tend to also have higher math scores.
- That girls and boys are not fundamentally worse or better in mathematics, reading, or writing.
- However, there are clear differences by gender within the range in which students operate. That is, among students who tend to have higher scores in all areas or lower scores in all areas, it can be said that girls performed better in the disciplines of reading and writing, while boys performed better in mathematics.

To verify this with certain statistical measures, I have created the following histogramms of the respective scores. In the respective plots, I have overlaid two histograms and differentiated them by color. When choosing colors, I ensured that they are as neutral as possible and therefore unbiased, yet still clearly distinguishable. Furthermore, I have decided to plot the median to get a grasp how large the differences are there.

In [6]:
data_male = data[data['gender'] == 'male']
data_female = data[data['gender'] == 'female']

female_color = '#4C72B0'  # Blue
male_color = '#DD8452'  # Orange
female_rule_color = '#3182bd'  # Darker blue
male_rule_color = '#e6550d'  # Darker orange

# Opacities
bar_opacity = 0.7
rule_opacity = 0.9

# Math Chart
bar_male_math = alt.Chart(data_male).mark_bar().encode(
    x=alt.X('math_score:Q', bin=True, title='Score'),
    y=alt.Y('count()', title='Records'),
    color=alt.value(male_color),
    opacity=alt.value(bar_opacity)
)

rule_male_math = alt.Chart(data_male).mark_rule().encode(
    x='mean(math_score):Q',
    size=alt.value(5),
    color=alt.value(male_rule_color),
    opacity=alt.value(rule_opacity)

)

male_c_math = bar_male_math + rule_male_math

bar_female_math = alt.Chart(data_female).mark_bar().encode(
    x=alt.X('math_score:Q', bin=True, title='Score'),
    y=alt.Y('count()', title='Records'),
    color=alt.value(female_color),
    opacity=alt.value(bar_opacity)
)

rule_female_math = alt.Chart(data_female).mark_rule().encode(
    x='mean(math_score):Q',
    size=alt.value(5),
    color=alt.value(female_rule_color),
    opacity=alt.value(rule_opacity)
)

female_c_math = bar_female_math + rule_female_math

math_c = (male_c_math + female_c_math).properties(height=250, width=250,title='Math Scores')

# Writing Chart
bar_male_writing = alt.Chart(data_male).mark_bar().encode(
    x=alt.X('writing_score:Q', bin=True, title='Score'),
    y=alt.Y('count()', title='Records'),
    color=alt.value(male_color),
    opacity=alt.value(bar_opacity)
)

rule_male_writing = alt.Chart(data_male).mark_rule().encode(
    x='mean(writing_score):Q',
    size=alt.value(5),
    color=alt.value(male_rule_color),
    opacity=alt.value(rule_opacity)
)

male_c_writing = bar_male_writing + rule_male_writing

bar_female_writing = alt.Chart(data_female).mark_bar().encode(
    x=alt.X('writing_score:Q', bin=True, title='Score'),
    y=alt.Y('count()', title='Records'),
    color=alt.value(female_color),
    opacity=alt.value(bar_opacity)
)

rule_female_writing = alt.Chart(data_female).mark_rule().encode(
    x='mean(writing_score):Q',
    size=alt.value(5),
    color=alt.value(female_rule_color),
    opacity=alt.value(rule_opacity)
)

female_c_writing = bar_female_writing + rule_female_writing

writing_c = (male_c_writing + female_c_writing).properties(height=250, width=250,title='Writing Scores')

# Reading Chart
bar_male_reading = alt.Chart(data_male).mark_bar().encode(
    x=alt.X('reading_score:Q', bin=True, title='Score'),
    y=alt.Y('count()', title='Records'),
    color=alt.value(male_color),
    opacity=alt.value(bar_opacity)
)

rule_male_reading = alt.Chart(data_male).mark_rule().encode(
    x='mean(reading_score):Q',
    size=alt.value(5),
    color=alt.value(male_rule_color),
    opacity=alt.value(rule_opacity)
)

male_c_reading = bar_male_reading + rule_male_reading

bar_female_reading = alt.Chart(data_female).mark_bar().encode(
    x=alt.X('reading_score:Q', bin=True, title='Score'),
    y=alt.Y('count()', title='Records'),
    color=alt.value(female_color),
    opacity=alt.value(bar_opacity)
)

rule_female_reading = alt.Chart(data_female).mark_rule().encode(
    x='mean(reading_score):Q',
    size=alt.value(5),
    color=alt.value(female_rule_color),
    opacity=alt.value(rule_opacity)
)

female_c_reading = bar_female_reading + rule_female_reading

reading_c = (male_c_reading + female_c_reading).properties(height=250, width=250,title='Reading Scores')

# Alle drei Diagramme nebeneinander anzeigen
final_chart = alt.hconcat(math_c, writing_c, reading_c).resolve_scale(color='independent')


# Circle Legend
legend_male = alt.Chart({'values': [{'gender': 'male'}, {'gender': 'female'}]}).mark_circle(color=male_color, size=100).encode(
    y=alt.value(0),
    x=alt.value(50)
)

text_male = alt.Chart({'values': [{'gender': 'male'}, {'gender': 'female'}]}).mark_text(dy=-10, dx=15, color='black').encode(
    y=alt.value(10),
    x=alt.value(100),
    text=alt.value('Male')
)

legend_female = alt.Chart({'values': [{'gender': 'male'}, {'gender': 'female'}]}).mark_circle(color=female_color, size=100).encode(
    y=alt.value(0),
    x=alt.value(200)
)

text_female = alt.Chart({'values': [{'gender': 'male'}, {'gender': 'female'}]}).mark_text(dy=-10, dx=15, color='black').encode(
    y=alt.value(10),
    x=alt.value(250),
    text=alt.value('Female')
)

legend = (legend_male + text_male + legend_female + text_female)

final_chart_with_legend = alt.vconcat(legend, final_chart)

final_chart_with_legend


The histograms indicates, women performing better in the areas of reading and writing, and men in mathematics. However, it is also noticeable that the differences in the respective medians are rather small and could be due to other influences.

Nevertheless, I have decidd that the histograms will be part of the final product that I will present to my test subjects. To better investigate the aspects that gender likely has more influence within a certain performance range (generally better or worse students), I have chosen to provide an SPLOM diagram alongside the histograms. In this diagram, the results of mathematics, reading, and writing are represented by position, while gender is represented by color. This allows the user to quickly get an overview of the relationships between each and fulfill the exploration goal as effectively as possible. To allow users to view the plots in more detail, I decided to make the plots interactive and show information about the respective scores on mouseover.

In [7]:
selection = alt.selection(type='multi', fields=['gender'], bind='legend')

alt.Chart(data).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    color="gender",
    tooltip=["gender","writing_score","reading_score","math_score"],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).properties(
    width=125,
    height=125
).repeat(
    row=["writing_score", "reading_score","math_score"],
    column=["writing_score", "reading_score","math_score"]
).interactive().add_selection(selection)

## Evaluation plan
### Target Question
The core question I want to answer through our visualization is: **How do different abilities (math, writing, reading) relate to each other, and does gender play a significant role in the results?**

### Participants
To answer this question, I recruit the following individuals:

1. **My Mother (58 years old, Chef)**:
    - Her perspectives will provide insights from a non-technical bakground.
    - She can evaluate the visualization's clarity and ease of understanding.

2. **My Father (59 years old, Service Technician)**:
    - He has a practical mindset and is not familiar with data visualization.
    - His feedback will help assess the visualization's accessibility to non-experts.
    - He can provide insights into whether the visualization effectively conveys information.

3. **A Friend (28 years old, Biology Student)**:
    - As a student, he is familiar with academic performance and statistical concepts.
    - His feedback will focus on the depth of insights and accuracy.
    - He can evaluate whether the visualization aligns with his domain  knowledge.


### Evaluation Measures
I use the following measures to assess the effectivenes of our visualization:

1. **Insight Depth**:
    - How well the visualization reveals patterns, trends, and relationships.
    - I want participants to gain meaningful insights into student performance.

2. **Use Cases**:
    - Whether participants can identify specific scenarios where the visualization would be useful.
    - I aim for versatility, allowing users to explore diferent aspects of the data.

3. **Accuracy**:
    - How accurately the visualization represents the underlying data.
    - I want to avoid misleading or incorrect interpretations.

### Evaluation Approach
I conduct a **formal experiment** with the participants. Here's how I instantiated the methods:

1. **Introduction**:
    - Brief participants about the dataset, its variables, and the visualization's purpose.
    - Explain that we want their feedback to improve the visualization.

2. **Task Scenarios**:
    - Participants will perform specific tasks using the visualization:
        - Compare math scores between genders.
        - Explore correlations between scores.
        - Identify interesting patterns.
    - They will think aloud while interacting with the visualization.

3. **Questionnaire**:
    - After each task, participants will answer questions:
        - Did you find the visualization easy to understand?
        - Were you able to identify trends or differences?
        - Did the visualization align with your expectations?

4. **Feedback Session**:
    - Participants will discuss their observations, challenges, and suggestions.
    - I record their comments and note any usability issues.

### Success Criteria
I consider the visualization successful if:
- Participants uncover meaningful insights related to student performance.
- Use cases emerge beyond the initial target question.
- Feedback indicates high clarity, accuracy, and usability.


## Evaluation 
For the evaluation of our Student Study Performance visualization, we followed the plan. Due to the challenge of recruiting domain experts, I engaged with three participants from diverse backgrounds:

1. **My Mother (58 years old, Chef)**:
    - Feedback: Found the visualization easy to understand.
    - Insight Depth: Identified correlations between writing and reading scores.
    - Use Cases: Suggested using the visualization for educational planning.
    - Accuracy: No concerns.
    - Approach: Formal experiment with task scenarios and questionnaire.

2. **My Father (59 years old, Service Technician)**:
    - Feedback: Appreciated the simplicity and clarity.
    - Insight Depth: Noticed gender differences in math scores.
    - Use Cases: Proposed using it for school performance discussions.
    - Accuracy: No concerns.
    - Approach: Formal experiment with task scenarios and questionnaire.

3. **My Friend (28 years old, Biology Student)**:
    - Feedback: Liked the color coding and scatterplot design.
    - Insight Depth: Explored outliers in writing scores.
    - Use Cases: Suggested using it for research on student success factors.
    - Accuracy: No concerns
    - Approach: Formal experiment with task scenarios and questionnaire.

### Synthesis of Findings
#### Elements That Worked Well:
1. **Clarity and Simplicity**:
    - Participants appreciated the straightforward design.
    - The visualization effectively conveyed information without overwhelming users.

2. **Color Coding**:
    - The use of color to represent gender was intuitive.
    - Participants quickly grasped the gender-related patterns.

3. **Scatterplots**:
    - Scatterplots allowed for easy exploration of relationships.
    - Participants could visually identify trends and outliers.

### Elements to Refine in Future Iterations:
1. **Interaction features**:
    - Participants expressed interest in interactive features.
    - Future iterations should enhance interactivity for deeper exploration.

2. **Aditional Context**:
    - Some participants wanted context (e.g., average scores, standard deviation).
    - Providing context would enhance the usefulness of the visualization.

3. **Domain-Specific Insights**:
    - While non-experts found the visualization valuable, experts might uncover deeper insights.
    - Future iterations could involve domain-specific experts for further evaluation.

### Conclusion
Overall, my evaluation confirmed that the visualization effectively comunicates student performance insights.