# Exploratory Data Analysis (EDA): Student Performance Dataset

## Introduction

In this notebook, we will perform an Exploratory Data Analysis (EDA) on a dataset that captures various factors influencing student performance. The dataset includes attributes such as:

* **Gender**
* **Race/Ethnicity**
* **Parental Level of Education**
* **Lunch Type**
* **Test Preparation Course**
* **Scores in Math, Reading, and Writing**

The goal of this analysis is to examine how demographic and socioeconomic factors such as gender, race/ethnicity, parental education, lunch type, and test preparation might relate to student performance in math, reading, and writing. By identifying patterns and potential correlations, this EDA aims to provide insights that could possibly inform strategies to better support diverse student groups and improve academic outcomes.

The dataset used in this analysis is publicly available on Kaggle and can be accessed [here](https://www.kaggle.com/datasets/rkiattisak/student-performance-in-mathematics/data).

---
## Step 1: Importing Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization settings
sns.set(style="whitegrid")
plt.style.use('seaborn-v0_8-muted')

# Display settings for DataFrame outputs
pd.set_option('display.max_columns', None)


---
## Step 2: Loading the Dataset

In [None]:
# Load the dataset from a CSV file
df = pd.read_csv("exams.csv")  # Make sure the file path is correct

# Set the index to start from 1
df.index = df.index + 1

# Display the first few records
df.head()

---
## Step 3: Checking for Missing Values

Before diving into analysis, it's important to check if the dataset contains any missing values. Missing data can skew results and must be handled appropriately.

In this dataset, we'll verify whether any null or missing values are present.


In [None]:
# Check for missing values in each column
df.isnull().sum()

**Result**: There are no missing values in the dataset. This ensures a smooth analysis process without the need for imputation or data cleaning in this regard.


---
## Step 4: Analyzing the Distribution of Scores

Now that we know the dataset is clean and complete, we can begin our exploratory data analysis.

### Step 4.1 Distribution of Student Scores by Subject
In this step, we'll visualize the distribution of student performance across the three academic subjects: **Math**, **Reading**, and **Writing**. 

We'll use a **violin plot** to show the distribution of scores for each subject. This type of plot combines a box plot with a kernel density estimate (KDE), making it easier to understand the data's shape, central tendency, and variability.


In [None]:
# Melt the data to long format for easier plotting
score_data = df[["math score", "reading score", "writing score"]].melt(
    var_name="Subject", value_name="Score"
)

# Rename subject values for better display
score_data["Subject"] = score_data["Subject"].map({
    "math score": "Math",
    "reading score": "Reading",
    "writing score": "Writing"
})

# Create the violin plot with palette and hue
plt.figure(figsize=(6, 4))
sns.violinplot(data=score_data, x="Subject", y="Score", hue="Subject", palette="muted", legend=False)

# Customize the plot
plt.title("Distribution of Student Scores by Subject", fontsize=14)
plt.xlabel("Subject")
plt.ylabel("Score")
plt.ylim(0, 110)
plt.yticks(range(0, 111, 10))  # Set y-axis ticks in increments of 10
plt.tight_layout()
plt.show()

### Step 4.2: Grades Distribution Across Male and Female Students

From the previous section, we observed that the score distributions in **Math**, **Reading**, and **Writing** are relatively balanced. To simplify further analysis, we will combine the three scores into a single **average grade** for each student. This average will be used as a unified performance metric in subsequent visualizations.

In this section, we aim to examine how student performance, represented by the average grade, differs between **male and female** students.

We'll begin by visualizing the **gender distribution** in the dataset, followed by a **box plot** that compares the average grades between male and female students.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Capitalize gender values
df["gender"] = df["gender"].str.capitalize()

# Create average score column
df["average score"] = df[["math score", "reading score", "writing score"]].mean(axis=1)

# Prepare gender counts
gender_counts = df["gender"].value_counts()

# Create a figure with 3 subplots side by side
fig, axs = plt.subplots(1, 3, figsize=(10, 4))

# 1. Pie chart
axs[0].pie(
    gender_counts,
    labels=gender_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=sns.color_palette("pastel")
)
axs[0].set_title("Gender Distribution (Pie Chart)")
axs[0].axis('equal')

# 2. Bar chart (no palette to avoid warning)
sns.barplot(x=gender_counts.index, y=gender_counts.values, ax=axs[1])
axs[1].set_title("Gender Distribution (Bar Chart)")
axs[1].set_xlabel("Gender")
axs[1].set_ylabel("Count")

# 3. Box plot with hue assigned, palette, and no legend
sns.boxplot(
    data=df, x="gender", y="average score",
    hue="gender", palette="Set2", legend=False,
    ax=axs[2]
)
axs[2].set_title("Average Grade Distribution by Gender")
axs[2].set_xlabel("Gender")
axs[2].set_ylabel("Average Score")
axs[2].set_ylim(0, 110)
axs[2].set_yticks(range(0, 111, 10))

plt.tight_layout()
plt.show()

# Calculate average scores by gender
avg_female = df.loc[df["gender"] == "Female", "average score"].mean()
avg_male = df.loc[df["gender"] == "Male", "average score"].mean()

# Print formatted averages with tabs for alignment
print(f"Female Students Average :\t{avg_female:.2f}")
print(f"Male Students Average   :\t{avg_male:.2f}")

From the above visualizations, we observe that the student population is almost evenly split, with **males** representing about **50.8%** and **females** **49.2%**, differing by only **12 students**. Despite this near parity, **female** students outperform males slightly, with an average score approximately **3 points** higher (**70.56** vs. **67.70**). This suggests that while the gender distribution is balanced, **female** students tend to achieve marginally better academic results in this dataset.

### Step 4.3: Grades Distribution Based on Student Demographic Profile

In this section, we explore how student performance varies across different **demographic categories**, focusing on:

- **Race/Ethnicity**
- **Parental Level of Education**

We begin by visualizing the distribution of students across these two demographic factors using pie charts. This gives us insight into the overall makeup of the dataset. Then, we analyze how average grades vary within these groups.

- For **race/ethnicity**, we use a **violin plot** to visualize the distribution and spread of grades within each group.
- For **parental level of education**, we use a **bar plot** to compare the **mean average score**, with **error bars representing the standard deviation (SD)**.


In [None]:
# Calculate average scores
df["average score"] = df[["math score", "reading score", "writing score"]].mean(axis=1)

# Category ordering
edu_order = [
    "some high school", "high school", "some college",
    "associate's degree", "bachelor's degree", "master's degree"
]

# Capitalize categories
df["race/ethnicity"] = df["race/ethnicity"].str.title()
df["parental level of education"] = df["parental level of education"].str.lower()

# Get value counts
race_counts = df["race/ethnicity"].value_counts()
edu_counts = df["parental level of education"].value_counts().reindex(edu_order)

# Side-by-side pie charts
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Pie chart for race/ethnicity
axs[0].pie(
    race_counts,
    labels=race_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=sns.color_palette("pastel")
)
axs[0].set_title("Distribution by Race/Ethnicity")
axs[0].axis('equal')

# Pie chart for parental education
axs[1].pie(
    edu_counts,
    labels=edu_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=sns.color_palette("pastel")
)
axs[1].set_title("Distribution by Parental Level of Education")
axs[1].axis('equal')

plt.tight_layout()
plt.show()

In [None]:
# Define the order
race_order = ["Group A", "Group B", "Group C", "Group D", "Group E"]

plt.figure(figsize=(10, 6))
sns.violinplot(
    data=df,
    x="race/ethnicity",
    y="average score",
    hue="race/ethnicity",
    order=race_order,          # Force alphabetical order
    palette="muted",
    legend=False
)

plt.title("Grade Distribution by Race/Ethnicity", fontsize=14)
plt.xlabel("Race/Ethnicity")
plt.ylabel("Average Score")
plt.ylim(0, 110)
plt.yticks(range(0, 111, 10))
plt.tight_layout()
plt.show()

In [None]:
# Bar plot with Seaborn's built-in error bars
plt.figure(figsize=(12, 6))
df["education_group"] = df["parental level of education"]  # duplicate column for hue

sns.barplot(
    data=df,
    x="parental level of education",
    y="average score",
    hue="education_group",
    order=edu_order,
    errorbar="sd",
    palette="Blues_d",
    legend=False
)


plt.title("Average Grade by Parental Education Level (Mean ± SD)", fontsize=14)
plt.xlabel("Parental Level of Education")
plt.ylabel("Average Score")
plt.xticks(rotation=15)
plt.ylim(0, 110)
plt.yticks(range(0, 111, 10))
plt.tight_layout()
plt.show()

From the visualizations above, we observe that **Group C** represents the largest racial/ethnic category in the dataset, comprising **32.3%** of the students. In terms of parental education, approximately **39.2%** of students have parents who did not pursue education beyond high school, which includes those with only a high school diploma or less.

Among the race/ethnicity groups, **Group E** recorded the highest average performance. Additionally, students whose parents have pursued higher education starting from **some college** and above, tend to score higher on average than those whose parents did not. This trend suggests a positive correlation between parental educational attainment and student academic performance.

### Step 4.4: Grades Distribution Based on Learning Condition

In this section, we investigate how students’ academic performance is influenced by **learning condition factors**, specifically:

- **Lunch type**, which serves as a socioeconomic indicator
- **Test preparation course**, indicating whether students received formal prep before exams

We begin by visualizing the **distribution** of students across these two categories using pie charts. Then, we analyze their **average scores** using:

- A **swarm plot** for lunch type
- A **boxen plot** for test preparation status

Both variables are binary and help us understand how access to resources and preparation impacts student achievement.


In [None]:
# Clean and capitalize
df["lunch"] = df["lunch"].str.capitalize()
df["test preparation course"] = df["test preparation course"].str.capitalize()

# Value counts
lunch_counts = df["lunch"].value_counts()
prep_counts = df["test preparation course"].value_counts()

# Side-by-side pie charts
fig, axs = plt.subplots(1, 2, figsize=(10, 6))

# Pie chart for lunch type
axs[0].pie(
    lunch_counts,
    labels=lunch_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=sns.color_palette("pastel")
)
axs[0].set_title("Distribution by Lunch Type")
axs[0].axis('equal')

# Pie chart for test preparation course
axs[1].pie(
    prep_counts,
    labels=prep_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=sns.color_palette("pastel")
)
axs[1].set_title("Distribution by Test Preparation Course")
axs[1].axis('equal')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.swarmplot(
    data=df,
    x="lunch",
    y="average score",
    hue="lunch",             # To safely use palette
    palette="Set2",
    legend=False
)

plt.title("Grade Distribution by Lunch Type", fontsize=14)
plt.xlabel("Lunch Type")
plt.ylabel("Average Score")
plt.ylim(0, 110)
plt.yticks(range(0, 111, 10))
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxenplot(
    data=df,
    x="test preparation course",
    y="average score",
    hue="test preparation course",
    palette="Set3",
    legend=False
)

plt.title("Grade Distribution by Test Preparation Status", fontsize=14)
plt.xlabel("Test Preparation Course")
plt.ylabel("Average Score")
plt.ylim(0, 110)
plt.yticks(range(0, 111, 10))
plt.tight_layout()
plt.show()

From the analysis above, we observe that students who **completed a test preparation course** tend to score **noticeably higher** than those who did not. Interestingly, students who receive **standard-priced lunch** also tend to achieve **higher average scores** compared to those receiving free or reduced lunch, highlighting an **unexpected association** between lunch program status and academic performance.

In terms of distribution, around **two-thirds of the students** fall into the category of having **standard lunch** and having **not completed** a test preparation course. This suggests that while formal preparation and lunch program status may both positively influence academic outcomes, a significant portion of students do not benefit from both simultaneously.

## Conclusion

Through this exploratory analysis, we gained a clearer understanding of how various demographic and socioeconomic factors relate to student academic performance. Key observations include:

* **Gender differences** are subtle but present, with female students slightly outperforming male students on average.
* **Race/ethnicity** and **parental level of education** appear to influence academic outcomes, with students from Group E and those whose parents attained higher education levels generally performing better.
* **Test preparation courses** and **standard lunch status** are associated with higher average scores, suggesting that both academic support and socioeconomic conditions may impact performance.

These findings highlight areas worth further investigation, such as the potential causal links between parental education, economic background, and student achievement. Future steps could include building predictive models or conducting more targeted analysis to guide educational interventions.

## References
*Student performance prediction*. Kaggle. Available at: https://www.kaggle.com/datasets/rkiattisak/student-performance-in-mathematics/data