<a href="https://colab.research.google.com/github/RidhimaJain/StudentPerformance-EDA/blob/main/EDA_Student_Performance_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Tite: Exploratory Data Analysis On Student Performance Dataset**

# **1. Introduction**

This dataset contains' demographics, test preparation status and exam scores. The purpose of this analysis is to explore the factors influencing student academic performance and generate insights to support educational improvements.

We hypothesize that:

- Students who complete the test preparation course will have higher exam scores than those who do not.
- The completion of test preparation course has a greater impact on math scores than on reading or writing scores.
- There is no significant difference in test preparation completion rates among different gender or ethnicity groups.
- Math scores are positively correlated with reading and writing scores.
- Students from parents with higher levels of education will perform better in math, reading, and writing exams.
- Female students will score higher than male students in reading and writing exams.
- Students who receive the standard lunch (versus free/reduced lunch) will have higher academic scores.
- There are significant differences in exam scores among different race/ethnicity groups.

This analysis will be useful for educators, school administrators, and policymakers looking to enhance student success.




# **2. Dataset Description:**
This dataset contains information on students’ demographic background, test preparation, and academic performance. The main variables include:

- **Gender:** Male or Female  
- **Race/Ethnicity:** Student’s ethnic group  
- **Parents' Level of Education:** Highest education level achieved by parents (e.g., high school, bachelor's degree)  
- **Lunch:** Type of lunch the student receives (standard or free/reduced)  
- **Test Preparation Course:** Whether the student completed a test preparation course (completed or none)  
- **Math Score:** Score obtained in the math exam (0-100)  
- **Reading Score:** Score obtained in the reading exam (0-100)  
- **Writing Score:** Score obtained in the writing exam (0-100)

These variables help us analyze how different factors relate to student academic performance.

# **3. Import Required Libraries**

We import Python libraries necessary for data manipulation and visualization

In [12]:
# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set a consistent theme for all plots
sns.set(style = 'whitegrid')

# **4. Load The Dataset**

The dataset is loaded using the pandas library. The dataset has been uploaded to a GitHub repository. This approach allows the CSV file to be accessed directly via its raw URL, making the code cleaner and removing the need for manual authorization or drive mounting each time the notebook is run.

In [13]:
# Load the dataset from GitHub
url = "https://raw.githubusercontent.com/RidhimaJain/StudentPerformance-EDA/refs/heads/main/StudentsPerformance.csv"

df = pd.read_csv(url)

# **5. Initial Data Inspection**

In this step, we perform an initial examination of the dataset to understand its structure and quality. This includes previewing sample records, checking the dataset’s size, identifying data types, and detecting missing or duplicate values. These insights help inform subsequent data cleaning and analysis steps.

## **5.1. Preview First Few Records**

Display the first 5 rows to get an initial idea of the dataset.

In [14]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


## **5.2. Check The Dataset Shape**

Check the number of rows and columns in the dataset.

In [15]:
df.shape

(1000, 8)

## **5.3 Dataset Summary Overview**

Check for missing values and data types of each column.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


**Interpretation:** From the output, we see all columns have 1000 non-null values. The data types look correct, with scores as integers and categorical columns as objects. This suggests minimal to no missing data that we may need to handle before analysis.

## **5.4. Statistical Summary of Numeric Columns**

Generate descriptive statistics such as mean, standard deviation, min, max, and quartiles for numerical columns.

In [17]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


# **6. Data Cleaning/Preprocessing**

Before analyzing or visualizing the data, we clean and prepare it to ensure consistency and accuracy. This includes handling missing values, fixing inconsistent categories, and checking data types.

## **6.1. Handling Missing Values**

We start by identifying and handling any missing values. This ensures the analysis isn't skewed by incomplete data.

In [18]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
gender,0
race/ethnicity,0
parental level of education,0
lunch,0
test preparation course,0
math score,0
reading score,0
writing score,0


**Interpretation:** All columns have complete data- no missing values to handle.

## **6.2. Handle Duplicate Records**

Duplicate records can bias the analysis. We check for and remove any duplicates if found.

In [19]:
# Check for duplicate records
df.duplicated().sum()

np.int64(0)

**Interpretation:** There are no duplicate records in the dataset. This means that we don't need to remove any records before analyzing the dataset.

## **6.3. Standardize Column Names**

To make column names easier to work with, we convert them to lowercase and replace spaces with underscores.


In [20]:
# Check column names to see if they are consistent
df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

**Interpretation:** The column names are in lowercases but has spaces that needs to be replaced with underscores to standardize it.

In [21]:
# Replacing ' ' in column names with '_'
df.columns = df.columns.str.replace(' ','_')

# Display Standardized Column Names
df.columns

Index(['gender', 'race/ethnicity', 'parental_level_of_education', 'lunch',
       'test_preparation_course', 'math_score', 'reading_score',
       'writing_score'],
      dtype='object')

## **6.4. Data Type Validation**

We validate and adjust data types to ensure each column is represented accurately. For instance, scores should be numeric, while categorical features like gender or parental education level should be of object or category type. This step is crucial for applying correct preprocessing techniques later on.


At step 5.3. Data Types and Missing Values

## **6.4. Explore Unique Values in Categorical Columns**

We examine categorical columns to understand their structure and fix any inconsistencies in naming (e.g., title case vs lowercase).