1) Problem statement
This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, parental_level_of_education, Lunch and Test preparation course.
2) Data Collection
Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
The data consists of 8 column and 1000 rows.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('C:\Projectfiles\MLproject\datasource\StudentsPerformance.csv')

In [None]:
df.head()

In [None]:
df.shape

3. Data Checks to perform
Check Missing values
Check Duplicates
Check data type
Check the number of unique values of each column
Check statistics of data set
Check various categories present in the different categorical column

In [None]:
df.isna().sum()

There are no missing values in the data set

In [None]:
df.duplicated().sum()

There are no duplicates values in the data set

In [None]:
df.info()

In [None]:
df.nunique()

In [None]:
df.describe()

Insights:
* From above description of numerical data, we observe that all the means are very close to each other - between 66 and 68.05;
* We can also understand that the standard deviations of all the columns are also close to each other - between 14.6 and 15.19;
* The minimum scores for reading which is 17 and writing which is 10 are much higher,  when compared to the minimum score for Math which is 0.

4. Categorical Data

In [None]:
df.head()

In [None]:
print(f"Categories in 'gender' variable: {df['gender'].unique()}")

print(f"Categories in 'race_ethnicity' variable: {df['race_ethnicity'].unique()}")

print(f"Categories in'parental_level_of_education' variable: {df['parental_level_of_education'].unique()}")

print(f"Categories in 'lunch' variable: {df['lunch'].unique()}")

print(f"Categories in 'test preparation course' variable: {df['test_preparation_course'].unique()}")

5. Exploratory Data Analysis

In [65]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print(f"We have {len(numeric_features)} numerical features : {numeric_features}")
print(f"\nWe have {len(categorical_features)} categorical features : {categorical_features}")

We have 5 numerical features : ['math_score', 'reading_score', 'writing_score', 'total score', 'average']

We have 5 categorical features : ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course']


In [None]:
#add total score and average columns
df['total score'] = df['math_score'] + df['reading_score'] + df['writing_score']
df['average'] = df['total score']/3
df.head()

In [None]:
reading_full_marks = df[df['reading_score'] == 100]['average'].count()
writing_full_marks = df[df['writing_score'] == 100]['average'].count()
math_full_marks = df[df['math_score'] == 100]['average'].count()

print(f'Number of students with full marks in Maths: {math_full_marks}')
print(f'Number of students with full marks in Writing: {writing_full_marks}')
print(f'Number of students with full marks in Reading: {reading_full_marks}')

In [None]:
reading_less_20 = df[df['reading_score'] <= 20]['average'].count()
writing_less_20 = df[df['writing_score'] <= 20]['average'].count()
math_less_20 = df[df['math_score'] <= 20]['average'].count()

print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')

Insights:
* From the above values, we can conclude that the students have not performed well in Math.
* Likewise, the Best performance of students in the exam can be observed in the reading section

6. Data Visualization
Histogram
Kernel Distribution Function (KDE)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
plt.subplot(121)
sns.histplot(data=df,x='average',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='average',kde=True,hue='gender')
plt.show()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
plt.subplot(121)
sns.histplot(data=df,x='total score',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='total score',kde=True,hue='gender')
plt.show()

Insights
* Female students tend to perform well then male students.

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='lunch')
plt.subplot(142)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='lunch')
plt.subplot(143)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='lunch')
plt.show()

Insights:
* Standard lunch helps the students both male or female, to perform well in exams, when compared to free/reduced lunch.

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='parental_level_of_education')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental_level_of_education')
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental_level_of_education')
plt.show()


Insights:
* In general parent's education does not have an impact on students performance in exam.
* 2nd plot shows that parent's whose education is of associate's degree or master's degree. their male child tend to perform well in exam
* 3rd plot we can see there is no effect of parent's education on female students.

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='race_ethnicity')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race_ethnicity')
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race_ethnicity')
plt.show()

Insights:
* Students of group A and group B tend to perform poorly in exam irrespective of whether they are male or female

7. Multivariate analysis using pieplot

In [None]:
plt.rcParams['figure.figsize'] = (30, 12)

plt.subplot(1, 5, 1)
size = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['red','green']


plt.pie(size, colors = color, labels = labels,autopct = '.%2f%%')
plt.title('Gender', fontsize = 20)
plt.axis('off')



plt.subplot(1, 5, 2)
size = df['race_ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan','orange']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Race_Ethnicity', fontsize = 20)
plt.axis('off')



plt.subplot(1, 5, 3)
size = df['lunch'].value_counts()
labels = 'Standard', 'Free'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Lunch', fontsize = 20)
plt.axis('off')


plt.subplot(1, 5, 4)
size = df['test_preparation_course'].value_counts()
labels = 'None', 'Completed'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Test Course', fontsize = 20)
plt.axis('off')


plt.subplot(1, 5, 5)
size = df['parental_level_of_education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bachelor's Degree","Master's Degree"
color = ['red', 'green', 'blue', 'cyan','orange','grey']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Parental Education', fontsize = 20)
plt.axis('off')


plt.tight_layout()
plt.grid()

plt.show()

Insights
* Total Number of Male and Female students is almost equal
* Number students are high in Group C
* Number of students who have standard lunch are greater
* Number of students who have not enrolled in any test preparation course is greater
* Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree"

8. UNIVARIATE ANALYSIS - Gender


In [None]:
f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)
    
plt.pie(x=df['gender'].value_counts(),labels=['Male','Female'],explode=[0,0.1],autopct='%1.1f%%',shadow=True,colors=['#ff4d4d','#ff8000'])
plt.show()

Insights
* Gender has balanced data with female students are 518 (48%) and male students are 482 (52%)

9. BIVARIATE ANALYSIS ( Is gender has any impact on student's performance ? )

In [None]:
gender_group = df.groupby('gender').mean()
gender_group

In [None]:
plt.figure(figsize=(10, 8))

X = ['Total Average','Math Average']


female_scores = [gender_group['average'][0], gender_group['math_score'][0]]
male_scores = [gender_group['average'][1], gender_group['math_score'][1]]

X_axis = np.arange(len(X))
  
plt.bar(X_axis - 0.2, male_scores, 0.4, label = 'Male')
plt.bar(X_axis + 0.2, female_scores, 0.4, label = 'Female')
  
plt.xticks(X_axis, X)
plt.ylabel("Marks")
plt.title("Total average v/s Math average marks of both the genders", fontweight='bold')
plt.legend()
plt.show()

Insights:
* On an average females have a better overall score than male.
* Male students have scored higher in Math.

10. Outliers

In [None]:
plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['math_score'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['reading_score'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['writing_score'],color='yellow')
plt.subplot(144)
sns.boxplot(df['average'],color='lightgreen')
plt.show()

11. MUTIVARIATE ANALYSIS USING PAIRPLOT

In [None]:
sns.pairplot(df,hue = 'gender')
plt.show()

Insights
* From the above plot it is clear that all the scores increase linearly with each other.

12. Conclusions
* Students Performance is related with lunch, race, parental level education
* Female students lead in pass percentage and also are top-scorers
* Students Performance is not much related with test preparation course, however finishing preparation course is beneficial.