# Student Performance Dataset
This dataset contains information about the performance of students in a particular school. The dataset includes information about the gender, race/ethnicity, parental level of education, lunch, and test preparation course of each student. It also includes the math score, reading score, and writing score for each student. The packages I used for this data analysis are; pandas for data reading and manipulation,numpy,plotly for visualization, scipy for statistical analysis and hypothesis testing.

In [1]:
import pandas as pd

In [2]:
sp = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')
sp.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [3]:
sp.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [4]:
sp.duplicated().sum()

0

In [5]:
sp.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

In [6]:
categorical_cols = []
continuous_cols = []

for col in sp.columns:
    if sp[col].dtype == 'object' or sp[col].nunique() < 10:
        categorical_cols.append(col)
    else:
        continuous_cols.append(col)

print('Categorical columns:', categorical_cols) #determining the type of variables
print('Continuous columns:', continuous_cols)

Categorical columns: ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']
Continuous columns: ['math score', 'reading score', 'writing score']


In [7]:
sp.shape

(1000, 8)

1. There are no null values or duplicated values in the dataset.
2. The dataset contains 5 categorical columns: gender, race/ethnicity, parental level of education, lunch, and test preparation course.
3. The dataset contains 3 continuous columns: math score, reading score, and writing score.
4. The shape of the dataset is (1000, 8). 1000 rows and 8 columns

# Exploratory analysis

In [8]:
stats = sp[['math score', 'writing score', 'reading score']].describe()
print(stats)

       math score  writing score  reading score
count  1000.00000    1000.000000    1000.000000
mean     66.08900      68.054000      69.169000
std      15.16308      15.195657      14.600192
min       0.00000      10.000000      17.000000
25%      57.00000      57.750000      59.000000
50%      66.00000      69.000000      70.000000
75%      77.00000      79.000000      79.000000
max     100.00000     100.000000     100.000000


In [9]:
import plotly.graph_objs as go
from plotly.offline import iplot
gender_counts=sp['gender'].value_counts()#distribution of gender
print(gender_counts)
gender_counts = go.Pie(labels=sp['gender'].value_counts().index, values=sp['gender'].value_counts().values,hole=.4)
data = [gender_counts]
layout = go.Layout(title='Distribution of gender')
fig = go.Figure(data=data, layout=layout)
iplot(fig)

female    518
male      482
Name: gender, dtype: int64


In [10]:
race_counts=sp['race/ethnicity'].value_counts()#race frequency
print(race_counts)
race= go.Bar(x=sp['race/ethnicity'].value_counts().index, y=sp['race/ethnicity'].value_counts().values,marker=dict(color='brown'))
data = [race]
layout = go.Layout(title='Race distribution')
fig = go.Figure(data=data, layout=layout)
iplot(fig)

group C    319
group D    262
group B    190
group E    140
group A     89
Name: race/ethnicity, dtype: int64


In [11]:

PLE_counts=sp['parental level of education'].value_counts()#PLE frequency
print(PLE_counts)
PLE = go.Bar(x=sp['parental level of education'].value_counts().index, y=sp['parental level of education'].value_counts().values)
data = [PLE]
layout = go.Layout(title='PLE distribution')
fig = go.Figure(data=data, layout=layout)
iplot(fig)

some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parental level of education, dtype: int64


In [12]:
lunch_counts=sp['lunch'].value_counts()#distribution of type of lunch
print(lunch_counts)
lunch_counts = go.Pie(labels=sp['lunch'].value_counts().index, values=sp['lunch'].value_counts().values,hole=.4)
data = [lunch_counts]
layout = go.Layout(title='Distribution of lunch type')
fig = go.Figure(data=data, layout=layout)
iplot(fig)

standard        645
free/reduced    355
Name: lunch, dtype: int64


In [13]:
TPC_counts=sp['test preparation course'].value_counts() #race distribution
print(TPC_counts)
TPC_counts = go.Pie(labels=sp['test preparation course'].value_counts().index, values=sp['test preparation course'].value_counts().values)
data = [TPC_counts]
layout = go.Layout(title='TPC Distribution')
fig = go.Figure(data=data, layout=layout)
iplot(fig)

none         642
completed    358
Name: test preparation course, dtype: int64


In [14]:

grouped = sp.pivot_table(values=['math score', 'reading score', 'writing score'], index=['parental level of education'], aggfunc='mean')
print(grouped)

graph = go.Bar(x=grouped.index, y=grouped['math score'], name='Math Score')
graph2 = go.Bar(x=grouped.index, y=grouped['reading score'], name='Reading Score')
graph3 = go.Bar(x=grouped.index, y=grouped['writing score'], name='Writing Score')

data = [graph, graph2, graph3]
layout = go.Layout(title='PLE vs Scores', xaxis=dict(title='Parental level of education'), yaxis=dict(title='Scores'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

                             math score  reading score  writing score
parental level of education                                          
associate's degree            67.882883      70.927928      69.896396
bachelor's degree             69.389831      73.000000      73.381356
high school                   62.137755      64.704082      62.448980
master's degree               69.745763      75.372881      75.677966
some college                  67.128319      69.460177      68.840708
some high school              63.497207      66.938547      64.888268


In [15]:

grouped = sp.pivot_table(values=['math score', 'reading score', 'writing score'], index=['race/ethnicity'], aggfunc='mean')
print(grouped)

graph = go.Bar(x=grouped.index, y=grouped['math score'], name='Math Score',marker=dict(color='black'))
graph2 = go.Bar(x=grouped.index, y=grouped['reading score'], name='Reading Score',marker=dict(color='#6e9de3'))
graph3 = go.Bar(x=grouped.index, y=grouped['writing score'], name='Writing Score',marker=dict(color='brown'))

data = [graph, graph2, graph3]
layout = go.Layout(title='Race vs Scores', xaxis=dict(title='Race'), yaxis=dict(title='Scores'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

                math score  reading score  writing score
race/ethnicity                                          
group A          61.629213      64.674157      62.674157
group B          63.452632      67.352632      65.600000
group C          64.463950      69.103448      67.827586
group D          67.362595      70.030534      70.145038
group E          73.821429      73.028571      71.407143


In [16]:
grouped = sp.pivot_table(values=['math score', 'reading score', 'writing score'], index=['gender'], aggfunc='mean')
print(grouped)
graph = go.Bar(x=grouped.index, y=grouped['math score'], name='Math Score')
graph2 = go.Bar(x=grouped.index, y=grouped['reading score'], name='Reading Score')
graph3 = go.Bar(x=grouped.index, y=grouped['writing score'], name='Writing Score')

data = [graph, graph2, graph3]
layout = go.Layout(title='Gender vs Scores', xaxis=dict(title='Gender'), yaxis=dict(title='Scores'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

        math score  reading score  writing score
gender                                          
female   63.633205      72.608108      72.467181
male     68.728216      65.473029      63.311203


In [17]:
grouped = sp.pivot_table(values=['math score', 'reading score', 'writing score'], index=['lunch'], aggfunc='mean')
print(grouped)
graph = go.Bar(x=grouped.index, y=grouped['math score'], name='Math Score')
graph2 = go.Bar(x=grouped.index, y=grouped['reading score'], name='Reading Score')
graph3 = go.Bar(x=grouped.index, y=grouped['writing score'], name='Writing Score')

data = [graph, graph2, graph3]
layout = go.Layout(title='Lunch vs Scores', xaxis=dict(title='Lunch'), yaxis=dict(title='Scores'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

              math score  reading score  writing score
lunch                                                 
free/reduced   58.921127      64.653521      63.022535
standard       70.034109      71.654264      70.823256


In [18]:
grouped = sp.pivot_table(values=['math score', 'reading score', 'writing score'], index=['test preparation course'], aggfunc='mean')
print(grouped)

graph = go.Bar(x=grouped.index, y=grouped['math score'], name='Math Score', marker=dict(color='black'))
graph2 = go.Bar(x=grouped.index, y=grouped['reading score'], name='Reading Score', marker=dict(color='#6e9de3'))
graph3 = go.Bar(x=grouped.index, y=grouped['writing score'], name='Writing Score', marker=dict(color='brown'))

data = [graph, graph2, graph3]
layout = go.Layout(title='TPC vs Scores', xaxis=dict(title='TPC'), yaxis=dict(title='Scores'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

                         math score  reading score  writing score
test preparation course                                          
completed                 69.695531      73.893855      74.418994
none                      64.077882      66.534268      64.504673


# Statistical Analysis

## Relationship between the scores

In [19]:
from scipy.stats import shapiro
stat, p = shapiro([sp['math score'], sp['reading score'], sp['writing score']])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Sample is normally distributed (fail to reject H0)')
else:
    print('Sample is not norm dist (reject H0)')

    from scipy.stats import spearmanr

corr = sp[['math score', 'reading score', 'writing score']].corr(method=('spearman'))
print(corr)

import plotly.express as px

fig = px.imshow(corr)

fig.show()

Statistics=0.993, p=0.000
Sample is not norm dist (reject H0)
               math score  reading score  writing score
math score       1.000000       0.804064       0.778339
reading score    0.804064       1.000000       0.948953
writing score    0.778339       0.948953       1.000000


# Hypotheses Testing

## Parental level of education impacts the scores (writing, math and reading)
* Null hypothesis: Parental level of education does not impact the scores
* Alternative hypothesis: Parental level of education impacts the scores
For testing I am going to use one way ANOVA because it calculates the comparison between a categorical and two or more continous variable.

## Normality test
ANOVA test assumes that the dataset is normally dist

In [20]:
from scipy.stats import shapiro
sp['parental level of education'], _ = pd.factorize(sp['parental level of education'])
stat, p = shapiro([sp['parental level of education'],sp['math score'], sp['reading score'], sp['writing score']])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Sample is normally distributed (fail to reject H0)')
else:
    print('Sample is not norm dist (reject H0)')
from scipy.stats import kruskal
stat, p = kruskal(sp['parental level of education'], sp['math score'], sp['reading score'], sp['writing score'])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('fail to reject H0')
else:
    print('reject H0')
fig = px.box(sp, y=['parental level of education', 'math score', 'reading score', 'writing score'])
fig.show()

Statistics=0.860, p=0.000
Sample is not norm dist (reject H0)
Statistics=2260.418, p=0.000
reject H0


## Test preparation completions also impact the scores
* Null hypothesis: Test preparation completions do not impact the scores
* Alternative hypothesis: Test preparation completions impact the scores

In [21]:
sp['test preparation course'], _ = pd.factorize(sp['test preparation course'])
stat, p = shapiro([sp['test preparation course'],sp['math score'], sp['reading score'], sp['writing score']])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Sample is normally distributed (fail to reject H0)')
else:
    print('Sample is not norm dist (reject H0)')
from scipy.stats import kruskal
stat, p = kruskal(sp['test preparation course'], sp['math score'], sp['reading score'], sp['writing score'])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('fail to reject H0')
else:
    print('reject H0')
fig = px.box(sp, y=['test preparation course', 'math score', 'reading score', 'writing score'])
fig.show()

Statistics=0.845, p=0.000
Sample is not norm dist (reject H0)
Statistics=2270.899, p=0.000
reject H0


## Lunch type also impacts the scores
* Null hypothesis: Lunch type does not impact the scores
* Alternative hypothesis: Lunch type impacts the scores

In [22]:
sp['lunch'], _ = pd.factorize(sp['lunch'])
stat, p = shapiro([sp['lunch'],sp['math score'], sp['reading score'], sp['writing score']])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Sample is normally distributed (fail to reject H0)')
else:
    print('Sample is not norm dist (reject H0)')
from scipy.stats import kruskal
stat, p = kruskal(sp['lunch'], sp['math score'], sp['reading score'], sp['writing score'])

print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('fail to reject H0')
else:
    print('reject H0')
fig = px.box(sp, y=['lunch', 'math score', 'reading score', 'writing score'])
fig.show()

Statistics=0.845, p=0.000
Sample is not norm dist (reject H0)
Statistics=2270.995, p=0.000
reject H0


# Conclusion
The first test was the relationship between the scores (categorical) using correlation the spearman method because the sample was not normally distributed.
Next, it was tested whether parental level of education impacts the scores (writing, math and reading). The null hypothesis is that parental level of education does not impact the scores and the alternative hypothesis was that parental level of education impacts the scores. For testing, one-way ANOVA was used because it calculates the comparison between a categorical and two or more continuous variables
Test preparation completions was tested for their impact on the scores. The null hypothesis is that test preparation completions do not impact the scores and the alternative hypothesis was that test preparation completions impacted the scores.The null hypothesis was rejected meaning test preparation completion or no completion had an impact.
Finally, lunch type was tested for its impact on the scores. The null hypothesis is that lunch type does not impact the scores and the alternative hypothesis is that lunch type impacts the scores. The null was furthermore rejected

In all ANOVA tests non-parametric tests were used because the assumption for normal distribution was violated.

Note: I would be grateful for any corrections or suggestions for improvement.