<a href="https://www.kaggle.com/code/mostafahafez25/students-performance-in-exams-notebook?scriptVersionId=119648575" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# __Student Performance in Exams dataset Analysis__
***

![image](https://www.usnews.com/cmsmedia/23/02/a1a88058409d8bc02755e6ae8bfa/200825-studyingbook-stock.jpg)

<a id="top"></a>
## Table of Contents

* [1.Import Libraries](#1)
* [2. Read Dataset](#2)
* [3 - Data Wrangling](#3)
    - [3.1 Check Missing Data](#3.1)
    - [3.2 Check Duplicates](#3.2)
    - [3.3 Create Total Score Column](#3.3)
* [4. Exploratory Data Analysis](#4)
    - [4.1 - Univariante Analysis](#4.1)
    - [4.2 - Bivariante Analysis](#4.2)
* [5. Reggression Prediction](#5)
    - [5.1 SGDReggressor](#5.1)
    - [5.2 Ridge Reggressor](#5.2)
    - [5.3 Linear Reggressor](#5.3)

<a id='1'></a>
## __1. Import Libraries__

In [1]:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.figure_factory as ff
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

%matplotlib inline


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

  shapely_geos_version, geos_capi_version_string


/kaggle/input/students-performance-in-exams/exams.csv


<a id='2'></a>
## __2. Read Dataset__

In [2]:
df = pd.read_csv('../input/students-performance-in-exams/exams.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group A,high school,standard,completed,67,67,63
1,female,group D,some high school,free/reduced,none,40,59,55
2,male,group E,some college,free/reduced,none,59,60,50
3,male,group B,high school,standard,none,77,78,68
4,male,group E,associate's degree,standard,completed,78,73,68


<a id='3'></a>
## __3.Data Wrangling__

<a id='3.1'></a>
### __3.1 Check Missing Data__

In [3]:
df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

There is <mark>__0 missing data__</mark>

<a id='3.2'></a>
### __3.2 Check Duplicates__

In [4]:
df.duplicated().sum()

1

Only <mark>__1 duplicate__</mark> found.

In [5]:
df.drop_duplicates()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group A,high school,standard,completed,67,67,63
1,female,group D,some high school,free/reduced,none,40,59,55
2,male,group E,some college,free/reduced,none,59,60,50
3,male,group B,high school,standard,none,77,78,68
4,male,group E,associate's degree,standard,completed,78,73,68
...,...,...,...,...,...,...,...,...
995,male,group C,high school,standard,none,73,70,65
996,male,group D,associate's degree,free/reduced,completed,85,91,92
997,female,group C,some high school,free/reduced,none,32,35,41
998,female,group C,some college,standard,none,73,74,82


<a id='3.3'></a>
### __3.3 Create Total Score Column__

In [6]:
df['total_score'] = df['math score'] + df['reading score'] + df['writing score']
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score
0,male,group A,high school,standard,completed,67,67,63,197
1,female,group D,some high school,free/reduced,none,40,59,55,154
2,male,group E,some college,free/reduced,none,59,60,50,169
3,male,group B,high school,standard,none,77,78,68,223
4,male,group E,associate's degree,standard,completed,78,73,68,219


<a id='4'></a>
## __4. Exploratory Data Analysis__

<a id='4.1'></a>
### __4.1 Univariante Analysis__

In [7]:
df.describe(include='all')

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score
count,1000,1000,1000,1000,1000,1000.0,1000.0,1000.0,1000.0
unique,2,5,6,2,2,,,,
top,male,group C,some college,standard,none,,,,
freq,517,323,222,652,665,,,,
mean,,,,,,66.396,69.002,67.738,203.136
std,,,,,,15.402871,14.737272,15.600985,43.542732
min,,,,,,13.0,27.0,23.0,65.0
25%,,,,,,56.0,60.0,58.0,175.75
50%,,,,,,66.5,70.0,68.0,202.0
75%,,,,,,77.0,79.0,79.0,235.0


In [8]:
def univariant_analysis(df, features):
    rows = 2
    cols = 3
    fig = make_subplots(rows=2, cols=3, specs=[[{"type": "Pie"}, {"type": "Pie"}, {"type": "Pie"}], [{"type": "Pie"}, {"type": "Pie"}, {"type": "Pie"}]])    
    counter = 0
    for i in range(rows):
        i+= 1
        for j in range(cols):
            j +=1
            if counter > 4:
                fig.show()
                
            else:
                vc = df[features[counter]].value_counts()
                fig.append_trace(go.Pie(labels=vc.index, values=vc.values, textinfo='label+percent', insidetextorientation='radial'), row=i, col=j)
                fig.update_layout(title='Univariante Analysis',title_x= 0.5, title_y=1, showlegend=False)
                counter +=1
    
    

In [9]:
features= ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']
univariant_analysis(df, features)

* The data is slightly in favor of <mark>__males with a 51.7%__</mark> .
* There is <mark>__5 students ethnicities__</mark> in the data with <mark>__group C__</mark> having the biggest share by <mark>__32.3%__</mark> . 
* The student parents is divided into <mark>__6 segmants__</mark> based on their educational level with <mark>__22.2% going to some college__</mark> .
* A higher sum of students <mark>__66.5% haven't taken the test preparation course__</mark> .
* <mark>__65.2%__</mark> of the students had <mark>__standard lunch__</mark> .

<a id='4.2'></a>
### __4.2 Bivariante Analysis__

In [10]:
df_score = pd.melt(df, id_vars=['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course'],
                   value_vars=['math score','reading score','writing score'])

<a id='4.2.1'></a>
#### __4.2.1 Test Scores by Gender__

In [11]:
fig = px.box(df_score, x='variable', y= 'value', color= 'gender', labels={'variable':'Skill', 'value':'Score','gender':'Gender'}, title='Test Scores by Gender' )

fig.show()

* <mark>__Male students__</mark> preformed better on the <mark>__math exams__</mark>.
* <mark>__Female students__</mark> preformed better on the language exams on both the <mark>__reading and writing skills__</mark>.

<a id='4.2.2'></a>
#### __4.2.2 Test Scores based upon Test Preparation Course__

In [12]:
fig = px.box(df_score, x='variable', y= 'value', color= 'test preparation course', labels={'variable':'Skill', 'value':'Score','test preparation course':'Test Preparation Course'}
             , title='Test Scores based upon Test Preparation Course')

fig.show()

* Students who <mark>__completed the test preparation course performed better on average__</mark> than those who didn't.

<a id='4.2.3'></a>
#### __4.2.3 Test Scores based upon Lunch__

In [13]:
fig = px.box(df_score, x='variable', y= 'value', color= 'lunch', labels={'variable':'Skill', 'value':'Score','lunch':'Lunch'}
             , title='Test Scores based upon Lunch')

fig.show()

* Students who <mark>__had a standard lunch performed better on average__</mark> than those who didn't.

<a id='4.2.4'></a>
#### __4.2.4 Test Scores based upon Parental Level of Education__

In [14]:
fig = px.box(df_score, x='variable', y= 'value', color= 'parental level of education', labels={'variable':'Skill', 'value':'Score','parental level of education':'Parental Level of Education'}
             , title='Test Scores based upon Parental Level of Education')

fig.show()

* Students whose <mark>__parents had a bachelor's or a master's degree perfomed slightly better__</mark> than other students.
* Students whose <mark>__parents went to some high school performed worst__</mark> compared to other students.

<a id='4.2.5'></a>
#### __4.2.5 Test Scores based upon Ethnicity__

In [15]:
fig = px.box(df, x='total_score', y= 'race/ethnicity', color='race/ethnicity', labels={'race/ethnicity':'Ethnicity', 'total_score':'Total Score','race/ethnicity':'Ethnicity'}
             , title='Test Scores based upon Etnicity')

fig.show()

* <mark>__Group E & Group D had the best performance__</mark> with Group E performing slightly better than Group D.

<a id='4.2.6'></a>
#### __4.2.6 Math Score vs Writing Score__

In [16]:
fig = px.scatter(df, x='math score', y= 'writing score', labels={'math score':'Math Score', 'writing score':'Writing Score'}, title='Math Score vs Writing Score')

fig.show()

<a id='4.2.7'></a>
#### __4.2.7 Math Score vs Reading Score__

In [17]:
fig = px.scatter(df, x='math score', y= 'reading score', labels={'math score':'Math Score', 'reading score':'Reading Score'}, title='Math Score vs Reading Score')

fig.show()

<a id='4.2.8'></a>
#### __4.2.8 Writing Score vs Writing Score__

In [18]:
fig = px.scatter(df, x='writing score', y= 'reading score', labels={'writing score':'Writing Score', 'reading score':'Reading Score'}, title='Writing Score vs Reading Score')

fig.show()

* The data shows that students that performed well on one exam normally performed well on the other two.

<a id='4.2.9'></a>
#### __4.2.9 Total Score Distribution based upon Test Preparation Course__

In [19]:
dfc = df[df['test preparation course'] == 'completed']
dfn = df[df['test preparation course'] == 'none']

hist = [dfc['total_score'], dfn['total_score']]
grp_labels = ['completed', 'none']

fig = ff.create_distplot(hist, grp_labels, show_rug=False)

fig.update_layout(title='Total Score Distribution based upon Test Preparation Course', xaxis_title='Total Score', yaxis_title='Percentage')
fig.show()

<a id='5'></a>
## __5. Regression Prediction__

In [20]:
features1=df[['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']]
X = pd.get_dummies(features1, prefix=['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course'],
                             columns=['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course'])

Y = df['total_score']

In [21]:
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2)

<a id='5.1'></a>
#### __5.1 SGDRegressor__

In [22]:
sgdr = SGDRegressor()
sgdr.fit(train_x, train_y)
yhat = sgdr.predict(test_x)
print(sgdr.score(test_x, test_y))

0.3380623338257993


<a id='5.2'></a>
#### __5.2 Ridge Regressor__

In [23]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(train_x, train_y)
y_pred=ridge.predict(test_x)
print(ridge.score(test_x,test_y))

0.3381494825049659


<a id='5.3'></a>
#### __5.3 Linear Regressor__

In [24]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_x, train_y)
y_pred=lr.predict(test_x)
print(lr.score(test_x,test_y))

0.34236610297529557
