In this notebook, we are trying to understand which factors may affect the student's performance. In order to discover correlations between factors, we will classify the scores into several unique groups.

Let's pay attention that dataset consists of five independent variables which are:

gender: sex of students (male, female)

race/ethnicity: race of students (A-E groups)

parental level of education: parents' final education

lunch: having lunch before test (standard, free/reduced)

test preparation course: having completed the preparation course before the test (none,completed)

In [10]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd 
import numpy as np  

Let's have a look at the dataset.

In [11]:
score_df = pd.read_csv('StudentsPerformance.csv')
score_df.sample(7)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
844,female,group D,some high school,free/reduced,completed,40,65,64
909,male,group E,bachelor's degree,standard,completed,70,64,70
887,male,group C,high school,free/reduced,none,54,72,59
582,female,group D,bachelor's degree,free/reduced,none,63,73,78
660,male,group C,some college,free/reduced,none,74,77,73
354,female,group C,some college,standard,none,59,71,70
826,female,group C,associate's degree,free/reduced,completed,56,68,70


We can see that some columns' names are too long. Let's abbreviate them.

In [12]:
score_df.rename(
    columns={
        'race/ethnicity': 'race',
        'parental level of education': 'parent_education',
        'test preparation course': 'prep_course',
        'math score': 'math',
        'reading score': 'reading',
        'writing score': 'writing'
    },
    inplace=True
)

score_df.sample(7)    

Unnamed: 0,gender,race,parent_education,lunch,prep_course,math,reading,writing
817,male,group D,bachelor's degree,free/reduced,completed,61,70,76
319,female,group D,associate's degree,free/reduced,none,56,65,63
9,female,group B,high school,free/reduced,none,38,60,50
793,male,group E,some high school,standard,completed,89,84,77
684,male,group B,some college,standard,completed,62,66,68
939,male,group D,some high school,standard,completed,77,68,69
938,male,group D,some college,standard,completed,85,81,85


It looks way better than before. Although we shortened columns' names, we are still able to recognize each meaning easily. We should check the data types.

In [13]:
score_df.dtypes

gender              object
race                object
parent_education    object
lunch               object
prep_course         object
math                 int64
reading              int64
writing              int64
dtype: object

We can see that most of the columns have a type of 'object'. This data type is not desirable when it comes to our analysis. It's better to deal with numerical representations of these features. Let's encode them.

In [14]:
labelencoder = LabelEncoder()
train_df = score_df.copy()
train_df['parent_education'] = labelencoder.fit_transform(train_df['parent_education'])
train_df['lunch'] = labelencoder.fit_transform(train_df['lunch'])
train_df['prep_course'] = labelencoder.fit_transform(train_df['prep_course'])

train_df.sample(7)

Unnamed: 0,gender,race,parent_education,lunch,prep_course,math,reading,writing
408,female,group D,2,0,0,52,57,56
827,female,group C,5,1,1,65,69,76
949,female,group E,2,0,0,57,75,73
343,male,group D,0,1,0,67,72,67
541,male,group D,0,0,0,79,82,80
649,female,group D,4,1,0,69,79,81
634,male,group D,5,1,1,84,84,80
