<a href="https://colab.research.google.com/github/PawelG-WWA/learning-features/blob/model-training/PROW_pd4484.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

The project's goal is to analyze dataset of factors impacting student performance. Dataset comes from [kaggle.com](https://www.kaggle.com/datasets/lainguyn123/student-performance-factors) and provides comprehensive overview of various factors affecting student final exam score. After analysis, I will introduce some Machine Learning models to classify and clasterize students to predict their potential so that teachers would know on whom they should focus on more.


# Loading basic libraries

Let's load some basic libraries for the analysis part

In [643]:
import pandas as pd
import numpy as np
from google.colab import userdata

# 1. Data research

Now we need to load and investigate the data to answer some quedtions:
- How the data looks like?
- Are there any missing values?
  - If some values are missing, how we should fill them? Should we drop observations with missing values?
- What are they types of values?
- Can we change type of values to something more meaningful/reasonable?
- What the dataset represents?

In [644]:
# load data into data frame
filepath = userdata.get('studentPerformanceFilePath')

df = pd.read_csv(filepath)
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [645]:
# show shape of the data frame: format (rows, columns)
df.shape

(6607, 20)

In [646]:
# show count of non null values in each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

## 1.1 Data engineering

In this section my focus is to examine the data thoroughly. I will endeavor to understand the meaning and possible implications of variables and how it all relates to the real world context. This will provide me with a comprehensive understanding of the problem domain.

Let's start with investigating possible categorical and boolean variables.

In [647]:
# summary of value counts in potentially categorical/boolean variables
(df['Parental_Involvement'].value_counts(),
 df['Access_to_Resources'].value_counts(),
 df['Extracurricular_Activities'].value_counts(),
 df['Motivation_Level'].value_counts(),
 df['Internet_Access'].value_counts(),
 df['Family_Income'].value_counts(),
 df['Teacher_Quality'].value_counts(),
 df['School_Type'].value_counts(),
 df['Peer_Influence'].value_counts(),
 df['Learning_Disabilities'].value_counts(),
 df['Parental_Education_Level'].value_counts(),
 df['Distance_from_Home'].value_counts(),
 df['Gender'].value_counts())

(Parental_Involvement
 Medium    3362
 High      1908
 Low       1337
 Name: count, dtype: int64,
 Access_to_Resources
 Medium    3319
 High      1975
 Low       1313
 Name: count, dtype: int64,
 Extracurricular_Activities
 Yes    3938
 No     2669
 Name: count, dtype: int64,
 Motivation_Level
 Medium    3351
 Low       1937
 High      1319
 Name: count, dtype: int64,
 Internet_Access
 Yes    6108
 No      499
 Name: count, dtype: int64,
 Family_Income
 Low       2672
 Medium    2666
 High      1269
 Name: count, dtype: int64,
 Teacher_Quality
 Medium    3925
 High      1947
 Low        657
 Name: count, dtype: int64,
 School_Type
 Public     4598
 Private    2009
 Name: count, dtype: int64,
 Peer_Influence
 Positive    2638
 Neutral     2592
 Negative    1377
 Name: count, dtype: int64,
 Learning_Disabilities
 No     5912
 Yes     695
 Name: count, dtype: int64,
 Parental_Education_Level
 High School     3223
 College         1989
 Postgraduate    1305
 Name: count, dtype: int64,
 Dis

In [648]:
# change all properties with 3 values into category

low_high_gradation = ['Low', 'Medium', 'High']
school_gradation = ['High School', 'College', 'Postgraduate']
distance_gradation = ['Near', 'Moderate', 'Far']
peer_influence_gradation = ['Negative', 'Neutral', 'Positive']

def change_type_to_category(column, gradation_type):
  df[column] = pd.Categorical(df[column], categories=gradation_type, ordered=True)

for data_column in df[['Parental_Involvement', 'Access_to_Resources', 'Motivation_Level', 'Family_Income', 'Teacher_Quality']]:
  change_type_to_category(data_column, low_high_gradation)

change_type_to_category('Parental_Education_Level', school_gradation)
change_type_to_category('Distance_from_Home', distance_gradation)
change_type_to_category('Peer_Influence', peer_influence_gradation)


In [649]:
# Add boolean properties for those column which only have 2 values within
df['Extracurricular_Activities_boolean'] = df['Extracurricular_Activities'].map({'Yes': True, 'No': False})
df['Internet_Access_boolean'] = df['Internet_Access'].map({'Yes': True, 'No': False})
df['School_Type_IsPublic'] = df['School_Type'].map({'Public': True, 'Private': False})
df['Learning_Disabilities_boolean'] = df['Learning_Disabilities'].map({'Yes': True, 'No': False})
df['IsFemale'] = df['Gender'].map({'Female': True, 'Male': False})

In [650]:
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,...,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Extracurricular_Activities_boolean,Internet_Access_boolean,School_Type_IsPublic,Learning_Disabilities_boolean,IsFemale
0,23,84,Low,High,No,7,73,Low,Yes,0,...,No,High School,Near,Male,67,False,True,True,False,False
1,19,64,Low,Medium,No,8,59,Low,Yes,2,...,No,College,Moderate,Female,61,False,True,True,False,True
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,...,No,Postgraduate,Near,Male,74,True,True,True,False,False
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,...,No,High School,Moderate,Male,71,True,True,True,False,False
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,...,No,College,Near,Female,70,True,True,True,False,True


In [651]:
# Remove data with null values
#
# There is no sense in filling missing teacher quality, parental education level or distance from home
# Filling with anything would be filling with a random value not representing the reality at all, it owuld be a guess
#
# I decided to remove all rows with missing values as they constitute only ~3.5% of the whole dataset.
df.dropna(subset=['Teacher_Quality'], inplace=True)
df.dropna(subset=['Parental_Education_Level'], inplace=True)
df.dropna(subset=['Distance_from_Home'], inplace=True)

In [652]:
# let's see if scores are between 0 and 100. If the score is higher than maximum value
# or lower than minimum value, that's probably an error and the score should be rounded to maximum/minimum
scores_minmax = pd.DataFrame({
    'Previous_Score_min': min(df['Previous_Scores']),
    'Previous_Score_max': max(df['Previous_Scores']),
    'Exam_Score_min': min(df['Exam_Score']),
    'Exam_Score_max': max(df['Exam_Score'])
}, index=[0])

scores_minmax

Unnamed: 0,Previous_Score_min,Previous_Score_max,Exam_Score_min,Exam_Score_max
0,50,100,55,101


In [653]:
# we can see that Exam_Score_max = 101. We need to update all 101 to 100
df.loc[df['Exam_Score'] == 101, 'Exam_Score'] = 100
df['Exam_Score'].describe()

Unnamed: 0,Exam_Score
count,6378.0
mean,67.25196
std,3.912884
min,55.0
25%,65.0
50%,67.0
75%,69.0
max,100.0


In [654]:
# this is the final dataset we will work with:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6378 entries, 0 to 6606
Data columns (total 25 columns):
 #   Column                              Non-Null Count  Dtype   
---  ------                              --------------  -----   
 0   Hours_Studied                       6378 non-null   int64   
 1   Attendance                          6378 non-null   int64   
 2   Parental_Involvement                6378 non-null   category
 3   Access_to_Resources                 6378 non-null   category
 4   Extracurricular_Activities          6378 non-null   object  
 5   Sleep_Hours                         6378 non-null   int64   
 6   Previous_Scores                     6378 non-null   int64   
 7   Motivation_Level                    6378 non-null   category
 8   Internet_Access                     6378 non-null   object  
 9   Tutoring_Sessions                   6378 non-null   int64   
 10  Family_Income                       6378 non-null   category
 11  Teacher_Quality                    

## 1.2 Knowing data better

In this section we will focus more on data investigation. We will try to find dependencies between data and proportions between properties.

In general, our goal is to be able to improve exam results. To do that, we need to find out what impacts the exam score the most, what characterizes students who get the lowest or the highest exam score.

Let's find out some basic information and depct some dependecies with plots.

In [655]:
# Import plotly library for creating charts
import plotly.express as px
import plotly.graph_objects as go

In [656]:
# Let's remind ourselves how the dataset looks like
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,...,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Extracurricular_Activities_boolean,Internet_Access_boolean,School_Type_IsPublic,Learning_Disabilities_boolean,IsFemale
0,23,84,Low,High,No,7,73,Low,Yes,0,...,No,High School,Near,Male,67,False,True,True,False,False
1,19,64,Low,Medium,No,8,59,Low,Yes,2,...,No,College,Moderate,Female,61,False,True,True,False,True
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,...,No,Postgraduate,Near,Male,74,True,True,True,False,False
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,...,No,High School,Moderate,Male,71,True,True,True,False,False
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,...,No,College,Near,Female,70,True,True,True,False,True


In [657]:
# let's compare scores from previous and current exam.
#
# We want to take all the data except count, as we know what the count is already and count will look bad on a chart
# because of its huge value
scores_comparison = pd.DataFrame({
      'Previous_Scores': df['Previous_Scores'].describe(),
      'Exam_Score': df['Exam_Score'].describe()
    }).tail(-1)

scores_comparison

Unnamed: 0,Previous_Scores,Exam_Score
mean,75.066165,67.25196
std,14.400389,3.912884
min,50.0,55.0
25%,63.0,65.0
50%,75.0,67.0
75%,88.0,69.0
max,100.0,100.0


In [658]:
fig = go.Figure(
    data = [
        go.Bar(name='Previous_Scores', x=scores_comparison.index, y=scores_comparison['Previous_Scores']),
        go.Bar(name='Exam_Scores', x=scores_comparison.index, y=scores_comparison['Exam_Score'])
    ]
)

fig.update_layout(barmode='group', title='Comparison between Previous_Scores and current Exam_Scores', xaxis_title='property', yaxis_title='score')
fig.show()

# As we can see, current Exam_Scores in comparison to Previous_Score have:
# - lower mean and standard deviation
# - higher minimum score and first quartile
# - lower second and third quartiles

## 1.3 Finding score catalysts

Now we have some general idea about the dataset. We compared scores from previous and current exams. We don't know anything about the exams though - results are different but we don't know why - it might be for example that the current exam was harder than previous one or that the material was less understandable.

Nevertheless we can look for factors among students that impact the final result and that's we will focus on in this sub-section.

In [659]:
# First let's add some grades for score ranges:
# A: 90-100
# B: 75-89
# C: 65-74
# D: 50-64
# F: 0-49
#
# This setup will help us in grouping, and grouping will simplifiy the process of finding factors which impact score
def apply_grade(score):
  if score >= 93:
    return 'A+'
  elif score >= 87:
    return 'A'
  elif score >= 83:
    return 'A-'
  if score >= 78:
    return 'B+'
  elif score >= 74:
    return 'B'
  elif score >= 70:
    return 'B-'
  if score >= 67:
    return 'C+'
  elif score >= 63:
    return 'C'
  elif score >= 60:
    return 'C-'
  if score >= 55:
    return 'D+'
  elif score >= 50:
    return 'D'
  else:
    return 'F'
grade_categories = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D', 'F']
df['Exam_grade'] = [apply_grade(score) for score in df['Exam_Score']]
df['Exam_grade'] = pd.Categorical(df['Exam_grade'], categories=grade_categories, ordered=True)
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,...,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Extracurricular_Activities_boolean,Internet_Access_boolean,School_Type_IsPublic,Learning_Disabilities_boolean,IsFemale,Exam_grade
0,23,84,Low,High,No,7,73,Low,Yes,0,...,High School,Near,Male,67,False,True,True,False,False,C+
1,19,64,Low,Medium,No,8,59,Low,Yes,2,...,College,Moderate,Female,61,False,True,True,False,True,C-
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,...,Postgraduate,Near,Male,74,True,True,True,False,False,B
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,...,High School,Moderate,Male,71,True,True,True,False,False,B-
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,...,College,Near,Female,70,True,True,True,False,True,B-


In [660]:
# Now, let's see what fraction of student get the given grades
counts = df['Exam_grade'].value_counts()
pd.DataFrame({
    'counts': counts,
    'percentage': round(counts/len(df), 5)
})

Unnamed: 0_level_0,counts,percentage
Exam_grade,Unnamed: 1_level_1,Unnamed: 2_level_1
C,2200,0.34494
C+,2029,0.31812
B-,1355,0.21245
C-,502,0.07871
B,171,0.02681
D+,66,0.01035
A+,19,0.00298
B+,16,0.00251
A,11,0.00172
A-,9,0.00141


In [661]:
grade_counts = df.groupby('Exam_grade', observed=False).count().iloc[:,0].rename('grade_counts')
grade_counts

Unnamed: 0_level_0,grade_counts
Exam_grade,Unnamed: 1_level_1
A+,19
A,11
A-,9
B+,16
B,171
B-,1355
C+,2029
C,2200
C-,502
D+,66


### 1.3.1 Motivation level impact

As we can see below, Medium motivaiton level characterizes many students across grades. Low motivation level is higher among students with lower grades, while higher level is associated with people getting better grades.

This is a good suggestion to transform these values into weighted representatoins instead of simply assigning 1, 2 and 3 to them.

In [662]:
motivation_factor = df.groupby(['Exam_grade', 'Motivation_Level'], observed=False)['Exam_Score'].count()
motivation_factor = motivation_factor.reset_index(name='count')
motivation_factor = motivation_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
motivation_factor['%'] = round(motivation_factor['count']/motivation_factor['grade_counts'] * 100, 2)

fig = px.bar(
    motivation_factor,
    x='Exam_grade',
    y='%',
    color='Motivation_Level',
    barmode='group',
    title='Exam Grade and Motivation Level Distribution'
)

fig.show()

### 1.3.2 Parental involvement

Similarly to motivaiton level, parental involvement is higher among students with better grades, but is growing only to some extent (to B) and then it roughly stays the same.

We can clearly see though that low involvement grows among students with worse results.

In [663]:
parental_involvement_factor = df.groupby(['Exam_grade', 'Parental_Involvement'], observed=False)['Exam_Score'].count()
parental_involvement_factor = parental_involvement_factor.reset_index(name='count')
parental_involvement_factor = parental_involvement_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
parental_involvement_factor['%'] = round(parental_involvement_factor['count']/parental_involvement_factor['grade_counts'] * 100, 2)

fig = px.bar(
    parental_involvement_factor,
    x='Exam_grade',
    y='%',
    color='Parental_Involvement',
    barmode='group',
    title='Exam Grade and Parental Involvement Distribution'
)
fig.show()

### 1.3.3 Access to resources

Low availability to education resources characterizes students with worse results, while high availability is typical for people getting better results.

Medium is very neutral and will be dimmed while two other values will weight more.


In [664]:
access_to_resources_factor = df.groupby(['Exam_grade', 'Access_to_Resources'], observed=False)['Exam_Score'].count()
access_to_resources_factor = access_to_resources_factor.reset_index(name='count')
access_to_resources_factor = access_to_resources_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
access_to_resources_factor['%'] = round(access_to_resources_factor['count']/access_to_resources_factor['grade_counts'] * 100, 2)

fig = px.bar(
    access_to_resources_factor,
    x='Exam_grade',
    y='%',
    color='Access_to_Resources',
    barmode='group',
    title='Exam Grade and Access to Resources Distribution'
)
fig.show()

### 1.3.4 Distance from home

Looks like distance from home doesn't differ too much among students, it will be removed before creating the model as it doesn't seem to have an impact on grades received.

In [665]:
distance_from_home_factor = df.groupby(['Exam_grade', 'Distance_from_Home'], observed=False)['Exam_Score'].count()
distance_from_home_factor = distance_from_home_factor.reset_index(name='count')
distance_from_home_factor = distance_from_home_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
distance_from_home_factor['%'] = round(distance_from_home_factor['count']/distance_from_home_factor['grade_counts'] * 100, 2)

fig = px.bar(
    distance_from_home_factor,
    x='Exam_grade',
    y='%',
    color='Distance_from_Home',
    barmode='group',
    title='Exam Grade and Distance from Home'
)
fig.show()

### 1.3.5 Family income

From the chart it is clear that higher family income means better results.

In [666]:
family_income_factor = df.groupby(['Exam_grade', 'Family_Income'], observed=False)['Exam_Score'].count()
family_income_factor = family_income_factor.reset_index(name='count')
family_income_factor = family_income_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
family_income_factor['%'] = round(family_income_factor['count']/family_income_factor['grade_counts'] * 100, 2)
fig = px.bar(
    family_income_factor,
    x='Exam_grade',
    y='%',
    color='Family_Income',
    barmode='group',
    title='Exam Grade and Family Income Distribution'
)
fig.show()

### 1.3.6 Teacher quality

Looks like only high teacher qulity may have an impact on exam result. Medium and low quality teachers are represented almost equally.

In [667]:
teacher_quality_factor = df.groupby(['Exam_grade', 'Teacher_Quality'], observed=False)['Exam_Score'].count()
teacher_quality_factor = teacher_quality_factor.reset_index(name='count')
teacher_quality_factor = teacher_quality_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
teacher_quality_factor['%'] = round(teacher_quality_factor['count']/teacher_quality_factor['grade_counts'] * 100, 2)

fig = px.bar(
    teacher_quality_factor,
    x='Exam_grade',
    y='%',
    color='Teacher_Quality',
    barmode='group',
    title='Exam Grade and Teacher Quality Distribution'
)
fig.show()


### 1.3.7 Peer influence

Again, medium value is not trending in any direction, while we can clearly see that negative influence may be associated with worse grades and on the other hand, positive impact results in bette rgrades.

In [668]:
peer_influence_factor = df.groupby(['Exam_grade', 'Peer_Influence'], observed=False)['Exam_Score'].count()
peer_influence_factor = peer_influence_factor.reset_index(name='count')
peer_influence_factor = peer_influence_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
peer_influence_factor['%'] = round(peer_influence_factor['count']/peer_influence_factor['grade_counts'] * 100, 2)

fig = px.bar(
    peer_influence_factor,
    x='Exam_grade',
    y='%',
    color='Peer_Influence',
    barmode='group',
    title='Exam Grade and Peer Influence Distribution'
)
fig.show()

### 1.3.7 Parental level education

Students whose parents finished high school are more predistined to get a worse score in comparison to students whose parents graduated college.

It's just a raw data, there may be other factors involved, but it is some kind of information we can use in firther analysis, especially, because we can see trends in both directions (lower parent's education -> worse grades, better parent's education -> higher grades)

In [669]:
parental_education_level_factor = df.groupby(['Exam_grade', 'Parental_Education_Level'])['Exam_Score'].count()
parental_education_level_factor = parental_education_level_factor.reset_index(name='count')
parental_education_level_factor = parental_education_level_factor.merge(grade_counts, left_on='Exam_grade', right_index=True)
parental_education_level_factor['%'] = round(parental_education_level_factor['count']/parental_education_level_factor['grade_counts'] * 100, 2)

fig = px.bar(
    parental_education_level_factor,
    x='Exam_grade',
    y='%',
    color='Parental_Education_Level',
    barmode='group',
    title='Exam Grade and Parental Education Level Distribution'
)
fig.show()





## 2. Finding more correlations


In [670]:
from plotly.subplots import make_subplots

## 2.1 Find continuous variables with normal distribution

In order to find linear correlation between some variables before taking Pearon's coefficient into consideration, we need to find out if these continuous variables are distributed normally.

First, let's remind ourselves how the data looks like.

In [671]:
pd.set_option('display.max_columns', None)
df.head()


Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Extracurricular_Activities_boolean,Internet_Access_boolean,School_Type_IsPublic,Learning_Disabilities_boolean,IsFemale,Exam_grade
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67,False,True,True,False,False,C+
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61,False,True,True,False,True,C-
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74,True,True,True,False,False,B
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71,True,True,True,False,False,B-
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70,True,True,True,False,True,B-


In [672]:
pd.reset_option('display.max_columns')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6378 entries, 0 to 6606
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype   
---  ------                              --------------  -----   
 0   Hours_Studied                       6378 non-null   int64   
 1   Attendance                          6378 non-null   int64   
 2   Parental_Involvement                6378 non-null   category
 3   Access_to_Resources                 6378 non-null   category
 4   Extracurricular_Activities          6378 non-null   object  
 5   Sleep_Hours                         6378 non-null   int64   
 6   Previous_Scores                     6378 non-null   int64   
 7   Motivation_Level                    6378 non-null   category
 8   Internet_Access                     6378 non-null   object  
 9   Tutoring_Sessions                   6378 non-null   int64   
 10  Family_Income                       6378 non-null   category
 11  Teacher_Quality                    

In [673]:
# now, let's create some histograms to see wether variables are normally distributed
fig = make_subplots(rows=2, cols=3)
fig.add_trace(go.Histogram(x=df['Hours_Studied']), row=1, col=1)
fig.add_trace(go.Histogram(x=df['Attendance']), row=1, col=2)
fig.add_trace(go.Histogram(x=df['Sleep_Hours']), row=1, col=3)

fig.add_trace(go.Histogram(x=df['Tutoring_Sessions']), row=2, col=1)
fig.add_trace(go.Histogram(x=df['Physical_Activity']), row=2, col=2)

fig.update_xaxes(title_text='Hours Studied', row=1, col=1)
fig.update_xaxes(title_text='Attendance', row=1, col=2)
fig.update_xaxes(title_text='Hours slept', row=1, col=3)
fig.update_xaxes(title_text='Tutoring_Sessions', row=2, col=1)
fig.update_xaxes(title_text='Physical_Activity', row=2, col=2)

fig.update_layout(title='Distribution of continuous variables', height=700)

fig.show()

We can see that three properties are perfect candidates:
- Hours Studied
- Hours Slept
- Physical Activity

To be sure, we may also conduct shapiro test for normalcy and calculating skewness

In [674]:
df['Hours_Studied'].skew()

0.016224630345704642

In [675]:
df['Sleep_Hours'].skew()

-0.02681432855871459

In [676]:
df['Physical_Activity'].skew()

-0.03705482879464206

In [677]:
from scipy.stats import shapiro
result = (shapiro(df['Hours_Studied']), shapiro(df['Sleep_Hours']), shapiro(df['Physical_Activity']))
[print(x.pvalue) for x in result]

6.241178609807634e-09
3.508919774013594e-41
1.3859880195408411e-49



scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 6378.



[None, None, None]

Although charts for candidate properties are bell-shaped and their skewness is almost equal to 0, shapiro test shows that these properties are not distributed normally (pvalue for shapiro test is much less than 0.05)

It means that a better correlation measure would be Spearman's.

In [678]:
df[['Hours_Studied', 'Sleep_Hours', 'Physical_Activity', 'Tutoring_Sessions', 'Attendance', 'Exam_Score']].corr(method='spearman')

Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Exam_Score
Hours_Studied,1.0,0.01326,-0.003413,-0.009805,-0.005054,0.482343
Sleep_Hours,0.01326,1.0,-0.001957,-0.004422,-0.016212,-0.006526
Physical_Activity,-0.003413,-0.001957,1.0,0.006986,-0.024908,0.026179
Tutoring_Sessions,-0.009805,-0.004422,0.006986,1.0,0.012652,0.164294
Attendance,-0.005054,-0.016212,-0.024908,0.012652,1.0,0.674109
Exam_Score,0.482343,-0.006526,0.026179,0.164294,0.674109,1.0


As we can see above, numerical variables are not correlated between them. We need to find another correlations. But first, we need to convert all categoricals and booleans to floats. Let's cerate a new data frame for that purpose, so that the original dataframe won't be oversaturated with data.

In [679]:
# First, let's copy all numerical values but without Previous_Scores
# We haven't analysed this values and we don't know if it will be helpful for our model.
df_numerical = df[['Hours_Studied', 'Sleep_Hours', 'Physical_Activity', 'Tutoring_Sessions', 'Attendance']]

## 2.2 Working with non-numerical properties

Now, let's turn all categorical properties into numerical codes. From the previous chapter, from points 1.3.1 - 1.3.7 we know that some values - like Medium for motivation - are distributed equally among all Exam_Score values. We need to do something about it to emphasize the fact that extremes impacts the Exam_score more than category in the middle. Let's create ordinal encoding for these categories first, then, let's add one-hot encoding

In [680]:
df_numerical['Motivation_Level_ordinal'] = df['Motivation_Level'].cat.codes
df_numerical['Parental_Involvement_ordinal'] = df['Parental_Involvement'].cat.codes
df_numerical['Access_to_Resources_ordinal'] = df['Access_to_Resources'].cat.codes
df_numerical['Family_Income_ordinal'] = df['Family_Income'].cat.codes
df_numerical['Teacher_Quality_ordinal'] = df['Teacher_Quality'].cat.codes
df_numerical['Peer_Influence_ordinal'] = df['Peer_Influence'].cat.codes
df_numerical['Parental_Education_Level_ordinal'] = df['Parental_Education_Level'].cat.codes
df_numerical['Distance_from_Home_ordinal'] = df['Distance_from_Home'].cat.codes

# and we will add Exam_grade as this is something we will try to predict with our model. It seems to be easier than
# trying to precisely guess the score
#
# First, we need to prepare a map to invert the ordering of grades - now, A+ is 1 while C is 6 etc.
# It will show us negative correlation, for easier interpretation we want to make A+ the highest and F the lowest grade

grade_map = {
    'F': 1,
    'D': 2,
    'D+': 3,
    'C-': 4,
    'C': 5,
    'C+': 6,
    'B-': 7,
    'B': 8,
    'B+': 9,
    'A-': 10,
    'A': 11,
    'A+': 12
}

df_numerical['Exam_grade_ordinal'] = df['Exam_grade'].map(grade_map)
df_numerical['Exam_grade_ordinal'] = df_numerical['Exam_grade_ordinal'].astype('int8')
df_numerical



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Motivation_Level_ordinal,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Family_Income_ordinal,Teacher_Quality_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Distance_from_Home_ordinal,Exam_grade_ordinal
0,23,7,3,0,84,0,0,2,0,1,2,0,0,6
1,19,8,4,2,64,0,0,1,1,1,0,1,1,4
2,24,7,4,2,98,1,1,1,1,1,1,2,0,8
3,29,8,4,1,89,1,0,1,1,1,0,0,1,7
4,19,6,4,3,92,1,1,1,1,2,1,1,0,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,7,2,1,69,1,2,1,2,1,2,0,0,6
6603,23,8,2,3,76,1,2,1,0,2,2,0,0,6
6604,20,6,2,3,90,0,1,0,0,1,0,2,0,6
6605,10,6,3,2,86,2,2,2,0,1,2,0,2,6


In [681]:
#Let's visualize grades with box plot to identify outliers
fig = px.box(df_numerical, y="Exam_grade_ordinal")
fig.show()

# looks like exa_grade >= 8 are outliers candidates
# In fact, if we recall counts for grades from previous notebook sections, we could see that for some
# grades we just have to less observations, so predictions may be bad for grades higher or equal to B (ordinal number = 8, more than 73 points)

In [682]:
# let's drop outliers:
df_numerical = df_numerical[(df_numerical['Exam_grade_ordinal'] < 8) & (df_numerical['Exam_grade_ordinal'] > 3)]

# and let's print box plot again:
fig = px.box(df_numerical, y=['Exam_grade_ordinal'])
fig.show()

In [683]:
correlation_spearman = df_numerical.corr(method='spearman')
correlation_spearman['Exam_grade_ordinal'].sort_values(ascending=False)

Unnamed: 0,Exam_grade_ordinal
Exam_grade_ordinal,1.0
Attendance,0.632546
Hours_Studied,0.42839
Access_to_Resources_ordinal,0.165027
Parental_Involvement_ordinal,0.146223
Tutoring_Sessions,0.139562
Peer_Influence_ordinal,0.103715
Parental_Education_Level_ordinal,0.102712
Family_Income_ordinal,0.081004
Motivation_Level_ordinal,0.070541


### Encoding boolean properties

Now, let's encode boolean properties using one-hot encoder. We need to encode it with one hot encoder so the model won't treat 0/1 values as ordinal (like 1 is higher than 0 or more important).

In [684]:
df = pd.get_dummies(df, columns=['Gender'], prefix='Gender', drop_first=True)

In [685]:
df_numerical['IsMale'] = df['Gender_Male'].astype('int8')
df_numerical



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Motivation_Level_ordinal,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Family_Income_ordinal,Teacher_Quality_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Distance_from_Home_ordinal,Exam_grade_ordinal,IsMale
0,23,7,3,0,84,0,0,2,0,1,2,0,0,6,1
1,19,8,4,2,64,0,0,1,1,1,0,1,1,4,0
3,29,8,4,1,89,1,0,1,1,1,0,0,1,7,1
4,19,6,4,3,92,1,1,1,1,2,1,1,0,7,0
5,19,8,3,3,88,1,1,1,1,1,2,2,0,7,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,7,2,1,69,1,2,1,2,1,2,0,0,6,0
6603,23,8,2,3,76,1,2,1,0,2,2,0,0,6,0
6604,20,6,2,3,90,0,1,0,0,1,0,2,0,6,0
6605,10,6,3,2,86,2,2,2,0,1,2,0,2,6,0


In [686]:
df = pd.get_dummies(df, columns=['Learning_Disabilities'], prefix='Learning_Disabilities', drop_first=True)

In [687]:
df_numerical['Learning_Disabilities'] = df['Learning_Disabilities_Yes'].astype('int8')
df_numerical



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Motivation_Level_ordinal,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Family_Income_ordinal,Teacher_Quality_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Distance_from_Home_ordinal,Exam_grade_ordinal,IsMale,Learning_Disabilities
0,23,7,3,0,84,0,0,2,0,1,2,0,0,6,1,0
1,19,8,4,2,64,0,0,1,1,1,0,1,1,4,0,0
3,29,8,4,1,89,1,0,1,1,1,0,0,1,7,1,0
4,19,6,4,3,92,1,1,1,1,2,1,1,0,7,0,0
5,19,8,3,3,88,1,1,1,1,1,2,2,0,7,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,7,2,1,69,1,2,1,2,1,2,0,0,6,0,0
6603,23,8,2,3,76,1,2,1,0,2,2,0,0,6,0,0
6604,20,6,2,3,90,0,1,0,0,1,0,2,0,6,0,0
6605,10,6,3,2,86,2,2,2,0,1,2,0,2,6,0,0


In [688]:
df = pd.get_dummies(df, columns=['School_Type'], prefix='School_type', drop_first=True)

In [689]:
df_numerical['IsPublicSchool'] = df['School_type_Public'].astype('int8')
df_numerical



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Motivation_Level_ordinal,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Family_Income_ordinal,Teacher_Quality_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Distance_from_Home_ordinal,Exam_grade_ordinal,IsMale,Learning_Disabilities,IsPublicSchool
0,23,7,3,0,84,0,0,2,0,1,2,0,0,6,1,0,1
1,19,8,4,2,64,0,0,1,1,1,0,1,1,4,0,0,1
3,29,8,4,1,89,1,0,1,1,1,0,0,1,7,1,0,1
4,19,6,4,3,92,1,1,1,1,2,1,1,0,7,0,0,1
5,19,8,3,3,88,1,1,1,1,1,2,2,0,7,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,7,2,1,69,1,2,1,2,1,2,0,0,6,0,0,1
6603,23,8,2,3,76,1,2,1,0,2,2,0,0,6,0,0,1
6604,20,6,2,3,90,0,1,0,0,1,0,2,0,6,0,0,1
6605,10,6,3,2,86,2,2,2,0,1,2,0,2,6,0,0,0


In [690]:
df = pd.get_dummies(df, columns=['Internet_Access'], prefix='Internet_Access', drop_first=True)
df


Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Tutoring_Sessions,Family_Income,...,Extracurricular_Activities_boolean,Internet_Access_boolean,School_Type_IsPublic,Learning_Disabilities_boolean,IsFemale,Exam_grade,Gender_Male,Learning_Disabilities_Yes,School_type_Public,Internet_Access_Yes
0,23,84,Low,High,No,7,73,Low,0,Low,...,False,True,True,False,False,C+,True,False,True,True
1,19,64,Low,Medium,No,8,59,Low,2,Medium,...,False,True,True,False,True,C-,False,False,True,True
2,24,98,Medium,Medium,Yes,7,91,Medium,2,Medium,...,True,True,True,False,False,B,True,False,True,True
3,29,89,Low,Medium,Yes,8,98,Medium,1,Medium,...,True,True,True,False,False,B-,True,False,True,True
4,19,92,Medium,Medium,Yes,6,65,Medium,3,Medium,...,True,True,True,False,True,B-,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,High,Medium,No,7,76,Medium,1,High,...,False,True,True,False,True,C+,False,False,True,True
6603,23,76,High,Medium,No,8,81,Medium,3,Low,...,False,True,True,False,True,C+,False,False,True,True
6604,20,90,Medium,Low,Yes,6,65,Low,3,Low,...,True,True,True,False,True,C+,False,False,True,True
6605,10,86,High,High,Yes,6,91,High,2,Low,...,True,True,False,False,True,C+,False,False,False,True


In [691]:
df_numerical['HasInternetAccess'] = df['Internet_Access_Yes'].astype('int8')
df_numerical



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Motivation_Level_ordinal,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Family_Income_ordinal,Teacher_Quality_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Distance_from_Home_ordinal,Exam_grade_ordinal,IsMale,Learning_Disabilities,IsPublicSchool,HasInternetAccess
0,23,7,3,0,84,0,0,2,0,1,2,0,0,6,1,0,1,1
1,19,8,4,2,64,0,0,1,1,1,0,1,1,4,0,0,1,1
3,29,8,4,1,89,1,0,1,1,1,0,0,1,7,1,0,1,1
4,19,6,4,3,92,1,1,1,1,2,1,1,0,7,0,0,1,1
5,19,8,3,3,88,1,1,1,1,1,2,2,0,7,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,7,2,1,69,1,2,1,2,1,2,0,0,6,0,0,1,1
6603,23,8,2,3,76,1,2,1,0,2,2,0,0,6,0,0,1,1
6604,20,6,2,3,90,0,1,0,0,1,0,2,0,6,0,0,1,1
6605,10,6,3,2,86,2,2,2,0,1,2,0,2,6,0,0,0,1


In [692]:
df = pd.get_dummies(df, columns=['Extracurricular_Activities'], prefix='Extracurricular_Activities', drop_first=True)

In [693]:
df_numerical['HasExtracurricularActivities'] = df['Extracurricular_Activities_Yes'].astype('int8')
df_numerical



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Hours_Studied,Sleep_Hours,Physical_Activity,Tutoring_Sessions,Attendance,Motivation_Level_ordinal,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Family_Income_ordinal,Teacher_Quality_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Distance_from_Home_ordinal,Exam_grade_ordinal,IsMale,Learning_Disabilities,IsPublicSchool,HasInternetAccess,HasExtracurricularActivities
0,23,7,3,0,84,0,0,2,0,1,2,0,0,6,1,0,1,1,0
1,19,8,4,2,64,0,0,1,1,1,0,1,1,4,0,0,1,1,0
3,29,8,4,1,89,1,0,1,1,1,0,0,1,7,1,0,1,1,1
4,19,6,4,3,92,1,1,1,1,2,1,1,0,7,0,0,1,1,1
5,19,8,3,3,88,1,1,1,1,1,2,2,0,7,1,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,7,2,1,69,1,2,1,2,1,2,0,0,6,0,0,1,1,0
6603,23,8,2,3,76,1,2,1,0,2,2,0,0,6,0,0,1,1,0
6604,20,6,2,3,90,0,1,0,0,1,0,2,0,6,0,0,1,1,1
6605,10,6,3,2,86,2,2,2,0,1,2,0,2,6,0,0,0,1,1


In [694]:
correlation = df_numerical.corr()
correlation['Exam_grade_ordinal'].sort_values(ascending=False)

Unnamed: 0,Exam_grade_ordinal
Exam_grade_ordinal,1.0
Attendance,0.630791
Hours_Studied,0.438413
Access_to_Resources_ordinal,0.164111
Parental_Involvement_ordinal,0.147351
Tutoring_Sessions,0.144822
Parental_Education_Level_ordinal,0.108092
Peer_Influence_ordinal,0.106645
Family_Income_ordinal,0.081307
Motivation_Level_ordinal,0.071866


In [695]:
data = df_numerical[correlation[correlation['Exam_grade_ordinal'] > 0.10]['Exam_grade_ordinal'].index]
data.sample(5)

Unnamed: 0,Hours_Studied,Tutoring_Sessions,Attendance,Parental_Involvement_ordinal,Access_to_Resources_ordinal,Peer_Influence_ordinal,Parental_Education_Level_ordinal,Exam_grade_ordinal
3929,21,2,99,1,2,1,2,7
4615,18,0,88,1,1,2,1,5
4076,18,1,69,2,2,1,0,5
5497,16,2,61,2,1,0,0,5
74,4,0,100,2,2,2,1,6


# 3. Training the model

In [712]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, root_mean_squared_error
from sklearn.model_selection import train_test_split

X = df_numerical.drop('Exam_grade_ordinal', axis=1)
y = df_numerical['Exam_grade_ordinal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
lr = LinearRegression()

lr.fit(X_train, y_train)

lr_pred = lr.predict(X_test)
rounded_predicts = [round(num) for num in lr_pred]
lr_r2 = r2_score(y_test, rounded_predicts)

mse = mean_squared_error(y_test, rounded_predicts)
rmse = root_mean_squared_error(y_test, rounded_predicts)

print(f"R2 Score = {lr_r2}")
print(f"MSE = {mse}")
print(f"RMSE = {rmse}")

R2 Score = 0.7779823985784138
MSE = 0.18346111719605696
RMSE = 0.4283236126996234


In [714]:
scatter = go.Scatter(
    x=y_test,
    y=rounded_predicts,
    mode='markers',
    marker=dict(color='blue', opacity=0.5),
    name='Predicted vs Actual'
)

line = go.Scatter(x=[min(y_test), max(y_test)], y=[min(y_test), max(y_test)], mode='lines', name='Perfect Prediction')

fig = go.Figure(data=[scatter, line])
fig.update_layout(title='Predicted vs Actual Exam Scores', xaxis_title='Actual Exam Scores', yaxis_title='Predicted Exam Scores')
fig.show()