# Predicting Students Maths Performance

## Table Of Contents
- [Introduction](#Introduction)
    - [Dataset Source](#DatasetSource)
    - [Dataset Details](#DatasetDetails)
    - [Dataset Variables](#DatasetVariables)
    - [Response Variable](#ResponseVariable)
    
- [Goals & Objectives](#Goals&Objectives)

- [Data Cleaning & Preprocesseing](#DataCleaning&Preprocesseing)

- [Data Exploration & Visualisation](#DataExploration&Visualisation)
    - [Literature Review](#LiteratureReview)
    
- [Summary & Conclusions](#Summary&Conclusions)

- [References](#References)

## Introduction <a id="Introduction"></a>

### Dataset Source <a id="DatasetSource"></a>

The dataset used in this study was obtained from Kaggle. This dataset includes the Maths grades of secondary school students in two Portuguese schools.

### Dataset Details <a id="DatasetDetails"></a>

This dataset is about the Final Math scores of secondary school students. In the dataset we have different features such as: the school of the students, the gender of the students, the age of students, the students home address, the students parents relationship status, the amount of time students study in a week, the number of classes the students had failed previously, the final Maths score achieved, and more. These features should be adequate for a linear regression problem on the students final scores.

The dataset has a total of 33 features (columns before dropping anything) and 395 observations (rows).

We will retrieve the dataset from our own devices by reading it since we have the csv file (the dataset) in the same folder as this jupyter notebook file.

In [2]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import requests
import io 
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)


In [3]:
# reading the csv file and naming it 'Maths_Scores'
Maths_Scores = pd.read_csv('Maths.csv')


In [4]:
Maths_Scores.sample(10, random_state=643)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
355,MS,F,18,U,GT3,T,3,3,services,services,course,father,1,2,0,no,yes,no,no,yes,yes,no,yes,5,3,4,1,1,5,0,10,9,9
50,GP,F,16,U,LE3,T,2,2,services,services,course,mother,3,2,0,no,yes,yes,no,yes,yes,yes,no,4,3,3,2,3,4,2,12,13,13
265,GP,M,18,R,LE3,A,3,4,other,other,reputation,mother,2,2,0,no,yes,yes,yes,yes,yes,yes,no,4,2,5,3,4,1,13,17,17,17
263,GP,F,17,U,GT3,T,3,3,other,other,home,mother,1,3,0,no,no,no,yes,no,yes,no,no,3,2,3,1,1,4,4,10,9,9
375,MS,F,18,R,GT3,T,1,1,other,other,home,mother,4,3,0,no,no,no,no,yes,yes,yes,no,4,3,2,1,2,4,2,8,8,10
106,GP,F,15,U,GT3,T,2,2,other,other,course,mother,1,4,0,yes,yes,yes,no,yes,yes,yes,no,5,1,2,1,1,3,8,7,8,8
207,GP,F,16,U,GT3,T,4,3,teacher,other,other,mother,1,2,0,no,no,yes,yes,yes,yes,yes,yes,1,3,2,1,1,1,10,11,12,13
349,MS,M,18,R,GT3,T,3,2,other,other,course,mother,2,1,1,no,yes,no,no,no,yes,yes,no,2,5,5,5,5,5,10,11,13,13
295,GP,M,17,U,GT3,T,3,3,health,other,home,mother,1,1,0,no,yes,yes,no,yes,yes,yes,no,4,4,3,1,3,5,4,14,12,11
140,GP,M,15,U,GT3,T,4,3,teacher,services,course,father,2,4,0,yes,yes,no,no,yes,yes,yes,no,2,2,2,1,1,3,0,7,9,0


### Dataset Variables <a id="DatasetVariables"></a>

The variables or features we are going to use in our report are shown in the table below.

In [19]:
from tabulate import tabulate

Variable_table = [['Name','Data Type','Units','Description'],
                 ['school', 'Nominal categorical', 'NA', 'The school that the student attends (GP or MS)'],
                 ['sex', 'Nominal categorical', 'NA', 'The sex of the student (M/Male or F/Female)'],
                 ['age', 'Numerical', 'Years', 'Age of the student'],
                 ['studytime', 'Ordinal categorical', 'Hours', 'Number of hours the student studies per week:\n 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours,\n or 4 - >10 hours'],
                 ['Medu', 'Ordinal categorical', 'NA', 'Highest level of mother\'s education: \n0 - none, 1 - primary education (4th grade), \n2 - 5th to 9th grade, \n3 - secondary education or 4 - higher education'],
                 ['Fedu', 'Ordinal categorical', 'NA', 'Highest level of father\'s education: \n0 - none, 1 - primary education (4th grade), \n2 - 5th to 9th grade, \n3 - secondary education or 4 - higher education'],
                 ['failures', 'Numerical', 'NA', 'Number of courses the student has failed previously: \nn if 0<=n<=3, else 4'],
                 ['absences', 'Numerical', 'NA', 'Number of times the student was absent'],
                 ['famrel', 'Ordinal categorical', 'NA', 'Quality of family relationships (from 1 - very bad to 5 - excellent)'],
                 ['Pstatus', 'Nominal categorical', 'NA', 'Parent\'s cohabitation status (\'T\' - living together or \'A\' - apart)'],
                 ['G3', 'Numerical', 'NA', 'Final Maths grade (0 - 20)']]


print(tabulate(Variable_table, headers='firstrow', tablefmt='fancy_grid'))

╒═══════════╤═════════════════════╤═════════╤══════════════════════════════════════════════════════════════════════╕
│ Name      │ Data Type           │ Units   │ Description                                                          │
╞═══════════╪═════════════════════╪═════════╪══════════════════════════════════════════════════════════════════════╡
│ school    │ Nominal categorical │ NA      │ The school that the student attends (GP or MS)                       │
├───────────┼─────────────────────┼─────────┼──────────────────────────────────────────────────────────────────────┤
│ sex       │ Nominal categorical │ NA      │ The sex of the student (M/Male or F/Female)                          │
├───────────┼─────────────────────┼─────────┼──────────────────────────────────────────────────────────────────────┤
│ age       │ Numerical           │ Years   │ Age of the student                                                   │
├───────────┼─────────────────────┼─────────┼───────────────────

### Response Variable <a id="ResponseVariable"></a>

The target variable for this report is 'G3' which is the final maths grade of the students. We will examine how the final maths grade changes based on the different explanatory variables.

## Goals & Objectives <a id="Goals&Objectives"></a>

When it comes to students' performance in school, there are a lot of factors that can affect their grades. We have decided to analyse a data set that presents the conditions that the students are under and how this may affect their grades. By doing this, we can see if certain conditions may improve the students' grades or even decrease their grades. Some of the conditions that were assessed includes, study time, travel time, extra paid classes, extra curricular activites and more. By looking at the correlation between these conditions and their grades, we can see how students can improve their grades. 

Our main objectives are to see which conditions affect the students' grades the most and what conditions need to be changed in order for them to improve their grades. Another objective that we have is to predict what grades the students will get by looking at the conditions that they were in. 

We would assume that the amount of study time relates to the students' grade. However, this may not always be the case. This is why we need to use more information to be able to accurately predict the students' grades. 

## Data Cleaning & Preprocesseing <a id="DataCleaning&Preprocesseing"></a>

In [26]:
column_names = list(zip(*Variable_table))[0]
scores_filtered_df = Maths_Scores[Maths_Scores.columns.intersection(column_names)]
scores_filtered_df.head()

Unnamed: 0,school,sex,age,Pstatus,Medu,Fedu,studytime,failures,famrel,absences,G3
0,GP,F,18,A,4,4,2,0,4,6,6
1,GP,F,17,T,1,1,2,0,5,4,6
2,GP,F,15,T,1,1,2,3,4,10,10
3,GP,F,15,T,4,2,3,0,3,2,15
4,GP,F,16,T,3,3,2,0,4,4,10


In [27]:
scores_filtered_df.describe()

Unnamed: 0,age,Medu,Fedu,studytime,failures,famrel,absences,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,2.035443,0.334177,3.944304,5.708861,10.41519
std,1.276043,1.094735,1.088201,0.83924,0.743651,0.896659,8.003096,4.581443
min,15.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,0.0,4.0,0.0,8.0
50%,17.0,3.0,2.0,2.0,0.0,4.0,4.0,11.0
75%,18.0,4.0,3.0,2.0,0.0,5.0,8.0,14.0
max,22.0,4.0,4.0,4.0,3.0,5.0,75.0,20.0


All the min and max values of the selected column are in correct range so we don't have any outliers in the data.

## Data Exploration & Visualisation <a id="DataExploration&Visualisation"></a>

### Literature Review <a id="LiteratureReview"></a>

## Summary & Conclusions <a id="Summary&Conclusions"></a>

## References <a id="References"></a>

- kaggle.com. (n.d.). Student performance in Maths. [online] Available at: https://www.kaggle.com/code/parvinderkaur21/student-performance-in-maths/data [Accessed 29 Sep. 2022].
