# Student Grade Predictions

Author: Jade Aidoghie  
Date: 6/10/2024

## 1. Background
This project involves predicting students' final grades or G3 in secondary education based on a variety of demographic, social, and academic features using linear regression. The dataset I'll be using contains information on student performance in mathematics collected from portugese schools.
  
> Data: [Student Performance - UC Irvine](https://archive.ics.uci.edu/dataset/320/student+performance)

## 2. Preliminary Exploration
In this section I'll perform some exploration to gain an understanding of the data to decide on what features can help to predict students' final grades.

In [13]:
import pandas as pd
import numpy as np
import sklearn
import plotly.express as px
import plotly.graph_objects as go
from sklearn import linear_model
from sklearn.utils import shuffle

In [47]:
# Loading in the dataset
data = pd.read_csv('student-mat.csv', sep=';')

data.head() # First 5 rows

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [16]:
data.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


## 3. Preprocessing 
The features I'll be investigating:  
  
**G1** - First period grade  
**G2** - Second period grade  
**studytime** - The weekly study time  
**failures** - The number of past class failures  
**absences** - The number of absences  
**goout** - Going out with friends

In [48]:
# Trimming the data so it contains the chosen features
data = data[["G1", "G2", "G3", "studytime", "failures", "absences", "goout"]]

In [49]:
# Separating the data
predict = "G3"
X = np.array(data.drop([predict], axis=1)) # Features
y = np.array(data[predict]) # Labels

In [50]:
# Splitting the data into testing and training data
# 90% of the data for training and 10% for testing
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

## 4. Linear Regression

In [51]:
linearReg = linear_model.LinearRegression()

In [52]:
linearReg.fit(x_train, y_train) # Fitting the data
accuracy = linearReg.score(x_test, y_test) # Accuracy of the model
print(f'Accuracy: {accuracy:.3f}')

Accuracy: 0.876


In [53]:
print('Coefficient: \n', linearReg.coef_) # Slope values
print('Intercept: \n', linearReg.intercept_) # Intercept

Coefficient: 
 [ 0.16117026  0.97921066 -0.19663751 -0.17814961  0.034455    0.07609285]
Intercept: 
 -1.8236140926466415


In [54]:
# Getting the predictions
predictions = linearReg.predict(x_test) 

for x in range(len(predictions)):
    print(f'Predictions: {predictions[x]:.2f}, Actual: {y_test[x]}, Features: {x_test[x]}')

Predictions: 13.01, Actual: 13, Features: [15 13  3  2 14  2]
Predictions: 13.32, Actual: 14, Features: [13 13  2  0 14  3]
Predictions: 9.44, Actual: 11, Features: [11 10  3  0  4  2]
Predictions: 3.82, Actual: 0, Features: [6 5 1 1 0 2]
Predictions: -0.87, Actual: 0, Features: [6 0 2 0 0 5]
Predictions: 17.72, Actual: 18, Features: [16 17  1  0  4  5]
Predictions: 9.49, Actual: 10, Features: [10 10  2  0  0  4]
Predictions: 8.33, Actual: 10, Features: [8 9 2 0 4 4]
Predictions: 7.91, Actual: 8, Features: [7 9 3 0 0 5]
Predictions: 5.77, Actual: 0, Features: [8 7 2 3 0 5]
Predictions: 16.20, Actual: 16, Features: [16 16  4  0 12  2]
Predictions: 10.16, Actual: 11, Features: [11 11  4  0  0  3]
Predictions: 12.56, Actual: 13, Features: [13 13  3  0  0  2]
Predictions: 10.68, Actual: 11, Features: [10 11  2  0  6  4]
Predictions: 14.32, Actual: 13, Features: [14 14  1  0  2  4]
Predictions: 9.74, Actual: 10, Features: [11 10  2  1 12  2]
Predictions: 11.26, Actual: 12, Features: [12 12 

## 5. Visualizations

In [56]:
# Creating a DataFrame for plotting
results = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})

# Scatter Plot of Predictions vs Actual Values with Best-Fit Line
fig1 = px.scatter(results, x='Actual', y='Predicted', trendline='ols', title='Actual vs Predicted Grades')
fig1.update_traces(marker=dict(color='black'))
fig1.data[1].line.color = 'red'
fig1.update_layout(xaxis_title='Actual Grades', yaxis_title='Predicted Grades')
fig1.show()

# Creating a DataFrame for residuals
residuals = y_test - predictions
residuals_df = pd.DataFrame({'Predicted': predictions, 'Residuals': residuals})

# Residual Plot
fig2 = px.scatter(residuals_df, x='Predicted', y='Residuals', title='Residual Plot')
fig2.update_traces(marker=dict(color='black'))
fig2.add_hline(y=0, line_dash="dash", line_color="red")
fig2.update_layout(xaxis_title='Predicted Grades', yaxis_title='Residuals')
fig2.show()