
## Introduction

In this week we will get a closer look to the possibilities of **linear regression** and methods to improve the performance of the model. We want to predict the Math-grade (G1) that a student at a Portuguese school will achieve under the existing observations.

In this data set originally 33 different features were collected. We will only use a subset of the features to perform our analysis today.

- **school** (GP - Gabriel Pereira, MS - Moisinho da Silveira)

- **sex** (M (Male) or F (Female))

- **age** (From 15 to 22)

- **Medu** (Education of Mother,0 (none), 1 (4th grade), 2 (5th to 9th grade), 3 (secondary education), 4 (higher education))

- **Fedu** (Education of Father,0 (none), 1 (4th grade), 2 (5th to 9th grade), 3 (secondary education), 4 (higher education))

- **Mjob** (Job of Mother, 5 different values: teacher, health, services, at_home, other)

- **Fjob** (Job of Father, 5 different values: teacher, health, services, at_home, other)

- **reason** (reason chosen this school, home, reputation, course or other)

- **studytime** (weekly studytime, 1 - 10 hours)

- **failures** (number of past class failures, 0 to 4)

- **goout** (go out with friends, 1 (very low) to 5 (very high)

- **G1** (first period grade, from 0 to 20)

<p><a href="http://www3.dsi.uminho.pt/pcortez"><strong>Source: Paulo Cortez, University of Minho, Guimaraes, Portugal</strong></a></p>

<https://www.kaggle.com/dipam7/student-grade-prediction?select=student-mat.csv>

## Data Preperation

### Numerical and categorical data

In [2]:
# Imports
import numpy as np
import random

from sklearn.metrics import root_mean_squared_error
from statsmodels.formula.api import ols

# Seed
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

In the cell below, we load the data and select a subset of it. Additionally we select the numerical parameters of the dataframe.

`iloc[:, 0:11]` is used to select data by the position. The first part before the comma selects the specified rows and the second part after it selects the columns. Here we want to select all rows so we can just use the colon. For the columns we want to select the first 11 columns starting by 0 and stopping after 10 which is done by 0:11.

`loc[: , ["column_name1", "column_name2"]]` works similar to `iloc` but uses names to select rows and columns.

Run the code below.

In [3]:
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/mathgrades.csv")
data.head()

Unnamed: 0,age,Medu,Fedu,school,sex,failures,studytime,goout,Mjob,Fjob,G1
0,18,4,4,GP,F,0,2,4,at_home,teacher,5
1,17,1,1,GP,F,0,2,3,at_home,other,5
2,15,1,1,GP,F,3,2,2,at_home,other,7
3,15,4,2,GP,F,0,3,2,health,services,15
4,16,3,3,GP,F,0,2,2,other,other,6


In [4]:
# Creating subset of data
data = data.iloc[:, 0:11]
print(data.head())

# Selecting data
numerical_data = data.loc[:, ["age", "Medu", "Fedu", "studytime", "failures", "goout", "G1"]]


   age  Medu  Fedu school sex  failures  studytime  goout     Mjob      Fjob  \
0   18     4     4     GP   F         0          2      4  at_home   teacher   
1   17     1     1     GP   F         0          2      3  at_home     other   
2   15     1     1     GP   F         3          2      2  at_home     other   
3   15     4     2     GP   F         0          3      2   health  services   
4   16     3     3     GP   F         0          2      2    other     other   

   G1  
0   5  
1   5  
2   7  
3  15  
4   6  


### Create training and test sets

Create the usual train-test split (80:20).

Run the code below.

In [5]:
from sklearn.model_selection import train_test_split
data_training, data_test = train_test_split(data, test_size=0.2, random_state=42)

### Baseline Model

Let us create a simple baseline for predicting the math grades by using just the variable studytime. Use the package `statsmodels.formula.api` and the function `summary()` to report the results. Validate the performance on the test set by using the RMSE.

Fill in the code below.

In [8]:
from statsmodels.formula.api import ols

baseline_model = ols("G1 ~ studytime", data=data_training).fit()
print(baseline_model.summary())

predictions = baseline_model.predict(data_test)
data_test["predictions"] = predictions
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(data_test["G1"], data_test["predictions"])
print("RMSE:", rmse)

                            OLS Regression Results                            
Dep. Variable:                     G1   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.038
Method:                 Least Squares   F-statistic:                     13.49
Date:                Mon, 18 Nov 2024   Prob (F-statistic):           0.000282
Time:                        16:46:27   Log-Likelihood:                -810.45
No. Observations:                 316   AIC:                             1625.
Df Residuals:                     314   BIC:                             1632.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      9.3349      0.470     19.861      0.0

### Simple Linear Regression Model

In contrast to the baseline model, create a simple linear regression with the variable age for predicting the math grades. Use the package `statsmodels.formula.api` and the function `summary()` to report the results. Validate the performance on the test set by using the RMSE.

Fill in the code below.

In [10]:
from statsmodels.formula.api import ols

simple_model = ols("G1 ~ age", data=data_training).fit()
print(simple_model.summary())

                            OLS Regression Results                            
Dep. Variable:                     G1   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.2511
Date:                Mon, 18 Nov 2024   Prob (F-statistic):              0.617
Time:                        16:48:43   Log-Likelihood:                -816.97
No. Observations:                 316   AIC:                             1638.
Df Residuals:                     314   BIC:                             1645.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     12.1322      2.399      5.057      0.0

## Non-linear transformations

Non-linear transformations are a very powerful method to increase our model/'s performance and fit to the data, since it is unrealistic to expect linear relations in most of real-world problems. This kind of transformations are done by using so called interaction terms and quadratic or logarithmic or exponential transformations of the data.

Before we continue with the modeling let's take a closer look at the correlation of the features. What do you notice?

In [None]:
import seaborn as sns
correlations = data[["age", "Medu", "Fedu", "studytime", "failures", "goout", "G1"]].corr(method="pearson")
sns.heatmap(correlations, cmap="vlag", vmin=-1, vmax=1, annot=True)

The correlation plot might help you to solve the next exercises.

### Extending the model I

Extend the baseline model with the variable age and an interaction term with age and studytime. Report the regression results. Validate the performance on the test set by using the RMSE.

Run the code below.

In [20]:
extended_model = ols("G1 ~ age + age:studytime", data=data_training)
extended_model = extended_model.fit()
print(extended_model.summary(slim=True))

predictions = extended_model.predict(data_test)
data_test["predictions"] = predictions
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(data_test["G1"], data_test["predictions"])
print("RMSE:", rmse)

                            OLS Regression Results                            
Dep. Variable:                     G1   R-squared:                       0.047
Model:                            OLS   Adj. R-squared:                  0.041
No. Observations:                 316   F-statistic:                     7.754
Covariance Type:            nonrobust   Prob (F-statistic):           0.000517
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        12.4005      2.347      5.283      0.000       7.782      17.019
age              -0.1894      0.143     -1.325      0.186      -0.471       0.092
age:studytime     0.0497      0.013      3.905      0.000       0.025       0.075

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
RMSE: 3.7271621062456477


## Non-Linear Transformations II

### Extending the model II

Extend the baseline model with the variable age, an quadratic transformation of age, the variable goout, and an interaction effect between studytime and goout. Report the regression results. Validate the performance on the test set by using the RMSE.

In [22]:
extended_model2 = ols("G1 ~ age + age**2 + goout + studytime:goout", data=data_training)
extended_model2 = extended_model2.fit()
print(extended_model2.summary(slim=True))

predictions = extended_model2.predict(data_test)
data_test["predictions"] = predictions
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(data_test["G1"], data_test["predictions"])
print("RMSE:", rmse)

                            OLS Regression Results                            
Dep. Variable:                     G1   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.041
No. Observations:                 316   F-statistic:                     5.543
Covariance Type:            nonrobust   Prob (F-statistic):            0.00102
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          13.0089      2.365      5.502      0.000       8.356      17.661
age                -0.0642      0.140     -0.457      0.648      -0.341       0.212
goout              -0.7488      0.194     -3.854      0.000      -1.131      -0.366
studytime:goout     0.2144      0.067      3.219      0.001       0.083       0.346

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
R

## Summary

You can use the summary section to try out other combinations of variables and data. The splitted data can be found in the variables data_training and data_test.

In [17]:
predictions = extended_model.predict(data_test)
data_test["predictions"] = predictions
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(data_test["G1"], data_test["predictions"])