
## Introduction

In this week we will get a closer look to the possibilities of **linear regression** and methods to improve the performance of the model. We want to predict the Math-grade (G1) that a student at a Portuguese school will achieve under the existing observations.

In this data set originally 33 different features were collected. We will only use a subset of the features to perform our analysis today.

- **school** (GP - Gabriel Pereira, MS - Moisinho da Silveira)

- **sex** (M (Male) or F (Female))

- **age** (From 15 to 22)

- **Medu** (Education of Mother,0 (none), 1 (4th grade), 2 (5th to 9th grade), 3 (secondary education), 4 (higher education))

- **Fedu** (Education of Father,0 (none), 1 (4th grade), 2 (5th to 9th grade), 3 (secondary education), 4 (higher education))

- **Mjob** (Job of Mother, 5 different values: teacher, health, services, at_home, other)

- **Fjob** (Job of Father, 5 different values: teacher, health, services, at_home, other)

- **reason** (reason chosen this school, home, reputation, course or other)

- **studytime** (weekly studytime, 1 - 10 hours)

- **failures** (number of past class failures, 0 to 4)

- **goout** (go out with friends, 1 (very low) to 5 (very high)

- **G1** (first period grade, from 0 to 20)

<p><a href="http://www3.dsi.uminho.pt/pcortez"><strong>Source: Paulo Cortez, University of Minho, Guimaraes, Portugal</strong></a></p>

<https://www.kaggle.com/dipam7/student-grade-prediction?select=student-mat.csv>

## Data Preperation

### Numerical and categorical data

In [None]:
# Imports
import numpy as np
import random

from sklearn.metrics import root_mean_squared_error
from statsmodels.formula.api import ols

# Seed
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

In the cell below, we load the data and select a subset of it. Additionally we select the numerical parameters of the dataframe.

`iloc[:, 0:11]` is used to select data by the position. The first part before the comma selects the specified rows and the second part after it selects the columns. Here we want to select all rows so we can just use the colon. For the columns we want to select the first 11 columns starting by 0 and stopping after 10 which is done by 0:11.

`loc[: , ["column_name1", "column_name2"]]` works similar to `iloc` but uses names to select rows and columns. 

Run the code below.

In [None]:
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/mathgrades.csv")
data.head()

In [None]:
# Creating subset of data
data = data.iloc[:, 0:11]
print(data.head())

# Selecting data
numerical_data = data.loc[:, ["age", "Medu", "Fedu", "studytime", "failures", "goout", "G1"]]


### Create training and test sets

Create the usual train-test split (80:20).

Run the code below.

In [None]:
from sklearn.model_selection import train_test_split
data_training, data_test = train_test_split(data, test_size=0.2, random_state=42)

### Baseline Model

Let us create a simple baseline for predicting the math grades by using just the variable studytime. Use the package `statsmodels.formula.api` and the function `summary()` to report the results. Validate the performance on the test set by using the RMSE.

Fill in the code below.

In [None]:
# Enter your code here

### Simple Linear Regression Model

In contrast to the baseline model, create a simple linear regression with the variable age for predicting the math grades. Use the package `statsmodels.formula.api` and the function `summary()` to report the results. Validate the performance on the test set by using the RMSE.

Fill in the code below.

In [None]:
# Enter your code here

## Non-linear transformations

Non-linear transformations are a very powerful method to increase our model/'s performance and fit to the data, since it is unrealistic to expect linear relations in most of real-world problems. This kind of transformations are done by using so called interaction terms and quadratic or logarithmic or exponential transformations of the data.

Before we continue with the modeling let's take a closer look at the correlation of the features. What do you notice?

In [None]:
import seaborn as sns
correlations = data[["age", "Medu", "Fedu", "studytime", "failures", "goout", "G1"]].corr(method="pearson")
sns.heatmap(correlations, cmap="vlag", vmin=-1, vmax=1, annot=True)

The correlation plot might help you to solve the next exercises.

### Extending the model I

Extend the baseline model with the variable age and an interaction term with age and studytime. Report the regression results. Validate the performance on the test set by using the RMSE.

Run the code below.

In [None]:
# Enter your code here!

## Non-Linear Transformations II

### Extending the model II

Extend the baseline model with the variable age, an quadratic transformation of age, the variable goout, and an interaction effect between studytime and goout. Report the regression results. Validate the performance on the test set by using the RMSE.

In [None]:
# Enter your code here!

## Summary

You can use the summary section to try out other combinations of variables and data. The splitted data can be found in the variables data_training and data_test.

In [None]:
# Enter your code here!