# MACHINE LEARNING                                                  

# Midterm exam 2025

<h1 style="color:red;">Instructions: Read Carefully!</h1>


- **[Use this Jupyter notebook]{.underline}** to complete the required tasks and submit it to Moodle. Keep the sectioning structure of the notebook and insert the code cells you need in the corresponding sections.

- The notebook should contain the code with your **analysis** and **it must be reproducible**. Set the random seeds to ensure that.

- The **most important part of your work is the comments and interpretation** of the analysis results obtained. **Do not include uncommented figures**. Remember to include a **conclusion section** at the end.                          

- **[Use OBS to record your screen]{.underline}**. Upload the video file (max. 500Mb) to Moodle. Alternatively, make sure to copy it to one of the pendrives that will be provided.

- The exam has **two notebooks:** one for the Classification problem (30% of the grading) and this one for the **Regression Problem**. You must submit both of them to Moodle.


## Statement of the Regression Problem

### Dataset 

+ Look for your student code in the `student_codes.txt` file. Use the corresponding zip file cpntaining the data files for your analysis.  
  **IMPORTANT:** An exam done with a wrong dataset implies a failed exam.                        

+ Load the training set **dfTR_reg_XX.csv** and the test set **dfTS_reg_XX.csv** corresponding to your student code.

+ The dataset contains 7 input variables called X1 to X7. The output numeric variable is called Y.

### External Code and Imports

+ The first code cell below contains standard imports that we have used in the sessions. With these imports you should be able to do all the tasks in the exam; that is not to say that you need to use all of them, and you are invited to use extra imports if you feel the need.

+ We have also included a Python script `auxiliary_code.py` with two functions called `ResidualPlots` and `explore_outliers` that will be available when you run the second cell in this notebook. 

<h2 style="color:red;">The regression problem has two parts, using the same dataset.</h2>

We **strongly recommend** you to organize your work using the results in the sessions of the course corresponding to the model you are fitting.  

In [2]:
# %matplotlib inline
%config InlineBackend.figure_format = 'png' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’

# plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Data management libraries
import numpy as np 
import pandas as pd
import scipy.stats as stats

# Connecting statsmodels with sklearn via sklearn2pmml and patsy 
import statsmodels.api as sm
from statsmodels.api import OLS
from statsmodels.stats.outliers_influence import variance_inflation_factor 
from sklearn2pmml.statsmodels import StatsModelsRegressor
import patsy as ps

from patsy_utils.patsy_formulaic_transformer import FormulaTransformer
# Scikit transformers and pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, PowerTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

# Scikit metrics and model selection
from sklearn.metrics import root_mean_squared_error, mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold

# Scikit-learn regression models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.ensemble import HistGradientBoostingRegressor



In [None]:
%run -i "auxiliary_code.py"

<h1 style="color:blue;">Regression Part I</h1>


<span style="color:blue;">**IMPORTANT:** </span>
+ <span style="color:blue;">We have provided you with separate training and test sets for reproducibility. But **the training set may still need preprocessing!** Do not assume that the training data has been thoroughly cleaned. The test set, on the other hand, can be used as it is.
</span>
+ <span style="color:blue;">Use RMSE as the evaluation metric for this part of the exam.</span>

### 1.1  Exploratory analysis of the training data



In [None]:
# Use your student code to select the data files.
dfTR = pd.read_csv("./dataRegression/dfTR_reg_XX.csv")
dfTS = pd.read_csv("./dataRegression/dfTS_reg_XX.csv")
dfTR.head()

### 1.2. Fit a linear regression model to the training set

+ **Fit an initial linear regression model to the training set to predict the output variable Y using (possibly a subset of) X1, ..., X7 as input variables.**
+ **This is a first and exploratory linear model. Keep it simple.**

#### 1.2.1. Analyze the significance of the model coefficients and the residuals plots

#### 1.2.2. Scores for this model

+ **Obtain the training, test and validation scores for this model.**
+ **Store them in a model dictionary like we have done in the course sessions.**

### 1.3 Second Linear Model

+ **Using the findings in 1.2 fit a second linear model to see if you can improve its performance.**
+ **Again, for this second model, analyze the significance of the coefficients, the residual plots and store the train, test and validation scores in the model dictionary.**

### 1.4 Lasso (Optional) 

+ **Only if you have time, after addressing the part below**
+ **Fit a Lasso model to the data. Use grid search to find the best alpha parameter.**
+ **What variables are selected by this model?**
+ **Analyze the residual plots and store the train, test and validation scores in the model dictionary.**

<h1 style="color:blue;">Regression Part II</h1>


<span style="color:blue;">**IMPORTANT:** </span>
+ <span style="color:blue;">We **strongly recommend** you to organize your work using the results in the corresponding session of the course.</span> 
+ <span style="color:blue;">In particular we recommend that you start with a fresh version of the data sets, by reloading them. </span>
+ <span style="color:blue;">To speed up your work keep in mind that this is the same dataset, so you should already have gone through EDA. Do not repeat the same EDA analysis, use it!</span>
+ <span style="color:blue;">But we also advise you to **keep the model dictionary** of the first part and add the models in this part to it for easy model comparison.</span>

In [None]:
# Use your student code to select the data files.
dfTR = pd.read_csv("./dataRegression/dfTR_reg_XX.csv")
dfTS = pd.read_csv("./dataRegression/dfTS_reg_XX.csv")
dfTR.head()

### 1.5. Histogram Gradient Boosting regression

#### 1.5.1. Fit a histogram gradient boosting regression model for the training dataset. 

+ **Use grid search to select the hyperparameters of the model.**

#### 1.5.2. Scores for this model

+ **Obtain the training, test and validation scores for this model.**
+ **Store them in a model dictionary like we have done in the course sessions.**

### 1.6 Model Comparison


#### 1.6.1 Compare all the models in the model dictionary using the training, test and validation scores. 

+ **Remember to use RMSE scores for this comparison.**

#### 1.6.2 Which model would you choose? Why?

+ **State your conclusions for the regression part.**

<h3 style="color:red;">Ok</h3>