<a href="https://colab.research.google.com/github/Sergei-N-Fedorov/Data_Analysis/blob/main/Sergei_Fedorov_DAKD_2025_exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="alert alert-block" style="color: green">
    <h1><center> DAKD 2025 EXERCISE 2: SUPERVISED LEARNING  </center></h1>

### Fill in your name, student id number and email address
#### name:  Sergei Fedorov
#### student id:  2511405
#### email:  sergei.s.fedorov@utu.fi


<i>Machine learning</i> is a subfield of artificial intelligence that provides automatic, objective and data-driven techniques for modeling the data. Its two main branches are <i>supervised learning</i> and <i>unsupervised learning</i>, and in this exercise, we are going to use the former, <font color = green>supervised learning</font>, for classification and regression tasks.

For classification, we will use the Cardio data that was mostly cleaned up. Some data pre-processing steps are still required to ensure that it's in an appropriate format so that models can learn from it. Even though we are not conducting any major data exploration nor data preparation this time, <i>you should never forget it in your future data analyses</i>.

-----

#### General Guidance for Exercises

- <b>Complete all tasks:</b> Make sure to answer all questions, even if you cannot get your script to fully work.

- <b>Code clarity:</b> Write clear and readable code. Include comments to explain what your code does.

- <b>Effective visualizations:</b> Ensure all plots have labeled axes, legends, and captions. Your visualizations should clearly represent the underlying data.

- <b>Notebook organization:</b> You can add more code or markdown cells to improve the structure of your notebook as long as it maintains a logical flow.

- <b>Submission:</b> Submit both the .ipynb and .html or .pdf versions of your notebook. Before finalizing your notebook, use the "Restart & Run All" feature to ensure it runs correctly.

- <b>Quiz:</b> After completing the notebook, you should complete the second exercise quiz in Moodle. Please do not attempt it before doing the notebook, as the questions directly relate to the results you will obtain from it.

- <b>Grading criteria:</b>
    
    - The grading scale is *Fail*/*Pass*/*Pass with honors* (+1).
    
    - To pass, you must complete the required parts and score at least 80% on the quiz. Please note that notebooks may also be checked.
    
    - To achieve Pass with honors, also complete the bonus exercises.

- <b>Technical issues:</b>
    
    - If you encounter problems, start with an online search to find solutions but do not simply copy and paste code. Understand any code you use and integrate it appropriately.
    
    - Cite all external sources used, whether for code or explanations.
    
    - If problems persist, ask for help in the course discussion forum, at exercise sessions, or via email at ankazl@utu.fi, zoher.orabe@utu.fi.

- <b>Use of AI and large language models:</b>
    
    - We do not encourage the use of AI tools like ChatGPT. If you use them, critically evaluate their outputs.
    
    - Describe how you used the AI tools in your work, including your input and how the output was beneficial.

- <b>Time management:</b> Do not leave your work until the last moment. No feedback will be available during weekends.

- <b>Additional notes:</b>
    
    - You can find the specific deadlines and session times for each assignment on the Moodle course page.
    
    - Ensure all your answers are concise—typically a few sentences per question.
    
    - Your .ipynb notebook is expected to be run to completion, which means that it should execute without errors when all cells are run in sequence.

<font color = green> The guided exercise session is held on the 3th of December at 14:15-16:00, at lecture hall X, Natura building.</font>

<font color = red size = 4>The deadline is the 8th of December at 23:55</font>. Late submissions will not be accepted unless there is a valid excuse for an extension which should be asked **before** the original deadline.


------

### <font color = red> Packages needed for this exercise: </font>

You can use other packages as well, but this excercise can be completed with those below.


In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import KFold # Added for cross-validation in Ridge Regression

# Show the plots inline in the notebook
%matplotlib inline

import numpy as np
import random

# Visualization packages - matplotlib and seaborn
# Remember that pandas is also handy and capable when it comes to plotting!
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning package - scikit-learn
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, LeaveOneOut, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import KFold # Added for cross-validation in Ridge Regression


# ------------------------------------
# This is NOT necessary, but can be used for better readability of prints

class color:
   GREEN = '\033[92m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   # Ends the styling
   END = '\033[0m'

# And to use this, simply just create it and call the variable you need as
print(color.UNDERLINE + 'This' + color.END,
      color.GREEN + 'is',
      color.RED + 'just',
      color.GREEN + 'an',
      color.BOLD + 'example' + color.END,
      color.RED + ':)'
)


# 2️⃣ Load data
cancerData = pd.read_csv('prostate.csv')  # assuming tab-separated


[4mThis[0m [92mis [91mjust [92man [1mexample[0m [91m:)


### <font color = red> Data needed for this exercise: </font>

You can download the two datasets (Cardio and Prostate) required to complete the notebook from Moodle under the Exercise II section.

### <font color = red> Reproducibility </font>

Before the exercise itself, we might as well discuss about the reproducibility of experiments we conduct in research. It can be quite a nightmare for some if code spewed out only random results.
To address this, we can set a **random seed** to ensure that any random processes, such as splitting our dataset into training and test sets, yield consistent results across multiple runs. By using a fixed random seed, we enhance the reproducibility of our experiments, making it easier to validate findings.
We will use the random seed <font color = red>2025</font>. For stable results and to obtain the correct quiz answers, please <font color = red> do not </font> use any other random seed anywhere in your code.


In [2]:
np.random.seed(2025)
random.seed(2025)

______________
## <font color = lightcoral>1. Classification using k-nearest neighbors </font>

We start exploring the world of data modeling by using the <font color = lightcoral>K-Nearest Neightbors (k-NN) algorithm</font>. The k-NN algorithm is a classic supervised machine learning technique based on the assumption that data points with similar features tend to belong to the same class, and thus are likely to be near each other in feature space.

In our case, we'll use the k-NN algorithm to *predict the presence of cardiovascular disease* (CVD) using all the other variables as <font color = lightcoral>features</font> in the given data set. I.e. the <font color = lightcoral>target variable</font> that we are interested in is `cardio`. Let's have a brief look at the features again:

| Feature | Type | Explanation |
| :- | :- | :-
| age | numeric | The age of the patient in years
| sex | binary | Female == 0, Male == 1
| weight | numeric | Measured weight of the patient (kg)
| height | numeric | Measured height of the patient (cm)
| ap_hi | numeric | Measured Systolic blood pressure
| ap_lo | numeric | Measured Diastolic blood pressure
| smoke | binary | A subjective feature based on asking the patient whether or not he/she smokes
| alco | binary | A subjective feature based on asking the patient whether or not he/she consumes alcohol
| active | binary |  A subjective feature based on asking the patient whether or not he/she exercises regularly
| cholesterol | categorical | Cholesterol associated risk information evaluated by a doctor
| gluc | categorical | Glucose associated risk information evaluated by a doctor

But first, we need data for the task. The code for loading the data into the environment is provided for you. The code should work but make sure that you have the CSV file of the data in the same directory where you have this notebook file.

####**Exercise 1 A)**

Before starting with the algorithm, perform some data exploration and determine the following characteristics of the data:

(1) What is the total number of data samples?

(2) How many distinct classes (target categories) does the dataset contain?

(3) Which person (their index) is the heaviest in the dataset?

(4) Which feature in the dataset has the largest range (max–min)?

Take a random sample of 1000 rows from the dataframe using a fixed random seed (<b>of 2025</b>). Print the first 15 rows to check that everything is ok with the dataframe.



In [6]:
### Data Loading and Exploration
# ------------------------------------------------------
# The data file should be at the same location than the
# exercise file to make sure the following lines work!
# Otherwise, fix the path.
# ------------------------------------------------------

# Path for the data
data_path = 'ex2_cardio_data.csv'

# Read the CSV file
cardio_data = pd.read_csv(data_path)

# For this task, we are going to use ALL variables but CARDIO as features, and CARDIO as label.

cardio_data.head(12)


Unnamed: 0,age,sex,height,weight,ap_hi,ap_lo,smoke,alco,active,cardio,cholesterol_normal,cholesterol_at_risk,cholesterol_elevated,gluc_normal,gluc_at_risk,gluc_elevated
0,48,1,170,104.0,120,80,0,0,1,0,1,0,0,1,0,0
1,51,0,160,59.0,110,80,0,0,1,0,0,1,0,1,0,0
2,42,1,166,77.0,120,80,0,0,1,0,1,0,0,1,0,0
3,55,0,168,80.0,120,80,0,0,1,0,1,0,0,1,0,0
4,57,0,154,41.0,806,0,0,0,1,0,1,0,0,1,0,0
5,53,0,152,56.0,103,65,0,0,1,0,1,0,0,1,0,0
6,42,0,167,67.0,110,70,0,0,1,0,1,0,0,1,0,0
7,41,1,172,70.0,110,80,0,0,1,0,1,0,0,1,0,0
8,43,1,169,67.0,100,80,0,0,1,0,1,0,0,1,0,0
9,39,1,168,60.0,120,80,0,0,1,0,1,0,0,0,1,0


In [8]:
# Find the following specs of the data:
# (1) What is the total number of data samples?
print("Shape: ", cardio_data.shape, "\n")

# (2) How many distinct classes (target categories) does the dataset contain?
display(cardio_data['cardio'].value_counts())

# (3) Which person (their index) is the heaviest in the dataset?
print("\nHeaviest patient", cardio_data['weight'].idxmax(), "has weight", cardio_data['weight'].max(), "kg\n")

# (4) Which feature in the dataset has the largest range (max–min)?
display(cardio_data.select_dtypes(include="number").max() - cardio_data.select_dtypes(include="number").min())

Shape:  (6000, 16) 



Unnamed: 0_level_0,count
cardio,Unnamed: 1_level_1
0,4200
1,1800



Heaviest patient 5912 has weight 168.0 kg



Unnamed: 0,0
age,35.0
sex,1.0
height,130.0
weight,140.0
ap_hi,14009.0
ap_lo,9800.0
smoke,1.0
alco,1.0
active,1.0
cardio,1.0


In [11]:
# Checking actual maximums and minimums
cardio_data.describe().round(1)
display(cardio_data.select_dtypes(include="number").agg(['min', 'max']).T)

Unnamed: 0,min,max
age,29.0,64.0
sex,0.0,1.0
height,68.0,198.0
weight,28.0,168.0
ap_hi,11.0,14020.0
ap_lo,0.0,9800.0
smoke,0.0,1.0
alco,0.0,1.0
active,0.0,1.0
cardio,0.0,1.0


####<font color = lightcoral> \<Answer\> </font>

(1) There is in total 6000 data points.

(2) The target variable has 2 categories (0/1).
Apparently, these values mean absence/presense of the disease.

(3) The person with the largest weight has index 5912.

(4) The largest range in the dataset belongs to `ap_hi` variable but it is obviously due to incorrect values.
Anyway, since the features are of different nature, it probably doesn't make much sense to look for the maximum range, except to find inconsistencies.

<!-- END -->

In [13]:
# As the notebook may be too slow to run for the full dataset, we will only use 1000 samples
# from the original data

# Resample with pandas.DataFrame.sample with random_state=2025 and print first 15 rows

### Yor Code Goes Here
cardio_data = pd.DataFrame.sample(cardio_data, n=1000, random_state=2025)
cardio_data.head(15)

Unnamed: 0,age,sex,height,weight,ap_hi,ap_lo,smoke,alco,active,cardio,cholesterol_normal,cholesterol_at_risk,cholesterol_elevated,gluc_normal,gluc_at_risk,gluc_elevated
2138,52,0,162,67.0,130,80,0,0,1,0,0,0,1,0,0,1
979,44,0,161,59.0,110,70,0,0,1,0,1,0,0,1,0,0
2801,53,0,159,84.0,120,80,0,0,1,0,1,0,0,1,0,0
298,39,0,150,65.0,120,90,0,0,0,0,1,0,0,1,0,0
2689,59,0,164,66.0,160,90,0,0,1,0,0,1,0,1,0,0
3661,44,1,176,72.0,120,80,0,0,0,0,1,0,0,1,0,0
866,63,0,160,63.0,130,100,0,0,1,0,0,1,0,1,0,0
4949,55,0,160,100.0,120,80,0,0,1,1,1,0,0,1,0,0
2351,50,0,167,73.0,130,80,0,0,0,0,1,0,0,1,0,0
1580,58,0,161,82.0,120,80,0,0,1,0,1,0,0,1,0,0


----

We have the data so now, let's put it to use. All the analyses will be done based on this sample of 1000.

To teach the k-NN algorithm (or any other machine learning algorithm) to recognize patterns, we need <font color = lightcoral>training data</font>. However, to assess how well a model has learned these patterns, we require <font color = lightcoral>test data</font> which is new and unseen by the trained model. It's important to note that the test set is not revealed to the model until after the training is complete.

So, to *estimate the performance of a model*, we may use a basic <font color = lightcoral>train-test split</font>. The term "split" is there because we literally split the data into two sets.


#### **Exercise 1 B)**

Gather the features into one array and the target variable into another array. Create training and test data by splitting the data into training (80%) and test (20%) sets. Use a fixed random seed to ensure that even if you execute this cell hundreds of times, you will get the same split each time.

- Do you need stratification for our dataset? Explain your decision.

In [14]:
###Train-test split

# As we are going to use ALL THE OTHER BUT CARDIO as features, we can drop the cardio column like this
features = cardio_data.drop(columns=['cardio'])

# and as labels, basically just choose the cardio column as follows
labels = cardio_data['cardio']

# WHY STRATIFICATION is needed/not needed?
# Inspect the labels and their distribution.
display(labels.value_counts())


Unnamed: 0_level_0,count
cardio,Unnamed: 1_level_1
0,699
1,301


In [16]:
# ------- TRAIN-TEST SPLIT
# Use sklearn.model_selection.train_test_split() function with random_state=2025

### Yor Code Goes Here
features_train, features_test, labels_train, labels_test = train_test_split(features,
                                                                            labels,
                                                                            random_state=2025,
                                                                            test_size=0.2,
                                                                            train_size=0.8,
                                                                            shuffle=True,
                                                                            stratify=labels)

####<font color = lightcoral> \<Answer\> </font>

Stratification is needed because the data sample is imbalanced in the sense that label `0` is 2.3 times more frequent than label `1`, which are about 30% of all datapoints. It is possible that without stratification the majority of `1`s would go to the test set while the training set would represent a wrong distribution of labels (`1`s would be underrepresented).

<!-- END -->

----------

####**Exercise 1 C)**

Standardize the numerical features in both the train and test sets.

- Explain how the k-NN model makes predictions about whether or not a patient has cardiovascular disease (CVD) when the features are not standardized. Specifically, discuss how the varying scales of different features can influence the model's predictions, and how standardization would change this influence.


*Note: Some good information about preprocessing and how to use it for train and test data can be found here https://scikit-learn.org/stable/modules/preprocessing.html*

In [17]:
### Standardization

# We wanted to scale only the numeric variables so we can have them in a list as
numeric_features = ['age', 'weight', 'height', 'ap_hi', 'ap_lo']

# We will use the standard Z-score standardization by scikit-learn. Only one scaler needed here!
scaler = StandardScaler()

# --- first, Normalize the training set (features_train)

# Fit a StandardScaler and scale the data using the computed mean and std
scaled_features_train = features_train.copy()
scaled_features_train[numeric_features] = (scaled_features_train[numeric_features].astype(float))

### Fitting the scaler and transforming the train set
scaler.fit(scaled_features_train[numeric_features])
scaled_features_train[numeric_features] = scaler.transform(scaled_features_train[numeric_features])

# then the test set (features_test)
# USE THE ALREADY FITTED STANDARDSCALER HERE
scaled_features_test = features_test.copy()
scaled_features_test[numeric_features] = (scaled_features_test[numeric_features].astype(float))

### Transforming the test set
scaled_features_test[numeric_features] = scaler.transform(scaled_features_test[numeric_features])


####<font color = lightcoral> \<Answer\> </font>

The model takes $k$ datapoints that are the closest to a new point and decide the label for the new point based on the labels of those $k$ points (taking the most frequent label among them).
Hence, the model needs a notion of distance between points, which has a natural implementation for numeric variables.

So, when the features are not scaled (standardized) and have significantly different ranges, they affect distance calculation in different ways.

<!-- END -->

-------

It's time for us to train the model!

####**Exercise 1 D)**

Train a k-NN model with $k=3$. Print out the confusion matrix and use it to compute the accuracy, the precision and the recall.

- What does each cell in the confusion matrix represents in the context of our dataset?

- How does the model perform with the different classes? Where do you think the differences come from? Interpret the performance metrics you just computed.

- With our dataset, why should you be a little more cautious when interpreting the accuracy?

*Note: We are very aware that there are functions available for these metrics, but this time, please calculate them using the confusion matrix.*

In [None]:
### Use the kNN classifier sklearn.neighbors.KNeighborsClassifier() with k = 3

### Yor Code Goes Here
knn_model = KNeighborsClassifier()

# ---- Confusion matrix. Use sklearn.metrics.confusion_matrix() to build it

### Yor Code Goes Here
conf_matrix = metrics.confusion_matrix()


# ----- Accuracy, precision and recall. Use previously built confusion matrix to calculate metrics

### Yor Code Goes Here


# Compare the results you got to the baselines by sklearn.metrics:
print()
print(f'Accuracy: {metrics.accuracy_score(labels_test, predicted_labels):.3f}')
print(f'Precision: {metrics.precision_score(labels_test, predicted_labels):.3f}')
print(f'Recall: {metrics.recall_score(labels_test, predicted_labels):.3f}')


<font color = lightcoral> \<Write your answer here\></font>

__________
## <font color = royalblue> 2. Classification accuracy using leave-one-out cross-validation

While the train-test split may provide us with an unbiased estimate of the performance, we only evaluate the model once. Especially when dealing with small datasets, a test set itself will be very small. How can we be sure that the evaluation is accurate with this small test set and not just a good (or bad) luck? And what if we'd like to compare two models and the other seems to be better -- how can we be sure that it's not just a coincidence?

Well, there's a great help available and it's called <font color = royalblue>cross-validation</font>. With its help, we can split the dataset into multiple different training and test sets, which allows us to evaluate models across various data partitions. This time, we'll take a closer look at the <font color = royalblue>leave-one-out cross-validation</font>.

**Exercise 2**

Let's keep the focus on detecting the CVD, so once again we utilize the k-NN model (with $k=3$) to predict the precense of the disease. Now, apply leave-one-out cross-validation to assess whether the k-NN model is suitable for addressing the problem. You may use the entire sample of 1000 on this task.

- **Exercise 2 A)** What can you say about the accuracy compared to the previous task?
- **Exercise 2 B)** What do you think: Does the k-NN model work for the problem in hand? Explain your answer.

*Tip: This can certainly be done manually, but `cross_val_score` is also a very handy function.*

In [None]:
### Leave-one-out cross-validation

# Either manually loop through the LOOCV or use the optimized
# sklearn.model_selection.cross_val_score() function.

### Yor Code Goes Here


<font color = royalblue> \<Write your answer here\></font>

____________
## <font color = forestgreen> 3. Model selection with leave-one-out cross-validation

So far, we've trained one model at a time and I've given the value of k for you. Accuracy is what it is (no spoilers here), but could we still do a little better? Let's explore that possibility through a process known as <font color=green>hyperparameter tuning</font>. The cross-validation is especially important tool for this task. Note here, that model selection and model evaluation (or assessment) are two different things: We use model selection to estimate the performance of various models to identify the model which is most likely to provide the "best" predictive performance for the task. And when we have found this most suitable model, we *assess* its perfomance and generalisation power on unseen data.

This time, we're going to train multiple models, let's say 30, and our goal is to select the best K-Nearest Neighbors model from this set. Most models come with various hyperparameters that require careful selection, and the k-NN model is no exception. Although we're talking about the number of neighbors here, it's important to note that k-NN also has several other hyperparameters, such as the used distance measure. However, for the sake of simplicity, this time we'll focus solely on fine-tuning the number of nearest neighbors, that is, the value of k, and use default values for all the other hyperparameters.

Let's focus on the model selection part here for the sake of comprehending the cross-validation itself. We'll get later on the whole pipeline, which also includes model assessment.

**Exercise 3**

Find the optimal k value from a set of $k=1...30$ using leave-one-out cross-validation. Plot the accuracies vs. the k values. Again, you may use the entire sample of 1000 on this task. We reccomend using the <i>sklearn.model_selection.GridSearchCV</i> function for it.

- **Exercise 3 A)** Which value of k produces the best accuracy when using leave-one-out cross-validation? Compare the result to the previous model with $k=3$.
- **Exercise 3 B)** If the number of k is still increased, what is the limit that the accuracy approaches? Why?
- **Exercise 3 C)** Discuss the impact of choosing a very small or very large number of neighbors on the k-NN model's ability to distinguish between the healthy individuals and the ones with CVD.

In [None]:
### Select best model

### Yor Code Goes Here


# Plot the accuracies as a function of <i>k</i>

- **Exercise 3 D)** Plot accuracies against <i>k</i>.
- **Exercise 3 E)** Observe the line trend

In [None]:
### Plot the accuracies vs. the values for k

### Yor Code Goes Here

# What if the number of <i>k</i> still increases? Experiment with the <i>k</i> values of $201-230$

- **Exercise 3 F)** Plot accuracies against <i>k</i>.
- **Exercise 3 G)** Observe the line trend

In [None]:
# If the number of k still increases?
# ----- Let's try really large k values (200-251) and see what happens to the final accuracy

### Yor Code Goes Here


______________
## <font color = lightcoral>4.  Data Loading and Initial Exploration for (Prostate Cancer) dataset </font>


We begin by introducing the <font color = lightcoral>prostate cancer dataset</font>, which will be used for our regression task. Our primary goal is to *predict the level of prostate-specific antigen (PSA)*, represented by the `lpsa` variable, using other relevant features as predictors.

Here's a brief overview of the features in our dataset:

| Feature | Type | Explanation |
| :- | :- | :-
| lcavol | numeric | log cancer volume
| lweight | numeric | log prostate weight
| age | numeric | patient's age
| lbph | numeric | log of benign prostatic hyperplasia amount
| svi | boolean | seminal vesicle invasion (0=no, 1=yes)
| lcp | numeric | log of capsular penetration
| gleason | numeric | Gleason score
| pgg45 | numeric | percentage of Gleason score 4+5
| lpsa | numeric | log PSA level (target variable)

Ensure that the `prostate.csv` file is located in the correct path (as specified in the code) for the code to execute correctly.

**Exercise 4 A)**

Load the prostate dataset and display summary statistics (mean, std, min, max) for all numeric variables.




In [None]:
#  Load prostate data

#  Display the first 5 rows of cancerData
# display(...)

#  Display the number of unique values per column
# display(...)

#  Display summary statistics of the dataset
# display(...)

----------

**Exercise 4 B)**

Plot pairwise relationships between the `lpsa` variable and all other features using `seaborn.pairplot` with the following parameters: `kind="reg"`, `diag_kind="kde"`, and `plot_kws={"line_kws": {"color": "red"}}`. You should **not** include the 'svi' variable in the analysis since it has only two values (0 or 1) and is binary.



In [None]:
#  Import necessary packages
# import ...

#  Select numeric columns from cancerData
# numeric_features = ...

#  Remove 'lpsa' and 'svi' from the feature list
# features_for_pairplot = ...

#  Create the pairplot

#  Add a title to the figure


______________
## <font color = royalblue> 5. Ridge Regression </font>


Having explored the relationships within our data, we are now ready to delve into regression modeling. This section will focus on <font color = royalblue>Ridge Regression</font>, a powerful regularization technique for predicting continuous outcomes. Our objective is to build a Ridge Regression model that can accurately estimate `lpsa` levels based on the other features in the dataset.

Before we train our model, it's crucial to prepare the data appropriately. This involves splitting the dataset into training and testing sets, and standardizing features to ensure optimal model performance, especially for regularization techniques like Ridge Regression.

**Exercise 5 A)**

1. Separate the features (X) from the target variable (`lpsa`).
2. Split the data into training (80%) and testing (20%) sets using `train_test_split(X, y, test_size=0.2, random_state=2025, shuffle=True)`. Set a `random_state` for reproducibility.
3. Standardize the features for both the training and testing sets using `StandardScaler()`.





In [None]:

#  Import necessary packages
# from ...

#  Separate X and y

#  Perform train-test split

#  Standardize features
# scaler = ...

#  Print results


----------

**Exercise 5 B)**

Train a Ridge Regression model (with `Ridge(alpha=100, random_state=2025)`) using the scaled training data. Evaluate the model on both the training and testing sets using Mean Squared Error (MSE).

Look at the Test MSE and compare it with Training MSE what do you notice? is the model generalize well, and do you think that alpha=100 is too strong?


In [None]:
#  Create Ridge model with alpha=100

#  Fit the model
# ridge.fit(...)

#  Predict on training and test data

#  Print evaluation results


____________
## <font color = forestgreen> 6. Hyperparameter Tuning with Cross-Validation </font>



selecting the optimal hyperparameters is crucial for maximizing ML model performance. This is where <font color = forestgreen>hyperparameter tuning</font> comes into play, often coupled with <font color = forestgreen>cross-validation</font> to ensure robust and reliable model selection.

Cross-validation helps us estimate how well a model will perform on unseen data, providing a more stable evaluation than a single train–test split—especially when fine-tuning hyperparameters. In this section, we will use K-Fold cross-validation to identify the best `alpha` (regularization strength) for our Ridge Regression model.

Before tuning hyperparameters, it is important to remember the purpose of the data split:

* **Training set:** used to fit (train) the model.
* **Validation (via cross-validation):** used to evaluate model performance during tuning **without** touching the test set.
* **Test set:** used only once at the very end to estimate final model performance **after** all tuning decisions are complete.

Using the test set during hyperparameter tuning would lead to overly optimistic results.
Therefore, we rely on K-Fold cross-validation applied to the training data to select hyperparameters in a robust and unbiased way.

---

**Exercise 6 A)**

Perform K-Fold cross-validation (e.g., with 5 folds) to find the optimal `alpha` value for Ridge Regression from a predefined range of `alphas` (e.g., `np.logspace(-2, 10, num=13)`). Plot the training MSE and cross-validation MSE against the `alpha` values. Highlight the `alpha` value that yields the lowest cross-validation MSE.
Use a logarithmic x-axis (plt.xscale('log')) for the alpha values.

* What is the optimal `alpha` value found, and how does it compare to `alpha=100` used in the previous question?


---

In [None]:

#  Define alphas
# alphas = ...

#  Create lists to store MSE values
# train_mse = ...
# cv_mse = ...

#  Create KFold object
# kf = ...

#  Loop over each alpha and compute train MSE and CV MSE
# for alpha in alphas:
#     ...

#  Plot training and CV MSE vs alpha
# plt.figure(...)
# ...

#  Identify and mark the best alpha
# best_index = ...
# best_alpha = ...
# ...

# plt.show()


----------

**Exercise 6 B)**

Train the Ridge Regression model again using the optimal `alpha` value found in Exercise 6 A). Print the Mean Squared Error (MSE) for both the training and test sets, and compare the results with the previous model you trained with `alpha=100` in Exercise 5 B).




In [None]:
#  Create Ridge model using best_alpha
# ridge = ...

#  Fit model to the training data
# ridge.fit(...)

#  Predict on training and test sets

#  Print evaluation results and compare between models trained with best_alpha and alpha=100


_____________________

## <font color = darkorange> 7. Bonus Exercise </font>


You can stop here and get the "pass" grade! To get the pass with honors, you need to do the following exercise. This means you'll get one bonus point for the exam.

The exercise may require you to do some research of your own. You are also required to **explain** the steps you choose with your own words, and show that you tried to understand the idea behind the task. There's no single correct solution for this so just explain what you did and especially ***why*** you did it. Please note that submitting only code will not be awarded a pass with honors.

----------------

**Exercise 7 A)**

Using the Ridge Regression model with the optimal `alpha` found in Exercise 6 A, write out the complete regression equation, including the intercept and the coefficients for each feature.

- What does the intercept term represent in this model?
- Which feature has the largest coefficient? What does this tell you about its importance? you can plot the features importance based on the coefficients and Sort by absolute coefficient value for better visualization

Which of the following statements best describes the role of a positive coefficient for `lcavol` in the Ridge Regression model predicting `lpsa`?

A) An increase in `lcavol` is associated with a decrease in predicted `lpsa`.
B) `lcavol` has no influence on predicted `lpsa`.
C) An increase in `lcavol` is associated with an increase in predicted `lpsa`.
D) The model is not well-fitted if `lcavol` has a positive coefficient.



In [None]:
### Code - Ridge Regression Equation

# Re-fit the best Ridge model using the optimal alpha found in Exercise 3 A

# Let's write the equation


In [None]:
### Code - Visualize Coefficients

# Create a DataFrame for easier plotting

# Sort by absolute coefficient value for better visualization

# Create horizontal bar plot



----------

**Exercise 7 B) - Regularization Path Analysis**

Visualize how coefficients change as the regularization strength (alpha) varies. This is called a regularization path or coefficient path.

1. Train Ridge Regression models for a range of alpha values (e.g., `np.logspace(-2, 4, num=50)`)
2. For each alpha, extract the coefficients
3. Plot the coefficient paths: each feature's coefficient value vs. log(alpha)
4. Add a vertical line at the optimal alpha found.

- Which of the following best describes what happens to Ridge Regression coefficients as alpha approaches infinity?

    A) All coefficients approach infinity
    B) All coefficients approach zero
    C) Coefficients remain unchanged
    D) Coefficients become random

- If a feature's coefficient changes dramatically as alpha increases, what does this suggest about that feature?

    A) The feature is not important
    B) The feature may be contributing to overfitting
    C) The feature has no correlation with the target
    D) The feature should be removed




In [None]:
### Code - Regularization Path Analysis

# Define a range of alpha values

# Plot coefficient paths

# Add vertical line at optimal alpha


----------

**Exercise 7 C) - Comparison with Lasso Regression**

This exercise focuses on comparing **Ridge Regression ($\text{L}_2$ regularization)** and **Lasso Regression ($\text{L}_1$ regularization)** to understand the practical differences between the two regularization techniques.

---
Coding Tasks:

1.  **Import:** Import the `Lasso` class from `sklearn.linear_model`.
2.  **Lasso Training (Fixed Alpha):** Train a Lasso model using the previously defined training test split in 5A) `X_train_scaled` and `y_train` with a fixed regularization parameter, $\alpha = 0.1$.
3.  **Coefficient Comparison (Fixed Alpha):** Compare the coefficients of the previously optimized Ridge Model (using its best $\alpha$) with the coefficients of the **Lasso Model** trained with $\alpha=0.1$.
4.  **Coefficient Visualization (Fixed Alpha):** Create a visualization (e.g., a bar chart) to display the coefficients of both models side-by-side.
5.  **Performance Comparison (Fixed Alpha):** Compare the test set performance metrics (Mean Squared Error (MSE) and **$\text{R}^2$**) for both models on `X_test_scaled` and `y_test`
6.  **Optimal Alpha Search (Lasso):** Perform hyperparameter tuning for Lasso Regression (similar to the method used for Ridge Regression, e.g., using cross-validation on the range `alphas_lasso = np.logspace(-4, 1, num=20)`) to find the optimal $\alpha$ value.
7.  **Final Coefficient Visualization (Optimal Alpha):** Train the Lasso model using its best $\alpha$ and visualize its final coefficients side-by-side with the pervious ridge regression model that trained with its best $\alpha$.

---

Answer the following questions based on your results:

* What effect did training with Lasso $\alpha=0.1$ have on the model's coefficients? Were any coefficients driven to exactly zero? If so, which features did this affect?
* Was the Lasso Test MSE (with $\alpha=0.1$) better or worse than the Ridge Test MSE(with its best $\alpha$)?
* What was the best alpha value for Lasso Regression determined through cross-validation?
* After training both the Ridge and Lasso models with their respective optimal $\alpha$ values, is there any significant difference in the resulting coefficient values?




In [None]:
### Code - Comparison with Lasso Regression

from sklearn.linear_model import Lasso

# Find optimal alpha for Lasso using cross-validation

# Train final models for ridge and lasso with the best alpha values

# Compare coefficients

# Visualize coefficients side-by-side

# Compare test set performance
