#### Jérémy TREMBLAY

# TP2 : KNN and regression

In [23]:
# Import the library that will be used in this notebook.
import pandas as pd
import numpy as np
import random

# Import the pyplot module from matplotlib with the plt alias.
import matplotlib.pyplot as plt

# Import the sklearn modules.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

## Task 6: Prepare data

**Consigne :** De la même manière que dans la première partie du TP, analysez et préparez les données.

In [24]:
# Specify the relative path of the housing file.
file_path = 'datasets/housing.csv'

# Load the database into a DataFrame.
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame with head.
print(df.head())

      RM  LSTAT  PTRATIO      MEDV
0  6.575   4.98     15.3  504000.0
1  6.421   9.14     17.8  453600.0
2  7.185   4.03     17.8  728700.0
3  6.998   2.94     18.7  701400.0
4  7.147   5.33     18.7  760200.0


In [25]:
print(df.isnull().any())

RM         False
LSTAT      False
PTRATIO    False
MEDV       False
dtype: bool


The dataset is already clean, we can easily read it now and search some information.

In [26]:
# Know the dimensions of the dataframe.
df.shape

(489, 4)

There is 489 rows and 4 columns, let's check the content more in detail with some stats.

In [27]:
# Display usefull information about the dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   RM       489 non-null    float64
 1   LSTAT    489 non-null    float64
 2   PTRATIO  489 non-null    float64
 3   MEDV     489 non-null    float64
dtypes: float64(4)
memory usage: 15.4 KB


In [28]:
df.describe()

Unnamed: 0,RM,LSTAT,PTRATIO,MEDV
count,489.0,489.0,489.0,489.0
mean,6.240288,12.939632,18.516564,454342.9
std,0.64365,7.08199,2.111268,165340.3
min,3.561,1.98,12.6,105000.0
25%,5.88,7.37,17.4,350700.0
50%,6.185,11.69,19.1,438900.0
75%,6.575,17.12,20.2,518700.0
max,8.398,37.97,22.0,1024800.0


In [29]:
df.MEDV.value_counts()

525000.0    8
485100.0    7
462000.0    7
455700.0    7
407400.0    6
           ..
690900.0    1
726600.0    1
636300.0    1
699300.0    1
170100.0    1
Name: MEDV, Length: 228, dtype: int64

After analyse, we can observe different things about this dataset:

1. It contains no empty values.

2. It only contains 4 columns.

The columns contain the following data:    
  
* RM (Average number of rooms per dwelling):
  * Mean: Approximately 6.24 rooms per dwelling.
  * Range: From a minimum of 3.56 to a maximum of 8.40 rooms.    
  
* LSTAT (Percentage of lower status population):
  * Mean: Around 12.94% lower status population.
  * Range: Varies from 1.98% to 37.97%.   
  
* PTRATIO (Pupil-teacher ratio):
  * Mean: About 18.52 students per teacher.
  * Range: Ranges from 12.60 to 22.00 students per teacher.    

* MEDV (Median value of owner-occupied homes):
  * Mean: Approximately $454,342.90 for median home value.
  * Range: Median home values vary from $105,000 to $1,024,800.

Information found [here](https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset).

Let's now define the data we will use for this exercise. We will use the first three columns (LSTAT, PTRATIO and RM) to predict the MEDV. To do this, we define the variables and split the data for the future tasks.

In [30]:
# Our data.
X = df[['RM', 'LSTAT', 'PTRATIO']]
y = df['MEDV']

# Split data between test and train (66% train).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Now that we have analysed the data we can create and train our model.

## Task 7: Train and analyse the model

**Consigne :** Réalisez l’apprentissage de votre modèle et identifiez une valeur du paramètre `n_neighbors` proposant une bonne généralisation. Pour cela, vous pouvez vous appuyez sur différentes métriques, dont certaines utilisées dans le précédent TP :

* Le coefficient de détermination

* L’erreur moyenne absolue définie par : `1 / n * Somme | y barre - y | pour i = 1 jusqu'à N` 

* L’erreur quadratique moyenne définie par : `1 / n * Somme (y barre - y)² pour i = 1 jusqu'à N `

avec y barre la valeur prédite et y la valeur attendue.

In [31]:
# List of n_neighbors values to test.
n_neighbors_values = list(range(1, 21))  # From 1 to 20 inclusive.

for n_neighbors in n_neighbors_values:
    # Create a K-Nearest Neighbors (KNN) model with the current n_neighbors value.
    model = KNeighborsRegressor(n_neighbors=n_neighbors)
    model.fit(X_train, y_train)
    
    # Make predictions on the test set.
    y_pred = model.predict(X_test)
    
    # Calculate the coefficient of determination (R-squared).
    r2 = r2_score(y_test, y_pred)
    
    # Calculate the Mean Absolute Error (MAE).
    mae = mean_absolute_error(y_test, y_pred)
    
    # Calculate the Mean Squared Error (MSE).
    mse = mean_squared_error(y_test, y_pred)
    
    # Display the metrics for each n_neighbors value.
    print(f'n_neighbors = {n_neighbors}: R-squared = {r2:.4f}, MAE = {mae:.2f}, MSE = {mse:.2f}')

n_neighbors = 1: R-squared = 0.7260, MAE = 64516.67, MSE = 6737091666.67
n_neighbors = 2: R-squared = 0.7488, MAE = 63012.96, MSE = 6177579722.22
n_neighbors = 3: R-squared = 0.7780, MAE = 59016.05, MSE = 5459338024.69
n_neighbors = 4: R-squared = 0.7822, MAE = 56311.11, MSE = 5355805486.11
n_neighbors = 5: R-squared = 0.7767, MAE = 56738.89, MSE = 5489780333.33
n_neighbors = 6: R-squared = 0.7821, MAE = 56069.14, MSE = 5358768549.38
n_neighbors = 7: R-squared = 0.7823, MAE = 55831.48, MSE = 5354503888.89
n_neighbors = 8: R-squared = 0.7837, MAE = 55839.58, MSE = 5320045269.10
n_neighbors = 9: R-squared = 0.7837, MAE = 55778.19, MSE = 5319446721.54
n_neighbors = 10: R-squared = 0.7798, MAE = 56307.22, MSE = 5413818083.33
n_neighbors = 11: R-squared = 0.7753, MAE = 57379.97, MSE = 5526418204.78
n_neighbors = 12: R-squared = 0.7697, MAE = 57829.94, MSE = 5663827577.16
n_neighbors = 13: R-squared = 0.7665, MAE = 57957.41, MSE = 5741286992.11
n_neighbors = 14: R-squared = 0.7608, MAE = 583

**Conclusion:**  

For `n_neighbors` = 1, the R-squared is `0.7260`, indicating that the model explains a significant portion of the variance. However, the *MAE* and *MSE* are high, suggestikng that this model might be overfitting the data.

For `n_neighbors` = 8, you have a high R-squared (`0.7837`), indicating a good fit to the data. The *MAE* and *MSE* are lower than for n_neighbors = 1, which is a good sign.

For `n_neighbors` = 20, the R-squaresd starts to decrease, and both *MAE* and *MSE* increase significantly. This suggests that the model is losing predictive power and underperforming.

Based on these observations, we want to choose a value for `n_neighbors` between 8 and 9. This range seems to offer a good trade-off between model complexity and performance. An `n_neighbors` value of 8 or 9 provides a relatively high R-squared while keeping *MAE* and *MSE* reasonably low, indicating good predictive power and generalization to unseeen data.

## Task 8: Normalize data and analyse performances

**Consigne :** Utilisez le `StandardScaler` proposé par `scikit-learn` pour normaliser les données d’entrée de la base de données.  Vérifiez ensuite les performances du modèle.

In [32]:
# Create scaler.
scaler = StandardScaler()

Then we should transform our data by using our scaler.

In [33]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Then we can train our model and predict the value based on the parameter found just before.

In [34]:
model = KNeighborsRegressor(n_neighbors=8)  # We choose an appropriate n_neighbors value which is 8.
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)  # Predict.

# COmpute values.
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

# Display data.
print(f'n_neighbors = 8: R-squared = {r2:.4f}, MAE = {mae:.2f}, MSE = {mse:.2f}')

n_neighbors = 8: R-squared = 0.8143, MAE = 52477.31, MSE = 4565602222.22


**Conclusion:**  
R-squared (R²) = 0.8143: This value indicates that the model explains approximately 81.43% of the variance in the target variable. A high R² value is a positive sign, suggesting a good fit to the data.

Mean Absolute Error (MAE) = 52477.31: The MAE measures the average absolute difference between predicted and actual values. A lower MAE is desirable, and 52477.31 indicates that, on average, predictions are off by this amount.

Mean Squared Error (MSE) = 4565602222.22: The MSE measufres the average squared difference between predicted and actual values. A lower MSE is better, and this value shows the average squared prediction error.

Based on these metrics, the model with n_neighbors = 8 and data normalization performs before than previously. We can observe better results with the use of the scaler.