#### Jérémy TREMBLAY

# TD1: Decision Tree Regression

In [1]:
# Import the library that will be used in this notebook.
import pandas as pd

# Import the sklearn modules.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Task 3: Create Regression Tree

**Consigne :** À partir da la base de données de *redwine*, proposer un modèle d’arbre de décision (de régression) performant qui prédit la note d’un vin rouge en fonction de ses attributs.  

Indications :  
* Il sera nécessaire d’utiliser un `DecisionTreeRegressor`.
* Il faudra adapter les mesures d’erreur : *MAE*, *MSE*, *R2* (mesures disponibles dans le module `sklearn.metrics`).

### First step: prepare data

First we need to load and prepare data.

In [2]:
# Specify the relative path of the redwine file.
file_path = 'datasets/redwine.csv'

# Load the database into a DataFrame.
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame with head.
print(df.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [3]:
print(df.isnull().any())

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool


The dataset is already clean.

In [4]:
# Know the dimensions of the dataframe.
df.shape

(1599, 12)

There is 1599 rows and 12 columns, let's check the content more in detail with some stats.

In [5]:
# Display usefull information about the dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [6]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [7]:
df.quality.value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

What we can conclude about this data:  

The dataset contains various attributes related to red wine quality. Here's a brief overview of the columns and what they represent:
* `Fixed Acidity`: Represents the fixed (non-volatile) acids in the wine.
* `Volatile Acidity`: Indicates the amount of acetic acid in the wine, which can lead to an unpleasant, vinegar-like taste.
* `Citric Acid`: Reflects the presence of citric acid, which can add freshness and flavor to the wine.
* `Residual Sugar`: Represents the amount of residuall sugar in the wine after fermentation.
* `Chlorides`: Indicates the amount of salt in the wine.
* `Free Sulfur Dioxide`: Measures the free forum of sulfur dioxide, which is used for preserving the wine.
* `Total Sulfur Dioxide`: Represents the total amount of sulfur dioxide, which is related to the wine's preservation and taste.
* `Density`: Reflects the density of the wine.
* `pH`: Represents the pH level, which can influence the wine's acidity.
* `Sulphates`: Indicates the amount of sulfur dioxide used in winemaking.
* `Alcohol`: Represents the alcohol content in the wine.
* `Quality`: Indicates the wine's quality, typically rated on a scale from 1 to 10.

This dataset provides a collection of attributes that are often used to assess and predict the quality of red wine. Each row in the dataset corresponds to a specific red wine sample, and the "Quality" column provides a rating for each sample, which is the data we are trying to predict.

### Second step: Split data

Now let's split the data between train and test.

In [8]:
# Load our fields X and Y. 
X = df.drop('quality', axis=1)
y = df['quality']

In [9]:
# Split data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # 1/3 for the test.
print("Train: ", len(X_train), ", ", len(y_train))
print("Test: ", len(X_test), ", ", len(y_test))

Train:  1071 ,  1071
Test:  528 ,  528


### Third part: Create regression tree

The main part of the code is here: we create our regression tree for each depth level to find the `max_depth` parameter.

In [10]:
# We test the first 23 depth.
depths = range(1, 23)

# iterate through each depth and create a regression tree, train it and predict result, compare the prediction and display the accuracy.
for depth in depths:
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Compute train MSE, test MSE, MAE and R2 and display data.
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    mae = mean_absolute_error(y_test, y_test_pred)
    r2 = r2_score(y_test, y_test_pred)
    print(f"Model (depth={depth}): [train MSE: {train_mse:.2f}, test MSE: {test_mse:.2f}, MAE: {mae:.2f}, R2: {r2:.2f}]")

Model (depth=1): [train MSE: 0.52, test MSE: 0.57, MAE: 0.59, R2: 0.14]
Model (depth=2): [train MSE: 0.46, test MSE: 0.52, MAE: 0.58, R2: 0.21]
Model (depth=3): [train MSE: 0.41, test MSE: 0.51, MAE: 0.56, R2: 0.24]
Model (depth=4): [train MSE: 0.36, test MSE: 0.49, MAE: 0.55, R2: 0.26]
Model (depth=5): [train MSE: 0.31, test MSE: 0.48, MAE: 0.52, R2: 0.28]
Model (depth=6): [train MSE: 0.26, test MSE: 0.52, MAE: 0.52, R2: 0.21]
Model (depth=7): [train MSE: 0.21, test MSE: 0.54, MAE: 0.51, R2: 0.18]
Model (depth=8): [train MSE: 0.16, test MSE: 0.55, MAE: 0.51, R2: 0.17]
Model (depth=9): [train MSE: 0.12, test MSE: 0.59, MAE: 0.51, R2: 0.10]
Model (depth=10): [train MSE: 0.09, test MSE: 0.63, MAE: 0.51, R2: 0.04]
Model (depth=11): [train MSE: 0.07, test MSE: 0.64, MAE: 0.51, R2: 0.04]
Model (depth=12): [train MSE: 0.05, test MSE: 0.64, MAE: 0.49, R2: 0.04]
Model (depth=13): [train MSE: 0.03, test MSE: 0.67, MAE: 0.50, R2: -0.01]
Model (depth=14): [train MSE: 0.02, test MSE: 0.66, MAE: 0.

**Conclusion :** After evaluating various decision tree regression models with different maximum depths (`max_depth`), we can observe this:

With a depth of 1, the model exhibits a relatively low test Mean Squared Error (*MSE*) of 0.57, but the *R2* (Coefficient of Determination) is only 0.14, suggesting limited predictive power.

As the depth increases up to 5, the test *MSE* continues to decrease, indicating improved model performance. However, beyond depth 5, the test *MSE* starts to increase again, indicating overfitting.

The Mean Absolute Error (*MAE*) is relatively stable across different depths, with values hovering around 0.51 to 0.53.

The *R2* values are relatively low, indicating that the models explain only a small portion of the variance in the target variable.

Considering the trade-off between model complexity and performance, this is recommended to choose a maximum depth of 5 for the decision tree regression model. At this depth, the model achieves a reasonably low test *MSE* of 0.48, a relatively low MA*E of 0.52, and a moderate *R2* of 0.28, suggesting that it strikes a balance between fitting the data and generalizing to unseen data. Beyond depth 5, the model becomes overly complex and may not generalize well to new data.

In [11]:
optimal_depth = 5

So let's use `max_depth` = 5 to train our model:

In [12]:
# Train our model with this depth.
final_model = DecisionTreeRegressor(max_depth=optimal_depth, random_state=42)
final_model.fit(X_train, y_train)

# Test.
y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)

# Evaluate the final model.
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
mae = mean_absolute_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)

print(f'Model Evaluation - Max Depth {optimal_depth}')
print(f'Train MSE: {train_mse:.2f}')
print(f'Test MSE: {test_mse:.2f}')
print(f'MAE: {mae:.2f}')
print(f'R2: {r2:.2f}')

Model Evaluation - Max Depth 5
Train MSE: 0.31
Test MSE: 0.48
MAE: 0.52
R2: 0.28
