## <span style="color:red"> Homework 3 </span>

# Disclaimer
* This notebook has been written by Benjamin Amar in 2023 DSAIS class at Em-Lyon.
* Generative AI (ChatGPT 4) has been used for debugging, inspiration.

With your toy dataset, now select an appropriate quantitative outcome - the same one you used for the homework on regression or another. <br> 
All the other variables are to be used as predictors. <br>
Be careful: ID variables should not be part of the modeling !

### Step 1 : Reading Assignment

As usual, the reading assignment is part of the next quiz

Read Section 8.2 (pp 343-354) of our Textbook [An Introduction to Statistical Learning - with Applications in Python](https://www.statlearning.com/)

### Step 2 : Handling categorical data

There are mainly two ways of handling categorical data :

1. One-hot encoding (OHE). Use pd.get_dummies or sklearn `OneHotEncoder modules`.<br>
    a. First explain what OHE accomplishes, in a small paragraph with its up and down sides.<br>
    b. Apply OHE to all of your categorical variables.<br>
    
2. Numericalization. Use sklearn LabelEncoder or better yet the method `astype('category')` followed by the attributes `.cat.codes` <br>
    a. First explain what numericalization entails with its up and down sides. <br>
    b. Apply numericalization on your chosen categorical variables. <br>
    
### Step 3 : Training models with default parameters

1. Use sklearn `train_test_split` to select a training sample and a test sample.
2. Train at least three differents models (one of them should not be tree-based).
3. Select the appropriate metrics to estimate scores and compare performance between the models.
4. Which handling of the categorical data is best ?

### Step 4 : Tuning model

1. Select the best model among the previous ones and do sequential tuning (one parameter at a time with a plot) on at least two of the parameters. <br>
    a. Print the values of the best parameters <br>
    b. Give the score of the tuned model on the test set <br>
2. Use `GridSearchCV` or `RandomizedSearchCV` from sklearn to tune 2 or more parameters <br>
    a. Print the best parameters <br>
    b. Save the best model and print its score on the test set <br>


## 0 - Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math as m
import seaborn as sns
import import_ipynb

from scipy.stats import chi2_contingency
from itertools import combinations
from scipy import stats

In [2]:
path = "Toydataset.csv"
data = pd.DataFrame()

if path.endswith('.csv'):
    data = pd.read_csv(path)

elif path.endswith('.xlsx') or path.endswith('.xls'):
    data = pd.read_excel(path)

elif path.endswith('.txt'):
    data = pd.read_csv(path, sep='*', encoding='latin')

else:
    print(f"Unsupported file format. Please provide a .csv or .xlsx file")
    
if data is not None:
    print(f"Dataframe {path} loaded successfully! 👍")

Dataframe Toydataset.csv loaded successfully! 👍


In [3]:
datac = data.copy()

In [4]:
datac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18600 entries, 0 to 18599
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Rank              16547 non-null  float64
 1   Game              16537 non-null  object 
 2   Month_str         16549 non-null  object 
 3   Month             16538 non-null  float64
 4   Year              16510 non-null  float64
 5   YearTop1          18600 non-null  bool   
 6   Hours_watched     16468 non-null  float64
 7   Hours_streamed    16451 non-null  float64
 8   Peak_viewers      16425 non-null  float64
 9   Peak_channels     16494 non-null  float64
 10  Streamers         16453 non-null  float64
 11  Avg_viewers       16453 non-null  float64
 12  Avg_channels      16552 non-null  float64
 13  Avg_viewer_ratio  16515 non-null  float64
dtypes: bool(1), float64(11), object(2)
memory usage: 1.9+ MB


In [5]:
datac

Unnamed: 0,Rank,Game,Month_str,Month,Year,YearTop1,Hours_watched,Hours_streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
0,1.0,League of Legends,,1.0,2016.0,False,94377226.0,1362044.0,530270.0,2903.0,129172.0,127021.0,,69.29
1,2.0,Counter-Strike: Global Offensive,January,1.0,2016.0,True,47832863.0,830105.0,372654.0,2197.0,,64378.0,1117.0,57.62
2,3.0,Dota 2,January,1.0,2016.0,False,45185893.0,433397.0,,1100.0,44074.0,60815.0,583.0,104.26
3,4.0,Hearthstone,January,1.0,2016.0,False,39936159.0,235903.0,,,36170.0,53749.0,317.0,169.29
4,5.0,Call of Duty: Black Ops III,January,1.0,2016.0,False,16153057.0,1151578.0,71639.0,,214054.0,21740.0,1549.0,14.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18595,196.0,PlateUp!,,9.0,2023.0,False,560064.0,18617.0,16476.0,71.0,4034.0,778.0,25.0,30.08
18596,197.0,PokÃ©mon GO,September,9.0,2023.0,False,551596.0,16578.0,3001.0,,,767.0,23.0,33.27
18597,,Bloons TD 6,September,9.0,2023.0,False,,20142.0,10320.0,65.0,5673.0,752.0,28.0,26.85
18598,199.0,,September,9.0,2023.0,False,533644.0,27014.0,11508.0,68.0,1144.0,,37.0,19.75


In [6]:
# Initialize a dictionary to store type consistency results
type_consistency = {}

# Iterate through DataFrame columns
for column in datac:
    # Get the unique data types in the column
    unique_types = datac[column].apply(type).unique()
    
    # Check if there's more than one unique data type
    if len(unique_types) == 1:
        type_consistency[column] = f"🟢 {column}: Consistent ({unique_types[0].__name__})"
    else:
        type_consistency[column] = f"🔴 {column}: Inconsistent ({', '.join(t.__name__ for t in unique_types)})"

# Print the type consistency results for each feature
for consistency in type_consistency.values():
    print(consistency)
    #output_file.write(consistency)

🟢 Rank: Consistent (float)
🔴 Game: Inconsistent (str, float)
🔴 Month_str: Inconsistent (float, str)
🟢 Month: Consistent (float)
🟢 Year: Consistent (float)
🟢 YearTop1: Consistent (bool)
🟢 Hours_watched: Consistent (float)
🟢 Hours_streamed: Consistent (float)
🟢 Peak_viewers: Consistent (float)
🟢 Peak_channels: Consistent (float)
🟢 Streamers: Consistent (float)
🟢 Avg_viewers: Consistent (float)
🟢 Avg_channels: Consistent (float)
🟢 Avg_viewer_ratio: Consistent (float)


In [7]:
datac.isna().sum().sum()

27308

# Step 2 : Handling categorical data

## 2.1.a

### Explanation:
The One-hot encoding method is used to convert categorical data into a format that can be understood by machine learning algorithms.
The purpose is to convert each category features into a new column with a binary value assigned to the values.

The advantages of the one hot encoding include:
- It's not creating any ordinal relationship between the categories
- It is suitable for linear regression since it is an algorithm that requires numerical input

However it has downsides:
- It can drastically increase the dimension of the dataset, introducing too many columns.
- It does not capture any inherent relationship between categories

## 2.1.b

In [8]:
datac = data.copy().drop('Month_str', axis=1)

We are dropping the 'Month_str' feature since it's not providing significant information to our model.
Moreover the feature 'Month' already gives the same information and is already encoded as an integer.

In [9]:
# Identify numerical and categorical columns
numerical_columns = datac.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_columns = datac.select_dtypes(include=['object']).columns.tolist()

numerical_columns, categorical_columns

(['Rank',
  'Month',
  'Year',
  'Hours_watched',
  'Hours_streamed',
  'Peak_viewers',
  'Peak_channels',
  'Streamers',
  'Avg_viewers',
  'Avg_channels',
  'Avg_viewer_ratio'],
 ['Game'])

In [10]:
from sklearn.impute import KNNImputer, SimpleImputer

# Apply KNN imputation for numerical columns
knn_imputer = KNNImputer()
datac[numerical_columns] = knn_imputer.fit_transform(datac[numerical_columns])

# Apply mode imputation for categorical columns
mode_imputer = SimpleImputer(strategy='most_frequent')
datac[categorical_columns] = mode_imputer.fit_transform(datac[categorical_columns])

datac.isnull().sum()

Rank                0
Game                0
Month               0
Year                0
YearTop1            0
Hours_watched       0
Hours_streamed      0
Peak_viewers        0
Peak_channels       0
Streamers           0
Avg_viewers         0
Avg_channels        0
Avg_viewer_ratio    0
dtype: int64

In [11]:
datac

Unnamed: 0,Rank,Game,Month,Year,YearTop1,Hours_watched,Hours_streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
0,1.0,League of Legends,1.0,2016.0,False,94377226.0,1362044.0,530270.0,2903.0,129172.0,127021.0,1742.8,69.290
1,2.0,Counter-Strike: Global Offensive,1.0,2016.0,True,47832863.0,830105.0,372654.0,2197.0,30256.6,64378.0,1117.0,57.620
2,3.0,Dota 2,1.0,2016.0,False,45185893.0,433397.0,169907.0,1100.0,44074.0,60815.0,583.0,104.260
3,4.0,Hearthstone,1.0,2016.0,False,39936159.0,235903.0,94691.4,619.4,36170.0,53749.0,317.0,169.290
4,5.0,Call of Duty: Black Ops III,1.0,2016.0,False,16153057.0,1151578.0,71639.0,484.2,214054.0,21740.0,1549.0,14.030
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18595,196.0,PlateUp!,9.0,2023.0,False,560064.0,18617.0,16476.0,71.0,4034.0,778.0,25.0,30.080
18596,197.0,PokÃ©mon GO,9.0,2023.0,False,551596.0,16578.0,3001.0,121.6,4070.8,767.0,23.0,33.270
18597,118.6,Bloons TD 6,9.0,2023.0,False,444984.0,20142.0,10320.0,65.0,5673.0,752.0,28.0,26.850
18598,199.0,Retro,9.0,2023.0,False,533644.0,27014.0,11508.0,68.0,1144.0,1012.6,37.0,19.750


In [12]:
# Applying One-hot encoding to the categorical columns
datac_ohe = pd.get_dummies(datac, columns=categorical_columns, drop_first=True)

In [13]:
datac_ohe

Unnamed: 0,Rank,Month,Year,YearTop1,Hours_watched,Hours_streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,...,Game_duplicate Sleepers,Game_eBaseball Pawafuru Puroyakyu 2022,Game_eBaseball Powerful Pro Yaky<U+016B> 2020,Game_eFootball 2022,Game_eFootball PES 2020,Game_iRacing,Game_osu!,Game_skribbl.io,Game_some some convenience store,Game_theHunter: Call of the Wild
0,1.0,1.0,2016.0,False,94377226.0,1362044.0,530270.0,2903.0,129172.0,127021.0,...,0,0,0,0,0,0,0,0,0,0
1,2.0,1.0,2016.0,True,47832863.0,830105.0,372654.0,2197.0,30256.6,64378.0,...,0,0,0,0,0,0,0,0,0,0
2,3.0,1.0,2016.0,False,45185893.0,433397.0,169907.0,1100.0,44074.0,60815.0,...,0,0,0,0,0,0,0,0,0,0
3,4.0,1.0,2016.0,False,39936159.0,235903.0,94691.4,619.4,36170.0,53749.0,...,0,0,0,0,0,0,0,0,0,0
4,5.0,1.0,2016.0,False,16153057.0,1151578.0,71639.0,484.2,214054.0,21740.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18595,196.0,9.0,2023.0,False,560064.0,18617.0,16476.0,71.0,4034.0,778.0,...,0,0,0,0,0,0,0,0,0,0
18596,197.0,9.0,2023.0,False,551596.0,16578.0,3001.0,121.6,4070.8,767.0,...,0,0,0,0,0,0,0,0,0,0
18597,118.6,9.0,2023.0,False,444984.0,20142.0,10320.0,65.0,5673.0,752.0,...,0,0,0,0,0,0,0,0,0,0
18598,199.0,9.0,2023.0,False,533644.0,27014.0,11508.0,68.0,1144.0,1012.6,...,0,0,0,0,0,0,0,0,0,0


In [14]:
datac_ohe.to_csv('ToyDataset_ohe.csv', index=False)

## 2.2

### Explanation:
Numericalization, also know as label encoding is a method that converts each category of a categorical feature into a unique number.

For example ['Blue', 'Red', 'Green'] would become [0, 1, 2]

The advantages of the label encoding include:
- It's converting categorical data into a format that can be read by algorithms requiring numerical input
- It's not expanding the size of the datasate unlink One Hot Encoding

However it has downsides:
- It introduces an ordinal relationship even for nominal data which can be misleading for the model, creating a bias
- It does not capture any inherent relationship between categories for nominal data because the assigned numbers are arbitrary

In [15]:
datac_num = data.copy().drop('Month_str', axis=1)

We are dropping the 'Month_str' feature since it's not providing significant information to our model.
Moreover the feature 'Month' already gives the same information and is already encoded as an integer.

In [16]:
# Identify numerical and categorical columns
numerical_columns = datac_num.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_columns = datac_num.select_dtypes(include=['object']).columns.tolist()

numerical_columns, categorical_columns

(['Rank',
  'Month',
  'Year',
  'Hours_watched',
  'Hours_streamed',
  'Peak_viewers',
  'Peak_channels',
  'Streamers',
  'Avg_viewers',
  'Avg_channels',
  'Avg_viewer_ratio'],
 ['Game'])

In [17]:
from sklearn.impute import KNNImputer, SimpleImputer

# Apply KNN imputation for numerical columns
knn_imputer = KNNImputer()
datac_num[numerical_columns] = knn_imputer.fit_transform(datac_num[numerical_columns])

# Apply mode imputation for categorical columns
mode_imputer = SimpleImputer(strategy='most_frequent')
datac_num[categorical_columns] = mode_imputer.fit_transform(datac_num[categorical_columns])

datac_num.isnull().sum()

Rank                0
Game                0
Month               0
Year                0
YearTop1            0
Hours_watched       0
Hours_streamed      0
Peak_viewers        0
Peak_channels       0
Streamers           0
Avg_viewers         0
Avg_channels        0
Avg_viewer_ratio    0
dtype: int64

In [18]:
# Apply numericalization to the 'YearTop1' column
datac_num['YearTop1'] = datac_num['YearTop1'].astype('category').cat.codes
datac_num['Game'] = datac_num['Game'].astype('category').cat.codes

In [19]:
datac_num

Unnamed: 0,Rank,Game,Month,Year,YearTop1,Hours_watched,Hours_streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
0,1.0,898,1.0,2016.0,0,94377226.0,1362044.0,530270.0,2903.0,129172.0,127021.0,1742.8,69.290
1,2.0,284,1.0,2016.0,1,47832863.0,830105.0,372654.0,2197.0,30256.6,64378.0,1117.0,57.620
2,3.0,431,1.0,2016.0,0,45185893.0,433397.0,169907.0,1100.0,44074.0,60815.0,583.0,104.260
3,4.0,737,1.0,2016.0,0,39936159.0,235903.0,94691.4,619.4,36170.0,53749.0,317.0,169.290
4,5.0,213,1.0,2016.0,0,16153057.0,1151578.0,71639.0,484.2,214054.0,21740.0,1549.0,14.030
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18595,196.0,1264,9.0,2023.0,0,560064.0,18617.0,16476.0,71.0,4034.0,778.0,25.0,30.080
18596,197.0,1295,9.0,2023.0,0,551596.0,16578.0,3001.0,121.6,4070.8,767.0,23.0,33.270
18597,118.6,172,9.0,2023.0,0,444984.0,20142.0,10320.0,65.0,5673.0,752.0,28.0,26.850
18598,199.0,1393,9.0,2023.0,0,533644.0,27014.0,11508.0,68.0,1144.0,1012.6,37.0,19.750


In [20]:
datac_num.to_csv('ToyDataset_num.csv', index=False)

# Step 3 : Training models with default parameters

We are choosing 'Avg_viewers' as our target variable.
It represents the average viewers of the channel.

We are going to use 'datac_num' as our dataset encoded using label encoding

In [21]:
from sklearn.model_selection import train_test_split

# Splitting the data into features and target
X = datac_num.drop("Avg_viewers", axis=1)
y = datac_num["Avg_viewers"]

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((14880, 12), (3720, 12), (14880,), (3720,))

In [22]:
from sklearn.metrics import mean_squared_error, r2_score

def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    """
    Function:
        Trains the model using the training data and evaluates its performance on the test data.
    
    Parameters:
        model: The machine learning model to be trained and evaluated.
        X_train, y_train: The training data.
        X_test, y_test: The test data.
    
    Returns:
        A formatted string containing the evaluation metrics for both training and test data.
    """
    # Train the model
    model.fit(X_train, y_train)
    
    # Helper function to compute metrics
    def compute_metrics(y_true, y_pred):
        mse = mean_squared_error(y_true, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_true, y_pred)
        rmse_over_mean = rmse / np.mean(y_true)
        
        return {
            'Mean Squared Error': mse,
            'Root Mean Squared Error': rmse,
            'RMSE/mean(y)': rmse_over_mean,
            'R-squared': r2
        }
    
    # Predict and evaluate on training data
    y_train_pred = model.predict(X_train)
    train_metrics = compute_metrics(y_train, y_train_pred)
    
    # Predict and evaluate on test data
    y_test_pred = model.predict(X_test)
    test_metrics = compute_metrics(y_test, y_test_pred)
    
    # Format the results for better readability
    output = []
    output.append("Training Metrics:")
    for key, value in train_metrics.items():
        output.append(f"  {key}: {value:.4f}")
    
    output.append("\nTest Metrics:")
    for key, value in test_metrics.items():
        output.append(f"  {key}: {value:.4f}")
    
    return "\n".join(output)

In [23]:
# Training a Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=42)
print(train_and_evaluate(dt_model, X_train, y_train, X_test, y_test))

Training Metrics:
  Mean Squared Error: 0.0000
  Root Mean Squared Error: 0.0000
  RMSE/mean(y): 0.0000
  R-squared: 1.0000

Test Metrics:
  Mean Squared Error: 103803411.1416
  Root Mean Squared Error: 10188.3959
  RMSE/mean(y): 1.3273
  R-squared: 0.8570


In [24]:
# Training a Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state=42)
print(train_and_evaluate(rf_model, X_train, y_train, X_test, y_test))

Training Metrics:
  Mean Squared Error: 9250663.7934
  Root Mean Squared Error: 3041.4904
  RMSE/mean(y): 0.4302
  R-squared: 0.9838

Test Metrics:
  Mean Squared Error: 68879153.3874
  Root Mean Squared Error: 8299.3466
  RMSE/mean(y): 1.0812
  R-squared: 0.9051


In [25]:
# Training a Linear Regression
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
print(train_and_evaluate(lr_model, X_train, y_train, X_test, y_test))

Training Metrics:
  Mean Squared Error: 81392426.4258
  Root Mean Squared Error: 9021.7751
  RMSE/mean(y): 1.2762
  R-squared: 0.8572

Test Metrics:
  Mean Squared Error: 104997543.2213
  Root Mean Squared Error: 10246.8309
  RMSE/mean(y): 1.3349
  R-squared: 0.8554


We are now comparing the three models we have trained and tested :

- Decision Tree Regressor: R2 = 0.857
- Random Forest Regressor: R2 = 0.905
- Linear Regression: R2 = 0.855

The Random Forest model is the one with the highest R² value, indicating that it performs the best among the three models we have tested.

# Step 4 : Tuning model

In [26]:
# Get the parameters of the trained Random Forest model
rf_params = rf_model.get_params()

# Print the parameters
for key, value in rf_params.items():
    print(f"{key}: {value}")

bootstrap: True
ccp_alpha: 0.0
criterion: squared_error
max_depth: None
max_features: 1.0
max_leaf_nodes: None
max_samples: None
min_impurity_decrease: 0.0
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
n_estimators: 100
n_jobs: None
oob_score: False
random_state: 42
verbose: 0
warm_start: False


Let's use the 'n_estimators' and 'n_jobs' parameters

In [27]:
def tune_single_parameter(model, param_name, param_values, X_train, y_train, X_test, y_test):
    """
    Function:
        Tune a single parameter for a given model by evaluating performance on a test set.
    
    Parameters:
        - model: The initialized machine learning model to be tuned.
        - param_name: The name of the parameter to be tuned.
        - param_values: A list of values to test for the parameter.
        - X_train, y_train: The training data.
        - X_test, y_test: The test data.

    Returns:
        - best_param_value: The best value for the parameter based on test set performance.
        - best_score: The best score achieved on the test set.
    """

    best_score = float('-inf')  # Start with negative infinity
    best_param_value = None

    for value in param_values:
        # Set the current parameter value
        setattr(model, param_name, value)
        
        # Train the model
        model.fit(X_train, y_train)
        
        # Evaluate on the test set
        score = model.score(X_test, y_test)
        
        # Update best score and best parameter value if needed
        if score > best_score:
            best_score = score
            best_param_value = value

    return best_param_value, best_score


In [28]:
tune_single_parameter(rf_model, 'n_estimators', [10, 50, 100, 150, 200], X_train, y_train, X_test, y_test)

(100, 0.9051256286106085)

In [29]:
tune_single_parameter(rf_model, 'n_jobs', [10, 50, 100, 150, 200], X_train, y_train, X_test, y_test)

(10, 0.9040109490341607)

We won't use the tuned 'n_jobs' parameter since it's giving a worse score than with default parameters. Thus we are going to drop it.

In [30]:
# Training a Random Forest Regressor with the tuned parameters
from sklearn.ensemble import RandomForestRegressor
rf_model_tuned = RandomForestRegressor(random_state=42, n_estimators=100)
print(train_and_evaluate(rf_model, X_train, y_train, X_test, y_test))

Training Metrics:
  Mean Squared Error: 8962817.5203
  Root Mean Squared Error: 2993.7965
  RMSE/mean(y): 0.4235
  R-squared: 0.9843

Test Metrics:
  Mean Squared Error: 69688414.9867
  Root Mean Squared Error: 8347.9587
  RMSE/mean(y): 1.0875
  R-squared: 0.9040


In [31]:
from sklearn.model_selection import RandomizedSearchCV

# Define the parameter distribution
param_dist = {
    'n_estimators': [10, 50, 100, 150, 200],
    'max_depth': [None, 10, 20, 30, 40, 50]
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_dist,
    n_iter=10,  # Number of parameter settings that are sampled
    cv=3,
    n_jobs=-1,
    verbose=2,
    random_state=42
)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Get the best parameters
best_params_random_search = random_search.best_params_

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [32]:
# Save the best model and get its score on the test set
best_rf_model = random_search.best_estimator_
best_rf_score = best_rf_model.score(X_test, y_test)

best_params_random_search, best_rf_score

({'n_estimators': 100, 'max_depth': 20}, 0.9027658877170996)

Curiously our score with tuned parameters is lower than the default model's R square.

We may have to investigate more on the fine tuning of the model's hyper parameters.