## **WORKSHOP 003 - NOTEBOOK #2: Model Selection and Training**

### **Setting Environment**

In [1]:
import os 
print(os.getcwd())

try:
    os.chdir("../../workshop-003")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-003.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\workshop-003\notebooks
d:\U\FIFTH SEMESTER\ETL\workshop-003


### **Importing modules and libraries**

In [2]:
import pandas as pd

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score
import joblib

### **Read Data**

In [4]:
df = pd.read_csv("./data/processed/world_happiness_report.csv")

df.head()

Unnamed: 0,country,continent,year,gdp_per_capita,health_life_expectancy,social_support,freedom,government_corruption,generosity,happiness_rank,happiness_score
0,Switzerland,Europe,2015,1.39651,0.94143,1.34951,0.66557,0.41978,0.29678,1,7.587
1,Iceland,Europe,2015,1.30232,0.94784,1.40223,0.62877,0.14145,0.4363,2,7.561
2,Denmark,Europe,2015,1.32548,0.87464,1.36058,0.64938,0.48357,0.34139,3,7.527
3,Norway,Europe,2015,1.459,0.88521,1.33095,0.66973,0.36503,0.34699,4,7.522
4,Canada,North America,2015,1.32629,0.90563,1.32261,0.63297,0.32957,0.45811,5,7.427


### **Data Preprocessing**

#### **Dummy Variables**

In [5]:
def create_dummy_vars(df):
    df = pd.get_dummies(df, columns=["continent"])
    
    new_columns = {
        "continent_North America": "continent_North_America",
        "continent_Central America": "continent_Central_America",
        "continent_South America": "continent_South_America"
    }

    df = df.rename(columns=new_columns)
    
    return df

In [6]:
df = create_dummy_vars(df)

#### **Splitting**

In [7]:
X = df.drop(["happiness_score", "happiness_rank", "country"], axis = 1)
y = df["happiness_score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=200)

In [8]:
print("Test data shape: ", X_test.shape)
print("Train data shape: ", X_train.shape)

Test data shape:  (235, 15)
Train data shape:  (547, 15)


In [9]:
X_test.columns

Index(['year', 'gdp_per_capita', 'health_life_expectancy', 'social_support',
       'freedom', 'government_corruption', 'generosity', 'continent_Africa',
       'continent_America', 'continent_Asia', 'continent_Central_America',
       'continent_Europe', 'continent_North_America', 'continent_Oceania',
       'continent_South_America'],
      dtype='object')

### **Model Selection and Training**

#### **Model Selection: _Linear Regression_**

In [12]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

In [18]:
print("Linear Regression Model Results: \n")
print("Mean Squared Error (MSE) =", mse_lr)
print("Coefficient of determination (R^2) =", r2_lr)

Linear Regression Model Results: 

Mean Squared Error (MSE) = 0.21031583876976218
Coefficient of determination (R^2) = 0.8337305795707146


#### **Model Selection: _Random Forest Regressor_**

In [19]:
rf_model = RandomForestRegressor(n_estimators=50, random_state=200)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

In [20]:
print("Random Forest Regression Model Results: \n")
print("Mean Squared Error (MSE) =", mse_rf)
print("Coefficient of determination (R^2) =", r2_rf)

Random Forest Regression Model Results: 

Mean Squared Error (MSE) = 0.17550951900650344
Coefficient of determination (R^2) = 0.8612474163822723


#### **Model Selection: _Alternative Random Forest Regressor_**

As an alternative approach, I will train another Random Forest Regressor with different hyperparameters to explore it's performance.

In [22]:
rf_model_alt = RandomForestRegressor(n_estimators=100, random_state=0)
rf_model_alt.fit(X_train, y_train)

y_pred_alt_rf = rf_model_alt.predict(X_test)

mse_alt_rf = mean_squared_error(y_test, y_pred_alt_rf)
r2_alt_rf = r2_score(y_test, y_pred_alt_rf)

In [23]:
print("Alternative Random Forest Regression Model Results: \n")
print("Mean Squared Error (MSE) =", mse_alt_rf)
print("Coefficient of determination (R^2) =", r2_alt_rf)

Alternative Random Forest Regression Model Results: 

Mean Squared Error (MSE) = 0.17210940183540657
Coefficient of determination (R^2) = 0.8639354474632258


#### **Model Selection: _Gradient Boosting Regressor_**

In [24]:
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

In [26]:
print("Gradient Boosting Regression Model Results: \n")
print("Mean Squared Error (MSE) =", mse_gb)
print("Coefficient of determination (R^2) =", r2_gb)

Gradient Boosting Regression Model Results: 

Mean Squared Error (MSE) = 0.17407216073657017
Coefficient of determination (R^2) = 0.8623837488995425


### **Save PKL File**

In [27]:
joblib.dump(rf_model_alt, "./model/alternative_rf_model.pkl")
print("Alternative Random Forest model saved to ./model/alternative_rf_model.pkl")

Alternative Random Forest model saved to ./model/alternative_rf_model.pkl


### **Conclusions**

- Both ensemble models, the Random Forest Regressor and the Alternative Random Forest Regressor, outperform the Linear Regression model, suggesting the presence of non-linear relationships within the data.
- The Linear Regression model produces a Mean Squared Error (MSE) of 0.2103 and a Coefficient of Determination (R²) of 0.8337, explaining approximately 83.37% of the variance in happiness scores, which indicates a reasonable but improvable fit.
- The Random Forest Regressor, with an MSE of 0.1755 and an R² of 0.8612, improves upon the Linear Regression, capturing about 86.12% of the variance, demonstrating the benefit of ensemble techniques.
- The Alternative Random Forest Regressor, configured with 100 estimators and a random state of 0, achieves the best performance with an MSE of 0.1721 and an R² of 0.8639, explaining approximately 86.39% of the variance. This slight improvement over the original Random Forest Regressor highlights the impact of increased estimators on model accuracy.
- Given its superior performance, the Alternative Random Forest Regressor will be selected and saved for future predictions.