# Build and train model

In this notebook an ElasticNet Model will be created and trained. The accuracy of the model will be evaluated through the four selected metrics: *mean squared error*, *root mean squared error*, *mean absolute error* and *mean absolute percentage error*.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import mlflow
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error

import logging
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

import warnings
warnings.filterwarnings("ignore")

In [2]:
# get data from git repository
#csv_url = 'https://raw.githubusercontent.com/Jas53/Data_Exploration_Project/main/data/WorldHappinessReport/2019.csv?token=GHSAT0AAAAAABRTCQJSXY2MLH74DIOZQA2UYQOB2QQ'
#df = pd.read_csv(csv_url)
df = pd.read_csv('./data/WorldHappinessReport/2019.csv')
# set new index for DataFrame
df.set_index('Country or region', inplace = True)
df.head()

Unnamed: 0_level_0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country or region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Finland,1,7.769,1.34,1.587,0.986,0.596,0.153,0.393
Denmark,2,7.6,1.383,1.573,0.996,0.592,0.252,0.41
Norway,3,7.554,1.488,1.582,1.028,0.603,0.271,0.341
Iceland,4,7.494,1.38,1.624,1.026,0.591,0.354,0.118
Netherlands,5,7.488,1.396,1.522,0.999,0.557,0.322,0.298


In [3]:
# check if there are doubled countries
df.index.value_counts()

Finland                1
Venezuela              1
Jordan                 1
Benin                  1
Congo (Brazzaville)    1
                      ..
Latvia                 1
South Korea            1
Estonia                1
Jamaica                1
South Sudan            1
Name: Country or region, Length: 156, dtype: int64

**TAKEAWAY**</br>
There are no duplicates.

In [42]:
# check every feature if there are Null-Values
print(df.isnull().sum())

Overall rank                    0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       0
dtype: int64


**TAEKAWAY**</br>
There are no Null-Values in this dataset.

## Split the data

In [4]:
# define target feature and create feature list
target_feature = 'Score'
features = df.columns.to_list()
features = [feature for feature in features if feature not in target_feature]

In [7]:
# split all data in training and testing data (Size 90 / 10)
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target_feature], test_size = 0.1, random_state = 1)

# split training data into training and validating data (Size 75 / 25)
X_train, X_val, y_train, y_val = train_test_split(X_train[features], y_train, test_size = 0.25)

With these splits we habe ~67% (105 records) training data to train the model, ~23% (35 records) validating data to tune the hyperparameter and to optimize the model and ~10% (16 records) testing data to demonstrate the model in [this](./demo.ipynb). 

## Train the model

In [8]:
def calc_metrics(actual_data: pd.Series, prediction_data: np.ndarray):
    """
    calculate the different metrics (mean squared error, root mean squared error, mean absolute error 
        and mean absolute percentage error) to evaluate the model accuracy.
    
    Params:
        actual_data (Series): the actual imported data
        prediction_data (ndarray): the from the model predicted data
    
    Returns:
        mse (float): the calculated mean squared error value
        rmse (float): the calculated root mean squared error value
        mae (float): the claclulated mean absolute error value
        mape (float): the calculated mean absolute percentage error value
    """
    # calculate the four accuracy metrics
    mse = mean_squared_error(actual_data, prediction_data)
    rmse = mean_squared_error(actual_data, prediction_data, squared = False)
    mae = mean_absolute_error(actual_data, prediction_data)
    mape = np.mean(np.abs((actual_data - prediction_data) / actual_data)) * 100
    
    # print the calculated values
    print('########## METRICS ##########')
    print('MSE:\t%s\nRMSE:\t%s\nMAE:\t%s\nMAPE:\t%s' % (mse, rmse, mae, mape))
    
    return mse, rmse, mae, mape

In [12]:
def train(alpha: float = 1.0, l1_ratio: float = 0.5):
    """
    Train a model with the given parameters with the trainigns data and validate 
    the accuracy of the model with the validation data
    
    Params:
        alpha (float): the value for the alpha parameter
        l1_ratio (float): the value for the l1_ratio parameter
    """
    with mlflow.start_run():
        # create model
        model = ElasticNet(alpha = alpha, l1_ratio = l1_ratio)
        # fit model
        model.fit(X_train, y_train)
        # predict 
        pred = model.predict(X_val)
        
        # print model parameter and calculated accuracy metrics
        print('\n########## MODEL ##########')
        print('alpha:\t%s\nl1_ratio:\t%s' % (alpha, l1_ratio))
        mse, rmse, mae, mape = calc_metrics(y_val, pred)
        
        # log parameter
        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        # log metrics
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("mae", mae)
        mlflow.log_metric("mape", mape)
        
        #log model
        mlflow.sklearn.log_model(model, "model")

In [10]:
# set range of the parameters alpha and l1_ratio to test the model with different parameter and 
#  to identify the model with the best accuracy
alpha_values = [1.0, 2.0, 3.0, 4.0, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 6.0]
l1_ratio_values = [0.0, 0.5, 1.0]

In [11]:
# create and train a model for every combination of the alpha and the l1_ratio parameter
for alpha_value in alpha_values:
    for l1_ratio_value in l1_ratio_values:
        train(alpha = alpha_value, l1_ratio = l1_ratio_value)

########## MODEL ##########
alpha:	1.0
l1_ratio:	0.0
########## METRICS ##########
MSE:	0.02745781055783349
RMSE:	0.16570398473734266
MAE:	0.13330644352941992
MAPE:	2.613903254275275
########## MODEL ##########
alpha:	1.0
l1_ratio:	0.5
########## METRICS ##########
MSE:	0.027650280286253093
RMSE:	0.1662837342804554
MAE:	0.13120492633303676
MAPE:	2.5593152513319386
########## MODEL ##########
alpha:	1.0
l1_ratio:	1.0
########## METRICS ##########
MSE:	0.027923408853987104
RMSE:	0.1671029887643758
MAE:	0.12877808318726336
MAPE:	2.4983861053111647
########## MODEL ##########
alpha:	2.0
l1_ratio:	0.0
########## METRICS ##########
MSE:	0.027520126793756636
RMSE:	0.16589191298480055
MAE:	0.13333964261002182
MAPE:	2.6138387548636763
########## MODEL ##########
alpha:	2.0
l1_ratio:	0.5
########## METRICS ##########
MSE:	0.02794238432637124
RMSE:	0.1671597568985168
MAE:	0.12865953315996487
MAPE:	2.49540974817139
########## MODEL ##########
alpha:	2.0
l1_ratio:	1.0
########## METRICS ##########


**TAKEAWAY**</br>
Even though there are different models with a pretty similar accuracy the best model overall is the model with the parameters alpha = 4.7 and l1_ration = 1.0.