
# Hyperparameter Optimization For The Human Freedom Index Model

<div class="alert alert-success">

The <a href="https://www.cato.org/human-freedom-index/2021 ">Human Freedom Index</a> measures economic freedoms such as the freedom to trade or to use sound money, and it captures the degree to which people are free to enjoy the major freedoms often referred to as civil liberties—freedom of speech, religion, association, and assembly— in the countries in the survey. In addition, it includes indicators on rule of law, crime and violence, freedom of movement, and legal discrimination against same-sex relationships. We also include nine variables pertaining to women-specific freedoms that are found in various categories of the index.

Datasource: https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv

<u>Citation</u>

Ian Vásquez, Fred McMahon, Ryan Murphy, and Guillermina Sutter Schneider, The Human Freedom Index 2021: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute and the Fraser Institute, 2021).
    
</div>

In [1]:
import os
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

<div class="alert alert-info">
    
I loaded the Human Freedom Index data from the provided link into a DataFrame called df, dropped redundant columns, and stored the independent variables in a DataFrame called X and the dependent variable hf_quartile in a DataFrame called y.
</div>


In [2]:
#Create df
def read_df(link):
    df = pd.read_csv(link)
    return df
    
df = read_df("https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv")

In [3]:
#Drop redundant columns
cols_to_drop = ['year','ISO','countries']
words_drop = ['rank','score']

def drop_columns(data,cols,words_drop):
    #Drop specified columns
    data = data.drop(cols,axis = 1)
    #Drop columns with the word "rank" and "score"
    for i in range(len(words_drop)):
        data = data[data.columns.drop(list(data.filter(regex=words_drop[i])))]
        
    return (data)

#Drop NA values in target variable
target = 'hf_quartile'
def drop_na_target(data,target_var):
    data = data.dropna(subset=[target_var])
    return data

#Assign X and y:
def assign_x_y(data,target_var):
    X = data.drop(columns=target_var)
    y = data[target_var]
    return (X,y)


df = drop_columns(df,cols_to_drop,words_drop)
df = drop_na_target(df,target)
(X,y) = assign_x_y(df,target)

<div class="alert alert-info">
    
I created a Pipeline with a SimpleImputer using the most frequent strategy, a OneHotEncoder for categorical variables, a standard scaler, and a logistic regression model with the solver saga and max_iter 2000. The resulting pipeline is stored in a variable called pipe.

</div>

In [4]:
#Define categorical and numerical variables
def cat_num_var(independent_var):
    """Function that creates a list with numerical variables and a list
    with categorical variables"""
    categorical_features = []
    numerical_features = []
    for col in independent_var.columns:
        if independent_var[col].dtype == object:
            categorical_features.append(col)
        else:
            numerical_features.append(col)
    return (categorical_features,numerical_features)

#Transformer
def transform_col(i=int): #i = position of column to transform
    """Function that creates a transformer for the pipeline"""
    transformer = ColumnTransformer([("ohe_encoder", OneHotEncoder(sparse = False), [i])],
                                                                remainder = "passthrough")
    return transformer

#steps for pipeline
def steps(transformer):
    steps = [("mode", SimpleImputer(strategy = "most_frequent")),
            ("data_cleaning", transformer),            
            ("normalization", StandardScaler()),
            ("training", LogisticRegression(solver='saga',max_iter=2000))]
    return steps

(categorical_features,numerical_features) = cat_num_var(X)
transformer = transform_col(0)
steps = steps(transformer)
pipe = Pipeline(steps)

<div class="alert alert-info">

Cross-validation with three stratified folds to estimate the performance of the model and stored the test score values in a dictionary called fold_scores.

</div>

In [5]:
import warnings
warnings.filterwarnings('ignore')

#Train-test split
def traintest_split(X,y):
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
    return (X_train, X_test, y_train, y_test)

#Cross Validation
def cross_validations(pipe, X, y, n_folds = int):
    scores = cross_val_score(pipe, X, y, cv = n_folds)
    fold_scores = {}
    for k in range(1, len(scores)+1):
        fold_scores[k] = scores[k-1]
    return fold_scores



(X_train, X_test, y_train, y_test) = traintest_split(X,y)
fold_scores = cross_validations(pipe,X, y,3)
fold_scores

{1: 0.9181380417335474, 2: 0.9501607717041801, 3: 0.8954983922829582}

<div class="alert alert-info">    
I created a GridSearchCV object called grid with three folds, using the previous pipeline. The grid search object tests the hyperparameters penalty with values ['l1', 'l2'] and C with values [0.1, 10]. I fitted the grid search object using the train and test datasets correctly. The best achieved accuracy score is stored in a variable called score.
</div>

In [6]:
param_grid = {'training__penalty':['l1', 'l2'], 'training__C':[0.1,10]}
grid = GridSearchCV(pipe, param_grid, cv = 3)
grid.fit(X_train, y_train)
grid.best_params_
grid.best_score_
score = grid.best_estimator_.score(X_test,y_test)
score


0.9491978609625669

<div class="alert alert-info">    
To optimize hyperparameters for all the steps of the pipeline, I created a new GridSearchCV object called grid and included hyperparameters from the scaler, imputer, transformer, encoder, and model. This open-ended approach allows for testing a wide range of hyperparameters for each step, ensuring comprehensive optimization.

</div>

In [7]:
#Create variables that represent a Model 
LR= LogisticRegression()
DT= DecisionTreeClassifier()
RF= RandomForestClassifier()

#Creating the transformer to make the pipelines
onehotencoder= OneHotEncoder(sparse=False)
imputer= SimpleImputer(strategy='most_frequent')
inner_pipe_steps = [('impute', imputer), ('ohe', onehotencoder)]
inner_pipe= Pipeline(inner_pipe_steps)
transformer = ColumnTransformer([('inner', inner_pipe, categorical_features)], remainder = 'passthrough')

#Creating the pipeline for Logistic Regression
pipe_steps=[('preprocess', transformer),
            ('imputer', SimpleImputer()),
            ('scaler', StandardScaler()), 
            ('classifier', DT)]
pipe= Pipeline(pipe_steps)

#Define the parameter grid for the LR
param_gridlr = {
    'classifier':[LR],
    'imputer__strategy' : ['median', 'mean'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__C': [0.1, 1.0, 10.0], 
    'classifier__penalty' :  ['l1', 'l2']
}


#Define the parameter grid for the DT
param_griddt = {
    'classifier': [DT],
    'imputer__strategy' : ['median', 'mean'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__max_depth': [3, 4, 5, 6, 7, None],
    'classifier__min_samples_split': [2, 3, 4, 5],
    'classifier__min_samples_leaf': [1, 2, 3, 4]
}

#Define the parameter grid for the RF
param_gridrf = {
    'classifier': [RF],
    'imputer__strategy': ['median' 'mean'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__max_depth': [3, 4, 5, 6, 7, None],
    'classifier__min_samples_split': [2, 3, 4, 5],
    'classifier__min_samples_leaf': [1, 2, 3, 4]
}

#Perform the grid search for Decision Tree
grid_param_list = [param_gridlr, param_griddt, param_gridrf]
grid_search = GridSearchCV(pipe, grid_param_list, cv=3, n_jobs=-2)
grid_search.fit(X_train, y_train)

Best_parameters= grid_search.best_params_
Best_score = grid_search.best_score_
Best_estimator = grid_search.best_estimator_.score(X_test, y_test)


In [8]:
print(Best_parameters)
print(Best_score)
print(Best_estimator)

{'classifier': LogisticRegression(), 'classifier__C': 1.0, 'classifier__penalty': 'l2', 'imputer__strategy': 'mean', 'scaler__with_mean': True, 'scaler__with_std': True}
0.9397078589340596
0.9518716577540107
