### Importing Necessary Modules
For this part of the model building process, the modules to be imported would be "joblib",

"sklearn" and "pandas". The "joblib" module would be used to save the created machine learning model.

Predicting if a passenger Survived or not has only 2 outcomes, "yes" or "no". Since this is

a binary classification problem, a Logistic Regression model would be built as a benchmark.

In [1]:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

### Read in the Data
The training set's features and labels would be used to train the model first of all.

They are brought in using the pandas "read_csv" function.

In [2]:
address1 = r"...\...\...\titanic_EDA\Split_data\train_features.csv"

address2 = r"...\...\...\titanic_EDA\Split_data\train_labels.csv"

train_features = pd.read_csv(address1)
train_labels = pd.read_csv(address2)

### Hyperparamters & Cross Validation
Hyperparameters are parameters/properties, whose values are used to control the learning process.

For a Logistic Regression model, most times the only necessary hyperparamter to be altered is the 

"C" parameter. This is a control variable inversely proportional to a "Regularization" parameter.

The "Regularization" parameter is used to control "overfitting" in a machine learning model.

"Overfitting" is generally to be avoided in machine learning models.

7 different values have been chosen for the "C" paramter in order to determine which value works best

and a 5-fold cross validation will be performed on the training set using GridSearchCV.

This would create 35 models and the best one would be chosen amongst them.

In [None]:
logistic_regression = LogisticRegression()
parameters = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

cross_validation = GridSearchCV(logistic_regression, parameters, cv = 5)

cross_validation.fit(train_features, train_labels.values.ravel())

### Print Function
The function below is just an extra step to print the results of the cross validation step in a 

readable format. This function is going to print the best model with the best suited hyperparameter

and also the performance of all the models and their respective hyperparamters.

In [4]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    
    for mean, std, params in zip(mean_score, std_score, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [5]:
print_results(cross_validation)

BEST PARAMS: {'C': 10}

0.689 (+/-0.072) for {'C': 0.001}
0.708 (+/-0.055) for {'C': 0.01}
0.798 (+/-0.094) for {'C': 0.1}
0.805 (+/-0.105) for {'C': 1}
0.807 (+/-0.11) for {'C': 10}
0.807 (+/-0.11) for {'C': 100}
0.807 (+/-0.11) for {'C': 1000}


### Best Model
The best model that displayed the best performance with the best suited hyperparameter value 

is shown below.

In [6]:
cross_validation.best_estimator_

### Saving the Model

The best model would now be saved using the joblib module.

In [None]:
joblib.dump(cross_validation.best_estimator_, r'...\...\...\titanic_EDA\Models\LR_model.pkl')