# Machine Learning Analysis

### This notebook will analyze and classify the encoded data to predict which passengers would be transported using random forest and gradient boosting classifier models.

In [1]:
# First start by importing the relevant packages:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

#### Random forest classifier: 1<sup>st</sup> iteration

In [2]:
# Load the training data (stored locally)
train_RF_iter1 = pd.read_csv('train_logic_impute.csv')

In [3]:
# Split the data into the features and response (Transported)
X_RF_iter1 = train_RF_iter1.drop(columns = 'Transported')
y_RF_iter1 = train_RF_iter1.Transported

In [4]:
# Now further subdivide the data into training and validation sets
X_RF_iter1_train, X_RF_iter1_validate, y_RF_iter1_train, y_RF_iter1_validate = train_test_split(X_RF_iter1,
                                                                                                y_RF_iter1,
                                                                                                test_size = 0.2,
                                                                                                random_state = 17,
                                                                                                shuffle = True)

In [5]:
# Define a function to make an initial estimate of how many trees are needed in the forest
def n_estimator_eval():
    n_estimators_list = [10,25,50,75,100,150,200,250,300,350,400,450,500,550,600,700]
    for n_estimators_eval in n_estimators_list:
        model_RF_eval = RandomForestClassifier(n_estimators = n_estimators_eval, random_state = 17, n_jobs = -1)
        model_RF_eval.fit(X_RF_iter1_train, y_RF_iter1_train)
        y_RF_eval_valid_predict = model_RF_eval.predict(X_RF_iter1_validate)
        acc_score_eval = accuracy_score(y_RF_iter1_validate, y_RF_eval_valid_predict) # (# of correct predictions)/(total # of predictions)
        acc_score_eval_rounded = round(acc_score_eval, ndigits = 4)
        print(f'With {n_estimators_eval} trees, the validation accuracy score was {acc_score_eval_rounded}')

In [6]:
# Now call the function and observe the results
n_estimator_eval()

With 10 trees, the validation accuracy score was 0.7861
With 25 trees, the validation accuracy score was 0.793
With 50 trees, the validation accuracy score was 0.8045
With 75 trees, the validation accuracy score was 0.8091
With 100 trees, the validation accuracy score was 0.8068
With 150 trees, the validation accuracy score was 0.8062
With 200 trees, the validation accuracy score was 0.8114
With 250 trees, the validation accuracy score was 0.8108
With 300 trees, the validation accuracy score was 0.8074
With 350 trees, the validation accuracy score was 0.8062
With 400 trees, the validation accuracy score was 0.8051
With 450 trees, the validation accuracy score was 0.8091
With 500 trees, the validation accuracy score was 0.8062
With 550 trees, the validation accuracy score was 0.8068
With 600 trees, the validation accuracy score was 0.8085
With 700 trees, the validation accuracy score was 0.8062


The validation set accuracy is greatest with 200 trees (for the default random forest parameters), so use this for the first iteration of the fitting (although this does seem a little low).

In [7]:
# Fit model with only n_estimators & random_state specified
model_RF_iter1 = RandomForestClassifier(n_estimators = 200, random_state = 17, n_jobs = -1)
model_RF_iter1.fit(X_RF_iter1_train, y_RF_iter1_train)
y_RF_valid_predict = model_RF_iter1.predict(X_RF_iter1_validate)
acc_score_iter1 = accuracy_score(y_RF_iter1_validate, y_RF_valid_predict)
print('The validation set accuracy score is:',round(acc_score_iter1, ndigits = 4))

The validation set accuracy score is: 0.8114


In [8]:
# Also load the test set and use the fitted model to predict the response
test_RF_iter1 = pd.read_csv('test_logic_impute.csv')
test_RF_iter1_predict = model_RF_iter1.predict(test_RF_iter1.drop(columns = 'PassengerId'))
test_RF_iter1_predict_S = pd.Series(test_RF_iter1_predict)
# Save the predictions in a .csv file for submission to evaluate the accuracy of the fit
# RF_iter1_submission_dict = {'PassengerId':test_RF_iter1.PassengerId,'Transported':test_RF_iter1_predict_S}
# RF_iter1_submission_df = pd.DataFrame(data = RF_iter1_submission_dict)
# RF_iter1_submission_df.to_csv(path_or_buf = 'Random_Forest_Classifier_Iteration_1.csv', index = False)

#### Random forest classifier: 2<sup>nd</sup> iteration

Now perform hyperparamter tuning to improve the model fit.

In [9]:
# Load the training data (stored locally)
train_RF_iter2 = pd.read_csv('train_logic_impute.csv')

In [10]:
# Split the data into the features and response (Transported)
X_RF_iter2 = train_RF_iter2.drop(columns = 'Transported')
y_RF_iter2 = train_RF_iter2.Transported

While the initial fitting was done with 200 trees, this feels a little low. There was a another local maximum at 450 trees, which seems more reasonable, so use this for hyperparameter grid search tuning.

In [11]:
# Initialize classifier model
random_forest_class = RandomForestClassifier(n_estimators = 450, random_state = 17, n_jobs = -1)
# Set conditions for the grid parameter search
# Initial conditions used were:
# p_grid = {'max_depth':np.linspace(5,30,6).astype(int),'max_features':np.linspace(2,18,5).astype(int)}
p_grid = {'max_depth':np.linspace(6,14,5).astype(int),'max_features':np.linspace(3,9,7).astype(int)}
cross_val_set = KFold(n_splits = 5, shuffle = True, random_state = 17)

In [12]:
# Perform grid search
grid_search = GridSearchCV(estimator = random_forest_class, param_grid = p_grid, scoring = 'accuracy', n_jobs = -1, cv = cross_val_set)
grid_search.fit(X_RF_iter2, y_RF_iter2)
# Store results in a dataframe
grid_results = pd.DataFrame(grid_search.cv_results_).sort_values(by = 'rank_test_score', ignore_index = True)

In [13]:
# Extract and print key results from dataframe
grid_max_depth = grid_results.loc[0, 'param_max_depth']
grid_max_features = grid_results.loc[0,'param_max_features']
test_acc = grid_results.loc[0,'mean_test_score']

print('The optimal parameters were:')
print('max_depth =',grid_max_depth)
print('max_features =',grid_max_features)
print('The mean validation k-fold accuracy was:',round(test_acc, ndigits = 4))

The optimal parameters were:
max_depth = 10
max_features = 6
The mean validation k-fold accuracy was: 0.8026


In [14]:
# Now fit the model using the optimal parameters
model_RF_iter2 = RandomForestClassifier(n_estimators = 450,
                                        max_depth = grid_max_depth,
                                        max_features = grid_max_features,
                                        random_state = 17)
model_RF_iter2.fit(X_RF_iter2, y_RF_iter2)

In [15]:
# Load the test set and use the fitted model to predict the response
test_RF_iter2 = pd.read_csv('test_logic_impute.csv')
test_RF_iter2_predict = model_RF_iter2.predict(test_RF_iter2.drop(columns = 'PassengerId'))
test_RF_iter2_predict_S = pd.Series(test_RF_iter2_predict)
# Save the predictions in a .csv file for submission to evaluate the accuracy of the fit
# RF_iter2_submission_dict = {'PassengerId':test_RF_iter2.PassengerId,'Transported':test_RF_iter2_predict_S}
# RF_iter2_submission_df = pd.DataFrame(data = RF_iter2_submission_dict)
# RF_iter2_submission_df.to_csv(path_or_buf = 'Random_Forest_Classifier_Iteration_2.csv', index = False)

#### Gradient boosting classifier: 1<sup>st</sup> iteration

In [16]:
# Load the training data (stored locally)
train_GBC_iter1 = pd.read_csv('train_logic_impute.csv')

In [17]:
# Split the data into the features and response (Transported)
X_GBC_iter1 = train_GBC_iter1.drop(columns = 'Transported')
y_GBC_iter1 = train_GBC_iter1.Transported

In [18]:
# Now further subdivide the data into training and validation sets
X_GBC_iter1_train, X_GBC_iter1_validate, y_GBC_iter1_train, y_GBC_iter1_validate = train_test_split(X_GBC_iter1,
                                                                                                    y_GBC_iter1,
                                                                                                    test_size = 0.2,
                                                                                                    random_state = 17,
                                                                                                    shuffle = True)

In [19]:
# Define a function to make an initial estimate of how many boosts are optimal
def n_estimator_eval():
    n_estimators_list = [10,25,50,75,100,150,175,200,225,250,300,350,400,450,500,550,600]
    for n_estimators_eval in n_estimators_list:
        model_GBC_eval = GradientBoostingClassifier(n_estimators = n_estimators_eval, random_state = 17)
        model_GBC_eval.fit(X_GBC_iter1_train, y_GBC_iter1_train)
        y_GBC_eval_valid_predict = model_GBC_eval.predict(X_GBC_iter1_validate)
        acc_score_eval = accuracy_score(y_GBC_iter1_validate, y_GBC_eval_valid_predict) # (# of correct predictions)/(total # of predictions)
        acc_score_eval_rounded = round(acc_score_eval, ndigits = 4)
        print(f'With {n_estimators_eval} boosting stages, the validation accuracy score was {acc_score_eval_rounded}')

In [20]:
n_estimator_eval()

With 10 boosting stages, the validation accuracy score was 0.7706
With 25 boosting stages, the validation accuracy score was 0.8039
With 50 boosting stages, the validation accuracy score was 0.8074
With 75 boosting stages, the validation accuracy score was 0.8114
With 100 boosting stages, the validation accuracy score was 0.8108
With 150 boosting stages, the validation accuracy score was 0.8131
With 175 boosting stages, the validation accuracy score was 0.8125
With 200 boosting stages, the validation accuracy score was 0.812
With 225 boosting stages, the validation accuracy score was 0.8125
With 250 boosting stages, the validation accuracy score was 0.812
With 300 boosting stages, the validation accuracy score was 0.8125
With 350 boosting stages, the validation accuracy score was 0.8137
With 400 boosting stages, the validation accuracy score was 0.8125
With 450 boosting stages, the validation accuracy score was 0.8154
With 500 boosting stages, the validation accuracy score was 0.8148
W

The validation accuracy is greatest with 450 boosts, so use this amount for the model.

In [21]:
# Fit model with only n_estimators & random_state specified
model_GBC_iter1 = GradientBoostingClassifier(n_estimators = 450, random_state = 17)
model_GBC_iter1.fit(X_GBC_iter1_train, y_GBC_iter1_train)
y_GBC_valid_predict = model_GBC_iter1.predict(X_GBC_iter1_validate)
acc_score_iter1 = accuracy_score(y_GBC_iter1_validate, y_GBC_valid_predict)
print('The validation set accuracy score is:',round(acc_score_iter1, ndigits = 4))

The validation set accuracy score is: 0.8154


In [22]:
# Load the test set and use the fitted model to predict the response
test_GBC_iter1 = pd.read_csv('test_logic_impute.csv')
test_GBC_iter1_predict = model_GBC_iter1.predict(test_GBC_iter1.drop(columns = 'PassengerId'))
test_GBC_iter1_predict_S = pd.Series(test_GBC_iter1_predict)
# Save the predictions in a .csv file for submission to evaluate the accuracy of the fit
# GBC_iter1_submission_dict = {'PassengerId':test_GBC_iter1.PassengerId,'Transported':test_GBC_iter1_predict_S}
# GBC_iter1_submission_df = pd.DataFrame(data = GBC_iter1_submission_dict)
# GBC_iter1_submission_df.to_csv(path_or_buf = 'Gradient_Boosting_Classifier_Iteration_1.csv', index = False)

#### Gradient boosting classifier: 2<sup>nd</sup> iteration

In [23]:
# Load the training data (stored locally)
train_GBC_iter2 = pd.read_csv('train_logic_impute.csv')

In [24]:
# Split the data into the features and response (Transported)
X_GBC_iter2 = train_GBC_iter2.drop(columns = 'Transported')
y_GBC_iter2 = train_GBC_iter2.Transported

In [25]:
# Initialize classifier model
grad_boost_class = GradientBoostingClassifier(random_state = 17)
# Set conditions for the grid parameter search
# Initial conditions used were:
# p_grid = {'n_estimators':np.linspace(100,500,9).astype(int),
#           'max_depth':np.linspace(1,8,8).astype(int),
#           'max_features':np.linspace(2,18,9).astype(int)}
p_grid = {'n_estimators':np.linspace(50,150,5).astype(int),
          'max_depth':np.linspace(5,8,4).astype(int),
          'max_features':np.linspace(1,6,6).astype(int)}
cross_val_set = KFold(n_splits = 5, shuffle = True, random_state = 17)

In [26]:
# Perform grid search
grid_search = GridSearchCV(estimator = grad_boost_class, param_grid = p_grid, scoring = 'accuracy', n_jobs = -1, cv = cross_val_set)
grid_search.fit(X_GBC_iter2, y_GBC_iter2)
# Store results in a dataframe
grid_results = pd.DataFrame(grid_search.cv_results_).sort_values(by = 'rank_test_score', ignore_index = True)

In [27]:
# Extract and print key results from dataframe
grid_n_estimators = grid_results.loc[0, 'param_n_estimators']
grid_max_depth = grid_results.loc[0, 'param_max_depth']
grid_max_features = grid_results.loc[0,'param_max_features']
test_acc = grid_results.loc[0,'mean_test_score']

print('The optimal parameters were:')
print('n_estimators =',grid_n_estimators)
print('max_depth =',grid_max_depth)
print('max_features =',grid_max_features)
print('The mean validation k-fold accuracy was:',round(test_acc, ndigits = 4))

The optimal parameters were:
n_estimators = 100
max_depth = 6
max_features = 2
The mean validation k-fold accuracy was: 0.8048


In [28]:
# Now fit the model using the optimal parameters
model_GBC_iter2 = GradientBoostingClassifier(n_estimators = grid_n_estimators,
                                             max_depth = grid_max_depth,
                                             max_features = grid_max_features,
                                             random_state = 17)
model_GBC_iter2.fit(X_GBC_iter2, y_GBC_iter2)

In [29]:
# Load the test set and use the fitted model to predict the response
test_GBC_iter2 = pd.read_csv('test_logic_impute.csv')
test_GBC_iter2_predict = model_GBC_iter2.predict(test_GBC_iter2.drop(columns = 'PassengerId'))
test_GBC_iter2_predict_S = pd.Series(test_GBC_iter2_predict)
# Save the predictions in a .csv file for submission to evaluate the accuracy of the fit
# GBC_iter2_submission_dict = {'PassengerId':test_GBC_iter2.PassengerId,'Transported':test_GBC_iter2_predict_S}
# GBC_iter2_submission_df = pd.DataFrame(data = GBC_iter2_submission_dict)
# GBC_iter2_submission_df.to_csv(path_or_buf = 'Gradient_Boosting_Classifier_Iteration_2.csv', index = False)

#### Summary of machine learning analysis results

| Model | Validation Set Accuracy Score | Test Set Accuracy Score |
| --- | -: | -: |
| Random forest classifier 1<sup>st</sup> iteration | 0.8114 | 0.7861 |
| Random forest classifier 2<sup>nd</sup> iteration | 0.8026 | 0.7942 |
| Gradient boosting classifier 1<sup>st</sup> iteration | 0.8154 | 0.7912 |
| Gradient boosting classifier 2<sup>nd</sup> iteration | 0.8048 | 0.7933 |

The test set accuracy scores are obtained from submitting predicted values for whether the passenger was transported in the test set to compare with the actual true results on Kaggle.

The test scores are all slightly lower than the validation scores, but the difference is still relatively small. Overall, the random forest classifier after hyperparameter tuning was the most accurate method. Both the random forest and gradient boosting classifiers had lower validation set accuracy scores after hyperparameter tuning (the test set accuracy improved in both cases). This may be due to (slight) model overfitting.