This notebook has the code for the Kaggle Titanic ML project.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

First, the two sets of data are imported into test and train variables:

In [None]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

Once the two data sets are imported, I call .head() to get an overview of the data:

In [None]:
print(train_data.head())
print(test_data.head())

There are 12 data columns in the train data, with the test data missing the 'Survived' column as these are the values that need to be predicted. An overview of the data suggests that some cleaning will be necessary. 'Cabin', for example, contains a number of NaN values. The NaN values will be ammended in a later step, but for this analysis I chose to drop 'Cabin'. The column 'Pclass' contains the passenger class, which may be a better indicator of the passenger class than the cabin location, so 'Cabin' will be dropped in the this analysis. Similarly, I will exclude 'Fare', 'Ticket', 'Parch' and 'SibSp' as these do not seem overtly associated with survival rates in contrast to the other columns. These are saved in a new variable:

In [None]:
train_data_set = train_data[["Pclass", "Sex", "Age", "Fare", "Embarked", "Survived"]]
test_data_set = test_data[["Pclass", "Sex", "Age", "Fare", "Embarked"]]

From the review of the data, the 'Sex' column is given as a string. This will need changing from a categorical variable to a continuous variable so that the classification model can be called later.

In [None]:
label = LabelEncoder()
label_train_data = train_data_set.apply(label.fit_transform)
label_test_data = test_data_set.apply(label.fit_transform)

The remaining data is checked to ensure that it is complete by searching for NaN values:

In [None]:
print(train_data_set.isna().sum())

It yields 2 values in 'Embarked' and 177 in 'Age'. These can be amended by using sklearn's imputer function to apply an average. Once this is done the original headings will need to be returned as these were lost in the imputer:

In [None]:
my_imputer = SimpleImputer()
imputed_train_data = pd.DataFrame(my_imputer.fit_transform(label_train_data))
imputed_test_data = pd.DataFrame(my_imputer.fit_transform(label_test_data))

imputed_test_data.columns = test_data_set.columns
imputed_train_data.columns = train_data_set.columns

The data is now ready to be passed to a model. In order to fine tune the model, I need test values for both the x and y variables (the data loaded from train.csv in imputed_test_data does not have the y data for survival). I need this so I can calculate error and determine what parameters for the model yield the best results. First, I split imputed_train_data into test and train variables so that I can begin to fine tune the model.

In [None]:
imputed_X = imputed_train_data[["Pclass", "Sex", "Age", "Fare", "Embarked"]]
imputed_y = imputed_train_data[["Survived"]]

train_X, test_X, train_y, test_y = train_test_split(imputed_X, imputed_y, test_size=0.2)

The predicition is whether or not the passengers survive - as there can only be one of two outcomes, a classification model will be best suited to make predictions. The random forest provides multiple models and finds the average result from these models. 

In [None]:
forest_model = RandomForestClassifier(random_state=1)

To test for the mean squared error, I will first fit the model with the training data in train_X and train_y. I can then call .predict() with the test data and view the mean squared error:

In [None]:
forest_model.fit(train_X, train_y)
print(f'MAE of forest is: {mean_squared_error(forest_model.predict(test_X), test_y)}')

Using a function to get the mean squared error and a loop, different values can be passed to the parameters of the random forest. This will help get an idea of which parameters provide the most accurate classifcation. 

In [None]:
def get_mae(estimators, samples, depth, train_X, train_y):
    model = RandomForestClassifier(n_estimators=estimators, max_depth=depth, max_samples=samples, random_state=1)
    model.fit(train_X, train_y)
    print(f'MAE of forest with {depth} depth, {estimators} estimators, and {samples}, samples is: {mean_squared_error(model.predict(test_X), test_y)}')
    
depths = [3, 6, 9, 12, 15]

for entry in depths:
    get_mae(500, 10, entry, train_X, train_y)

After fine-tuning to look for the most accuracy from the random forest, I added these parameters to forest_model:

In [None]:
forest_model = RandomForestClassifier(n_estimators=700, max_depth=12, max_samples = 20, random_state=1)
forest_model.fit(imputed_train_data[["Pclass", "Sex", "Age", "Fare", "Embarked"]], imputed_train_data[['Survived']])

This model can then be used to predict the test data for passenger survival:

In [None]:
predictions = forest_model.predict(imputed_test_data)

Finally, I converted these values to .csv to be submitted to Kaggle to check how well the model did. This model received 0.77. 

In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions.astype(int)})
output.to_csv('submission.csv', index=False)