# Titanic Dataset Analysis with Random Forest Classifier

The titanic data set is separated into a training set and a test set. The training set is used here to build a random forest classifier model and test set is used to measure model quality.

In [20]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

Each ```.csv``` file contains missing values for different features. For example, some entries lack an 'Age' feature. Thus, they have to be filled in accordingly. For numerical features, we fill in with the mean of that feature. For categorical features, we just use the most frequent value for filling.

In [21]:
def missing_values_fill(features, X):
    for fname in features:
        if X[fname].isna().any():
            if X[fname].dtype == 'int64' or X[fname].dtype == 'float64':
                X[fname].fillna(X[fname].mean(), inplace=True)
            elif X[fname].dtype == 'object':
                X[fname].fillna(X[fname].mode()[0], inplace=True)

In [22]:
X_full = pd.read_csv('train.csv')
X_test_full = pd.read_csv('test.csv')

We separate the target from the predictors and drop features that are bad predictors such as name, ticket, and cabin.

In [23]:
y = X_full['Survived']
X_full.drop(['Survived'], axis=1, inplace=True)
X_full.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
#For all features that have missing values, fill them in accordingly
features = ['Pclass', 'Sex', 'Embarked', 'Age', 'SibSp', 'Parch', 'Fare']
missing_values_fill(features, X_full)

We generate the training and test data with an 80-20 split.

In [24]:
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)

X_train = X_train_full[features].copy()
X_valid = X_valid_full[features].copy()

The categorical variables must have a numerical representation in order to be used in the model. Such features are one-hot encoded.

In [25]:
categorical_features = [fname for fname in X_full.columns
                        if X_full[fname].dtype == 'object']

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(X_train[categorical_features]))
OH_X_valid = pd.DataFrame(OH_encoder.transform(X_valid[categorical_features]))

OH_X_train.index = X_train.index
OH_X_valid.index = X_valid.index

numerical_X_train = X_train.drop(categorical_features, axis=1)
numerical_X_valid = X_valid.drop(categorical_features, axis=1)

OH_X_train = pd.concat([numerical_X_train, OH_X_train], axis=1)
OH_X_valid = pd.concat([numerical_X_valid, OH_X_valid], axis=1)

We specify our model: ```RandomForestClassifier``` with 100 decision trees and fit the training data. We measure the quality of our model using the model ```score```. It has about 86% accuracy, which is fairly accurate.

In [28]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=0)

rf_model.fit(OH_X_train, y_train)

predictions = rf_model.predict(OH_X_valid)
print(rf_model.score(OH_X_valid, y_valid))

0.8603351955307262


We finally predict with the test data and save our predictions to ```submission.csv```. Just as before, the data must be cleaned.

In [27]:
#Drop features that can't predict anything
X_test_full.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)

#Fill missing values
missing_values_fill(features, X_test_full)
X_test = X_test_full[features].copy()

#One-hot encode categorical features
OH_X_test = pd.DataFrame(OH_encoder.fit_transform(X_test[categorical_features]))
OH_X_test.index = X_test.index
numerical_X_test = X_test.drop(categorical_features, axis=1)
OH_X_test = pd.concat([numerical_X_test, OH_X_test], axis=1)

#Make predictions
test_predictions = rf_model.predict(OH_X_test)
output = pd.DataFrame({'PassengerId': X_test_full.PassengerId,
                       'Survived': test_predictions})
output.to_csv('submission.csv', index=False)

## This model scored about 77% accuracy in the Kaggle competition which is about 65th percentile (a fair/passable score). To improve this model, techniques such as feature engineering must be implemented.