In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

This notebook contains a solution to the Kaggle competition: Titanic.

I will start by reading the train data and the test data. I would typically import the pandas module in order to be able to do so. But it is already imported in the provided code in the cell above. I ran the cell above and used the output to know the path from which I should read the train and test data.

In [None]:
X = pd.read_csv('/kaggle/input/titanic/train.csv', index_col='PassengerId')
X_test = pd.read_csv('/kaggle/input/titanic/test.csv', index_col='PassengerId')

I will remove the rows in the training data which miss the prediction target (Survived) because such rows will be useless. I will then seperate the target (y) and the predictors (X).

In [None]:
X.dropna(axis=0, subset=['Survived'], inplace=True)
y = X.Survived
X.drop(['Survived'], axis=1, inplace=True)

Initially, I will split the training data. I will use 80% of the training data to train the initial model, and I will use this initial model to predict the other 20% of the training data. This way, I can validate the initial model. After validating the model, I will make a final model, which I will train with the full training data. In the cell below, I split the data 80% for initial training, and 20% for validation.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

In the following cells, I will handle missing values.

I will start by exploring the data. The following cell outputs the number of the rows and the number of the columns in the training data. And for each column that contains missing value(s), it outputs the number of rows missing.

In [None]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])


So there are three columns containing missing values: Age, Cabin, Embarked

Let's begin with the Cabin column. As more than half of the rows are missing, I will choose to drop that column.

In [None]:
X_train = X_train.drop(['Cabin'], axis = 1)
X_valid = X_valid.drop(['Cabin'], axis = 1)
X_test = X_test.drop(['Cabin'], axis = 1)

As for the Embarked column, there are only 2 missing values. So I will choose to set them to "unknown".

In [None]:
X_train.Embarked.fillna("unknown")
X_valid.Embarked.fillna("unknown")
X_test.Embarked.fillna("unknown")

In order to decide what to do with the Age column, I will further explore the data by using the describe() function in the pandas module.

In [None]:
X_train.describe()

I can see from the output that there is no big difference between the mean and the median values of the Age column. This means I can replace the missing values by either the mean or the median. The standard deviation of the Age column is also relatively low. So I will choose to fill the missing values in that column with the mean value.
But I will leave this now. I will handle the categorical variables first, and I will come back to it.

In the following cells, I will handle categorical data

The cell below outputs the names of the categorical columns. And for each categorical variable, it outputs the number of unique values.

In [None]:
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

sorted(d.items(), key=lambda x: x[1])

So the categorical columns are: Name, Sex, Ticket, Emberked

I will begin with the Name column. I think it will not affect the result significantly, so I will choose to drop it.

In [None]:
X_train = X_train.drop(['Name'], axis = 1)
X_valid = X_valid.drop(['Name'], axis = 1)
X_test = X_test.drop(['Name'], axis = 1)

As for the Sex and the Embarked columns, these two variables are nominal. So one-hot encoding will be a suitable approach. It is also worth noting that, Sex has only two unique values, and Embarked has only three unique values. This means that one-hot encoding these two columns will not add too much entries to the data.

In [None]:
from sklearn.preprocessing import OneHotEncoder

colss = ['Sex', 'Embarked']

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[colss]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[colss]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[colss]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
OH_cols_test.index = X_test.index

other_X_train = X_train.drop(['Sex', 'Embarked'], axis=1)
other_X_valid = X_valid.drop(['Sex', 'Embarked'], axis=1)
other_X_test = X_test.drop(['Sex', 'Embarked'], axis=1)

OH_X_train = pd.concat([other_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([other_X_valid, OH_cols_valid], axis=1)
OH_X_test = pd.concat([other_X_test, OH_cols_test], axis=1)

The remaining categorical variable is: Ticket. It is nominal variable, so label encoding it will be suitable. I will check first whether there are values of Ticket in the validation data that does not appear in the train data.

In [None]:
from sklearn.preprocessing import LabelEncoder

if not set(OH_X_valid['Ticket']).issubset(set(OH_X_train['Ticket'])):
    print("Ticket cannot be label encoded")

As I cannot label encode the Ticket column, I will drop it.

In [None]:
OH_X_train = OH_X_train.drop(['Ticket'], axis = 1)
OH_X_valid = OH_X_valid.drop(['Ticket'], axis = 1)
OH_X_test = OH_X_test.drop(['Ticket'], axis = 1)

Now it is time to come back to the Age column. As I said before, I will fill it with the mean value. Note that, now the only column containing missing value is the age column. So imputaion will be done on that column.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(imputer.fit_transform(OH_X_train))
imputed_X_valid = pd.DataFrame(imputer.transform(OH_X_valid))
imputed_X_test = pd.DataFrame(imputer.transform(OH_X_test))

imputed_X_train.columns = OH_X_train.columns
imputed_X_valid.columns = OH_X_valid.columns

After preprocessing the data; after handling missing values and categorical variables, I will build the model. I choose the random forest classifier model. I will train the model with 80% of the preprocessed training data (imputed_X_train). Then, I will use the model to predict 20% of the preprocessed training data (imputed_X_valid). I will use the predictions to validate the model by calculating the mean absolute error in the predictions.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(imputed_X_train, y_train)
predictions = model.predict(imputed_X_valid)
mae = mean_absolute_error(predictions, y_valid)
mae

The output ranges from 0 to 1. So if the mean absolute error is approximately 0.1731, then the performance of the model can be considered to be good :)

I also measered the performance of the model using the accuracy_socre() function found in the sklearn.metrics module

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_valid, predictions)

The output of the above cell is approximately 0.8268. So the accuracy of the model is around 82.68% 

I will make a new model. I will train this new model with the whole training data (rather than training it will only 80% of it, to increase accuracy). Then, I will use this new model to predict the preprocessed test data.

In [None]:
final_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
final_model.fit(imputed_X_train, y_train)
final_preds = final_model.predict(imputed_X_test)

Finally, I will save the predictions in a csv file in the format required in the competition.
I added a print statement to make sure the cell runs successfully.

In [None]:
output = pd.DataFrame({'PassengerId': X_test.index, 'Survived': final_preds})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")