In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
titanic_file_path = '/kaggle/input/titanic/train.csv'

titanic_data = pd.read_csv(titanic_file_path)

titanic_data.head()

First, I will make a list of all the columns so I can take a look at the correlation matrix for each of these elements.

In [3]:
columns = list(titanic_data.columns.values)
print(columns)

With the elements listed, I now need to see which are numeric and which are categorical, which I can do with describe().

In [4]:
print(titanic_data.describe())

Now I will need to review how each of the remaining elements are categorized to see if they, perhaps, can logically be split into larger groups for comparison.

In [5]:
categorical_subset = titanic_data[['Sex', 'Embarked', 'Ticket', 'Cabin']]

print(titanic_data.Sex.head(15))
print(titanic_data.Embarked.head(15))
print(titanic_data.Ticket.head(15))
print(titanic_data.Cabin.head(15))

In [6]:


print(titanic_data['Sex'].value_counts())
print(titanic_data['Embarked'].value_counts())
print(titanic_data['Ticket'].value_counts())
print(titanic_data['Cabin'].value_counts())

In [7]:
print(titanic_data.isnull().sum())

Clearly, Sex is an easily categorized variable, as is Embarked, since those contain a minimum number of possible responses. Those can be assigned dummy values as they are. Ticket does not have any immediately noticeable patterns, though there are occasional repeating letter patterns. Additional exploration would be necessary to use that variable.

Cabin provides an interesting possibility for consideration. The naming convension of the cabins indicates that they are separated into sections by letter. It is possible that certain sections were more deadly than others, regardless of Sex of Pclass. It is also notable that Cabin has the most null values in the data set by a large margin. Both of these facts deserve exploration.

In [8]:
import re

titanic_data['cabin_section'] = titanic_data['Cabin'].str.extract('([A-Z])', expand=True)
titanic_data['cabin_section'] = titanic_data[['cabin_section']].fillna(value = 'Unknown')

# pd.set_option("max_rows", None)

titanic_data['cabin_section'].head(20)
# titanic_data.Cabin.head

In [9]:
print(titanic_data['cabin_section'].value_counts())

I have altered the Cabin column to reflect the section of the ship where the cabin is located and turned this into a new variable cabin_section. I have also put NAN values into their own variable so they can be considered in the model. It is possible that this lack of information could constitute factors that might lead to surviving the disaster or not.

In [10]:
numerical_subset = titanic_data[['Pclass', 'Parch', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]

categorical_subset = titanic_data[['Sex', 'Embarked', 'cabin_section']]
categorical_subset = pd.get_dummies(categorical_subset)

features = pd.concat([numerical_subset, categorical_subset], axis=1)

In [11]:
# Find all correlations and sort 
correlations_data = features.corr()['Survived'].sort_values()

# Print the most negative correlations
print(correlations_data.head(15), '\n')

# Print the most positive correlations
print(correlations_data.tail(15))

From the correlation matrix, we can see that Sex has the largest correlation with survivorship. Class appears to be negatively associated with survivorship, but this is due to the computer assigned class values a true numerical value, rather than its symbolic one it has in the real world. cabin_section_Unknown is also negatively associated with survivorship, suggesting that we might be right about that variable.

Fare also seems to have played a role, as it is positively associated with survivorhood. This seems like an extension of Pclass, but if so, why is the correlation reduced by roughly 8%? Perhaps it is thrown off by the crew that did not pay for the voyage? Or perhaps some wealthy people were guests on the ship and did not pay. Both are possible, so additional exploration is required.

We will start by looking at a bar chart of survival rates by cabin section.

In [12]:
section_survival = titanic_data.groupby(['cabin_section'])['Survived'].value_counts().unstack().plot(figsize=(10, 8), kind='bar',stacked = False)

We can see here that survivor rates, with the exception of Sections A, G, and T, tend to be higher when the section is known. G and T seem to be fairly even, but not well represented.

Upon further research (https://titanic.fandom.com/wiki/A_Deck), it appears that there are a lot of unknown occupants of Deck A, which might be contributing to its unusual composition. From the occupants that are known, it seems that many of these particular occupants were unwilling to leave, either due to disbelief, fear of leaving the ship, or guilt at its sinking, rather than being willfully excluded from lifeboats.

Decks B and C https://titanic.fandom.com/wiki/B_Deck#:~:text=B%20Deck%2C%20also%20called%20the,featuring%20their%20own%20private%20promenades., https://titanic.fandom.com/wiki/C_Deck ), have a much higher proportion of known occupants than any other decks, so we can have a higher certainty that these proportions are fairly accurate.

G and F decks were entirely third class passengers and both of these decks were completely flooded within minutes of hitting the iceberg. Since there were 1,100 third class passengers, it appears that both G and F were full of Unknown status passengers.

Interestingly, E deck, despite being fairly low in the ship, had a comparatively high survival rate. Whether or not it played a part is uncertain, but we can note that a majority of the onboard crew were cabined on E deck (https://titanic.fandom.com/wiki/E_Deck ). It is possible that, despite a comparatively low status, their position of power on the ship afforded them a better chance of surviving.

In [13]:
sex_survival = titanic_data.groupby(['Sex'])['Survived'].value_counts().unstack().plot(figsize=(10, 8), kind='bar',stacked = False)

The difference between the survivorship of the sexes is highly significant, though unsurprising. The phrase "women and children first" seemed to have been a guiding cultural principle for those onboard, though it was clearly not absolute, as certain men were considered exceptions.

In [14]:
class_survival = titanic_data.groupby(['Pclass'])['Survived'].value_counts().unstack().plot(figsize=(10, 8), kind='bar',stacked = False)

The most significant piece of this graph is the likelihood of a third class passenger surviving, which was significantly lower than the other classes. As seen in the section examination above, part of this had to do with how quickly third class decks flooded. Many people on these decks simply did not have time to escape. The facts of the night make it clear, though, that this is not the only reason, as many third class passengers were denied access to lifeboats.

We may have a better approximation of the true significance by examining the difference between first and second class passengers, who were both on sections that flooded more slowly. Here, we can see that the difference is still stark, but not quite so extreme as with third class. Still, class exclusion becomes harsher as class level is lowered, so the answer is most likely somewhere in the middle.

In [20]:
#import seaborn as sns
#sns.set(font_scale = 2)
#import matplotlib.pyplot as plt
#%matplotlib inline

#cabin_encode = pd.get_dummies(titanic_data[['cabin_section']])
#some_features = titanic_data[['Fare', 'Survived']]

#features = pd.concat([some_features, cabin_encode], axis=1)

# Use seaborn to plot a scatterplot of Score vs Log Source EUI
#sns.lmplot('Fare', 'cabin_section', 
#          hue = 'Survived', data = features,
#          scatter_kws = {'alpha': 0.8, 's': 60}, fit_reg = False,
#          size = 12, aspect = 1.2);

# Plot labeling
#plt.xlabel("Fare", size = 28)
#plt.ylabel('Deck', size = 28)
#plt.title('Class and Cabin Compared With Survival', size = 36);

(This was an attempt to make use of a scatterplot, but this particular type of representation does not seem to serve our model, regardless of how it is changed. I may return to it later.)

In [21]:
numerical_subset = titanic_data[['Pclass', 'Fare']]

categorical_subset = titanic_data[['Sex', 'cabin_section', 'Embarked']]
categorical_subset = pd.get_dummies(categorical_subset)

features = pd.concat([numerical_subset, categorical_subset], axis=1)

X = features

print(X.describe)

In [23]:
y = titanic_data.Survived

In [25]:
# Data Prep
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# Potential Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Testing
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Final model decision
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

After importing necessary libraries, we split the data into a training and test set.

In [26]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 13)

Imputing the missing values of Age and Embark.

In [30]:
imputer = SimpleImputer(strategy='median')

imputer.fit(X_train)

X_train = imputer.transform(X_train)
X_val = imputer.transform(X_val)

In [31]:
print('Missing values in training features: ', np.sum(np.isnan(X_train)))
print('Missing values in testing features:  ', np.sum(np.isnan(X_val)))

In [33]:
# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(X_train)

# Transform both the training and testing data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_val)

In [34]:
# Convert y to one-dimensional array (vector)
y_train = np.array(y_train).reshape((-1, ))
y_val = np.array(y_val).reshape((-1, ))

In [37]:
# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

# Takes in a model, trains the model, and evaluates the model on the test set
def fit_and_evaluate(model):
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions and evalute
    model_pred = model.predict(X_val)
    model_mae = mae(y_val, model_pred)
    
    # Return the performance metric
    return model_mae

In [38]:
lr = LinearRegression()
lr_mae = fit_and_evaluate(lr)

print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)

In [39]:
svm = SVR(C = 1000, gamma = 0.1)
svm_mae = fit_and_evaluate(svm)

print('Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae)

In [40]:
random_forest = RandomForestRegressor(random_state=60)
random_forest_mae = fit_and_evaluate(random_forest)

print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae)

In [41]:
gradient_boosted = GradientBoostingRegressor(random_state=60)
gradient_boosted_mae = fit_and_evaluate(gradient_boosted)

print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae)

In [42]:
knn = KNeighborsRegressor(n_neighbors=10)
knn_mae = fit_and_evaluate(knn)

print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae)

In [50]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(20,10)) 


# Dataframe to hold the results
model_comparison = pd.DataFrame({'model': ['Linear Regression', 'Support Vector Machine',
                                           'Random Forest', 'Gradient Boosted',
                                            'K-Nearest Neighbors'],
                                 'mae': [lr_mae, svm_mae, random_forest_mae, 
                                         gradient_boosted_mae, knn_mae]})

# Horizontal bar chart of test mae
model_comparison.sort_values('mae', ascending = False).plot(x = 'model', y = 'mae', kind = 'barh',
                                                           color = 'red', edgecolor = 'black')

# Plot formatting
plt.ylabel(''); plt.yticks(size = 14); plt.xlabel('Mean Absolute Error'); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);

In [51]:
# With the model design complete, use all the data for the final model.
random_forest.fit(X, y)

# Prepare Test Data
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

test_data['cabin_section'] = test_data['Cabin'].str.extract('([A-Z])', expand=True)
test_data['cabin_section'] = test_data[['cabin_section']].fillna(value = 'Unknown')

numerical_subset = test_data[['Pclass', 'Fare']]

categorical_subset = test_data[['Sex', 'cabin_section', 'Embarked']]
categorical_subset = pd.get_dummies(categorical_subset)

test_features = pd.concat([numerical_subset, categorical_subset], axis=1)

X_test = test_features

train_predictions = random_forest.predict(X_test).astype(int)

output = pd.DataFrame({'PassengerId' : test_data.PassengerId,
                   'Survived' : test_predictions})

output.to_csv('submission.csv',index=False)