In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [5]:
titanic_file_path = '/kaggle/input/titanic/train.csv'

titanic_data = pd.read_csv(titanic_file_path)

titanic_data.head()

First, I will make a list of all the columns so I can take a look at the correlation matrix for each of these elements.

In [6]:
columns = list(titanic_data.columns.values)
print(columns)

With the elements listed, I now need to see which are numeric and which are categorical, which I can do with describe().

In [7]:
print(titanic_data.describe())

Now I will need to review how each of the remaining elements are categorized to see if they, perhaps, can logically be split into larger groups for comparison.

In [8]:
categorical_subset = titanic_data[['Sex', 'Embarked', 'Ticket', 'Cabin']]

print(titanic_data.Sex.head(15))
print(titanic_data.Embarked.head(15))
print(titanic_data.Ticket.head(15))
print(titanic_data.Cabin.head(15))

In [9]:


print(titanic_data['Sex'].value_counts())
print(titanic_data['Embarked'].value_counts())
print(titanic_data['Ticket'].value_counts())
print(titanic_data['Cabin'].value_counts())

In [10]:
print(titanic_data.isnull().sum())

Clearly, Sex is an easily categorized variable, as is Embarked, since those contain a minimum number of possible responses. Those can be assigned dummy values as they are. Ticket does not have any immediately noticeable patterns, though there are occasional repeating letter patterns. Additional exploration would be necessary to use that variable.

Cabin provides an interesting possibility for consideration. The naming convension of the cabins indicates that they are separated into sections by letter. It is possible that certain sections were more deadly than others, regardless of Sex of Pclass. It is also notable that Cabin has the most null values in the data set by a large margin. Both of these facts deserve exploration.

In [29]:
import re

titanic_data['cabin_section'] = titanic_data['Cabin'].str.extract('([A-Z])', expand=True)
titanic_data['cabin_section'] = titanic_data[['cabin_section']].fillna(value = 'Unknown')

# pd.set_option("max_rows", None)

titanic_data['cabin_section'].head(20)
# titanic_data.Cabin.head

In [30]:
print(titanic_data['cabin_section'].value_counts())

I have altered the Cabin column to reflect the section of the ship where the cabin is located and turned this into a new variable cabin_section. I have also put NAN values into their own variable so they can be considered in the model. It is possible that this lack of information could constitute factors that might lead to surviving the disaster or not.

In [36]:
numerical_subset = titanic_data[['Pclass', 'Parch', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]

categorical_subset = titanic_data[['Sex', 'Embarked', 'cabin_section']]
categorical_subset = pd.get_dummies(categorical_subset)

features = pd.concat([numerical_subset, categorical_subset], axis=1)

In [37]:
# Find all correlations and sort 
correlations_data = features.corr()['Survived'].sort_values()

# Print the most negative correlations
print(correlations_data.head(15), '\n')

# Print the most positive correlations
print(correlations_data.tail(15))

From the correlation matrix, we can see that Sex has the largest correlation with survivorship. Class appears to be negatively associated with survivorship, but this is due to the computer assigned class values a true numerical value, rather than its symbolic one it has in the real world. cabin_section_Unknown is also negatively associated with survivorship, suggesting that we might be right about that variable.

Fare also seems to have played a role, as it is positively associated with survivorhood. This seems like an extension of Pclass, but if so, why is the correlation reduced by roughly 8%? Perhaps it is thrown off by the crew that did not pay for the voyage? Or perhaps some wealthy people were guests on the ship and did not pay. Both are possible, so additional exploration is required.