# Titanic Data Wrangling and Exploration
The purpose of this notebook is to adequately understand, describe, and explore the titanic data set that will be used in this project.

In [1]:
import os
while os.path.basename(os.getcwd()) != 'Synthetic_Data_GAN_Capstone':
    os.chdir('..')
from utils.data_loading import load_raw_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Importing data
Note that I have written a helper function to load in the desired data set. If you have not already downloaded
the data sets to the appropriate directory, this can be done simply by running the following code:

In [2]:
titanic = load_raw_dataset('titanic')
titanic = titanic[0]  # Ignoring the test data set for now
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Inspecting the data

In [3]:
print(titanic.shape)
titanic.describe()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Looks like we have some missing values! Age, Cabin, and Embarked all contain missing values.

### Cleaning the data
The titanic data set has a good variety of data types and problems with the data for use in making good predictions. As such, we will employ a variety of methods to clean up the data.

#### Embarked (imputed)

In [5]:
titanic.groupby('Embarked').size()

Embarked
C    168
Q     77
S    644
dtype: int64

The vast majority of Embarkation sites is S, and we only have 2 missing values. Let's set the value of our two missing values of Embarked to the mode, S.

In [6]:
titanic.Embarked = titanic.Embarked.fillna('S')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


#### Cabin --> CabinLetter (transformed)

In [7]:
titanic.groupby('Cabin').size().head()

Cabin
A10    1
A14    1
A16    1
A19    1
A20    1
dtype: int64

I propose distilling cabin into a feature of simply the letter at the front of Cabin. Those with a missing value will be grouped together to their own grouping. This is operating under the assumption that there is information inherent in the fact that the data point is missing, so it may be helpful in predicting survival.

In [8]:
titanic['CabinLetter'] = titanic.Cabin.str[0]
titanic.CabinLetter = titanic.CabinLetter.fillna('NoCabin')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
CabinLetter    891 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 90.6+ KB


In [9]:
titanic.groupby('CabinLetter').size().reset_index().sort_values(0,ascending=False)

Unnamed: 0,CabinLetter,0
7,NoCabin,687
2,C,59
1,B,47
3,D,33
4,E,32
0,A,15
5,F,13
6,G,4
8,T,1


Age is a bit more difficult than these. I propose returning here after doing a feature transformation on name to impute Age using catboost.

#### Name --> Title (transformed)

In [10]:
titanic.Name.head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

It looks like we can use a bit of regex here to extract the title. Let's give it a shot!

In [11]:
titanic['Title'] = titanic['Name'].str.extract(pat=r'(\b[A-Za-z]+\.)', expand=False)
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
CabinLetter    891 non-null object
Title          891 non-null object
dtypes: float64(2), int64(5), object(7)
memory usage: 97.5+ KB


In [12]:
titanic.Title.value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Col.           2
Major.         2
Mlle.          2
Mme.           1
Jonkheer.      1
Capt.          1
Sir.           1
Lady.          1
Countess.      1
Ms.            1
Don.           1
Name: Title, dtype: int64

Let's group anything not called Mr., Miss., Mrs., or Master. into one Misc. grouping

In [13]:
titanic.Title[~titanic.Title.isin(['Mr.','Miss.','Mrs.','Master.'])]='Misc.'
titanic.Title.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Mr.        517
Miss.      182
Mrs.       125
Master.     40
Misc.       27
Name: Title, dtype: int64

#### PassengerID, Ticket, Cabin, and Name (dropped)

Let's discard PassengerId and Ticket (along with Cabin and Name) as they are likely not going to be terribly useful in predicting survival.

In [14]:
titanic = titanic.drop(columns=['PassengerId', 'Ticket', 'Cabin', 'Name'])
titanic.shape

(891, 10)

#### Age (imputation)

In [15]:
from catboost import CatBoostRegressor
model=CatBoostRegressor(task_type='GPU')
age_labels = titanic.Age[~titanic.Age.isna()]
has_age = titanic[~titanic.Age.isna()].drop(columns='Age')
no_age = titanic[titanic.Age.isna()].drop(columns='Age')
model.fit(has_age, age_labels, cat_features=([0, 1, 2, 6, 7, 8]), silent=True)

<catboost.core.CatBoostRegressor at 0x7f720f596588>

In [16]:
new_age_labels = pd.DataFrame(model.predict(no_age))
new_titanic = pd.concat([pd.concat([has_age, age_labels], axis=1).reset_index().drop(columns='index'), pd.concat([no_age.reset_index().drop(columns='index'), new_age_labels], axis=1).rename(columns={0: 'Age'})], axis=0)
new_titanic.head()

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked,CabinLetter,Title,Age
0,0,3,male,1,0,7.25,S,NoCabin,Mr.,22.0
1,1,1,female,1,0,71.2833,C,C,Mrs.,38.0
2,1,3,female,0,0,7.925,S,NoCabin,Miss.,26.0
3,1,1,female,1,0,53.1,S,C,Mrs.,35.0
4,0,3,male,0,0,8.05,S,NoCabin,Mr.,35.0


In [17]:
new_titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 176
Data columns (total 10 columns):
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
CabinLetter    891 non-null object
Title          891 non-null object
Age            891 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 76.6+ KB


### Store the cleaned data for future use
Much better! Now we have a cleaned data set with better features and no more missing values. Now we can save it so that we can use it in our GANs later on.

In [18]:
new_titanic.to_csv('downloads/titanic/cleaned.csv', index=False)