# Titanic - Machine Learning for Disaster
## Kaggle Competition
### Tyler J Simpson

### Classification Problem
#### Predict Survival - Yes/No
#### Utilizing Ken Jee's guide as a template https://www.youtube.com/watch?v=I3FBJdiExcg&t=576s&ab_channel=KenJee

### Import Packages

In [128]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #added to base packages
import matplotlib.pyplot as plt #added to base packages
from sklearn.preprocessing import MinMaxScaler #added to base packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Load Data

In [129]:
training_data = pd.read_csv('/kaggle/input/titanic/train.csv') #training data
test_data = pd.read_csv('/kaggle/input/titanic/test.csv') #test data
all_data = pd.concat([training_data, test_data])

training_data['train_test'] = 1
test_data['train_test'] = 0
test_data['Survived'] = np.NaN

%matplotlib inline
all_data.columns

### Outline

1. Describe data (EDA)
2. Numerical discovery (EDA) -- Boxplots, histograms, correlation
3. Categorical discover (EDA) -- Value counts
4. Feature engineering
5. Preprocess data
6. Model building

### 1. Describe Data

In [130]:
training_data.head()

In [131]:
#training_data.describe().columns
training_data.info()
#note age and especially cabin have many nulls

In [132]:
training_data.describe()

In [133]:
#split between numerical and categorical values
df_num = training_data[['Age','SibSp','Parch','Fare']] #histogram
df_cat = training_data[['Survived','Pclass','Sex','Ticket','Cabin','Embarked']] #value counts

### 2. Numerical Discovery

In [134]:
#histogram for all numerical features
for i in df_num.columns:
    plt.hist(df_num[i])
    plt.title(i)
    plt.show()

#### Fare is especially large and my need normalized

### Normalize all numerical data

In [135]:
scaler = MinMaxScaler()
df_num[['Age','SibSp','Parch','Fare']] = scaler.fit_transform(df_num[['Age','SibSp','Parch','Fare']]) #normalize numeric columns
print(df_num)

In [136]:
#histogram for all scaled numerical features
for i in df_num.columns:
    plt.hist(df_num[i])
    plt.title(i)
    plt.show()

In [137]:
#correlation of numerical data
print(df_num.corr())
sns.heatmap(df_num.corr())

#### SibSp and Parch highly correlated

In [138]:
#Compare survival rates across numerical data
pd.pivot_table(training_data, index = 'Survived', values = ['Age','SibSp','Parch','Fare'])

#### Age plays minimal factor though younger seem to survive more 
#### Fare, people who paid more seem more likely to survive
#### Parch, parents per child seems to improve survival rate
#### SibSp, having siblings/spouses seems to drop survival rate

### 3. Categorical Discovery

In [139]:
df_cat.head()

In [140]:
df_cat.dtypes

In [142]:
for i in df_cat.columns:
    sns.barplot(df_cat[i].value_counts().index,df_cat[i].value_counts()).set_title(i)
    plt.show()

#### Most people did not survive
#### Most people were in the basic cabin (3)
#### Most people were male
#### Most people came from Southampton

In [144]:
#Compare survival rates across categorical data
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Pclass', values = 'Ticket', aggfunc = 'count'))
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Sex', values = 'Ticket', aggfunc = 'count'))
print(pd.pivot_table(training_data, index = 'Survived', columns = 'Embarked', values = 'Ticket', aggfunc = 'count'))

#### Only people in 1st class had a higher survival rate
#### Women had a much higher survival rate
#### Location played minimal effect

### 4. Feature Engineering

In [161]:
training_data.head()

### Cabin Feature Engineering

#### Cabin has too many features, needs engineered to extract data

In [171]:
training_data.Cabin.head(20)

In [170]:
#check to see how many people had multiple cabins
training_data['Cabin_Mult'] = training_data.Cabin.apply(lambda x:0 if pd.isna(x) else len(x.split(' ')))
training_data['Cabin_Mult'].value_counts()

In [155]:
#Compare survival rates across cabin multiple count
pd.pivot_table(training_data, index = 'Survived', columns = 'Cabin_Mult', values = 'Ticket', aggfunc = 'count')

In [158]:
#create categories based on the cabin letter (n = Null)
training_data['Cabin_Letter'] = training_data.Cabin.apply(lambda x: str(x)[0])
print(training_data.Cabin_Letter.value_counts())

In [160]:
#Compare survival rates by cabin letter
pd.pivot_table(training_data, index = 'Survived', columns = 'Cabin_Letter', values = 'Name', aggfunc = 'count')

#### People with a cabin letter had higher survival rate than those without a cabin letter

### Ticket Feature Engineering

In [169]:
training_data.Ticket.head(20)

In [163]:
#Split based on if there is text in the ticket
training_data['Ticket_Num'] = training_data.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
training_data['Ticket_Num'].value_counts()

In [165]:
#Observe trends in the text of the ticket
training_data['Ticket_Txt'] = training_data.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() 
                                                         if len(x.split(' ')[:-1]) >0 else 0)
training_data['Ticket_Txt'].value_counts()

#### Letter conventions in the ticket don't seem to be common enough to include

In [167]:
#Compare survival rates by text present in ticket
pd.pivot_table(training_data, index = 'Survived', columns = 'Ticket_Num', values = 'Ticket', aggfunc = 'count')

#### Similar survival rate

### Name Feature Engineering

In [168]:
training_data.Name.head(20)

In [177]:
#Split to extract persons title
#split via after comma, before period strip remainder
training_data['Name_Title'] = training_data.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
training_data['Name_Title'].value_counts()