# Titanic Kaggle Competition V3 - Possibly Random Forest

#### Predict the survivors of the Titanic disaster given personal information. Key predictor variables are the age, gender, family size, title, embarkment location, passenger class, and fare price paid by the passenger. The predicted variable is a binary survival/not survival of the passenger.

#### The train.csv file provides the data for training the model predictions and the test.csv provides the data for testing the model.

## The following notebook is organized as follows:
##### 1. Library imports and file imports
##### 2. EDA, Missing Data Resolution, Feature Creation
##### 3. Feature Scaling and Normalization
##### 4. Model Training
##### 5. Model Testing
##### 6. Model Validation on Holdout Data Set

### 1. Library imports and file imports

In [4]:
import numpy as np
import pandas as pd
import plotly as plt
import plotly.express as px


In [19]:
#import data

train_df = pd.read_csv(r'train.csv')

test_df = pd.read_csv(r'test.csv')


# 2. EDA, Missing Data Resolution, Feature Creation

In [21]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### The SibSp and Parch values may be of interest in combining for another data set given advice from sources online. 

In [22]:
train_df.info()

print("-------------COMPARISON OF TRAIN (ABOVE) AND TEST DATA (BELOW)------------")

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
-------------COMPARISON OF TRAIN (ABOVE) AND TEST DATA (BELOW)------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       ----

#### Name, Sex, Cabin, Embarked will have to be converted to encoded categorical data if usable

In [23]:
train_df.Cabin.unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

#### The 'Age', 'Cabin' and 'Emabarked' (almost none) data sets are missing data that will have to be accounted for. I believe that multiple imputation may be best for 'Age', while 'Cabin' may be unusbale given the eratic nature of the labels (possibler to pull decks from cabins??), and using something like mode may be easiest for 'Embarked' data given that there are very few missing values.

In [24]:
train_df['Survived'].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

#### From the survived target variable we can see the outcome of our classification model should predict something in the neighborhood of 62% died and 38% survivors for our test data, assuming a equally-representable sample was taken for test and training data sets. 

## PClass Data

In [54]:
# Found a useful groupby function online for quick comparisons and examples on Titanic data set from Kaggle user ZlatanKremonic

print("Unique values in Passenger Class data set:", train_df.Pclass.unique())

train_df['Survived'].groupby(train_df['Pclass']).mean()



Unique values in Passenger Class data set: [3 1 2]


Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

#### We can see that the Passenger Classes were postively correlated to the survival rate. We will treat Passenger class as a categorical variable with no adjustments given no null values.

## Name Data

#### Splitting the Titles of the passengers to see the usefulness of the new variable. This is done frequently by others online and the length of the name is also used to compare with survival rate online. Some used the test set to combine data and gain additional titles/name lengths, but attempting to keep this process at least somewhat representative of a "real world" applicaiton, in which a test set would not be used to train the model at all.

In [62]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [90]:
train_df['Title'] = train_df['Name'].str.split(r'[,.]').str.get(1)

In [96]:
train_df.Survived.groupby(train_df.Title).mean()

Title
 Capt            0.000000
 Col             0.500000
 Don             0.000000
 Dr              0.428571
 Jonkheer        0.000000
 Lady            1.000000
 Major           0.500000
 Master          0.575000
 Miss            0.697802
 Mlle            1.000000
 Mme             1.000000
 Mr              0.156673
 Mrs             0.792000
 Ms              1.000000
 Rev             0.000000
 Sir             1.000000
 the Countess    1.000000
Name: Survived, dtype: float64

#### Given that the actual survival rate of men vs women on the titanic was Men: 20% Women:80%, this title variable matches real world expectations closely, and the higher survvial rates for "wealthier sounding" titles may be more helpful to differentiate target variable instead of just Gender variable. 

In [107]:
# Get name length and check if it correlates to survival rate

train_df['Name_Length'] = train_df.Name.str.len()

In [120]:
train_df.Name_Length.value_counts()

19    64
25    55
27    50
18    50
26    49
28    43
24    43
17    42
21    40
23    39
20    39
22    38
30    37
29    32
31    30
16    26
32    23
33    22
15    15
47    11
37    10
38     9
36     9
39     9
45     9
44     8
41     8
34     7
46     7
40     7
51     7
35     6
43     5
42     5
49     5
50     4
52     4
56     3
14     3
48     3
13     2
12     2
53     2
55     2
57     2
67     1
54     1
61     1
65     1
82     1
Name: Name_Length, dtype: int64

In [122]:
#Checking possible error in longest name at 82 characters, seems like a married spanish surname combination, OK
train_df.loc[train_df['Name_Length'] == 82]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Name_Length
307,308,1,1,"Penasco y Castellana, Mrs. Victor de Satode (M...",female,17.0,1,0,PC 17758,108.9,C65,C,Mrs,82


#### Splitting the Name Lengths in quarteriles to check correlation to Survival rate

In [142]:
train_df['Survived'].groupby(pd.qcut(train_df['Name_Length'],4)).mean()

Name_Length
(11.999, 20.0]    0.230453
(20.0, 25.0]      0.325581
(25.0, 30.0]      0.364929
(30.0, 82.0]      0.626126
Name: Survived, dtype: float64

In [143]:
pd.qcut(train_df['Name_Length'],4).value_counts()

(11.999, 20.0]    243
(30.0, 82.0]      222
(20.0, 25.0]      215
(25.0, 30.0]      211
Name: Name_Length, dtype: int64

## Gender Data

#### Looking at the gender data in comparison to average survival rate, this will be used as categorical variable and encoded later. We already konw from dataframe info that there are no nulls that need to be dealt with. 

In [140]:
train_df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [141]:
train_df['Survived'].groupby(train_df['Sex']).mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

#### From our work with titles and online research we can see that the train set of data has a slightly lower average survival rate for females and slightly lower average survival rate for females than the gender survival rate of the actual Titanic event.

## Age Data

#### We will have to address the large number of missing values in the Age variable. I believe that using multiple imputation by chained equations may be the most accurate given we have a large number other data variables that we can use to iterate through in order to arrive at the "best guess" age given the passenger's other data points.

In [151]:
train_df['Survived'].groupby(pd.qcut(train_df['Age'],12)).mean()

Age
(0.419, 9.0]      0.612903
(9.0, 18.0]       0.415584
(18.0, 20.125]    0.300000
(20.125, 23.0]    0.313433
(23.0, 25.0]      0.381818
(25.0, 28.0]      0.393443
(28.0, 31.0]      0.393939
(31.0, 34.0]      0.440000
(34.0, 38.0]      0.474576
(38.0, 44.0]      0.370968
(44.0, 51.0]      0.396552
(51.0, 80.0]      0.350877
Name: Survived, dtype: float64

#### We can see that the relationship between survival and age is far more positive for passengers under the age of 9. It is fairly even though for other ages when compared to the survival rate. It is also clear that the average survival rate was higher for passengers between the ages of 31 and 38.

# SipSp

In [None]:
#### Given that 