# Titanic Kaggle Competition V3

#### Predict the survivors of the Titanic disaster given personal information. Key predictor variables are the age, gender, family size, title, embarkment location, passenger class, and fare price paid by the passenger. The predicted variable is a binary survival/not survival of the passenger.

#### The train.csv file provides the data for training the model predictions and the test.csv provides the data for testing the model.

## The following notebook is organized as follows:
#### 1. Library imports and file imports
#### 2. EDA 
#### 3. Feature Creation 
#### 4. Missing Values Resolution
#### 5. Variable Scaling and Normalization
#### 6. Model Training
#### 7. Model Testing
#### 8. Model Validation on Test Data Provided

# 1. Library imports and file imports

In [1]:
import numpy as np
import pandas as pd
import plotly as plt
import plotly.express as px


In [2]:
#import data

train_df = pd.read_csv(r'train.csv')

test_df = pd.read_csv(r'test.csv')


# 2. EDA

In [3]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### The SibSp and Parch values may be of interest in combining for another data set given advice from sources online. 

In [4]:
train_df.info()

print("-------------COMPARISON OF TRAIN (ABOVE) AND TEST DATA (BELOW)------------")

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
-------------COMPARISON OF TRAIN (ABOVE) AND TEST DATA (BELOW)------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       ----

#### Name, Sex, Cabin, Embarked will have to be converted to encoded categorical data if usable

In [5]:
train_df.Cabin.unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

#### The 'Age', 'Cabin' and 'Emabarked' (almost none) data sets are missing data that will have to be accounted for. I believe that multiple imputation may be best for 'Age', while 'Cabin' may be unusbale given the eratic nature of the labels (possibler to pull decks from cabins??), and using something like mode may be easiest for 'Embarked' data given that there are very few missing values.

In [6]:
train_df['Survived'].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

#### From the survived target variable we can see the outcome of our classification model should predict something in the neighborhood of 62% died and 38% survivors for our test data, assuming a equally-representable sample was taken for test and training data sets. 

## PClass Data

In [7]:
# Found a useful groupby function online for quick comparisons and examples on Titanic data set from Kaggle user ZlatanKremonic

print("Unique values in Passenger Class data set:", train_df.Pclass.unique())

train_df['Survived'].groupby(train_df['Pclass']).mean()



Unique values in Passenger Class data set: [3 1 2]


Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

#### We can see that the Passenger Classes were postively correlated to the survival rate. We will treat Passenger class as a categorical variable with no adjustments given no null values.

## Name Data

#### Splitting the Titles of the passengers to see the usefulness of the new variable. This is done frequently by others online and the length of the name is also used to compare with survival rate online. Some used the test set to combine data and gain additional titles/name lengths, but attempting to keep this process at least somewhat representative of a "real world" applicaiton, in which a test set would not be used to train the model at all.

In [8]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
train_df['Title'] = train_df['Name'].str.split(r'[,.]').str.get(1)

In [10]:
train_df.Survived.groupby(train_df.Title).mean()

Title
 Capt            0.000000
 Col             0.500000
 Don             0.000000
 Dr              0.428571
 Jonkheer        0.000000
 Lady            1.000000
 Major           0.500000
 Master          0.575000
 Miss            0.697802
 Mlle            1.000000
 Mme             1.000000
 Mr              0.156673
 Mrs             0.792000
 Ms              1.000000
 Rev             0.000000
 Sir             1.000000
 the Countess    1.000000
Name: Survived, dtype: float64

#### Given that the actual survival rate of men vs women on the titanic was Men: 20% Women:80%, this title variable matches real world expectations closely, and the higher survvial rates for "wealthier sounding" titles may be more helpful to differentiate target variable instead of just Gender variable. 

In [11]:
# Get name length and check if it correlates to survival rate

train_df['Name_Length'] = train_df.Name.str.len()

In [12]:
train_df.Name_Length.value_counts()

19    64
25    55
27    50
18    50
26    49
28    43
24    43
17    42
21    40
23    39
20    39
22    38
30    37
29    32
31    30
16    26
32    23
33    22
15    15
47    11
37    10
38     9
36     9
39     9
45     9
44     8
41     8
34     7
46     7
40     7
51     7
35     6
43     5
42     5
49     5
50     4
52     4
56     3
14     3
48     3
13     2
12     2
53     2
55     2
57     2
67     1
54     1
61     1
65     1
82     1
Name: Name_Length, dtype: int64

In [13]:
#Checking possible error in longest name at 82 characters, seems like a married spanish surname combination, OK
train_df.loc[train_df['Name_Length'] == 82]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Name_Length
307,308,1,1,"Penasco y Castellana, Mrs. Victor de Satode (M...",female,17.0,1,0,PC 17758,108.9,C65,C,Mrs,82


#### Splitting the Name Lengths in quarteriles to check correlation to Survival rate

In [14]:
train_df['Survived'].groupby(pd.qcut(train_df['Name_Length'],4)).mean()

Name_Length
(11.999, 20.0]    0.230453
(20.0, 25.0]      0.325581
(25.0, 30.0]      0.364929
(30.0, 82.0]      0.626126
Name: Survived, dtype: float64

In [15]:
pd.qcut(train_df['Name_Length'],4).value_counts()

(11.999, 20.0]    243
(30.0, 82.0]      222
(20.0, 25.0]      215
(25.0, 30.0]      211
Name: Name_Length, dtype: int64

## Gender Data

#### Looking at the gender data in comparison to average survival rate, this will be used as categorical variable and encoded later. We already konw from dataframe info that there are no nulls that need to be dealt with. 

In [16]:
train_df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [17]:
train_df['Survived'].groupby(train_df['Sex']).mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

#### From our work with titles and online research we can see that the train set of data has a slightly lower average survival rate for females and slightly lower average survival rate for females than the gender survival rate of the actual Titanic event.

## Age Data

#### We will have to address the large number of missing values in the Age variable. I believe that using multiple imputation by chained equations may be the most accurate given we have a large number other data variables that we can use to iterate through in order to arrive at the "best guess" age given the passenger's other data points.

In [18]:
train_df['Survived'].groupby(pd.qcut(train_df['Age'],12)).mean()

Age
(0.419, 9.0]      0.612903
(9.0, 18.0]       0.415584
(18.0, 20.125]    0.300000
(20.125, 23.0]    0.313433
(23.0, 25.0]      0.381818
(25.0, 28.0]      0.393443
(28.0, 31.0]      0.393939
(31.0, 34.0]      0.440000
(34.0, 38.0]      0.474576
(38.0, 44.0]      0.370968
(44.0, 51.0]      0.396552
(51.0, 80.0]      0.350877
Name: Survived, dtype: float64

#### We can see that the relationship between survival and age is far more positive for passengers under the age of 9. It is fairly even though for other ages when compared to the survival rate. It is also clear that the average survival rate was higher for passengers between the ages of 31 and 38.

## SibSp and Parch Data

In [19]:
train_df['Survived'].groupby(train_df['SibSp']).mean()

SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000
Name: Survived, dtype: float64

#### Given that there is not a clear correlation between the SibSp variable and survival, we can look at combining the variables (SibSp and Parch) to see if it correlates well to survival.

In [20]:
train_df['Fam_Size'] = train_df.SibSp + train_df.Parch

In [21]:
train_df['Survived'].groupby(train_df['Fam_Size']).mean()

Fam_Size
0     0.303538
1     0.552795
2     0.578431
3     0.724138
4     0.200000
5     0.136364
6     0.333333
7     0.000000
10    0.000000
Name: Survived, dtype: float64

#### From the grouping of average survivors and the family size, we can see that the combination of the two variables (Parents and childs and Siblings and Spouses) seems to have a higher correlation to the average rate of survival given that the family size was zero to three, while anything over that was not significant (other than a family size of 6)

In [22]:
train_df.Fam_Size.value_counts()

0     537
1     161
2     102
3      29
5      22
4      15
6      12
10      7
7       6
Name: Fam_Size, dtype: int64

## Ticket Data

In [23]:
train_df.Ticket.head(5)

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

#### Considering the variability in the ticket variable, it may not be worth applying without changes. Given the online sources they do reference the correlation between ticket string length and survival.

In [24]:
train_df['Ticket_Length'] = train_df.Ticket.str.len()

In [25]:
train_df.Survived.groupby(train_df.Ticket_Length).mean()

Ticket_Length
3     0.000000
4     0.366337
5     0.618321
6     0.319809
7     0.296296
8     0.539474
9     0.192308
10    0.341463
11    0.250000
12    0.400000
13    0.400000
15    0.333333
16    0.272727
17    0.428571
18    0.000000
Name: Survived, dtype: float64

#### From the grouping we can see that the average survival rate was far better for some than others given the length of the ticket name. This may be due to the locations that the tickes were bought, and therefore had higher instances of women or children from one location adn therefore a different ticket name length. Or the instance of eight characters in the ticket versus a ticket with a name length of 9 may reflect a ticket namig schema the Titanic parent company used for higher decks of the ship versus lower decks of the ship, thus giving us a higher survival rate for some lengths and not others. 

#### Will explore whether or not the inclusion any letter or special characters in the ticket string have significance in the survival of the passenger.

In [26]:
train_df.Ticket.head()

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

In [27]:
# isdigit on column of Tickets to see True/False of whether or not they are only digits

train_df['Ticket_Char'] = train_df['Ticket'].str.isdigit()

In [28]:
# Convert True False from .isdigit() to integers

train_df['Ticket_Char'] = train_df.Ticket_Char.astype(int)

In [29]:
#Check to see the correlation between the tickets with only digits and survival rate

train_df['Survived'].groupby(train_df['Ticket_Char']).mean()

Ticket_Char
0    0.382609
1    0.384266
Name: Survived, dtype: float64

#### From the grouped results above we can see that there was not a meaningful difference in average survival rate of passengers with tickets with solely digits vs tickets with characters. I will keep the variable in the data set in case the Random Forest can include it at a deep level of the decision tree and may aid in further classification of the target variable. (this logic seen online and applied here).

#### We will take the first letter of each ticket as well to see if this provides us any insight to survival. Online sources used this is an exmaple for more meaningful extraction from ticket variable.

In [30]:
train_df['Ticket_First_Char'] = train_df.Ticket.astype(str).str[0]

In [31]:
print(train_df['Ticket_First_Char'].value_counts())

print('--------------------------------------')

print(train_df['Survived'].groupby(train_df['Ticket_First_Char']).mean())

3    301
2    183
1    146
S     65
P     65
C     47
A     29
W     13
4     10
7      9
F      7
6      6
L      4
5      3
8      2
9      1
Name: Ticket_First_Char, dtype: int64
--------------------------------------
Ticket_First_Char
1    0.630137
2    0.464481
3    0.239203
4    0.200000
5    0.000000
6    0.166667
7    0.111111
8    0.000000
9    1.000000
A    0.068966
C    0.340426
F    0.571429
L    0.250000
P    0.646154
S    0.323077
W    0.153846
Name: Survived, dtype: float64


#### From the grouping above we can see that tickets with a first character 'C' were much more likely to survive versus a ticket with a first character of 'W'. This may be due to PAssenger Class or even embarkment locaiton, but gives us another variable to use in the random forest model.

## Fare Data

#### We will look at the fare price relative to the average survival rate of passengers. From the quick correlation ran below, we can see this will be positively correlated to a higher rate of survival. 

In [32]:
train_df['Survived'].groupby(pd.qcut(train_df['Fare'],5)).mean()

Fare
(-0.001, 7.854]      0.217877
(7.854, 10.5]        0.201087
(10.5, 21.679]       0.424419
(21.679, 39.688]     0.444444
(39.688, 512.329]    0.642045
Name: Survived, dtype: float64

#### From the above we can see that the 5th quantile with highest ticket prices had a far better average survival rate than that of passengers who paid the lowest fares. If passengers who paid lower fares were indeed in the bottom decks of the ship at the time of the incident, then it would make sense that they would be more likely to have not survived. 

## Cabin Data

#### Given the large amount of data mising from the Cabin variable, and that when we quickly looked at the Cabin data previously and saw extreme differences in the strings of the cabin assignments, we will have to explore the data further to glean usable data from it. 

In [33]:
# Explore Cabin Data

train_df.Cabin.unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

#### The cabin letters are most likely Deck assignments and may be useful for us to use. We will split the Cabin strings and then use the letters as well as the cabin numbers to see if they tell us anything.

In [34]:
# Split the strings to take the first character as the deck assignment and the second, third and fourth characters as the cabin numbers.

train_df['Cabin_Deck'] = train_df.Cabin.astype(str).str[0]
train_df['Cabin_Deck'].replace('n',np.NaN,inplace=True)

In [35]:
train_df['Survived'].groupby(train_df['Cabin_Deck']).mean()

Cabin_Deck
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
T    0.000000
Name: Survived, dtype: float64

#### From the above we can see that the passengers with cabins on Decks (an assumption that was verified online) 'B', 'D', 'E' were much more likely to survive than passengers on Decks 'A' or 'G'.

In [36]:
train_df['Cabin_Number'] = train_df.Cabin.astype(str).str[1:]
train_df['Cabin_Number'].replace('an',np.NaN,inplace=True)
train_df['Cabin_Number'].unique()

array([nan, '85', '123', '46', '6', '103', '56', '23 C25 C27', '78', '33',
       '30', '52', '28', '83', ' G73', '31', '5', '10 D12', '26', '110',
       '58 B60', '101', ' E69', '47', '86', '2', '19', '7', '49', '4',
       '32', '80', '36', '15', '93', '35', '87', '77', '67', '94', '125',
       '99', '118', '', '22 C26', '106', '65', '54', '57 B59 B63 B66',
       '34', '18', '124', '91', '40', '128', '37', '50', '82', '96 B98',
       '10', '44', '104', '111', '92', '38', '21', '12', '63', '14', '20',
       '79', '25', '73', '95', '39', '22', '70', '16', '68', '41', '9',
       '23', '48', '58', '126', '71', '51 B53 B55', ' G63', '62 C64',
       '24', '90', '45', '8', '121', '11', '3', '82 B84', '17', '102',
       '69', '42', '148'], dtype=object)

In [37]:
#Dealing with irregular additional cabins lumped together, splitting string to take only first cabin number of string, and converting to integer

train_df['Cabin_Number'] = train_df['Cabin_Number'].str.split(' ').str[0]

train_df['Cabin_Number'] = pd.to_numeric(train_df['Cabin_Number'], errors='coerce')

In [38]:
#Checking groupby of survived and cabin numbers

train_df['Survived'].groupby(pd.qcut(train_df['Cabin_Number'],7)).mean()

Cabin_Number
(1.999, 11.857]     0.714286
(11.857, 23.714]    0.714286
(23.714, 35.0]      0.741935
(35.0, 49.0]        0.607143
(49.0, 69.286]      0.680000
(69.286, 96.0]      0.612903
(96.0, 148.0]       0.680000
Name: Survived, dtype: float64

#### Given the above values we can see some differentiation in the average survival rate of passengers given certain groups of Cabin Numbers.

## Embarked Data

#### Looking at the port of embarkment, we can see at a quick glance that a passenger that departed from port C had a higher average rate of survival than a passenger who departed from port Q or S. This may be due to the areas of departure may have had more females or higher class passengers than others.

In [39]:
train_df['Survived'].groupby(train_df['Embarked']).mean()

Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64

In [40]:
train_df.Embarked.isnull().sum()

2

#### Given that there are only two missing values, we will use the port that occurs the most to fill in the missing values later on. 

# 3. Feature Creation

#### Encoding categorical data BEFORE imputation section next. This ensures better regression estimates instead of using only current numerical values. 


#### After research online settled on dummies function in pandas for categorical encoding. It will create a larger dataset but will test to see if it creates a problem with the random forest computation times. Ideally it would be worth looking at multiple encoding options and model types to use in conjunction for an average rate of survival on the test set (shown by user 'Volha' on Kaggle)

# NEED TO LOOK AT MICE IMPUTER THAT CAN AUTO CREATE BINARY COLUMNS OF MISSing VLAUES

In [41]:
#### Added columns to identify missing data points in variables
#Age
train_df['Age_missing'] = (~train_df['Age'].isnull()).astype(int)

#Embarked
train_df['Embarked_missing'] = (~train_df['Embarked'].isnull()).astype(int)

#Cabin_Deck
train_df['Cabin_Deck_missing'] = (~train_df['Cabin_Deck'].isnull()).astype(int)

#Cabin_Number
train_df['Cabin_Number_missing'] = (~train_df['Cabin_Number'].isnull()).astype(int)


#### Will need to encode: 'PClass', 'Sex', 'Embarked', 'Title', 'Ticket_First_Char', 'Cabin_Deck'.
#### I am not going back to encode FamSize into a group and therefore have an ordinal data set since the size of the family is important in label, and not use the dummies encoding for 'FamSize', 'Age', or 'Fare'.


In [42]:
# Testing the dummies funstion on the train_df  nominal categorical data variables. 

train_df = pd.get_dummies(train_df, columns = ['Pclass', 'Sex', 'Embarked', 'Title', 'Ticket_First_Char', 'Cabin_Deck'],prefix_sep='_', drop_first = True)

train_df.head()

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Name_Length,...,Ticket_First_Char_P,Ticket_First_Char_S,Ticket_First_Char_W,Cabin_Deck_B,Cabin_Deck_C,Cabin_Deck_D,Cabin_Deck_E,Cabin_Deck_F,Cabin_Deck_G,Cabin_Deck_T
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,23,...,0,0,0,0,0,0,0,0,0,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,51,...,1,0,0,0,1,0,0,0,0,0
2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,22,...,0,1,0,0,0,0,0,0,0,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,44,...,0,0,0,0,1,0,0,0,0,0
4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,24,...,0,0,0,0,0,0,0,0,0,0


#### Need to drop the columns we are not using

In [43]:
# Drop PassengerID, Name, Ticket, Cabin columns

train_df = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# 4. Missing Value Resolution

In [44]:
#Check sum of nulls in data

train_df.isnull().sum()

Survived                  0
Age                     177
SibSp                     0
Parch                     0
Fare                      0
Name_Length               0
Fam_Size                  0
Ticket_Length             0
Ticket_Char               0
Cabin_Number            695
Age_missing               0
Embarked_missing          0
Cabin_Deck_missing        0
Cabin_Number_missing      0
Pclass_2                  0
Pclass_3                  0
Sex_male                  0
Embarked_Q                0
Embarked_S                0
Title_ Col                0
Title_ Don                0
Title_ Dr                 0
Title_ Jonkheer           0
Title_ Lady               0
Title_ Major              0
Title_ Master             0
Title_ Miss               0
Title_ Mlle               0
Title_ Mme                0
Title_ Mr                 0
Title_ Mrs                0
Title_ Ms                 0
Title_ Rev                0
Title_ Sir                0
Title_ the Countess       0
Ticket_First_Char_2 

### Given the list of nulls above we will deal with the missing data in the following ways:

### Age: 
#### MICE: This appears to be an important variable, given that we previously identified passengers under the age of 9 were far more likely to survive than others, multiple imputation chained equations (MICE) may be the best choice to provide a better estimated missing value if other young passengers can be "identified" given the other variables in the data. 

### Cabin_Number: 
#### Imputation using MICE as well, I will look at using these varaibles in Random Forest and without, to see if results improve.

### Age and Cabin_Number Missing Values

In [45]:
# import mice and apply
from fancyimpute import IterativeImputer

Using TensorFlow backend.


In [46]:
imputer = IterativeImputer(random_state=42)

train_df_imputed = pd.DataFrame(imputer.fit_transform(train_df), columns = train_df.columns)

train_df = train_df_imputed

# 5. Model Training

## Splitting Test Train sets for Survived column

In [47]:
train_df.describe()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Name_Length,Fam_Size,Ticket_Length,Ticket_Char,Cabin_Number,...,Ticket_First_Char_P,Ticket_First_Char_S,Ticket_First_Char_W,Cabin_Deck_B,Cabin_Deck_C,Cabin_Deck_D,Cabin_Deck_E,Cabin_Deck_F,Cabin_Deck_G,Cabin_Deck_T
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,...,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,29.67107,0.523008,0.381594,32.204208,26.965208,0.904602,6.750842,0.741863,49.97614,...,0.072952,0.072952,0.01459,0.05275,0.066218,0.037037,0.035915,0.01459,0.004489,0.001122
std,0.486592,13.647529,1.102743,0.806057,49.693429,9.281607,1.613459,2.745515,0.437855,16.685253,...,0.260203,0.260203,0.119973,0.223659,0.248802,0.188959,0.186182,0.119973,0.06689,0.033501
min,0.0,-7.15165,0.0,0.0,0.0,12.0,0.0,3.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,21.0,0.0,0.0,7.9104,20.0,0.0,5.0,0.0,49.908437,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,29.0,0.0,0.0,14.4542,25.0,0.0,6.0,1.0,49.919823,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,36.75,1.0,0.0,31.0,30.0,1.0,7.0,1.0,49.952271,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,80.0,8.0,6.0,512.3292,82.0,10.0,18.0,1.0,148.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [48]:
#Split the labels ('Survived') from the dataset and drop unnecessary PassenengerID column. ALWAYS SPLIT DATA BEFORE NORMALIZATION TO AVOID DATA LEAKAGE (outside influences on train data)

X = train_df.drop(columns=['Survived'])
y = train_df['Survived']


In [49]:
# Train Test Split (set random state to 42 for answer to the universe)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=42)

y_train.unique()

array([0., 1.])

## Hyperparameter Tuning

#### I will look at obtaining the best parameters for the random forest classifer using a grid search. I will look at improving computation time by possibly using a GPU later on.

In [50]:
#Import grid search and classifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [57]:
# Create the parameter grid
param_grid = {"max_depth" : [2, 4, 5, 10],
             "criterion"   : ["gini", "entropy"],
             "min_samples_leaf" : [1,2,3,4],
             "min_samples_split" : [2, 3, 4, 5, 6, 10, 12, 16],
             "n_estimators": [50,75, 100, 125, 150, 200]}


# Create a Random Forest model
rf = RandomForestClassifier(max_features = 'auto', oob_score = True, random_state = 42, n_jobs=-1)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 5, n_jobs = -1 verbose = 2)

In [58]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)

{'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 75}
0.8440953412784399


## Given the above we will fit the best parameters to a rf model and also test a generic untuned model to see the differences. 

In [80]:
# Fit paramters from grid search
#best parameters = {'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 75}
rf_base = RandomForestClassifier(oob_score = True)
rf_best = RandomForestClassifier(criterion = 'entropy', max_depth = 10, min_samples_leaf = 1, min_samples_split = 2, n_estimators = 75, max_features = 'auto', oob_score = True, random_state = 42, n_jobs=-1)

In [85]:
from sklearn.metrics import classification_report

# Fit best model and test base results and compare the two results
#base
rf_base.fit(X_train, y_train)

print("Detailed classification report:")
y_base_pred = rf_base.predict(X_test)

print(classification_report(y_test, y_base_pred))

print(rf_base.oob_score_)

print('========================================================================================================')
#best
rf_best.fit(X_train, y_train)

print("Detailed classification report:")
y_best_pred = rf_best.predict(X_test)

print(classification_report(y_test, y_best_pred))

print(rf_best.oob_score_)

Detailed classification report:
              precision    recall  f1-score   support

         0.0       0.87      0.89      0.88       105
         1.0       0.83      0.81      0.82        74

    accuracy                           0.85       179
   macro avg       0.85      0.85      0.85       179
weighted avg       0.85      0.85      0.85       179

0.8286516853932584
Detailed classification report:
              precision    recall  f1-score   support

         0.0       0.86      0.90      0.88       105
         1.0       0.84      0.80      0.82        74

    accuracy                           0.85       179
   macro avg       0.85      0.85      0.85       179
weighted avg       0.85      0.85      0.85       179

0.8300561797752809


In [82]:
pd.concat((pd.DataFrame(X_train.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(rf_base.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

Unnamed: 0,variable,importance
8,Age_missing,0.113182
28,Title_ Mrs,0.106051
0,SibSp,0.104926
15,Embarked_Q,0.100959
3,Name_Length,0.091032
4,Fam_Size,0.08804
29,Title_ Ms,0.038739
6,Ticket_Char,0.037155
5,Ticket_Length,0.036549
14,Sex_male,0.03629


In [70]:
pd.concat((pd.DataFrame(X_train.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(rf_best.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

Unnamed: 0,variable,importance
28,Title_ Mrs,0.129654
8,Age_missing,0.090005
3,Name_Length,0.088479
4,Fam_Size,0.086771
15,Embarked_Q,0.085799
0,SibSp,0.079796
5,Ticket_Length,0.042571
6,Ticket_Char,0.041163
14,Sex_male,0.037044
25,Title_ Mlle,0.036833


# The line of code below is particularly important as Kaggle would rate the predictions wrong if the Survived value in not of int data type
# submission.Survived = submission.Survived.astype(int)