# **PREDICT SURVIVABILITY ON THE TITANIC (version 3)**
### This notebook is used to solve the question of survivabilty probability on the Titanic.
### This is a Kaggle challenge and test and train data is provided from Kaggle and public data.
###### Robert M. Taylor, PhD
###### 20190305

## **A. First, I need to import the packages I'll need and get the data...**

In [16]:
#for data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
import sys
from unidecode import unidecode 

#for data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#for machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

**Importing Kaggle dataset**
Let's start by importing the training and testing datasets from the Kaggle Titanic competition. Before joining them together, let's add the column Survived filled with missing values to the testing dataset to avoid re-ordering of the columns after concatenation.

In [17]:
ktrain = pd.read_csv('/Users/rmtaylor/Desktop/School/Titanic ML Challenge/train.csv')
ktest = pd.read_csv('/Users/rmtaylor/Desktop/School/Titanic ML Challenge/test.csv')
ktest.insert(1, 'Survived', float('nan'))
kcombine = pd.concat([ktrain,ktest])

kcombine.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Importing Wikipedia dataset**
- To get the Titanic passenger list on Wikipedia, we could web scrape it using the function pd.read_html(). 
- It returns all tables that are found on the web page as a list of dataframes. 
- Since there are 3 tables according to the passenger classes, we assign them to separate variables wiki1, wiki2, wiki3. 
- Don't forget to mention the parameter header=0 so that the first row is used as a header.

For convenience these dataframes were saved as csv files in case the Titanic passenger list on Wikipedia changes.

In [18]:
wiki1 = pd.read_html('https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic', header=0)[0]
wiki2 = pd.read_html('https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic', header=0)[1]
wiki3 = pd.read_html('https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic', header=0)[2]

wiki1.to_csv('wiki1.csv', index=False)
wiki2.to_csv('wiki2.csv', index=False)
wiki3.to_csv('wiki3.csv', index=False)

In [19]:
wiki1 = pd.read_csv('Titanic/wiki1.csv')
wiki2 = pd.read_csv('Titanic/wiki2.csv')
wiki3 = pd.read_csv('Titanic/wiki2.csv')

Let's have a look at the table for the 1st class (see below).

Note that after some of the names there are rows starting with "and <profession>, <Title> <Name> <Surname>". These rows correspond to the servants who were travelling with some of the families from the 1st class.

Sometimes rows have a reference. For example, [59] says:

Though their employers travelled in first class, this servant was given second class accommodations, as their services were not needed while their employers were on board.

Note also that Body contains some letters after the body number. These are references for the name of the ship that found the body. For example, "MB" means "Mackay-Bennett".

For the extended Titanic dataset we will leave everything as it is to preserve as much information as possible. Later in the notebook, however, we will do a thorough analysis.

In [20]:
wiki1.head()

Unnamed: 0,Name,Age,Hometown,Boarded,Destination,Lifeboat,Body
0,"Allen, Miss Elizabeth Walton",29,"St Louis, Missouri, US",Southampton,"St Louis, Missouri, US",2.0,
1,"Allison, Mr. Hudson Joshua Creighton",30,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,135MB
2,"and chauffeur, Mr. George Swane[59]",19,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,294MB
3,"and cook, Miss Amelia Mary ""Mildred"" Brown[59]",18,"London, England, UK",Southampton,"Montreal, Quebec, Canada",11.0,
4,"Allison, Mrs. Bessie Waldo (née Daniels)",25,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,


Let's also add the information about the passenger class to the dataframe wiki1. We know that this not always true (see the reference 59 above) but this can always be taken into account later.

In [21]:
wiki1['Pclass'] = 1
wiki1.head()

Unnamed: 0,Name,Age,Hometown,Boarded,Destination,Lifeboat,Body,Pclass
0,"Allen, Miss Elizabeth Walton",29,"St Louis, Missouri, US",Southampton,"St Louis, Missouri, US",2.0,,1
1,"Allison, Mr. Hudson Joshua Creighton",30,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,135MB,1
2,"and chauffeur, Mr. George Swane[59]",19,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,294MB,1
3,"and cook, Miss Amelia Mary ""Mildred"" Brown[59]",18,"London, England, UK",Southampton,"Montreal, Quebec, Canada",11.0,,1
4,"Allison, Mrs. Bessie Waldo (née Daniels)",25,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,,1


The table for the 2nd class (see below) has the same structure as the one for the 1st class.

Note, however, that Hometown sometimes doesn't have a town and only the country is mentioned. Also, countries are sometimes not consistent. For example, in the table for the 1st class "England" is followed by "UK" (see row 3 above) but in the table for the 2nd class (see rows 3-4 below) only "England" is mentioned. The country is also sometimes missed in the column Destination (see, for example, row 4 below).

We will add the Pclass feature to the wiki2 dataframe as well.

In [22]:
wiki2['Pclass'] = 2
wiki2.head()

Unnamed: 0,Name,Age,Hometown,Boarded,Destination,Lifeboat,Body,Pclass
0,"Abelson, Mr. Samuel",30,Russia,Cherbourg,"New York, New York, US",,,2
1,"Abelson, Mrs. Anna (née Wizosky?)",28,Russia,Cherbourg,"New York, New York, US",10.0,,2
2,"Andrew, Mr. Edgar Samuel",17,"San Ambrosio, Córdoba, Argentina",Southampton,"Trenton, New Jersey, US",,,2
3,"Andrew, Mr. Frank Thomas",30,"Redruth, Cornwall, England",Southampton,"Houghton, Michigan, US",,,2
4,"Angle, Mr. William A.",32,"Warwick, Warwickshire, England",Southampton,New York City,,,2


... and the same for wiki3

In [23]:
wiki3["Pclass"] = 3 
wiki3.head()

Unnamed: 0,Name,Age,Hometown,Boarded,Destination,Lifeboat,Body,Pclass
0,"Abelson, Mr. Samuel",30,Russia,Cherbourg,"New York, New York, US",,,3
1,"Abelson, Mrs. Anna (née Wizosky?)",28,Russia,Cherbourg,"New York, New York, US",10.0,,3
2,"Andrew, Mr. Edgar Samuel",17,"San Ambrosio, Córdoba, Argentina",Southampton,"Trenton, New Jersey, US",,,3
3,"Andrew, Mr. Frank Thomas",30,"Redruth, Cornwall, England",Southampton,"Houghton, Michigan, US",,,3
4,"Angle, Mr. William A.",32,"Warwick, Warwickshire, England",Southampton,New York City,,,3


All 3 tables have the same structure and can be concatenated.

Finally, we concatenate all 3 tables and for convenience reset the index using the parameter ignore_index=True.

In [24]:
wcombine = pd.concat([wiki1, wiki2, wiki3], ignore_index=True)
wcombine.head()

Unnamed: 0,Name,Age,Hometown,Boarded,Destination,Lifeboat,Body,Pclass
0,"Allen, Miss Elizabeth Walton",29,"St Louis, Missouri, US",Southampton,"St Louis, Missouri, US",2.0,,1
1,"Allison, Mr. Hudson Joshua Creighton",30,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,135MB,1
2,"and chauffeur, Mr. George Swane[59]",19,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,294MB,1
3,"and cook, Miss Amelia Mary ""Mildred"" Brown[59]",18,"London, England, UK",Southampton,"Montreal, Quebec, Canada",11.0,,1
4,"Allison, Mrs. Bessie Waldo (née Daniels)",25,"Montreal, Quebec, Canada",Southampton,"Montreal, Quebec, Canada",,,1


## Preparing Kaggle dataset for merge
In order to merge the Kaggle and Wikipedia datasets, we need to use matching column(s), for example, Name. But if you compare it in both datasets, you will see that names are often spelled differently, for example, "Elizabeth" in Wikipedia (see row 1 above) and "Elisabeth" in Kaggle (see row 1 below). So Name is not a good matching column, however, we could use it to generate other matching columns. These could be, for example, 3 first letters of a surname and 3 first letters of a name. This should be precise enough for most of the cases and hopefully avoid the misspelling issue. In case of duplicates we could use the column Age to differentiate them. For example, the boy "Allison, Master. Hudson Trevor" (see row 2 below) and his father "Allison, Mr. Hudson Joshua Creighton" (see row 4 below) will both result in the surname-name code "All-Hud" but could be differentiated using Age.

In [25]:
kcombine.sort_values(['Pclass', 'Name']).reset_index(drop=True).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,731,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,306,1.0,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
2,298,0.0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1198,,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,499,0.0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


Before we start extracting surname and name codes (first 3 letters), note that in the Kaggle dataset the title of "Mrs." is followed first by the name of her husband and only then in parentheses by the actual wife's name (see row 4 above). Therefore, we need to extract surname and name codes separately for passengers with the title "Mrs." and combine them with the codes for the rest. For this purpose we will use regular expressions (see more in the Python regex documentation). The new dataset will be called kagg_corr and for convenience let's sort the dataframe by Pclass and Name.

Rows that contain "Mrs." in their name are matched using the first regular expression. Matched groups (?P<Surname>.*), (?P<Husband_name>.*) and (?P<Wife_name>.*) are extracted as columns in the dataframe temp. It is important not to leave any space between (?P<Husband_name>.*) and \((?P<Wife_name>.*)\) because the husband's name is sometimes missing and the matching won't work with the extra space between them.

Afterwards all the rows are matched using the second regular expression. This time the named groups for matching (?P<Surname>.*), (?P<Title>.*) and ?P<Name>.*) are extracted as columns in the dataframe temp2. In this case the rows that contain "Mrs." will be matched wrongly but we don't care about them because they have already been correctly matched using the previous regular expression.

Note that the named groups (?P<Husband_name>.*) and (?P<Title>.*) can be simply replaced by .* since they are not used later on. Nevertheless, we include them for clarity.

Before we assign the obtained surname and name codes to the additional columns, we need to make sure that they are the same for both Kaggle and Wikipedia datasets. Due to incosistencies in compound surnames, for example, "van Billiard" in Kaggle and "Van Billiard" in Wikipedia datasets, we should capitalize all codes using the built-in method title().

In [26]:
kcombine_corr = kcombine.copy()

# Sorting by class and name
kcombine_corr = kcombine_corr.sort_values(['Pclass', 'Name']).reset_index(drop=True)

# Extracting surnames and names using regular expressions
temp = kcombine_corr.Name.str.extract(r'(?P<Surname>.*), Mrs\. (?P<Husband_name>.*)\((?P<Wife_name>.*)\)')
temp2 = kcombine_corr.Name.str.extract(r'(?P<Surname>.*), (?P<Title>.*)\. (?P<Name>.*)')

temp2.head()

Unnamed: 0,Surname,Title,Name
0,Allen,Miss,Elisabeth Walton
1,Allison,Master,Hudson Trevor
2,Allison,Miss,Helen Loraine
3,Allison,Mr,Hudson Joshua Creighton
4,Allison,Mrs,Hudson J C (Bessie Waldo Daniels)


In [27]:
# Adding Kaggle surname codes
surname = temp.Surname
surname2 = temp2.Surname
surname = surname.fillna(surname2)
surname = surname.str.title()
surname_code = surname.str[0:3]
kcombine_corr['Surname_code'] = surname_code

In [28]:
# Adding Kaggle name codes
name = temp.Wife_name
name2 = temp2.Name
name = name.fillna(name2)
name = name.str.title()
name_code = name.str[0:3]
kcombine_corr['Name_code'] = name_code

kcombine_corr.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname_code,Name_code
0,731,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,All,Eli
1,306,1.0,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,All,Hud
2,298,0.0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,All,Hel
3,1198,,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,All,Hud
4,499,0.0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,All,Bes


## Preparing Wikipedia dataset for merge
Just like in the Kaggle dataset it is useful to have a column with a unique ID for each passenger. Let's name this column WikiId and insert in the beginning of the dataset using the method insert().

Then we need to correct a typo in row 339 that has "." after the surname instead of ",".

We should also replace the missing values in the Wikipedia dataset written as "--" with NaN.

Afterwards we use the same strategy as in the Kaggle dataset by matching the name strings with two regular expressions. This time the first regular expression is used to match the servants that have a different name structure starting with "and ...". Note that the group (?P<Title>.*?) now has ? in the end and indicates non-greedy matching. This is important for the cases when the title doesn't end with ".", for example, "Miss" (unlike the Kaggle dataset where all titles end with "." even "Miss."). For the same reason (?P<Title>.*?) is followed by \.* where * indicates that a dot might not be present at all.

In addition to the capitalization of the surname and name codes, we also convert special characters that might occur there, into their ASCII equivalents using the function unidecode(). Unlike the Kaggle dataset, the Wikipedia dataset has a lot of these characters especially in Scandinavian names (for example, "Björnström-Steffanson, Mr. Mauritz Håkan")

Since we will use the column Age to help resolving duplicates, we need to change its type to float just like in the Kaggle dataset. This column, however, indicates the age of babies in months (for example, "11 mo.") so it should be first extracted using the matched group (?P<Months>\d*) and divided by 12. The missing values are then filled with the rest of the values from Age and its type is changed to float. The rounding to 2 decimal digits is applied so that the format is consistent with the Kaggle dataset.

Since both Kaggle and Wikipedia datasets has columns Name, Age, Surname_code and Name_code, let's add the suffix _wiki in the Wikipedia dataset in order to distinguish them.

In [29]:
wcombine_corr = wcombine.copy()

# Adding WikiId
wcombine_corr.insert(0, 'WikiId', wcombine_corr.index + 1)

# Correcting a typo
wcombine_corr['Name'][339] = 'Beane, Mrs. Ethel (née Clarke)'

# Replacing -- with NaN
wcombine_corr = wcombine_corr.replace('--', float('nan'))

# Extracting surnames and names using regular expressions
temp = wcombine_corr.Name.str.extract(r'and (?P<Profession>.*), (?P<Title>.*?)\.* (?P<Name>.*) (?P<Surname>.*)')
temp2 = wcombine_corr.Name.str.extract(r'(?P<Surname>.*), (?P<Title>.*?)\.* (?P<Name>.*)')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [30]:
# Adding Wikipedia surname codes
surname = temp.Surname
surname2 = temp2.Surname
surname = surname.fillna(surname2)
surname = surname.str.title()
surname = surname.apply(unidecode)
surname_code = surname.str[0:3]
wiki_corr['Surname_code'] = surname_code

# Adding Wikipedia name codes
name = temp.Name
name2 = temp2.Name
name = name.fillna(name2)
name = name.str.title()
name = name.apply(unidecode)
name_code = name.str[0:3]
wiki_corr['Name_code'] = name_code

# Converting age type to float
months = wiki_corr.Age.str.extract(r'(?P<Months>\d*) mo.', expand=False).astype('float64') / 12.0
age = months.fillna(wiki_corr.Age)
wiki_corr['Age'] = age.astype('float64').round(2)

# Adding Wikipedia suffixes
wiki_corr = wiki_corr.rename(columns={'Name': 'Name_wiki', 'Age': 'Age_wiki', 'Surname_code': 'Surname_code_wiki', 'Name_code': 'Name_code_wiki'})

wiki_corr.head()

AttributeError: 'float' object has no attribute 'encode'