#### ML Project Part 2

**Introduction**

So, in the previous chapter we took on an ML task to predict survival rates of passengers on the Titanic. We successfully imported and explored the data and then built a heuristic measure based upon the sex of the passengers and a Logistic Regression model in order to provide ourselves with a baseline measure of performance. These both measured just under **80%** in terms of accuracy.

Our next step is going to be a deeper look into the data in our dataset in order to clean, combine, refine, drop and generally transform our variables to make them more predictive. This is a process called **Feature Engineering** and it's going to be pretty much the sole focus of this chapter!

Shit data + shit algortithm = shit results  
shit data + shiny algorithm = slightly less shit results (hopefully)  
Shiny data + shit algorithm = Not bad results  

In [82]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Rerunning the code from the previous chapter:

In [83]:
# Import
path = '../data/'
df = pd.read_csv('{}titanic_train.csv'.format(path))
df.columns = [
    x.lower() for x in df.columns
]

# Basic Feature Engineering
df['embarked'].fillna('')

sex_codes = {
    'male' : 0,
    'female' : 1,
}

embarked_codes = {
    'S': 1,
    'Q': 2,
    'C': 3,
    '': None
}


df['sex'].replace(sex_codes, inplace=True)
df['embarked'].replace(embarked_codes, inplace=True)
df.head(5)

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,3.0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1.0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1.0


And of course our newly functionalised model:

In [110]:
def log_reg_model(features, label):
    '''
    Runs a simple Logistic Regression Model to measure
    performance.
    '''
    
    model = LogisticRegression() 
    scores = cross_val_score(
        model, 
        features, 
        label, 
        cv=50
    )
    
    print('Model Performance: {}%'.format(round(scores.mean()*100,1)))



**Step 3: Feature Engineering**

It's worth noting that we already did a little bit of what could be called **feature engineering** when we converted the `sex` and `embarked` variables. **Feature engineering** purists may disagree with that assessment but at a fundamental level, part of the feature selection process is the exploratory work to see what patterns are hidden in the data and which features are suitable and which are not.

Additionally we might also want to apply a basic algorithm to our dataset both to set a baseline level of predictitive accuracy to try and improve upon with feature engineering, and also to see how much feature engineering is required and on which variables, or whether feature engineering is even required at all (hint: it usually is) however this is a training lesson so we'll forgo these steps to focus on how we might be able to build additional predictive features from our dataset.

**Family Size & Lone Travellers**

Firstly we'll take a look at the siblings and parent/child variables (`sibsp` and `parch`). On their own the variables don't have that much correlation with our `label` variable of `survived`. However, what if we could combine them into a single variable for family size? This single variable will likely have more value than the two separate variables and is a basic form of **dimensionality reduction** which makes our models quicker and can improve the quality of our results also.

In [84]:
df['family_size'] = df['sibsp'] + df['parch'] + 1

**Cabin**

We can see from our profile report, that a lot of the fields within the `cabin` column are missing. This could either imply that those people didn't have a cabin or that the cabin data is incomplete and we don't know which.

It might be tempting to drop it, or leave it in to see what it does, but it's good practice to do a little research on your data wherever possible. Fortunately for us, the Titanic sinking was a pretty major historical event and we also have Google to help us. A [quick search](https://www.google.co.uk/search?q=cabin+allocations+titanic&oq=cabin+allocations+titanic&aqs=chrome..69i57j69i60.3551j0j4&sourceid=chrome&ie=UTF-8) reveals [the following infomation](https://www.encyclopedia-titanica.org/cabins.html):

*The allocation of cabins on the Titanic is a source of continuing interest and endless speculation. Apart from the recollections of survivors and a few tickets and boarding cards, the only authoritative source of cabin data is the incomplete first class passenger list recovered with the body of steward Herbert Cave. The list below includes this data and includes the likely occupants of some other cabins determined by other means.*

*The difficulty in determining, with any degree of accuracy, the occupancy of cabins on the Titanic indicates the need for further research in this area.*

So there we have it... At best the `cabin` variable can be considered incomplete and at worst unreliable! As such, we'll drop it from our dataset:

In [85]:
df.drop('cabin', axis=1, inplace=True)

**Name**

Looking at the name field we can see a few potentially useful pieces of data.  

Firstly, we have the surname. This could be useful for linking families together via a family id variable. We also have the title. This might be a good way to infer the age or status of someone (especially considering our `age` variable has so many missing values. Lastly, for married females it was the convention at the time to go by their husbands name (albeit with a Mrs. as a title instead of a Mrs.) and we get their full maiden name in brackets also. This provides us with a means to link spouses and as the survival rates for single and married people may differ, this could be an avenue worth exploring.

Let's start by splitting the name string up into parts and creating a surname variable:

In [86]:
df['split_name'] = df['name'].str.split()
surnames = [str.strip(name[0][:-1]) for name in df['split_name'].values]
df['surname'] = surnames

We can easily make a family id variable by taking the surname and the number of family members:

In [87]:
df['family_id'] = df['surname'].astype(str) + df['family_size'].astype(str)

Now we have a slight issue in how to return the passenger title. Due to the inconsistency of the `name` variable format (e.g. double barrelled surnames) we can't rely on the position of the title in the string. However we might be able to use something called a Regular Expression to extract it.

**Regular Expressions**

[Regular Expressions](https://www.regular-expressions.info/) (or regex or regexp for short) is a special convention for describing a search pattern. The regular expression syntax is standard across programming languages so once you've learned it, you can apply it anywhere! We'll not go into the detail of regular expressions here but there is [plenty of information out there](https://www.regular-expressions.info/quickstart.html) on how to get started, and to let you in on a trade secret, unless you have a very specific requirement, a quick google search will likely yield the result you need.

In this case we can see that there is a distinct pattern to the title of a passenger. The title starts with a capital letter, with the rest of the characters being lower case and at the end of the title there's a full stop. As such we can construct our regular expression as follows:

In [88]:
import re

def title(row):
    title_search = re.search(' ([A-Za-z]+)\.', row['name'])
    # If a title exists, extract and return it
    if title_search:
        return title_search.group(1)
    return ""

df['title'] = df.apply(title, axis=1)
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,embarked,family_size,split_name,surname,family_id,title
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,1.0,2,"[Braund,, Mr., Owen, Harris]",Braund,Braund2,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,3.0,2,"[Cumings,, Mrs., John, Bradley, (Florence, Bri...",Cumings,Cumings2,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,1.0,1,"[Heikkinen,, Miss., Laina]",Heikkinen,Heikkinen1,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,1.0,2,"[Futrelle,, Mrs., Jacques, Heath, (Lily, May, ...",Futrelle,Futrelle2,Mrs
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,1.0,1,"[Allen,, Mr., William, Henry]",Allen,Allen1,Mr


As we can see, this has worked pretty well! We can check what unique values our function has returned as follows:

In [89]:
df['title'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'Countess',
       'Jonkheer'], dtype=object)

To answer your first question, yes [Jonkheer is a valid title](https://en.wikipedia.org/wiki/Jonkheer)!  However there are 17 unique titles here and as we'll see when we come to transform our categorical variables for inclusion in the model, this can have a severe impact on performance. As such we'll simplify this a little with a mapping:

In [90]:
title_codes = {
    'Mr': 1,       # General adult male
    'Mrs': 2,      # General adult female
    'Miss': 3,     # General young female
    'Master': 4,   # General young male
    'Don': 5,      # Noble male
    'Rev': 6,      # Professional
    'Dr': 6,       # Professional
    'Mme': 2,      # General adult female
    'Ms': 2,       # General adult female
    'Major': 6,    # Professional
    'Lady': 7,     # Noble female
    'Sir' : 5,     # Noble male
    'Mlle': 3,     # General young female
    'Col': 6,      # Professional
    'Capt': 6,     # Professional
    'Countess': 7, # Noble female
    'Jonkheer': 5  # Noble male
}

df['title'].replace(title_codes, inplace=True)
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,embarked,family_size,split_name,surname,family_id,title
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,1.0,2,"[Braund,, Mr., Owen, Harris]",Braund,Braund2,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,3.0,2,"[Cumings,, Mrs., John, Bradley, (Florence, Bri...",Cumings,Cumings2,2
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,1.0,1,"[Heikkinen,, Miss., Laina]",Heikkinen,Heikkinen1,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,1.0,2,"[Futrelle,, Mrs., Jacques, Heath, (Lily, May, ...",Futrelle,Futrelle2,2
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,1.0,1,"[Allen,, Mr., William, Henry]",Allen,Allen1,1


**Age**

As we saw previously, we had a lot of missing values for our `age` variable and based upon our exploratory analysis, we know that it is a good predictor of survival. For our initial model, we simply infilled the missing values with the mean, however there may be a way we can do this more intelligently based upon the title of the passenger. We'll start by splitting out some relevant variables into a separate dataset and looking at the averages ages of the various titles:

In [91]:
df_titles = df[['age','title']]
tb_mean = df_titles.groupby('title').mean().rename(columns={'age': 'mean'})
tb_med = df_titles.groupby('title').median().rename(columns={'age': 'median'})
tb_min = df_titles.groupby('title').min().rename(columns={'age': 'min'})
tb_max = df_titles.groupby('title').max().rename(columns={'age': 'max'})

tb = pd.concat([tb_mean,tb_med,tb_min,tb_max], axis=1)
tb

Unnamed: 0_level_0,mean,median,min,max
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,32.36809,30.0,11.0,80.0
2,35.718182,35.0,14.0,63.0
3,21.804054,21.0,0.75,63.0
4,4.574167,3.5,0.42,12.0
5,42.333333,40.0,38.0,49.0
6,46.705882,50.0,23.0,70.0
7,40.5,40.5,33.0,48.0


Next we can create an function to infill missing `age` values with the median value for that specific title:

In [92]:
def infer_age(row):
    '''
    Infers the age for nan values
    '''
    if(pd.isnull(row['age'])):
        
        if row['title'] == 1:    # Mr
            return 30
        elif row['title']  == 2:  # Mrs
            return 35
        elif row['title']  == 3:  # Miss
            return 21
        elif row['title']  == 4:  # Master
            return 4
        elif row['title']  == 5:  # Noble male
            return 40
        elif row['title']  == 6:  # Professional
            return 50
        elif row['title']  == 7:  # Noble female
            return 40

    else:
        return row['age']

df['age'] = df.apply(infer_age, axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
passengerid    891 non-null int64
survived       891 non-null int64
pclass         891 non-null int64
name           891 non-null object
sex            891 non-null int64
age            891 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
ticket         891 non-null object
fare           891 non-null float64
embarked       889 non-null float64
family_size    891 non-null int64
split_name     891 non-null object
surname        891 non-null object
family_id      891 non-null object
title          891 non-null int64
dtypes: float64(3), int64(8), object(5)
memory usage: 111.5+ KB


As we can see our `age` variable has been successfully infilled! Let's check the dataframe to be on the safe side:

In [93]:
df

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,embarked,family_size,split_name,surname,family_id,title
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.2500,1.0,2,"[Braund,, Mr., Owen, Harris]",Braund,Braund2,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,3.0,2,"[Cumings,, Mrs., John, Bradley, (Florence, Bri...",Cumings,Cumings2,2
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,1.0,1,"[Heikkinen,, Miss., Laina]",Heikkinen,Heikkinen1,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,1.0,2,"[Futrelle,, Mrs., Jacques, Heath, (Lily, May, ...",Futrelle,Futrelle2,2
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.0500,1.0,1,"[Allen,, Mr., William, Henry]",Allen,Allen1,1
5,6,0,3,"Moran, Mr. James",0,30.0,0,0,330877,8.4583,2.0,1,"[Moran,, Mr., James]",Moran,Moran1,1
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,1.0,1,"[McCarthy,, Mr., Timothy, J]",McCarthy,McCarthy1,1
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.0750,1.0,5,"[Palsson,, Master., Gosta, Leonard]",Palsson,Palsson5,4
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,1.0,3,"[Johnson,, Mrs., Oscar, W, (Elisabeth, Vilhelm...",Johnson,Johnson3,2
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,3.0,2,"[Nasser,, Mrs., Nicholas, (Adele, Achem)]",Nasser,Nasser2,2


OK, so now might be a good time to check and see what our tranformations of the `age`, `family_size` and `title` variables have done to the accuracy of our model. Let's start by creating our features dataset:

In [97]:
df_model = df[['pclass','sex','age','title','family_size']]
df_model.head()

Unnamed: 0,pclass,sex,age,title,family_size
0,3,0,22.0,1,2
1,1,1,38.0,2,2
2,3,1,26.0,3,1
3,1,1,35.0,2,2
4,3,0,35.0,1,1


You'll notice that all the variables in the dataset are numeric, which is of course the right thing to do as scikit-learn would produce an error if we didn't! Looking at our `pclass`, `age` and `family_size` variables, these are all quantative and discrete in that the value relates to a scale. However our `sex` and `title` variables are qualatative or categorical data encoded as numbers. We want our algorithm to treat these two types of variable differently, otherwise it will think that they are discrete and continuous variables also. So what's the solution to this? The answer is a technique called One Hot Encoding.

**One Hot Encoding**

[One Hot Encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. If we were to feed our dataset into an algorithm 'as-is' it would interpret the `title` mathematically (e.g. Mr + Mr = Mrs or 1 + 1 = 2). To combat this, we need to create a separate binary variable for each categorical value (e.g. Mr, Mrs, Miss, Master etc.) and assign a value of 1 (hot) if that value is applicable to the record or 0 (cold) if it's not. You'll see that the 'hot' and 'cold' terminology simply means a binary variable where one of the variables is 'hot' for each of the categorical values.

It's probably better illustrated than explained so we'll do this with pandas. Note that scikit-learn has a one-hot encoder also, but since we've done the rest of our feature engineering in pandas, we'll stick with it and explore how to do it in scikit-learn later on in the course. The method for doing so is called `get_dummies()`

In [98]:
df_sex = pd.get_dummies(df_model['sex'])
df_sex.columns = ['sex_{}'.format(x) for x in df_sex.columns]

df_title = pd.get_dummies(df_model['title'])
df_title.columns = ['title_{}'.format(x) for x in df_title.columns]

df_model = pd.concat([df_model,df_sex,df_title], axis=1)
df_model.drop(['sex','title'], axis=1, inplace=True)
df_model.head()

Unnamed: 0,pclass,age,family_size,sex_0,sex_1,title_1,title_2,title_3,title_4,title_5,title_6,title_7
0,3,22.0,2,1,0,1,0,0,0,0,0,0
1,1,38.0,2,0,1,0,1,0,0,0,0,0
2,3,26.0,1,0,1,0,0,1,0,0,0,0
3,1,35.0,2,0,1,0,1,0,0,0,0,0
4,3,35.0,1,1,0,1,0,0,0,0,0,0


**Measuring Performance**

So... The moment of truth! You'll remember we created a simple logistic regression model as a baseline. Let's re-run that now:

In [111]:
features = df[['pclass','age','sex']]
label = df[['survived']]

log_reg_model(features,label)

Model Performance: 79.7%


Let's use our new `df_model` dataset to see if our feature engineering has improved our model:

In [112]:
features = df_model
label = df[['survived']]

log_reg_model(features,label)

Model Performance: 83.3%


Our new dataset yields a **3.6%** improvement and beats our target of **80%**! This is good news! But to illustrate the importance of one hot encoding, let's run the model without the one hot encoded values:

In [113]:
features = df[['pclass','sex','age','title','family_size']]
label = df[['survived']]

log_reg_model(features,label)

Model Performance: 79.8%


You can see without one hot encoding we would have hardly experienced any improvement in model performance at all. 

One other thing is that we've not included the `fare` variable at all yet. This is a continuous value so we don't need to do anything special to it, we can just simply include it in our features dataset. We will however have to re-build our `df_model` dataset:

In [114]:
# Creating the model dataset
df_model = df[['pclass','sex','age','title','family_size','fare']]

# Re-running the one hot encoding
df_sex = pd.get_dummies(df_model['sex'])
df_sex.columns = ['sex_{}'.format(x) for x in df_sex.columns]

df_title = pd.get_dummies(df_model['title'])
df_title.columns = ['title_{}'.format(x) for x in df_title.columns]

df_model = pd.concat([df_model,df_sex,df_title], axis=1)
df_model.drop(['sex','title'], axis=1, inplace=True)
df_model.head()

Unnamed: 0,pclass,age,family_size,fare,sex_0,sex_1,title_1,title_2,title_3,title_4,title_5,title_6,title_7
0,3,22.0,2,7.25,1,0,1,0,0,0,0,0,0
1,1,38.0,2,71.2833,0,1,0,1,0,0,0,0,0
2,3,26.0,1,7.925,0,1,0,0,1,0,0,0,0
3,1,35.0,2,53.1,0,1,0,1,0,0,0,0,0
4,3,35.0,1,8.05,1,0,1,0,0,0,0,0,0


In [150]:
features = df_model
label = df[['survived']]

log_reg_model(features,label)

Model Performance: 82.6%


Interesting!! Performance has actually reduced as a result of the new `fare` variable. This could mean one of two things:

1. `fare` is a poor predictor of survival
2. It might be a useful predictor, however we need to transform it before we can use it.

In [119]:
# Infill zeros
med_fare = df['fare'].median() # Median due to outliers

fare_med = {
    0: med_fare
}

df['fare'].replace(fare_med, inplace=True)

In [129]:
# Scaling
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
fare_array = df_model['fare'].as_matrix()
fare_scales = scaler.fit_transform(fare_array.reshape(-1, 1))
df_model['fare_scaled'] = fare_scales

Feed into the model! 

In [151]:
features = df_model.drop('fare', axis=1)
label = df[['survived']]

log_reg_model(features,label)

Model Performance: 83.0%


Interestingly transforming the `fare` variable into `fare_scaled` via scaling has improved performance but it's still not quite as good as before. We know from our exploratory analysis that `fare` and `pclass` are highly correlated so it could be that including two correlated variables is decreasing performance. We can test this by susbstituting our `pclass` variable with `fare_scaled`.

In [158]:
#features = df_model.drop(['fare','pclass','fare_scaled','sex_0','sex_1','age','family_size'], axis=1)
features = df_model[['title_1','title_2', 'title_3', 'title_4', 'title_5', 'title_6', 'title_7','family_size', 'age', 'sex_0', 'sex_1', 'fare_scaled']]
label = df[['survived']]

log_reg_model(features,label)

Model Performance: 83.3%


As we can see we're now back up to **83.3%**, our best score so far! But is there a way for us to include both `fare` and `pclass` in the model without adversely affecting performance? We might want to

Still short of 80%!! What if we add the title variables?

**83.3%** Bingo!!! Is there more we can do with the data?

In [138]:
df_model.columns

Index(['pclass', 'age', 'family_size', 'fare', 'sex_0', 'sex_1', 'title_1',
       'title_2', 'title_3', 'title_4', 'title_5', 'title_6', 'title_7',
       'fare_scaled'],
      dtype='object')

#### Sources & Further Reading

[Feature Engineering Survival on the Titanic](https://www.kaggle.com/matlihan/feature-engineering-for-titanic-dataset)  
[Introduction to PCA](https://lazyprogrammer.me/tutorial-principal-components-analysis-pca/)  