# Feature Engineering and Preparation for Titanic
In one of the previous exercises we have explored the titanic dataset. Now, we want to extract new features from the dataset and make it ready for classification algorithms.

As always, we start by importing the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
%matplotlib inline

# Change default figure and font size for plots
matplotlib.rcParams['figure.figsize'] = (12.0, 9.0)
matplotlib.rcParams.update({'font.size': 12})

Please, load the titanic data from the path **'../data/titanic_new.csv'** as a pandas dataframe called **titanic** and use the head and the info method on the dataframe.

In [None]:
# Load data


In [None]:
# head


In [None]:
# info


## Feature Engineering

Before we start dropping columns, filling null values and encoding categorical variables we want to **construct some new features.**

Therefore, have a look at the **'name'** column. What kind of feature could we extract from that column?

In [None]:
# Investigate the name column


### Extract Title
Maybe the title in front of the first name could be useful? Therefore, let's extract it by using a small python function and regex expression.

In [None]:
# Just execute
import re
def getTitle(name):
    'Extracts the word on front of a dot (.)'
    title = re.search(r'(\w+\.)',name)
    return title.group(1)

Next, use the **getTitle** function in an **apply method** on the column **'Name'** of the dataframe in order to get a new column containing all the titles. Name this new column **'Title'**.

**Hint**: Use a Lambda function in the apply method.

In [None]:
# Create title column
titanic[<FILL-IN>] = <FILL-IN>.apply(<FILL-IN>)

Extract all unique titles by using the **unique()** method on the new column.

In [None]:
# Unique titles


Some of the titles seem to be very uncommon, e.g. *Jonkheer*. If you are interested in what this title means you can have a look at the Wikipedia link (https://en.wikipedia.org/wiki/Jonkheer). There is only one person in the dataset having this title, who is that?

In [None]:
# Get the Jonkheer


Next, **check the cardinality** of the different titles by using the method **value_counts** or visualize them by using the seaborn method **sns.countplot(data, column)**.

In [None]:
# Check cardinality


Some tiltes are underrepresented. Hence, we combine all uncommon titles to a **new feature** called **'rareTitle'**.
This can easily be done by using the map function and a python dictionary. In each Title field it replaces the key with the corresponding value. If the value in the column is not contained in the dictionary keys a null value will be inserted which we directly fill with the value 'rareTitle'.

In [None]:
# Just execute
map_dict = {
    'Ms.': 'Miss.',
    'Mlle.': 'Miss.',
    'Mr.' : 'Mr.',
    'Mrs.': 'Mrs.',
    'Miss.': 'Miss.'
}

titanic['Title'] = titanic['Title'].map(map_dict).fillna('rareTitle')

Again, check the unique titles and the cardinality.

In [None]:
# Check titles


In [None]:
# Check cardinality


Before we combine the common titles like Miss and Mrs, we can extract another feature: married and unmarried female. Similar as before, we use a map function to generate this new feature.

Please add a **new column** called **'marital_status'** and apply a **map function** using the map_dict on the 'Title' column. Afterwards, **fill the null values** with the string **'Unknown'**.

In [None]:
# Create new feature marital_status
map_dict = {
    'Miss.' : 'no',
    'Mrs.': 'yes',
}

<FILL-IN> = <FILL-IN>

Check the unique elements and the cardinality of the new feature.

In [None]:
# Unique elements

In [None]:
# Cardinality

Finally, we **combine the common titles** in the 'Title' columns to the category **'noTitle'**. Therefore, we use the numpy method **np.where()**. Check the docstring to see how it works.

Create a boolean Pandas Series which contains the value False if the title is common and True if it is rare. Call this series cond.

In [None]:
# Create boolean series
<FILL-IN> = <FILL-IN>

# Perform conditional replacement
titanic['Title'] = np.where(cond, 'rareTitle', 'noTitle')

Finally, check the **cardinality** of the **Title** feature.

In [None]:
# Cardinality


# Deck
We can extract another feature from the 'Cabin' column which by itself is not doing much. A lot of null values are contained in that column since only 1st class passengers have cabins. A **cabin number** looks like **'C123'**, where the **first letter refers to the deck**. We extract the deck in a similar fashion as the title.

In [None]:
# Just execute

# Cabin list
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G']

In [None]:
# Just execute

# Extract function
def charInString(charlist, text):
    for element in charlist:
        if element in text:
            return element
    return 'Unknown'

Test the above function on two different cabin numbers, e.g. 'C123' and 'X555'.

In [None]:
# Test function


Now, apply the new function to the column **'Cabin'** by using the **apply** method and add the result to the titanic dataframe as a new **column** called **'Deck'**. Notice, first you have to fill ne null values of that column. Therefore, you can use the fillna method.

In [None]:
# Create feature Deck
<FILL > = <FILL-IN>.fillna(<FILL-IN>).apply(<FILL-IN>)

As always, check the unique values and the cardinality of the new feature.

In [None]:
# Unique values


In [None]:
# Cardinality


## Family Size
Finally, we add a new feature **Family_Size**, which is just the sum of SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard).

Please create this new column called **Familiy_Size**.

In [None]:
# Add column Familiy_Size
<Fill-IN> = <FILL-IN>

## Feature Preparation

Finally, we can prepare our data so that we can feed it to a classification model in the next exercise. This part will be very similar to the feature preparation part of the house pricing dataset, i.e. splitting the data into a train and test dataset, filling null values and encoding the catogrical variables to numerical ones.

First, **drop** the unnecessary columns **'Cabin'**, **'PassengerId'**, **'Name'**, ***'SibSp'**, **'Parch'** and **'Ticket'**.

In [None]:
# Drop unnecessary cols
titanic.drop(<FILL-IN>, axis=<FILL-IN>, errors='ignore', inplace=True)

titanic.info()

Since all the categorical features have no null values, we can already 'dummy encode' them. Therefore, use the Pandas method **get_dummies** with the arguments data=titanic, prefix_sep='=' and drop_first=True. Call the resulting dataframe **titanic_dum** and investigate the new dataframe by using the **info** method.

In [None]:
# One Hot Encoding


The only remaining column which contains null values is the **'Age'** column. Instead of just filling the values by the median or mean we design our **own imputation method**. Maybe you remember, that the **medians differed across the passenger classes**. Hence, we fill each null value with the median of the class group. In Sklearn it is straight forward to construct custom preprocessing functions and classes. Therefore, we build a custom class:

In [None]:
# Just execute
from sklearn.base import TransformerMixin, BaseEstimator

class AgeImputer(TransformerMixin, BaseEstimator):
    '''Custom Imputator which computes the median of the age column grouped 
    by another feature and fills the null values accordingly'''
    
    def __init__(self, col, copy=True):
        self.col = col # col to groupyBy
        self.copy = copy # option
        self.median = {} # medianDict
        
    def fit(self, X, y=None):
        # fitting procedure fills the median dict
        self.median = X[['Age', self.col]].groupby(self.col).median().to_dict()['Age']
        return self
    
    def transform(self, X):
        X_ = X if not self.copy else X.copy()
        for key in self.median:
            # filling NaN values using conditional Expressions
            X_.loc[(  (X_['Age'].isnull()) & (X_['Pclass'] == key)),'Age'] = self.median[key]
        return X_

The class above contains a fit method. As we already know we are **only allowed to use fit functions** on the training set. Hence, we have to **split** our data into **test** and **training datasets**.

Please, **import** the **train_test_split** function from the modul **sklearn.model_selection** and apply the function to the dataframe **titanic_dum**. Set the **test_size** to **0.2** and the **random_state** argument to **42**. Call the resulting dataframes **titanic_train** and **titanic_test**.

In [None]:
# Import train_test_split


In [None]:
# Split dataframes in train and test
titanic_train, <FILL-IN> = <FILL-IN>

Now we can use our custom AgeImputer.

Create an instance of that class called **ageImputer**. Set the argument **col to 'Pclass'**.

In [None]:
# Create instance ageImputer


Next, use the **fit_transform** method on the **titanic_train** dataset and afterwards the **transform** method on the **titanic_test** dataset. Call the results **train_df** and **test_df**, respectively. Finally, use the **info** methods on both results to check if all null values are gone.

In [None]:
# Fill Nulls
<FILL-IN> = <FILL-IN>
test_df = <FILL-IN>

In [None]:
# Check test df


In [None]:
# Check train df


Finally, save the two dataframes as python binary pickle files. Therefore, use the method to_pickle on the two dataframes. Call the files *titanic_train.pkl* and *titanic_test.pkl*.

In [None]:
# Save dataframes as pickle file


**This is the end of this exercise.**