# Titanic

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Slides can be found here: 
https://docs.google.com/presentation/d/1lSzs0hmy6-aqAyRhg30wSoxdjly1RsyfmK0vQfS03Ns/edit?usp=sharing

We can us the pandas read_cvs to function to read the train and test data
Since the train & test data is already in our directory, 
we don't have to specify the full path

In [None]:
train = pd.read_csv("titanic_train.csv") 
test = pd.read_csv("titanic_test.csv")

In [None]:
train.head(5)

In [None]:
train.dtypes.to_frame().transpose()

To avoid performing engineering seperatly on training and testing data, we will "concat" the train and data set.

axis = 0 means concating row wise
axis = 1 means concating column wise

In [None]:
# Create full DataFrame
full = pd.concat([train,test],axis=0,ignore_index=True,sort=True)

# Create submission example
sub_example = full[["PassengerId","Survived"]]

# Imputing Missing Values

Since some ML algorithms are not robust to missing values, we need to remove NAs from our data set.

Our data set is very small, we cannot afford to drop columns with missing values. 

In such cases, we use NA imputing to resolve this issue.

Imputing changes the value of NAs to something which the ML algorithm  can understand, while mitigating the risk of adding $influencial$ values. 

For continous features, we usually impute missing values by the mean or the mode of the **training** data.

For categorical featres, we usually impute missing values by a new category called "Missing".



In [None]:
# Running a for loop over columns of float dtypes: 
for col in full.select_dtypes(["float","int"]).columns: 
    if col != "Survived": # Making sure we are not imputing the Survived values
        # Imputing to the mean of the training set
        full.loc[full[col].isna(),col] = train[col].mean() 

Run a for loop over columns of object dtypes, and impute missing values to a new category called "Missing" 

In [None]:
### Fill Code:


In [None]:
full.isna().sum()

# Feature Engineering


In [None]:
full.head(10)

In this section, we will come up with features that are not there in the data set.
For example, the data set doesn't include any features that detail whether a passenger is married.
We can include this feature into our model by using the following code:

In [None]:
# Manual Features Example
full["IsMarriedMan"] = ((full["Sex"] == "male")&(full["SibSp"]>0)&(full["Age"]>18))*1

## Let's think of some other features which will be useful...

In [None]:
### Create Features Here:


What are some other features that you think we should add into our model?

## String Features

Let's Create a feature which extract the title from the name:

In [None]:
# Create Feature Here:
import re
full["Title"] = full["Name"]
for row in range(0,full.shape[0]):
    full.loc[row,"Title"] = re.sub('(.*, )|(\\..*)',"",full["Name"].loc[row])

In [None]:
dist = pd.Series(index=full["Title"].unique())
for each in dist.index:
    dist.loc[each] = sum(full["Title"]==each)

In [None]:
dist

## Avoiding overfitting to granular data

We see that there are many categorical levels which have very few data points.

We should remove them to avoid overfitting

In [None]:
# Change names
full.loc[full.Title.isin(["Don","Capt","Major","Col","Jonkheer"]),"Title"] = "Mr"
full.loc[full.Title.isin(["Ms","the Countess","Lady"]),"Title"] = "Miss"
full.loc[full.Title.isin(["Mme","Mlle","Dona"]),"Title"] = "Mrs"
full.loc[(full.Title=="Sir"),"Title"] = "Mr"

In [None]:
dist_2 = pd.Series(index=full["Title"].unique())
for each in dist_2.index:
    dist_2.loc[each] = sum(full["Title"]==each)
dist_2

# Is 8 to little? That is for you to decide...

## Outliers

Let's try and find weird outlier. Many ML algorithms are sensitive to outliers. Clipping values is usually very helpful.

In [None]:
# Plots
for col in train.select_dtypes(["float","int"]).columns: 
    train[col].plot(kind="hist")
    plt.title(col)
    plt.show()

### Age

In [None]:
full.loc[full["Age"].idxmin(5)]

In [None]:
train.loc[full["Age"]<2] # Only looking at training data! No Cheating :) 

In [None]:
## Make feature for baby? Clip low values


### Fare

In [None]:
# Fare
train.loc[full["Fare"]>200]

In [None]:
# Make new feature?


## Final Steps

In [None]:
# Drop useless features
full.drop(["Name","Ticket","Cabin",'PassengerId',"Embarked"],axis=1,inplace=True)

In [None]:
# One Hot Encode Variables
full = pd.get_dummies(full)

In [None]:
# Split Data
train_fe = full[~full.Survived.isna()].loc[0:599]
valid_fe = full[~full.Survived.isna()].loc[600:891]
test_fe = full[full.Survived.isna()]

# Model Fitting

## Next Time...