# Logistic Regression - processing the Titanic Dataset

In this notebook, the titanic dataset will be used to  
explore the data (EDA) and preprocess it for the next notebook to fit a logistic regression classifier.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pylab import rcParams

In [None]:
%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sns.set_style('whitegrid')

## EDA

In [None]:
# Import the dataset
url = 'https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/master/titanic-train.csv'
titanic = pd.read_csv(url)
titanic.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
titanic.head()

Here’s the Data Dictionary, so we can understand the columns info . better:

- PassengerID: type should be integers

- Survived: survived or not

- Pclass: class of Travel of every passenger

- Name: the name of the passenger

- Sex: gender

- Age: age of passengers

- SibSp: No. of siblings/spouse aboard

- Parch: No. of parent/child aboard

- Ticket: Ticket number

- Fare: what Prices they paid

- Cabin: cabin number

- Embarked: the port in which a passenger has embarked.  
        - C: Cherbourg , S: Southhampton , Q: Queenstown

In [None]:
# Distribution of target class
sns.countplot(x='Survived', data=titanic, palette='hls');

In [None]:
# Missing values
titanic.isnull().sum().sort_values(ascending=False)

In [None]:
titanic.info()

Ok, so there are only 891 rows in the titanic data frame. 


Cabin is almost all missing values, so we can drop that variable completely, but what about age? Age seems like a relevant predictor for survival right? We'd want to keep the variables, but it has 177 missing values. 


We are going to need to find a way to approximate for those missing values!

#### Dropping missing values: 


So let's just go ahead and drop all the variables that aren't relevant for predicting survival. We should at least keep the following:

Survived - This variable is obviously relevant.

Pclass - Does a passenger's class on the boat affect their survivability?

Sex - Could a passenger's gender impact their survival rate?

Age - Does a person's age impact their survival rate?

SibSp - Does the number of relatives on the boat (that are siblings or a spouse) affect a person survivability? Probability

Parch - Does the number of relatives on the boat (that are children or parents) affect a person survivability? Probability

Fare - Does the fare a person paid effect his survivability? Maybe - let's keep it.

Embarked - Does a person's point of embarkation matter? It depends on how the boat was filled... Let's keep it.

What about a person's name, ticket number, and passenger ID number? For now they're irrelavant for predicting survivability. And as you recall, the cabin variable is almost all missing values, so we can just drop all of these.

In [None]:
titanic_data = titanic.drop(['PassengerId','Name','Ticket','Cabin'], 1)
titanic_data.head()

In [None]:
sns.boxplot(x='Pclass', y='Age', data=titanic_data, palette='hls');

Speaking roughly, we could say that the younger a passenger is, the more likely it is for them to be in 3rd class. The older a passenger is, the more likely it is for them to be in 1st class. So there is a loose relationship between these variables. So, let's write a function that approximates a passengers age, based on their class. From the box plot, it looks like the median age of 1st class passengers is about 37, 2nd class passengers is 29, and 3rd class pasengers is 24.

So let's write a function that finds each null value in the Age variable, and for each null, checks the value of the Pclass and assigns an age value according to the average age of passengers in that class.

In [None]:
# Check for median age per class
titanic_data.groupby('Pclass').median()

In [None]:
def age_approx(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

When we apply the function and check again for null values, we see that there are no more null values in the age variable.

In [None]:
titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(age_approx, axis=1)
titanic_data.isnull().sum()

There are 2 null values in the embarked variable. We can drop those 2 records without loosing too much important information from our dataset, so we will do that.

In [None]:
titanic_data.dropna(inplace=True)
titanic_data.isnull().sum()

The next thing we need to do is reformat our variables so that they work with the model.
Specifically, we need to reformat the Sex and Embarked variables into numeric variables.

In [None]:
gender = pd.get_dummies(titanic_data['Sex'],drop_first=True)
gender.head()

In [None]:
embark_location = pd.get_dummies(titanic_data['Embarked'],drop_first=True)
embark_location.head()

In [None]:
titanic_data.head()

In [None]:
titanic_data.drop(['Sex', 'Embarked'],axis=1,inplace=True)
titanic_data.head()

In [None]:
titanic_dmy = pd.concat([titanic_data,gender,embark_location],axis=1)
titanic_dmy.head()

### Checking for independence between features¶

In [None]:
sns.heatmap(titanic_dmy.corr());

In [None]:
# Fare and Pclass are not independent of each other, so I am going to drop one of these. 

titanic_dmy.drop(['Fare'] ,axis=1,inplace=True)
titanic_dmy.head()

In [None]:
# We have 6 predictive features that remain. The rule of thumb is 50 records per feature... 
# so we need to have at least 300 records in this dataset. Let's check again.
# Ok, we have 889 records so we are fine.
titanic_dmy.info() 

In [None]:
X = titanic_dmy.iloc[:,1:].values
y = titanic_dmy.iloc[:,0].values

In [None]:
#Store the preprocessed dataframe to make it available in eg. other notebooks
%store titanic_dmy