<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Sources" data-toc-modified-id="Sources-0.1">Sources</a></span></li></ul></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-1">Data Exploration</a></span><ul class="toc-item"><li><span><a href="#Gender-influence" data-toc-modified-id="Gender-influence-1.1">Gender influence</a></span></li></ul></li><li><span><a href="#Feature-engineering" data-toc-modified-id="Feature-engineering-2">Feature engineering</a></span><ul class="toc-item"><li><span><a href="#Titres" data-toc-modified-id="Titres-2.1">Titres</a></span></li><li><span><a href="#Remplir-les-valeurs-manquantes" data-toc-modified-id="Remplir-les-valeurs-manquantes-2.2">Remplir les valeurs manquantes</a></span></li><li><span><a href="#Traitement-des-noms-et-des-titres" data-toc-modified-id="Traitement-des-noms-et-des-titres-2.3">Traitement des noms et des titres</a></span></li><li><span><a href="#Les-catégories-de-prix" data-toc-modified-id="Les-catégories-de-prix-2.4">Les catégories de prix</a></span></li><li><span><a href="#Point-d'embarquement" data-toc-modified-id="Point-d'embarquement-2.5">Point d'embarquement</a></span></li><li><span><a href="#Cabines" data-toc-modified-id="Cabines-2.6">Cabines</a></span></li><li><span><a href="#Classes-de-voyageur" data-toc-modified-id="Classes-de-voyageur-2.7">Classes de voyageur</a></span></li><li><span><a href="#Ticket" data-toc-modified-id="Ticket-2.8">Ticket</a></span></li><li><span><a href="#Familles" data-toc-modified-id="Familles-2.9">Familles</a></span></li></ul></li><li><span><a href="#Train-Test-split" data-toc-modified-id="Train-Test-split-3">Train Test split</a></span></li></ul></div>

## Sources
- https://www.kaggle.com/viveksrinivasan/analyzing-titanic-dataset

In [1]:
import pandas as pd
import numpy as np

pd.options.mode.chained_assignment = None

In [2]:
def get_combined_data():
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')

    targets = train.Survived
    train.drop('Survived',1,inplace=True)
    
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index',inplace=True,axis=1)
    
    return combined

combined = get_combined_data()

In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Data Exploration

## Gender influence

In [4]:
print("Proportion of male who survived :")
print(train["Survived"][train["Sex"] == "male"].value_counts(normalize=True))

Proportion of male who survived :
0    0.811092
1    0.188908
Name: Survived, dtype: float64


In [5]:
print("Proportion of female who survived :")
print(train["Survived"][train["Sex"] == "female"].value_counts(normalize=True))

Proportion of female who survived :
1    0.742038
0    0.257962
Name: Survived, dtype: float64


# Feature engineering

## Titres

In [6]:
def get_titles():

    global combined
    
    # we extract the title from each name
    combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    
    # a map of more aggregated titles
    Title_Dictionary = {
        "Capt":"Officer", "Col":"Officer","Major":"Officer","Jonkheer":"Royalty",
        "Don":"Royalty","Sir" :"Royalty","Dr":"Officer",
        "Rev":"Officer","the Countess":"Royalty","Dona":"Royalty",
        "Mme":"Mrs","Mlle":"Miss","Ms":"Mrs","Mr" :"Mr","Mrs" :"Mrs",
        "Miss" :"Miss","Master" :"Master","Lady" :"Royalty"
}
    
    # we map each title
    combined['Title'] = combined.Title.map(Title_Dictionary)
    
get_titles()

## Remplir les valeurs manquantes

In [7]:
combined["Age"] = combined.groupby(['Sex','Pclass','Title'])['Age'].transform(lambda x: x.fillna(x.median()))

## Traitement des noms et des titres
On passe les titres en one hot encoding et on supprime les noms.

In [8]:
def process_names():
    
    global combined
    # we clean the Name variable
    combined.drop('Name',axis=1,inplace=True)
    
    # encoding in dummy variable
    titles_dummies = pd.get_dummies(combined['Title'],prefix='Title')
    combined = pd.concat([combined,titles_dummies],axis=1)
    
    # removing the title variable
    combined.drop('Title',axis=1,inplace=True)
    
process_names()

## Les catégories de prix
Remplacement de la valeur manquante pour Fare.

In [9]:
def process_fares():
    
    global combined
    combined.Fare.fillna(combined.Fare.mean(),inplace=True)
    
process_fares()

## Point d'embarquement
- Remplir valeurs manquantes avec la plus fréquente
- One Hot Encoding des points d'embarquement

In [10]:
def process_embarked():
    
    global combined
    combined.Embarked.fillna('S',inplace=True)
    
    # dummy encoding 
    embarked_dummies = pd.get_dummies(combined['Embarked'],prefix='Embarked')
    combined = pd.concat([combined,embarked_dummies],axis=1)
    combined.drop('Embarked',axis=1,inplace=True)

process_embarked()

## Cabines
- Valeurs manquantes
- Dummy Encoding
- Suppression de 'Cabin'

In [11]:
def process_cabin():
    
    global combined
    
    # replacing missing cabins with U (for Unknown)
    combined.Cabin.fillna('U',inplace=True)
    
    # mapping each Cabin value with the cabin letter
    combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
    
    # dummy encoding ...
    cabin_dummies = pd.get_dummies(combined['Cabin'],prefix='Cabin')
    
    combined = pd.concat([combined,cabin_dummies],axis=1)
    
    combined.drop('Cabin',axis=1,inplace=True)

process_cabin()

In [12]:
def process_gender():
    
    global combined
    combined['Sex'] = combined['Sex'].map({'male':1,'female':0})
    
process_gender()

## Classes de voyageur

In [13]:
def process_pclass():
    
    global combined
    # encoding into 3 categories:
    pclass_dummies = pd.get_dummies(combined['Pclass'],prefix="Pclass")
    
    # adding dummy variables
    combined = pd.concat([combined,pclass_dummies],axis=1)
    
    # removing "Pclass"
    combined.drop('Pclass',axis=1,inplace=True)
    

process_pclass()

## Ticket

In [14]:
def process_ticket():
    
    global combined
    
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip() , ticket)
        ticket = list(filter(lambda t : not t.isdigit(), ticket))
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'XXX'
        
    # Extracting dummy variables from tickets:

    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'],prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies],axis=1)
    combined.drop('Ticket',inplace=True,axis=1)

ticket  = process_ticket()

## Familles

Création de variables pour la famille: 
- Taille de la famille.
- One Hot vecteur pour la taille de la famille.

In [15]:
def process_family():
    
    global combined
    # introducing a new feature : the size of families (including the passenger)
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    
    # introducing other features based on the family size
    combined['Single'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    combined['SmallF'] = combined['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
    combined['MedF'] = combined['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
    combined['LargeF'] = combined['FamilySize'].map(lambda s: 1 if s >= 5 else 0)
    
process_family()

# Train Test split

In [16]:
def recover_train_test_target():
    global combined
    
    train0 = pd.read_csv('train.csv')
    
    targets = train0.Survived
    train = combined.loc[0:890]
    test = combined.loc[891:]
    
    return train, test, targets

xtrain_fe, xtest_fe, labels = recover_train_test_target()

In [17]:
xtest_fe.to_csv("xtest_fe", encoding='utf-8', index=False)
xtrain_fe.to_csv("xtrain_fe", encoding='utf-8', index=False)

In [18]:
print(xtrain_fe.shape)
print(xtest_fe.shape)
print(labels.shape)

(891, 69)
(418, 69)
(891,)
