# Titanic problem

## Problem definition
- Informal: predict passengers who survived and who died;
- Formal:
    - Task: create a binary classifier algorythm which classify a passenger from Titanic as survivied or not;
    - Experience: data on Titanic wreck;
    - Performance: accuracy score;
    
## Assumptions
- Columns wise assumption:
    - Pclass: unequal treatment of different classes of passengers(first class is the top priority);
    - Sex and age: women and children were rescued first;
    - SibSp: there might be two possible situations: man can help woman to survive, or they decided to die together;
    - Parch: the parents can help children to survive;
    - Fare: the steerage passengers had the lowest chance to survive (the cheapest tickets).
        - First Class (parlor suite) — £870;
        - First Class (berth)— £30;
        - Second Class — £12;
        - Third Class — £3 to £8;
    - Cabin: the evacution was from boat deck. The nearer a passenger was to this deck the greater chance to survive;
    - Embarked: the English speaking passengerns had more chance to survive. The embarked place may have impact on it. S-England, C-France, Q-Ireland (English language).
- The dataset contains information only on passengers;
- The dataset contains information only on passengers who were aboard during the accident;
- The title of the passenger can matter for survival;

## Solution
- Data Selection:
    - Pclass, Sex and Age: it is the widely known fact that the different classes of passengers were treated unequally. Also, the women and children were rescued first;
    - SibSp and Parch: check if there is a statistically significant difference in survived and not survived groups;
    - Ticket and Fare: cannot be used itself, however it might be helpful to identify the missing values of cabins;
    - Cabin: see Assumptions section;
    - Embarked: the chance to survive depended on language of the passenger. The people who embarked in France might not speak English;
- Removing missing values in the following columns:
    - Age: build a binary classifier to predict age group;
    - Cabin: the decision was made to skip this feature as the number of missing values in train and holdout datasets is significant;
    - Embarked: will be filled with mode;
- Explore relationships between variables:
    - Is there statistically significant difference in survival rate between passenger who had siblings/spouses aboard and didn't have; Who had parents/children aboard? Who had specific title?
    - Is there statistically significant difference in survial rate between different ports?
    - Is there statistically significant difference in class between different ports;
- Features generation:
    - Last name, first name, title, spouse ID from Name column;
    - Fist class, Third class(dummies) from Pclass;
    - Female (dummy) from Sex;
    - S, Q (dummies) from Embarked;
    - SibSp (dummies) from SibSp/NoSibsp;
    - Parch (dummies) from Parch/NoParch;
    - Adult man from Sex and Age;
- Model fitting and optimization:
    - Method parameters: train, validation, algo, params(dict);
    - Method algo:
        - Fit model to the train set;
        - Make predictions;
        - Calculate metric;
        - Print information;
- Model tunning:
    - Method parameters: train, validation, algo, params;
    - Method algo:
        - Fit model to the train set;
        - Make predictions;
        - Calculate metric;
        - Print information;
        
## Software development
- Class Titanic_Dataset:
    - Take Pandas DataFrames train and holdout as a parameters;
    - Divide df to train and validation parts;
    - Store train, validation and holdout dfs;
    - Inherit all methods from Pandas Dataframe class;
- Filling missing values:
    - Embarked (9)
    - Age (11);
- Feature generation:
    - First class (1, dummy);
    - Third class (2, dummy);
    - Last name from Name (3);
    - First name from Name (4);
    - Title from Name (5);
    - Female (6, dummy);
    - SibSp/NoSibsp from SibSp (7, dummy);
    - Parch/NoParch from Parch (8, dummy);
    - Cherbourg from Embarked (10, dummy);
    - Adult man from Sex and Age (12, dummy);
- Data visualization and statistical exploration (if nessecary):
    - Survival rate Sibsp vs NoSibsp(14);
    - Survival rate Parch vs NoParch(15);
    - Survial rate between different titles (16);
    - Survial rate between different ports (17);
    - Class between different ports (18);
- Model fitting and optimization;
    - Algo (19);
- Model tunning;
    - Algo (20);

## Age prediction problem

### Problem definition
- Informal: identify if the passenger is child or not;
- Formal:
    - Task: build a binary classifier to predict age group of passengers;
    - Experience: data on Titanic passengers where the age not missed;
    - Performance: ROC-AUC score;
    
### Assumptions
- According to Encyclopedia Titanica the person was treated as a child if he/she was 14 years old or younger;
- Is there any difference between number of children among:
    - Different classes of passengers;
    - Different genders;
    - Different titles;
    - passengers who has sibsp/parch aboard;
    - Different ports;
    
### Solution
- Create the same dummy variables as for titanic problem;
- Check relationships between different variables and age group;
- Build binary classifier to predict age group;

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import math
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.svm import SVC
import warnings

warnings.filterwarnings("ignore")

pd.set_option('max_columns', 160)
pd.set_option('max_rows', 800)
pd.set_option('max_colwidth', 5000)

train = pd.read_csv('train.csv')
holdout = pd.read_csv('test.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Definition of supporting functions

In [14]:
# Define funuction for name processing
def process_mrs_with_par(name):
    """(str) -> str
    
    Return first name of the name for Mrs and Lady passengers whose first name is
    in the parentheses.
    
    >>> process_mrs_with_par('Futrelle, Mrs. Jacques Heath (Lily May Peel)')
    'Lily May Peel'
    >>> process_mrs_with_par('Watt, Mrs. James (Elizabeth "Bessie" Inglis Milne)')
    'Elizabeth Inglis Milne'
    >>> process_mrs_with_par('Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")')
    'Lucille Christiana Sutherland'
    """
    # extract the string within first parentheses without them
    first_name = re.search('\((.*?)\)', name).group(1)
       
    # if resulting string contains string in quotes
    if re.search('"(.*)"', first_name):
         # remove part in quoutes from it
        first_name = re.sub(' "(.*)"', '', first_name)
    return first_name

def process_other_names(name):
    """(str) -> str
    
    Return first name from name. 
    
    >>> process_name('Sawyer, Mr. Frederick Charles')
    'Frederick Charles'
    >>> process_other_names('Bradley, Mr. George ("George Arthur Brayton")')
    'George'
    >>> process_other_names('Petranec, Miss. Matilda')
    'Matilda'
    >>> process_other_names('O\\'Dwyer, Miss. Ellen "Nellie"')
    'Ellen'
    >>> process_other_names('Masselmani, Mrs. Fatima')
    'Fatima'
    """
    #return all characters after dot or all characters between dot and parentheses or
    # quotes
    first_name = re.search('\.(.*)', name).group(1)
    # if string contains '(' or '"':
    if re.search('[("](.*)[)"]', first_name):
        # return all characters between '. ' and ' (' or ' "'
        first_name = re.sub('[("](.*)[)"]', "", first_name)
    return first_name

def process_name(name):
    """(str) -> str
    
    Return first name from name. If title Mrs return name in parentheses without them.
    Else return name before parentheses or all characters.
    
    >>> process_name('Sawyer, Mr. Frederick Charles')
    'Frederick Charles'
    >>> process_name('Bradley, Mr. George ("George Arthur Brayton")')
    'George'
    >>> process_name('Petranec, Miss. Matilda')
    'Matilda'
    >>> process_name('O\\'Dwyer, Miss. Ellen "Nellie"')
    'Ellen'
    >>> process_name('Futrelle, Mrs. Jacques Heath (Lily May Peel)')
    'Lily May Peel'
    >>> process_name('Watt, Mrs. James (Elizabeth "Bessie" Inglis Milne)')
    'Elizabeth Inglis Milne'
    >>> process_name('Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")')
    'Lucille Christiana Sutherland'
    >>> process_name('Masselmani, Mrs. Fatima')
    'Fatima'
    """
    if re.search('Mrs|Lady', name):
        try:
            first_name = process_mrs_with_par(name)
        except AttributeError:
            first_name = process_other_names(name)
    else:
        first_name = process_other_names(name)
    
    return first_name.strip()

## Definition of Titanic_Dataset class

In [15]:
class Titanic_Dataset():
    
    @classmethod
    def __init__(self, train_data, holdout_data):
        self.train = train_data[['PassengerId', 'Pclass', 'Name',
                           'Sex', 'Age', 'SibSp', 'Parch',
                           'Ticket', 'Fare', 'Cabin',
                           'Embarked', 'Survived']].copy()
        self.holdout = holdout_data.copy()
    
    @classmethod
    def get_dummy(self, col, value, new_name, comp=False):
        # Create dummy variable from a given column with value as 1
        # name new_name
        try:
            for df in (self.train, self.holdout):
                if not comp:
                    df[new_name] = df[col].apply(lambda x: 1 if x == value else 0)
                else:
                    df[new_name] = df[col].apply(lambda x: 1 if x > value else 0)
        except KeyError:
            pass
            
    @classmethod
    def parse_name(self):
        # Parse name column to last_name, title and first_name columns
        for df in (self.train, self.holdout):
            # Create last_name column
            df['last_name'] = df['Name'].apply(lambda x: re.search('(.*)\,', x).group(1))
            # Create title column
            df['title'] = df['Name'].apply(lambda x: re.search('\, (.*)\.', x).group(1))
            # Create first_name column
            df['first_name'] = df['Name'].apply(process_name)
    
    @classmethod
    def fill_embarked(self):
        self.train['Embarked'].fillna(self.train['Embarked'].mode()[0], inplace=True)
        self.holdout['Embarked'].fillna(self.holdout['Embarked'].mode()[0], inplace=True)
    
    @classmethod
    def drop_columns(self, columns):
        for df in (self.train, self.holdout):
            df.drop(columns, axis=1, inplace=True)
    
    @classmethod
    def exp_rel(self, var):
        values = self.train[var].unique()
        f = lambda x: 1 if x <= 3 else math.ceil(x / 3)
        fig = plt.figure(figsize=(9, 3*f(len(values))))
        for i, value in enumerate(sorted(values)):
            fig.add_subplot(f(len(values)), 3, i+1)
            data = self.train[self.train[var]==value]
            chart = sns.barplot(x='is_adult', y="is_adult",data=data,
                             estimator=lambda x: len(x) / len(data) * 100)
            chart.tick_params(bottom=False, top=False, left=False, right=False)
            chart.axes.get_yaxis().set_visible(False)
            plt.xlabel('')
            plt.ylabel('')
            plt.title('Value of variable: ' + str(value))
            plt.suptitle('The variable is ' + var, y=1.1, fontsize=15)
            sides = ['left', 'right', 'top']
            for side in sides:
                chart.spines[side].set_visible(False)
            for p in chart.patches:
                chart.annotate(str(np.round(p.get_height(),
                                            decimals=2)) + '%',
                               (p.get_x() + p.get_width()/2,
                                p.get_height() + 1),
                               horizontalalignment='center')
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def balance(data, col):
        # Find value with the smalles number
        smallest = min(data[col].value_counts().values)
        # Find the number of rows to resample
        num_resample = max(data[col].value_counts().values) - smallest
        # Resample rows
        sample = data.sample(num_resample, replace=True, random_state=1)
        # return new dataset with resampled rows
        return pd.concat([data, sample], axis=0)
    
    @staticmethod
    def sens_spec_score(true_labels, predictions):
        tn, fp, fn, tp = confusion_matrix(true_labels, predictions).ravel()
        sensitivity = tp / (tp + fn)
        specificity = tn / (tn+fp)
        return sensitivity, specificity
    
    @classmethod
    def fit_and_optimize(self, model, X, y, bln=False):
        # Divide dataset to train and validation parts
        self.train_fit, self.test_fit = train_test_split(self.train[X + y], test_size=0.33)
        if bln:
            self.train_fit = self.balance(self.train_fit, y[0])
        
        # Use k-fold validation
        kf = KFold(10, shuffle=True)
        
        roc_score = list()
        sens_score = list()
        spec_score = list()
        for train_index, test_index in kf.split(self.train_fit):
            # Fit model to the train set
            fold_train = self.train_fit.iloc[train_index]
            fold_test = self.train_fit.iloc[test_index]
            model.fit(fold_train[X], fold_train[y])
            # Make predictions;
            predictions = model.predict(fold_test[X])
            # Calculate metric;
            roc_score.append(roc_auc_score(fold_test[y], predictions))
            sens, spec = self.sens_spec_score(fold_test[y], predictions)
            sens_score.append(sens)
            spec_score.append(spec)
        # Find the average metric and print it
        print('Average ROC-AUC {1}; Model: {0}'.format(type(model).__name__,
                                                       np.round(np.mean(roc_score), 5)))
        print('Average Sensitivity {}'.format(np.round(np.mean(sens_score), 5)))
        print('Average Specificity {}'.format(np.round(np.mean(spec_score), 5)))
        return self.train_fit[X + y], self.test_fit[X + y] 
    
    def tunning(self, model, grid_params, X, y):
        # Instantiate the grid search model
        grid_search = GridSearchCV(estimator = model,
                                   scoring = 'roc_auc',
                                   param_grid = grid_params, 
                                   cv = 5)
        # Fit the grid search to the data
        grid_search.fit(self.train_fit[X], self.train_fit[y])
        # fit model with best parameters
        model = grid_search.best_estimator_
        # make predictions
        model.fit(self.train_fit[X], self.train_fit[y])
        predictions = model.predict(self.test_fit[X])
        # calculate sensitivity and specificity
        sens, spec = self.sens_spec_score(self.test_fit[y], predictions)
        # print results
        print('ROC-AUC {}'.format(roc_auc_score(self.test_fit[y], predictions)))
        print('Sensitivity {}'.format(sens))
        print('Specificity {}'.format(spec))
        print(grid_search.best_params_)
        return model

In [16]:
test_train = train.copy()
test_holdout = holdout.copy()
titanic = Titanic_Dataset(test_train, test_holdout)

## Filling missing values in age column

First, we will initiliaze object of Titanic Dataset class and add various dummy columns in the data sets. 

In [17]:
# Instantiate class object
age = Titanic_Dataset(train[~train['Age'].isnull()],
                      train[train['Age'].isnull()].drop('Age', axis=1))

# Add column is_1class
age.get_dummy('Pclass', 1, 'is_1class')
# Add column is_3class
age.get_dummy('Pclass', 3, 'is_3class')
# Add columns last_name, title, first_name
age.parse_name()
# Add column is_female
age.get_dummy('Sex', 'female', 'is_female')
# Add column have_sibsp
age.get_dummy('SibSp', 0, 'have_sibsp', comp=True)
# Add column have_parch
age.get_dummy('Parch', 0, 'have_parch', comp=True)
# Add column cher
age.get_dummy('Embarked', 'C', 'cher')
# Add column sout
age.get_dummy('Embarked', 'S', 'sout')
# Add column is_adult which is target variable for this part
age.get_dummy('Age', 14, 'is_adult', comp=True)
age.fill_embarked()

Second, we will create bar charts to explore relationship between different variables and target variable. The special method will be defined for the visualization. 

In [18]:
age.train.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Survived', 'is_1class',
       'is_3class', 'last_name', 'title', 'first_name', 'is_female',
       'have_sibsp', 'have_parch', 'cher', 'sout', 'is_adult'],
      dtype='object')

In [38]:
# cols = ['Pclass', 'Sex', 'Embarked', 'is_1class',
#         'is_3class', 'have_sibsp', 'have_parch', 'cher', 'sout']
cols = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'is_1class',
        'is_3class', 'title', 'is_female', 'have_sibsp', 'have_parch',
        'cher', 'sout']
for col in cols:
    print(age.holdout[col].value_counts(normalize=True))

3    0.768362
1    0.169492
2    0.062147
Name: Pclass, dtype: float64
male      0.700565
female    0.299435
Name: Sex, dtype: float64
0    0.774011
1    0.146893
8    0.039548
3    0.022599
2    0.016949
Name: SibSp, dtype: float64
0    0.887006
2    0.067797
1    0.045198
Name: Parch, dtype: float64
S    0.508475
Q    0.276836
C    0.214689
Name: Embarked, dtype: float64
0    0.830508
1    0.169492
Name: is_1class, dtype: float64
1    0.768362
0    0.231638
Name: is_3class, dtype: float64
Mr        0.672316
Miss      0.203390
Mrs       0.096045
Master    0.022599
Dr        0.005650
Name: title, dtype: float64
0    0.700565
1    0.299435
Name: is_female, dtype: float64
0    0.774011
1    0.225989
Name: have_sibsp, dtype: float64
0    0.887006
1    0.112994
Name: have_parch, dtype: float64
0    0.785311
1    0.214689
Name: cher, dtype: float64
1    0.508475
0    0.491525
Name: sout, dtype: float64


In [50]:
prob_cols = ['Pclass']#, 'Sex', 'SibSp', 'Parch', 'title']
for col in prob_cols:
    print('Conditional probabilities')
    print(age.train[age.train['is_adult']==1][col].value_counts())
    print()
    
    print(age.train[age.train['is_adult']==0][col].value_counts())
    print()
    
    print('Common probabilities')
    print(age.train[col].value_counts())
    print('<---------------------------------------->')

Conditional probabilities
3    302
1    181
2    154
Name: Pclass, dtype: int64

3    53
2    19
1     5
Name: Pclass, dtype: int64

Common probabilities
3    355
1    186
2    173
Name: Pclass, dtype: int64
<---------------------------------------->


### Observations

The percentage of children is different for the variables Pclass, Sex, is_1class, is_3class, have_sibsp, have_parch. As the variables Pclass and is_1class, is_3class correspond to the same data only dummy variables is_1class and is_3class will be used. For Sex variable the column is_female will be used. We can also notice that the dataset is imbalanced on 'is_adult ' column. The special method will be added to tackle this situation.

In [20]:
np.random.seed(1)
X = ['is_1class', 'is_3class', 'is_female', 'have_sibsp', 'have_parch']
y = ['is_adult']

for model in [LogisticRegression(), DecisionTreeClassifier(),
             RandomForestClassifier(), MLPClassifier(), SVC(), BernoulliNB()]:
    age.fit_and_optimize(model, X, y)

Average ROC-AUC 0.58844; Model: LogisticRegression
Average Sensitivity 0.96689
Average Specificity 0.21
Average ROC-AUC 0.53491; Model: DecisionTreeClassifier
Average Sensitivity 0.94721
Average Specificity 0.12262
Average ROC-AUC 0.68138; Model: RandomForestClassifier
Average Sensitivity 0.94998
Average Specificity 0.41278
Average ROC-AUC 0.69263; Model: MLPClassifier
Average Sensitivity 0.94985
Average Specificity 0.4354
Average ROC-AUC 0.49783; Model: SVC
Average Sensitivity 0.99565
Average Specificity 0.0
Average ROC-AUC 0.75014; Model: BernoulliNB
Average Sensitivity 0.91968
Average Specificity 0.5806


The BernoulliNB algorithm showed the best results. We will try to balance dataset in order to improve overall performance.  

### Imporving overall performance

In [21]:
np.random.seed(1)
for model in [LogisticRegression(), DecisionTreeClassifier(),
             RandomForestClassifier(), MLPClassifier(), SVC(), BernoulliNB()]:
    age.fit_and_optimize(model, X, y, bln=True)

Average ROC-AUC 0.59482; Model: LogisticRegression
Average Sensitivity 0.96881
Average Specificity 0.22083
Average ROC-AUC 0.70796; Model: DecisionTreeClassifier
Average Sensitivity 0.94692
Average Specificity 0.46899
Average ROC-AUC 0.74118; Model: RandomForestClassifier
Average Sensitivity 0.95562
Average Specificity 0.52673
Average ROC-AUC 0.80454; Model: MLPClassifier
Average Sensitivity 0.9314
Average Specificity 0.67768
Average ROC-AUC 0.5; Model: SVC
Average Sensitivity 1.0
Average Specificity 0.0
Average ROC-AUC 0.7584; Model: BernoulliNB
Average Sensitivity 0.93176
Average Specificity 0.58503


We can observe that resampling has imporved performance of every model. The short list of models which will be used for further analysis:
- RandomForestClassifier;
- MLPClassifier;
- BernoulliNB;

### Algorithms tuning

Let's update our class with new method for model tuning. 

In [39]:
np.random.seed(1)
grid_params = {
    'n_estimators': [100, 200, 300, 400, 500],
    'criterion': ['gini', 'entropy'],
    'max_depth': [10, 20, 30, 50, 100, None],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 2, 3],           
              }
rf_best = age.tunning(RandomForestClassifier(), grid_params, X, y)

ROC-AUC 0.6817275747508306
Sensitivity 0.9348837209302325
Specificity 0.42857142857142855
{'criterion': 'gini', 'max_depth': 100, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}


In [40]:
np.random.seed(1)
grid_params = {
    'hidden_layer_sizes': [(50, 30), (50, 20), (50, 10)],
    'activation':['identity', 'logistic', 'tanh', 'relu']
              }
nn_best = age.tunning(MLPClassifier(), grid_params, X, y)

ROC-AUC 0.7775193798449613
Sensitivity 0.8883720930232558
Specificity 0.6666666666666666
{'activation': 'tanh', 'hidden_layer_sizes': (50, 10)}


In [41]:
np.random.seed(1)
grid_params = {
    'alpha': [1.5, 2, 2.5, 3, 3.5],
    'fit_prior': [True, False]
    }
nb_best = age.tunning(BernoulliNB(), grid_params, X, y)

ROC-AUC 0.7055370985603543
Sensitivity 0.9348837209302325
Specificity 0.47619047619047616
{'alpha': 1.5, 'fit_prior': True}
