# Titanic problem

## Problem definition
- Informal: predict passengers who survived and who died;
- Formal:
    - Task: create a binary classifier algorythm which classify a passenger from Titanic as survivied or not;
    - Experience: data on Titanic wreck;
    - Performance: accuracy score;
    
## Assumptions
- Columns wise assumption:
    - Pclass: unequal treatment of different classes of passengers(first class is the top priority);
    - Sex and age: women and children were rescued first;
    - SibSp: there might be two possible situations: man can help woman to survive, or they decided to die together;
    - Parch: the parents can help children to survive;
    - Fare: the steerage passengers had the lowest chance to survive (the cheapest tickets).
        - First Class (parlor suite) — £870;
        - First Class (berth)— £30;
        - Second Class — £12;
        - Third Class — £3 to £8;
    - Cabin: the evacution was from boat deck. The nearer a passenger was to this deck the greater chance to survive;
    - Embarked: the English speaking passengerns had more chance to survive. The embarked place may have impact on it. S-England, C-France, Q-Ireland (English language).
- The dataset contains information only on passengers;
- The dataset contains information only on passengers who were aboard during the accident;
- The title of the passenger can matter for survival;

## Solution
- Data Selection:
    - Pclass, Sex and Age: it is the widely known fact that the different classes of passengers were treated unequally. Also, the women and children were rescued first;
    - SibSp and Parch: check if there is a statistically significant difference in survived and not survived groups;
    - Ticket and Fare: cannot be used itself, however it might be helpful to identify the missing values of cabins;
    - Cabin: see Assumptions section;
    - Embarked: the chance to survive depended on language of the passenger. The people who embarked in France might not speak English;
- Removing missing values in the following columns:
    - Age:
        - The passenger is classified as adult if:
            - has title Mrs;
            - has spouse (has spouse and the same last name);
            - has child (has child and the same last name);
            - has sibling who is older than 18 (has sibling and the same last name);
            - has parent who is older than 50 (has child and the same last name);
        - The passenger is classified as a child if:
            - has parent who is younger than 50 (has parent and the same last name);
            - has sibling who is younger than 18 (has sibling and the same last name);
            - is man and share the cabin with women (the same ticket number, third class and differnt last names);
        - The passenger is classified as adult in all other cases;
    - Cabin: The deck will be exracted from this column. Algorithm: 
        - Enter the same deck if the ticket number are the same;
        - Enter the same deck if passenger have spouse;
    Then, the classifier will be implemented in order to predict the deck of the cabin;
    - Embarked: will be filled with mode;
- Explore relationships between variables:
    - Is there statistically significant difference in survival rate between passenger who had siblings/spouses aboard and didn't have; Who had parents/children aboard? Who had specific title?
    - Is there statistically significant difference in survial rate between different decks?
    - Is there statistically significant difference in survial rate between different ports?
    - Is there statistically significant difference in class between different ports;
- Features generation:
    - Last name, first name, title, spouse ID from Name column;
    - Fist class, Third class(dummies) from Pclass;
    - Female (dummy) from Sex;
    - S, Q (dummies) from Embarked;
    - SibSp (dummies) from SibSp/NoSibsp;
    - Parch (dummies) from Parch/NoParch;
    - Adult man from Sex and Age;
    - Deck with classifier from PClass, Sex, Age, SibSp, Parch, Ticket, Fare;
    - A-E (dummies) from Deck;
- Model fitting and optimization:
    - Method parameters: train, validation, algo, params(dict);
    - Method algo:
        - Fit model to the train set;
        - Make predictions;
        - Calculate metric;
        - Print information;
 - Model tunning:
     - Method parameters: train, validation, algo, params;
     - Method algo:
        - Fit model to the train set;
        - Make predictions;
        - Calculate metric;
        - Print information;
        
## Software development
- Class Titanic_Dataset:
    - Take Pandas DataFrames train and holdout as a parameters;
    - Divide df to train and validation parts;
    - Store train, validation and holdout dfs;
    - Inherit all methods from Pandas Dataframe class;
- Filling missing values:
    - Embarked (9)
    - Age (11);
    - Cabin (13), then classifier;
- Feature generation:
    - First class (1, dummy);
    - Third class (2, dummy);
    - Last name from Name (3);
    - First name from Name (4);
    - Title from Name (5);
    - Female (6, dummy);
    - SibSp/NoSibsp from SibSp (7, dummy);
    - Parch/NoParch from Parch (8, dummy);
    - Cherbourg from Embarked (10, dummy);
    - Adult man from Sex and Age (12, dummy);
    - Deck from Cabin (14);
    - A-E from Deck(15);
- Statistical exploration:
    - Survival rate Sibsp vs NoSibsp(16);
    - Survival rate Parch vs NoParch(17);
    - Survial rate between different titles (18);
    - Survival rate between different decks (19);
    - Class between different ports (20);
- Model fitting and optimization;
    - Algo (21);
- Model tunning;
    - Algo (22);

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split

pd.set_option('max_columns', 160)
pd.set_option('max_rows', 800)
pd.set_option('max_colwidth', 5000)

train = pd.read_csv('train.csv')
holdout = pd.read_csv('test.csv')

## Definition of supporting functions

In [19]:
# Define funuction for name processing
def process_mrs_with_par(name):
    """(str) -> str
    
    Return first name of the name for Mrs and Lady passengers whose first name is
    in the parentheses.
    
    >>> process_mrs_with_par('Futrelle, Mrs. Jacques Heath (Lily May Peel)')
    'Lily May Peel'
    >>> process_mrs_with_par('Watt, Mrs. James (Elizabeth "Bessie" Inglis Milne)')
    'Elizabeth Inglis Milne'
    >>> process_mrs_with_par('Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")')
    'Lucille Christiana Sutherland'
    """
    # extract the string within first parentheses without them
    first_name = re.search('\((.*?)\)', name).group(1)
       
    # if resulting string contains string in quotes
    if re.search('"(.*)"', first_name):
         # remove part in quoutes from it
        first_name = re.sub(' "(.*)"', '', first_name)
    return first_name

def process_other_names(name):
    """(str) -> str
    
    Return first name from name. 
    
    >>> process_name('Sawyer, Mr. Frederick Charles')
    'Frederick Charles'
    >>> process_other_names('Bradley, Mr. George ("George Arthur Brayton")')
    'George'
    >>> process_other_names('Petranec, Miss. Matilda')
    'Matilda'
    >>> process_other_names('O\\'Dwyer, Miss. Ellen "Nellie"')
    'Ellen'
    >>> process_other_names('Masselmani, Mrs. Fatima')
    'Fatima'
    """
    #return all characters after dot or all characters between dot and parentheses or
    # quotes
    first_name = re.search('\.(.*)', name).group(1)
    # if string contains '(' or '"':
    if re.search('[("](.*)[)"]', first_name):
        # return all characters between '. ' and ' (' or ' "'
        first_name = re.sub('[("](.*)[)"]', "", first_name)
    return first_name

def process_name(name):
    """(str) -> str
    
    Return first name from name. If title Mrs return name in parentheses without them.
    Else return name before parentheses or all characters.
    
    >>> process_name('Sawyer, Mr. Frederick Charles')
    'Frederick Charles'
    >>> process_name('Bradley, Mr. George ("George Arthur Brayton")')
    'George'
    >>> process_name('Petranec, Miss. Matilda')
    'Matilda'
    >>> process_name('O\\'Dwyer, Miss. Ellen "Nellie"')
    'Ellen'
    >>> process_name('Futrelle, Mrs. Jacques Heath (Lily May Peel)')
    'Lily May Peel'
    >>> process_name('Watt, Mrs. James (Elizabeth "Bessie" Inglis Milne)')
    'Elizabeth Inglis Milne'
    >>> process_name('Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")')
    'Lucille Christiana Sutherland'
    >>> process_name('Masselmani, Mrs. Fatima')
    'Fatima'
    """
    if re.search('Mrs|Lady', name):
        try:
            first_name = process_mrs_with_par(name)
        except AttributeError:
            first_name = process_other_names(name)
    else:
        first_name = process_other_names(name)
    
    return first_name.strip()

## Definition of Titanic_Dataset class

In [30]:
class Titanic_Dataset():
    
    def __init__(self, train_data, holdout_data):
        self.train = train[['PassengerId', 'Pclass', 'Name',
                           'Sex', 'Age', 'SibSp', 'Parch',
                           'Ticket', 'Fare', 'Cabin',
                           'Embarked', 'Survived']].copy()
        self.holdout = holdout.copy()
        
    def get_dummy(self, col, value, new_name):
        # Create dummy variable from a given column with value as 1
        # name new_name
        for df in (self.train, self.holdout):
            if isinstance(value, str):
                df[new_name] = df[col].apply(lambda x: 1 if x == value else 0)
            else:
                df[new_name] = df[col].apply(lambda x: 1 if x > value else 0)
    
    def parse_name(self):
        # Parse name column to last_name, title and first_name columns
        for df in (self.train, self.holdout):
            # Create last_name column
            df['last_name'] = df['Name'].apply(lambda x: re.search('(.*)\,', x).group(1))
            # Create title column
            df['title'] = df['Name'].apply(lambda x: re.search('\, (.*)\.', x).group(1))
            # Create first_name column
            df['first_name'] = df['Name'].apply(process_name)
            
    def fill_embarked(self):
        self.train['Embarked'].fillna(self.train['Embarked'].mode()[0], inplace=True)
        self.holdout['Embarked'].fillna(self.holdout['Embarked'].mode()[0], inplace=True)
        
    def fill_age(self):
        # Loop over train and holdout datasets
        # Assing average age of passengers with title Mrs to passengers with the same
        # time but missing age

In [36]:
test_train = train.copy()
test_holdout = holdout.copy()
titanic = Titanic_Dataset(test_train, test_holdout)
titanic.parse_name()
titanic.train[~titanic.train['Age'].isnull()].shape

(714, 15)

In [29]:
train['Embarked'].mode()

0    S
dtype: object