Many begginers here comparing their result with the leaderboard have the impression that they are doing something wrong, but a result of about 77% is a normal one, the problem is to push it a few percents higher.

This notebook used the standard dataset and scored 0.811 on the leaderboard that puts it in top 1%, if not take into account the top results based on cheating or that used an extended dataset. 

Some feature engineering showed here could be interesting for many beginners as am I by myself, so all the critics, sugestions and rocks of any diameter thrown - are very welcome :)

Have fun!

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import (SimpleImputer, IterativeImputer)
from sklearn.preprocessing import (OneHotEncoder, StandardScaler)
from sklearn.model_selection import (GridSearchCV, cross_val_score)
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.cluster import KMeans
from imblearn.over_sampling import SMOTE
from catboost import CatBoostClassifier

# Load and analyze data

In [None]:
# Load data
full_df = pd.read_csv('/kaggle/input/titanic/train.csv')
test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
full_df.head()

In [None]:
# Separate test_df PassengerId (will need it for submission)
test_pass_id = test_df.pop('PassengerId')

# Keep max index that will be used to back split training and test data
X_max_index = full_df.shape[0]

# Separate features and target
y = full_df.Survived

df = full_df.drop(['Survived', 'PassengerId'], axis=1)
df = pd.concat([df, test_df], axis=0).reset_index(drop=True)

df.info()

- Some features need imputation
- Cabin column has a lot of missing values, we will use the availble 
  values to create a new feature and will drop Cabin
- We will create the feature Deck level, using the correlation between
  Pclass and info deducted from Cabin column. We suppose that the deck 
  level could take a role in survivability of the people as the lifeboats 
  were on the top level.
- From Name we will keep just the lastname and use it during creation 
  of Deck_level.
- We will create the feature Title, extracting the title from Name column, 
  supposing that some category of people had priority to embark the lifeboats.

# Unprocessed data correlation



In [None]:
full_df.corr()['Survived'].sort_values(ascending=False)

- Fare and Pclass have the highest correlation to Survived, it seams higher class (also higher Fare) had priority to embarc the lifeboats.
- Pclass has negative correlation because Pclass is numbered 1, 2, 3 (high, medium, low), but results to an invers survivability (class 3 = lower chance to survive, class 1 = higher chance).

# Features' instances

In [None]:
df.hist(bins=30, figsize=(12, 8))
plt.show()

- Attributes have different scales
- Some features are skewed right, we should check for outliers and normalize data
- Fare has values of 0 that looks weird

In [None]:
# Zero values in Fare we will consider as an error or outlier and will delete for further imputation
df.loc[df.Fare.eq(0), 'Fare'] = np.nan

# Create Lastname

In [None]:
df['Lastname'] = df.Name.str.split(', ').str[0]

# Create Title

In [None]:
# Extracting the Title from Name column
df['Title'] = df.Name.str.split(', ').str[1]
df['Title'] = df.Title.str.split('.').str[0]

In [None]:
# Analyze titles
df.Title.value_counts()

There are some title with the same meaning that should be joined together and also many unique titles that we will group under the title 'Noble'

In [None]:
# Analyze the title Mr and the Age
df[df.Title.eq('Mr')].Age.describe()

In [None]:
# Analyze the title Master and the Age
df[df.Title.eq('Master')].Age.describe()

Title Mr was used from 11 years old and Master to maximum 15 years old. 
Master is an antiquated title for an underage male.
We will join them together and then split again at age 15 to have a clean delimeter.

In [None]:
# Grouping the same type titles 

# We change also Miss to Mrs, but later we will convert back 
# to Miss just for young females as for now Miss is not 
# very usefull as it represents a young lady and also 
# an unmarried adult one of any age
females = ['Ms', 'Miss', 'Mlle', 'Mrs', 'Mme']
df.loc[df.Title.isin(females), 'Title'] = 'Mrs'

males = ['Master', 'Mr']
df.loc[(df.Title.isin(males)), 'Title'] = 'Mr'

# Change the titles for children to Master and Miss
df.loc[(df.Title.eq('Mr') & df.Age.lt(15)), 'Title'] = 'Master'
df.loc[(df.Title.eq('Mrs') & df.Age.lt(15)), 'Title'] = 'Miss'

# Create noble title
df.loc[(~df.Title.isin(females) & ~df.Title.isin(males)), 'Title'] = 'Noble'

# Create Price

We should divide the Fare by number of passengers on the same ticket

In [None]:
# Analyze Fare by ticket number to be sure that the Fare represents 
# the full price of the ticket and not the price per person

# Split Ticket by series and number
df['Ticket_series'] = [i[0] if len(i) > 1 else 0 for i in df.Ticket.str.split()]
df['Ticket_nr'] = [i[-1] for i in df.Ticket.str.split()]

# Check if Fare min and Fare max of the same ticket number are the same
df_fare = df[~df.Fare.isna()]
multi_tickets = df_fare.groupby(df_fare.Ticket_nr[df_fare.Ticket_nr.duplicated()])
(multi_tickets.Fare.min() != multi_tickets.Fare.max()).sum()

There is just 1 ticket where min and max don't corespond, we will ignore it as a mistake

In [None]:
# Create a column with the passengers number by ticket 
ticket_dict = df.groupby('Ticket_nr').Lastname.count().to_dict()
df['Passengers_ticket'] = df.Ticket_nr.map(ticket_dict)

# Create Price column
df['Price'] = (df.Fare / df.Passengers_ticket).round()

# Create Deck

This will have the deck letter

In [None]:
# Extract Deck letter from Cabin column
df['Deck'] = df.Cabin.str[0]

# Check how many missing values we have at this step
df.Deck.isna().sum()

In [None]:
# Deck distribution by Pclass
df.groupby('Pclass').Deck.value_counts()

In [None]:
# Deck missing values by Pclass
df.loc[df.Deck.isna(), 'Pclass'].value_counts()

- On the 1st step we will impute the Deck letter based on Ticket_nr, if the same Ticket_nr has already an available 
  value for Deck in other rows
  
- On the 2nd step we will impute based on Lastname using the same method as in the first step, but to be sure that 
  the passengers are not from different families with the same Lastname we will use some filters in the process.

- On the 3rd step we will impute based on Pclass, as every Pclass was on separate Deck with some intersections between 
  (some googling confirms that class-deck distribution corresponds to our Deck distribution by Pclass analysis). 
  To improve the accuracy we will check also the mean Price for each Pclass-Deck group to determine the Deck. 

In [None]:
# Function for imputing Deck
def impute_deck_by(feature):
    for pclass in range(1, 4):
        # Create a mapping dictionary
        map_dic = (df[~df.Deck.isna() & df.Pclass.eq(pclass)]
                   .groupby(feature).Deck.unique()
                   .apply(list).to_dict())

        # Keep just the keys with a single deck to avoid 
        # the same key on different decks
        map_dic = {i:j[0] for i, j in map_dic.items() 
                   if len(j) == 1}

        # Imputing Deck from map_dic
        df.loc[df.Deck.isna() & df.Pclass.eq(pclass), 
               'Deck'] = df[feature].map(map_dic)

    # Check how many missing values we have at this step
    print(df.Deck.isna().sum())

In [None]:
impute_deck_by('Ticket_nr')

In [None]:
impute_deck_by('Lastname')

We have recovered 25 values, not much, but they correspond to reality,
the rest we will impute later based on Pclass and Price as mentioned earlier.

# Impute Age

In [None]:
# List of titles
titles = list(df.Title.unique())

# Impute median Age by title
for title in titles:
    df.loc[(df.Age.isna() & df.Title.eq(title)), 'Age'] = df.loc[df.Title.eq(title), 'Age'].median()

# Analyze and impute missing prices

We impute prices first as there are less missing values in Price than in Deck and we use them both for imputation

In [None]:
# Analyze Price by Deck and Pclass
df.groupby(['Pclass', 'Deck']).Price.describe()

Very large standard deviation in Pclass 1, Deck B comparing to others, we should analyze this.

In [None]:
# Cabin T was on the upper deck (google helps), 
# so we will replace it with A deck as it has just a single value
df.loc[df.Deck.eq('T'), 'Deck'] = 'A'

In [None]:
# Check the cheapest prices for Deck B
df[df.Deck.eq('B')].sort_values('Price').head()

In [None]:
# Maybe Mr Carlsson paid just 5 pounds for that 1st class ticket, 
# but this value is an outlier that we will replace with the next min
df.loc[df.Ticket_nr.eq('695'), 'Price'] = 19

In [None]:
# Check the most expensive prices for Deck B
df[df.Deck.eq('B')].sort_values('Price', ascending=False).head(10)

In [None]:
# Two most expensive tickets are outliers,
# we will cap them at the next overall highest Price 
df.loc[df.Ticket_nr.eq('17755'), 'Price'] = 68
df.loc[df.Ticket_nr.eq('17558'), 'Price'] = 68

In [None]:
# Create a data frame of mean prices by Pclass and Deck 
class_deck_price = pd.DataFrame(df.groupby(['Pclass', 'Deck'])
                                .Price.mean().round(2)).reset_index()

# Impute missing prices 
# Where Deck is missing we will use the mean price by Pclass only
for index, row in df.loc[df.Price.isna(), 
                         ['Pclass', 'Deck']].iterrows():
    if not pd.isna(row.Deck):
        new_price = class_deck_price.loc[
            (class_deck_price.Pclass.eq(row.Pclass) 
            & class_deck_price.Deck.eq(row.Deck)), 'Price'].mean()
    else:
        new_price = class_deck_price[
            class_deck_price.Pclass.eq(row.Pclass)].Price.mean()

    df.loc[[index], 'Price'] = new_price

# Analyze and impute missing Deck

In [None]:
# Create dictionaries with aproximative price ranges by deck 
# concluded from previous analisys
first_cl = {'A': [25, 30],
            'B': [35, 70],
            'C': [30, 35],
            'D': [19, 25],
            'E': [9, 19]}

second_cl = {'D': [13, 17],
             'E': [5, 9],
             'F': [9, 13]}

third_cl = {'E': [8, 9],
            'F': [9, 21],
            'G': [0, 8]}

# Create a dictionary pairing Pclass and respective price dictionary
class_dict = {1: first_cl,
              2: second_cl,
              3: third_cl}

# Impute missing Deck values 
for index, row in df.loc[df.Deck.isna(), ['Pclass', 'Price']].iterrows():
    for c, d in class_dict.items():
        if row.Pclass == c:
            for i, j in d.items():
                if max(j) > row.Price >= min(j):
                    df.loc[[index], 'Deck'] = i

# Encode Deck with it's deck level number counting from the bottom
deck_level = {'G': 1, 'F': 2, 'E': 3, 'D': 4, 'C': 5, 'B': 6, 'A': 7}

df.Deck = df.Deck.replace(deck_level)

# Create Escape_density

Crowded decks could lead to jams and chaos when everybody wanted to go to the upper deck as the lifeboats were there.
This feature will show through which amount of people each deck passenger needed to pass to arrive on top. 
Basically for each deck we will have a number of people equal to the summ of its own value and all the decks that are upper from it.

In [None]:
# Analyse how many people were on each deck.
# Many values were imputed with aproximation,but at least we will have 
# an aproximative crowd mass each passenger has to pass going up
deck_people = df.Deck.value_counts().sort_index()
deck_people_dic = deck_people.to_dict()
deck_people_dic

In [None]:
# Create an escape density dictionary from which we will impute data to our new feature
escape_density = {}
for i in range(1, 8):
    escape_density[i] = sum(deck_people_dic.values())
    del deck_people_dic[i]
    
escape_density

In [None]:
# Create Escape_density column
df['Escape_density'] = df.Deck.replace(escape_density)

# Create Family_size

It will represent how big the family was

In [None]:
# We add together the person and his SibSp and Parch
df['Family_size'] = 1 + df.SibSp + df.Parch

# Create Family_survivers

This feature can't be used for modeling as it would lead to target leakage, but by analysing it later we can separate families that could have higher surviving chance

In [None]:
# Create full data frame for analysis
X = df[:X_max_index]
test_df = df[X_max_index:].copy()
full_df = pd.concat([X, y], axis=1).copy()

# Check for families that has survivers and create a dictionary with mean value of their family survivability
family_survivers = full_df[['Lastname', 'Survived']].groupby('Lastname').mean().round(2).reset_index()
family_survivers_dict = dict(zip(family_survivers.Lastname, family_survivers.Survived))

# Reduce the dictionary to the list of families that are both in train and test data
common_survivers = {}
for lastname, survived in family_survivers_dict.items():
    if lastname in list(test_df['Lastname'].unique()):
        common_survivers[lastname] = survived

# Create Family_survivers feature
test_df['Family_survivers'] = test_df.Lastname.map(common_survivers)
full_df['Family_survivers'] = full_df.Lastname.map(common_survivers)

# For the families that are not present in both train and test we will impute the overall mean value
test_df.Family_survivers = test_df.Family_survivers.fillna(test_df.Family_survivers.mean())
full_df.Family_survivers = full_df.Family_survivers.fillna(full_df.Family_survivers.mean())

# Separate back features and target
y = full_df.Survived

df = full_df.drop('Survived', axis=1)
df = pd.concat([df, test_df], axis=0).reset_index(drop=True)

# Clean data

In [None]:
# Change Pclass dtype to category as it's a classification feature
df.Pclass = df.Pclass.astype('category')

In [None]:
# Drop further unused columns
col_drop = ['Name', 'Ticket', 'Fare', 'Cabin', 'Lastname','Ticket_nr',  
            'Ticket_series', 'Passengers_ticket']
df = df.drop(col_drop, axis=1)

# Impute and encode categoricals

In [None]:
# List of categorical columns
categ_cols = list(df.select_dtypes(['object', 'category']).columns)

# Impute categoricals with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')

df_cat = pd.DataFrame(cat_imputer.fit_transform(df[categ_cols]), 
                      columns=df[categ_cols].columns)

# Encode categorical
df_cat = pd.get_dummies(df_cat)

# Impute numericals

In [None]:
# List of numerical columns
num_cols = list(df.select_dtypes(['int64', 'float64']).columns)

# Impute numericals
it_imp = IterativeImputer()

df_num = pd.DataFrame(it_imp.fit_transform(df[num_cols]),
                      columns=df[num_cols].columns)

# Concatenate with encoded categorical columns
df = pd.concat([df_cat, df_num], axis=1)

# Create Deck_survive_ratio

In [None]:
# Create a full data frame for analysis
X = df[:X_max_index]
full_df = pd.concat([X, y], axis=1)

# Total Survived by Deck
deck_total_survived = full_df.groupby('Deck').Survived.sum()

# Dictionary with deck_survive_ratio
deck_survive_ratio = (deck_total_survived / deck_people).to_dict()

# Create Deck_survive_ratio
df['Deck_survive_ratio'] = df.Deck.map(deck_survive_ratio)

In [None]:
# Function for kde plotting
def survive_chance_by(feature, xticks=None, xlim=None):
    survived = full_df[full_df.Survived.eq(1)]
    not_survived = full_df[full_df.Survived.eq(0)]

    plt.figure(figsize=(10, 5))

    survived[feature].plot(kind='kde', label='survived')
    not_survived[feature].plot(kind='kde', label='not_survived')
    
    plt.xlim(xlim)
    plt.xticks(xticks)
    plt.legend()
    plt.grid()
    plt.xlabel(feature)
    plt.show()

# Create Age_group

In [None]:
# Survivers by Age
survive_chance_by('Age', np.arange(0, 81, 5), (0, 80))

By curves intersection points we can separate 4 age groups:
    
    1. 0-16 years old have higher survivability chance
    2. 16-33 years old low chance
    3. 33-43 years old better chance
    4. For the rest the chances are almost equal

In [None]:
df['Age_group'] = pd.cut(x=df.Age, labels=[4, 1, 3, 2],
                         bins=[-1, 16, 33, 43, df.Age.max()]).astype('float')

# Create Family_group

In [None]:
# Survivers by Family_size
survive_chance_by('Family_size', np.arange(0, 10, 1), (0, 10))

Here we can separate 3 groups:

    1. Single persons had lower chance to survive
    2. 2-4 members families had higher chances, as they had some priority to safeboats with 1-2 children with them
    3. 5 and more members families had almost equal chances

In [None]:
# Create Family_group feature
df['Family_group'] = pd.cut(x=df.Family_size, labels=[1, 3, 2], 
                            bins=[-1, 1, 4, df.Family_size.max()]).astype('float')

# Create Lucky_family

To create this feature we analyse earlier created Family_survivers that used by itself would overfit the model

In [None]:
# Survivers by Family_survivers
survive_chance_by('Family_survivers', np.arange(0, 1.5, 0.1), (0, 1.5))

By curves intersection points we can separate 4 family groups with different chance to survive

In [None]:
# Create Lucky_family feature
df['Lucky_family'] = pd.cut(x=df.Family_survivers, labels=[2, 3, 1, 4],
                            bins=[-1, 0.22, 0.35, 0.49, df.Family_survivers.max()]).astype('float')

# Standardization

In [None]:
# Apply np.log to normalize the skewed right Price
df.Price = df.Price.apply(np.log1p)

# Standardize 
std_scaler = StandardScaler()

df_scaled = std_scaler.fit_transform(df)
df = pd.DataFrame(df_scaled, columns=df.columns)

In [None]:
# Drop features not used for modeling
cols_to_drop = ['Family_survivers', 'SibSp', 'Parch', 'Family_size']
df = df.drop(cols_to_drop, axis=1)

# Split train and test data

In [None]:
X = df[:X_max_index]
test_df = df[X_max_index:]

# Processed data correlation

In [None]:
# Concatenate into a full dataset
full_df = pd.concat([X, y], axis=1)

correlation = full_df.corr()['Survived'].sort_values(ascending=False)

# Correlation graph
correlation[1:].plot(kind='bar', figsize=(10,5), title='Survivability dependency')
plt.show()

# Conclusion

On the Titanic is better to not be an usual single adult male on a lower deck and embarked from Southampton with a cheap ticket in the pocket - RIP Jack Dawson :-(

# Find best features
This cell is commented out as it takes long time to run and the resulted final_features are shown further

In [None]:
# # Define model
# cat_model = CatBoostClassifier(thread_count=-1, verbose=False)

# # Define and fit feature selector
# sfs = SequentialFeatureSelector(cat_model, 
#                                 scoring='accuracy', 
#                                 direction = 'backward')
# sfs.fit(X, y)

# # List of the final features to be used for submission modeling
# final_features = list(sfs.get_feature_names_out())

In [None]:
# From Feature selector we've got this list of final features to use
final_features = ['Pclass_2', 'Pclass_3', 'Sex_female', 'Title_Mr', 
                  'Title_Mrs', 'Price', 'Deck_survive_ratio', 'Age_group',
                  'Family_group', 'Lucky_family']

# CatBoost grid search parameter tuning
This cell is commented out as it takes long time to run and the resulted parameters are shown further

In [None]:
# # Define model
# cat_model = CatBoostClassifier()

# # Define parameters' grid
# grid = {'verbose': [False],
#         'thread_count': [-1],
#         'depth': [3, 4, 5, 6],
#         'iterations': [500, 1000, 2000, 3000],
#         'learning_rate': [0.0001, 0.001, 0.01]}

# # Define GridSearchCV
# grid_cat = GridSearchCV(estimator=cat_model, param_grid=grid, cv=3, n_jobs=-1)
# grid_cat.fit(X[final_features], y)

# params = grid_cat.best_params_

# print('\n Best Score:\n', grid_cat.best_score_)
# print('\n Best parameters:\n', params)

In [None]:
# Best parameters
params = {'verbose': False,
          'thread_count': -1,
          'depth': 4, 
          'iterations': 1000, 
          'learning_rate': 0.0005}

# Final model

In [None]:
# Define and fit the model
cat_model = CatBoostClassifier(**params)
cat_model.fit(X[final_features], y)

# Check accuracy and features importance
cat_rmses = cross_val_score(cat_model, X[final_features], y, cv=5)

print(pd.Series(cat_rmses).describe())
print('\n', cat_model.get_feature_importance(prettified=True))

# Submission

In [None]:
# Make predictions which we will submit.
test_preds = cat_model.predict(test_df[final_features])

# Save predictions in the format used for competition scoring
output = pd.DataFrame({'PassengerId': test_pass_id,
                       'Survived': test_preds})
output.to_csv('submission.csv', index=False)