> Igor Sorochan DSU-31

# Data quality problems

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import re

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'matplotlib'

## Prepare

|data frame| used for storing:|
|:---|:---|
|df_raw | untouched input data|
|df | cleaned data|
|df_viz|simplified plot readings|
|df_test|data for test on Kaggle|
|submission|exemplar for Kaggle submission|
| X | processed Train set|
|X_subm|processed Test set|
|y (Series) | target labels|

#### Loading the data set

In [1288]:
df_raw=pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/Titanic_train.csv')
df_test=pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/Titanic_test.csv')
df_subm=pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/Titanic_gender_submission.csv')

In [1289]:
df_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [1290]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [1291]:
def check_na(df):
  '''
  Check for missing values in a dataframe
  df - dataframe
  '''
  for col in df.columns:
    print(f'{col.ljust(12)} {df[col].isna().sum():<5}{df[col].isna().sum()/df.shape[0]:.2%}')

def get_boxplot(X, columns=[]):
    '''
    Plot boxplot for each column in columns
    X - dataframe
    columns - list of columns to plot
    '''
    for i in columns:
        fig = px.box(x=X[i])
        fig.show()
    pass  

def get_pairplot(X, columns=None):
    '''
    Plot pairplot for each column in columns or for all columns
    X - dataframe
    columns - list of columns to plot
    '''
    if columns is None: # Если не указаны колонки, то берем все
        columns = list(X.columns)
    sns.pairplot(X[columns])
    pass   

### Checking for missing values, duplicates and outliers

In [1292]:
check_na(df_raw) # check for missing values

PassengerId  0    0.00%
Survived     0    0.00%
Pclass       0    0.00%
Name         0    0.00%
Sex          0    0.00%
Age          177  19.87%
SibSp        0    0.00%
Parch        0    0.00%
Ticket       0    0.00%
Fare         0    0.00%
Cabin        687  77.10%
Embarked     2    0.22%


The data set is small, so we should try not to delete rows with nans but leverage smart filling.  
Age directly influnces on ones' chances to survive.  
Cabin and Embarked may affects but it is not so obvious.  
Let's try to **fill in the missing values.**

### Age - passenger age  
We have 20% of nulls here.  
What other attributes can indicate the age of a passenger?  
Potential candidates are:  
* Pclass  
* Name (salutation)

Let's explore. 

In [1293]:
df_raw.groupby(['Pclass'])[['Age']].mean().style.bar(align='mid', color='coral')

Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,38.233441
2,29.87763
3,25.14062


There is a strong correlation between Age and Pclass.  
The youngest passengers are in 3-rd Pclass.

Let's look at relations between Salutation and Age.  
The idea is to **roughly determine age on a passenger Salutation.**

First, we extract Salutations as any word from Name with a dot at the end. 

In [1294]:
df = df_raw.copy() # df -dataframe for cleaned data

pattern= re.compile(r'(\w+)\.') # pattern for any word followed by a dot
df['Salutation']= df['Name'].apply(lambda x: re.findall(pattern, x)[0] ) # for train data
df_test['Salutation']= df_test['Name'].apply(lambda x: re.findall(pattern, x)[0] ) # for test data
df.Salutation.value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: Salutation, dtype: int64

|Salutation|Description|Proposed replacement values|
|:---|:---|:---|
|Master|The term "Master" refers to young male passengers who were under the age of 18 and were traveling without their parents.  In the early 1900s, the term "Master" was commonly used as a courtesy title for boys, particularly those from wealthy or upper-class families.  The use of this title on passenger lists was a way to distinguish these young male passengers from adult male passengers.| Master|
|Rev|"Reverend", which is a title used to address members of the Christian clergy. | Mr|
|Mlle|French "Mademoiselle" (unmarried women)|Miss|
|Mme|French "Madame"|Mrs|
|Countess|The title of Countess is a noble title given to a woman in certain European countries.|Mrs|
|Jonkheer|The term "Jonkheer" is a Dutch noble title that is roughly equivalent to the English title of "Esquire".|Mr|
|Dona|Spanish  "Madam" or "Lady". There was only "Dona" onboard: Doña Fermina Oliva y Ocana. She was a 39-year-old first-class passenger from Spain who boarded the Titanic in Cherbourg, France, and disembarked in New York City. Doña Fermina Oliva y Ocana was traveling with her maid, Miss Asuncion Durán y More, who also survived the sinking. They both boarded lifeboat 8, which was one of the first to leave the Titanic. They were later transferred to the rescue ship Carpathia and eventually reached New York City on April 18, 1912.|Mrs|
|Major Col Capt Sir Don|Salutations to adult man passengers|Mr|
|Ms|There is some ambiguity here. Two womans were titled as Ms (SibSp==0)|Miss|


Miss, Mrs, Mr, Master we leave without any replacements.

In [1295]:
# extract salutation (any word followed by a dot) from name
salut = set(df_raw.Name.str.extract('(\w+)\.')[0])  # [0] to select the first column and convert to set to remove duplicates

# add the salutations from the test set (maybe there are some rare salutations in the test set)
salut = salut | set(df_test.Name.str.extract('(\w+)\.')[0])  

# remove proper salutation 
salut -=  {'Master','Miss','Mr','Mrs'} 

salut1 = list(sorted(salut)) # sort the set and convert to list as a set order in zip function is RANDOM

print('Set of Salutation replacements:')
# create a list of replacement for rare salutation
salut2 =['Mr', 'Mr','Mrs', 'Mr', 'Mrs', 'Mr', 'Mr', 'Mrs', 'Mr', 'Miss',  'Mrs', 'Miss', 'Mr', 'Mr']
for i in zip(salut1, salut2):
    print (i[0].ljust(9), i[1])

Set of Salutation replacements:
Capt      Mr
Col       Mr
Countess  Mrs
Don       Mr
Dona      Mrs
Dr        Mr
Jonkheer  Mr
Lady      Mrs
Major     Mr
Mlle      Miss
Mme       Mrs
Ms        Miss
Rev       Mr
Sir       Mr


And finally define common Salutations.

In [1296]:
df['Salutation'] = df.Salutation.replace(salut1, salut2)
df_test['Salutation'] = df_test.Salutation.replace(salut1, salut2)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Mr
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr


Look at Pass #887. We've managed to define him as a Mr.  
Now we can define a passenger Age by his (her) Salutation.  
Let' check the Salutation w.r.t Sex.

In [1297]:
pd.crosstab(df.Salutation,df.Sex).style.background_gradient(cmap='coolwarm')

Sex,female,male
Salutation,Unnamed: 1_level_1,Unnamed: 2_level_1
Master,0,40
Miss,185,0
Mr,1,537
Mrs,128,0


We've named a female as Mr(  
Let's figure it out

In [1298]:
df[(df.Salutation=='Mr') & (df.Sex=='female')].head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
796,797,1,1,"Leader, Dr. Alice (Farnham)",female,49.0,0,0,17465,25.9292,D17,S,Mr


We mistakenly think of all doctors as men.  
Here we see a female Dr.  
Let's fix this.

In [1299]:
df.loc[796,'Salutation']='Mrs'

In [1300]:
# define a dictionary with the mean age for each Salutation and Pclass combination
df_age = pd.concat([df, df_test], axis= 0).groupby(['Salutation','Pclass'])[['Age']].mean().round(2)
age_dict = df_age.to_dict()['Age']
age_dict

{('Master', 1): 6.98,
 ('Master', 2): 2.76,
 ('Master', 3): 6.09,
 ('Miss', 1): 30.13,
 ('Miss', 2): 20.87,
 ('Miss', 3): 17.36,
 ('Mr', 1): 42.2,
 ('Mr', 2): 32.91,
 ('Mr', 3): 28.32,
 ('Mrs', 1): 42.89,
 ('Mrs', 2): 33.52,
 ('Mrs', 3): 32.33}

In [1301]:
def fill_age(row):
    '''
    Add age to missing values
    '''
    if np.isnan(row['Age']):
        return age_dict[row['Salutation'], row['Pclass']]
    else:
        return row['Age']

Fill the missing age values.

In [1302]:
df.Age = df.apply(fill_age, axis=1) # fill the age to the TRAIN dataframe
df_test.Age = df_test.apply(fill_age, axis=1) # fill the age to the TEST dataframe
df.loc[888,:] # check if the age was added           

PassengerId                                         889
Survived                                              0
Pclass                                                3
Name           Johnston, Miss. Catherine Helen "Carrie"
Sex                                              female
Age                                               17.36
SibSp                                                 1
Parch                                                 2
Ticket                                       W./C. 6607
Fare                                              23.45
Cabin                                               NaN
Embarked                                              S
Salutation                                         Miss
Name: 888, dtype: object

Here we see the result: Miss. Johnston from 3 Pclass was given the age of 17.36.  
Unfortunately, she didn't survive the sinking.

Finally we've managed to fill missing ages and got the following Age distribution by Sex and Pclass.

In [1303]:
df.groupby(['Pclass','Sex'])[['Age']].mean().sort_values(by="Age", ascending= False).style.bar(align='mid', color='coral')

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Pclass,Sex,Unnamed: 2_level_1
1,male,41.439508
1,female,35.268617
2,male,30.921481
2,female,28.516316
3,male,26.742305
3,female,21.405208


In [1304]:
# df.drop('Salutation', axis=1, inplace=True) 
# df_test.drop('Salutation', axis=1, inplace=True)

### Sibsp

"SibSp" is an abbreviation for "Sibling/Spouse".  
The values for "SibSp" range from 0  to 8 (indicating that the passenger had eight siblings or spouses on board).  


In [1305]:
df_viz = df.copy() # create a copy of the dataframe for simplified plots reading
df_viz['Survived '] = df_viz['Survived'].map({0: 'Not survived', 1: 'Survived'})

In [1306]:
fig= px.histogram(df_viz, x=["SibSp", "Pclass"], color='Survived ', 
             color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'}, 
             barmode="group", opacity=.7)
fig.update_layout(title='Survival by "SibSp"', xaxis_title= "# Sibling / Spouse", yaxis_title= "Count")

The maximum chances to survive had passengers with 1 Sibling/Spouse onboard.  
(Look at the group where the green bar is higher than the red one.)

### Parch

"Parch" is an abbreviation for "Parent/Child".  
The values for "Parch" range from 0 (indicating that the passenger had no parents or children on board) to 6 (indicating that the passenger had six parents or children on board).  

In [1307]:
fig = px.histogram(df_viz, x=["Parch"],  color='Survived ', 
             color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'}, 
             barmode="group", opacity=.7)
fig.update_layout(title='Survival by "Parch"', xaxis_title= "# Parent / Children", yaxis_title= "Count")

As with SibSp passengers with 1 or 2 Parents or childs had the maximum chances to survive.

### Ticket

In [1308]:
df.Ticket.duplicated().any()

True

There are duplicated tickets.  
Let's take a closer look.

In [1309]:
pd.set_option('display.max_colwidth', None) # to display full text in columns
df.groupby("Ticket").agg(List_of_passengers_by_one_ticket=("Name", lambda x: x.tolist()),
                                   Count=('PassengerId','count'),
                                   Survived= ("Survived",lambda x: x.tolist())).sort_values(by='Count', ascending=False).head(10)


Unnamed: 0_level_0,List_of_passengers_by_one_ticket,Count,Survived
Ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1601,"[Bing, Mr. Lee, Ling, Mr. Lee, Lang, Mr. Fang, Foo, Mr. Choong, Lam, Mr. Ali, Lam, Mr. Len, Chip, Mr. Chang]",7,"[1, 0, 1, 1, 1, 0, 1]"
CA. 2343,"[Sage, Master. Thomas Henry, Sage, Miss. Constance Gladys, Sage, Mr. Frederick, Sage, Mr. George John Jr, Sage, Miss. Stella Anna, Sage, Mr. Douglas Bullen, Sage, Miss. Dorothy Edith ""Dolly""]",7,"[0, 0, 0, 0, 0, 0, 0]"
347082,"[Andersson, Mr. Anders Johan, Andersson, Miss. Ellis Anna Maria, Andersson, Miss. Ingeborg Constanzia, Andersson, Miss. Sigrid Elisabeth, Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren), Andersson, Miss. Ebba Iris Alfrida, Andersson, Master. Sigvard Harald Elias]",7,"[0, 0, 0, 0, 0, 0, 0]"
CA 2144,"[Goodwin, Master. William Frederick, Goodwin, Miss. Lillian Amy, Goodwin, Master. Sidney Leonard, Goodwin, Master. Harold Victor, Goodwin, Mrs. Frederick (Augusta Tyler), Goodwin, Mr. Charles Edward]",6,"[0, 0, 0, 0, 0, 0]"
347088,"[Skoog, Master. Harald, Skoog, Mrs. William (Anna Bernhardina Karlsson), Skoog, Mr. Wilhelm, Skoog, Miss. Mabel, Skoog, Miss. Margit Elizabeth, Skoog, Master. Karl Thorsten]",6,"[0, 0, 0, 0, 0, 0]"
3101295,"[Panula, Master. Juha Niilo, Panula, Master. Eino Viljami, Panula, Mr. Ernesti Arvid, Panula, Mrs. Juha (Maria Emilia Ojala), Panula, Mr. Jaako Arnold, Panula, Master. Urho Abraham]",6,"[0, 0, 0, 0, 0, 0]"
S.O.C. 14879,"[Hood, Mr. Ambrose Jr, Hickman, Mr. Stanley George, Davies, Mr. Charles Henry, Hickman, Mr. Leonard Mark, Hickman, Mr. Lewis]",5,"[0, 0, 0, 0, 0]"
382652,"[Rice, Master. Eugene, Rice, Master. Arthur, Rice, Master. Eric, Rice, Master. George Hugh, Rice, Mrs. William (Margaret Norton)]",5,"[0, 0, 0, 0, 0]"
PC 17757,"[Bidois, Miss. Rosalie, Robbins, Mr. Victor, Astor, Mrs. John Jacob (Madeleine Talmadge Force), Endres, Miss. Caroline Louise]",4,"[1, 0, 1, 1]"
4133,"[Lefebre, Master. Henry Forbes, Lefebre, Miss. Mathilde, Lefebre, Miss. Ida, Lefebre, Miss. Jeannie]",4,"[0, 0, 0, 0]"


As we see there were multiple passengers on one ticket.  
That's normal. People traveled with their families.

In [1310]:
pd.reset_option('display.max_colwidth') # reset the display option

Let's check duplicates taking Name into account.

In [1311]:
df[df[['Ticket','Name']].duplicated()] 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation


Finally, there were no duplicates if we take into account passenger Name.  

### Fare

"Fare" refers to the amount of money paid by each passenger for their ticket.  
The values for "Fare" range from 0 (indicating that the passenger did not pay any fare, possibly due to a complimentary or staff ticket) to 512.3292 (the highest fare paid by any passenger on board).  
Fares may also indicate a passenger's socioeconomic status or cabin class on a ship, and therefore their chances of survival.

In [1312]:
get_boxplot(df, ['Fare'])

In [1313]:
df = df[df.Fare<500] # remove the outlier

In [1314]:
df[df['Fare'].isna()] # check if there is a missing value in TRAIN set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation


In [1315]:
df_test[df_test['Fare'].isna()] # check if there is a missing value in TEST set

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,Mr


Let's fill that nan with mean over Pclass and Embarked.

In [1316]:
fare_dict = df.groupby(['Pclass','Embarked'])[['Fare']].mean().to_dict()['Fare']
fare_dict

{(1, 'C'): 89.80594390243903,
 (1, 'Q'): 90.0,
 (1, 'S'): 70.3648622047244,
 (2, 'C'): 25.358335294117648,
 (2, 'Q'): 12.35,
 (2, 'S'): 20.327439024390245,
 (3, 'C'): 11.214083333333333,
 (3, 'Q'): 11.183393055555555,
 (3, 'S'): 14.64408300283286}

In [1317]:
def fill_fare(row):
    '''
    Add fare to missing values
    '''
    if np.isnan(row['Fare']):
        return fare_dict[row['Pclass'], row['Embarked']]
    else:
        return row['Fare']

In [1318]:
df_test.Fare = df_test.apply(fill_fare, axis=1) # fill the fare to the TEST dataframe

In [1319]:
df_test.loc[152,:]

PassengerId                  1044
Pclass                          3
Name           Storey, Mr. Thomas
Sex                          male
Age                          60.5
SibSp                           0
Parch                           0
Ticket                       3701
Fare                    14.644083
Cabin                         NaN
Embarked                        S
Salutation                     Mr
Name: 152, dtype: object

Fare attribute is now filled with mean value for appropriate Pclass and Embarked port.

### Embarked

"Embarked" refers to the port of embarkation of each passenger.  
These are the three ports from which the Titanic embarked on its maiden voyage:

* C: Cherbourg
* Q: Queenstown (now known as Cobh)
* S: Southampton   

In [1320]:
df[df.Embarked.isna()] # check if there is a missing value in TRAIN set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,Miss
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,Mrs


In [1321]:
df_test[df_test.Embarked.isna()] # check if there is a missing value in TEST set

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation


We can decide on Embarked port by Pclass, SibSp, Parch values.

In [1322]:
df[(df.Pclass == 1) & (df.SibSp == 0) & (df.Parch == 0) & (df.Fare < 90) & (df.Fare > 60)].sort_values(by='Embarked', ascending=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S,Miss
290,291,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.85,,S,Miss
504,505,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S,Miss
627,628,1,1,"Longley, Miss. Gretchen Fiske",female,21.0,0,0,13502,77.9583,D9,S,Miss
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.5,B77,S,Mrs
139,140,0,1,"Giglio, Mr. Victor",male,24.0,0,0,PC 17593,79.2,B86,C,Mr
218,219,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C,Miss
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,42.89,0,0,PC 17585,79.2,,C,Mrs
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C,Miss
369,370,1,1,"Aubart, Mme. Leontine Pauline",female,24.0,0,0,PC 17477,69.3,B35,C,Mrs


Chance are the passengers were got onboard at Southampton or Cherbourg.  
Let's select Cherbourg.

In [1323]:
df.loc[df.Embarked.isna().index, 'Embarked'] = 'C' # fill the Embarked with 'C' for the TRAIN dataframe

### Cabin
Cabin number could potentially affect passenger survival  
if everyone were evacuated one by one according to their Cabins.  
But that's was not a case.

How does nan in Cabin correlate with survival?

In [1324]:
df[df['Cabin'].isna()].groupby(['Pclass','Survived'])[['PassengerId']].count().style.bar(align='mid', color='coral')

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Pclass,Survived,Unnamed: 2_level_1
1,0,21
1,1,18
2,0,94
2,1,74
3,0,366
3,1,113


We may notice that absence of Cabin attribute correlated with survival and Pclass, especially with 3-rd Pclass passengers.  
But there is no any causal relationships here.  

It is proposed to divide all Cabin attributes into filled and unfilled as [0,1].

In [1325]:
def fill_cabin(row):
    '''
    Add cabin to missing values
    '''
    if pd.isna(row['Cabin']):
        return 0
    else:
        return 1

In [1326]:
df.Cabin = df.apply(fill_cabin, axis=1) # fill the Cabin to train dataframe
df_test.Cabin = df_test.apply(fill_cabin, axis=1) # fill the Cabin to test dataframe



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Sex

In [1327]:
df['Sex'].value_counts()

male      575
female    313
Name: Sex, dtype: int64

Two values. 
Ready for encoding. 

### Final value filling checks.

In [1328]:
check_na(df) # check if there are any missing values

PassengerId  0    0.00%
Survived     0    0.00%
Pclass       0    0.00%
Name         0    0.00%
Sex          0    0.00%
Age          0    0.00%
SibSp        0    0.00%
Parch        0    0.00%
Ticket       0    0.00%
Fare         0    0.00%
Cabin        0    0.00%
Embarked     0    0.00%
Salutation   0    0.00%


In [1329]:
check_na(df_test) # check if there are missing values in the test set

PassengerId  0    0.00%
Pclass       0    0.00%
Name         0    0.00%
Sex          0    0.00%
Age          0    0.00%
SibSp        0    0.00%
Parch        0    0.00%
Ticket       0    0.00%
Fare         0    0.00%
Cabin        0    0.00%
Embarked     0    0.00%
Salutation   0    0.00%


### Input data visuals.

In [1330]:
fig=px.histogram(df_viz, x='Survived ',  barmode='stack', color='Survived ', 
color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'},
 width=600, histfunc='count', text_auto=True) 
fig.update_layout( title='Passenger SURVIVAL. Target label distribution', xaxis_title="", yaxis_title="")
fig.update_xaxes(type='category') # format x_axes type
fig.update_layout(yaxis={"showticklabels": False}) # hide y_axes ticks
fig.update_layout(showlegend=False)

There is a small target label imbalance.  
But nothing special.

### Let's explore survival rate

In [1331]:
def custom_aggregation(data):
    '''
    Calculate the survival rate for each group
    '''
    d = {}
    d['total'] = data['PassengerId'].count()                # total number of passengers in the group
    d['Survived'] = (data['Survived'] == 1).sum()           # number of survived passengers in the group
    d['Not_survived'] = (data['Survived'] == 0).sum()       # number of not survived passengers in the group
    d['Surv_rate'] = round(d['Survived']/d['total']*100,1)  # survival rate in the group
    return pd.Series(d)

grouped = df.groupby(['Sex', 'Pclass'])[['Sex', 'Pclass','Survived','PassengerId']].apply(custom_aggregation)
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,total,Survived,Not_survived,Surv_rate
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,1,93.0,90.0,3.0,96.8
female,2,76.0,70.0,6.0,92.1
female,3,144.0,72.0,72.0,50.0
male,1,120.0,43.0,77.0,35.8
male,2,108.0,17.0,91.0,15.7
male,3,347.0,47.0,300.0,13.5


In [1332]:
# reset the index to make the groupby columns as columns, plotly doesn't support multiindex
fig = px.line(grouped.reset_index(), x="Pclass", y= 'Surv_rate',  color="Sex", text='Surv_rate',
    markers= True,  title= 'Survival rate by Sex and Pclass', width=600, height=600)
fig.update_layout(xaxis = dict(tickmode = 'linear', tick0 = 1, dtick = 1, tickfont = dict(size=30)), 
    xaxis_title="Passenger class", yaxis_title="Survival rate,  %") 
fig.update_traces(marker_size=20, line=dict(width=5))   # change marker size and line width
fig.update_traces(textposition="bottom center")         # change text position
fig.show()

If you were a `female` passenger in `1st class`,  
your chances of surviving the sinking are `7 times higher` (96,8 vs 13.5 %)  
than those of a `male` passenger in `3rd class`.

In [1333]:
def custom_aggregation2(data):
    '''
    Calculate the survival rate for each group
    '''
    d = {}
    d['total'] = data['PassengerId'].count()                # total number of passengers in the group
    d['Survived'] = (data['Survived'] == 1).sum()           # number of survived passengers in the group
    d['Not_survived'] = (data['Survived'] == 0).sum()       # number of not survived passengers in the group
    d['Surv_rate'] = round(d['Survived']/d['total']*100,1)  # survival rate in the group
    return pd.Series(d)

grouped2 = df.groupby(['Salutation', 'Pclass'])[['Sex', 'Pclass','Survived','PassengerId']].apply(custom_aggregation)
grouped2

Unnamed: 0_level_0,Unnamed: 1_level_0,total,Survived,Not_survived,Surv_rate
Salutation,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Master,1,3.0,3.0,0.0,100.0
Master,2,9.0,9.0,0.0,100.0
Master,3,28.0,11.0,17.0,39.3
Miss,1,47.0,45.0,2.0,95.7
Miss,2,35.0,33.0,2.0,94.3
Miss,3,102.0,51.0,51.0,50.0
Mr,1,117.0,40.0,77.0,34.2
Mr,2,99.0,8.0,91.0,8.1
Mr,3,319.0,36.0,283.0,11.3
Mrs,1,46.0,45.0,1.0,97.8


In [1334]:
# reset the index to make the groupby columns as columns, plotly doesn't support multiindex
fig = px.line(grouped2.reset_index(), x="Salutation", y= 'Surv_rate',  color="Pclass", text='Surv_rate',
    markers= True,  title= 'Survival rate by Salutation and Pclass', width=600, height=600
    # ,category_orders={"Salutation": [ "Master","Miss", "Mrs", "Mr"]}
    )
fig.update_layout(xaxis = dict(tickmode = 'linear', tick0 = 1, dtick = 1, tickfont = dict(size=30)), 
    xaxis_title="Salutation", yaxis_title="Survival rate,  %") 
fig.update_traces(marker_size=20, line=dict(width=5))   # change marker size and line width
fig.update_traces(textposition="bottom center")         # change text position
# fig.update_xaxes(type='category')
# fig.update_layout(xaxis={'categoryorder':'total ascending'})
# fig.update_xaxes(categoryorder='total ascending')
fig.show()

In [1335]:
num_col = [col for col in df.columns if df[col].dtype != ('object' or 'date')] # list of numerical columns
cat_col = [col for col in df.columns if df[col].dtype == ('object' or 'date')] # list of categorical columns

fig= px.box(df[num_col], log_y= True)
fig.update_layout(title='Boxplots of numerical features', xaxis_title="", yaxis_title="")

In [1336]:
df.isna().sum().any(), df_test.isna().sum().any() # final check if there are any missing values

(False, False)

### Encoding categirical variable to numeric

In [1337]:
for i in cat_col:
    print(i,df[i].nunique()) # check the number of unique values in categorical columns

Name 888
Sex 2
Ticket 680
Embarked 3
Salutation 4


We don't plan to use  `Name`, `Ticket`


In [1338]:
df.Sex = df.Sex.map({'male':1,'female':0})      # encode Sex column in TRAIN set
df_test.Sex = df_test.Sex.map({'male':1,'female':0}) 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Logistic regression
### Predictions based on dirty data

In [1339]:
y = df_raw['Survived'] # target label

In [1340]:
X = df_raw.drop(['Survived','PassengerId','Name','Ticket','Cabin'], axis=1) # drop the target label and PassengerId, Name, Ticket columns
X['Sex'] = X['Sex'].map({'male':1,'female':0}) # encode Sex variable to 0, 1
X = pd.get_dummies(X, columns=['Embarked']) # encode categorical variables to numeric

In [1341]:
# Let's fill the missing values in the Age column with the median value of the Age column.
X['Age'] = X['Age'].fillna(X['Age'].median())

In [1342]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # split the data into train and test sets

In [1343]:
lr = LogisticRegression(max_iter = 500) # create a logistic regression model
lr.fit(X_train, y_train) # fit the model to the train set
lr_pred = lr.predict(X_test) # predict the target label for the test set
print('Logistic Regression accuracy score (clean data): ', lr.score(X_test, y_test))

Logistic Regression accuracy score (clean data):  0.8100558659217877


### Predictions beased on cleaned data

In [1344]:
y = df['Survived'] # target label
X = df.drop(['Survived','PassengerId','Name','Ticket'], axis=1) # drop the target label and PassengerId
X_subm = df_test.drop(['PassengerId','Name','Ticket'], axis=1) # drop the PassengerId

In [1345]:
X = pd.get_dummies(X, columns=['Embarked','Salutation']) # encode categorical variables to numeric
X_subm = pd.get_dummies(X_subm, columns=['Embarked','Salutation']) 

In [1346]:
# X_poly = PolynomialFeatures(degree=2, interaction_only= True, include_bias = False ).fit_transform(X) # create polynomial features

In [1347]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # split the data into train and test sets

In [1354]:
lr = LogisticRegression(max_iter = 500) # create a logistic regression model
lr.fit(X_train, y_train) # fit the model to the train set
lr_pred = lr.predict(X_test) # predict the target label for the test set
print('Logistic Regression accuracy score (clean data): ', lr.score(X_test, y_test))

Logistic Regression accuracy score (clean data):  0.8651685393258427


In [1349]:
# pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression(max_iter = 500))]) # create a pipeline
# pipe.fit(X_train, y_train) # fit the model to the train set
# pipe_pred = pipe.predict(X_test) # predict the target label for the test set
# print('Logistic Regression accuracy score (clean data): ', pipe.score(X_test, y_test))

### Applying smart filling the missing values we've managed to increase Accuracy more than 2 %.

In [1350]:
subm = lr.predict(X_subm) # predict the target label for the TEST set

## Random forest

In [1351]:
dt = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=1)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
print('Decision tree accuracy score (clean data): ', dt.score(X_train, y_train))
print('Decision tree accuracy score (clean data): ', dt.score(X_test, y_test))

Decision tree accuracy score (clean data):  0.8746478873239436
Decision tree accuracy score (clean data):  0.8651685393258427


In [1352]:
y_subm = dt.predict(X_subm) # predict the target label for the TEST set

In [1353]:
pd.DataFrame({'PassengerId': df_test['PassengerId'], 
              'Survived': y_subm}).to_csv('submission.csv', index=False)