> Igor Sorochan DSU-31

# Data quality problems

In [319]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import re



## Prepare

|data frame| used for storing:|
|:---|:---|
|df_raw | untouched input data|
|df | cleaned data|
|df_test|data for test w/o target labels|
|df_viz|simplified plot readings|
|submission|exemplar for Kaggle submission|
| X | independent features|
|y (Series) | target labels|
|y_transformed (Series)|encoded target labels|
### EDA
#### Loading the data set

In [320]:
df_raw=pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/Titanic_train.csv')
df_test=pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/Titanic_test.csv')
# submission=pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/Titanic_gender_submission.csv')

In [321]:
df_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [322]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [323]:
def check_na(df):
  for col in df.columns:
    print(f'{col.ljust(12)} {df[col].isna().sum():<5}{df[col].isna().sum()/df.shape[0]:.2%}')

In [324]:
check_na(df_raw)

PassengerId  0    0.00%
Survived     0    0.00%
Pclass       0    0.00%
Name         0    0.00%
Sex          0    0.00%
Age          177  19.87%
SibSp        0    0.00%
Parch        0    0.00%
Ticket       0    0.00%
Fare         0    0.00%
Cabin        687  77.10%
Embarked     2    0.22%


The data set is small, so we should try not to delete rows with nans but recover them.  
Age directly influnce on ones' chances to survive.  
Cabin and Embarked may affects but it is not so obvious.  
Let's try to **fill in the missing values.**

### Age - passenger age  
We have 20% of nulls here.  
What attributes can indicate the age of a passenger?  
Potential candidates are:  
* Pclass  
* Name (salutation)

Let's explore. 

In [325]:
df_raw.groupby(['Pclass'])[['Age']].mean().style.bar(align='mid', color='coral')

Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,38.233441
2,29.87763
3,25.14062


There is strong correlation between Age and Pclass.  
The youngest passengers are in 3-rd Pclass.

Let's explore relations between Salutation and Age.  
The idea is to **roughly determine age on a passenger Salutation.**

First, we extract Salutations as any word from Name with a dot at the end. 

In [326]:
# df -dataframe for cleaned data
df = df_raw.copy()

pattern= re.compile(r'(\w+)\.')
df['Salutation']= df['Name'].apply(lambda x: re.findall(pattern, x)[0] )
df_test['Salutation']= df_test['Name'].apply(lambda x: re.findall(pattern, x)[0] ) # for test data
df.Salutation.value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: Salutation, dtype: int64

|Salutation|Description|Proposed replacement values|
|:---|:---|:---|
|Master|The term "Master" refers to young male passengers who were under the age of 18 and were traveling without their parents.  In the early 1900s, the term "Master" was commonly used as a courtesy title for boys, particularly those from wealthy or upper-class families.  The use of this title on passenger lists was a way to distinguish these young male passengers from adult male passengers.| Master|
|Rev|"Reverend", which is a title used to address members of the Christian clergy. | Mr|
|Mlle|French "Mademoiselle" (unmarried women)|Miss|
|Mme|French "Madame"|Mrs|
|Countess|The title of Countess is a noble title given to a woman in certain European countries.|Mrs|
|Jonkheer|The term "Jonkheer" is a Dutch noble title that is roughly equivalent to the English title of "Esquire".|Mr|
|Dona|Spanish  "Madam" or "Lady". There was only "Dona" onboard: Doña Fermina Oliva y Ocana. She was a 39-year-old first-class passenger from Spain who boarded the Titanic in Cherbourg, France, and disembarked in New York City. Doña Fermina Oliva y Ocana was traveling with her maid, Miss Asuncion Durán y More, who also survived the sinking. They both boarded lifeboat 8, which was one of the first to leave the Titanic. They were later transferred to the rescue ship Carpathia and eventually reached New York City on April 18, 1912.|Mrs|
|Major Col Capt Sir Don|Salutations to adult man passengers|Mr|
|Ms|There is some ambiguity here. Two womans were titled as Ms (SibSp==0)|Miss|


In [327]:
# df[df.Salutation=='Ms' ] 

In [328]:
# extract salutation (any word followed by a dot) from name
salut = set(df_raw.Name.str.extract('(\w+)\.')[0])  # [0] to select the first column and convert to set to remove duplicates

# add the salutations from the test set
salut = salut | set(df_test.Name.str.extract('(\w+)\.')[0])  

# remove proper salutation 
salut -=  {'Master','Miss','Mr','Mrs'} 

salut1 = list(sorted(salut)) # sort the set and convert to list as a set order in zip function is RANDOM

print('Set of Salutation replacements:')
# create a list of replacement for rare salutation
salut2 =['Mr', 'Mr','Mrs', 'Mr', 'Mrs', 'Mr', 'Mr', 'Mrs', 'Mr', 'Miss',  'Mrs', 'Miss', 'Mr', 'Mr']
for i in zip(salut1, salut2):
    print (i[0].ljust(9), i[1])

Set of Salutation replacements:
Capt      Mr
Col       Mr
Countess  Mrs
Don       Mr
Dona      Mrs
Dr        Mr
Jonkheer  Mr
Lady      Mrs
Major     Mr
Mlle      Miss
Mme       Mrs
Ms        Miss
Rev       Mr
Sir       Mr


And finally define Salutations.

In [329]:
df['Salutation'] = df.Salutation.replace(salut1, salut2)
df_test['Salutation'] = df_test.Salutation.replace(salut1, salut2)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Mr
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr


Look at Pass #887. We've managed to define him as a Mr.
Now we can define a passenger Age by his (her) Salutation.

In [330]:
# define a dictionary with the mean age for each Salutation and Pclass
df_age = pd.concat([df, df_test], axis= 0).groupby(['Salutation','Pclass'])[['Age']].mean().round(2)
age_dict = df_age.to_dict()['Age']
age_dict

{('Master', 1): 6.98,
 ('Master', 2): 2.76,
 ('Master', 3): 6.09,
 ('Miss', 1): 30.13,
 ('Miss', 2): 20.87,
 ('Miss', 3): 17.36,
 ('Mr', 1): 42.24,
 ('Mr', 2): 32.91,
 ('Mr', 3): 28.32,
 ('Mrs', 1): 42.8,
 ('Mrs', 2): 33.52,
 ('Mrs', 3): 32.33}

In [331]:
def fill_age(row):
    '''
    Add age to missing values
    '''
    if np.isnan(row['Age']):
        return age_dict[row['Salutation'], row['Pclass']]
    else:
        return row['Age']

Fill the missing age values.

In [332]:
df.Age = df.apply(fill_age, axis=1) # fill the age to the dataframe
df_test.Age = df_test.apply(fill_age, axis=1) # fill the age to the TEST dataframe
df.loc[888,:] # check if the age was added           

PassengerId                                         889
Survived                                              0
Pclass                                                3
Name           Johnston, Miss. Catherine Helen "Carrie"
Sex                                              female
Age                                               17.36
SibSp                                                 1
Parch                                                 2
Ticket                                       W./C. 6607
Fare                                              23.45
Cabin                                               NaN
Embarked                                              S
Salutation                                         Miss
Name: 888, dtype: object

Here we see the result: Miss. Johnston from 3 Pclass was given the age of 17.36.  
Unfortunately, she didn't survive the sinking.

Finally we've managed to fill missing ages and got the following Age distribution by Sex and Pclass.

In [333]:
df.groupby(['Pclass','Sex'])[['Age']].mean().sort_values(by="Age", ascending= False).style.bar(align='mid', color='coral')

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Pclass,Sex,Unnamed: 2_level_1
1,male,41.446393
1,female,35.260957
2,male,30.921481
2,female,28.516316
3,male,26.742305
3,female,21.405208


In [334]:
# fig= px.histogram(df, x= 'Age', color='Survived', 
# color_discrete_map={0:'firebrick', 1:'lightgreen'}, barmode='group', opacity=.7, height=600)
# fig.update_layout(bargap=0.1, title_text='Age distribution', xaxis = dict(
#         tickmode = 'linear', tick0 = 1, dtick = 2)
# )

### Sibsp

"SibSp" is an abbreviation for "Sibling/Spouse".  
The values for "SibSp" range from 0  to 8 (indicating that the passenger had eight siblings or spouses on board).  


In [335]:
df_viz = df.copy() # create a copy of the dataframe for simplified plots reading
category_names = {0: 'Not survived', 1: 'Survived'}
df_viz['Survived '] = df_viz['Survived'].map(category_names)

In [336]:
px.histogram(df_viz, x=["SibSp", "Pclass"], color='Survived ', 
             color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'}, 
             barmode="group", opacity=.7)

The maximum chances ti survive had passengers with 1 Sibling/Spouse onboard.

### Parch

"Parch" is an abbreviation for "Parent/Child".  
The values for "Parch" range from 0 (indicating that the passenger had no parents or children on board) to 6 (indicating that the passenger had six parents or children on board).  

In [337]:
px.histogram(df_viz, x=["Parch"],  color='Survived ', 
             color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'}, 
             barmode="group", opacity=.7)

As with SibSp passengers with 1 or 2 Parents or childs had the maximum chances to survive.

### Ticket

In [338]:
df.Ticket.duplicated().any()

True

There were duplicated tickets.  
Let's look closer

In [365]:
pd.set_option('display.max_colwidth', None) # to display full text in columns
df.groupby("Ticket").agg(List_of_passengers_by_one_ticket=("Name", lambda x: x.tolist()),
                                   Count=('PassengerId','count'),
                                   Deck=("Pclass", lambda x: x.tolist())).sort_values(by='Count', ascending=False).head(10)


Unnamed: 0_level_0,List_of_passengers_by_one_ticket,Count,Deck
Ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1601,"[Bing, Mr. Lee, Ling, Mr. Lee, Lang, Mr. Fang, Foo, Mr. Choong, Lam, Mr. Ali, Lam, Mr. Len, Chip, Mr. Chang]",7,"[3, 3, 3, 3, 3, 3, 3]"
CA. 2343,"[Sage, Master. Thomas Henry, Sage, Miss. Constance Gladys, Sage, Mr. Frederick, Sage, Mr. George John Jr, Sage, Miss. Stella Anna, Sage, Mr. Douglas Bullen, Sage, Miss. Dorothy Edith ""Dolly""]",7,"[3, 3, 3, 3, 3, 3, 3]"
347082,"[Andersson, Mr. Anders Johan, Andersson, Miss. Ellis Anna Maria, Andersson, Miss. Ingeborg Constanzia, Andersson, Miss. Sigrid Elisabeth, Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren), Andersson, Miss. Ebba Iris Alfrida, Andersson, Master. Sigvard Harald Elias]",7,"[3, 3, 3, 3, 3, 3, 3]"
CA 2144,"[Goodwin, Master. William Frederick, Goodwin, Miss. Lillian Amy, Goodwin, Master. Sidney Leonard, Goodwin, Master. Harold Victor, Goodwin, Mrs. Frederick (Augusta Tyler), Goodwin, Mr. Charles Edward]",6,"[3, 3, 3, 3, 3, 3]"
347088,"[Skoog, Master. Harald, Skoog, Mrs. William (Anna Bernhardina Karlsson), Skoog, Mr. Wilhelm, Skoog, Miss. Mabel, Skoog, Miss. Margit Elizabeth, Skoog, Master. Karl Thorsten]",6,"[3, 3, 3, 3, 3, 3]"
3101295,"[Panula, Master. Juha Niilo, Panula, Master. Eino Viljami, Panula, Mr. Ernesti Arvid, Panula, Mrs. Juha (Maria Emilia Ojala), Panula, Mr. Jaako Arnold, Panula, Master. Urho Abraham]",6,"[3, 3, 3, 3, 3, 3]"
S.O.C. 14879,"[Hood, Mr. Ambrose Jr, Hickman, Mr. Stanley George, Davies, Mr. Charles Henry, Hickman, Mr. Leonard Mark, Hickman, Mr. Lewis]",5,"[2, 2, 2, 2, 2]"
382652,"[Rice, Master. Eugene, Rice, Master. Arthur, Rice, Master. Eric, Rice, Master. George Hugh, Rice, Mrs. William (Margaret Norton)]",5,"[3, 3, 3, 3, 3]"
PC 17757,"[Bidois, Miss. Rosalie, Robbins, Mr. Victor, Astor, Mrs. John Jacob (Madeleine Talmadge Force), Endres, Miss. Caroline Louise]",4,"[1, 1, 1, 1]"
4133,"[Lefebre, Master. Henry Forbes, Lefebre, Miss. Mathilde, Lefebre, Miss. Ida, Lefebre, Miss. Jeannie]",4,"[3, 3, 3, 3]"


As we see there were multiple passengers on one ticket.  
That's normal. People traveled with their families.

In [340]:
pd.reset_option('display.max_colwidth') # reset the display option

In [341]:
df[df[['Ticket','Name']].duplicated()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation


Finally, there were no duplicates if we take into account passenger Name.  

### Fare

"Fare" refers to the amount of money paid by each passenger for their ticket.  
The values for "Fare" range from 0 (indicating that the passenger did not pay any fare, possibly due to a complimentary or staff ticket) to 512.3292 (the highest fare paid by any passenger on board).  
Fares may also indicate a passenger's socioeconomic status or cabin class on a ship, and therefore their chances of survival.

In [342]:
df[df['Fare'].isna()] # check if there is a missing value in TRAIN set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation


In [343]:
df_test[df_test['Fare'].isna()] # check if there is a missing value in TEST set

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,Mr


Let's fill that nan with mean over Pclass and Embarked.

In [344]:
fare_dict = df_test.groupby(['Pclass','Embarked'])[['Fare']].mean().to_dict()['Fare']
fare_dict

{(1, 'C'): 110.07351071428572,
 (1, 'Q'): 90.0,
 (1, 'S'): 76.677504,
 (2, 'C'): 20.120445454545457,
 (2, 'Q'): 11.27395,
 (2, 'S'): 23.056089743589745,
 (3, 'C'): 10.658700000000001,
 (3, 'Q'): 8.998985365853658,
 (3, 'S'): 13.913029787234043}

In [345]:
def fill_fare(row):
    '''
    Add fare to missing values
    '''
    if np.isnan(row['Fare']):
        return fare_dict[row['Pclass'], row['Embarked']]
    else:
        return row['Fare']

In [346]:
df_test.Fare = df_test.apply(fill_fare, axis=1) # fill the fare to the TEST dataframe

In [347]:
df_test.loc[152,:]

PassengerId                  1044
Pclass                          3
Name           Storey, Mr. Thomas
Sex                          male
Age                          60.5
SibSp                           0
Parch                           0
Ticket                       3701
Fare                     13.91303
Cabin                         NaN
Embarked                        S
Salutation                     Mr
Name: 152, dtype: object

Fare attribute is now filled with mean value for appropriate Pclass and Embarked port.

### Embarked

"Embarked" refers to the port of embarkation of each passenger.  
These are the three ports from which the Titanic embarked on its maiden voyage:

* C: Cherbourg
* Q: Queenstown (now known as Cobh)
* S: Southampton   

In [348]:
df[df.Embarked.isna()] # check if there is a missing value in TRAIN set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,Miss
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,Mrs


In [349]:
df_test[df_test.Embarked.isna()] # check if there is a missing value in TEST set

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation


We can decide on Embarked port by Pclass, SibSp, Parch values.

In [350]:
df[(df.Pclass == 1) & (df.SibSp == 0) & (df.Parch == 0) & (df.Fare < 90) & (df.Fare > 60)].sort_values(by='Embarked', ascending=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S,Miss
290,291,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.85,,S,Miss
504,505,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S,Miss
627,628,1,1,"Longley, Miss. Gretchen Fiske",female,21.0,0,0,13502,77.9583,D9,S,Miss
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.5,B77,S,Mrs
139,140,0,1,"Giglio, Mr. Victor",male,24.0,0,0,PC 17593,79.2,B86,C,Mr
218,219,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C,Miss
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,42.8,0,0,PC 17585,79.2,,C,Mrs
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C,Miss
369,370,1,1,"Aubart, Mme. Leontine Pauline",female,24.0,0,0,PC 17477,69.3,B35,C,Mrs


Chance are the passengers were got onboard at Southampton or Cherbourg.  
Let's select Cherbourg.

In [351]:
df.Embarked = df.Embarked.fillna('C') # fill the Embarked to train dataframe

### Cabin
Cabin number could potentially affect passenger survival  
if everyone were evacuated one by one according to their Cabins.  
But that's was not a case.

How does nan in Cabin correlate with survival?

In [352]:
df[df['Cabin'].isna()].groupby(['Pclass','Survived'])[['PassengerId']].count().style.bar(align='mid', color='coral')

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Pclass,Survived,Unnamed: 2_level_1
1,0,21
1,1,19
2,0,94
2,1,74
3,0,366
3,1,113


We may notice that absence of Cabin attribute correlated with survival and Pclass, especially with 3-rd Pclass passengers.  
But there is no any causal relationships here.  

It is proposed to divide all Cabin attributes into filled and unfilled as [0,1].

In [353]:
def fill_cabin(row):
    '''
    Add cabin to missing values
    '''
    if pd.isna(row['Cabin']):
        return 0
    else:
        return 1

In [354]:
df.Cabin = df.apply(fill_cabin, axis=1) # fill the Cabin to train dataframe
df_test.Cabin = df_test.apply(fill_cabin, axis=1) # fill the Cabin to test dataframe

### Final value filling checks.

In [355]:
check_na(df) # check if there are any missing values

PassengerId  0    0.00%
Survived     0    0.00%
Pclass       0    0.00%
Name         0    0.00%
Sex          0    0.00%
Age          0    0.00%
SibSp        0    0.00%
Parch        0    0.00%
Ticket       0    0.00%
Fare         0    0.00%
Cabin        0    0.00%
Embarked     0    0.00%
Salutation   0    0.00%


In [356]:
check_na(df_test) # check if there are missing values in the test set

PassengerId  0    0.00%
Pclass       0    0.00%
Name         0    0.00%
Sex          0    0.00%
Age          0    0.00%
SibSp        0    0.00%
Parch        0    0.00%
Ticket       0    0.00%
Fare         0    0.00%
Cabin        0    0.00%
Embarked     0    0.00%
Salutation   0    0.00%


df.crosstab

In [357]:
fig=px.histogram(df_viz, x='Survived ',  barmode='stack', color='Survived ', 
color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'},
 width=600, histfunc='count', text_auto=True) 
fig.update_layout( title='Passenger SURVIVAL. Target label distribution', xaxis_title="", yaxis_title="")
fig.update_xaxes(type='category') # format x_axes type
fig.update_layout(yaxis={
    # "tickvals": [],
    # "ticktext": [],
    "showticklabels": False
})
fig.update_layout(showlegend=False)

# fig.show()

There is a small target label imbalance.

In [358]:
fig=px.histogram(df_viz, x='Sex',  barmode='group', color='Survived ', 
color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'},width=600, histfunc='count', text_auto=True) 
fig.update_layout( title='Survival and passengers GENDER', xaxis_title="", yaxis_title="")
fig.update_xaxes(type='category') # format x_axes type
fig.update_layout(yaxis={
    # "tickvals": [],
    # "ticktext": [],
    "showticklabels": False
})

Female survival rate is much higher.

In [359]:
fig=px.histogram(df_viz, x=['Pclass'],  barmode='group', color='Survived ', 
color_discrete_map={'Not survived':'firebrick', 'Survived':'lightgreen'},width=600, histfunc='count', text_auto=True) 
fig.update_layout( title='Survival and passengers Pclass', xaxis_title="", yaxis_title="")
fig.update_layout(yaxis={"showticklabels": False})

The survival rate among 1-st Pclass passengers are much higher than in third class.

In [381]:
df_raw.iloc[:,[1,2,4,0]].pivot_table( df_raw.iloc[:,[1,2,4,0]], index= ['Sex', 'Survived','Pclass'], aggfunc=['count','mean'], margins= True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,PassengerId,PassengerId
Sex,Survived,Pclass,Unnamed: 3_level_2,Unnamed: 4_level_2
female,0.0,1.0,3,325.0
female,0.0,2.0,6,423.5
female,0.0,3.0,72,440.375
female,1.0,1.0,91,473.967033
female,1.0,2.0,70,444.785714
female,1.0,3.0,72,359.083333
male,0.0,1.0,77,413.623377
male,0.0,2.0,91,454.010989
male,0.0,3.0,300,456.75
male,1.0,1.0,45,527.777778


In [382]:
def percent(data):
    d = {}
    d['total'] = data['PassengerId'].count()
    d['Survived'] = (data['Survived'] == 1).sum()
    d['Not_survived'] = (data['Survived'] == 0).sum()
    d['Surv_rate'] = round(d['Survived']/d['total']*100,1)
    return pd.Series(d)

df.groupby(['Sex', 'Pclass'])[['Sex', 'Pclass','Survived','PassengerId']].apply(percent).reset_index()

Unnamed: 0,Sex,Pclass,total,Survived,Not_survived,Surv_rate
0,female,1,94.0,91.0,3.0,96.8
1,female,2,76.0,70.0,6.0,92.1
2,female,3,144.0,72.0,72.0,50.0
3,male,1,122.0,45.0,77.0,36.9
4,male,2,108.0,17.0,91.0,15.7
5,male,3,347.0,47.0,300.0,13.5


In [311]:
df = pd.DataFrame({'group':['a','a','b','b'],
                   'd1':[5,10,100,30],
                   'd2':[7,1,3,20],
                   'weights':[.2,.8, .4, .6]},
                 columns=['group', 'd1', 'd2', 'weights'])
df                 


Unnamed: 0,group,d1,d2,weights
0,a,5,7,0.2
1,a,10,1,0.8
2,b,100,3,0.4
3,b,30,20,0.6


In [312]:
def weighted_average(data):
    d = {}
    d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
    d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
    return pd.Series(d)

df.groupby('group').apply(weighted_average)

Unnamed: 0_level_0,d1_wa,d2_wa
group,Unnamed: 1_level_1,Unnamed: 2_level_1
a,9.0,2.2
b,58.0,13.2


In [None]:
train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()


'Sex' is very interesting feature. Isn't it? Let's explore more features

In [None]:
sb.countplot('Pclass', hue='Survived', data=train_data)
plt.title('Pclass: Sruvived vs Dead')
plt.show()

Wow.... That looks amazing. It is usually said that Money can't buy Everything, But it is clearly seen that pasangers of Class 1 are given high priority while Rescue. There are greater number of passangers in Class 3 than Class 1 and Class 2 but very few, almost 25% in Class 3 survived. In Class 2, survivail and non-survival rate is 49% and 51% approx. While in Class 1 almost 68% people survived. So money and status matters here.

Let's dive in again into data to check more interesting observations.

In [None]:
pd.crosstab([train_data.Sex,train_data.Survived],train_data.Pclass,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
sb.factorplot('Pclass', 'Survived', hue='Sex', data=train_data)
plt.show()

I use FactorPlot and CrossTab here because with these plots categorical variables can easily be visualized. Looking at FactorPlot and CrossTab, it is clear that women survival rate in Class 1 is about 95-96%, as only 3 out of 94 women died. So, it is now more clear that irrespective of Class, women are given first priority during Rescue. Because survival rate for men in even Class 1 is also very low. From this conclusion, PClass is also a important feature.

In [None]:
print('Oldest person Survived was of:',train_data['Age'].max())
print('Youngest person Survived was of:',train_data['Age'].min())
print('Average person Survived was of:',train_data['Age'].mean())

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
sb.violinplot('Pclass','Age',hue='Survived',data=train_data,split=True,ax=ax[0])
ax[0].set_title('PClass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sb.violinplot("Sex","Age", hue="Survived", data=train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

From above violen plots, following observations are clear,

1) The no of children is increasing from Class 1 to 3, the number of children in Class 3 is greater than other two. 2) Survival rate of children, for age 10 and below is good irrespective of Class 3) Survival rate between age 20-30 is well and is quite better for women.

Now, in Age feature we have 177 null values filled with NaN. We have to deal with it. But we can't enter mean of age in every NaN column, because our average/mean is 29 and we cannot put 29 for a child or some olde man. So we have to discover something better. Let's do something more interesting with dataset by exploring more.

add Codeadd Markdown
What is, if I look at 'Name' feature, It looks interesting. Let's check it....

In [None]:
train_data['Initial']=0
for i in train_data:
    train_data['Initial']=train_data.Name.str.extract('([A-Za-z]+)\.') #extracting Name initials

In [None]:
pd.crosstab(train_data.Initial,train_data.Sex).T.style.background_gradient(cmap='summer_r')

There are many names which are not relevant like Mr, Mrs etc. So I will replace them with some relevant names,

In [None]:
train_data.groupby('Initial')['Age'].mean()


From the above plots, I found the following observations

(1) First priority during Rescue is given to children and women, as the persons<5 are save by large numbers  
(2) The oldest saved passanger is of 80  
(3) The most deaths were between 30-40

In [None]:
sb.factorplot('Pclass','Survived',col='Initial',data=train_data)
plt.show()



From the above FactorPlots it is Clearly seen that women and children were saved irrespective of PClass


Let's explore some more

Feature: SibSip

SibSip feature indicates that whether a person is alone or with his family. Siblings=brother,sister, etc and Spouse= husband,wife

In [None]:
pd.crosstab([train_data.SibSp],train_data.Survived).style.background_gradient('summer_r')

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sb.barplot('SibSp','Survived', data=train_data,ax=ax[0])
ax[0].set_title('SipSp vs Survived in BarPlot')
sb.factorplot('SibSp','Survived', data=train_data,ax=ax[1])
ax[1].set_title('SibSp vs Survived in FactorPlot')
plt.close(2)
plt.show()

In [None]:
pd.crosstab(train_data.SibSp,train_data.Pclass).style.background_gradient('summer_r')

There are many interesting facts with this feature. Barplot and FactorPlot shows that if a passanger is alone in ship with no siblings, survival rate is 34.5%. The graph decreases as no of siblings increase. This is interesting because, If I have a family onboard, I will save them instead of saving myself. But there's something wrong, the survival rate for families with 5-8 members is 0%. Is this because of PClass? Yes this is PClass, The crosstab shows that Person with SibSp>3 were all in Pclass3. It is imminent that all the large families in Pclass3(>3) died.

That are some interesting facts we have observed with Titanic dataset.