**https://www.kaggle.com/code/jamesleslie/titanic-eda-wrangling-imputation/notebook**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
import os
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

%matplotlib inline
rcParams['figure.figsize'] = 10,8
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (12,8)})

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
# print(os.listdir("../input"))

In [None]:
train = pd.read_csv('./data/train.csv', )
test = pd.read_csv('./data/test.csv')
df = pd.concat([train, test], axis=0, sort=True)

In [None]:
df.head()

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

display_all(df.describe(include='all').T)

In [None]:
print(df['Survived'])
print(df.Name)

In [None]:
df['Survived'].value_counts()

**Extract title from name**
 <br>
A simple option for the missing age values is to use the median age value. Let's go a little further and use each passenger's Title to estimate their age. E.g. if a passenger has the title of Dr, I will give them the median age value for all other passengers with the same title.

**2.1. Impute missing age values**
 <br>
We can use a regular expression to extract the title from the Name column. We will do this by finding the adjacent letters that are immediately followed by a full stop.

In [None]:
df['Title'] = df['Name'].str.extract('([A-Za-z]+)\.', expand=True)
df['Title'].value_counts()

In [None]:
# replace rare titles with more common ones
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr',
           'Don': 'Mr', 'Mme': 'Mrs', 'Jonkheer': 'Mr', 'Lady': 'Mrs',
           'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
df.replace({'Title': mapping}, inplace=True)
df['Title'].value_counts()

**Use median of title group** 
<br>
Now, for each missing age value, we will impute the age using the median age for all people with the same title.

In [None]:
# impute missing Age values using median of Title groups
title_ages = dict(df.groupby('Title')['Age'].median())

# create a column of the average ages
df['age_med'] = df['Title'].apply(lambda x: title_ages[x])

# replace all missing ages with the value in this column
df['Age'].fillna(df['age_med'], inplace=True, )
del df['age_med']

In [None]:
print(len(title_ages))
print(title_ages)
print(type(title_ages))

<p>We can visualize the median ages for each title group. Below, we see that each title has a distinctly different median age.</p>

<blockquote><p><strong>Note</strong>: There is no risk in doing this after imputation, as the median of an age group has not been affected by our actions.</p>
</blockquote>

In [None]:
sns.barplot(x='Title', y='Age', data=df, estimator=np.median, ci=None, palette='Blues_d')
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.countplot(x='Title', data=df, palette='hls', hue='Survived')
plt.xticks(rotation=45)
plt.show()

<h2 id="2.2.-Impute-missing-fare-values">2.2. Impute missing fare values<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-eda-wrangling-imputation/notebook#2.2.-Impute-missing-fare-values" target="_self" rel=" noreferrer nofollow">¶</a></h2>
<!-- <br> -->
<p>For the single missing fare value, I also use the median fare value for the passenger's class.</p>
<!-- <br> -->
<blockquote><p>Perhaps you could come up with a cooler way of visualising the relationship between the price a passenger paid for their ticket and their chances of survival?</p>
</blockquote>

In [None]:
dff = df[['Sex', 'Survived', 'Fare']]


In [None]:
# dff.astype({'Fare': 'int64'}).dtypes
# dff

**Use median of title group** 
<br>
Now, for each missing age value, we will impute the age using the median age for all people with the same title.

In [None]:

# impute missing Age values using median of Title groups
title_ages = dict(df.groupby('Title')['Age'].median())

# create a column of the average ages
df['age_med'] = df['Title'].apply(lambda x: title_ages[x])

# replace all missing ages with the value in this column
df['Age'].fillna(df['age_med'], inplace=True, )
del df['age_med']

In [None]:

# drr = df.drop(axis=0, index=152)
# print(drr['Fare'].isna().sum())
# print(drr[df['Fare'].isnull()].index.tolist())


# df.drop(axis=0, index=152)
# print(df['Fare'].isna().sum())
# print(df[df['Fare'].isnull()].index.tolist())
# print(drr['Fare'][152])
# print(df[['Sex','Fare' ,'Survived']][150:155])
# print(drr[['Sex','Fare' ,'Survived']][150:155])

In [None]:
# impute missing Fare values using median of Pclass groups
class_fares = dict(df.groupby('Pclass')['Fare'].median())

# create a column of the average fares
df['fare_med'] = df['Pclass'].apply(lambda x: class_fares[x])

# replace all missing fares with the value in this column
df['Fare'].fillna(df['fare_med'], inplace=True, )
del df['fare_med']

In [None]:
sns.swarmplot(x='Sex',y='Fare' ,hue='Survived',s = 2, data=df)
plt.figure(figsize=(4, 4))
plt.show()

In [None]:
drr = df.astype({'Fare': 'int32'})


<h2 id="2.3.-Impute-missing-&quot;embarked&quot;-value">2.3. Impute missing "embarked" value<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-eda-wrangling-imputation/notebook#2.3.-Impute-missing-%22embarked%22-value" target="_self" rel=" noreferrer nofollow">¶</a></h2>
<p>There are also just two missing values in the <code>Embarked</code> column. Here we will just use the Pandas 'backfill' method.</p>

In [None]:
sns.catplot(x='Embarked', y='Survived', data=df,
            kind='bar', palette='muted', errorbar=None)
plt.show()

In [None]:
df['Embarked'].fillna(method='backfill', inplace=False)
df.Embarked.unique()

In [None]:
df.columns

<h1 id="3.-Add-family-size-column">3. Add family size column<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-eda-wrangling-imputation/notebook#3.-Add-family-size-column" target="_self" rel=" noreferrer nofollow">¶</a></h1>
<p>We can use the two variables of <strong>Parch</strong> and <strong>SibSp</strong> to create a new variable called <strong>Family_Size</strong>. This is simply done by adding <code>Parch</code> and <code>SibSp</code> together.</p>

In [None]:
# create Family_Size column (Parch +)
df['Family_Size'] = df['Parch'] + df['SibSp']


In [None]:
display_all(df.describe(include='all').T)

<h1 id="4.-Save-cleaned-version">4. Save cleaned version<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-eda-wrangling-imputation/notebook#4.-Save-cleaned-version" target="_self" rel=" noreferrer nofollow">¶</a></h1><p>Finally, let's save our cleaned data set so we can use it in other notebooks.</p>

In [None]:
# train = df[pd.notnull(df['Survived'])]
# test = df[pd.isnull(df['Survived'])]

In [None]:
# train.to_csv('./data/train_clean.csv', index=False)
# test.to_csv('./data/test_clean.csv', index=False)