In [1]:
file = 'https://drive.google.com/file/d/18b5oqoSnSQ7d28vnv8RMgan2tI1--uLs/view?usp=drive_link'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv(file)

In [None]:
df.head()

In [None]:
# make index start from 0
df.index = df.index + 1

In [None]:
df.head()

In [None]:
# check data types and not non-null count
df.info()

In [None]:
# isna(), isnull()
df.isna().sum()

With handling missing values, we could drop off all rows with missing values, this method is faster but consider the data that could be lost in the process.

In [None]:
new_df = df.dropna()

In [None]:
new_df.shape

In [None]:
new_df.isna().sum()

Using a statistical measure like the mean or median to fill up missing values is recommended for numeric data like missing age records.

In [None]:
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)

In [None]:
df.isna().sum()

Cabin records are categorical: they are not numbers but distict categories. in this case, the mode could be suitable.

In [None]:
df['Cabin'].mode()

The column is multi-modal (more than one mode) in this case you may say it is 'tri-modal'.
`NOTE:`ploting a distribution will make this detail more visual.

In [None]:
df.Fare.max()

Certain columns could be dropped off, due to their volume of missing values.

In [None]:
df.drop('Cabin', axis=1, inplace=True)

In [None]:
df.isna().sum()

It is important to consider certain relationships between the columns, they may give us a clue as to what is suitable to fill in the missing values with.

In [None]:
# fill up missing values: missing values in the embarked column
df[df.Embarked.isna()]

Couuld passengers who paid the same amount 'Fare' may have also shared a common embarked port?Ticket numbers are also similar.

In [None]:
df[(df.Fare >= 80.0) & (df.Fare < 90.0)]['Embarked']#.dropna().mode()

In [None]:
df[(df.Fare >= 80.0) & (df.Fare < 90.0)]['Embarked'].value_counts()

In [None]:
df.Embarked.fillna('C', inplace=True)

In [None]:
df.isna().sum()

In [None]:
# age
df['Age'].mean()

In [None]:
df['Age'].median()

In [None]:
median_age = df['Age'].median()
df['Age'].fillna(median_age)

Fowardfill and Backfill methods: consider the previous and subsequent values respectively and use them to fill up missing values accordingly.

In [None]:
# forward fill, backward fill
# take a sample of our age data
df_sample = df.sample(30)

In [None]:
df_sample.isna().sum()

In [None]:
# Apply foward fill
df_sample['Age'].ffill()

Additional Resources:
[Techniques to Handle Missing Data](https://www.datacamp.com/tutorial/techniques-to-handle-missing-data-values)
