<a href="https://colab.research.google.com/github/Ashikur-ai/Learn-Machine-Learning/blob/main/1_1_Analysis_of_Missing_Values_in_Titanic_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Feature Engineering Series Tutorial 1: Analysis of Missing Values in Titanic Dataset and its Mechanisms

Missing data, or missing values, occur when no data / no value is stored for centain observations witin a variable. Incomplete data is an unavoidable problem in most data sources, and may have a significant impact on the conclusions that can be derived from the data.

##Why is data missing?
The source of missing data can be very different. These are just a few examples:
* A value is missing because it was forgotten, lost or not stored properly

* For a certain observation, the value does not exist

* The value can't be known or identified

It is important to understand how the missing data are introduced in the dataset, that is, the mechanisms by which missing information is introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. In addition, by knowing the source of missing datas, we may choose to take action to control that source and decrease the amount of missing information looking forward during data collection.

##Missing Data Mechanisms

There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third on involves a systematic loss of data.

##Missing Completely at Random (MCAR):

A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations.

##Missing at Random (MAR):

MAR occurs when there is a relationship between the propensity of missing values and the observed data. In other words, the probability of an observation being missing depends on available information (i.e., other variables in the dataset).

##Missing Not at Random (MNAR):

Missing data is not at random (MNAR) when there is a mechanism or a reason why missing values are introduced in the dataset. For example, MNAR would occur if people failed to fill in a depression survey because of their level of depression.

#In this Video:
In the following cells we will:
* Learn how to detect and quantify missing values

* Try to identify the 3 different mechanisms of missing data introduction 

We will use the toy Loan dataset and the Titanic dataset.



#Let's start!

Here we have imported the necessary libraries.
* pandas is used to read the dataset into a dataframe and perform operations on it

* numpy is used to perform basic array operations

* pyplot from matplotlib is used to visualize the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
%matplotlib inline

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#Number of data missing

In [3]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#How many percentage of data is missing?

In [4]:
data.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

#Missing data Not at Random(MNAR)

##Cabin analysis


In [5]:
 data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0)
 data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,cabin_null
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


#Let's find the percentage of cabin data missing among the survived and not survived people

In [6]:
data.groupby(['Survived'])['cabin_null'].mean()

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

#Age percentage related to Survived

In [7]:
 data['Age'].isnull().groupby(data['Survived']).mean()

Survived
0    0.227687
1    0.152047
Name: Age, dtype: float64

#Missing completely at random

In [9]:
data['Embarked'].isnull().mean()

0.002244668911335578

In [10]:
data[data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,cabin_null
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,0
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,0
