## Introduction :
> * Data Cleaning is the process of finding and __correcting__ the inaccurate/incorrect data that are present in the dataset. 
> * One such process needed is to do something about the values that are missing in the dataset. In real life, many datasets will have many missing values, so dealing with them is an __important__ step.


> * Why do you need to fill in the missing data? Because most of the machine learning models that you want to use will provide an error if you pass NaN values into it. The easiest way is to just fill them up with 0, but this can __reduce__ your model __accuracy__ significantly.

> * For filling missing values, there are many methods available. For choosing the best method, you need to understand the type of missing value and its significance, before you start filling/deleting the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Titanic_Kaggle.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


>* Deleting some columns because these are not useful in our data cleaning process and even they not really so useful  in real life analysis still you can drop only those columns which are not so useful in your analysis.
>* Cleaning process has not been started yet.
>* We just dropped some __not so useful__ columns.

In [3]:
df.drop(df.columns[[0, 3, 8, 10, 11]], axis = 1, inplace = True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


> * Here __'Sex'__ variable is a categorical variable.
> * We need to change that into numrical values: Say male : 1; Female : 0
> * For that we will use __Label-Encoding__ or __One-Hot-Encoding__ as follows :

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

new_df = df   #Just saving a copy into another dataframe, in case we want to do any operations on original dataset.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int32  
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int32(1), int64(4)
memory usage: 45.4 KB


> * __Age__ has missing values.
> * Lets try to fit this data into __Logistic Regression__ even though we have NaN values.
> * Lets just check wheather it accepts the data or not.
> * If not; we will check some techniques to impute missing values.

### Splitting the data into x and y

In [6]:
y = df.iloc[:,[0]]
x = df.iloc[:,[1,2,3,4,5,6]]

In [7]:
x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,1,22.0,1,0,7.25
1,1,0,38.0,1,0,71.2833
2,3,0,26.0,0,0,7.925
3,1,0,35.0,1,0,53.1
4,3,1,35.0,0,0,8.05


In [8]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

In [9]:
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(random_state = 0)

lg.fit(x_train, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

> * See that the logistic regression model does not work as we have NaN values in the dataset. Only some of the machine learning algorithms can work with missing data like KNN, which will ignore the values with Nan values.
> * We will see following methods and apply each one on above Linear regression model and check which one more accurate.

> __Methods__ :
> * Deleting the columns with missing data
> * Deleting the rows with missing data
> * Filling the missing data with a value – Imputation
> * Imputation with an additional column
> * Filling with a Regression Model

## 1. Deleting columns with missing  data.
> * Lets delete the column __Age__ & fit the model then check for accuracy.

In [10]:
updated_df = x.dropna(axis=1)
updated_df.head()

Unnamed: 0,Pclass,Sex,SibSp,Parch,Fare
0,3,1,1,0,7.25
1,1,0,1,0,71.2833
2,3,0,0,0,7.925
3,1,0,1,0,53.1
4,3,1,0,0,8.05


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

In [12]:
x_train, x_test, y_train, y_test = train_test_split(updated_df, y, test_size = 0.3, random_state = 42)

In [13]:
lg = LogisticRegression(random_state = 42)

lg.fit(x_train, y_train)
pred = lg.predict(x_test)
print('Accuracy :', round(accuracy_score(pred, y_test)*100, 2),'%')

Accuracy : 79.85 %


### Insights :
> * Here we achieved the accuracy of  79% but in that process we have lost one important variable i.e 'Age'.
> * Meaning this method can lead into the loss of data. That is essentially not advisable / ideal.
> * Even 79% accuracy does not seem that better.
> * Hence further we will look at another technique. 

## 2. Deleting the rows with missing data

In [14]:
updated_df = new_df.dropna(axis = 0)               # updated_df that fetchs all the columns from new_df

In [15]:
y1 = updated_df['Survived']
updated_df.drop(['Survived'], axis = 1, inplace = True)

In [16]:
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  714 non-null    int64  
 1   Sex     714 non-null    int32  
 2   Age     714 non-null    float64
 3   SibSp   714 non-null    int64  
 4   Parch   714 non-null    int64  
 5   Fare    714 non-null    float64
dtypes: float64(2), int32(1), int64(3)
memory usage: 36.3 KB


In [17]:
# Train test split :
X_train, X_test, y_train, y_test = train_test_split(updated_df, y1, test_size = 0.3)

In [18]:
lr = LogisticRegression()

lg.fit(X_train, y_train)
pred1 = lg.predict(X_test)
print('Accuracy :', round(accuracy_score(pred1, y_test)*100, 2),'%')

Accuracy : 81.4 %


### Note :
> * In this case, see that we are able to achieve better accuracy than before. This is maybe because the column Age contains more valuable information than we expected.

## 3. Filling the missing data with a value – Imputation

> ### The possible ways to do this are :
> 1. Filling the missing data with the __mean__ or __median__ value if it’s a __numerical__ variable.
> 2. Filling the missing data with __mode__ if it’s a __categorical__ value.
> 3. Filling the numerical value with 0 or -999, or some other number that will not __occur__ in the data. This can be done so that the machine can recognize that the data is not real or is different.
> 4. Filling the categorical value with a new type for the missing values.

In [19]:
updated_df = df
updated_df['Age'] = updated_df['Age'].fillna(updated_df['Age'].mean())
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int32  
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int32(1), int64(4)
memory usage: 45.4 KB


In [20]:
y2 = updated_df['Survived']
updated_df.drop('Survived', axis = 1, inplace = True)

In [21]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(updated_df, y2, test_size = 0.3)

lgr = LogisticRegression()

lgr.fit(X_train1, y_train1)
pred2 = lgr.predict(X_test1)
print('Accuracy:', round(accuracy_score(pred2, y_test1)*100, 2),'%')

Accuracy: 73.88 %


### Insights :
> * The accuracy value comes out to be 79.48% which is a reduction over the previous case.
> * This will not happen in general, in this case, it means that the mean has not filled the null value properly

## 4. Imputation with an additional column
> * The problem with the previous model is that the model __does not know__ whether the values came from the original data or the imputed value. To make sure the model knows this, we are adding __Ageismissing__ the column which will have __True__ as value, if it is a null value and __False__ if it is not a null value.

In [22]:
updated_df = df
updated_df['Ageismissing'] = updated_df['Age'].isnull()

updated_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Ageismissing
0,3,1,22.0,1,0,7.25,False
1,1,0,38.0,1,0,71.2833,False
2,3,0,26.0,0,0,7.925,False
3,1,0,35.0,1,0,53.1,False
4,3,1,35.0,0,0,8.05,False


In [23]:
## Using sklearn SimpleImputer to impute the value :

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

new_data = imputer.fit_transform(updated_df)

updated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Pclass        891 non-null    int64  
 1   Sex           891 non-null    int32  
 2   Age           891 non-null    float64
 3   SibSp         891 non-null    int64  
 4   Parch         891 non-null    int64  
 5   Fare          891 non-null    float64
 6   Ageismissing  891 non-null    bool   
dtypes: bool(1), float64(2), int32(1), int64(3)
memory usage: 39.3 KB


In [24]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(updated_df, y2, test_size = 0.3)

lgr = LogisticRegression()

lgr.fit(X_train2, y_train2)
pred2 = lgr.predict(X_test2)
print('Accuracy:', round(accuracy_score(pred2, y_test2)*100, 2),'%')

Accuracy: 80.22 %


## Please check next notebook for step 5. 🙏🏻