# **Missing Indicator**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('train.csv',usecols=['Age','Fare','Survived'])
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


## **Type 1 - Using `MissingIndicator()`**

In [3]:
x = df.drop(columns=['Survived'])
y = df[['Survived']]

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=2)

In [4]:
si = SimpleImputer()


x_train_trf = pd.DataFrame(si.fit_transform(x_train), columns=x.columns)
x_test_trf = pd.DataFrame(si.transform(x_test), columns=x.columns)


In [5]:
reg = LogisticRegression()

reg.fit(x_train_trf, y_train)
y_test_pred = reg.predict(x_test_trf)

print(f"{np.round(accuracy_score(y_test, y_test_pred),2)*100} %")


61.0 %


In [6]:
# printing the missing data

mi = MissingIndicator()
x_train_missing = pd.DataFrame(mi.fit_transform(x_train), columns=['missing x_train'])
x_test_missing = pd.DataFrame(mi.transform(x_test), columns=['missing x_test'])
print(x_train_missing.sum())   # will print the total numbers of missing data
print(x_test_missing.sum())   # will print the total numbers of missing data
x_train_missing.sample(5)       # will print all the missing dat


missing x_train    148
dtype: int64
missing x_test    29
dtype: int64


Unnamed: 0,missing x_train
313,False
642,False
193,False
130,False
276,False


`MissingIndicator()` is a class from Scikit-Learn (sklearn.impute) that creates a new feature (a new column) which simply tells you: "Was the data here missing?"

It transforms your dataset into a binary (True/False or 1/0) matrix where:

- True = The value was missing (NaN)
- False = The value was present

Sometimes, the fact that data is missing is actually important information.

Example: If a "Income" field is missing, it might mean the person is unemployed or hiding their wealth. That "missingness" itself is a pattern the model can learn from.

In [7]:
x_train['Age_NaN'] = x_train_missing
x_test['Age_NaN'] = x_test_missing          # REMEMBER, while naming the 'Age_NaN' it must be same in both x_train and x_test, or else it will throw error while making prediction
x_train.sample(5)     # the cells where values are not missing will be written as 'False', and 'True' for missing values


Unnamed: 0,Age,Fare,Age_NaN
401,26.0,8.05,False
868,,9.5,
230,35.0,83.475,False
232,59.0,13.5,False
576,34.0,13.0,False


What we did in above is very imp, coz as you put a new feature which represents where the value is missing and where the value is not missing, your accuracy for the prediction increases by a lot. 

This is a way where the machine gets to know where the value is missing and wehere it is not.

In [8]:
x_test.sample(5)

Unnamed: 0,Age,Fare,Age_NaN
128,,22.3583,False
546,19.0,26.0,
342,28.0,13.0,
275,63.0,77.9583,
384,,7.8958,


In [9]:
si = SimpleImputer()

x_train_trf2 = si.fit_transform(x_train)
x_test_trf2 = si.transform(x_test)

reg = LogisticRegression()
reg.fit(x_train_trf2, y_train)
y_test_pred2 = reg.predict(x_test_trf2)

print(f"{np.round(accuracy_score(y_test, y_test_pred2),2)*100} %")
# as you can see our output has increased by 2 %


62.0 %


As you can see, it is quite hectic to do all these always, hence to avoid these there is a inbuilt parameter in SimpleImputer `add_indicator = True`.

By doing these you dont need to make another column specifically. 
Example code is down below...

## **Type 2 - using `SimpleImputer(add_indicator=True)`**

In [10]:
x = df.drop(columns=['Survived'])
y = df[['Survived']]

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=2)

si = SimpleImputer(add_indicator=True)

cols = [['Age', 'Fare', 'Age_imputed']]

x_train_trf = pd.DataFrame(si.fit_transform(x_train), columns=cols)
x_test_trf = pd.DataFrame(si.transform(x_test), columns=cols)


reg.fit(x_train_trf, y_train)
y_test_pred = reg.predict(x_test_trf)

print(f"ACCURACY : {np.round(accuracy_score(y_test, y_test_pred),2)*100} %")

# as you can see the accuracy is 63 % now



ACCURACY : 63.0 %


If you set `add_indicator=True` inside `SimpleImputer()`, it does two things at once:

- Imputes the missing values (e.g., replaces NaN with Mean).
- Appends new columns (indicators) to the end of your dataset that show where the missing values originally were (True/False or 1/0).


It effectively combines `SimpleImputer()` and `MissingIndicator()` into a single step.

In [11]:
x_train_trf.sample(6)       # 1 = data missing, 0 = data not missing

Unnamed: 0,Age,Fare,Age_imputed
660,26.0,18.7875,0.0
688,36.0,512.3292,0.0
427,30.5,7.75,0.0
534,32.0,15.5,0.0
455,44.0,8.05,0.0
523,29.785904,8.05,1.0


**KEEP IN MIND THAT `add_indicator=True` ONLY ADDS NEW FEATURES FOR THE FEATURES WHICH HAS MISSING VALUS, NOT FOR ALL THE FEATURES**

*Since `fare` has no missing values, hence its `fare_imputed` feature is not made*