<a href="https://colab.research.google.com/github/Smarth2005/Machine-Learning/blob/main/Exploratory%20Data%20Analysis/07.%20Using%20Missing%20Indicator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Using `MissingIndicator`**

<div align="justify">

The `MissingIndicator` class in `sklearn.impute` is used to identify and mark the locations of missing values in a dataset. It generates a binary matrix (0s and 1s), where `1` indicates a missing value at that position and `0` indicates a non-missing value.
</div>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
from google.colab import files
uploaded = files.upload()

Saving income_evaluation.csv to income_evaluation.csv


In [3]:
df = pd.read_csv('income_evaluation.csv', na_values=' ?')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
df.isna().sum()

Unnamed: 0,0
age,0
workclass,1836
fnlwgt,0
education,0
education-num,0
marital-status,0
occupation,1843
relationship,0
race,0
sex,0


In [5]:
# separate independent and dependent features
X = df.drop(' income', axis=1)
y = df[' income']

# train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [6]:
from sklearn.impute import MissingIndicator
mi = MissingIndicator()

# By default, parameter features = 'missing_only'.

In [7]:
mi.fit_transform(x_train)

array([[False, False, False],
       [False, False, False],
       [False, False, False],
       ...,
       [False, False, False],
       [False, False, False],
       [False, False, False]])

In [8]:
pd.DataFrame(mi.fit_transform(x_train))

Unnamed: 0,0,1,2
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
26043,False,False,False
26044,False,False,False
26045,False,False,False
26046,False,False,False


`features='missing-only'` (default) tells `MissingIndicator` **to generate output only for columns that have missing values**, reducing dimensionality and improving efficiency.

In [11]:
mi.features_
# returns the features (indices) that contain the missing values.

array([ 1,  6, 13])

In [9]:
mi1 = MissingIndicator(features='all')

In [10]:
pd.DataFrame(mi1.fit_transform(x_train)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False


`features='all'` will return a binary column for every feature, whether it had missing values or not.



There is another parameter `error_on_new` that controls whether an error should be raised if new (previously unseen) missing values appear in the **test/inference data** during `.transform()`. By default, it is set to `True`.

#### **Use-Case and Best Practices**

<div align="justify">

In some cases, the fact that a value is missing is informative and can be used as a feature in your machine learning model. Rather than simply imputing and forgetting missingness, MissingIndicator helps retain this **missingness pattern as a new feature**.

>Use MissingIndicator only if you suspect the missing pattern has information (i.e., data is not MCAR).<br>
MCAR= Missingness is Completely Random.
</div>