## üìå Missing Indicator

### ‚úÖ Definition
A **missing indicator** is a **binary feature** created to explicitly represent whether an original data value was missing or observed.

> A missing indicator takes the value **1 if the original feature value is missing (NaN)** and **0 if the value is present**, allowing machine learning models to learn patterns related to missingness itself.

---

### üß† Why Missing Indicators Are Used
- Missing values may carry **important information**
- Imputation alone hides the fact that data was missing
- Missing indicators help models capture **missingness patterns**

---

### üìê Mathematical Representation
For a feature \( X \):

\[
M =
\begin{cases}
1, & \text{if } X \text{ is missing} \\
0, & \text{if } X \text{ is present}
\end{cases}
\]

---

### üß™ Example

| Feature (FireplaceQu) | Missing Indicator |
|----------------------|------------------|
| Gd | 0 |
| NaN | 1 |
| TA | 0 |
| NaN | 1 |

---

### üîÅ Relationship with Imputation

| Step | Purpose |
|------|--------|
| Imputation | Fills missing values |
| Missing Indicator | Preserves missing-value information |

‚úî Best practice in ML:


In [87]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer , MissingIndicator

In [88]:
df = pd.read_csv(r'C:\Users\Lenovo\Krishnaraj singh\Code\newml\Documents!.0\train.csv',usecols=['Age','Fare','Survived'])

In [89]:
df.isnull().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [90]:
x = df.drop(columns=['Survived'])
y = df['Survived']

In [91]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=2)

In [92]:
x_train.head()

Unnamed: 0,Age,Fare
30,40.0,27.7208
10,4.0,16.7
873,47.0,9.0
182,9.0,31.3875
876,20.0,9.8458


In [93]:
si = SimpleImputer(strategy='mean')

x_train_trf = si.fit_transform(x_train)
x_test_trf = si.transform(x_test)

In [94]:
si

In [95]:
x_train_trf

array([[ 40.        ,  27.7208    ],
       [  4.        ,  16.7       ],
       [ 47.        ,   9.        ],
       ...,
       [ 71.        ,  49.5042    ],
       [ 29.78590426, 221.7792    ],
       [ 29.78590426,  25.925     ]])

In [96]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(x_train_trf,y_train)

y_predict = clf.predict(x_test_trf)

from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_predict)*100


61.452513966480446

__Use missing indictor so create a class of missing indicator__

In [97]:
mi = MissingIndicator()

mi.fit(x_train)


In [98]:
mi.features_



array([0])

__It is completly new colmn that  is created from the age colmn__

In [99]:
x_train_missing = mi.transform(x_train)

In [100]:
x_train_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [101]:
x_test_missing = mi.transform(x_test)

In [102]:
x_test_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [103]:
x_train['Age_NA'] = x_train_missing
x_test['Age_NA'] = x_test_missing

In [104]:
x_train

Unnamed: 0,Age,Fare,Age_NA
30,40.0,27.7208,False
10,4.0,16.7000,False
873,47.0,9.0000,False
182,9.0,31.3875,False
876,20.0,9.8458,False
...,...,...,...
534,30.0,8.6625,False
584,,8.7125,True
493,71.0,49.5042,False
527,,221.7792,True


In [105]:
x_train_trf1 =  si.fit_transform(x_train)
x_test_trf1 =  si.transform(x_test)

In [106]:
clf.fit(x_train_trf1,y_train)

In [107]:
y_predict1 = clf.predict(x_test_trf1)

In [108]:
accuracy_score(y_test,y_predict1)

0.6312849162011173

In [109]:
accuracy_score(y_test,y_predict)

0.6145251396648045

__It is directly used in simpleimputer__

### üìå SimpleImputer with Missing Indicator

```python
from sklearn.impute import SimpleImputer

# Create an imputer that also adds a missing-value indicator
si = SimpleImputer(add_indicator=True)


In [110]:
si = SimpleImputer(add_indicator=True)

In [111]:
x_train_trf2 = si.fit_transform(x_train)
x_test_trf2 = si.transform(x_test)

In [112]:
clf.fit(x_train_trf2,y_train)

In [114]:
y_predict2 = clf.predict(x_test_trf2)

In [115]:
accuracy_score(y_test,y_predict2)

0.6312849162011173

In [116]:
accuracy_score(y_test,y_predict1)

0.6312849162011173