In [None]:
## üîπ Missing Value Indicator (Missing Flag Feature)

A **Missing Value Indicator** is:

> A new binary feature that tells whether the original value was missing or not.

Instead of only imputing the value, we also create a feature that records the missingness.

---

# üß† Why Do We Need It?

Because:

üëâ Missing itself may contain information.

Example:

* Age missing ‚Üí maybe elderly people didn‚Äôt disclose
* Income missing ‚Üí maybe unemployed
* Medical test missing ‚Üí maybe doctor skipped it

If you only fill missing with median, that signal is lost.

The indicator preserves that signal.

---

# üìä Example

Original:

| Age |
| --- |
| 25  |
| 30  |
| NaN |
| 40  |

Step 1Ô∏è‚É£ Create indicator:

```python
X_train['Age_missing'] = X_train['Age'].isnull().astype(int)
```

Now:

| Age | Age_missing |
| --- | ----------- |
| 25  | 0           |
| 30  | 0           |
| NaN | 1           |
| 40  | 0           |

Step 2Ô∏è‚É£ Impute Age (e.g., median):

```python
X_train['Age'].fillna(X_train['Age'].median(), inplace=True)
```

Final:

| Age | Age_missing |
| --- | ----------- |
| 25  | 0           |
| 30  | 0           |
| 30  | 1           |
| 40  | 0           |

Now the model sees:

* The imputed value
* Whether it was originally missing

---

# üî• Why This Is Powerful

Mean/median imputation reduces variance and weakens covariance.

The indicator helps the model learn:

> ‚ÄúThis value was artificial.‚Äù

Especially powerful for:

* Linear models
* Logistic regression
* Neural networks

---

# üéØ When To Use Missing Indicator?

‚úÖ When missing is not completely random (MAR / MNAR)
‚úÖ When missing % is moderate/high
‚úÖ When using linear models
‚úÖ When using median/mean imputation

---

# üö´ When Not Necessary?

‚ùå If missing is extremely small (<1%)
‚ùå If using models that handle missing internally (e.g., some boosting libraries)

---

# üß† Deep Intuition

Think of missing indicator as:

> A flag saying ‚ÄúThere was uncertainty here.‚Äù

Instead of hiding the uncertainty, we expose it to the model.

---

# üìä How It Affects Covariance

Without indicator:

* Covariance shrinks (mean imputation problem)

With indicator:

* Model can separate ‚Äúreal‚Äù values from ‚Äúimputed‚Äù ones
* Relationship distortion is reduced

---

# üèÜ Professional Pipeline Example

```python
for col in numerical_cols:
    X_train[col + '_missing'] = X_train[col].isnull().astype(int)
    X_test[col + '_missing'] = X_test[col].isnull().astype(int)
    
    median = X_train[col].median()
    X_train[col].fillna(median, inplace=True)
    X_test[col].fillna(median, inplace=True)
```

Always:

* Learn median from train
* Apply to test
* Create indicator in both

---

# üî¨ Advanced Note

Sklearn has built-in support:

```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median', add_indicator=True)
```

This automatically adds missing indicators.

---

# üèÅ Final Mental Model

Imputation fills missing values.
Missing indicator preserves missing information.

Best practice for numeric data:

```text
Median Imputation + Missing Indicator
```

Very strong baseline in industry.

---


In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import MissingIndicator,SimpleImputer

In [4]:
df = pd.read_csv("C:\\Users\\Admin\\OneDrive\\Desktop\\DS_Resources\\ml_campus_x\\100-days-of-machine-learning\\day38-missing-indicator\\train.csv",usecols=['Age','Fare','Survived'])

In [53]:
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [54]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [71]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [56]:
X_train.head()

Unnamed: 0,Age,Fare
30,40.0,27.7208
10,4.0,16.7
873,47.0,9.0
182,9.0,31.3875
876,20.0,9.8458


In [57]:
si = SimpleImputer()
X_train_trf = si.fit_transform(X_train)
X_test_trf = si.transform(X_test)

In [58]:
X_train_trf

array([[ 40.        ,  27.7208    ],
       [  4.        ,  16.7       ],
       [ 47.        ,   9.        ],
       ...,
       [ 71.        ,  49.5042    ],
       [ 29.78590426, 221.7792    ],
       [ 29.78590426,  25.925     ]])

In [59]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_trf,y_train)

y_pred = clf.predict(X_test_trf)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6145251396648045

In [60]:
mi = MissingIndicator()

mi.fit(X_train)

MissingIndicator()

In [61]:
mi.features_

array([0], dtype=int64)

In [62]:
X_train_missing = mi.transform(X_train)

In [63]:
X_train_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [43]:
X_test_missing = mi.transform(X_test)

In [45]:
X_test_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [64]:
X_train['Age_NA'] = X_train_missing

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['Age_NA'] = X_train_missing


In [65]:
X_test

Unnamed: 0,Age,Fare
707,42.0,26.2875
37,21.0,8.0500
615,24.0,65.0000
169,28.0,56.4958
68,17.0,7.9250
...,...,...
89,24.0,8.0500
80,22.0,9.0000
846,,69.5500
870,26.0,7.8958


In [66]:
X_test['Age_NA'] = X_test_missing

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['Age_NA'] = X_test_missing


In [67]:
X_train

Unnamed: 0,Age,Fare,Age_NA
30,40.0,27.7208,False
10,4.0,16.7000,False
873,47.0,9.0000,False
182,9.0,31.3875,False
876,20.0,9.8458,False
...,...,...,...
534,30.0,8.6625,False
584,,8.7125,True
493,71.0,49.5042,False
527,,221.7792,True


In [68]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [69]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_trf2,y_train)

y_pred = clf.predict(X_test_trf2)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6312849162011173

In [70]:
si = SimpleImputer(add_indicator=True)

In [72]:
X_train = si.fit_transform(X_train)

In [73]:
X_test = si.transform(X_test)

In [74]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_trf2,y_train)

y_pred = clf.predict(X_test_trf2)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6312849162011173