# 📌 Overview
**The Univariate Missing Indicator is a preprocessing tool that flags missing values in a dataset by creating a new binary indicator variable (0 or 1) for each feature with missing data.**

**Instead of filling missing values, this transformer preserves information about the missingness itself, which can be valuable if the pattern of missingness contains predictive signals.**

**It is commonly used in conjunction with an imputer to enhance model performance and allow algorithms to learn from both the imputed value and the fact that a value was missing.**

## 📆 When to Use
- Use the Univariate Missing Indicator when:

- The pattern of missingness might contain valuable predictive information.

- You want to track which values were originally missing, even after imputation.

- You're working with tree-based models (e.g., Random Forest, XGBoost).

- You’re building a scikit-learn pipeline or using AutoML workflows.



## ✅ Advantages
- 🧠 Retains missingness information: Useful if the absence of a value has meaning.

- 🔍 Enhances transparency: Helps models identify patterns of missingness.

- 🔄 Combines well with imputers: Allows both the value and its missing state to be modeled.

- 📈 Improves performance: Especially in models that benefit from feature interactions (e.g., decision trees).

- 🔧 Pipeline-compatible: Easy to integrate with scikit-learn’s Pipeline or ColumnTransformer.

## ❌ Disadvantages
- 🧱 Increases dimensionality: Adds new features to the dataset (1 per missing column).

- ⚠️ Not useful for all models: May not help linear models or models sensitive to multicollinearity.

- 🔁 Redundant if model handles missing values internally: e.g., XGBoost and LightGBM can handle NaNs directly.

- 🧪 May capture noise: If missingness is random and not related to the target, it may add unnecessary complexity.

## 🤖 Ideal Use Cases
- Datasets with non-random missingness (MNAR) or structured missing patterns.

- Clinical, financial, or survey data where why a value is missing is important.

- Paired with SimpleImputer, KNNImputer, or IterativeImputer.

- Pipelines that require tracking which values were modified or filled.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import MissingIndicator,SimpleImputer

In [2]:
df = pd.read_csv(r"C:\Users\Asus\Downloads\train.csv",usecols=['Age','Fare','Survived'])

In [3]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [4]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [5]:
X_train.head()

Unnamed: 0,Age,Fare
30,40.0,27.7208
10,4.0,16.7
873,47.0,9.0
182,9.0,31.3875
876,20.0,9.8458


# before missing_indicator

In [6]:
si = SimpleImputer()
X_train_trf = si.fit_transform(X_train)
X_test_trf = si.transform(X_test)

In [7]:
X_train_trf

array([[ 40.        ,  27.7208    ],
       [  4.        ,  16.7       ],
       [ 47.        ,   9.        ],
       ...,
       [ 71.        ,  49.5042    ],
       [ 29.78590426, 221.7792    ],
       [ 29.78590426,  25.925     ]], shape=(712, 2))

In [8]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_trf,y_train)

y_pred = clf.predict(X_test_trf)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6145251396648045

# after missing -indicator

In [9]:
mi = MissingIndicator()

mi.fit(X_train)


In [10]:
mi.features_

array([0])

In [11]:
X_train_missing = mi.transform(X_train)

In [12]:
X_train_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [13]:
X_test_missing = mi.transform(X_test)

In [14]:
X_test_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [15]:
X_train['Age_NA'] = X_train_missing

In [16]:
X_test

Unnamed: 0,Age,Fare
707,42.0,26.2875
37,21.0,8.0500
615,24.0,65.0000
169,28.0,56.4958
68,17.0,7.9250
...,...,...
89,24.0,8.0500
80,22.0,9.0000
846,,69.5500
870,26.0,7.8958


In [17]:
X_test['Age_NA'] = X_test_missing

In [18]:
X_train

Unnamed: 0,Age,Fare,Age_NA
30,40.0,27.7208,False
10,4.0,16.7000,False
873,47.0,9.0000,False
182,9.0,31.3875,False
876,20.0,9.8458,False
...,...,...,...
534,30.0,8.6625,False
584,,8.7125,True
493,71.0,49.5042,False
527,,221.7792,True


In [19]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [20]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_trf2,y_train)

y_pred = clf.predict(X_test_trf2)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6312849162011173

# using direct in simple inputer class

In [21]:
si = SimpleImputer(add_indicator=True)

In [22]:
X_train = si.fit_transform(X_train)

In [23]:
X_test = si.transform(X_test)

In [24]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_trf2,y_train)

y_pred = clf.predict(X_test_trf2)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6312849162011173