# MultiVariate Imputation By Chained Equation  
When we impute the missing data, data is assumed to be in one of these states:  
* **MCAR** (Missing Completely At Random): The missingness has no pattern at all. It doesn’t depend on the missing value or any other observed data.
* **MAR** (Missing At Random): : The missingness can be fully explained by other observed features, so you have enough information to impute without bias.
* **MNAR** (Missing Not At Random): The data is missing but it can't be predicted using other features, not enough data to support an un-biased prediction.


<br><br>
### Advantages & Disadvantages  
* It's quite accurate
* it's slow, because you are using a whole ML algorithm to impute
* You have put your whole training set on the server, because while in production any null values comes, algo will have to run on training set
  <br><br>

---

### Steps

MICE is an advanced method for imputing missing data by iteratively predicting missing values for each column using the other columns.

---

##### Step 1: Initial Imputation
- Fill all missing values with simple guesses (mean, median, or constant).
- Example:
```python
A: [1, 2, NaN, 4] → [1, 2, 2.33, 4]
B: [NaN, 5, 6, 7] → [6, 5, 6, 7]
C: [1, NaN, 3, 4] → [1, 2.67, 3, 4]
```

This gives a complete dataset to start the iterations.

---

##### Step 2: Iteratively Update Missing Values
For each column with missing values:

1. Pick a column (e.g., `A`) as the **target**.  
2. Use all other columns (`B`, `C`) as **predictors**.  
3. Fit a regression model (or other estimator) using **rows where `A` is observed**.  
4. Predict missing values of `A` and replace the old guesses.  

Repeat the same for `B` and `C`.

---

##### Step 3: Repeat Iterations
- After one round (updating all columns), go back to column `A` and repeat using updated values of `B` and `C`.  
- Continue for a fixed number of iterations (e.g., 10) or until the imputed values **stabilize**.

---

##### Step 4: Optional Multiple Imputations
- Perform MICE multiple times to capture the **uncertainty** in imputations.  
- This generates several complete datasets.  
- Analyze each dataset separately and combine results using Rubin’s rules.  

---

##### Key Points
- Column-wise modeling: each column with missing data is predicted using other columns.  
- Iterative updating improves predictions as other columns are updated.  
- Best suited for **MAR** (missing at random) data.  
- Flexible: choose any predictive model (linear regression, decision tree, Bayesian regression, etc.).



In [385]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [376]:
df = pd.read_csv('assets/50_Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [377]:

df.iloc[1,0] = np.nan
df.iloc[3,0] = np.nan
df.iloc[0,2] = np.nan
df.iloc[2,1] = np.nan
df.iloc[3,2] = np.nan

In [378]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,,New York,192261.83
1,,151377.59,443898.53,California,191792.06
2,153441.51,,407934.54,Florida,191050.39
3,,118671.85,,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [379]:
dfx = df.drop(columns=['State','Profit'],axis=1)

In [380]:
dfx.head()
dfy = dfx.copy()

In [381]:
dfx.isnull().sum()#good

R&D Spend          2
Administration     1
Marketing Spend    2
dtype: int64

In [382]:
null_indexes = list()
for x in dfx.columns:
    for i in range(len(dfx)):
        if np.isnan(dfx[x][i]):
            null_indexes.append((i,list(dfx.columns).index(x)))
null_indexes

[(1, 0), (3, 0), (2, 1), (0, 2), (3, 2)]

### Impute with mean

In [383]:
values = {'R&D Spend':df['R&D Spend'].mean(),'Administration':df['Administration'].mean(),'Marketing Spend':df['Marketing Spend'].mean()}
df_imputed = dfx.fillna(values)

In [386]:
for i in range(1,50):
    if i == 1:
        print(f'\n{i}st imputation')
    elif i==2:
        print(f'\n{i}nd imputation')
    elif i == 3:
        print(f'\n{i}rd imputation')
    else:
        print(f'\n{i}th imputation')
    for i in range(len(null_indexes)):
        # fetch the row
        X_test = df_imputed.iloc[null_indexes[i][0]]
        # remove the row
        temp = df_imputed.drop(null_indexes[i][0])
        # fetch X_train
        X_train = temp.drop(df_imputed.columns[null_indexes[i][1]],axis=1)
        # fetch y_train
        y_train = temp[df_imputed.columns[null_indexes[i][1]]]
        # remove the value that has to be predicted from X_test and make it 2d nparray so that it can be given to model for prediction.
        X_test = X_test.values.reshape(1,-1)
        X_test = X_test[0][X_test[0] != df_imputed.iloc[null_indexes[i]]].reshape(1,-1)
        # predict
        lr = DecisionTreeRegressor()
        lr.fit(X_train,y_train)
        prediction = lr.predict(pd.DataFrame(X_test,columns=X_train.columns))
        # impute the new value and reset
        print(f"Predicted: {prediction[0]}, Previous: {df_imputed.iloc[null_indexes[i]]}, Difference: {df_imputed.iloc[null_indexes[i]] - prediction[0]}")
        df_imputed.iloc[null_indexes[i]] = prediction[0]



1st imputation
Predicted: 153441.51, Previous: 70398.13895833334, Difference: -83043.37104166667
Predicted: 165349.2, Previous: 70398.13895833334, Difference: -94951.06104166668
Predicted: 151377.59, Previous: 121756.86591836736, Difference: -29620.724081632638
Predicted: 202005.649375, Previous: 202005.649375, Difference: 0.0
Predicted: 202005.649375, Previous: 202005.649375, Difference: 0.0

2nd imputation
Predicted: 153441.51, Previous: 153441.51, Difference: 0.0
Predicted: 165349.2, Previous: 165349.2, Difference: 0.0
Predicted: 151377.59, Previous: 151377.59, Difference: 0.0
Predicted: 202005.649375, Previous: 202005.649375, Difference: 0.0
Predicted: 202005.649375, Previous: 202005.649375, Difference: 0.0

3rd imputation
Predicted: 153441.51, Previous: 153441.51, Difference: 0.0
Predicted: 165349.2, Previous: 165349.2, Difference: 0.0
Predicted: 151377.59, Previous: 151377.59, Difference: 0.0
Predicted: 202005.649375, Previous: 202005.649375, Difference: 0.0
Predicted: 202005.64

## Now Using `sklearn`

In [349]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [360]:
it_imputer = IterativeImputer(max_iter=50, random_state=0)
imputed = it_imputer.fit_transform(dfy)


In [361]:
type(imputed)

numpy.ndarray

In [363]:
imputed_sklrn = pd.DataFrame(imputed,columns = dfy.columns)

In [374]:
for i in null_indexes:
    print(f"by sklearn: {imputed_sklrn.iloc[i]}\tcustom: {df_imputed.iloc[i]}")

by sklearn: 145120.91567567218	custom: 148464.20170776223
by sklearn: 70789.62146004538	custom: 70451.15329514252
by sklearn: 129354.24463892463	custom: 131825.72604625786
by sklearn: 371343.56399445207	custom: 377861.51695349603
by sklearn: 206000.85960623703	custom: 205610.0687930744
