# TPS-JUN22, Multivariate Feature Imputation 🔥
## ...
Multivariate imputer that estimates each feature from all the others.
A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

https://scikit-learn.org/stable/modules/impute.html#iterative-imputer

#### Credits...
I used some or majority of the work or ideas in these Notebooks, Thanks to the authors.

* https://www.kaggle.com/code/inversion/get-started-with-mean-imputation
* https://www.kaggle.com/code/hiro5299834/tps-jun-2022-iterativeimputer-baseline

# 1. Loading the Requiered Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
from tqdm import tqdm
from pathlib import Path
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

---

# 2. Setting the Notebook

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 3. Loading the Information (CSV) Into A Dataframe

In [None]:
%%time
# Load the CSV information into a Pandas DataFrame...
input_path = Path('/kaggle/input/tabular-playground-series-jun-2022/')

dataset = pd.read_csv(input_path / 'data.csv', index_col='row_id')
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='row-col')

---

# 4. Exploring the Information Available

In [None]:
%%time
# Explore the shape of the DataFrame...
dataset.shape

In [None]:
%%time
# Display simple information of the variables in the dataset...
dataset.info(verbose = False)

In [None]:
%%time
# Display the first few rows of the DataFrame...
dataset.head()

In [None]:
%%time
# Generate a simple statistical summary of the DataFrame, Only Numerical...
dataset.describe()

In [None]:
%%time
# Calculates the total number of missing values...
dataset.isnull().sum().sum()

In [None]:
%%time
# Display the number of missing values by variable...
dataset.isnull().sum()

In [None]:
%%time
# Display the number of missing values by variable...
dataset.isnull().sum().sort_values()

In [None]:
%%time
# Display the number of unique values for each variable...
dataset.nunique()

In [None]:
%%time
# Display the number of unique values for each variable, sorted by quantity...
dataset.nunique().sort_values(ascending = True)

---

# 5. Multivariate Feature Imputation

In [None]:
%%time
SEED = 22
ESTIMATORS = 1024
ITERATIONS_IMPUTER = 32

params = {'n_estimators': ESTIMATORS,
          'random_state': SEED,
          'tree_method' : 'gpu_hist',}

estimator = XGBRegressor(**params)

imp = IterativeImputer(estimator = estimator,
                       missing_values = np.nan,
                       max_iter = ITERATIONS_IMPUTER,
                       initial_strategy = 'mean',
                       imputation_order = 'ascending',
                       verbose = 2,
                       random_state = SEED,
                      )

dataset[:] = imp.fit_transform(dataset)

---

In [None]:
# %%time
# SEED = 22
# ITERATIONS_CATBOOST = 10
# ITERATIONS_IMPUTER  = 5

# params = {'iterations': ITERATIONS_CATBOOST,
#           'task_type' :'GPU',
#           'devices'   :'0:1',
#           'verbose'   : 0}

# estimator = CatBoostRegressor(**params)

# imp = IterativeImputer(estimator = estimator,
#                        missing_values = np.nan,
#                        max_iter = ITERATIONS_IMPUTER,
#                        initial_strategy = 'mean',
#                        imputation_order = 'ascending',
#                        verbose = 2,
#                        random_state = SEED,
#                       )

# dataset[:] = imp.fit_transform(dataset)

---

# 6.0 Submission 

In [None]:
%%time
for i in tqdm(submission.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    submission.loc[i, 'value'] = dataset.loc[row, col]

submission.to_csv("submission.csv")
submission

---
