# TPS-JUN22, Neuronal Net, Multivariate Feature Imputation 🧊



## <span style="color:orange"> A Tensorflow Model Using Keras API</span>.
Hello Kaggle! 

In this Notebook I developed a Neuronal Network Imputer usinf the Multivariate Feature Imputation technique and examples of this techniques applied to GDBT Models; The idea it's quite simple but powerrful.
Basically we train a model in all the features - 1 in wich the minus 1 feature is the column we are trying to impute or estimate, we complete an iteration for all the features availables that are missing values. 

<img src='https://drive.google.com/uc?id=1x03IDYdr80DYOFsdKY54vpGif7eLsjPT' width = 500>




Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.


### ✨ Strategy...
Build a NN Model that train in the N-1 Features and predict on Feature -1, the training takes a lot of time...


### 📅 Updates...
**06/04/2022**
* Developed the initial Notebook using XGBoost
* Adapated the XGBoost Function to train a NN Model
* Generated Submission for the Leaderboard 

**06/05/2022**
* Maybe try some features


### 🖥️ Credits...
I based my initial work on this Notebooks, thanks to the Authours;
* https://www.kaggle.com/code/reymaster/0-93002-iterative-imputer-baseline
* https://www.kaggle.com/code/djustin/lgb-cpu/data
* ...


### 💽 Data Description
For this challenge, you are given (simulated) manufacturing control data that contains missing values due to electronic errors. Your task is to predict the values of all missing data in this dataset. (Note, while there are continuous and categorical features, only the continuous features have missing values.)

Here's a notebook that you can use to get started.

Good luck!

**Files**
* **data.csv** - the file includes normalized continuous data and categorical data; your task is to predict the values of the missing data.
* **sample_submission.csv** - a sample submission file in the correct format; the row-col indicator corresponds to the row and column of each missing value in data.csv


### 📖 Table of Content...
* **Importing Python Libraries for the NotebookNotebook Configuration**
* **Notebook Configuration**
* **Loading the Datasets into a Pandas DataFrame**
* **Multivariate Feature Imputation GDBT**
* Model Parameters
* Model Training XGBoost or LBGM Loop, For all the Features
* Model Training XGBoost or LBGM Loop, For all the Features    
* **Multivariare Feature Imputation Using Tensorflow / Keras**
* Defining Model NN Libraries
* Defining a NN Model Function; Simple Network Architecture
* Defining the Model Fit Function
* Training the Model in a Cross Validation Loop for all the Features
* **Model Submission**


---

# Importing Python Libraries for the Notebook
Placeholder, explanations of this sections...

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
# Loading more libraries for the model...
from pathlib import Path
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from tqdm import tqdm
import datetime

---

# Notebook Configuration
Placeholder, explanations of this sections...

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# Loading the Datasets into a Pandas DataFrame
Placeholder, explanations of this sections...

In [None]:
%%time
# Load the CSV information into a Pandas DataFrame...
input_path = Path('/kaggle/input/tabular-playground-series-jun-2022/')

dataset = pd.read_csv(input_path / 'data.csv')
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='row-col')

In [None]:
%%time
dataset.head()

---

# Feature Engineering
Placeholder, explanations of this sections...

In [None]:
%%time
dataset['num_nans'] = dataset.isnull().sum(axis = 1)

In [None]:
%%time
dataset.head()

In [None]:
%%time
dataset.describe()

# Multivariate Feature Imputation GDBT
Placeholder, explanations of this sections...

## Model Parameters
Placeholder, explanations of this sections...

In [None]:
%%script false --no-raise-error
%%time
SEED = 22
ESTIMATORS = 128

params = {'n_estimators': ESTIMATORS,
          'random_state': SEED,
          'tree_method' : 'gpu_hist'}

---

## Model Training XGBoost or LBGM Loop, For all the Features
Placeholder, explanations of this sections...

In [None]:
%%script false --no-raise-error
%%time 
# Creates a loop, train in all -1 columns use -1 column as a target...

features = dataset.columns.to_list()
data_completed = pd.DataFrame()


for feat in tqdm(features):
    if dataset[feat].isnull().any():
        
        # Identify missing values...
        missing_values = list(np.where(dataset[feat].isnull())[0])
        not_missing_values = list(np.where(dataset[feat].isnull() == False)[0])
        
        trn_data = dataset.iloc[not_missing_values,]
        tst_data = dataset.iloc[missing_values,]
        
        X = trn_data.drop([feat,'row_id'],axis = 1)
        y = trn_data[feat]
        X_test = tst_data.drop([feat,'row_id'],axis = 1)
            
        model = XGBRegressor(**params)
        #model = LGBMRegressor(n_estimators = 10,metric = 'rmse')
        model.fit(X,y)
        
        y_predict = model.predict(X_test)
        imputed_data = dataset[feat]
        imputed_data.iloc[missing_values,] = y_predict
        
        data_completed = pd.concat([data_completed, imputed_data],axis = 1)
        
    else:
        data_completed = pd.concat([data_completed, dataset[feat]],axis = 1)
        

In [None]:
%%script false --no-raise-error
%%time
for i in tqdm(submission.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    submission.loc[i, 'value'] = dataset.loc[row, col]

submission.to_csv("gdbt_submission.csv")
submission

---

# Multivariare Feature Imputation Using Tensorflow / Keras
Placeholder, explanations of this sections...

## Defining Model NN Libraries
Placeholder, explanations of this sections...

In [None]:
%%time
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, BatchNormalization, Dropout, Concatenate

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

import random
import math

---

## Defining Model NN Parameters
Placeholder, explanations of this sections...

In [None]:
%%time
# Defining model parameters...
BATCH_SIZE = 2048
EPOCHS     = 32 # For Testing Purposes, Increase for Final Submission...
VERBOSE    = 0 
NUM_FOLDS  = 3  # For Testing Purposes, Increase for Final Submission...

---

## Defining a NN Model Function; Simple Network Architecture
Placeholder, explanations of this sections...

In [None]:
%%time
def nn_model_one():
    
    '''
    Function to define the Neuronal Network architecture...
    '''
    
    L2 = 65e-6
    activation_func = 'swish'
    inputs = Input(shape = (len(features) - 2)) # Remove the Id and the durrent feature that's imputed
    
    x = Dense(128, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(L2), 
              activation = activation_func)(inputs)
    
    x = BatchNormalization()(x)
    
    x = Dense(64, 
          #use_bias  = True, 
          kernel_regularizer = tf.keras.regularizers.l2(L2), 
          activation = activation_func)(x)
    
    x = BatchNormalization()(x)
    
    x = Dense(32, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(L2), 
              activation = activation_func)(x)
    
    x = BatchNormalization()(x)

    x = Dense(1 , 
              #use_bias  = True, 
              #kernel_regularizer = tf.keras.regularizers.l2(L2),
              activation = 'linear')(x)
    
    model = Model(inputs, x)
    
    return model

In [None]:
%%time
def nn_model_two():
    
    '''
    Function to define the Neuronal Network architecture...
    '''
    
    L2 = 65e-6
    activation_func = 'swish'
    inputs = Input(shape = (len(features) - 2)) # Remove the Id and the durrent feature that's imputed
    
    x0 = Dense(256, kernel_regularizer = tf.keras.regularizers.l2(L2), activation = activation_func)(inputs)
    x1 = Dense(64,  kernel_regularizer = tf.keras.regularizers.l2(L2), activation = activation_func)(x0)
    x1 = Dense(64,  kernel_regularizer = tf.keras.regularizers.l2(L2), activation = activation_func)(x1)
    x1 = Concatenate()([x1, x0])
    x1 = Dropout(0.1)(x1)
    x1 = BatchNormalization()(x1)
    x1 = Dense(16, kernel_regularizer = tf.keras.regularizers.l2(L2), activation = activation_func)(x1)
    x1 = Dense(1,  kernel_regularizer = tf.keras.regularizers.l2(4e-4), activation = 'sigmoid')(x1)
    model = Model(inputs, x1)
    
    return model

In [None]:
# Baseline Network = [64,32,16,1] With 16 Epochs...  *
# Test Two Network = [128,64,32,1] With 16 Epochs... **
# Test Three Network = [128,64,32,1] With 32 Epochs... ???

---

## Defining the Model Fit Function
Placeholder, explanations of this sections...

In [None]:
 %%time
# Defining model training function...
def fit_model(X_train, y_train, X_val, y_val, X_test, run = 0):
    '''
    '''
    lr_start = 0.01
    start_time = datetime.datetime.now()
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)

    epochs = EPOCHS    
    lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.7, patience = 4, verbose = VERBOSE)
    es = EarlyStopping(monitor = 'val_loss',patience = 12, verbose = 1, mode = 'min', restore_best_weights = True)
    callbacks = [lr, es]
    
    model = nn_model_two() # Selection of the Second Model.
    optimizer_func = tf.keras.optimizers.Adam(learning_rate = lr_start)
    loss_func = tf.keras.losses.MeanSquaredError()
    
    model.compile(optimizer = optimizer_func, loss = loss_func)
    
    X_val = scaler.transform(X_val)
    validation_data = (X_val, y_val)
    
    history = model.fit(X_train, 
                        y_train, 
                        validation_data = validation_data, 
                        epochs          = epochs,
                        verbose         = VERBOSE,
                        batch_size      = BATCH_SIZE,
                        shuffle         = True,
                        callbacks       = callbacks
                       )
    
    history_list.append(history.history)
    #print(f'Training Loss:{history_list[-1]["loss"][-1]:.5f}')
    callbacks, es, lr, history = None, None, None, None
    
    
    y_val_pred = model.predict(X_val, batch_size = BATCH_SIZE, verbose = VERBOSE)
    score = mean_absolute_error(y_val, y_val_pred)
    
    #print(f'Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}'
    #      f'| MSE: {score:.5f}')
    
    score_list.append(score)
    
    tst_data_scaled = scaler.transform(X_test)
    tst_pred = model.predict(tst_data_scaled)
    predictions.append(tst_pred)
    
    return model

---

## Training the Model in a Cross Validation Loop for all the Features
Placeholder, explanations of this sections...

In [None]:
%%time
# Create empty lists to store NN training metrics and predictions


# Creates a loop, train in all -1 columns use -1 column as a target...
features = dataset.columns.to_list()
data_completed = pd.DataFrame()


for feat in tqdm(features):
    # Create empty lists to store NN training metrics and predictions
    history_list = []
    score_list   = []
    predictions  = []
    
    if dataset[feat].isnull().any():
        #print('Training Model For: ',feat)
        
        # Identify missing values...
        missing_values = list(np.where(dataset[feat].isnull())[0])
        not_missing_values = list(np.where(dataset[feat].isnull() == False)[0])
        
        
        trn_data = dataset.iloc[not_missing_values,]
        tst_data = dataset.iloc[missing_values,]
        
        # Define kfolds for training purposes...
        kf = KFold(n_splits = NUM_FOLDS)

        for fold, (trn_idx, val_idx) in enumerate(kf.split(trn_data)):
            #print(f' Training fold: {fold}...')
            X_train, X_val = trn_data.iloc[trn_idx].drop([feat,'row_id'],axis = 1), trn_data.iloc[val_idx].drop([feat,'row_id'], axis = 1)
            y_train, y_val = trn_data.iloc[trn_idx][feat], trn_data.iloc[val_idx][feat]
            X_test = tst_data.drop([feat,'row_id'], axis = 1)
            
            X_train, X_val = X_train.fillna(X_train.mean()), X_val.fillna(X_val.mean())
            X_test = X_test.fillna(X_test.mean())
            
            fit_model(X_train, y_train, X_val, y_val, X_test)
        
        mean_values = np.array(predictions).mean(axis = 0)
        imputed_data = dataset[feat]
        imputed_data.iloc[missing_values] = mean_values.ravel()
        data_completed = pd.concat([data_completed, imputed_data],axis = 1)
    
    else:
        data_completed = pd.concat([data_completed, dataset[feat]],axis = 1)        

---

# Model Submission
Placeholder, explanations of this sections...

In [None]:
nn_submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='row-col')

In [None]:
%%time
data_completed.shape

In [None]:
%%time
for i in tqdm(nn_submission.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    nn_submission.loc[i, 'value'] = data_completed.loc[row, col]

nn_submission.to_csv("nn_submission.csv")

In [None]:
nn_submission

---