<h1>TPS Jun 2022: Missing data imputation</h1>

<h2> Data description </h2>

For this challenge, we are given (simulated) manufacturing control data that contains missing values due to electronic errors. 

<h3>Goal </h3>

* Predict the values of all missing data in this dataset. (Note, while there are continuous and categorical features, only the continuous features have missing values.)
    
<h3> Files </h3>

* data.csv - the file includes normalized continuous data and categorical data.

* sample_submission.csv - a sample submission file in the correct format; the row-col indicator corresponds to the row and column of each missing value in data.csv

    
<h3> Evaluation  </h3>

* Submissions are scored on the root mean squared error (RMSE) 

# Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' ## hide tf warnings
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, BatchNormalization, Dropout
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping


# Loading data

In [None]:
data = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')
submission = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv',
                         index_col='row-col')

# Info

In [None]:
data.info()

### 🔵 We can see only float-type features contain Nan values

# The percentage of missing values

In [None]:
def get_missings(df):
    labels,values = list(),list()
    for column in df.columns:
           if df[column].isnull().sum():
            labels.append(column)
            values.append((df[column].isnull().sum() / len(df[column]))*100)
            missings=pd.DataFrame({'Column':labels,
                                   'Missing(Percent)':values}).sort_values(by='Missing(Percent)',
                                                                           ascending=False)
    return missings
    
get_missings(data).head(20)

### 🔵 The percentage of missing values of features is almost the same


## Make a list of feature names containing Nan

In [None]:
features_with_Nan = get_missings(data).Column.to_list()

# Number of features containing Nan values
print('There are ',len(features_with_Nan), 'features with Nan values')

# Creating a function for feature selection
This function takes a column name and returns 30 features that are most correlated with it, Later I will use this function to select features to train our model in a large loop, This will reduce the runtime and since we are choosing related features it may improve the score.

In [None]:
def high_correlated(col):
    
    return data.corrwith(data[col]).abs().sort_values(ascending=False)[1:33].index.to_list()

# Modeling

## Define RMSE loss function

In [None]:
def rmse(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred - y_true)))

## Implementing NN Model
I will run this function for each column using a loop

In [None]:
def nn_model(X_train,y_train,X_val,y_val,X_test):
    
    # Random seed
    tf.random.set_seed(42)
    
    # Create a sequential model
    model= Sequential([
    tf.keras.layers.Input(shape = X_train.shape[1:]),
    Dense(512, activation='swish'),
    BatchNormalization(),
    Dense(256, activation='swish'),
    BatchNormalization(),
    Dense(128, activation='swish'),
    BatchNormalization(),
    Dense(64, activation='swish'),
    BatchNormalization(),
    Dense(32, activation='swish'),
    BatchNormalization(),
    Dense(16, activation='swish'),
    BatchNormalization(),
    Dense(1,   activation = 'linear')
    ])
    
    # Compile the model
    model.compile(
    loss=rmse,
    optimizer=Adam(learning_rate = 0.01),
    metrics=[rmse]
    )
        
    # Define callbacks
    lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.5, patience = 3, verbose = 0)
    es = EarlyStopping(monitor = 'val_loss',patience = 12, verbose = 0, mode = 'min', restore_best_weights = True)
    callbacks=[lr,es]
    
    # Fit the model
    history = model.fit(X_train, y_train,
                       epochs=33,
                       validation_data=(X_val, y_val),
                       batch_size= 4094,
                       shuffle = True,
                       callbacks = callbacks,
                       verbose=0)
    
    return model,history

## Iterate over columns and run the model

* The following outer loop goes only through 55 columns that have missing values.
* To impute a column only the most correlated features (35 features) with the target column will be selected to train the model .
* The inner loop is for splitting train data to 4 splits then training model through cross validation and making out-of-fold predictions.
* The mean of oof predictions would be the final predicted values for current column.
* We can also save validation loss of each column in a list for later analysis.

In [None]:
data_imputed = pd.DataFrame() 
loss_per_feature={}

# Iterate over the columns that contain Nan values
for col in tqdm(data[features_with_Nan].columns):
    
    predictions=[]
    validation_loss=[]
    
    # Mask to access not_null part of the current column
    not_null = ~data[col].isnull()
    
    # Train dataset (includes non-null part of current column)
    train = data.loc[not_null]
    
    # Test dataset (includes null part of current column)
    test = data.loc[~not_null]
    
    # Feature selection
    selected_features=[n for n in high_correlated(col) if n not in ['row_id', col]]
    # Cross validation type
    kf = KFold(n_splits = 5)
    
    # Splitting data to train and validation
    for fold, (train_idx, val_idx) in enumerate(kf.split(train[selected_features])):

        X_train, X_val = train.iloc[train_idx].drop(col,axis = 1), train.iloc[val_idx].drop(col, axis = 1)
        y_train, y_val = train.iloc[train_idx][col], train.iloc[val_idx][col]
        X_test = test.drop(col, axis = 1)
        
        # Fillna with the mean
        X_train, X_val = X_train.fillna(X_train.median()), X_val.fillna(X_val.median())
        X_test = X_test.fillna(X_test.median())
        
        # Standard Scaling 
        scaler = StandardScaler()
        X_train=scaler.fit_transform(X_train)
        X_val=scaler.transform(X_val)
        X_test=scaler.transform(X_test)
        
        # Running NN model 
        model = nn_model(X_train,y_train,X_val,y_val,X_test)
        
        # Make an out-of-fold prediction
        y_preds = model[0].predict(X_test)
        
        # Add y_preds to a list
        predictions.append(y_preds)
        
        # Save loss for current fold
        validation_loss.append(model[1].history["val_loss"][-1])

    # Caluculate the mean of oof predictions
    mean_values = np.array(predictions).mean(axis = 0)
    
    # Save mean-loss for current feature
    loss_per_feature[col] = np.mean(validation_loss)
    
    # Specifying column to impute
    imputed_feature = data[col].copy()
    
    # Filling missing values
    imputed_feature.loc[~not_null] =  mean_values.ravel()
    
    # Concatenate imputed columns
    data_imputed = pd.concat([data_imputed, imputed_feature],axis = 1)

# Replace columns with imputed columns 
data[features_with_Nan] = data_imputed

# Validation loss per column

Knowing which columns have been imputed with low accuracy, we can look for another solution for them, but not now... maybe in the next update 🤔

In [None]:
loss_df=pd.DataFrame(loss_per_feature,index=['Validation_RMSE']).T.sort_values(by='Validation_RMSE',
                                                                               ascending=False)

loss_df

# Making submission

In [None]:
for i in submission.index: 
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    submission.loc[i, 'value'] = data.loc[row, col]

submission.to_csv("submission.csv")
submission