----

$$ \text{MISSING DATA} $$

---

One of the most common problems when working with Data and Predictive Models is dealing with missing values. In this notebook, we will handle the missing values for the training dataset and evaluate how good our input method is. The evaluation is done by simulating missing data on samples with data, this way we can compare the that that is being inputted to the actual data.

    In this notebook we will:

    1. Analyze the subset with the missing data
        - Is it any different from the rest of the data?
    2. Create training and validation subsets
    3. Use different methods to input data
        - Simple imputation
        - Unsupervised Learning
        - Supervised Learning
    4. Evaluate the results

In [1]:

import pandas as pd
import numpy as np

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)


In [2]:

df_train = pd.read_csv('../../data/train-validation/training-data.csv', index_col = 0)

# Remove Label
train_data = df_train.drop(columns = ['dano_na_plantacao']).iloc[:,1:]


In [3]:
n_missing = train_data[train_data.Semanas_Utilizando.isna()].shape[0]
total = train_data.shape[0]
n_perc = n_missing/total * 100

print(f'Dataset size:\t {total}')
print(f'Total missing:\t {n_missing:}')
print(f'Missing data:\t {n_perc:.3} %')

Dataset size:	 56000
Total missing:	 5690
Missing data:	 10.2 %


## Missing data


There are many approaches when handling missing data. In this case, the missing data is Continuous. So these methods will be tested:

- Deletion - This is not ideal since when working with new data for prediction this data will have to be discarded.

- Imputation
    - Simple Imputation - Mean, Median and most frequent values.
    - Machine Learning - Both supervised learning and unsupervised are possible when imputing missing data

Before we start treating the missing values let's look at the samples with missing data and compare with the rest of the data. This will tell us if the missing data is in any way skewed from the rest. If so Simple Imputation is not recommended.

In [4]:
print('\n\n\t\t\t\t\tSubset without missing Data')
display(train_data[~train_data.Semanas_Utilizando.isna()].describe())
print('\n\n\t\t\t\t\tSubset with missing Data')
display(train_data[train_data.Semanas_Utilizando.isna()].describe())



					Subset without missing Data


Unnamed: 0,Estimativa_de_Insetos,Tipo_de_Cultivo,Tipo_de_Solo,Categoria_Pesticida,Doses_Semana,Semanas_Utilizando,Semanas_Sem_Uso,Temporada
count,50310.0,50310.0,50310.0,50310.0,50310.0,50310.0,50310.0,50310.0
mean,1399.919579,0.283542,0.457364,2.270443,25.862353,28.651719,9.512025,1.896263
std,851.70579,0.450722,0.498184,0.465177,15.584169,12.42961,9.903693,0.703317
min,150.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,731.0,0.0,0.0,2.0,15.0,20.0,0.0,1.0
50%,1212.0,0.0,0.0,2.0,20.0,28.0,7.0,2.0
75%,1898.0,1.0,1.0,3.0,40.0,37.0,16.0,2.0
max,4097.0,1.0,1.0,3.0,95.0,67.0,50.0,3.0




					Subset with missing Data


Unnamed: 0,Estimativa_de_Insetos,Tipo_de_Cultivo,Tipo_de_Solo,Categoria_Pesticida,Doses_Semana,Semanas_Utilizando,Semanas_Sem_Uso,Temporada
count,5690.0,5690.0,5690.0,5690.0,5690.0,0.0,5690.0,5690.0
mean,1386.976801,0.288576,0.447276,2.25993,25.831283,,9.547803,1.900176
std,834.701574,0.45314,0.497256,0.463573,15.589252,,9.858941,0.702468
min,150.0,0.0,0.0,1.0,0.0,,0.0,1.0
25%,731.0,0.0,0.0,2.0,15.0,,0.0,1.0
50%,1212.0,0.0,0.0,2.0,20.0,,7.0,2.0
75%,1898.0,1.0,1.0,3.0,40.0,,16.0,2.0
max,4097.0,1.0,1.0,3.0,95.0,,50.0,3.0


There are no large discrepancies between both subsets, this means that we can try to use information from the complete dataset to fill in the missing values. There are different methods of data inputting and here we will evaluate which one fits better. 

Primary we eliminate all the missing samples from `Semanas_Utilizando`, now we have a new DataFrame with no missing data: `df_full`. Let's separate a sample from so we can use to evaluate how good our method is. The evaluation sample size is  a percentage of the total missing data: 10%.

## Train-Validation Subsets

In [5]:
# Drop missing data
df_full = train_data[~train_data.Semanas_Utilizando.isna()]

na_train, na_eval = train_test_split(df_full.copy(), test_size=0.1)
na_eval_y = na_eval.Semanas_Utilizando.values.copy()
na_eval_df = na_eval.drop(columns = ['Semanas_Utilizando'])

With an evaluation function is possible to measure how good we are filling in missing values. Here we will use Mean Squared Error (MSE), the smaller the value the better out method is.


$$MSE = \frac{1}{n} \sum_{n=1}^{n}(Y_i - Ŷ_i)^2$$


In [6]:
# Create MSE cost function
mse = lambda A,B : (np.square(A - B)).mean(axis=0)
mse_score = {'Simple Imputing': {}, 'Supervised Learning':{}, 'Unsupervised Learning':{}}

## Data Imputation

### Simple Imptation

In [7]:
ones = np.ones(na_eval_y.shape)

# Most Frequent Value
mode = na_train.Semanas_Utilizando.mode()[0] * ones
mode_mse = mse(na_eval_y, mode)
mse_score['Simple Imputing']['Mode'] = mode_mse

# Mean Value
mean = na_train.Semanas_Utilizando.mean() * ones 
mean_mse =  mse(na_eval_y, mean)
mse_score['Simple Imputing']['Mean'] = mean_mse

# Median Value
median = na_train.Semanas_Utilizando.median() * ones
median_mse = mse(na_eval_y, median)
mse_score['Simple Imputing']['Median'] = median_mse



### Unsupervised learning
Here we can use [Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html) to find the closest n samples for each sample with missing data and use their `Semanas_Utilizadas` mean value as an input.


To train the Nearest Neighbors we need to remove the column we want to fill. This way we can predict (`kneighbors()` - return closest samples) on the subset without the data. 

In [8]:
na_y = na_train['Semanas_Utilizando']
na_x = na_train.drop(columns = ['Semanas_Utilizando'])
nbrs = NearestNeighbors(n_neighbors=10, algorithm='ball_tree').fit(na_x)

In [9]:
_, neighbors = nbrs.kneighbors(na_eval_df)

display(neighbors.shape)

(5031, 10)

The `kneighbors` output is a matrix containing all the indexes positions for the n closest neighbors for each sample. So the expected output shape is (n_samples, n_neighbors). Now we have to iterate for each sample and get the mean values for the missing column for all n neighbors. 

In [10]:

# Column Index
col_idx = na_train.columns.get_loc('Semanas_Utilizando')

knn_input = []
# For each sample we have n_neighbors
# Loop all samples
for sample in range(neighbors.shape[0]):
    
    # Get Sample from data set we trained on, but here with the desired column
    # desired column index -> Semanas_Utilizando -> col_idx
    # loop on all closest n_neighbors and get mean value
    sample_mean = np.mean([na_train.iloc[i, col_idx] 
                           for i in neighbors[sample,:] ])
    knn_input.append(sample_mean)

# Evaluate score
knn_mse =  mse(na_eval_y, knn_input)
mse_score['Unsupervised Learning']['Nearest Neighbors'] = knn_mse


### Supervisied learning

[Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) - here we try to use a simple supervised machine learning model to train a regressor to predict the expected value of the missing feature based on the other features.

In [11]:

lr = LinearRegression().fit(na_x, na_y)
lr_input = lr.predict(na_eval_df)

lr_mse = mse(na_eval_y, lr_input)
mse_score['Supervised Learning']['Linear Regression'] = lr_mse


[Iterative Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html?highlight=iterativeimputer#sklearn.impute.IterativeImputer) - Like with the linear regressor, this is also a supervised learning algorithm, where iteratively, for each feature, a regressoion is fit on `(x,y)`.

In [12]:

# Populate dataframe with nans
na_eval['Semanas_Utilizando'] = np.nan

# Get index for later evaluation
eval_index = na_eval.index

# Concatanate with training data
na_concat = pd.concat([na_train, na_eval], axis = 0)


In [13]:

# Create Fit and Predict
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit(na_concat)
imp_input = imp.transform(na_concat)[na_concat.isna()]

# Evaluate score
imp_mse =  mse(na_eval_y, imp_input)
mse_score['Supervised Learning']['Interative Inputer'] = imp_mse


## Results

In [14]:

from plotly import graph_objects as go
fig = go.Figure()

for key in mse_score.keys():
    fig.add_trace(go.Bar(
                x=list(mse_score[key].values()),
                y=list(mse_score[key].keys()),
                orientation='h',
                name = key
                        ),
                 )
fig.update_layout(
    title="Imputation of missing values methods comparison",
    xaxis_title="Mean Squared Error")
fig.show()