# Missing Data Imputation 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
from sklearn.preprocessing import scale
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

import pandas as pd
import os

## Data loading

The dataset that we will use is named `data-lab2.csv`.

* Import the dataset using pandas
* Describe statistically the different variables available 
* Print the total % of missing data over the data
* Print the % of missing data per variable

In [None]:
df = pd.read_csv('data-lab2.csv')

In [None]:
df.head()

In [None]:
df.describe()
# What do you notice ??? (hint: type of variables)

In [None]:
# Your response

## Simple imputation strategies 

Let us consider the case of imputing the **earnings** variable  in this dataset. 

* Plot the distribution of the earnings variables with a histogram
* Describe what the function bellow is doing 
* Use the function to impute the missing variable earnings 
* Compare the distributions of the newly imputed variable and the previous one

In [None]:
from random import sample

def random_imp(a):
    missing = a.isna()
    n_miss = np.sum(missing)
    obs = a[-missing]
    imputed = np.array(a)
    impute_values = sample(list(obs), n_miss)
    imputed[missing]=impute_values
    return imputed

* What are the problems with this approach ? 
* Propose two simple alternatives and compare the obtained results 
* Create a function that encapsulate all three alternatives for imputing a variable. The choice of the strategy is a parameter of the function.

In [None]:
# Your answer 

## Trying some logical rules 

* Create a column in the data frame that indicated missingness in the earning variable 
* Check for the correlation between this new variable and the other one in the dataset 
* Can you propose a simple logical rule to impute some of the missing values based on the workmos variable?

In [None]:
# Your answer 

## Using Matching to impute missing values (deterministic)

* Implement a function that
 * for observations with NaN retrieve the top-k closest neighbors based on some distance 
 * impute the mean or the median earnings over the top-k closest neighbors
 
Parameters: k (default=3), distance (default=euclidean), strategy (default=mean)

* Impute the earnings variable and compare the results with previous imputation stragies

In [None]:
# Your answer

## Using Linear Regression to impute missing values (deterministic)

Recall: the objective of linear regression is to model the relationship between a continuous variable $Y$ and a set of explicatives variables $X_1, \cdots, X_d$. Linear model assumes a relationship of the following form 
$Y_i = \theta_0 + \theta_1X_1 + \cdots + \theta_dX_d$ and the objective is to estimate the $\theta$s coefficients. 

One way of adressing the problem of missing data imputation is to cast the task as a linear regression problem. Here the target variable $Y$ is earnings and the $X$s are the other variables in the dataset. 

* Learn a linear regression model to predict earnings based on the other variables 
* Comment on the results: values of coefficients, R^2 etc. 
* Predict the earning values for all observations of the dataset
* Imput the prediction where the earnings missing
* Compare the final distribution with previous strategy
* Include the strategy in your previous function

In [None]:
## Your answer

## Multiple variables imputation 

Let's move on to the case where multiple variables present missing values. 
To process we will introduce missing values in the age variable. 

* Write a function to cause approximatively 30% of the values of a variable x to be missing. Design the mechanism to be at random but not completely at random: the probability of missing age should depend on some y variable. 
* Apply the function to the age variable with a dependence on the sex.
* Explore the iterative imputer of scikit-learn.

In [None]:
## Your answer