# Missings
## Plan

1. Make the dataset with NAs.
2. Check hypothesis on the data with NAs and without them.
3. Imputation methods.

**Data**: sample of the data that we used on the Practice 2 - TIMSS data for 8 graders in United Arab Emirates.

In [None]:
import pandas as pd

data = pd.read_csv('timss_uae_sample.csv')

In [None]:
data['education'].value_counts()

In [None]:
data['education'] = data['education'].replace({'University or Higher': 3, "Don't Know": 4, 
                                               "Secondary or lower": 0, "Post-Secondary": 1})

In [None]:
data.head()

## Adding NAs

In [None]:
import numpy as np

na_5_index = np.random.choice(data.index, int(len(data) * 0.05), replace=False)
na_20_index = np.random.choice(data.index, int(len(data) * 0.2), replace=False)
na_40_index = np.random.choice(data.index, int(len(data) * 0.4), replace=False)

In [None]:
data_5_na = data.copy()
data_20_na = data.copy()
data_40_na = data.copy()

data_5_na.loc[na_5_index, 'interest'] = np.nan
data_20_na.loc[na_20_index, 'interest'] = np.nan
data_40_na.loc[na_40_index, 'interest'] = np.nan

In [None]:
data_20_na[data_20_na['interest'].isna()]

In [None]:
1389 / len(data)

## Testing Hypotheses

Regression: target variable - math_score, independent variables - all other columns.

1. Full data
2. Data with NAs - just dropping observations with NA

full data

In [None]:
import statsmodels.formula.api as smf

m_full = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data).fit()
m_full.summary()

**Note**: smf.ols function drops NAs by itself

data with 5% NAs in the column *interest*

In [None]:
m_5 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_5_na).fit()
m_5.summary()

data with 20% NAs in the column *interest*

In [None]:
m_20 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_20_na).fit()
m_20.summary()

data with 40% NAs in the column *interest*

In [None]:
m_40 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_40_na).fit()
m_40.summary()

just dropping observations with NA

In [None]:
# dropping observations that have NA in the column 'interest'
data_5_na.dropna(subset='interest')

## Imputation methods

1. Some constant value:
+ distrinct from other values
+ mean, median, mode value for the column

2. Some randomly selected value from this column (or not randomly - for ex, previous or next value)

3. A value estimated by another predictive model.

### Simple methods

In [None]:
data_5_na.fillna(0)

In [None]:
data_5_na.fillna(data['interest'].mean())

### Model imputation

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2, weights="uniform")
data_knn_imputed = imputer.fit_transform(data_40_na[['sex', 'education', 'interest', 'teaching', 'success', 'importance']])


In [None]:
data_knn_imputed[data_knn_imputed == np.nan]

**MICE method for imputation in multiple columns**

Good explanation: https://stats.stackexchange.com/questions/421545/multiple-imputation-by-chained-equations-mice-explained#:~:text=MICE%20is%20a%20multiple%20imputation,are%20missing%20completely%20at%20random).

In Python: function IterativeImputer in library fancyimpute


There are other algorithms in the library.

In [None]:
from fancyimpute import IterativeImputer
mice_imputer = IterativeImputer(verbose=False)
data_imputed = mice_imputer.fit_transform(data)

you can use some other algorithm, for example, just a linear regression.

(from the library sklearn or statsmodels)

## Testing hypotheses on imputed data

Let's use two methods: using the mean value and some model imputation algorithm, let it be KNN.

### Imputation with mean value

In [None]:
data_5_mean_imputed = data_5_na.fillna(data_5_na.mean())
data_20_mean_imputed = data_20_na.fillna(data_20_na.mean())
data_40_mean_imputed = data_40_na.fillna(data_40_na.mean())

In [None]:
m_5 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_5_mean_imputed).fit()
m_5.summary()

In [None]:
m_20 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_20_mean_imputed).fit()
m_20.summary()

In [None]:
m_40 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_40_mean_imputed).fit()
m_40.summary()

### Imputation with kNN 


In [None]:
def knn_impute(data, n_neighbors=2):
    imputer = KNNImputer(n_neighbors=n_neighbors, weights="uniform")
    return imputer.fit_transform(data)

data_5_knn_imputed = data_5_na.copy()
data_20_knn_imputed = data_20_na.copy()
data_40_knn_imputed = data_40_na.copy()

data_5_knn_imputed[['sex', 'education', 'interest', 'teaching', 'success', 'importance']] = knn_impute(data_5_na[['sex', 'education', 'interest', 'teaching', 'success', 'importance']])
data_20_knn_imputed[['sex', 'education', 'interest', 'teaching', 'success', 'importance']] = knn_impute(data_20_na[['sex', 'education', 'interest', 'teaching', 'success', 'importance']])
data_40_knn_imputed[['sex', 'education', 'interest', 'teaching', 'success', 'importance']] = knn_impute(data_40_na[['sex', 'education', 'interest', 'teaching', 'success', 'importance']])

In [None]:
m_5 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_5_knn_imputed).fit()
m_5.summary()

In [None]:
m_20 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_20_knn_imputed).fit()
m_20.summary()

In [None]:
m_40 = smf.ols(formula="math_score ~ C(sex) + interest + teaching + success + importance + C(education, Treatment(0))", 
                 data=data_40_knn_imputed).fit()
m_40.summary()

## Task for you
**Deadline**: 27.09.2022 12:00, send me on e-mail aspestova@hse.ru in **html** format.

### Part 1.

1. Use 2 algorithms (not used on the practice) of model imputation for imputing the data.
2. Run same regressions (as on the practice) but on the imputed data.
3. Compare the results with those that were obtained during the practice.