<left><img width=100% height=100% src="img/itu_logo.png"></left>

## Lecture 04: Imputation of Missing Values

### __Gül İnan__<br><br>Istanbul Technical University

## Re-visit Video Games Data

In [11]:
#import dataset
import pandas as pd
video_df = pd.read_table("datasets/video.csv", sep = ";", na_values="99", index_col=0)
video_df.head()

Unnamed: 0,time,freq,sex,age,home,math,work,own,grade
0,2.0,weekly,female,19,yes,no,10.0,yes,A
1,0.0,monthly,female,18,yes,yes,0.0,yes,C
2,0.0,monthly,male,19,yes,no,0.0,yes,B
3,0.5,monthly,female,19,yes,no,0.0,yes,B
4,0.0,semesterly,female,19,yes,yes,0.0,no,B


In Video Games Data, we can see that `freq` and `work` features have missing values.

In [12]:
#get some info on variables
video_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    91 non-null     float64
 1   freq    78 non-null     object 
 2   sex     91 non-null     object 
 3   age     91 non-null     int64  
 4   home    91 non-null     object 
 5   math    91 non-null     object 
 6   work    88 non-null     float64
 7   own     91 non-null     object 
 8   grade   91 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 7.1+ KB


When we feed a feauture with missing values to a regression algorithm, most of the time it will throw an error.

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split

video_X = video_df[["age", "work"]]
video_y = video_df[["time"]]

#Split 80:20
video_X_train, video_X_test, video_y_train, video_y_test = train_test_split(video_X, video_y, test_size=0.1, random_state=1300)

In [14]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()

#no transformation
regr.fit(video_X_train,video_y_train)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## What is the problem?

We cannot feed the features with missing values to regression algoritm since it cannot handle NaN's.
Most algorithms cannot handle missing values.

## What are the possible ways to deal with NaNs?

- Delete the rows?
- Replace them with some reasonable values?

## Imputation methods

`Imputation` is an approach to handling missing values. Imputation means filling in missing values, and there is a wide variety of methods. Usually, these are unsupervised, so they only make use of the information of features on the training data. 


The simplest strategy is to:

 - Fill in a **numerical feature** with the **mean** or **median** of that feature over the non-missing samples and
 - Fill in a **categorical feature** with the **most frequent value** of that feature over the non-missing samples.

This approach is implemented in the [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) in scikit-learn.

In [19]:
from sklearn import set_config
set_config(transform_output="pandas")  #available in scikit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

In [20]:
#imputer.fit(video_X_train)
#video_X_train_imp = imputer.transform(video_X_train)  #two lines can be combined as: imputer.fit_transform(video_X_train)
video_X_train_imp = imputer.fit_transform(video_X_train)
video_X_test_imp = imputer.transform(video_X_test)

In [21]:
#The imputation fill value for each feature
print(imputer.feature_names_in_)
imputer.statistics_

['age' 'work']


array([19.54320988,  7.71794872])

In [23]:
#Let’s check whether the NaN values have been replaced or not
video_X_train_imp.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81 entries, 49 to 58
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     81 non-null     float64
 1   work    81 non-null     float64
dtypes: float64(2)
memory usage: 1.9 KB


`Imputation` is suggested to be the `first step in any preprocessing sequence`. 

Then, how can we first impute the work feature and then scale it prior to model fitting?

## Pipelines

- The [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) class in sckit-learn allows us to make `pipeline` of transformers with a final estimator. 

In [24]:
from sklearn import set_config
set_config(transform_output="pandas")  #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#construct the pipeline

pipe = make_pipeline(
       SimpleImputer(strategy="mean"), 
       StandardScaler(), 
       LinearRegression()
)

In [25]:
pipe

In [26]:
pipe.fit(video_X_train, video_y_train)

Note that we are passing X_train and not the imputed or scaled data here. When you call fit on the pipeline, it carries out the following steps:

- Fits `SimpleImputer` on X_train,
- **Transforms** X_train using the fit `SimpleImputer` to create X_train_imp,
- Fits `StandardScaler` on X_train_imp,
- **Transforms** X_train_imp using the fit `StandardScaler` to create X_train_imp_scaled, and
- Fits the model (LinearRegression in our case) on X_train_imp_scaled.

In [27]:
pipe.predict(video_X_test)

array([[1.2631643 ],
       [0.63037214],
       [1.06891954],
       [1.54151206],
       [1.06891954],
       [1.5244895 ],
       [1.06891954],
       [0.8201088 ],
       [1.2166047 ],
       [1.06891954]])

Note that we are passing original data to `predict` as well. 

In [28]:
print('Test R2 on test data: %.2f' % pipe.score(video_X_test, video_y_test)) #linear model is not a good model for this data. for that reason r2 is close to zero

Test R2 on test data: -0.03


## References

- https://ubc-cs.github.io/cpsc330/lectures/05_preprocessing-pipelines.html
- Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. Taylor & Francis Group.

In [None]:
import session_info
session_info.show()