# Impute missing data in a dataset

In this template, we will explain how to handle missing data in a dataset.  
Missing data is often indicated by `NaN`, `null`, leaving it empty, etc. Let's import a dataset consisting of such empty data.

In [None]:
# Load packages
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
%config InlineBackend.figure_format = 'retina'

In [None]:
# Upload your data as CSV and load as data frame
df = pd.read_csv("data.csv")

As we can see there is some missing data. A naive approach would be to simply delete these rows. However, lots of  interesting data could get lost this way.  
We will use a better approach by imputing the missing values from known data in the same column. This is achieved by the imputer from [scikit-learn](https://scikit-learn.org/stable/user_guide.html), which was previously imported.  

We will pass two arguments to this imputer:
* `missing_values`: as we can see in the dataframe above, the missing values are all indiciated by *NaN* (not a number). This should be consistent through the whole dataset (or at least through every column). This is the case here.
* `strategy`: there are four options ("mean", "median", "most_frequent" or "constant"). We will use "mean" for the numerical values and "most_frequent" for the strings.

In [None]:
NUMERICAL_COLUMNS = [1,2]                                                               # The columns containing numerical values
STRING_COLUMNS = [4]                                                                    # The columns containing strings

# First, we will construct an imputer for the numerical values (salary and height)
num_imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')                 # To impute missing values that should be numerical we will use the mean of that column
num_imputer = num_imputer.fit(df.iloc[:,NUMERICAL_COLUMNS])                             # Calculate the mean of each column
df.iloc[:,NUMERICAL_COLUMNS] = num_imputer.transform(df.iloc[:,NUMERICAL_COLUMNS])      # Fill in the NaN's with the correct mean

# Now, we will do the same for column 'cartype', which consists of string values
str_imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')        # To impute missing values that should be strings we will use the most frequent type from that column
str_imputer = str_imputer.fit(df.iloc[:,STRING_COLUMNS])                                # Calculate the most frequent value
df.iloc[:,STRING_COLUMNS] = str_imputer.transform(df.iloc[:,STRING_COLUMNS])            # Fill in the NaN's with the correct value

df

We got rid of all missing data by either filling it with the mean or the most frequent occurence from the corresponding column. Now, we can start analyzing it or train a ML model with it. But first we will save it in a new csv-file.

In [None]:
df.to_csv('data_fixed.csv')

### Useful links
[Documentation SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)