## Imputing Missing Values

Imputing Missing Values by using the SimpleImputer.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

## Objective

Our aim to impute the missing values in our dataset to increase the performance of any estimator. In this section we will work on the `Life Expectancy` updated dataset which do not have any missing values so first we will load the data set then we will create fake `np.nan` values. 

In [2]:
# loading the Life expectancy dataset.
df = pd.read_csv("../datasets/Life-Expectancy-Data-Updated.csv")
df.head()

Unnamed: 0,Country,Region,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,...,Diphtheria,Incidents_HIV,GDP_per_capita,Population_mln,Thinness_ten_nineteen_years,Thinness_five_nine_years,Schooling,Economy_status_Developed,Economy_status_Developing,Life_expectancy
0,Turkiye,Middle East,2015,11.1,13.0,105.824,1.32,97,65,27.8,...,97,0.08,11006,78.53,4.9,4.8,7.8,0,1,76.5
1,Spain,European Union,2015,2.7,3.3,57.9025,10.35,97,94,26.0,...,97,0.09,25742,46.44,0.6,0.5,9.7,1,0,82.8
2,India,Asia,2007,51.5,67.9,201.0765,1.57,60,35,21.2,...,64,0.13,1076,1183.21,27.1,28.0,5.0,0,1,65.4
3,Guyana,South America,2006,32.8,40.5,222.1965,5.68,93,74,25.3,...,93,0.79,4146,0.75,5.7,5.5,7.9,0,1,67.0
4,Israel,Middle East,2012,3.4,4.3,57.951,2.89,97,89,27.0,...,94,0.08,33995,7.91,1.2,1.1,12.8,1,0,81.7


### Basic information

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2864 entries, 0 to 2863
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Country                      2864 non-null   object 
 1   Region                       2864 non-null   object 
 2   Year                         2864 non-null   int64  
 3   Infant_deaths                2864 non-null   float64
 4   Under_five_deaths            2864 non-null   float64
 5   Adult_mortality              2864 non-null   float64
 6   Alcohol_consumption          2864 non-null   float64
 7   Hepatitis_B                  2864 non-null   int64  
 8   Measles                      2864 non-null   int64  
 9   BMI                          2864 non-null   float64
 10  Polio                        2864 non-null   int64  
 11  Diphtheria                   2864 non-null   int64  
 12  Incidents_HIV                2864 non-null   float64
 13  GDP_per_capita    

In Summary:

we have total `2864` entries and `21` columns, we don't found any non-value. 

* expect `country` and `year` columns all values are numerical.

### Descriptive stats

In [4]:
df.describe()

Unnamed: 0,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,Polio,Diphtheria,Incidents_HIV,GDP_per_capita,Population_mln,Thinness_ten_nineteen_years,Thinness_five_nine_years,Schooling,Economy_status_Developed,Economy_status_Developing,Life_expectancy
count,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0
mean,2007.5,30.363792,42.938268,192.251775,4.820882,84.292598,77.344972,25.032926,86.499651,86.271648,0.894288,11540.92493,36.675915,4.865852,4.899825,7.632123,0.206704,0.793296,68.856075
std,4.610577,27.538117,44.569974,114.910281,3.981949,15.995511,18.659693,2.193905,15.080365,15.534225,2.381389,16934.788931,136.485867,4.438234,4.525217,3.171556,0.405012,0.405012,9.405608
min,2000.0,1.8,2.3,49.384,0.0,12.0,10.0,19.8,8.0,16.0,0.01,148.0,0.08,0.1,0.1,1.1,0.0,0.0,39.4
25%,2003.75,8.1,9.675,106.91025,1.2,78.0,64.0,23.2,81.0,81.0,0.08,1415.75,2.0975,1.6,1.6,5.1,0.0,1.0,62.7
50%,2007.5,19.6,23.1,163.8415,4.02,89.0,83.0,25.5,93.0,93.0,0.15,4217.0,7.85,3.3,3.4,7.8,0.0,1.0,71.4
75%,2011.25,47.35,66.0,246.791375,7.7775,96.0,93.0,26.4,97.0,97.0,0.46,12557.0,23.6875,7.2,7.3,10.3,0.0,1.0,75.4
max,2015.0,138.1,224.9,719.3605,17.87,99.0,99.0,32.1,99.0,99.0,21.68,112418.0,1379.86,27.7,28.6,14.1,1.0,1.0,83.8


### Missing Values

From basic info we already know that this dataset has no missing values values.

In [5]:
df.isna().sum()

Country                        0
Region                         0
Year                           0
Infant_deaths                  0
Under_five_deaths              0
Adult_mortality                0
Alcohol_consumption            0
Hepatitis_B                    0
Measles                        0
BMI                            0
Polio                          0
Diphtheria                     0
Incidents_HIV                  0
GDP_per_capita                 0
Population_mln                 0
Thinness_ten_nineteen_years    0
Thinness_five_nine_years       0
Schooling                      0
Economy_status_Developed       0
Economy_status_Developing      0
Life_expectancy                0
dtype: int64

### Adding Missing Values randomly into over dataset.

first we will do each step separately then we will combine them into a function.

To create dataset with some `np.nan` values (randomly placed) we follow these following steps:

1. first we will separate the dataset into the x_data and y_data
2. we will choose the y_data as the `Life Expectancy` and `x_data` as the remaining variables.
3. split data into the train and test.
4. extract the number of entries and features.
5. decide the how many missing values we will put.
6. create a random binary sample with n_entries and put some `True` values equal to the missing values.
7. Now we need one another sample of random number in b/w the `0` and `n_features`. 
8. Generate the missing sample.  

In [6]:
# first we will separate the sample.
y_data = df["Life_expectancy"]

# in our x_data we will remove the Country, Region Year and Life_Expectancy columns.
x_data = df.drop(axis=1, columns=["Country","Region", "Year", "Life_expectancy"])

In [7]:
# Now testing our x_data shape
x_data.shape

(2864, 17)

In [8]:
# splitting the data into train and test dataset.
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.30)

Now we will have train and test data set we need to only focus on the train dataset so from now we will  use the train dataset to create a new dataset with `np.nan` values.

### Adding Missing Values

In [9]:
# to add missing values first we need to extract the number of entries and features from the xtrain dataset.
n_sample, n_features = x_train.shape
print("Total number of samples: ", n_sample)
print("Total number of featrues: ", n_features)

Total number of samples:  2004
Total number of featrues:  17


In [10]:
# missing value rate
missing_rate = 0.30  # new dataset will have 30% missing values.

In [11]:
# Now compute the number of sample to have the missing values.
n_missing_number = int(n_sample * missing_rate)
print("Total number of missing Values will be: ", n_missing_number)

Total number of missing Values will be:  601


**sample of binary values for missing value**

In [12]:
# Now creating a sample of binary values 
n_missing_sample = np.zeros(n_sample, dtype=bool)

# now we will put some True we equal to the n_missing_number to our binary sample.
n_missing_sample[:n_missing_number] = True

# Now shuffle the data to distribute the True values to the whose sample equally.
np.random.shuffle(n_missing_sample)

# our new sample
print(n_missing_sample)

[False False False ... False False False]


In [13]:
# we also need one more random sample to select the feature to put the random value.
n_missing_featrue = np.random.randint(0,n_features, n_missing_number) 

In [14]:
n_missing_featrue

array([14,  1,  4, 14,  2, 14, 11, 15,  3,  7, 15, 11, 10,  6, 11,  7,  8,
        7, 13,  1,  2,  5, 16, 14, 14, 12,  3,  4,  3,  5, 13, 11,  4, 10,
        4, 14,  8,  6, 15, 15, 11,  4,  8,  2, 16, 15,  4, 14,  3, 13, 13,
        6,  9, 12,  5,  8, 14, 12, 11,  5, 15,  6,  5,  0, 13,  6, 14,  3,
        8,  5,  4,  0, 14,  4,  3,  2, 13, 13, 15,  7,  5,  9,  1, 12,  9,
        4, 14, 12,  3,  5,  4, 10,  8,  3, 16,  5, 15,  3, 10,  2,  4, 14,
        9, 11, 14, 16,  2,  3, 13,  8,  9, 13, 13, 15,  4,  1, 13, 16,  1,
        9, 14,  5, 16, 14, 10,  3,  9,  5, 14, 15,  4,  2, 14,  5, 10, 12,
       13, 13,  3, 12, 16, 16,  6, 16, 14,  9, 13, 12, 11, 16,  7, 11,  1,
       10,  3,  6, 12,  8,  5,  7,  0, 16, 14,  5, 14, 12, 15,  1, 14, 13,
        8, 13,  2, 16, 11,  4,  9,  3,  0,  3, 12, 12,  8,  0,  4,  1,  9,
       14,  0,  2,  0,  5, 11, 16, 13,  0,  7,  7,  3,  2, 16,  8,  8,  9,
       16,  9, 16, 11, 13,  8,  3,  1, 13,  8, 14, 13,  4,  0, 15, 14,  1,
        1, 12, 11,  5,  0

In [15]:
# creating a copy to the data set
x_train_copy = x_train.copy()
y_train_copy = y_train.copy()

In [16]:
# we need to inspect the shape of the x_train_copy and (n_missing_sample, n_missing_feature)
print("x_train_copy shape: ", x_train_copy.shape)
print("n_missing_sample shape: ", n_missing_sample.shape)
print("n_missing_fetaure shape: ", n_missing_featrue.shape)

x_train_copy shape:  (2004, 17)
n_missing_sample shape:  (2004,)
n_missing_fetaure shape:  (601,)


Since our shape is matching with n_samples we are good to go.

Now we will select all those sample who are marked for `np.nan` values and then form those we will select the `featrue` in which we want to put the `np.nan` value.

In [17]:
np_x_train_data = x_train_copy.to_numpy()

In [18]:
np_x_train_data[n_missing_sample, n_missing_featrue] = np.nan

In [19]:
x_train_data = pd.DataFrame(np_x_train_data, columns=x_train_copy.columns)

In [20]:
x_train_data.isna().sum()

Infant_deaths                  39
Under_five_deaths              29
Adult_mortality                34
Alcohol_consumption            39
Hepatitis_B                    44
Measles                        36
BMI                            29
Polio                          32
Diphtheria                     35
Incidents_HIV                  36
GDP_per_capita                 33
Population_mln                 27
Thinness_ten_nineteen_years    40
Thinness_five_nine_years       46
Schooling                      40
Economy_status_Developed       34
Economy_status_Developing      28
dtype: int64

## Getting Together

In [21]:
def add_missing_value(x_data,y_data, missing_rate=.30):
    """Create artificial missing values in a dataset.

    This function randomly selects n% of the samples and replaces one feature value with np.nan for each sample. 
    The output vector is not modified.

    Parameters
    ----------
    X_data : array-like of shape (n_samples, n_features)
        The input data matrix with all the features.
    y_data : array-like of shape (n_samples,)
        The output vector with the target values.
    missing_rate: a float value to set the total number of missing values to add in the
        dataset.

    Returns
    -------
    x_missing : array-like of shape (n_samples, n_features)
        The modified data matrix with missing values.
    y_missing : array-like of shape (n_samples,)
        The original output vector.
    
    """
    # extracting the numbr of samples and the features.
    n_samples, n_features = x_data.shape
    
    # caluate the n_missing_number
    n_missing_number = int(missing_rate * n_samples)
    
    # Now create a binary array equal to the n_samples size.
    n_missing_samples = np.zeros(n_samples, dtype=bool)
    
    # Now put some True values on the samples equal to the n_missing_number
    n_missing_samples[:n_missing_number] = True
    
    # now shuffle the n_missing_samples to distribute the missing value equally 
    # through out the dataset.
    np.random.shuffle(n_missing_samples)
    
    # creating another n_feature_samples to select the featrue.
    n_missing_features = np.random.randint(0, n_features, n_missing_number)
    
    # Now we will create a copy of the original data and transform the copy
    x_missing = x_data.copy()
    y_missing = y_data.copy()
    
    # putting some missing values.
    x_missing[n_missing_samples, n_missing_features] = np.nan
    
    return x_missing, y_missing

In [22]:
# adding missing values.
x_miss_data, y_miss_data = add_missing_value(x_train.to_numpy(), y_train.to_numpy(), 0.40)

In [23]:
# again creating the dataframe with new values
new_data = pd.DataFrame(x_miss_data, columns=x_train.columns)

In [24]:
# displaying the total number of nan values in each feature
new_data.isna().sum()

Infant_deaths                  52
Under_five_deaths              49
Adult_mortality                45
Alcohol_consumption            36
Hepatitis_B                    48
Measles                        52
BMI                            55
Polio                          46
Diphtheria                     48
Incidents_HIV                  54
GDP_per_capita                 39
Population_mln                 44
Thinness_ten_nineteen_years    64
Thinness_five_nine_years       45
Schooling                      35
Economy_status_Developed       42
Economy_status_Developing      47
dtype: int64

**Imputing the data**

In [25]:
# creating a SimpleImputer to replace all the missing value with the mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
x_train_imputted =imputer.fit_transform(x_miss_data)

In [26]:
# again creating the dataframe with new values
new_data = pd.DataFrame(x_train_imputted, columns=x_train.columns)
new_data.isna().sum()

Infant_deaths                  0
Under_five_deaths              0
Adult_mortality                0
Alcohol_consumption            0
Hepatitis_B                    0
Measles                        0
BMI                            0
Polio                          0
Diphtheria                     0
Incidents_HIV                  0
GDP_per_capita                 0
Population_mln                 0
Thinness_ten_nineteen_years    0
Thinness_five_nine_years       0
Schooling                      0
Economy_status_Developed       0
Economy_status_Developing      0
dtype: int64

As we can see that after imputing the data, we don't have any `np.nan` values in the train dataset. Now we will perform the `RandomForestRegressor` by using both the `Imputted` and `non-Imputted` dataset.

## Regression with Original data.

In [34]:
# first create a LinearRegression instance
regressor = RandomForestRegressor()

# here x_train and y_train are the original data.
regressor.fit(x_train, y_train)

# now perform the cross validation on the regressor
score = cross_val_score(regressor, x_train, y_train, scoring="neg_mean_squared_error", cv=4)

In [35]:
print("score mean: %.2f"% (score.mean()))
print("score standard deviation: ", score.std())

score mean: -0.53
score standard deviation:  0.04973003005274423


## Regression with imputted data

In [40]:
# regression with imputted data
regressor = RandomForestRegressor()
score = cross_val_score(regressor, x_train_imputted, y_train, scoring="neg_mean_squared_error", cv=5)
print(f"NMSE : {score.mean():.2f}")
print("STD: ", score.std())

NMSE : -0.91
STD:  0.17758075798714867


In [30]:
y_pred =  regressor.predict(x_test)

In [31]:
prediction =  pd.DataFrame({"y_pred": y_pred, "y_test": y_test})

In [32]:
prediction["error"] = prediction["y_pred"] - prediction["y_test"]


In [33]:
mean_squared_error(y_test, y_pred)

0.3574989720930258