![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - Advanced Data Manipulation - Data Imputation

*Basic initialization of the workspace.*

In [None]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [None]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

# disable warnings for chained assignment
pd.set_option('mode.chained_assignment', None)

Pandas installed at version: 1.3.5


In [None]:
!python -m pip install sklearn
import sklearn as skl
import sklearn.experimental as skle
import sklearn.impute as skli

print ("Sklearn installed at version: {}".format(skl.__version__))

Sklearn installed at version: 1.0.2


### 1 Loading Data

We will focus on processing a dataset focused on immigration data. It contains the number of foreign born citizens in different countries, considering different genders in different years. 

The dataset has missing values for a critical feature: the stock of citizens with a foreign origin/nationality.


#### 1.1 Loading and exploring data

First of all, loading data and basic exploration of the dataset structure is required. 

In [None]:
# load raw data for processing
raw_data = pd.read_parquet(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%204%20-%20Advanced%20Data%20Manipulation/Session%202%20-%20Advanced%20Data%20Manipulation/data/migration_dataset_extended.parquet"
)

print(
    "A sample of of raw data is \n {}".format(
      raw_data[0:10]  
    )
)

A sample of of raw data is 
      Year COU_ORIG Origin Country Gender COU_DEST Destination Country  Immigrant Stock  Origin Country Population  Destination Country Population
0  2000.0      AFG    Afghanistan    MEN      AUS           Australia           6500.0                 20779957.0                      19153000.0
1  2001.0      AFG    Afghanistan    MEN      AUS           Australia           7410.0                 21606992.0                      19413000.0
2  2002.0      AFG    Afghanistan    MEN      AUS           Australia           8710.0                 22600774.0                      19651400.0
3  2003.0      AFG    Afghanistan    MEN      AUS           Australia           9260.0                 23680871.0                      19895400.0
4  2004.0      AFG    Afghanistan    MEN      AUS           Australia           9810.0                 24726689.0                      20127400.0
5  2005.0      AFG    Afghanistan    MEN      AUS           Australia          10600.0         

We can observe that the dataset has several features:

*  **Year** - the year of observation;
*  **COU_ORIG** - the ISO3 code for the country of origin;
*  **Origin Country** - the name of the country of origin (country of birth/nationality);
*  **Gender** - the gender of the immigrants;
*  **COU_DEST** - the country of destination (country of residence);
*  **Destination Country** - the country of residence;
*  **Immigrant Stock** - the number of immigrants (foreign born citizens)
*  **Origin Country Population** - the population in the country of origin;
*  **Destination Country Population** - the population in the country of destination.

Several of these features have missing data.

#### 1.2 Deciding which data to drop and which data to keep

We will explore the data that has missing data on the features we need as a basic minimum: **Year**, **COU_ORIG**, **COU_DEST**, **Gender**. This data will be dropped outright as we cannot handle it at all. 

In [None]:
# identify the records to be dropped
print(
    "The count of records to be dropped as totally invalid is {}".format(
      raw_data[
          np.isnan(raw_data["Year"]) |
          pd.isnull(raw_data["COU_ORIG"]) |
          pd.isnull(raw_data["COU_DEST"]) |
          pd.isnull(raw_data["Gender"])
      ].shape[0]    
    )
)

The count of records to be dropped as totally invalid is 1


In [None]:
# keep basic valid data
basic_valid_data = raw_data[
         ~(
              np.isnan(raw_data["Year"]) |
              pd.isnull(raw_data["COU_ORIG"]) |
              pd.isnull(raw_data["COU_DEST"]) |
              pd.isnull(raw_data["Gender"])
         )    
]

Furthermore, the raw data has some invalid (negative) values for the **Immigrant Stock** feature. We will replace these values with **np.nan** (missing data) so the data imputation process will handle these values as well.

In [None]:
print(
      "There are {} records with an invalid value for Immigrant Stock.\n\
These values will be set to np.nan(missing data)".format(
        basic_valid_data[basic_valid_data["Immigrant Stock"] < 0].shape[0]
       )
    )

# set negative Immigrant Stock values to np.nan
basic_valid_data.loc[basic_valid_data["Immigrant Stock"] < 0, "Immigrant Stock"] = np.nan  

There are 66 records with an invalid value for Immigrant Stock.
These values will be set to np.nan(missing data)


To ensure a better descriptive power of data, we will consider adding a data indicator **Immigrant Stock Missing Indicator** to flag records with missing values.

In [None]:
# set missing indicator if immigant stock information is missing
basic_valid_data["Immigrant Stock Missing Indicator"] = [1 if np.isnan(value) else 0 for value in basic_valid_data["Immigrant Stock"] ]

print(
    "A sample of data with the information of missing data for Immigrant Stock is \n {}".format(
      basic_valid_data[0:10]  
    )
)

A sample of data with the information of missing data for Immigrant Stock is 
      Year COU_ORIG Origin Country Gender COU_DEST Destination Country  Immigrant Stock  Origin Country Population  Destination Country Population  Immigrant Stock Missing Indicator
0  2000.0      AFG    Afghanistan    MEN      AUS           Australia           6500.0                 20779957.0                      19153000.0                                  0
1  2001.0      AFG    Afghanistan    MEN      AUS           Australia           7410.0                 21606992.0                      19413000.0                                  0
2  2002.0      AFG    Afghanistan    MEN      AUS           Australia           8710.0                 22600774.0                      19651400.0                                  0
3  2003.0      AFG    Afghanistan    MEN      AUS           Australia           9260.0                 23680871.0                      19895400.0                                  0
4  2004.0      A

We need to identify for each combination of ("COU_ORIG", "COU_DEST", "Gender") how many years have a missing value in report to the total number of records associated with the combination. 

In [None]:
# aggregate count of records with existing Immigrant Stock value 
# versus the count of records for each combination of
# ("COU_ORIG", "COU_DEST", "Gender")  
missing_value_aggregate = basic_valid_data.groupby(
    ["COU_ORIG", "COU_DEST", "Gender"]
).agg(
        count_records = ("Immigrant Stock Missing Indicator", "count"),
        count_missing_records = ("Immigrant Stock Missing Indicator", "sum")       
  )

print(
    "A sample of data with the information of cont of missing data records versus all records\
in a combination of ('COU_ORIG', 'COU_DEST', 'Gender') is \n {}".format(
      missing_value_aggregate[0:20]  
    )
)

A sample of data with the information of cont of missing data records versus all recordsin a combination of ('COU_ORIG', 'COU_DEST', 'Gender') is 
                           count_records  count_missing_records
COU_ORIG COU_DEST Gender                                      
AFG      AUS      MEN                21                      0
                  WMN                21                      0
         AUT      MEN                21                      2
                  WMN                21                      2
         BEL      MEN                21                     12
                  WMN                21                     12
         CAN      MEN                21                     18
                  WMN                21                     19
         CHE      MEN                21                     10
                  WMN                21                     10
         CHL      MEN                21                     19
                  WMN            

For a better usage of data we will need to flatten the aggregate information into a standard data frame, this can be done by the [**reset_index**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

Once this is done, we will also calculate the ratio of missing records towards all existing records for each combination of interest.

In [None]:
# reset index from aggregate
missing_value_combinations = missing_value_aggregate.reset_index()

# count ration missing records
missing_value_combinations["ratio_missing_records"] = missing_value_combinations["count_missing_records"] / missing_value_combinations["count_records"]

# ignore data with no missing records
missing_value_combinations =  missing_value_combinations[missing_value_combinations["ratio_missing_records"] > 0]

print(
    "A sample of missing data statistics information is \n {}".format(
      missing_value_combinations[0:10]  
    )
)

print(
    "There are in total {} combinations to be analysed".format(
        missing_value_combinations.shape[0]
    )
)

A sample of missing data statistics information is 
    COU_ORIG COU_DEST Gender  count_records  count_missing_records  ratio_missing_records
2       AFG      AUT    MEN             21                      2               0.095238
3       AFG      AUT    WMN             21                      2               0.095238
4       AFG      BEL    MEN             21                     12               0.571429
5       AFG      BEL    WMN             21                     12               0.571429
6       AFG      CAN    MEN             21                     18               0.857143
7       AFG      CAN    WMN             21                     19               0.904762
8       AFG      CHE    MEN             21                     10               0.476190
9       AFG      CHE    WMN             21                     10               0.476190
10      AFG      CHL    MEN             21                     19               0.904762
11      AFG      CHL    WMN             21               

A this point we decide to address only the combination that has a ratio of missing records of 0.4 or less. This is an empirical decision, ratio can vary depending on the data analyst and business needs.

Therefore we will leave all the combinations having a ratio of more than 0.4 as is, for the rest we will used data imputation in order to replace missing values with estimated ones.  

In [None]:
# add the imputed Immigrant Stock value
basic_valid_data["Immigrant Stock Imputed"] = np.nan
basic_valid_data = basic_valid_data.reindex(
    columns = [
               "Year",
               "COU_ORIG",
               "Origin Country",
               "Gender", 
               "COU_DEST",
               "Destination Country",
               "Immigrant Stock",
               "Immigrant Stock Imputed",
               "Immigrant Stock Missing Indicator",
               "Origin Country Population",
               "Destination Country Population"
    ]
)

# keep only the combinations with a ratio of missing records of at most 0.4
missing_value_combinations_imputation = missing_value_combinations[missing_value_combinations["ratio_missing_records"] <= 0.4]

# drop the columns that are missing
missing_value_combinations_imputation = missing_value_combinations_imputation.drop(
    ["count_records", "count_missing_records", "ratio_missing_records"],
    axis = 1
  )

print(
    "The count of combinations for which we will try to attempt imputation is {}".format(
        missing_value_combinations_imputation.shape[0]
    )
)

The count of combinations for which we will try to attempt imputation is 2394


#### 1.3 Performing data imputation

We will use the imputers from the Sklearn package. One of these imputers is the [**SimpleImputer**](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) which allows basic data imputation using mean, median or a constant value.

We intend to impute data within each ("COU_ORIG", "COU_DEST", "Gender") combination selected for imputation, the strategy would be to use median data imputation.

In [None]:
# we will use the simple imputer with the strategy
# to impute the median value for missing data
simple_imputer = skli.SimpleImputer(
  missing_values= np.nan,  
  strategy = "median",
  verbose = 1   
)

# as an example - impute the first ("COU_ORIG", "COU_DEST", "Gender")
# combination selected for imputing
missing_value_combination = missing_value_combinations_imputation.values[0]

# extract relevant data
country_origin = missing_value_combination[0]
country_destination = missing_value_combination[1]
gender = missing_value_combination[2]

# slice from full data for the 
# combination selected for imputing
slice_for_imputation = basic_valid_data[basic_valid_data["COU_ORIG"] == country_origin]
slice_for_imputation = slice_for_imputation[slice_for_imputation["COU_DEST"] == country_destination]
slice_for_imputation = slice_for_imputation[slice_for_imputation["Gender"] == gender]

# extract values for imputation, we only need the "Immigrant Stock" feature 
values_to_transform = slice_for_imputation["Immigrant Stock"].values

# fit and transform values for imputation
transformed_values = simple_imputer.fit_transform(values_to_transform.reshape(-1, 1))

# extract in a meaningful way the transformed values
# make sure they represent an integer
transformed_values = transformed_values.reshape(1, -1)[0]
transformed_values = transformed_values.astype(np.int32)

# set the imputer values back in the full data frame
basic_valid_data.loc[slice_for_imputation.index, "Immigrant Stock Imputed"] = transformed_values

# display the data slice for edification
print(
      "The transformed data slice is \n{}".format(
          basic_valid_data.loc[slice_for_imputation.index]
      )
    )

The transformed data slice is 
      Year COU_ORIG Origin Country Gender COU_DEST Destination Country  Immigrant Stock  Immigrant Stock Imputed  Immigrant Stock Missing Indicator  Origin Country Population  Destination Country Population
21  2000.0      AFG    Afghanistan    MEN      AUT             Austria              NaN                   5650.0                                  1                 20779957.0                       8011566.0
22  2001.0      AFG    Afghanistan    MEN      AUT             Austria              NaN                   5650.0                                  1                 21606992.0                       8042293.0
23  2002.0      AFG    Afghanistan    MEN      AUT             Austria           1710.0                   1710.0                                  0                 22600774.0                       8081957.0
24  2003.0      AFG    Afghanistan    MEN      AUT             Austria           2112.0                   2112.0                             

We can observe that the simple imputer did not generate credible data, in an empirical manner the value for immigrant stock appear to be quite large for the associated years.

We can use also the [**IterativeImputer**](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) which will consider additional features (like year) in order to consider data patterns and trends for imputation.

In [None]:
# enable iterative imputer
from sklearn.experimental import enable_iterative_imputer

# we will use an iterative imputer which will consider also the 
# Year feature in addition to the Immigrant Stock one 
iterative_imputer = skli.IterativeImputer(
  missing_values= np.nan,
  random_state= 0  
)

# reset the imputed values
basic_valid_data["Immigrant Stock Imputed"] = np.nan

# as an example - impute the 10th ("COU_ORIG", "COU_DEST", "Gender")
# combination selected for imputing
missing_value_combination = missing_value_combinations_imputation.values[9]

# extract relevant data
country_origin = missing_value_combination[0]
country_destination = missing_value_combination[1]
gender = missing_value_combination[2]

# slice from full data for the 
# combination selected for imputing
slice_for_imputation = basic_valid_data[basic_valid_data["COU_ORIG"] == country_origin]
slice_for_imputation = slice_for_imputation[slice_for_imputation["COU_DEST"] == country_destination]
slice_for_imputation = slice_for_imputation[slice_for_imputation["Gender"] == gender]

# we will consider both "Year" and "Immigrant Stock" for imputation
values_to_transform = slice_for_imputation[["Year", "Immigrant Stock"]].values

# fit and transform values for imputation
transformed_values = iterative_imputer.fit_transform(values_to_transform)

# extract in a meaningful way the transformed values
# make sure they represent an integer
transformed_values = transformed_values
transformed_values = transformed_values.astype(np.int32)

# set the imputer values back in the full data frame
basic_valid_data.loc[slice_for_imputation.index, "Immigrant Stock Imputed"] = transformed_values[:, 1]

# display the data slice for edification
print(
      "The transformed data slice is \n{}".format(
          basic_valid_data.loc[slice_for_imputation.index]
      )
    )

The transformed data slice is 
        Year COU_ORIG Origin Country Gender COU_DEST Destination Country  Immigrant Stock  Immigrant Stock Imputed  Immigrant Stock Missing Indicator  Origin Country Population  Destination Country Population
1008  2000.0      AFG    Afghanistan    WMN      HUN             Hungary              NaN                     14.0                                  1                 20779957.0                      10210971.0
1009  2001.0      AFG    Afghanistan    WMN      HUN             Hungary              NaN                     40.0                                  1                 21606992.0                      10187576.0
1010  2002.0      AFG    Afghanistan    WMN      HUN             Hungary              NaN                     66.0                                  1                 22600774.0                      10158608.0
1011  2003.0      AFG    Afghanistan    WMN      HUN             Hungary              NaN                     93.0                   

We can observe that the iterative imputer picks up the general data trend, however due to the large variation in data - it may predict even negative values. The negative values do not make sense from a business domain perspective, so the associated data should be set to 0. 

#### 1.4 Putting it all together

We will consider the following data immputation strategy in regards with the **Immigrant Stock** feature:

*  For any combination of ("COU_ORIG", "COU_DEST", "Gender") where there are less than 40% percent of missing records, the missing values will be imputed using the **Year** and **Immigrant Stock** features;
*  If the iterative imputer generates negative imputation values, they will be set to 0;
*  All other missing data will be set to 0.

An implementation of this strategy can be found below, reusing the imputer from the previous example:

In [None]:
# reset the imputed values
basic_valid_data["Immigrant Stock Imputed"] = np.nan

# imputing will be performed for all combination of 
# ("COU_ORIG", "COU_DEST", "Gender") selected for imputing
for country_origin, country_destination, gender in missing_value_combinations_imputation.values:

  # slice from full data for the 
  # combination selected for imputing
  slice_for_imputation = basic_valid_data[basic_valid_data["COU_ORIG"] == country_origin]
  slice_for_imputation = slice_for_imputation[slice_for_imputation["COU_DEST"] == country_destination]
  slice_for_imputation = slice_for_imputation[slice_for_imputation["Gender"] == gender]

  # we will consider both "Year" and "Immigrant Stock" for imputation
  values_to_transform = slice_for_imputation[["Year", "Immigrant Stock"]].values

  # fit and transform values for imputation
  transformed_values = iterative_imputer.fit_transform(values_to_transform)

  # extract in a meaningful way the transformed values
  # make sure they represent an integer
  transformed_values = transformed_values
  transformed_values = transformed_values.astype(np.int32)

  # set the imputer values back in the full data frame
  basic_valid_data.loc[slice_for_imputation.index, "Immigrant Stock Imputed"] = transformed_values[:, 1]

In [None]:
# display statistics about imputation results
count_total_imputed = basic_valid_data[~np.isnan(basic_valid_data["Immigrant Stock Imputed"])].shape[0]
count_negatively_imputed = basic_valid_data[basic_valid_data["Immigrant Stock Imputed"] < 0].shape[0]

print (
    "The total values imputed are {}, out of which {} are imputed with negative values".format(
        count_total_imputed,
        count_negatively_imputed
    )
)

# set negatively imputed values to 0
basic_valid_data.loc [(basic_valid_data["Immigrant Stock Imputed"] < 0), "Immigrant Stock Imputed"] = 0 

The total values imputed are 50274, out of which 1725 are imputed with negative values


In [None]:
# obtain count of values not imputed 
count_not_imputed = basic_valid_data[
                                      (basic_valid_data["Immigrant Stock Missing Indicator"] == 1)
                                      &
                                      np.isnan(basic_valid_data["Immigrant Stock Imputed"])
                                    ].shape[0]

print (
    "The total values not imputed yet are {}. They will be imputed to value 0.".format(
        count_not_imputed
    )
)

# impute missing all the missing values to 0
basic_valid_data.loc [
                      (basic_valid_data["Immigrant Stock Missing Indicator"] == 1)
                      &
                      np.isnan(basic_valid_data["Immigrant Stock Imputed"])
, "Immigrant Stock Imputed"] = 0 

The total values not imputed yet are 151047. They will be imputed to value 0.


In [None]:
print(
    "A sample of basic processed data is \n {}".format(
      basic_valid_data[0:10]  
    )
)

A sample of basic processed data is 
      Year COU_ORIG Origin Country Gender COU_DEST Destination Country  Immigrant Stock  Immigrant Stock Imputed  Immigrant Stock Missing Indicator  Origin Country Population  Destination Country Population
0  2000.0      AFG    Afghanistan    MEN      AUS           Australia           6500.0                      NaN                                  0                 20779957.0                      19153000.0
1  2001.0      AFG    Afghanistan    MEN      AUS           Australia           7410.0                      NaN                                  0                 21606992.0                      19413000.0
2  2002.0      AFG    Afghanistan    MEN      AUS           Australia           8710.0                      NaN                                  0                 22600774.0                      19651400.0
3  2003.0      AFG    Afghanistan    MEN      AUS           Australia           9260.0                      NaN                           

In [None]:
# use an indicator for imputation as well, all the missing values are now imputed
# we will create a new column "Immigrant Stock Processed" containing
# the "Immigrant Stock" values if not empty
# otherwise the "Immigrant Stock Imputed" values
basic_valid_data["Immigrant Stock Imputed Indicator"] = basic_valid_data["Immigrant Stock Missing Indicator"]
basic_valid_data["Immigrant Stock Processed"] = np.nan

# store imputed values in the "Immigrant Stock Processed" column where applicable 
basic_valid_data.loc[basic_valid_data["Immigrant Stock Imputed Indicator"] == 1, "Immigrant Stock Processed"] = \
  basic_valid_data[basic_valid_data["Immigrant Stock Imputed Indicator"] == 1]["Immigrant Stock Imputed"] 

# store the original values in the "Immigrant Stock Processed" column where applicable
basic_valid_data.loc[basic_valid_data["Immigrant Stock Imputed Indicator"] == 0, "Immigrant Stock Processed"] = \
  basic_valid_data[basic_valid_data["Immigrant Stock Imputed Indicator"] == 0]["Immigrant Stock"] 


In [None]:
# create the final processed data
# retaining only the relevant information
final_data = basic_valid_data[[
                                "Year", 
                                "Gender", 
                                "COU_ORIG", 
                                "Origin Country",
                                "Origin Country Population",
                                "COU_DEST",
                                "Destination Country",
                                "Destination Country Population",
                                "Immigrant Stock Processed",
                                "Immigrant Stock Imputed Indicator"
                               ]]

# make sure the year remains an integer value
final_data["Year"] = final_data["Year"].astype("int16")

In [None]:
print(
    "A sample of final processed data is \n {}".format(
      final_data[0:10]  
    )
)

A sample of final processed data is 
    Year Gender COU_ORIG Origin Country  Origin Country Population COU_DEST Destination Country  Destination Country Population  Immigrant Stock Processed  Immigrant Stock Imputed Indicator
0  2000    MEN      AFG    Afghanistan                 20779957.0      AUS           Australia                      19153000.0                     6500.0                                  0
1  2001    MEN      AFG    Afghanistan                 21606992.0      AUS           Australia                      19413000.0                     7410.0                                  0
2  2002    MEN      AFG    Afghanistan                 22600774.0      AUS           Australia                      19651400.0                     8710.0                                  0
3  2003    MEN      AFG    Afghanistan                 23680871.0      AUS           Australia                      19895400.0                     9260.0                                  0
4  2004    MEN   

In [None]:
# save the processed data
final_data.to_parquet("migration_dataset_imputed.parquet")