![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - Advanced Data Manipulation - Numerical Data Processing

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.21.5


In [2]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

# disable warnings for chained assignment
pd.set_option('mode.chained_assignment', None)

Pandas installed at version: 1.3.5


In [3]:
!python -m pip install sklearn
import sklearn as skl
import sklearn.preprocessing as sklp

print ("Sklearn installed at version: {}".format(skl.__version__))

Sklearn installed at version: 1.0.2


In [4]:
import warnings

# supress RuntimeWarnings that are not relevant
warnings.filterwarnings("ignore")

### 1 Loading Data

We will focus on processing a dataset focused on immigration data. It contains the number of foreign born citizens in different countries, considering different genders in different years. 

The dataset's values for immigrant stock have been imputed - so no missing data is expected on this feature.


#### 1.1 Loading and exploring data

First of all, loading data and basic exploration of the dataset structure is required. 

In [5]:
# load data for processing
loaded_data = pd.read_parquet(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%204%20-%20Advanced%20Data%20Manipulation/Session%202%20-%20Advanced%20Data%20Manipulation/data/migration_dataset_imputed.parquet"
)

print(
    "A sample of of loaded data is \n {}".format(
      loaded_data
    )
)

A sample of of loaded data is 
         Year Gender COU_ORIG Origin Country  Origin Country Population COU_DEST Destination Country  Destination Country Population  Immigrant Stock Processed  Immigrant Stock Imputed Indicator
0       2000    MEN      AFG    Afghanistan                 20779957.0      AUS           Australia                      19153000.0                     6500.0                                  0
1       2001    MEN      AFG    Afghanistan                 21606992.0      AUS           Australia                      19413000.0                     7410.0                                  0
2       2002    MEN      AFG    Afghanistan                 22600774.0      AUS           Australia                      19651400.0                     8710.0                                  0
3       2003    MEN      AFG    Afghanistan                 23680871.0      AUS           Australia                      19895400.0                     9260.0                                  

We can observe that the dataset has several features:

*  **Year** - the year of observation;
*  **Gender** - the gender of the immigrants;
*  **COU_ORIG** - the ISO3 code for the country of origin;
*  **Origin Country** - the name of the country of origin (country of birth/nationality);
*  **Origin Country Population** - the population in the country of origin;
*  **COU_DEST** - the country of destination (country of residence);
*  **Destination Country** - the country of residence;
*  **Destination Country Population** - the population in the country of destination;
*  **Immigrant Stock Processed** - the number of immigrants (foreign born citizens) which was processed by providing imputed values where the original data has been missing;
*  **Immigrant Stock Imputed Indicator** - an indicator specifying if the immigrant stock value is the original one or it has been missing and the actual value is imputed.


We are interested in making sure that the numerical features are processed in a manner that is most useful both for data insights and for machine learning processing.

First of all, let's find out some statistical information about these features:

In [6]:
# define the target numerical values
target_features = [
                    "Immigrant Stock Processed",
                    "Origin Country Population",
                    "Destination Country Population"
                  ]

# print basic statistics about target features
for target_feature in target_features :
  target_feature_valid = loaded_data[~np.isnan(loaded_data[target_feature])]

  print("Target feature: {} \n".format(target_feature))
  print("Min Value: {:.2f}, Max Value: {:.2f}, Average Value: {:.2f}, Standard Deviation {:.2f} \n".format(
      np.min(target_feature_valid[target_feature]),
      np.max(target_feature_valid[target_feature]),
      np.average(target_feature_valid[target_feature]),
      np.std(target_feature_valid[target_feature])
    )
  )


Target feature: Immigrant Stock Processed 

Min Value: 0.00, Max Value: 11714489.00, Average Value: 6853.38, Standard Deviation 88985.54 

Target feature: Origin Country Population 

Min Value: 9392.00, Max Value: 1410929362.00, Average Value: 34891321.29, Standard Deviation 133379259.47 

Target feature: Destination Country Population 

Min Value: 281205.00, Max Value: 329484123.00, Average Value: 32103057.69, Standard Deviation 56298756.97 



#### 1.2 Feature scaling

We can observe that there is a lot of variability in the numerical features. 

Usually, for many machine learning algorithms, a high variability in the data along with a wide interval of values are highly detrimental for algorithmic performance. In many cases, the data should be processed so that it is fit into a clearly defined interval (usually [0,1]).

The sklearn library has a strong support for data scaling via the [**MinMaxScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) or the [**RobustScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) classes.

In case the data variability is high, the RobustScaler class is preffered since it is quite resilient to data outliers (data that have extreme or unusual values).

In case of the **Origin Country Population** and **Destination Country Population** we can scale the features at the country (origin or destination) level. We would like to keep the values in the interval [0,1] and this is done by using the MinMaxScaler for each slice of data associated with an origin or destination country. 

In [7]:
# create a column for origin and destination country population
# scaled values
loaded_data["Origin Country Population Scaled"] = np.nan
loaded_data["Destination Country Population Scaled"] = np.nan

# process all origin countries
origin_countries = loaded_data["COU_ORIG"]
origin_countries = origin_countries.drop_duplicates()

for origin_country in origin_countries.values :
  # determine the slice of data for the origin country
  origin_country_slice = loaded_data[loaded_data["COU_ORIG"] == origin_country]

  # scale the data and assign it to the corresponding feature
  origin_country_scaler = sklp.MinMaxScaler()
  loaded_data.loc[origin_country_slice.index, "Origin Country Population Scaled"] = \
    origin_country_scaler.fit_transform(
        origin_country_slice["Origin Country Population"].values.reshape((-1,1)) 
      )

# we process the destination countries as well
destination_countries = loaded_data["COU_DEST"]
destination_countries = destination_countries.drop_duplicates()

for destination_country in destination_countries.values :
  # slice the data for a specific destination country
  destination_country_slice = loaded_data[loaded_data["COU_DEST"] == destination_country]

  # scale the data and assign it to the corresponding feature
  destination_country_scaler = sklp.MinMaxScaler()
  loaded_data.loc[destination_country_slice.index, "Destination Country Population Scaled"] = \
    destination_country_scaler.fit_transform(
        destination_country_slice["Destination Country Population"].values.reshape((-1,1)) 
      )

We will scale the **Immigrant Stock Processed** feature as well, this time over all the applicable (Origin Country, Destination Country) combinations.

In [8]:
# get all the applicable (origin country, destination country) combinations
country_combinations = loaded_data[["COU_ORIG", "COU_DEST"]]
country_combinations = country_combinations.drop_duplicates()

# create a column for scaled values of Immigrant Stock Processed feature
loaded_data["Immigrant Stock Processed Scaled"] = np.nan

for origin_country, destination_country in country_combinations.values:
  # determine the slice of data for the origin and destination country
  data_slice = loaded_data[loaded_data["COU_ORIG"] == origin_country]
  data_slice = data_slice[data_slice["COU_DEST"] == destination_country]

  # scale the data and assign it to the corresponding feature
  immigrant_stock_scaler = sklp.MinMaxScaler()
  loaded_data.loc[data_slice.index, "Immigrant Stock Processed Scaled"] = \
      immigrant_stock_scaler.fit_transform(
        data_slice["Immigrant Stock Processed"].values.reshape((-1,1))
      )

In [9]:
# display a data sample
print(
    "A sample of the scaled data is as follows \n{}".format(
        loaded_data
    )
)

A sample of the scaled data is as follows 
        Year Gender COU_ORIG Origin Country  Origin Country Population COU_DEST Destination Country  Destination Country Population  Immigrant Stock Processed  Immigrant Stock Imputed Indicator  Origin Country Population Scaled  Destination Country Population Scaled  Immigrant Stock Processed Scaled
0       2000    MEN      AFG    Afghanistan                 20779957.0      AUS           Australia                      19153000.0                     6500.0                                  0                          0.000000                               0.000000                          0.055057
1       2001    MEN      AFG    Afghanistan                 21606992.0      AUS           Australia                      19413000.0                     7410.0                                  0                          0.045571                               0.039792                          0.082286
2       2002    MEN      AFG    Afghanistan           

In [10]:
# define the target numerical values
target_features = [
                    "Immigrant Stock Processed Scaled",
                    "Origin Country Population Scaled",
                    "Destination Country Population Scaled"
                  ]

# print basic statistics about target features
for target_feature in target_features :
  target_feature_valid = loaded_data[~np.isnan(loaded_data[target_feature])]

  print("Target feature: {} \n".format(target_feature))
  print("Min Value: {:.2f}, Max Value: {:.2f}, Average Value: {:.2f}, Standard Deviation {:.2f} \n".format(
      np.min(target_feature_valid[target_feature]),
      np.max(target_feature_valid[target_feature]),
      np.average(target_feature_valid[target_feature]),
      np.std(target_feature_valid[target_feature])
    )
  )

Target feature: Immigrant Stock Processed Scaled 

Min Value: 0.00, Max Value: 1.00, Average Value: 0.21, Standard Deviation 0.32 

Target feature: Origin Country Population Scaled 

Min Value: 0.00, Max Value: 1.00, Average Value: 0.48, Standard Deviation 0.32 

Target feature: Destination Country Population Scaled 

Min Value: 0.00, Max Value: 1.00, Average Value: 0.48, Standard Deviation 0.32 



A this point we decide to address only the combination that has a ratio of missing records of 0.4 or less. This is an empirical decision, ratio can vary depending on the data analyst and business needs.

Therefore we will leave all the combinations having a ratio of more than 0.4 as is, for the rest we will used data imputation in order to replace missing values with estimated ones.  

In [11]:
# create the final processed data
# retaining only the relevant information
final_data = loaded_data[[
                                "Year", 
                                "Gender", 
                                "COU_ORIG", 
                                "Origin Country",
                                "Origin Country Population",
                                "Origin Country Population Scaled",
                                "COU_DEST",
                                "Destination Country",
                                "Destination Country Population",
                                "Destination Country Population Scaled",
                                "Immigrant Stock Processed",
                                "Immigrant Stock Processed Scaled",
                                "Immigrant Stock Imputed Indicator"
                               ]]

# make sure the year remains an integer value
final_data["Year"] = final_data["Year"].astype("int16")

In [12]:
# display a sample of the final data
print(
    "A sample of final processed data is \n {}".format(
      final_data
    )
)

A sample of final processed data is 
         Year Gender COU_ORIG Origin Country  Origin Country Population  Origin Country Population Scaled COU_DEST Destination Country  Destination Country Population  Destination Country Population Scaled  Immigrant Stock Processed  Immigrant Stock Processed Scaled  Immigrant Stock Imputed Indicator
0       2000    MEN      AFG    Afghanistan                 20779957.0                          0.000000      AUS           Australia                      19153000.0                               0.000000                     6500.0                          0.055057                                  0
1       2001    MEN      AFG    Afghanistan                 21606992.0                          0.045571      AUS           Australia                      19413000.0                               0.039792                     7410.0                          0.082286                                  0
2       2002    MEN      AFG    Afghanistan                

In [13]:
# save the processed data
final_data.to_parquet("migration_dataset_numerically_processed.parquet")