![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - Advanced Data Manipulation - Categorical Data Processing

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [2]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

# disable warnings for chained assignment
pd.set_option('mode.chained_assignment', None)

Pandas installed at version: 1.3.5


In [3]:
!python -m pip install sklearn
import sklearn as skl
import sklearn.preprocessing as sklp

print ("Sklearn installed at version: {}".format(skl.__version__))

Sklearn installed at version: 1.0.2


In [4]:
import warnings

# supress RuntimeWarnings that are not relevant
warnings.filterwarnings("ignore")

### 1 Loading Data

We will focus on processing a dataset focused on immigration data. It contains the number of foreign born citizens in different countries, considering different genders in different years. 

The dataset's values for immigrant stock have been imputed - so no missing data is expected on this feature.


#### 1.1 Loading and exploring data

First of all, loading data and basic exploration of the dataset structure is required. 

In [5]:
# load data for processing
loaded_data = pd.read_parquet(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%204%20-%20Advanced%20Data%20Manipulation/Session%202%20-%20Advanced%20Data%20Manipulation/data/migration_dataset_numerically_processed.parquet"
)

print(
    "A sample of of loaded data is \n {}".format(
      loaded_data[0:10]  
    )
)

A sample of of loaded data is 
    Year Gender COU_ORIG Origin Country  Origin Country Population  Origin Country Population Scaled COU_DEST Destination Country  Destination Country Population  Destination Country Population Scaled  Immigrant Stock Processed  Immigrant Stock Processed Scaled  Immigrant Stock Imputed Indicator
0  2000    MEN      AFG    Afghanistan                 20779957.0                          0.000000      AUS           Australia                      19153000.0                               0.000000                     6500.0                          0.055057                                  0
1  2001    MEN      AFG    Afghanistan                 21606992.0                          0.045571      AUS           Australia                      19413000.0                               0.039792                     7410.0                          0.082286                                  0
2  2002    MEN      AFG    Afghanistan                 22600774.0               

We can observe that the dataset has several features:

*  **Year** - the year of observation;
*  **Gender** - the gender of the immigrants;
*  **COU_ORIG** - the ISO3 code for the country of origin;
*  **Origin Country** - the name of the country of origin (country of birth/nationality);
*  **Origin Country Population** - the population in the country of origin;
*  **Origin Country Population Scaled** - the population in the country of origin (scaled values);
*  **COU_DEST** - the country of destination (country of residence);
*  **Destination Country** - the country of residence;
*  **Destination Country Population** - the population in the country of destination;
*  **Destination Country Population Scaled** - the population in the country of destination (scaled values);
*  **Immigrant Stock Processed** - the number of immigrants (foreign born citizens) which was processed by providing imputed values where the original data has been missing;
*  **Immigrant Stock Processed Scaled** - the number of immigrants (foreign born citizens) which was processed by providing imputed values where the original data has been missing (scaled values);
*  **Immigrant Stock Imputed Indicator** - an indicator specifying if the immigrant stock value is the original one or it has been missing and the actual value is imputed.


#### 1.2 Feature scaling

We can observe that several critical data features are in textual format: **Gender**, **COU_ORIG** and **COU_DEST**. Using textual format makes data processing difficult and it is open to textual errors.

Therefore we would like to encode thse textual values in a format closer to a numeric format that is easier to process.



A mechanism used for encoding textual values into a more processable format is creating a binary variables vector having the length equal to the count of all the distinct values for the categorical feature. The vector will have all the values set to False (or 0) for all the categorical values, except the one that is encoded where the values will be True (or 1).

This encoding mechanism is call the **one-hot encoding mechanism**.

The sklearn library supports the one-hot encoding mechanism via the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) class.

We will use this mechanism to encode the **Gender** features in the first place.

In [6]:
# get the Gender values
gender_values = loaded_data["Gender"]

# encode the Gender values 
gender_encoder = sklp.OneHotEncoder(
    sparse = False
)
gender_encoded_values = gender_encoder.fit_transform(gender_values.values.reshape(-1,1))

# obtain the feature names
gender_feature_names = gender_encoder.get_feature_names_out(["Gender"])

# create a data frame for encoded values 
gender_encoded_data = pd.DataFrame(
  columns = gender_feature_names,
  data = gender_encoded_values    
)

# join the encoded data
processed_data = pd.merge(
  loaded_data,
  gender_encoded_data, 
  how = "inner",
  left_index= True,
  right_index= True
)

# display a data sample
print(
    "A sample of the encoded data is as follows \n{}".format(
        processed_data[0:10]
    )
)

A sample of the encoded data is as follows 
   Year Gender COU_ORIG Origin Country  Origin Country Population  Origin Country Population Scaled COU_DEST Destination Country  Destination Country Population  Destination Country Population Scaled  Immigrant Stock Processed  Immigrant Stock Processed Scaled  Immigrant Stock Imputed Indicator  Gender_MEN  Gender_WMN
0  2000    MEN      AFG    Afghanistan                 20779957.0                          0.000000      AUS           Australia                      19153000.0                               0.000000                     6500.0                          0.055057                                  0         1.0         0.0
1  2001    MEN      AFG    Afghanistan                 21606992.0                          0.045571      AUS           Australia                      19413000.0                               0.039792                     7410.0                          0.082286                                  0         1.0         

We will encode also the country related data for the **COU_ORIG** and **COU_DEST** features. We will encode these values 

In [7]:
# get the COU_ORIG & COU_DEST values
country_values = loaded_data[["COU_ORIG", "COU_DEST"]]
                             
# encode the Gender values 
countries_encoder = sklp.OneHotEncoder(
    sparse = False
)
country_encoded_values = countries_encoder.fit_transform(country_values)

# obtain the feature names
country_feature_names = countries_encoder.get_feature_names_out(["COU_ORIG", "COU_DEST"])

# create a data frame for encoded values 
country_encoded_data = pd.DataFrame(
  columns = country_feature_names,
  data = country_encoded_values    
)

# join the encoded data
processed_data = pd.merge(
    processed_data,
    country_encoded_data, 
    how = "inner",
    left_index= True,
    right_index= True
  )

# display a data sample
print(
    "A sample of the encoded data is as follows \n{}".format(
        processed_data[0:10]
    )
)

A sample of the encoded data is as follows 
   Year Gender COU_ORIG Origin Country  Origin Country Population  Origin Country Population Scaled COU_DEST Destination Country  Destination Country Population  Destination Country Population Scaled  Immigrant Stock Processed  Immigrant Stock Processed Scaled  Immigrant Stock Imputed Indicator  Gender_MEN  Gender_WMN  COU_ORIG_AFG  COU_ORIG_AGO  COU_ORIG_ALB  COU_ORIG_AND  COU_ORIG_ARE  COU_ORIG_ARG  COU_ORIG_ARM  COU_ORIG_ATG  COU_ORIG_AUS  COU_ORIG_AUT  COU_ORIG_AZE  COU_ORIG_BDI  COU_ORIG_BEL  COU_ORIG_BEN  COU_ORIG_BFA  COU_ORIG_BGD  COU_ORIG_BGR  COU_ORIG_BHR  COU_ORIG_BHS  COU_ORIG_BIH  COU_ORIG_BLR  COU_ORIG_BLZ  COU_ORIG_BMU  COU_ORIG_BOL  COU_ORIG_BRA  ...  COU_ORIG_WSM  COU_ORIG_YEM  COU_ORIG_YUCS  COU_ORIG_YYY  COU_ORIG_ZAF  COU_ORIG_ZMB  COU_ORIG_ZWE  COU_DEST_AUS  COU_DEST_AUT  COU_DEST_BEL  COU_DEST_CAN  COU_DEST_CHE  COU_DEST_CHL  COU_DEST_CZE  COU_DEST_DEU  COU_DEST_DNK  COU_DEST_ESP  COU_DEST_EST  COU_DEST_FIN  COU_DEST_FRA 

In [8]:
# create the final processed data
# retaining only the relevant information

columns = ["Year", "Gender"]
columns = np.append(columns, gender_encoder.get_feature_names_out(["Gender"]))
columns = np.append(columns, ["COU_ORIG", "COU_DEST", "Origin Country", "Destination Country"])
columns = np.append(columns, countries_encoder.get_feature_names_out(["COU_ORIG", "COU_DEST"]))
columns = np.append(columns, [
                              "Origin Country Population", 
                              "Origin Country Population Scaled",
                              "Destination Country Population",
                              "Destination Country Population Scaled",
                              "Immigrant Stock Processed",
                              "Immigrant Stock Processed Scaled",
                              "Immigrant Stock Imputed Indicator"
])

final_data = processed_data[columns]

# make sure the year remains an integer value
final_data["Year"] = final_data["Year"].astype("int16")

In [9]:
# display a sample of the final data
print(
    "A sample of final processed data is \n {}".format(
      final_data[0:10]  
    )
  )

A sample of final processed data is 
    Year Gender  Gender_MEN  Gender_WMN COU_ORIG COU_DEST Origin Country Destination Country  COU_ORIG_AFG  COU_ORIG_AGO  COU_ORIG_ALB  COU_ORIG_AND  COU_ORIG_ARE  COU_ORIG_ARG  COU_ORIG_ARM  COU_ORIG_ATG  COU_ORIG_AUS  COU_ORIG_AUT  COU_ORIG_AZE  COU_ORIG_BDI  COU_ORIG_BEL  COU_ORIG_BEN  COU_ORIG_BFA  COU_ORIG_BGD  COU_ORIG_BGR  COU_ORIG_BHR  COU_ORIG_BHS  COU_ORIG_BIH  COU_ORIG_BLR  COU_ORIG_BLZ  COU_ORIG_BMU  COU_ORIG_BOL  COU_ORIG_BRA  COU_ORIG_BRB  COU_ORIG_BRN  COU_ORIG_BTN  COU_ORIG_BWA  COU_ORIG_CAF  COU_ORIG_CAN  COU_ORIG_CHE  ...  COU_DEST_AUS  COU_DEST_AUT  COU_DEST_BEL  COU_DEST_CAN  COU_DEST_CHE  COU_DEST_CHL  COU_DEST_CZE  COU_DEST_DEU  COU_DEST_DNK  COU_DEST_ESP  COU_DEST_EST  COU_DEST_FIN  COU_DEST_FRA  COU_DEST_GBR  COU_DEST_GRC  COU_DEST_HUN  COU_DEST_IRL  COU_DEST_ISL  COU_DEST_ISR  COU_DEST_ITA  COU_DEST_LUX  COU_DEST_LVA  COU_DEST_MEX  COU_DEST_NLD  COU_DEST_NOR  COU_DEST_NZL  COU_DEST_POL  COU_DEST_PRT  COU_DEST_SVK  COU_DEST_S

In [10]:
# save the processed data
final_data.to_parquet("migration_dataset_fully_processed.parquet")