# Analytica - Feature Engingeering

## Import Libraries

In [3]:
## Import core libraries

# For data
import pandas as pd
import numpy as np

# For train/test splitting and scaling processes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# For modelling (specifically for generating the constant column as part of FE)
import statsmodels.api as sm

## Importing Data

In [5]:
## Read the data set

# Create data frame from the WHO data
who = pd.read_csv("https://raw.githubusercontent.com/Ale42RA/WHO_analytica/refs/heads/main/life_expectancy_data.csv")

In [7]:
## Preview the data frame

who.head()

Unnamed: 0,Country,Region,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,...,Diphtheria,Incidents_HIV,GDP_per_capita,Population_mln,Thinness_ten_nineteen_years,Thinness_five_nine_years,Schooling,Economy_status_Developed,Economy_status_Developing,Life_expectancy
0,Turkiye,Middle East,2015,11.1,13.0,105.824,1.32,97,65,27.8,...,97,0.08,11006,78.53,4.9,4.8,7.8,0,1,76.5
1,Spain,European Union,2015,2.7,3.3,57.9025,10.35,97,94,26.0,...,97,0.09,25742,46.44,0.6,0.5,9.7,1,0,82.8
2,India,Asia,2007,51.5,67.9,201.0765,1.57,60,35,21.2,...,64,0.13,1076,1183.21,27.1,28.0,5.0,0,1,65.4
3,Guyana,South America,2006,32.8,40.5,222.1965,5.68,93,74,25.3,...,93,0.79,4146,0.75,5.7,5.5,7.9,0,1,67.0
4,Israel,Middle East,2012,3.4,4.3,57.951,2.89,97,89,27.0,...,94,0.08,33995,7.91,1.2,1.1,12.8,1,0,81.7


## Train/Test Splitting

In [11]:
## Seperate features and target for train/test splitting

# Split the target from the columns
feature_cols = list(who.columns)
feature_cols.remove('Life_expectancy')

# Create X (features), and y (target) variables.
X = who[feature_cols]
y = who['Life_expectancy']

In [13]:
## Using the train-test split function from sklearn

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify= who['Region'])

In [15]:
## Visually check that the number of records between X and y (across Train and Test) both match

print(f"X_train:   {X_train.shape}\ny_train:   {y_train.shape}\n")
print(f"X_test:    {X_test.shape}\ny_test:    {y_test.shape}")

X_train:   (2291, 20)
y_train:   (2291,)

X_test:    (573, 20)
y_test:    (573,)


In [17]:
## Assert that the indicies and number of records between X and y (across Train and Test) both match

# Indicies
assert(all(X_train.index == y_train.index)), "There is some index mismatch in Train"
assert(all(X_test.index == y_test.index)), "There is some index mismatch in Test"

# Number of records
assert(X_train.shape[0] == y_train.shape[0]), "There is some records mismatch in Train"
assert(X_test.shape[0] == y_test.shape[0]), "There is some records mismatch in Test"

## Feature Engingeering

### Feature Selection

Removed continuous numerical columns with a non-linear relationship with the target.
- Alcohol_consumption
- Hepatitis_B
- Thinness_five_nine_years
- Thinness_ten_nineteen_years
- Measles
- Population_mln

Exponential relationship observed for Incidents_HIV, and GDP_per_capita so we did a log transformation to make them linear.

|Feature|Before Transformation|After Transformation Correlation|
|-|-|-|
|HIV|-0.56|0.69|
|GBP|0.58|0.79|

Initially started with all the linear relationship and non-continuous features. However, we then removed the following features accordingly:

|Feature|Reason|
|-|-|
|Year & Status|They did not improve RMSE performance and when removed, would reduce the Cond. No.|
|Polio & Diphtheria | Both are very strongly correlated (0.95). Diphtheria on its own did not improve RMSE. Polio on its own did not decrease condition number|
|Infant_deaths| Is very strongly correlated with Under_five_deaths (0.98), so to reduce multicollinearity and it had the lesser impact on reducing RMSE|

Additional note::
- Under_five_deaths and Adult_mortality are strongly correlated (0.79)

We investigated methods of reducing this by combining them, however this made the RMSE worse, which we therefore decided not to pursue.

### Model Scaler Variables

In [26]:
## Feature Engingeering process

# Create standard scaler variables     
standard_scaler_bmi = StandardScaler()
standard_scaler_schooling = StandardScaler()

# Create minmax scaler variables
minmax_scaler_gdp = MinMaxScaler()
minmax_scaler_hiv = MinMaxScaler()

# Create robust scaler variables
robust_scaler_under_five = RobustScaler()
robust_scaler_adult_mortality = RobustScaler()

### Custom Function

In [29]:
## Feature Engingeering process

# Custom function for all FE processes
def feature_eng(who, train=False):
        # Create copy of the data frame
        who = who.copy()
       
        # One Hot Encoding (OHE)
        who = pd.get_dummies(who, columns = ['Region'], drop_first = True, prefix = 'Region', dtype = int)

        # Converting features into logarithm
        who['Incidents_HIV_log'] = np.log(who['Incidents_HIV'])
        who['GDP_per_capita_log'] = np.log(who['GDP_per_capita'])

        #Fit if training 
        if train:    
            # Normally distributed feature: BMI
            who[['BMI']] = standard_scaler_bmi.fit_transform(who[['BMI']])
        
            # Normally distributed feature: Schooling
            who[['Schooling']] = standard_scaler_schooling.fit_transform(who[['Schooling']])
        
            # MinMax scaling for bounded features
            who[['GDP_per_capita_log']] = minmax_scaler_gdp.fit_transform(who[['GDP_per_capita_log']])
            who[['Incidents_HIV_log']] = minmax_scaler_hiv.fit_transform(who[['Incidents_HIV_log']])
        
            # Robust scaling for features with outliers
            who[['Under_five_deaths']] = robust_scaler_under_five.fit_transform(who[['Under_five_deaths']])
            who[['Adult_mortality']] = robust_scaler_adult_mortality.fit_transform(who[['Adult_mortality']])

        #Just transform if not training 
        else:
            # Normally distributed feature: BMI
            who[['BMI']] = standard_scaler_bmi.transform(who[['BMI']])
        
            # Normally distributed feature: Schooling
            who[['Schooling']] = standard_scaler_schooling.transform(who[['Schooling']])
        
            # MinMax scaling for bounded features
            who[['GDP_per_capita_log']] = minmax_scaler_gdp.transform(who[['GDP_per_capita_log']])
            who[['Incidents_HIV_log']] = minmax_scaler_hiv.transform(who[['Incidents_HIV_log']])
        
            # Robust scaling for features with outliers
            who[['Under_five_deaths']] = robust_scaler_under_five.transform(who[['Under_five_deaths']])
            who[['Adult_mortality']] = robust_scaler_adult_mortality.transform(who[['Adult_mortality']])
            
            
        # Created for statsmodeling. Must always be present
        who = sm.add_constant(who)

        # Return the results
        return who

### Train/Test Transformation

In [32]:
## Transform the X train and test data with the feature engineering

X_train_fe = feature_eng(X_train, True)
X_test_fe = feature_eng(X_test)

### Quality Checks

In [35]:
## Visually check the transformation

X_train_fe.head()

Unnamed: 0,const,Country,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,...,Region_Asia,Region_Central America and Caribbean,Region_European Union,Region_Middle East,Region_North America,Region_Oceania,Region_Rest of Europe,Region_South America,Incidents_HIV_log,GDP_per_capita_log
682,1.0,Luxembourg,2000,3.7,-0.333631,-0.449104,13.21,77,86,0.260232,...,0,0,1,0,0,0,0,0,0.233255,0.970644
1094,1.0,Lebanon,2002,15.1,-0.101695,-0.502102,2.06,77,28,0.673138,...,0,0,0,1,0,0,0,0,0.090235,0.573123
2325,1.0,Comoros,2011,60.5,1.049063,0.541184,0.11,83,64,-0.519702,...,0,0,0,0,0,0,0,0,0.090235,0.319012
524,1.0,Moldova,2014,13.7,-0.128457,0.070414,7.4,92,92,0.856652,...,0,0,0,0,0,0,1,0,0.466509,0.438908
2822,1.0,Tunisia,2005,18.8,-0.021409,-0.418259,1.25,97,99,0.214353,...,0,0,0,0,0,0,0,0,0.143019,0.470362


In [37]:
## Check that the X_train_fe contains no nulls and that all data types are ready for modelling

# Null values
print(f"Sum of nulls: {sum(X_train_fe.isnull().sum())}")

# Data 
X_train_fe.dtypes

Sum of nulls: 0


const                                   float64
Country                                  object
Year                                      int64
Infant_deaths                           float64
Under_five_deaths                       float64
Adult_mortality                         float64
Alcohol_consumption                     float64
Hepatitis_B                               int64
Measles                                   int64
BMI                                     float64
Polio                                     int64
Diphtheria                                int64
Incidents_HIV                           float64
GDP_per_capita                            int64
Population_mln                          float64
Thinness_ten_nineteen_years             float64
Thinness_five_nine_years                float64
Schooling                               float64
Economy_status_Developed                  int64
Economy_status_Developing                 int64
Region_Asia                             

*Note: Whilst Country's data type is not formatted for modelling (object), since this feature is not used in the models, this did not need converting.

## Feature Columns for each model

### Feature Columns for the 'Precise' model

In [42]:
## Precise model feature columns used

feature_cols_pre = [
 'const',
 'Under_five_deaths',
 'Adult_mortality',
 'BMI',
 'Schooling',
 'Region_Asia',
 'Region_Central America and Caribbean',
 'Region_European Union',
 'Region_Middle East',
 'Region_North America',
 'Region_Oceania',
 'Region_Rest of Europe',
 'Region_South America',
 'Incidents_HIV_log',
 'GDP_per_capita_log'
 ]

### Feature Columns for the 'Minimalistic' model

In [45]:
## Minimalistic model feature columns used

feature_cols_min = [
 'const',
 'Under_five_deaths',
 'Adult_mortality',
 'BMI'
 ]

#### Ethical Considerations

Based off the precise feature columns, the following features were not used, to minimise the ethical implciations within this model:

- Region
- Incidents_HIV_log
- GDP_per_capita_log

The reasoning for this is that these features could suggest the following:

- Reinforces Stereotypes: 
    - These features can perpetuate harmful stereotypes and lead to discrimination based on geography, health status, or economic background.
- Ethical Concerns: 
    - Use of HIV is seen as sensitive data, which could raise privacy concerns, and relying on GDP could reinforce economic inequality.