# Machine Learning Pipeline - Feature Engineering

In the following notebooks, each of the steps of below Machine Learning pipeline would be implemented. 

Machine Learning Pipleline:


1. Data Analysis
2. ***Feature Enginerring***
3. Feature Selection
4. Model Training
5. Obtaining Predictions/Scoring

This notebook is focussed on Data Feature Engineering.


> Dataset Source: Using dataset from [Kaggle](https://www.kaggle.com/datasets/overload10/adult-census-dataset?resource=download) as per project requirement. See below for more details:

===================================================================================================================

## Predicting Adult Census Income

> The aim of this project to build a machine learning model to predict the class of adult census income i.e., whether the sample falls under >50K or <50K based on different explanatory variables describing aspect of the class.

### Why this is important?

> Predicting the class of adult census income would benefit various financial institutions and it would pave the way for fruitful profit for the institutions. It would also help consumer-based services to target the correct consumers.

### What is the objective of the machine learning model?

1. To perform in-depth exploratory data analysis of the datasets.
2. To engineer new predictive features from the available graphs
3. To develop a supervised model to classify census income into >50K and <50K.
4. To recommend a threshold that will perform better in terms of F1 score.
5. To create an API endpoint for the trained model and deploy it.


### Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we set the seed.


In [1]:
#Import Required Libraries

#to handle dataset
import pandas as pd
import numpy as np

#for plotting
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
#setting the style to white
sns.set_style('white')

#for the yeo-johnson transformation
import scipy.stats as stats

#to divide train and test dataset
from sklearn.model_selection import train_test_split

#feature scaling
from sklearn.preprocessing import MinMaxScaler

#to save the trained class
import joblib

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

In [2]:
#load dataset
df = pd.read_csv('..\\Dataset\\adult.csv')

#rename columns
df.rename(columns = {'fnlwgt': 'final_weight', 'education-num' : 'education_num', 
                    'marital-status' : 'marital_status', 'capital-gain' : 'capital_gain'
                    , 'capital-loss' : 'capital_loss', 'hours-per-week' : 'hours_per_week'}, inplace = True)

#rows and columns of the data
print(df.shape)

#visualize the dataset
df.head()

(32561, 15)


Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K



### Separate dataset into train and test

It is important to separate data into training and testing set.

When features are engineered, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

Our feature engineering techniques will learn:

    - mean
    - mode
    - exponents for the yeo-johnson
    - category frequency
    - and category to number mappings

from the train set.

Separating the data into train and test involves randomness, therefore, we need to set the seed.


In [3]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['salary'], axis = 1), 
                                                    df['salary'], 
                                                    test_size = 0.1, 
                                                    random_state = 0,)

In [4]:
print(X_train.shape, X_test.shape)

(29304, 14) (3257, 14)



### Feature Engineering

In the following cells, we will engineer the variables of the Salary Dataset so that we tackle:

    1. Missing values
    2. Non-Gaussian distributed variables
    3. Categorical variables: remove rare labels
    4. Categorical variables: convert strings to numbers
    5. Put the variables in a similar scale

### Missing Values

In [5]:
#In Analysis Notebook, it has been observed that the categorical variables has extra spaces in their values
#Let's apply the logic to remove the spaces
#Extract the categorical variables
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O' and var != 'salary']

print("Total numbner of categorical variables: " + str(len(cat_vars)))
print("Column names : " + str(cat_vars))

# replace field that's entirely space (or empty) with NaN
X_train = X_train.replace(r'\s', '', regex=True)
X_train = X_train.replace(r'\?', np.nan, regex=True)
for col in cat_vars:
    print(col)
    print(X_train[col].unique())

Total numbner of categorical variables: 8
Column names : ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'country']
workclass
['Private' 'Self-emp-inc' 'Self-emp-not-inc' 'Federal-gov' nan 'Local-gov'
 'State-gov' 'Never-worked' 'Without-pay']
education
['7th-8th' 'Masters' 'Prof-school' 'Bachelors' 'HS-grad' '11th'
 'Assoc-acdm' 'Some-college' '5th-6th' 'Doctorate' '1st-4th' 'Assoc-voc'
 '10th' '12th' '9th' 'Preschool']
marital_status
['Divorced' 'Married-civ-spouse' 'Widowed' 'Never-married'
 'Married-spouse-absent' 'Separated' 'Married-AF-spouse']
occupation
['Sales' 'Exec-managerial' 'Prof-specialty' 'Machine-op-inspct'
 'Craft-repair' 'Other-service' 'Adm-clerical' 'Farming-fishing' nan
 'Transport-moving' 'Priv-house-serv' 'Handlers-cleaners' 'Tech-support'
 'Protective-serv' 'Armed-Forces']
relationship
['Unmarried' 'Husband' 'Not-in-family' 'Own-child' 'Other-relative' 'Wife']
race
['White' 'Black' 'Asian-Pac-Islander' 'Other' 'Amer-Ind

In [6]:
#apply the above function to the test set
# replace field that's entirely space (or empty) with NaN
X_test.replace(r'\s', '', regex=True, inplace = True)
X_test.replace(r'\?', np.nan, regex=True, inplace = True)
for col in cat_vars:
    print(col)
    print(X_test[col].unique())

workclass
['Private' 'Local-gov' nan 'Federal-gov' 'Self-emp-not-inc' 'Self-emp-inc'
 'State-gov' 'Without-pay']
education
['Some-college' 'Bachelors' 'Assoc-acdm' '5th-6th' '11th' 'HS-grad'
 'Assoc-voc' 'Masters' 'Doctorate' 'Prof-school' '10th' '7th-8th'
 '1st-4th' '9th' '12th' 'Preschool']
marital_status
['Divorced' 'Never-married' 'Married-civ-spouse' 'Separated'
 'Married-spouse-absent' 'Widowed' 'Married-AF-spouse']
occupation
['Adm-clerical' 'Prof-specialty' 'Sales' 'Transport-moving'
 'Other-service' 'Exec-managerial' nan 'Protective-serv' 'Craft-repair'
 'Machine-op-inspct' 'Farming-fishing' 'Handlers-cleaners' 'Tech-support'
 'Priv-house-serv' 'Armed-Forces']
relationship
['Unmarried' 'Not-in-family' 'Husband' 'Own-child' 'Other-relative' 'Wife']
race
['White' 'Amer-Indian-Eskimo' 'Black' 'Asian-Pac-Islander' 'Other']
sex
['Female' 'Male']
country
['United-States' nan 'India' 'Mexico' 'El-Salvador' 'England' 'France'
 'Germany' 'Peru' 'Philippines' 'South' 'Cuba' 'Canada' 'Ho

In [7]:
# make a list of the categorical variables that contain missing values
#.sum() function gives the total sum of null values for the column
cat_vars_with_na = [
    var for var in cat_vars
    if X_train[var].isnull().sum() > 0
]

# print percentage of missing values per variable
#.mean provides the (null sum)/total values
print(X_train[cat_vars_with_na].isnull().mean().sort_values(ascending = False))

occupation    0.056375
workclass     0.056136
country       0.017950
dtype: float64


The percentage of missing values is less than 1% and the missing values can be imputed with the mode of the columns but we would analyze further to study the relationships of the mssing values. If the values are missing not at random then NA could be replaced with mode of the columns.

In [8]:
# % of missing values in occupation is more than other variables.
# Let's check whether there is any relationship with occupation, workclass and country missing values

missing_df = df[df['occupation'] == ' ?']
m_df = missing_df[['workclass', 'country', 'education', 'sex', 'salary']].drop_duplicates()

print('\n Country : \n')
print(m_df.groupby(['country', 'workclass','salary']).agg('count').sort_values(by = 'country',ascending = False))

print('\n Education : \n')
print(m_df.groupby(['education', 'workclass','salary']).agg('count').sort_values(by = 'education',ascending = False))

print('\n Sex : \n')
print(m_df.groupby(['sex', 'workclass','salary']).agg('count').sort_values(by = 'sex',ascending = False))
# print('\n Workclass : \n')
# print(m_df.groupby('workclass').agg('count'))
#print('Workclass : ' + na_m['workclass'].unique())
# print('\n Occupation : \n')
# print(df['occupation'].unique())


 Country : 

                                          education  sex
country             workclass     salary                
 Vietnam             ?             <=50K          3    3
 United-States       Never-worked  <=50K          5    5
                     ?             >50K          19   19
                                   <=50K         31   31
 Trinadad&Tobago     ?             <=50K          1    1
 Thailand            ?             <=50K          1    1
 Taiwan              ?             >50K           1    1
                                   <=50K          5    5
 South               ?             <=50K          5    5
                                   >50K           1    1
 Scotland            ?             >50K           1    1
 Puerto-Rico         ?             <=50K          5    5
 Portugal            ?             <=50K          3    3
 Poland              ?             >50K           1    1
                                   <=50K          3    3
 Philippines     

**Observation:**
- By looking at the above data distribution it is hard to conclude that the data was missed while building the dataset because the data lean towards salary with <=50k for most countries and adults with average education qualifications. The data is evenly distributed between male and female. It can be concluded that adults had opted out from diclsoing their worlclass and occupation. Thus, the missing workclass data can be coded as 'Never-Worked' and missing occupation can be coded as 'No-Occupation' and countries as 'Missing'.

In [9]:
X_train['occupation'].fillna('No-Occupation', inplace = True)
X_train['workclass'].fillna('Never-worked', inplace = True)
X_train['country'].fillna('Missing', inplace = True)

In [10]:
X_test['occupation'].fillna('No-Occupation', inplace = True)
X_test['workclass'].fillna('Never-worked', inplace = True)
X_test['country'].fillna('Missing', inplace = True)

In [11]:
print('X_train\n' + str(X_train[cat_vars_with_na].isna().sum()) + '\n')

print('X_test\n' + str(X_test[cat_vars_with_na].isna().sum()) + '\n')

X_train
workclass     0
occupation    0
country       0
dtype: int64

X_test
workclass     0
occupation    0
country       0
dtype: int64



There are no missing values in training and testing dataset.

### Numerical Variable Transformation

#### Yeo-Johnson Transformation


In [12]:
#Let's apply transformations to final_weight and hours_per_week
# the yeo-johnson transformation learns the best exponent to transform the variable
# it needs to learn it from the train set: 
X_train['final_weight'], param0 = stats.yeojohnson(X_train['final_weight'])
print(param0)
X_train['hours_per_week'], param1 = stats.yeojohnson(X_train['hours_per_week'])
print(param1)

# and then apply the transformation to the test set with the same
# parameter: see who this time we pass param as argument to the 
# yeo-johnson
X_test['final_weight'] = stats.yeojohnson(X_test['final_weight'], lmbda=param0)

X_test['hours_per_week'] = stats.yeojohnson(X_test['hours_per_week'],lmbda=param1)


0.42370200361598553
1.0070128126235112


#### Logarithmic transformation

In [13]:
X_train['age'] = np.log(X_train['age'])
X_test['age'] = np.log(X_test['age'])

In [14]:
# check absence of na in the train set
print([var for var in X_train.columns if X_train[var].isnull().sum() > 0])

# check absence of na in the test set
print([var for var in X_train.columns if X_test[var].isnull().sum() > 0])

[]
[]


#### Binarize skewed variables

There were a few variables very skewed, we would transform those into binary variables.


In [15]:
skewed = ['capital_gain','capital_loss']

for var in skewed:
    
    # map the variable values into 0 and 1
    X_train[var] = np.where(X_train[var]==0, 0, 1)
    X_test[var] = np.where(X_test[var]==0, 0, 1)

In [16]:
X_train[X_train['workclass']=='Without-pay']

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country
29158,4.204693,Without-pay,352.157062,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,Asian-Pac-Islander,Male,0,0,12.150742,Philippines
25500,3.367296,Without-pay,424.444546,Some-college,10,Married-civ-spouse,Farming-fishing,Own-child,White,Male,0,0,66.501569,United-States
28829,4.219508,Without-pay,390.379978,Some-college,10,Married-spouse-absent,Farming-fishing,Unmarried,White,Female,0,0,25.422614,United-States
15533,3.044522,Without-pay,441.123566,HS-grad,9,Never-married,Craft-repair,Own-child,Black,Male,0,0,40.795678,United-States
32262,4.127134,Without-pay,375.935076,Some-college,10,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,16.227348,United-States
27747,4.276666,Without-pay,333.790095,HS-grad,9,Married-civ-spouse,Other-service,Husband,White,Male,0,0,56.209166,United-States
21944,3.951244,Without-pay,412.012797,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,30.54142,United-States
15695,3.091042,Without-pay,493.137184,HS-grad,9,Never-married,Handlers-cleaners,Own-child,White,Male,1,0,40.795678,United-States
20073,4.174387,Without-pay,388.71203,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,1,0,20.310749,United-States
22215,2.944439,Without-pay,216.37103,HS-grad,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,10.115601,United-States


In [17]:
#workclass - without_pay would map to 'Never-Worked'
#education and education_num represents same. Hence, education_num for 1st-4th would be clubbed together.
#marital_status - Married-AF-spouse would be mapped as rare.
#similarly other variables would be mapped as rare.
#if the predictive model is bad, then we would revisit this section as well.

In [18]:
def identify_frequent_labels(df, var, rare_perc):
    df = df.copy()
    
    #get the percentage of the labels
    tmp = df.groupby(var)['education_num'].count()/len(df)
    #print(tmp)
#     workclass
#     Federal-gov         0.029483
#     Local-gov           0.064279
#     Never-worked        0.000215

    #return categories that are frequent
    return tmp[tmp>rare_perc].index

for var in cat_vars:
    
    frequent_ls = identify_frequent_labels(X_train, var, 0.01)
    print(var, frequent_ls)
    print()
    
    
#     # replace rare categories by the string "Rare"
    if var not in ['workclass']:
        X_train[var] = np.where(X_train[var].isin(
            frequent_ls), X_train[var], 'Rare')

        X_test[var] = np.where(X_test[var].isin(
            frequent_ls), X_test[var], 'Rare')
    else:
    
        X_train['workclass'] = np.where(X_train['workclass'].isin(
                    frequent_ls), X_train['workclass'], 'Never-worked')
        
        X_test['workclass'] = np.where(X_test['workclass'].isin(
                    frequent_ls), X_test['workclass'], 'Never-worked')

workclass Index(['Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc',
       'Self-emp-not-inc', 'State-gov'],
      dtype='object', name='workclass')

education Index(['10th', '11th', '12th', '5th-6th', '7th-8th', '9th', 'Assoc-acdm',
       'Assoc-voc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters',
       'Prof-school', 'Some-college'],
      dtype='object', name='education')

marital_status Index(['Divorced', 'Married-civ-spouse', 'Married-spouse-absent',
       'Never-married', 'Separated', 'Widowed'],
      dtype='object', name='marital_status')

occupation Index(['Adm-clerical', 'Craft-repair', 'Exec-managerial', 'Farming-fishing',
       'Handlers-cleaners', 'Machine-op-inspct', 'No-Occupation',
       'Other-service', 'Prof-specialty', 'Protective-serv', 'Sales',
       'Tech-support', 'Transport-moving'],
      dtype='object', name='occupation')

relationship Index(['Husband', 'Not-in-family', 'Other-relative', 'Own-child', 'Unmarried',
       'Wife'],
    

In [19]:
for var in cat_vars:
    print(X_train[var].value_counts().sort_values(ascending = False))
    print("\n")
    print(X_test[var].value_counts().sort_values(ascending = False))
    print("\n")

workclass
Private             20459
Self-emp-not-inc     2282
Local-gov            1884
Never-worked         1665
State-gov            1165
Self-emp-inc          989
Federal-gov           860
Name: count, dtype: int64


workclass
Private             2237
Self-emp-not-inc     259
Local-gov            209
Never-worked         192
State-gov            133
Self-emp-inc         127
Federal-gov          100
Name: count, dtype: int64


education
HS-grad         9478
Some-college    6559
Bachelors       4788
Masters         1563
Assoc-voc       1249
11th            1049
Assoc-acdm       951
10th             835
7th-8th          584
Prof-school      527
9th              468
12th             399
Doctorate        356
5th-6th          301
Rare             197
Name: count, dtype: int64


education
HS-grad         1023
Some-college     732
Bachelors        567
Masters          160
Assoc-voc        133
11th             126
Assoc-acdm       116
10th              98
7th-8th           62
Doctorate      

#### Encoding of categorical variables

Next, we need to transform the strings of the categorical variables into numbers.

We will do it so that we capture the monotonic relationship between the label and the target.

In [20]:
# this function will assign discrete values to the strings of the variables,
# so that the smaller value corresponds to the category that shows the smaller
# mean salary

def replace_categories(train, test, y_train, var, target):
    y_train = pd.Series(np.where(y_train == ' <=50K', 0, 1), name = 'salary')
    
    y_traini = X_train.index
    y_train.index = y_traini
    tmp = pd.concat([X_train, y_train], axis=1)
    
    # order the categories in a variable from that with the lowest
    # salary, to that with the highest
    ordered_labels = tmp.groupby([var])[target].mean().sort_values().index

    # create a dictionary of ordered categories to integer values
    ordinal_label = {k: i for i, k in enumerate(ordered_labels, 0)}
    
    print(var, ordinal_label)
    print()

    # use the dictionary to replace the categorical strings by integers
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)

In [21]:
for var in cat_vars:
    replace_categories(X_train, X_test, y_train, var, 'salary')

workclass {'Never-worked': 0, 'Private': 1, 'State-gov': 2, 'Self-emp-not-inc': 3, 'Local-gov': 4, 'Federal-gov': 5, 'Self-emp-inc': 6}

education {'Rare': 0, '9th': 1, '11th': 2, '5th-6th': 3, '10th': 4, '7th-8th': 5, '12th': 6, 'HS-grad': 7, 'Some-college': 8, 'Assoc-acdm': 9, 'Assoc-voc': 10, 'Bachelors': 11, 'Masters': 12, 'Prof-school': 13, 'Doctorate': 14}

marital_status {'Never-married': 0, 'Separated': 1, 'Married-spouse-absent': 2, 'Widowed': 3, 'Divorced': 4, 'Rare': 5, 'Married-civ-spouse': 6}

occupation {'Rare': 0, 'Other-service': 1, 'Handlers-cleaners': 2, 'No-Occupation': 3, 'Farming-fishing': 4, 'Machine-op-inspct': 5, 'Adm-clerical': 6, 'Transport-moving': 7, 'Craft-repair': 8, 'Sales': 9, 'Tech-support': 10, 'Protective-serv': 11, 'Prof-specialty': 12, 'Exec-managerial': 13}

relationship {'Own-child': 0, 'Other-relative': 1, 'Unmarried': 2, 'Not-in-family': 3, 'Husband': 4, 'Wife': 5}

race {'Rare': 0, 'Black': 1, 'White': 2, 'Asian-Pac-Islander': 3}

sex {'Female'

In [22]:
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

[]

In [23]:
# check absence of na in the train set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]

[]

In [24]:
#will drop education_num from both train and test set
X_train.drop(['education_num'], axis = 1, inplace = True)
X_test.drop(['education_num'], axis = 1, inplace = True)

#### Feature Scaling
For use in linear models, features need to be either scaled. We will scale features to the minimum and maximum values:

In [25]:
# create scaler
scaler = MinMaxScaler()

#  fit  the scaler to the train set
scaler.fit(X_train) 

# transform the train and test set

# sklearn returns numpy arrays, so we wrap the
# array with a pandas dataframe

X_train = pd.DataFrame(
    scaler.transform(X_train),
    columns=X_train.columns
)

X_test = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_train.columns
)

In [26]:
X_train.head()

Unnamed: 0,age,workclass,final_weight,education,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country
0,0.482644,0.166667,0.352022,0.357143,0.666667,0.692308,0.4,0.666667,0.0,0.0,0.0,0.395685,0.666667
1,0.528238,1.0,0.279148,0.857143,1.0,1.0,0.8,0.666667,1.0,0.0,0.0,0.497826,0.666667
2,0.704502,0.5,0.440719,0.928571,0.5,0.923077,0.6,0.666667,1.0,0.0,0.0,0.548951,0.666667
3,0.813899,0.5,0.272839,0.785714,1.0,0.692308,0.8,0.666667,1.0,0.0,0.0,0.395685,0.666667
4,0.49823,0.166667,0.282357,0.5,0.0,0.384615,0.6,0.333333,0.0,0.0,0.0,0.518272,0.666667


#### Target

In [27]:
y_train = pd.Series(np.where(y_train == ' <=50K', 0, 1), name = 'salary')

y_traini = X_train.index
y_train.index = y_traini

y_test = pd.Series(np.where(y_test == ' <=50K', 0, 1), name = 'salary')

y_testi = X_test.index
y_test.index = y_testi

y_test.head()

0    0
1    0
2    0
3    0
4    1
Name: salary, dtype: int32

In [28]:
# let's now save the train and test sets for the next notebook!

X_train.to_csv('xtrain.csv', index=False)
X_test.to_csv('xtest.csv', index=False)

y_train.to_csv('ytrain.csv', index=False)
y_test.to_csv('ytest.csv', index=False)

In [29]:
# now let's save the scaler

joblib.dump(scaler, 'minmax_scaler.joblib') 

['minmax_scaler.joblib']

This concludes the Feature Engineering Section.