# Introduction

In order to streamline and automate the data science lifecylcle, we have implemented a customized data pipeline that automates the below stages of the lifecycle:

1. Preprocessing/Wrangling
2. Feature Engineering
3. Handling class imbalance
4. Modelling/Making Predictions

# Loading the required libraries

Our pipeline uses the below libraries to preprocess the data and make predictions.

## Installing the required libraries

To ensure that there no erros arise due to package version mismatch, it is highly recommended to run the below cell to install the libraries.

__NOTE:__ In order to run the below commands in the command line interface/terminal, please ensure the below:

- You have python version 3.10 installed(https://www.python.org/downloads/release/python-3100/)
- You have a standard pip accompanying it(https://pip.pypa.io/en/stable/installation/)
- It is added to the classpath of your Operating System(https://geek-university.com/add-python-to-the-windows-path/)

In [None]:
!pip install pandas==1.4.4
!pip install category_encoders
!pip install xgboost==1.6.2
!pip install imblearn==0.9.1
!pip install joblib

## Importing the required libraries after installation

Once the libraries are available to use, we need to load them in the current working environment. Please use the below cell to import all the required methods/classes from the libraries that will be used by the pipeline.

In [155]:
import pandas as pd
from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from category_encoders.target_encoder import TargetEncoder
from xgboost import XGBClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from imblearn.over_sampling import SMOTE
from joblib import dump, load
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

# Reading the training dataset

We will be first reading the training dataset that will be used to train the pipeline.

Please note that this training dataset csv file needs to included in the same path as the ipynb code for the below code to work as intended.
Also, please note that the below cell works only if the dataset is a CSV file.

Please provide the path to the training dataset here inside pd.read_csv(""). Use the below code as reference.

In [569]:
try:
    data = pd.read_csv("A2_change_job_labeled.csv")
    test_data = pd.read_csv("A2_change_job_submission.csv")
    kaggle_data = pd.read_csv("A2_change_job_kaggle.csv")
except Exception as e:
    print(f"An error occurred while reading the dataset, {e}. Kindly google the error message to solve the error.")

## Verifying the data loaded

Please run the below code that shows the first 5 rows of the dataset as a means of verification that the data is read properly.

In [570]:
data.head()

Unnamed: 0,cid,city.100,city.10,city.105,city.103,city.104,city.102,city.101,city.2119,city.114,...,education_level,academic_discipline,experience,company_size,company_type,last_new_job,city_development_index,relevant_experience,training_hours,target
0,7124,0,0,0,0,0,0,0,0,0,...,Undergraduate,STEM,15,100-500,NGO,1,0.942116,Has relevant experience,21,no
1,51,0,0,1,0,0,0,0,0,0,...,Undergraduate,STEM,4,50-99,Funded startup,1,0.219561,No relevant experience,92,yes
2,13137,0,0,0,0,0,0,0,0,0,...,High School,STEM,8,50-99,Pvt Ltd,>4,0.499002,Has relevant experience,21,no
3,2769,0,0,0,0,0,0,0,0,0,...,Undergraduate,STEM,2,50-99,Pvt Ltd,never,0.351297,No relevant experience,114,no
4,8374,0,0,0,0,0,0,0,0,0,...,PhD,STEM,7,5000-9999,Public sector,1,0.626747,Has relevant experience,24,yes


In [571]:
test_data.head()

Unnamed: 0,cid,city.100,city.10,city.105,city.103,city.104,city.102,city.101,city.2119,city.114,...,education_level,academic_discipline,experience,company_size,company_type,last_new_job,city_development_index,relevant_experience,training_hours,target
0,12321,0,0,0,0,1,0,0,0,0,...,Masters,STEM,>20,5000-9999,Pvt Ltd,>4,0.942116,Has relevant experience,26,no
1,941,0,0,0,0,0,0,0,0,0,...,Masters,STEM,>20,100-500,Pvt Ltd,2,0.834331,Has relevant experience,78,no
2,17715,0,0,0,0,0,0,0,0,0,...,Undergraduate,STEM,>20,Oct-49,Pvt Ltd,2,0.922156,Has relevant experience,32,no
3,6540,0,0,0,0,1,0,0,0,0,...,Undergraduate,Business degree,11,5000-9999,Pvt Ltd,2,0.942116,Has relevant experience,7,no
4,6760,0,0,0,0,0,0,0,0,0,...,Undergraduate,STEM,5,100-500,Pvt Ltd,never,0.351297,Has relevant experience,62,no


In [572]:
kaggle_data.head()

Unnamed: 0,cid,city.100,city.10,city.105,city.103,city.104,city.102,city.101,city.2119,city.114,...,education_level,academic_discipline,experience,company_size,company_type,last_new_job,city_development_index,relevant_experience,training_hours,target
0,1,0,0,0,0,1,0,0,0,0,...,Undergraduate,STEM,>20,500-999,Other,1,0.942116,Has relevant experience,36,no
1,3,0,0,0,0,0,0,0,0,0,...,Undergraduate,STEM,5,50-99,Pvt Ltd,never,0.351297,No relevant experience,83,no
2,7,0,0,0,0,0,0,0,0,0,...,High School,No major,5,50-99,Funded startup,1,0.942116,Has relevant experience,24,no
3,12,0,0,0,0,1,0,0,0,0,...,Undergraduate,STEM,5,5000-9999,Pvt Ltd,1,0.942116,Has relevant experience,108,no
4,15,0,0,0,0,0,0,0,0,0,...,High School,STEM,5,<10,Early stage startup,never,0.351297,No relevant experience,26,no


# Extracting the features and target from the data

After the data is read successfully, we need to extract the features __X__ which will be the input columns to the pipeline and the target __y__, which is the response/target variable we are trying to predict.

Please run the below cell to perform the extraction.

__NOTE:__ Please ensure that the target variable is named as "target" in your dataset

In [655]:
# Please ensure that your target variable is named as "target"
try:
    final_columns = [] # Used for keeping track of the final columns in the output of the pipeline


    X = data.drop(["target"],axis = 1) # Original feature data
    X_mirror = data.drop(["target"],axis = 1) # Used as a version which will keep track of the final data coming from the pipeline
    y = data.loc[:,"target"] # Target variable

    # Doing the same for the test data
    X_test = test_data.drop(["target"],axis = 1) # Original feature data
    X_mirror_test = test_data.drop(["target"],axis = 1) # Used as a version which will keep track of the final data coming from the pipeline
    y_test = test_data.loc[:,"target"] # Target variable

    X_kaggle = kaggle_data.drop(["target"],axis = 1) # Original feature data
    X_mirror_kaggle = kaggle_data.drop(["target"],axis = 1) # Used as a version which will keep track of the final data coming from the pipeline
    y_kaggle = kaggle_data.loc[:,"target"] # Target variable


except Exception as e:

    print(f"There was an error extracting features/target from the data. Error message: {e}. It could be possible that the target variable doesn't exist in the dataset. Attempting to extract based on the assumption that the last variable in the dataset is the target variable.")

    X = data.iloc[:,0:-1] # Original feature data
    X_mirror = data.iloc[:,0:-1] # Used as a version which will keep track of the final data coming from the pipeline
    y = data.iloc[:,-1] # Target variable

    # Doing the same for the test data
    X_test = test_data.iloc[:,0:-1] # Original feature data
    X_mirror_test = test_data.iloc[:,0:-1] # Used as a version which will keep track of the final data coming from the pipeline
    y_test = test_data.iloc[:,-1] # Target variable

    X_kaggle = kaggle_data.iloc[:,0:-1] # Original feature data
    X_mirror_kaggle = kaggle_data.iloc[:,0:-1] # Used as a version which will keep track of the final data coming from the pipeline
    y_kaggle = kaggle_data.iloc[:,-1] # Target variable

# Pipeline

We have defined a number of custom transformer classes extending the base classes (BaseEstimator, TransformerMixin) below. These classes extend the functionality of the base classes and implement additional logic on top of it catering to our specific needs.

Please run all the cells below where the custom transformer classes are defined.

### Custom Label Encoder

We defined a custom label encoder which extends on the LabelEncoder functionality of scikit-learn. This allowed us to perform custom encoding of certain variables based on our domain knowledge of the categories in the variables. For instance, enrolled_university with categories is encoded as follows: "No enrollment": 0, "Part time":1, "Full time":2

In [666]:
# (BaseEstimator, TransformerMixin). This makes it compatible with scikit-learn’s Pipelines
class CustomLabelEncoder(BaseEstimator, TransformerMixin):
    # initializer
    def __init__(self, columns = None):
        # save the features list internally in the class
        self.columns = columns

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        global X_mirror
        global final_columns

    # return the dataframe with the specified features
        X["enrolled_university_encoded"] = X.enrolled_university.replace({"No enrollment": 0, "Part time":1, "Full time":2})

        X["relevant_experience_encoded"] = X.relevant_experience.replace({"Has relevant experience": 1, "No relevant experience":0})

        X["last_new_job_encoded"] = X.last_new_job.replace({"1": 1, "2": 2, "3": 3, "4": 4, ">4": 5, "never": 0})

        X["experience_regrouped_encoded"] = X_mirror.experience_regrouped.replace({"<1": 1,"1-3":2,"3-7":3,"7-14":4,">20":5 })

        X_mirror["enrolled_university_encoded"] = X["enrolled_university_encoded"]
        X_mirror["relevant_experience_encoded"] = X["relevant_experience_encoded"]
        X_mirror["last_new_job_encoded"] = X["last_new_job_encoded"]
        X_mirror["experience_regrouped_encoded"] = X["experience_regrouped_encoded"]

        # Returning only the necessary columns from here
        return_cols = []

        for col in fs_cols:
            if col in X.columns:
                return_cols.append(col)


        final_columns += return_cols

        return X[return_cols]

### Custom One Hot Encoder

We defined a custom one hot encoder which extends on the OneHotEncoder functionality of scikit-learn. This allowed us to perform one hot encoding of certain variables and then also set the required variable names for the one-hot encoded columns.

In [667]:
# (BaseEstimator, TransformerMixin). This makes it compatible with scikit-learn’s Pipelines
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
    # initializer
    def __init__(self, columns = None):
        # save the features list internally in the class
        self.columns = columns

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        # return the dataframe with the specified features
        X_copy = X.copy()
        global X_mirror
        global final_columns

        # Creating a OneHotEncoder instance.

        ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

        # Fitting the encoder to the input list of features.
        ohe.fit(X[list(self.columns)])
        data_ohe = pd.DataFrame(ohe.fit_transform(X[list(self.columns)]), columns=ohe.get_feature_names_out())

        # Resetting the index lost during transformation
        data_ohe.index = X_copy.index

        # Adding the newly generated one-hot encoded columns to the original data
        X = pd.concat([X,data_ohe],axis=1)

        # Replicating the same for the mirror data
        X_mirror = pd.concat([X_mirror,data_ohe],axis=1)

        # Returning only the necessary columns from here
        return_cols = []

        for col in fs_cols:
            if col in X.columns:
                return_cols.append(col)

        final_columns += return_cols
        return X[return_cols]

### Feature Engineering

We defined a feature engineering transformer that performs __binning__ where the original categories of the categorical variable were regrouped based on the conditions(defined in the _transform_experience_ function) that seemed appropriate given the circumstances.

In [668]:
class FeatureEngineering(BaseEstimator, TransformerMixin):
    # initializer
    def __init__(self, columns = None):
        # save the features list internally in the class
        self.columns = columns

    def fit(self, X, y = None):
        return self

    def transform_experience(self, row):
        """
        The feature experience is re-categorized in an exponential manner as it makes sense for
        experience variable.
        :param row:
        :return:
        """
        try:
            if int(row) >= 1 and int(row) <= 3:
                return '1-3'
            elif int(row) > 3 and int(row) <= 7:
                return '3-7'
            elif int(row) > 7 and int(row) <= 14:
                return '7-14'
            elif int(row) > 14 and int(row) <= 20:
                return '>20'

        except ValueError:
            if row == '<1':
                return '<1'
            elif row == '>20':
                return '>20'

    def transform(self, X, y = None):
        # return the dataframe with the specified features
        global final_columns
        global X_mirror
        final_columns = []

        X['experience_regrouped'] = X.experience.apply(self.transform_experience)
        X_mirror['experience_regrouped'] = X['experience_regrouped']

        # Returning only the necessary columns from here
        return_cols = []

        for col in fs_cols:
            if col in X.columns:
                return_cols.append(col)

        final_columns += return_cols
        return X[return_cols]

### Custom Target Encoder

We defined a custom target encoder which extends on the TargetEncoder functionality of category_encoders. This allowed us to perform target encoding of certain variables and then also set the required variable names for the target encoded columns.

In [669]:
# (BaseEstimator, TransformerMixin). This makes it compatible with scikit-learn’s Pipelines
class CustomTargetEncoder(BaseEstimator, TransformerMixin):
    # initializer
    def __init__(self, columns, y):
        # save the features list internally in the class
        self.columns = columns
        self.y = y

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        global final_columns
        te=TargetEncoder()
        for col in (self.columns + ["experience_regrouped"]):
            X_mirror[col + "_te"]=te.fit_transform(X_mirror[col],self.y)
            X[col + "_te"] = X_mirror[col + "_te"]

        # Returning only the necessary columns from here
        return_cols = []

        for col in fs_cols:
            if col in X.columns:
                return_cols.append(col)

        final_columns += return_cols
        return X[return_cols]

### Custom Feature Selector

We defined a feature selector transformer which selects features based on a pre-determined best feature list(generated based on 6 different feature selection techniques). There features will be later used for final modelling.

Please refer to the report for in-depth explanation of how the feature selection process was employed.

In [670]:
# (BaseEstimator, TransformerMixin). This makes it compatible with scikit-learn’s Pipelines
class FeatureSelector(BaseEstimator, TransformerMixin):
    # initializer
    def __init__(self, columns = None):
        # save the features list internally in the class
        self.columns = columns

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        global X_mirror
        X_mirror = pd.DataFrame()

        # Returning only the specific features given to it
        return_cols = []
        global final_columns
        for col in fs_cols:
            if col in X.columns:
                return_cols.append(col)


        final_columns += return_cols
        return X[return_cols]

## Pipeline Stage I: Column Transformer

We bundled all the transformations that were applied on specific sets of features based on the custom classes(defined earlier) into a single ColumnTransformer unit for it be shipped as the first stage of the pipeline.

In [690]:
# Defining the columns which will be passsed to specific stages of the Column Transformer
le_cols = ["enrolled_university","relevant_experience","last_new_job"]
ohe_cols = ["gender","academic_discipline","company_type","company_size","education_level"]
fe_cols = ["experience"]
te_cols = [col for col in X.columns if X[col].dtype == 'object']

fs_cols = [
           'city_development_index',
           'enrolled_university_encoded',
           'company_type_Pvt Ltd',
           'relevant_experience_encoded',
           'education_level_Undergraduate',
            'company_size_10000+',
           ]


# Bundle preprocessing for all the transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('feature_engineering', FeatureEngineering(),fe_cols),
        ('custom_label_encoder', CustomLabelEncoder(columns=le_cols), le_cols),
        ('onehot', CustomOneHotEncoder(columns=ohe_cols), ohe_cols),
        ('feature_selector', FeatureSelector(columns=fs_cols), list(X.columns)),
    ], remainder="drop")

## Pipeline Stage II: Handling class imbalance using SMOTE

The minority class in the target variable is oversampled using SMOTE algorithm. This is then shipped as the second stage of the pipeline.

In [673]:
smote = SMOTE(random_state=1211)

## Pipeline Stage III: Modelling using Extreme Gradient Boosting Tree algorithm

The final stage of the pipeline uses XGBoost Classifier model with tuned hyperparameters based on HyperOpt's Bayesian Optimization.

In [691]:
model = XGBClassifier(**{'max_depth': 16, 'learning_rate':0.13, 'n_estimators':350, 'objective':'binary:logistic',
         'booster':'gbtree',
         'n_jobs':1,
         'nthread':None,
         'gamma':7.2,
         'min_child_weight':1,
         'max_delta_step':0,
         'subsample':1,
         'colsample_bytree':0.7,
         'colsample_bylevel':1,
         'reg_alpha':0,
         'reg_lambda':0.5,
         'scale_pos_weight':1,
         'base_score':0.5,
         'random_state':0,
         'seed':None,
         'missing':1})

## Bundling all the stages as a pipeline

Finally, we are bundling all the stages as steps in the pipeline defined below. The stages/steps in the pipeline will be performed in the following sequence:

Stages:
1. preprocesser
2. class_balancing_SMOTE
3. model

In [692]:
# Bundle preprocessing and modeling code in a pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('class_balancing_SMOTE', smote),
                           ('model', model)
                           ])

# Fitting the pipeline on the given data

Having defined the pipeline, we will now fit the pipeline on the given data. But first, we need to encode the target variable from "yes","no" to 1 and 0 respectively. Please note that this couldn't be performed inside the pipeline due to a limitation of scikit-learn's library and hence had to be performed manually.

### Performing 5fold Stratified Cross Validation and computing mean F1 Score

In [693]:
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

nfold_f1 = []
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_valid = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_valid = y.iloc[train_index], y.iloc[test_index]

    y_train.replace({"yes": 1, "no":0}, inplace = True)
    y_valid.replace({"yes": 1, "no":0}, inplace = True)
    pipeline.fit(X_train,y_train)
    nfold_f1.append(f1_score(pipeline.predict(X_valid),y_valid))

print(np.mean(nfold_f1))

TRAIN: [ 8118  2797  9401 ... 11355  1827  8332] TEST: [10648  2600  4735 ...   822  1627  9292]
TRAIN: [ 742  711 1898 ... 2005 2262 5392] TEST: [ 9022  1821  4843 ... 10862  8175  2241]
TRAIN: [ 5352  6183  7785 ...  3517  7811 10844] TEST: [ 5953 10250  9056 ...  5052  3837   397]
TRAIN: [ 2561  1997 10476 ...  3330  1496 11030] TEST: [ 6188 10595  6015 ...  6682 11358 11032]
TRAIN: [6739  845 1630 ... 2547 4452 1935] TEST: [ 616  512 2657 ... 3525 5929 3245]
0.5475946136062944


### Fitting pipeline on complete dataset

In [683]:
y.replace({"yes": 1, "no":0}, inplace = True)
pipeline.fit(X,y)

## Optional: Persisting the pipeline and loading it

This optional code helps to persist the pipeline to a file which can be later loaded on any machine without having to retrain the pipeline on the given data.

In [684]:
dump(pipeline, 'pipeline.joblib') # Dump the pipeline to a file with extension .joblib
pipeline_loaded = load('pipeline.joblib') # Load the stored pipeline using the filename

__Testing the loaded pipeline by calculating F1 score__

In [685]:
f1_score(pipeline_loaded.predict(X),y)

0.5540495867768594

# Making predictions on test dataset and exporting the predictions

Having trained the pipeline on the train dataset, now we will make our predictions on the test dataset and ship it.

### Generating the predictions file from Test Data

In [695]:
predictions =  pd.concat([X_test.cid,pd.Series(pipeline.predict(X_test),name= "target").replace({1:"yes",0:"no"})],axis=1)

print(predictions.head())

predictions.to_csv("pred_labels.csv",index= False, header = True)

     cid target
0  12321     no
1    941     no
2  17715     no
3   6540     no
4   6760    yes


### Generating the predictions from Kaggle data

In [694]:
predictions =  pd.concat([X_kaggle.cid,pd.Series(pipeline.predict(X_kaggle),name= "target")],axis=1)

print(predictions.head())

predictions.to_csv("pred_labels_kaggle_tuned.csv",index= False, header = True)

   cid  target
0    1       0
1    3       1
2    7       0
3   12       0
4   15       1


# References

- Creating custom scikit-learn Transformers. (n.d.). Andrew Villazon. Retrieved October 23, 2022, from https://www.andrewvillazon.com/custom-scikit-learn-transformers/