# Building a Modelling Pipeline

One of the most important steps in Machine Learning is build a pipeline that allows you as a Data Scientist to Experiment, experiment and experiment. This is a crucial step saving time allowing to quickly make multiple experiments and minimizing the bugs on the ML systems you want to deploy. Let's begin:

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, FunctionTransformer, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipe
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score

pd.set_option('display.max_columns', None)

## Checking the data and data types

One of the first thing we will do is check if our csv data types are correctly being interpreted by Pandas, why we care about this? It's easy not all objects take the same space in the computer, dtype `object` would take much more memory than dtype `int`. Also this dtypes would allow us to do specific operations. Let's read the file as it is and tune some of these dtypes.

In [2]:
date_cols = ['your_birthdate']
columns_scheme = {'your_dept_code_resides':'int',
                 'your_municipality_code_resides':'int',
                 'inst_institution_code':'int',
                 'your_prgm_municipality_code':'int',
                 'your_inst_municipality_code':'int',
                 'your_inst_department_code':'int',
                  'score_language_saber_11':'int',
                  'score_mathematics_saber_11':'int',
                  'score_biology_saber_11':'int',
                  'score_chemistry_saber_11':'int',
                  'score_physics_saber_11':'int',
                  'score_social_science_saber_11':'int',
                  'score_philosophy_saber_11':'int',
                  'score_english_saber_11':'int',
                 'score_optative_saber_11':'int'}

In [3]:
df = pd.read_csv("../input/saberpro-preprocessed/saber_combined_preprocessed.csv", 
                 parse_dates=date_cols,
                dtype=columns_scheme)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212010 entries, 0 to 212009
Data columns (total 98 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   your_type_of_document            212010 non-null  object 
 1   your_nationality                 212010 non-null  object 
 2   your_gender                      212010 non-null  object 
 3   your_birthdate                   212010 non-null  object 
 4   your_foreigner                   212010 non-null  object 
 5   period                           212010 non-null  int64  
 6   your_consecutive                 212010 non-null  object 
 7   your_marital_status              212010 non-null  object 
 8   your_student                     212010 non-null  object 
 9   your_country_resides             212010 non-null  object 
 10  your_have_ethnicity              212010 non-null  object 
 11  your_dept_resides                212010 non-null  object 
 12  yo

In [4]:
df['your_undergraduate_core'].value_counts()

ADMINISTRATION                                      35790
LAW                                                 15263
PUBLIC ACCOUNTING                                   14334
EDUCATION                                           14219
INDUSTRIAL ENGINEERING                              10660
                                                    ...  
NUTRITION AND DIET                                    264
AGRICULTURAL, FOREST ENGINEERING                      244
PHISICS                                               191
TRAINING RELATED TO THE MILITARY OR POLICE FIELD      168
REPRESENTATIVE ARTS                                   164
Name: your_undergraduate_core, Length: 64, dtype: int64

In [5]:
df['your_undergraduate_core'].value_counts().index

Index(['ADMINISTRATION', 'LAW', 'PUBLIC ACCOUNTING', 'EDUCATION',
       'INDUSTRIAL ENGINEERING', 'PSYCHOLOGY',
       'ENVIRONMENTAL, SANITARY ENGINEERING', 'UNCLASSIFIED',
       'CIVIL ENGINEERING', 'SOCIAL COMMUNICATION, JOURNALISM', 'ECONOMY',
       'COMPUTER SYSTEMS, TELEMATICS ENGINEERING', 'MECHANICAL ENGINEERING',
       'DESIGN', 'ARCHITECTURE', 'SOCIOLOGY, SOCIAL WORK', 'NURSING',
       'ELECTRONIC ENGINEERING, TELECOMMUNICATIONS', 'MEDICINE',
       'SUPERIOR NORMALS', 'THERAPIES', 'MILITARY OR POLICE TRAINING',
       'POLITICAL SCIENCE, INTERNATIONAL RELATIONS', 'CHEMICAL ENGINEERING',
       'AGRONOMY', 'BIOLOGY, MICROBIOLOGY', 'MINING, METALLURGY ENGINEERING',
       'PUBLIC HEALTH', 'OTHER ENGINEERING', 'VETERINARY MEDICINE',
       'ADVERTISING', 'ELECTRICAL ENGINEERING', 'CHEMISTRY',
       'SPORTS, PHYSICAL EDUCATION AND RECREATION',
       'MODERN LANGUAGES, LITERATURE, LINGUISTICS', 'ODONTOLOGY',
       'PLASTIC ARTS, VISUAL ARTS', 'SURGICAL INSTRUMENTATION', '

In [6]:
replace_dict = {'AGRICULTURAL, FORESTRY ENGINEERING':'AGRICULTURAL, FOREST ENGINEERING',
               'AGROINDUSTRIAL ENGINEERING, FOOD':'AGROINDUSTRIAL AND FOOD ENGINEERING',
               'TRAINING RELATED TO THE MILITARY OR POLICE FIELD':'MILITARY OR POLICE TRAINING',
               'NUTRITION AND DIET':'NUTRITION AND DIETETICS'}
df['your_undergraduate_core'] = df['your_undergraduate_core'].replace(replace_dict)

## Splitting into Train and Test

Awesome with the data loaded it's time for our next step, when is our intention to model a ML algorithm we need to always **before we do anything else split between TRAIN and TEST datasets**.

Why? It's because we want that our algorithm is able to generalize, the test data would then function as dataset to confidently said that our algorithm would do a great job when is in production. 

We would use SkLearn `train_test_split` function, within this function we will use a `random_state` this is for reproducibility of our ML Algorithm.

In [7]:
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
del(df) # Liberating ram

## Pre-model pipeline

Within this first pipeline we will define what would be our target (this would give us flexibility in case we want to later on change it), select the columns that we will use during training and drop the remaining.

In [8]:
columns_to_model =[
    'score_language_saber_11',
      'score_mathematics_saber_11',
      'score_biology_saber_11',
      'score_chemistry_saber_11',
      'score_physics_saber_11',
      'score_social_science_saber_11',
      'score_philosophy_saber_11',
      'score_english_saber_11',
    'optative_field_saber_11']

target = 'your_undergraduate_core'

class PrepareData(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_model, target = 'your_undergraduate_core'):
        self.target = target
        self.columns_to_model = columns_to_model
    def fit(self, X, y=None):
        pass
    def transform(self, X, y=None):
        X = X.copy()
        y = X.loc[:, self.target]
        X = X.loc[:, columns_to_model]
        return X, y
    
prepare_data_pipe = Pipeline(steps=[('prepare',
                                     PrepareData(columns_to_model=columns_to_model,
                                                 target=target))])
prepare_data_pipe.fit(X_train)

Pipeline(steps=[('prepare',
                 PrepareData(columns_to_model=['score_language_saber_11',
                                               'score_mathematics_saber_11',
                                               'score_biology_saber_11',
                                               'score_chemistry_saber_11',
                                               'score_physics_saber_11',
                                               'score_social_science_saber_11',
                                               'score_philosophy_saber_11',
                                               'score_english_saber_11',
                                               'optative_field_saber_11']))])

Now we can transform any raw data set, we do this in case we later on recieve more raw data.

In [9]:
X_train, y_train = prepare_data_pipe.transform(X_train)
X_test, y_test = prepare_data_pipe.transform(X_test)

In [10]:
X_train.head(1)

Unnamed: 0,score_language_saber_11,score_mathematics_saber_11,score_biology_saber_11,score_chemistry_saber_11,score_physics_saber_11,score_social_science_saber_11,score_philosophy_saber_11,score_english_saber_11,optative_field_saber_11
139381,56,72,66,65,69,62,56,72,SCORE_DEEPEN_BIOLOGY


# Model Pipeline 

In this section we're going to define a training pipeline, the pipeline would be able to recieve any X and y and train with the data within. For now we will just include a `OneHotEncoding` for the categorical variables and a `MinMaxScaler`. The power of these pipelines is that you can include any other transformation relatively easy. Just add a new tuple contaning a the name of the step and the SKLearn transformation method 

In [11]:
X_train.columns

Index(['score_language_saber_11', 'score_mathematics_saber_11',
       'score_biology_saber_11', 'score_chemistry_saber_11',
       'score_physics_saber_11', 'score_social_science_saber_11',
       'score_philosophy_saber_11', 'score_english_saber_11',
       'optative_field_saber_11'],
      dtype='object')

In [12]:
SMOTE_strategy = {}
for i in y_train.value_counts()[y_train.value_counts()<5000].index:
    SMOTE_strategy[i] = 5000


def ModelPipelineTrain(X_train, y_train, model, model_params):
    '''
    Recieves a X_train processed with it target values (y_train) and train a model
    using SMOTE.
    
    You can also find the best model using a Random Search within a model_param dictonary
    with all the hyperparameters to test
    
    Returns the 
    '''
    
    # Defining Categorical Pipeline
    cat_pipeline = Pipeline([
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse = True))
    ])
    
    # Definining Numerical Pipeline
    num_pipeline = Pipeline([
        ('scaler', StandardScaler())
    ])
     
    # Combine categorical and numerical pipelines
    preprocessor = ColumnTransformer([
        ('categorical', cat_pipeline, selector(dtype_exclude="float")),
        ('numerical', num_pipeline, selector(dtype_include="float"))
    ], remainder = 'drop')
    
    # Fit the processed data with a model
    model_pipe = ImbPipe([
        ('pre-process', preprocessor),
        ('oversampling', SMOTE(random_state=42, n_jobs= -1, sampling_strategy=SMOTE_strategy)),
        ('model', model)
    ])
    
    # Randomized Search
    search = RandomizedSearchCV(model_pipe,
                               model_params,
                               cv = 5,
                               verbose = 50,
                                n_jobs = -1,
                               scoring = 'recall_micro',
                               random_state = 42,
                               n_iter=8,
                               )
    # Fit
    search.fit(X_train, y_train)
    return search

In [13]:
rf_params = {'model__max_depth':list(range(5,10)),
            'model__n_estimators':[int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)],
            }
random_forest = ModelPipelineTrain(X_train, y_train, RandomForestClassifier(), rf_params)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed: 11.6min
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed: 14.1min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed: 16.4min
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed: 17.3min
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed: 17.4min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed: 18.2

In [14]:
from joblib import dump, load
joblib_file = "rf_models.joblib"  
#dump(random_forest, joblib_file) 

In [15]:
xgb_params = {'model__eta':[0.1, 0.3, 0.5],
            'model__gamma':[0, 0.1, 10],
            'model__max_depth':[3,6,10],
            'model__lambda':[1, 2, 6, 10],
            }
xgb_model = ModelPipelineTrain(X_train, y_train, xgb.XGBClassifier(tree_method='gpu_hist', objective = 'multi:softprob', eval_metric='mlogloss'), xgb_params)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.5min
[CV] model__n_estimators=800, model__max_depth=6 .....................
[CV]  model__n_estimators=800, model__max_depth=6, score=0.169, total= 3.8min
[CV] model__n_estimators=800, model__max_depth=6 .....................
[CV]  model__n_estimators=800, model__max_depth=6, score=0.169, total= 4.0min
[CV] model__n_estimators=400, model__max_depth=8 .....................
[CV]  model__n_estimators=400, model__max_depth=8, score=0.169, total= 2.6min
[CV] model__n_estimators=400, model__max_depth=8 .....................
[CV]  model__n_estimators=400, model__max_depth=8, score=0.169, total= 2.5min
[CV] model__n_estimators=400, model__max_depth=8 .....................
[CV]  model__n_estimators=400, model__max_depth=8, score=0.169, total= 2.5min
[CV] model__n_estimators=200, model__max_depth=5 ........



[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  1.5min
[CV] model__n_estimators=800, model__max_depth=6 .....................
[CV]  model__n_estimators=800, model__max_depth=6, score=0.169, total= 3.8min
[CV] model__n_estimators=800, model__max_depth=6 .....................
[CV]  model__n_estimators=800, model__max_depth=6, score=0.169, total= 3.9min
[CV] model__n_estimators=800, model__max_depth=6 .....................
[CV]  model__n_estimators=800, model__max_depth=6, score=0.169, total= 3.8min
[CV] model__n_estimators=400, model__max_depth=8 .....................
[CV]  model__n_estimators=400, model__max_depth=8, score=0.169, total= 2.5min
[CV] model__n_estimators=400, model__max_depth=8 .....................
[CV]  model__n_estimators=400, model__max_depth=8, score=0.169, total= 2.5min
[CV] model__n_estimators=200, model__max_depth=5 .....................
[CV]  model__n_estimators=200, model__max_depth=5, score=0.169, total=  52.5s
[CV] model__n_estimators=200, model__max_dep



[Parallel(n_jobs=-1)]: Done  38 out of  40 | elapsed: 33.2min remaining:  1.7min




[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 34.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 34.5min finished




In [16]:
from joblib import dump, load
joblib_file = "xgb_models.joblib"  
dump(xgb_model, joblib_file)

['xgb_models.joblib']

In [17]:
def model_report(model, X_infer, y_true):
    print('-- Model Report --')
    y_pred = model.predict(X_infer)
    print('Accuracy: '+str(accuracy_score(y_true, y_pred)))
    print('Recall (Micro): '+str(recall_score(y_true, y_pred,average='micro')))
    print('F1-Score (Micro): '+str(f1_score(y_true, y_pred,average='micro')))

model_report(random_forest.best_estimator_, X_train, y_train)
model_report(random_forest.best_estimator_, X_test, y_test)

-- Model Report --
Accuracy: 0.16987995849252394
Recall (Micro): 0.16987995849252394
F1-Score (Micro): 0.16987995849252394
-- Model Report --
Accuracy: 0.16824678081222583
Recall (Micro): 0.16824678081222583
F1-Score (Micro): 0.16824678081222583


In [18]:
model_report(xgb_model.best_estimator_, X_train, y_train)
model_report(xgb_model.best_estimator_, X_test, y_test)

-- Model Report --
Accuracy: 0.1598627423234753
Recall (Micro): 0.1598627423234753
F1-Score (Micro): 0.1598627423234753
-- Model Report --
Accuracy: 0.15895476628460922
Recall (Micro): 0.15895476628460922
F1-Score (Micro): 0.15895476628460922
