# ML Pipeline Tuning
This notebook was used to:
- Compare the XGBoost classifier against the random forest classifer
- Determine optimal parameters for the XGBoost classifier

In [1]:
! pip install --upgrade setuptools
! pip install --upgrade pip
! pip install --upgrade xgboost
! pip install  --upgrade hyperopt

Collecting setuptools
  Downloading setuptools-68.0.0-py3-none-any.whl (804 kB)
     ------------------------------------- 804.0/804.0 kB 16.9 MB/s eta 0:00:00
Installing collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 65.6.3
    Uninstalling setuptools-65.6.3:
      Successfully uninstalled setuptools-65.6.3
Successfully installed setuptools-68.0.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.27 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.27 requires nbformat==5.4.0, but you have nbformat 5.7.0 which is incompatible.


Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 22.1 MB/s eta 0:00:00


ERROR: To modify pip, please run the following command:
C:\Users\DanielJoseph.Onsiter\anaconda3\python.exe -m pip install --upgrade pip


Collecting xgboost
  Downloading xgboost-1.7.6-py3-none-win_amd64.whl (70.9 MB)
     --------------------------------------- 70.9/70.9 MB 29.7 MB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-1.7.6
Collecting hyperopt
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
     ---------------------------------------- 1.6/1.6 MB 25.4 MB/s eta 0:00:00
Collecting py4j
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     ---------------------------------------- 200.5/200.5 kB ? eta 0:00:00
Installing collected packages: py4j, hyperopt
Successfully installed hyperopt-0.2.7 py4j-0.10.9.7


In [35]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import xgboost as xgb
from hyperopt import fmin, tpe, hp, STATUS_OK
import timeit
import pickle

In [9]:
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger','omw-1.4'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DanielJoseph.Onsiter\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DanielJoseph.Onsiter\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DanielJoseph.Onsiter\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DanielJoseph.Onsiter\AppData\Roaming\nltk_dat
[nltk_data]     a...


True

### Importing the data

In [4]:
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('MessageCategories',con=engine)
X = df['message'] 
Y = df[df.columns.difference(['id','message','genre','original'])]
df.shape

(26215, 40)

In [5]:
display(X.head(),Y.head())

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

Unnamed: 0,aid_centers,aid_related,buildings,child_alone,clothing,cold,death,direct_report,earthquake,electricity,...,request,search_and_rescue,security,shelter,shops,storm,tools,transport,water,weather_related
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
for col in Y.columns:
    print(Y[col].value_counts())

0    25906
1      309
Name: aid_centers, dtype: int64
0    15355
1    10860
Name: aid_related, dtype: int64
0    24882
1     1333
Name: buildings, dtype: int64
0    26215
Name: child_alone, dtype: int64
0    25810
1      405
Name: clothing, dtype: int64
0    25685
1      530
Name: cold, dtype: int64
0    25021
1     1194
Name: death, dtype: int64
0    21140
1     5075
Name: direct_report, dtype: int64
0    23760
1     2455
Name: earthquake, dtype: int64
0    25683
1      532
Name: electricity, dtype: int64
0    25933
1      282
Name: fire, dtype: int64
0    24060
1     2155
Name: floods, dtype: int64
0    23292
1     2923
Name: food, dtype: int64
0    25932
1      283
Name: hospitals, dtype: int64
0    24510
1     1705
Name: infrastructure_related, dtype: int64
0    24131
1     2084
Name: medical_help, dtype: int64
0    24902
1     1313
Name: medical_products, dtype: int64
0    25355
1      860
Name: military, dtype: int64
0    25917
1      298
Name: missing_people, dtype: int64
0    2

#### Observation
Quite a bit of data imbalance in many of the columns.

### Creating tokenizer
This is the tokenizer used in the case studies

In [7]:
def tokenize(text):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens
    pass

### 1. Random Forest Classifier

The Random Forest Classifier was first used as a performance benchmark

In [14]:
def build_model():

    classifier = RandomForestClassifier()
    pipeline = Pipeline([
                        ('vect', CountVectorizer(tokenizer=tokenize)),
                        ('tfidf', TfidfTransformer()),
                        ('clf', MultiOutputClassifier(classifier))
                        ])

    parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        'clf__estimator__n_estimators': [5,10,20]
    }
    
    cv = GridSearchCV(pipeline, param_grid=parameters, cv=2, n_jobs=6)
    
    return cv

st1 = timeit.default_timer()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .25, random_state=420)

# train classifier
model = build_model()
model.fit(X_train,y_train)

# predict on test data
y_pred = model.predict(X_test)
st2 = timeit.default_timer()
print('time taken: ' + str(st2-st1) +'s')



time taken: 978.425593300024s


In [15]:
model.best_params_

{'clf__estimator__n_estimators': 20, 'vect__ngram_range': (1, 2)}

In [16]:
#Show performance per column
macro_f1_scores = []
for i,col in enumerate(y_test.columns):
    y_pred_col = y_pred[:,i]
    y_test_col = y_test[col].values
    print('col: ' + col)
    print(classification_report(y_test_col, y_pred_col))
    print(confusion_matrix(y_test_col, y_pred_col))
    macro_f1_scores.append([col,f1_score(y_test_col, y_pred_col, average='macro')])
    print('')

col: aid_centers
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6473
           1       0.00      0.00      0.00        81

    accuracy                           0.99      6554
   macro avg       0.49      0.50      0.50      6554
weighted avg       0.98      0.99      0.98      6554

[[6473    0]
 [  81    0]]

col: aid_related
              precision    recall  f1-score   support

           0       0.73      0.89      0.80      3821
           1       0.78      0.54      0.64      2733

    accuracy                           0.74      6554
   macro avg       0.75      0.72      0.72      6554
weighted avg       0.75      0.74      0.73      6554

[[3397  424]
 [1248 1485]]

col: buildings
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      6253
           1       0.74      0.10      0.18       301

    accuracy                           0.96      6554
   macro avg       0.85

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



col: hospitals
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6485
           1       0.00      0.00      0.00        69

    accuracy                           0.99      6554
   macro avg       0.49      0.50      0.50      6554
weighted avg       0.98      0.99      0.98      6554

[[6485    0]
 [  69    0]]

col: infrastructure_related
              precision    recall  f1-score   support

           0       0.93      1.00      0.97      6123
           1       0.40      0.00      0.01       431

    accuracy                           0.93      6554
   macro avg       0.67      0.50      0.49      6554
weighted avg       0.90      0.93      0.90      6554

[[6120    3]
 [ 429    2]]

col: medical_help
              precision    recall  f1-score   support

           0       0.92      1.00      0.96      6014
           1       0.71      0.05      0.10       540

    accuracy                           0.92      6554
   macro a

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



col: shops
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6528
           1       0.00      0.00      0.00        26

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      1.00      0.99      6554

[[6528    0]
 [  26    0]]

col: storm
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      5898
           1       0.81      0.30      0.44       656

    accuracy                           0.92      6554
   macro avg       0.87      0.65      0.70      6554
weighted avg       0.92      0.92      0.91      6554

[[5852   46]
 [ 458  198]]

col: tools
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6508
           1       0.00      0.00      0.00        46

    accuracy                           0.99      6554
   macro avg       0.50      0.50     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Observation

Viewing the macro f1 score per category, it appears as though this classifier is not classifying many of the categories very well. The macro f1 score is chosen to evaluate the model as it is more appropriate for imbalanced data ['3'](https://stephenallwright.com/micro-vs-macro-f1-score/)

In [17]:
macro_f1_scores_df = pd.DataFrame(macro_f1_scores,columns=['category','macro f1 score RF'])
macro_f1_scores_df.sort_values(by='macro f1 score RF')

Unnamed: 0,category,macro f1 score RF
25,related,0.481496
14,infrastructure_related,0.487542
21,other_aid,0.491043
22,other_infrastructure,0.491802
28,security,0.495303
0,aid_centers,0.496891
18,missing_people,0.496968
10,fire,0.497123
13,hospitals,0.497354
32,tools,0.498239


To evaluate the performance of the model as a whole, I introduce an evaluation metric which accounts for the macro f1 score across all columns: score = sqrt(sum(f1_score)/num_columns). Drawing inspiration from the calculation of RMSE, f1 scores of each column were squared so that higher f1 scores will have more weighting. 

In [18]:
score = 0
for i,col in enumerate(y_test.columns):
    y_pred_col = y_pred[:,i]
    y_test_col = y_test[col].values
    x = f1_score(y_test_col, y_pred_col, average='macro')
    score = score + x**2

score = np.sqrt(score/y_test.shape[1])
score

0.6097763746479982

This score will serve as a benchmark, our goal is to achieve a score higher than 0.61

### 2. XGBoost classifier

The XGBoost classifier was next tested. These medium articles were referenced to implement XGBoost and hyperparameter optimization using the hyperopt library ['1'](https://medium.com/@rithpansanga/optimizing-xgboost-a-guide-to-hyperparameter-tuning-77b6e48e289d), ['2'](https://towardsdatascience.com/automate-hyperparameter-tuning-with-hyperopts-for-multiple-models-22b499298a8a)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .25, random_state=666)

#Tuned parameters
space = {
    'max_depth': hp.choice('max_depth', range(5, 30, 1)),
    'learning_rate': hp.loguniform('learning_rate', -5, -2),
    'subsample': hp.uniform('subsample', 0.5, 1),
    'n_estimators' : hp.choice('n_estimators', range(5, 50, 1)),
    'reg_lambda' : hp.uniform ('reg_lambda', 0,1),
    'reg_alpha' : hp.uniform ('reg_alpha', 0,1)
}

def custom_loss(y_test,y_pred):
    score = 0
    
    for i,col in enumerate(y_test.columns):
        y_pred_col = y_pred[:,i]
        y_test_col = y_test[col].values
        x = f1_score(y_test_col, y_pred_col, average='macro')
        score = score + (x)**2

    score = np.sqrt(score/y_test.shape[1])
    return score
    

# Define the objective function to minimize
def objective(params):
    xgb_model = xgb.XGBClassifier(**params)
    pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(xgb_model))
                    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    #The loss function here is the one I defined previously
    score = custom_loss(y_test,y_pred)
    
    #The score is returned here as a negative value, as fmin will attempt to minimize this value
    return {'loss': -score, 'status': STATUS_OK}

# Perform the optimization
best_params = fmin(objective, space, algo=tpe.suggest, max_evals=10)
print("Best set of hyperparameters: ", best_params)

  0%|          | 0/10 [00:00<?, ?trial/s, best loss=?]




 10%|█         | 1/10 [04:19<38:55, 259.49s/trial, best loss: -0.7028727897760098]




 20%|██        | 2/10 [05:52<21:30, 161.32s/trial, best loss: -0.7028727897760098]




 30%|███       | 3/10 [10:18<24:24, 209.15s/trial, best loss: -0.7028727897760098]




 40%|████      | 4/10 [12:36<18:06, 181.06s/trial, best loss: -0.7028727897760098]




 50%|█████     | 5/10 [19:12<21:33, 258.71s/trial, best loss: -0.7049036656933776]




 60%|██████    | 6/10 [19:37<11:57, 179.28s/trial, best loss: -0.7049036656933776]




 70%|███████   | 7/10 [22:23<08:44, 174.95s/trial, best loss: -0.7050858297356839]




 80%|████████  | 8/10 [25:31<05:58, 179.05s/trial, best loss: -0.7050858297356839]




 90%|█████████ | 9/10 [27:12<02:34, 154.76s/trial, best loss: -0.7050858297356839]




100%|██████████| 10/10 [28:12<00:00, 169.28s/trial, best loss: -0.7050858297356839]
Best set of hyperparameters:  {'learning_rate': 0.01996511800675754, 'max_depth': 17, 'n_estimators': 18, 'reg_alpha': 0.09938147837291078, 'reg_lambda': 0.49525470882855027, 'subsample': 0.8815543226911875}


Using the best set of hyperparameters: 

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .25, random_state=420)

classifier = xgb.XGBClassifier(learning_rate=best_params['learning_rate'],
                                           max_depth=best_params['max_depth'],
                                           n_estimators=best_params['n_estimators'],
                                           reg_alpha=best_params['reg_alpha'],
                                           reg_lambda=best_params['reg_lambda'],
                                           subsample=best_params['subsample'])
pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(classifier))
                    ])

# train classifier
pipeline.fit(X_train,y_train)
# predict on test data
y_pred = pipeline.predict(X_test)



In [32]:
#Show performance per column
macro_f1_scores = []
for i,col in enumerate(y_test.columns):
    y_pred_col = y_pred[:,i]
    y_test_col = y_test[col].values
    print('col: ' + col)
    print(classification_report(y_test_col, y_pred_col))
    print(confusion_matrix(y_test_col, y_pred_col))
    macro_f1_scores.append([col,f1_score(y_test_col, y_pred_col, average='macro')])
    print('')

col: aid_centers
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6473
           1       0.21      0.04      0.06        81

    accuracy                           0.99      6554
   macro avg       0.60      0.52      0.53      6554
weighted avg       0.98      0.99      0.98      6554

[[6462   11]
 [  78    3]]

col: aid_related
              precision    recall  f1-score   support

           0       0.74      0.85      0.79      3821
           1       0.74      0.58      0.65      2733

    accuracy                           0.74      6554
   macro avg       0.74      0.72      0.72      6554
weighted avg       0.74      0.74      0.73      6554

[[3256  565]
 [1142 1591]]

col: buildings
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      6253
           1       0.66      0.38      0.48       301

    accuracy                           0.96      6554
   macro avg       0.81

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.88      0.98      0.93      5678
           1       0.53      0.16      0.25       876

    accuracy                           0.87      6554
   macro avg       0.71      0.57      0.59      6554
weighted avg       0.84      0.87      0.84      6554

[[5550  128]
 [ 733  143]]

col: other_infrastructure
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      6260
           1       0.36      0.03      0.05       294

    accuracy                           0.95      6554
   macro avg       0.66      0.51      0.51      6554
weighted avg       0.93      0.95      0.94      6554

[[6246   14]
 [ 286    8]]

col: other_weather
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      6210
           1       0.48      0.15      0.23       344

    accuracy                           0.95      6554
   macro avg       0.72    

In [33]:
macro_f1_scores_df = pd.DataFrame(macro_f1_scores,columns=['category','macro f1 score XGB'])
macro_f1_scores_df.sort_values(by='macro f1 score XGB')

Unnamed: 0,category,macro f1 score XGB
32,tools,0.498201
20,offer,0.498661
30,shops,0.49893
22,other_infrastructure,0.51359
28,security,0.518517
0,aid_centers,0.528159
14,infrastructure_related,0.530626
25,related,0.533829
21,other_aid,0.588681
13,hospitals,0.597544


In [34]:
score = 0
for i,col in enumerate(y_test.columns):
    y_pred_col = y_pred[:,i]
    y_test_col = y_test[col].values
    x = f1_score(y_test_col, y_pred_col, average='macro')
    score = score + x**2

score = np.sqrt(score/y_test.shape[1])
score

0.7016988709720189

#### Observation
While some classes do not see much improvement in classification accuracy, others are improved quite signicantly. The score is also much higher than before. For this reason, I use the XGBoost classifier in the pipeline

In [36]:
pickle.dump(pipeline,open('XGB.pkl','wb'))