## ML Pipeline

This file has been instrumental in making decisions and developing the ML pipeline stored in the python file. 


### Import libraries and load data

Let's start with downloading and importing libraries and then set up some static values and warning options and load the dataset.

In [16]:
# import libraries
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk

nltk.download('wordnet')
nltk.download('punkt') 
nltk.download('stopwords') 

from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize, punkt
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from sqlalchemy import create_engine

# Prevent sklearn from printing ConvergenceWarning (due to max iterations limit)
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category = ConvergenceWarning)

# a static value to detect hyperlinks
URL_REGEX = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
STOP_WORDS = set(stopwords.words('english'))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\JakubBelow\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\JakubBelow\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JakubBelow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# load datasets
engine = create_engine('sqlite:///DB/disaster_messages.db')
df = pd.read_sql_table('DB/disaster_messages', con=engine)

In [5]:
df.head(1)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Process data and engineer features

We don't have much feature engineering to do here, so let's define a tokenizer to be used in the pipeline that will:
- replace all hyperlinks with a placeholder
- tokenize words
- lemmatize and transform tokens to lowercase
- clean tokens from any stop words

In [6]:
# remove the 'child_alone' feature that doesn't seem to appear at all in the dataset
features = df.loc[:,'related':].columns.to_list()
features.remove('child_alone')

In [7]:
def tokenize(text):
    """
    Desc: Returns cleaned and lemmatized tokens from a text to be used by an NLP vectorizer
    
        Parameters:
            text (str): a document to be processed (e.g. a twitter message)
        Returns:
            clean_tokens (list[str]): a list of cleaned and lemmatized word tokens
    """
    
    # find and replace all hyperlinks
    urls = re.findall(URL_REGEX, text)
    
    for url in urls:
        text = text.replace(url, '<url>')
        
    # tokenize
    tokens = word_tokenize(text)
    
    # lemmatize and clean words
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]
    clean_tokens = [tok for tok in clean_tokens if tok not in STOP_WORDS]
        
    return clean_tokens

In [19]:
# define X and y
X = df['message']
y = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8)

In [20]:
len(features)

35

### Build the pipeline and quickly assess three different models 

It's worth checking different estimators for multitarget mnultioutput learning model to choose one that's performing hte best before tuning it.

In [21]:
# define a function to build a pipeline with given estimator
def model_pipeline(model):
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize, token_pattern=None)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(estimator=model))
    ])
    
    return pipeline

In [22]:
# Run three different estimators to check which one's the best

models = [
    LogisticRegression(max_iter=1000),
    MultinomialNB(),
    SGDClassifier()
]

for model in models:
    pipeline = model_pipeline(model)
    print(f'{model}:\n')
    # train classifier
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    pred_df = pd.DataFrame(y_pred, columns=y_test.columns)
    report_df = pd.DataFrame(columns=['precision', 'recall', 'f1-score'])

    for col in pred_df.columns:
        scores = classification_report(y_test[col], pred_df[col], output_dict=True, zero_division=0)['weighted avg']
        precision, recall, f1_score, _ = [score for score in scores.values()]
        report_df.loc[len(report_df)] = [precision, recall, f1_score]

    report_df.index = pred_df.columns
    report_df
    
    print('success:\n', report_df.mean(), '\n\n')
    

LogisticRegression(max_iter=1000):

success:
 precision    0.939473
recall       0.947385
f1-score     0.936945
dtype: float64 


MultinomialNB():

success:
 precision    0.906830
recall       0.932930
f1-score     0.908542
dtype: float64 


SGDClassifier():

success:
 precision    0.939719
recall       0.948921
f1-score     0.938432
dtype: float64 




SGDClassifier seems to pereform the best. Since we only have binary features, the feature scaling is not a problem here (it's sensistive to it).

Now, let's use this estimator to check fot best hyperparameters. We will try with the following three:
1. penalty
2. loss
3. max iterations

In [23]:
%%time
pipeline = model_pipeline(SGDClassifier())
parameters = {'clf__estimator__penalty' : ['l1', 'l2', 'elasticnet'],
              'clf__estimator__loss': ['hinge', 'log_loss', 'squared_hinge', 'perceptron'],
              'clf__estimator__max_iter' : [200, 500, 1000]
             }

cv = GridSearchCV(pipeline, param_grid=parameters)
best_clf = cv.fit(X_train, y_train)

CPU times: total: 28min 46s
Wall time: 1h 16min 19s


In [24]:
best_clf.best_params_

{'clf__estimator__loss': 'hinge',
 'clf__estimator__max_iter': 1000,
 'clf__estimator__penalty': 'l2'}

It took us over 20 minutes, but we have the winners. The loss function will be 'hinge' with 500 max iterations and l2 penalty regulazer. Finally, let's estimate precision, recall, and f1-score for each target class.

### Report scores for each target class

In [25]:
# predict y values and the test sample
y_pred = best_clf.predict(X_test)
pred_df = pd.DataFrame(y_pred, columns=y_test.columns)
report_df = pd.DataFrame(columns=['precision', 'recall', 'f1-score'])

for col in pred_df.columns:
    scores = classification_report(y_test[col], pred_df[col], output_dict=True, zero_division=0)['weighted avg']
    precision, recall, f1_score, _ = [score for score in scores.values()]
    report_df.loc[len(report_df)] = [precision, recall, f1_score]

report_df.index = pred_df.columns
report_df

Unnamed: 0,precision,recall,f1-score
related,0.805688,0.814645,0.792316
request,0.899475,0.90389,0.896253
offer,0.991247,0.995614,0.993426
aid_related,0.779925,0.779939,0.776681
medical_help,0.913982,0.929825,0.90754
medical_products,0.949685,0.957475,0.945979
search_and_rescue,0.963667,0.973494,0.963102
security,0.95513,0.977307,0.966091
military,0.965476,0.970442,0.959547
water,0.964773,0.966629,0.965412


8 out of 35 classes are never predicted for the training data set. This is a potential pain point for the next iteration of the model. Oversampling might possibly help with the issue. Otherwise, a business decision may be made to forfeit these features for the time being altogether.

In [26]:
# get % of predicted classes
pred_counts = pd.DataFrame(y_pred, columns=y_test.columns)
pred_counts = pred_counts.sum().sort_values() / pred_counts.shape[0]

# get % of actual classes
test_counts = pd.DataFrame(y_test, columns=y_test.columns)
test_counts = test_counts.sum().sort_values() / test_counts.shape[0]

#create a new dataframe to compare
classes_prevalencs_df = pd.concat([test_counts, pred_counts], axis=1)
classes_prevalencs_df.columns = ['actual prevalence', 'predicted_pct']
classes_prevalencs_df.sort_values(by='predicted_pct')

Unnamed: 0,actual prevalence,predicted_pct
shops,0.003814,0.0
offer,0.004386,0.0
tools,0.007056,0.0
aid_centers,0.009725,0.0
hospitals,0.012777,0.0
infrastructure_related,0.063692,0.0
other_infrastructure,0.041571,0.0
security,0.022693,0.0
fire,0.008772,0.000572
missing_people,0.012777,0.000572


Interestingly, some of the classes that were not predicted at all are not actually that underrepresented. For instance, the "other_infrastructure" class does account for approx. 4.5% of the test dataset.

In [28]:
# export the model
import pickle
pickle.dump(best_clf, open(f'model.pkl', 'wb'))

In [31]:
pickle.dump(best_clf, open(f'model.pkl', 'wb'))

In [30]:
best_clf?

In [35]:
df['message'][500]

'Please, we are staying in a church. There are no wounded but we are in dire need of food, water, gas. Our house is completely leveled'

In [38]:
df.head()
classes_prevalencs_df

Unnamed: 0,actual prevalence,predicted_pct
shops,0.003814,0.0
offer,0.004386,0.0
tools,0.007056,0.0
fire,0.008772,0.000572
aid_centers,0.009725,0.0
hospitals,0.012777,0.0
missing_people,0.012777,0.000572
clothing,0.015446,0.008963
cold,0.020404,0.004577
electricity,0.022311,0.003814
