This notebook is a machine learning part of NLP disaster response profect for Data Science nandegree. The first part, EDA and ETL part, can be found <a href='http://localhost:8888/notebooks/Desktop/Pipelines_testing/data/EDA%20and%20ETL.ipynb'>here</a>. As mentioned in previous notebook, data is provided by <a href='https://appen.com/'>Figure Eight (aquired by Appen)</a> and the objective of the project is to build a web app that will correctly classify disaster messages for a better and quicker response. This notebook covers text preprocessing, pipeline building, evaluating, tuning and testing of classifier. Also the notebook will describe steps for ML pipeline. According to ETL pipeline, our data is now stored in disaster_messages table in DisasterResponse.db database file.

First, we need to import the libraries and load the data.

In [1]:
# import the libraries
import warnings
warnings.filterwarnings('ignore')

import time
import numpy as np
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
import re

import nltk
nltk.download(['punkt','wordnet','stopwords','averaged_perceptron_tagger','omw'])
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import recall_score, f1_score, brier_score_loss, roc_auc_score
from scipy.stats import zscore

import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HurazovRuslan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HurazovRuslan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HurazovRuslan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HurazovRuslan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw to
[nltk_data]     C:\Users\HurazovRuslan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw is already up-to-date!


In [2]:
# load the data

# connect to the database
engine = create_engine(f'sqlite:///https://github.com/KhurazovRuslan/disaster_response_etl_and_ml_pipelines/blob/main/data/DisasterResponse.db')
# C:/Users/HurazovRuslan/Desktop/Pipelines_testing/data/DisasterResponse.db
# sql query
query = 'SELECT * FROM disaster_messages'

# load data into dataframe
df = pd.read_sql(sql=query, con=engine)


df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26216 entries, 0 to 26215
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      26216 non-null  int64 
 1   message                 26216 non-null  object
 2   original                10170 non-null  object
 3   genre                   26216 non-null  object
 4   related                 26216 non-null  int64 
 5   request                 26216 non-null  int64 
 6   offer                   26216 non-null  int64 
 7   aid_related             26216 non-null  int64 
 8   medical_help            26216 non-null  int64 
 9   medical_products        26216 non-null  int64 
 10  search_and_rescue       26216 non-null  int64 
 11  security                26216 non-null  int64 
 12  military                26216 non-null  int64 
 13  child_alone             26216 non-null  int64 
 14  water                   26216 non-null  int64 
 15  fo

So, we have 40 columns and 36 of them are different classes we need to classify. As mentioned in <a href='http://localhost:8888/notebooks/Desktop/Pipelines_testing/data/EDA%20and%20ETL.ipynb'>previous notebook</a>, we'll drop 'child_alone' column for simplicity as it only has label 0. The column to work with for features is 'message' column.

In [4]:
df.drop('child_alone', axis=1, inplace=True)

I want to use TF-IDF for feature extraction but the text has to be preprocessed first. I need to standardize the case of the letters, expand contractions, get rid of all non-alphabetic characters, lemmatize the words and throw out stop words. All this steps were done in <a href='http://localhost:8888/notebooks/Desktop/Pipelines_testing/data/EDA%20and%20ETL.ipynb'>previous notebook</a> for exploritory analysis and we figured that there are some words to be added to stop words vocabulary and I want to use word tags before lemmatizing this time.

In [5]:
# functions and variables for text preprocessing

# additional stop words
add_stopwords = ['', ' ', 'say', 's', 'u', 'ap', 'afp', '...', 'n', '\\','http','bit','ly','like','know']

# stop_words
stop_words = ENGLISH_STOP_WORDS.union(add_stopwords)

# contractions dictionary to replace short versions with full versions
contractions_dict = {
  "i'm":"i am",
  "i'll":"i will",  
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he will have",
  "he's": "he is",
  "how'd": "how did",
  "how'd'y": "how do you",
  "how'll": "how will",
  "how's": "how is",
  "i'd": "I would",
  "i'd've": "I would have",
  "i'll": "I will",
  "i'll've": "I will have",
  "i'm": "I am",
  "i've": "I have",
  "isn't": "is not",
  "it'd": "it had",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so is",
  "that'd": "that would",
  "that'd've": "that would have",
  "that's": "that is",
  "there'd": "there had",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "to've": "to have",
  "wasn't": "was not",
  "we'd": "we had",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "y'all": "you all",
  "y'alls": "you alls",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "you'd": "you had",
  "you'd've": "you would have",
  "you'll": "you you will",
  "you'll've": "you you will have",
  "you're": "you are",
  "you've": "you have"
}

# compile
c_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# function to expand contractions
def expand_contractions(message, c_re=c_re):
    
    """
    Expands contractions according to pre-compiled dictionary.
    message - string, text where contractions have to be expanded.
    """
    
    def replace(match):
        return contractions_dict[match.group(0)]
    
    return c_re.sub(replace, message)


# a function to replace nltk pos tags with corresponding word_net pos tags (to use with WordNetLemmatizer)
def word_net_tags(nltk_tag):
    
    """
    Replaces ntlk pos tag with corresponding WordNet pos tag
    nltk_tag - nltk pos tag
    Returns a corresponding WordNet pos tag 
    """
    
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
    
    
# lemmatizing function
def lemmatize(tagged_text):
    
    """
    Lemmatizes tokens
    tagged_text - list of tokens and their nltk po tags
    Returns list of lemmatized tokens 
    """
    
    # list of lemmatized tokens
    lem_tokens = []
    
    # a for loop to populate the list
    for word,tag in tagged_text:
        
        word_net_tag = word_net_tags(tag)
        
        if word_net_tag is None:
            lem_tokens.append(WordNetLemmatizer().lemmatize(word))
            
        else:
            lem_tokens.append(WordNetLemmatizer().lemmatize(word, pos=word_net_tag))
            
            
    return lem_tokens


# text processing function for TfidfVectorizer
def process_text(text):
    
    """
    text - string
    Removes all non-alphabetica characters and stop words, lemmatizes the words in the text and tokenizes them
    Returns clean, lemmatized tokens
    """
    
    # decontract text
    text = expand_contractions(text.lower())
           
    # keep only letters and numbers
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
            
    # tokenize
    tokens = word_tokenize(text)
            
    # nltk pos tags
    tokens = nltk.pos_tag(tokens)
            
    # change nltk to wordnet pos tags and lemmatize
    tokens = lemmatize(tokens)
            
    # remove stop words and return clean tokens
    return [token for token in tokens if token not in stop_words]

Now, I want to use several different classifiers in pipeline first. Then choose the best one and try ti tune it for a better performance. Our data is highly imbalanced and the number of positive responses (label 1) is very low for each class. I thing that our classifier has to be very good in classifing labels 1 specifically as it could have dangerous consequences to miss out a true disaster message. So I chose recall score ('weighted') as the main metric. The idea is to be able identify all positive labels even if we'll have some false positives on the way. F1 score ('weighted') will also be displayed as one of secondary metrics. ROC AUC score is to show the quality of predictions. I also want to use it for web app to show the probability of the text to be a disaster message. Brier score loss will be used as secondary metric for probabilities. Accuracy score will also be displayed in ML pipeline because the project requires it but I think because of high imbalance in data accuracy score is not a good metric here.

In [6]:
# functions to train and evaluate different classifiers


# function to evaluate probabilities
def evaluate_proba(y_test,y_proba,metric):
    """
    Evaluates predicted probabilities
    y_test - true labels
    y_proba - predicted probabilities
    metric - string, one of 2 metrics, 'brier_score_loss' or 'roc_auc_score'
    Returns the mean score of each predicted probabilities' labels
    """
    
    # list of scores of each label
    scores = []
    
    # a for loop to populate the list
    for i,category in enumerate(list(y_test.columns)):
        
        # if roc_auc_score chosen
        if metric=='roc_auc_score':
            scores.append(roc_auc_score(y_test[category],y_proba[i][:,1]))
            
        elif metric=='brier_score_loss':
            scores.append(brier_score_loss(y_test[category],y_proba[i][:,1]))
            
    return np.mean(scores)


def score_model(models,X_train,y_train,X_test,y_test):
    """
    Scores different models.
    Models are trained on NLP pipeline with multioutputclassifier output
    Returns a dataframe with model name and recall, roc_auc, f1 scores and brier_score_loss 
    
    
    models - a dictionary of different classifiers
    X_train, y_train - features and labels to train on
    X_test, y_test - features and labels to test for scoring
    """
    
    # lists of models' names and scores
    model_name, recall, roc_auc, f1, brier_loss = [],[],[],[],[]
    
    # fit each model and populate score lists
    for label,model in models.items():
        
        model_name.append(label)
        
        pipeline = Pipeline(steps=[
            ('features',TfidfVectorizer(tokenizer=process_text, ngram_range=(1,2), max_df=0.95, min_df=2)),
            ('clf',MultiOutputClassifier(estimator=model))
        ])
        
        pipeline.fit(X_train,y_train)
        
        y_pred = pipeline.predict(X_test)
        
        y_prob = np.array(pipeline.predict_proba(X_test))
        
        recall.append(recall_score(y_test,y_pred, average='weighted'))
        roc_auc.append(evaluate_proba(y_test,y_prob,metric='roc_auc_score'))
        f1.append(f1_score(y_test,y_pred, average='weighted'))
        brier_loss.append(evaluate_proba(y_test,y_prob,metric='brier_score_loss'))
        
        
        
    # dataframe
    df = pd.DataFrame({'model_name':model_name,
                       'recall_score':recall,
                       'roc_auc_score':roc_auc,
                       'f1_score':f1,
                       'brier_score_loss':brier_loss})
    df = df.sort_values(by=['recall_score','roc_auc_score','f1_score'], ascending=False)
    
    return df

In [7]:
# models to train
models = {
    'Logistic regression':LogisticRegression(random_state=42),
    'Decision tree':DecisionTreeClassifier(random_state=42),
    'Random forest':RandomForestClassifier(random_state=42),
    'Adaboost':AdaBoostClassifier(random_state=42),
    'Multinomial Naive Bayes':MultinomialNB(),
    'XGBoost':XGBClassifier(random_state=42)
}

In [8]:
# split data
X = df['message']
y = df.iloc[:,4:]

In [9]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# need some data for evaluation
X_test, X_eval, y_test, y_eval = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [115]:
%%time
scores_bow = score_model(models,X_train,y_train,X_test,y_test)
scores_bow

Wall time: 37min 24s


Unnamed: 0,model_name,recall_score,roc_auc_score,f1_score,brier_score_loss
5,XGBoost,0.603534,0.830475,0.641227,0.04031
1,Decision tree,0.596386,0.662217,0.601412,0.069652
3,Adaboost,0.583133,0.809574,0.618481,0.225074
0,Logistic regression,0.533092,0.85656,0.577866,0.041543
2,Random forest,0.531566,0.813457,0.575077,0.04189
4,Multinomial Naive Bayes,0.398474,0.660139,0.415164,0.053886


So, XGBoost looks very good across the board. It shows highest recall and F1 scores, second highest roc_auc score and lowest brier score loss.

Now, let's try adding word count as a feature and see if that's going to improve our scores. We'll have to create a custom object with 'fit' and 'transform' methods in order to add it into the pipeline.

In [10]:
class WordCount(BaseEstimator,TransformerMixin):
    """
    Counts number of words in a message.
    Returns standardized number of words in message. Uses z-score for standardization
    """
    
    # constractor function
    def __init__(self):
        pass
    
    # fit method
    def fit(self,df,y_true=None):
        """
        df - Series of messages or texts
        Returns itself
        """
        return self
    
    # transform method
    def transform(self,df,y_true=None):
        """
        df - Series of messages or texts
        Returns list of stardardized numbers of words in each element of df
        """
        
        self.df = df
        
        # word count
        word_count = np.array([len(str(text).split(' ')) for text in self.df])
        
        # scipy's zscore to stardardize the numbers
        word_count = zscore(word_count)
        
        # return a 2d array
        return np.array(word_count[:,None], copy=False, subok=True, ndmin=2)
    
    # fit_transform
    def fit_transform(self,df,y=None):
        """
        df - Series of messages or texts
        Performs a fitting and transformation over df
        """
        
        return self.fit(df,y).transform(df)

Now adding new object to pipeline and score classifiers. But this time I won't be using Multinomial Naive Bayes classifier. Mainly because there might be negative values as features (after stardardizing word count) and Multinomial Naive Bayes doesn't work with negative values. Plus, it performed the worst previously, so I don't think dropping it out will make any difference.

In [11]:
# models to train
models = {
    'Logistic regression':LogisticRegression(random_state=42),
    'Decision tree':DecisionTreeClassifier(class_weight='balanced', random_state=42),
    'Random forest':RandomForestClassifier(class_weight='balanced', random_state=42),
    'Adaboost':AdaBoostClassifier(random_state=42),
    'XGBoost':XGBClassifier(random_state=42)
}

def score_model(models,X_train,y_train,X_test,y_test):
    """
    Scores different models.
    Models are trained on NLP pipeline with multioutputclassifier output
    Returns a dataframe with model name and recall, roc_auc, f1 scores and brier_score_loss 
    
    
    models - a dictionary of different classifiers
    X_train, y_train - features and labels to train on
    X_test, y_test - features and labels to test for scoring
    """
    
    # lists of models' names and scores
    model_name, recall, roc_auc, f1, brier_loss = [],[],[],[],[]
    
    # fit each model and populate score lists
    for label,model in models.items():
        
        model_name.append(label)
        
        pipeline = Pipeline(steps=[
            ('features',FeatureUnion(transformer_list=[
                ('tfidf',TfidfVectorizer(tokenizer=process_text, ngram_range=(1,2), max_df=0.95, min_df=2)),
                ('wordcount',WordCount())
            ])),
            ('clf',MultiOutputClassifier(estimator=model))
        ])
        
        pipeline.fit(X_train,y_train)
        
        y_pred = pipeline.predict(X_test)
        
        y_prob = np.array(pipeline.predict_proba(X_test))
        
        recall.append(recall_score(y_test,y_pred, average='weighted'))
        roc_auc.append(evaluate_proba(y_test,y_prob,metric='roc_auc_score'))
        f1.append(f1_score(y_test,y_pred, average='weighted'))
        brier_loss.append(evaluate_proba(y_test,y_prob,metric='brier_score_loss'))
        
        
        
    # dataframe
    df = pd.DataFrame({'model_name':model_name,
                       'recall_score':recall,
                       'roc_auc_score':roc_auc,
                       'f1_score':f1,
                       'brier_score_loss':brier_loss})
    df = df.sort_values(by=['recall_score','roc_auc_score','f1_score'], ascending=False)
    
    return df

In [118]:
%%time
scores_wc = score_model(models,X_train,y_train,X_test,y_test)
scores_wc

Wall time: 33min 51s


Unnamed: 0,model_name,recall_score,roc_auc_score,f1_score,brier_score_loss
1,Decision tree,0.628594,0.677532,0.587169,0.090209
4,XGBoost,0.604177,0.831996,0.643115,0.040343
3,Adaboost,0.570924,0.818169,0.618956,0.225262
0,Logistic regression,0.534297,0.858119,0.581687,0.041291
2,Random forest,0.516145,0.830301,0.556463,0.04173


Well, the only classifier that improved is Decision tree. Others either did not improve or got worse. Although, Decision tree has the highest recall score now (around 0.63), XGBoost still looks more balanced with roc_auc score around 0.83, highest F1 score at 0.64, lowest brier loss of 0.04 and second highest recall score at 0.6. So, I'll continue on with XGBoost classifier.

Since adding word count did nothing in terms of improving classifier, I'm not going to add it to the pipeline.

Alright, we have our overall best performing classifier. Now it's time to tune the pipeline trying to get a better score. I'll use GridSearch for that. Since this process takes a lot of time, I'll try tuning only few parameters: 2 parameters for TfIdfVectorizer and 3 parameters for XGBoost clssifier. Also I'll use only 2-fold fit and recall weighted score as scoring.

But first, let's evaluate our pipeline with only XGBoost classifier as estimator.

In [16]:
# pipeline with chosen classifier
pipeline = Pipeline(steps=[
    ('features',TfidfVectorizer(tokenizer=process_text, ngram_range=(1,2), max_df=0.95, min_df=2)),
    ('clf',MultiOutputClassifier(XGBClassifier(random_state=42)))
])

In [19]:
# a function to evaluate chosen classifier
def eval_model(pipeline=pipeline,X_train=X_train,y_train=y_train,X_test=X_test,y_test=y_test):
    """
    Evaluates a chosen model.
    pipeline - model (pipeline) to evaluate,
    X_train, y_train - features and labels to train on
    X_test, y_test - features and labels to evaluate the model
    Returns a dataframe with model's recall (weighted), roc_auc, f1 (weighted) scores and briesr score loss
    """
    
    # train the model
    pipeline.fit(X_train,y_train)
    
    # predictions
    y_pred = pipeline.predict(X_test)
    y_prob = np.array(pipeline.predict_proba(X_test))
    
    # evaluations in dataframe
    evaluations = pd.DataFrame({'recall_score':[recall_score(y_test,y_pred, average='weighted')],
                                'roc_auc_score':[evaluate_proba(y_test, y_prob, metric='roc_auc_score')],
                                'f1_score':[f1_score(y_test,y_pred, average='weighted')],
                                'brier_score_loss':[evaluate_proba(y_test, y_prob, metric='brier_score_loss')]})
    
    return evaluations

In [20]:
%%time
# evaluate
eval_model()

Wall time: 9min 25s


Unnamed: 0,recall_score,roc_auc_score,f1_score,brier_score_loss
0,0.603534,0.830475,0.641227,0.04031


Now, the Grid Search!

In [119]:
# pipeline
pipeline = Pipeline(steps=[
    ('features',TfidfVectorizer(tokenizer=process_text, ngram_range=(1,2), max_df=0.95, min_df=2)),
    ('clf',MultiOutputClassifier(XGBClassifier(random_state=42)))
])

# parameters to tune
params = {
    'features__ngram_range':[(1,2),(1,3)],
    'features__max_df':[0.95,0.5,0.3],
    'clf__estimator__n_estimators':[500,1000],
    'clf__estimator__max_depth':range(6,8),
    'clf__estimator__eta':[0.01,0.1]
}


# model
model = GridSearchCV(estimator=pipeline, param_grid=params, cv=2, scoring='recall_weighted', verbose=10)

In [120]:
%%time
model.fit(X_train,y_train)

Fitting 2 folds for each of 48 candidates, totalling 96 fits
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 2), score=0.559, total=17.8min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 17.8min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 2), score=0.558, total=17.2min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 3) 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 34.9min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 3), score=0.559, total=17.9min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 3) 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 52.9min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.95, features__ngram_range=(1, 3), score=0.559, total=19.2min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 72.1min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 2), score=0.559, total=16.1min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 88.2min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 2), score=0.558, total=17.2min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 3) 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 105.4min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 3), score=0.559, total=19.6min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 3) 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 125.0min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.5, features__ngram_range=(1, 3), score=0.559, total=20.2min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 145.3min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 2), score=0.559, total=16.2min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 161.5min remaining:    0.0s


[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 2), score=0.558, total=16.8min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 3) 
[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 3), score=0.559, total=17.6min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 3) 
[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 3), score=0.559, total=19.0min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 2) 
[CV]  clf__estimator__eta=0.01, clf__e

[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=7, clf__estimator__n_estimators=500, features__max_df=0.3, features__ngram_range=(1, 3), score=0.565, total=20.4min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 2) 
[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 2), score=0.581, total=31.5min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 2) 
[CV]  clf__estimator__eta=0.01, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 2), score=0.581, total=33.2min
[CV] clf__estimator__eta=0.01, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 3) 
[CV]  clf__estimator__eta=0.01

[CV]  clf__estimator__eta=0.1, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 2), score=0.604, total=29.8min
[CV] clf__estimator__eta=0.1, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 3) 
[CV]  clf__estimator__eta=0.1, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 3), score=0.601, total=31.1min
[CV] clf__estimator__eta=0.1, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 3) 
[CV]  clf__estimator__eta=0.1, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 3), score=0.604, total=33.5min
[CV] clf__estimator__eta=0.1, clf__estimator__max_depth=6, clf__estimator__n_estimators=1000, features__max_df=0.5, features__ngram_range=(1, 2) 
[CV]  clf__estimator__eta=0.1, clf_

[CV]  clf__estimator__eta=0.1, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.95, features__ngram_range=(1, 3), score=0.602, total=35.3min
[CV] clf__estimator__eta=0.1, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.5, features__ngram_range=(1, 2) 
[CV]  clf__estimator__eta=0.1, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.5, features__ngram_range=(1, 2), score=0.595, total=29.2min
[CV] clf__estimator__eta=0.1, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.5, features__ngram_range=(1, 2) 
[CV]  clf__estimator__eta=0.1, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.5, features__ngram_range=(1, 2), score=0.603, total=30.8min
[CV] clf__estimator__eta=0.1, clf__estimator__max_depth=7, clf__estimator__n_estimators=1000, features__max_df=0.5, features__ngram_range=(1, 3) 
[CV]  clf__estimator__eta=0.1, clf__est

[Parallel(n_jobs=1)]: Done  96 out of  96 | elapsed: 2413.2min finished


Wall time: 1d 17h 26min 52s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('features',
                                        TfidfVectorizer(max_df=0.95, min_df=2,
                                                        ngram_range=(1, 2),
                                                        tokenizer=<function process_text at 0x000002024AF60B88>)),
                                       ('clf',
                                        MultiOutputClassifier(estimator=XGBClassifier(base_score=None,
                                                                                      booster=None,
                                                                                      colsample_bylevel=None,
                                                                                      colsample_bynode=None,
                                                                                      colsample_bytree=None,
                                                                                 

In [122]:
model.best_params_

{'clf__estimator__eta': 0.1,
 'clf__estimator__max_depth': 6,
 'clf__estimator__n_estimators': 1000,
 'features__max_df': 0.95,
 'features__ngram_range': (1, 3)}

So, let's use the best found parameters while building the pipeline. Also I want to combine train and test sets into one training set. Then use it for pipeline training and evaluate the model on the evaluation set (X_eval, y_eval). And I want to change evaluation function a bit so that trained classifier is saved as .pkl file.

In [50]:
# a function to evaluate and save trained classifier
def eval_model(pipeline=pipeline,X_train=X_train,y_train=y_train,X_test=X_test,y_test=y_test):
    """
    Evaluates a chosen model.
    pipeline - model (pipeline) to evaluate,
    X_train, y_train - features and labels to train on
    X_test, y_test - features and labels to evaluate the model
    Returns a dataframe with model's recall (weighted), roc_auc, f1 (weighted) scores and briesr score loss
    """
    
    # train the model
    pipeline.fit(X_train,y_train)
    
    # predictions
    y_pred = pipeline.predict(X_test)
    y_prob = np.array(pipeline.predict_proba(X_test))
    
    # evaluations in dataframe
    evaluations = pd.DataFrame({'recall_score':[recall_score(y_test,y_pred, average='weighted')],
                                'roc_auc_score':[evaluate_proba(y_test, y_prob, metric='roc_auc_score')],
                                'f1_score':[f1_score(y_test,y_pred, average='weighted')],
                                'brier_score_loss':[evaluate_proba(y_test, y_prob, metric='brier_score_loss')]})
    
    # save classifier
    pickle.dump(pipeline, open('trained_classifier.pkl','wb'))
    
    return evaluations

In [37]:
# pipeline with tuned parameters
pipeline = Pipeline(steps=[
    ('features',TfidfVectorizer(tokenizer=process_text, ngram_range=(1,3), max_df=0.95, min_df=2)),
    ('clf',MultiOutputClassifier(XGBClassifier(max_depth=6, n_estimators=1000, eta=0.1, random_state=42)))
])

In [36]:
train_X = pd.concat([X_train,X_test])
train_y = pd.concat([y_train,y_test])

In [38]:
%%time
# evaluate
eval_model(pipeline,train_X,train_y,X_eval,y_eval)

Wall time: 1h 36min 25s


Unnamed: 0,recall_score,roc_auc_score,f1_score,brier_score_loss
0,0.611016,0.808242,0.651459,0.039614


We can see that with tuned parameters our classifier performed almost the same as with default parameters (recall and f1 scores and brier score loss are a bit better while roc_auc score is a bit worse) which I think is a good sign considering it was tested on previously unseen test set. This model was saved and will now be used in web app.

The full scrip for ML pipeline can be found in train_classifier.py file.