# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.<br>

My general notes:<br>
Have in mind, that we work on a multi-class, multi-label text classification which assigns to each message sample a set of category target labels. The messages are short and an imbalanced data distribution exists. The dataset has 19634 data points with 32 different target categories.

During the disaster messages processing, the English text is tokenized, lower cased, lemmatized and the contractions are expanded. Additionally, e.g. spaces, punctuation and English stop words are removed.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
#
# import libraries
#

# download necessary NLTK data
import nltk
#nltk.download(['punkt', 'wordnet', 'stopwords'])

# import libraries
import random as rn
import numpy as np
import pandas as pd
import string
import pickle
from sqlalchemy import create_engine
import Contractions

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from bs4 import BeautifulSoup

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# warnings status to show
import warnings
warnings.warn("once")



Make the code reproducible ...

In [2]:
FIXED_SEED = 42

# The below is necessary for starting NumPy generated random numbers in a well-defined initial state.
np.random.seed(FIXED_SEED)

# The below is necessary for starting core Python generated random numbers in a well-defined state.
rn.seed(FIXED_SEED)

In [3]:
# load data from database
try:
    engine = create_engine('sqlite:///Disaster_Messages_engine.db')
    df = pd.read_sql_table('Messages_Categories_table', engine)
    
    # success
    print("The dataset has {} data points with {} variables each.".format(*df.shape))
except:
    print("The database 'Disaster_Messages_engine.db' could not be loaded. No ML pipeline activities possible.")

The dataset has 14403 data points with 36 variables each.


In [4]:
df.head()

Unnamed: 0,message,original,genre,lang_code,related,request,offer,aid_related,medical_help,medical_products,...,tools,hospitals,shops,aid_centers,weather_related,floods,storm,fire,earthquake,cold
0,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,en,1,0,0,1,0,0,...,0,0,0,0,1,0,1,0,0,0
1,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,en,1,1,0,1,0,1,...,0,1,0,0,0,0,0,0,0,0
2,Storm at sacred heart of jesus,Cyclone Coeur sacr de jesus,direct,en,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
3,"Please, we need tents and water. We are in Sil...",Tanpri nou bezwen tant avek dlo nou zon silo m...,direct,en,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,I am in Croix-des-Bouquets. We have health iss...,"Nou kwadebouke, nou gen pwoblem sant m yo nan ...",direct,en,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [34]:
# create input (X) and output (y) samples, we know that related is always one ...
# as input we have to take care about the messages
# the categories are the targets of the multi-class, multi-label classification
X = df[['message']]
y = df[df.columns[4:]]

In [35]:
X.head(2)

Unnamed: 0,message
0,Is the Hurricane over or is it not over
1,UN reports Leogane 80-90 destroyed. Only Hospi...


In [6]:
y

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,tools,hospitals,shops,aid_centers,weather_related,floods,storm,fire,earthquake,cold
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,1,1,0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
3,1,1,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,1,0,1,1,1,0,0,0,1,...,0,0,0,0,1,1,0,0,0,0
6,1,1,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,1,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,1,1,0,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [7]:
for group in y.columns:
    print("'{}' includes {} x value 1.".format(group, y[group].sum()))

'related' includes 14403 x value 1.
'request' includes 4374 x value 1.
'offer' includes 117 x value 1.
'aid_related' includes 10729 x value 1.
'medical_help' includes 2066 x value 1.
'medical_products' includes 1297 x value 1.
'search_and_rescue' includes 718 x value 1.
'security' includes 467 x value 1.
'military' includes 857 x value 1.
'water' includes 1650 x value 1.
'food' includes 2885 x value 1.
'shelter' includes 2281 x value 1.
'clothing' includes 401 x value 1.
'money' includes 598 x value 1.
'missing_people' includes 297 x value 1.
'refugees' includes 872 x value 1.
'death' includes 1187 x value 1.
'other_aid' includes 3392 x value 1.
'infrastructure_related' includes 1688 x value 1.
'transport' includes 1197 x value 1.
'buildings' includes 1313 x value 1.
'electricity' includes 528 x value 1.
'tools' includes 158 x value 1.
'hospitals' includes 283 x value 1.
'shops' includes 118 x value 1.
'aid_centers' includes 308 x value 1.
'weather_related' includes 7209 x value 1.
'fl

In [8]:
# label count 1 shall not exist anymore
df[df.columns[4:]].iloc[:,:].sum(axis=1).value_counts()

3     3857
4     3232
5     2332
6     1675
7     1084
2      742
8      661
9      356
10     176
11     125
12      57
13      43
14      22
16      11
15      10
17       5
19       4
18       4
20       3
25       1
24       1
22       1
23       1
dtype: int64

### 2. Write a tokenization function to process your text data

During EPL pipeline activities we realised that there are messages which are not useful (e.g. 'nonsense' character sequences, html characters) and there are probably web links included. We have to deal with this in the tokenize() function.

In [9]:
CONTRACTION_MAP = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [13]:
# function from Dipanjan's repository:
# https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%\
# 20content/nlp%20proven%20approach/NLP%20Strategy%20I%20-%20Processing%20and%20Understanding%20Text.ipynb

def expand_contractions(text, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    
    return expanded_text

In [10]:
def tokenize(text):
    # have in mind that we use this for a web app adding new messages;
    # if still html, xml or other undefined parts in the existing messages:
    # first remove such metatext from English messages
    # see: https://docs.python.org/3.7/library/codecs.html#encodings-and-unicode
    # "To be able to detect the endianness of a UTF-16 or UTF-32 byte sequence,
    # there’s the so called BOM (“Byte Order Mark”). [...]
    # In UTF-8, the use of the BOM is discouraged and should generally be avoided."
    # specific ones are e.g. notepad signatures from Microsoft as part of the messages which should be avoided;
    # other undefined characters have the coding of the 'replacement character' unicode u"\ufffd"
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    
    stop_words = set(stopwords.words('english'))
    #stop_words.remove('no')
    #stop_words.remove('not')
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'       
    detected_urls = re.findall(url_regex, bom_removed)
    for url in detected_urls:
        text = bom_removed.replace(url, "urlplaceholder")
        
    # change the negation wordings like don't to do not, won't to will not 
    # or other contractions like I'd to I would, I'll to I will etc. via dictionary
    text = expand_contractions(text, CONTRACTION_MAP)

    # remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
    text = text.translate(str.maketrans('','', string.punctuation))
    # remove numbers
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    # during ETL pipeline we have reduced the dataset on English messages ('en' language coding,
    # but there can be some wrong codings
    tokens = word_tokenize(letters_only, language='english')
    lemmatizer = WordNetLemmatizer()  # for the lexical correctly found word stem (root)

    clean_tokens = []
    for tok in tokens:
        # use only lower cases, remove leading and ending spaces
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        # remember: there have been nonsense sentences, so, now some strings could be empty
        # toDo: what is the correct length number to use now? Small ones are probably no relevant words ...
        # remove English stop words
        if (len(clean_tok) > 1) & (clean_tok not in stop_words):
            clean_tokens.append(clean_tok)

    return clean_tokens

In [11]:
# example for unit test to remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
example_str = 'This [is an] example? {of} string. with.? some &punctuation &signs!!??!!'
result = example_str.translate(str.maketrans('','', string.punctuation))
print(result)
# output shall be: This is an example of string with some punctuation signs

This is an example of string with some punctuation signs


In [36]:
# test tokenize
for message in X['message'][:10]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Is the Hurricane over or is it not over
['hurricane'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'needs', 'supply', 'desperately'] 

Storm at sacred heart of jesus
['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you!
['please', 'need', 'tent', 'water', 'silo', 'thank'] 

I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )
['croixdesbouquets', 'health', 'issue', 'worker', 'santo', 'area', 'croixdesbouquets'] 

There's nothing to eat and water, we starving and thirsty.
['nothing', 'eat', 'water', 'starving', 'thirsty'] 

I am in Thomassin number 32, in the area named Pyron. I would like to have some water. Thank God we are fine, but we desperately need water. Thanks
['thomassin', 'number', 'area', 'named', 'pyron', 'would', 'like', 'wa

### 3. Build a machine learning pipeline
Notes:
- Regarding the class default parameters, for this Python implementation scikit-learn version 0.21.2 is used.
- We use np.random.seed() beside of random_state/random_seed parameters ([reason](https://stackoverflow.com/questions/47923258/random-seed-on-svm-sklearn-produces-different-results))
- For the pipeline workflow a `FeatureUnion`instance concatenates results of multiple transformer objects

Remember, we are dealing with an imbalanced dataset, therefore not all models can be used. A machine learning classifier could be more biased towards the majority class, causing bad classification of the minority class. Therefore we have to take care. We start with <i>LogisticRegression</i> and use other appropriate models later in this Python implementation. If the metric evaluation of the used models shows issues, we have to change our dataset e.g. doing undersampling or oversampling getting a more balanced dataset.

This machine pipeline should take in the `message` column as input and output classification results on the other remaining 31 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

As its first estimator we use [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression). Its default parameter values are:<br>
LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’auto’, verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

We have to solve a supervised, multi-class, multi-label problem, therefore some parameters have to be changed:<br>
- solver = 'saga'  (handles L1 and L2, according scikit-learn documentation it is often the best choice)
- multi_class = 'multinomial'
- C = the optimal value for the inverse of regularization strength is going to be set later, in the cross validation optimisation subchapter of this project.


Regarding feature extraction:<br>
For having a measure of the word frequency of each text term the <i>Term Frequency - Inverse Document Frequency</i> class exists in the library scikit-learn with 2 types - vectoriser and transformer. The used class [TfidTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) has the default parameters:<br>
TfidfTransformer(norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

Both pipeline classes <i>TfidTransformer</i> and <i>LogisticRegression</i> use a L2 normalisation for scaling. Therefore, the amount of words has no influence on our result. As it is stated in the scikit-learn [documentation](https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_20newsgroups.html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-20newsgroups-py), "if the goal is to get the best predictive accuracy, it is better to use the non sparsity-inducing l2 penalty instead." All feature vectors have an euclidian norm of 1.<br>
The text messages are transformed to a number vector representation used to train supervised classifiers able to predict the associated categories of future, new messages.

The usage of other machine learning models for imbalanced datasets, like Linear Support Vector Machine or Multinomial Naive Bayes classification model or Random Forest ensemble classification model, as well as model parameter optimisation is part of the project subchapters below. Have in mind that text data are being part of the higher dimensional bag-of-words spaces and there Euclidean distance does not work well. 

In [15]:
pipeline = Pipeline([
        ('features', FeatureUnion([ 
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
            ]))
            
        ])),
    
        ('clf', MultiOutputClassifier(LogisticRegression(solver='saga', multi_class='auto',
                                                         max_iter=500, class_weight='balanced', 
                                                         n_jobs=-1, random_state=FIXED_SEED)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

Before using the pipeline on the whole dataset, do single prediction in default form with its components for an example disaster message of the dataset.

In [16]:
X.shape

(14403, 1)

In [17]:
y.shape

(14403, 32)

In [27]:
# shuffle is by default set on True,
# usage of stratify param leads to stratify split technique for this imbalanced dataset,
# having both would be a StratifiedShuffleSplit algorithm in the background,
# but
# stratify=y leads to a ValueError: The least populated class in y has only 1 member, which is too few.
# The minimum number of groups for any class cannot be less than 2.
# ToDo: clarify why => must be: stratify=y.iloc[:,1]
# Therefore I added randomised resampling to reduce the probability of getting ValueError's during model fit()
# and after clarification comment them out

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y.iloc[:,1],
                                                    test_size=0.2, random_state=FIXED_SEED)
#X_train = X_train.sample(n = X_train.shape[0], axis=0, random_state=FIXED_SEED) 
#y_train = y_train.sample(n = y_train.shape[0], axis=0, random_state=FIXED_SEED)

In [28]:
X_train.shape

(11522, 1)

In [29]:
y_train.shape

(11522, 32)

Now, we train the pipeline ...

In [30]:
pipeline.fit(X_train, y_train)

ValueError: Found input variables with inconsistent numbers of samples: [1, 11522]

And calculate the LogisticRegression model prediction ...

In [35]:
y_logr_pred = pipeline.predict(X_test)

In [23]:
#y_test.shape

In [24]:
#y_logr_pred.shape

In [None]:
#df_pred = pd.DataFrame(y_logr_pred, columns=y_test.columns)
#df_pred.to_csv("y_logr_pred_file.csv")  
#y_test.to_csv("y_logr_test_file.csv")  

In [25]:
# Are there differences in the test dataFrame and in y_pred, means exists a test item without corresponding prediction
#set(y_test) - set(df_pred)

### 5. Test your model
For evaluation:<br>
Report accuracy score, f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each, where:

TP = TruePositive; FP = FalsePositive; TN = TrueNegative; FN = FalseNegative.

**Accuracy Score** is a classification score. It is the number of correct predictions made divided by the total number of predictions made. In a multilabel classification task it computes subset accuracy. 
  
Furthermore, beside accuracy, we add additional metrics to compare the model performance having an originally imbalanced dataset. Accuracy would focus too much on the majority classes. Because of this overfitting of the majority classes, its value would be too good and therefore misleading.

**Precision** quantifies the binary precision. In other words, a measure of a classifiers exactness. It is a ratio of true positives (messages correctly classified to their categories)) to all positives (all messages classified to categories, irrespective of whether that was the correct classification), in other words it is the ratio of

TP / (TP + FP)

**Recall** tells us what proportion of messages that actually were classified to specific categories were classified by us as this categories. Means, a measure of a classifiers completeness. It is a ratio of true positives to all the correctly category classified messages that were actually disaster messages, in other words it is the ratio of

TP / (TP + FN)

A model's ability to precisely predict those that are correctly categoriesed disaster messages is more important than the model's ability to recall those individuals. 

We can use **F-beta score** as a metric that considers both precision and recall. According scikit-learn, the F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. F – Measure is nothing but the harmonic mean of Precision and Recall.

Fβ=(1 + β2)  (precision⋅recall / ((β2⋅precision) + recall))

In particular, when β=0.5, more emphasis is placed on precision. And when β=1.0 recall and precision are equally important.

According scikit-learn: "The **F1 score** ... reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter."

From scikit-learn documentation for the classification report:<br>
The classification_report() function returns an additional value: **Support** - the number of occurrences of each label in y_true.<br>
The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), sample average (only for multilabel classification) and micro average (averaging the total true positives, false negatives and false positives) it is only shown for multi-label or multi-class with a subset of classes because it is accuracy otherwise.

Note: Having the imbalanced dataset in mind, Cohen's Kappa and Confusion Matrix are not possible because this is a multi-label classification task.

In [25]:
def display_results(y_test, y_pred, cv=None):
    # text summary of the overall accuracy, precision, recall, F1 score for each class   
    print("First: overall accuracy score: {:5f}".format(accuracy_score(y_test, y_pred)))
    
    target_names = y_test.columns
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    # shows F1_score, precision and recall
    class_report = classification_report(y_test, y_pred, target_names=target_names)
    print("Classification Report for each target class:\n", class_report)

    if cv != None:
        print("\n\n---- Best Parameters: ----\n{}".format(cv.best_params_))

In [None]:
display_results(y_test, y_logr_pred, None)

Such kind of behaviour has been ..... 

### 6. Improve your model
Use grid search to find better parameters. 

In [26]:
pipeline.get_params()

In [55]:
# specify parameters for grid search
parameters = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'features__text_pipeline__vect__max_df': [0.5, 0.75, 1.0],
    'features__text_pipeline_tfidf_sublinear_tf': [True],
    'clf__estimator__classifier__C' : [0.01, 0.1, 1, 10, 50, 100, 200],  
    'clf__estimator__classifier__max_iter': [1000, 1500, 3000, 5000]
}

# create grid search object
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
grid_cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=-1, cv=10, verbose=1)

### 7. Test your model
Show the accuracy, precision, recall and F-score of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [44]:
# model = cv
grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 56 candidates, totalling 280 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 75.0min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 157.0min
[Parallel(n_jobs=4)]: Done 280 out of 280 | elapsed: 442.3min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text_pipeline',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('vect',
                                                                                         CountVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                                                                         decode_error='strict',
                                                                                                         dtype=<class 'numpy.int64'>,
  

In [45]:
y_logr_pred = grid_cv.predict(X_test)

In [46]:
print("Evaluation results for the cross validation tuned 'Logistic Regression' estimator:")
display_results(y_test, y_logr_pred, grid_cv)

Evaluation results for the cross validation tuned 'Logistic Regression' estimator:
First: overall accuracy score: 0.086576
Classification Report for each target class:
                         precision    recall  f1-score   support

               request       0.55      0.47      0.51      1082
                 offer       0.00      0.00      0.00        31
           aid_related       0.57      0.60      0.58      2700
          medical_help       0.20      0.01      0.03       564
      medical_products       0.06      0.00      0.01       317
     search_and_rescue       0.50      0.01      0.02       178
              security       0.00      0.00      0.00       118
              military       0.14      0.00      0.01       213
           child_alone       0.00      0.00      0.00        10
                 water       0.10      0.01      0.02       409
                  food       0.26      0.09      0.13       722
               shelter       0.15      0.03      0.04       55

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


The evaluation result is a little bit better, but still includes categories with 0.0 values. These are the group of categories with supported observation numbers < 150. So, metrics have not been calculated properly. Additionally, in general it is still the case that the precision values are higher compared to the recall values, means a lot of false negatives. We have to improve the model conditions again.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

**First**, we try out other machine learning algorithms which are tuned by cross validation to compare their prediction results. Further models are:
- Naïve Bayes: `MultinomialNB`<br>
    The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g. word counts for text classification). Its default setting is: MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None). The parameter 'alpha' is its "Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing)."
    
- Support Vector Machines (regular `LinearSVC`)<br>
  Linear Support Vector Classification default setting is: LinearSVC(penalty=’l2’, loss=’squared_hinge’, dual=True, tol=0.0001, C=1.0, multi_class=’ovr’, fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)<br>
  According scikit-learn the parameter multi_class (string ‘ovr’ or ‘crammer_singer’ (default=’ovr’)): Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while "crammer_singer" optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If "crammer_singer" is chosen, the options loss, penalty and dual will be ignored.<br>
  And the parameter C is the penalty parameter of the error term.

- Ensemble Method: `RandomForestClassifier`<br>
    A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control overfitting.<br>
    Its default setting is: RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
    
**Second**, because it is an imbalanced dataset we do a balancing before classification. The categority classes with low numbers of observations are outnumbered. So, the dataset is highly skewed. To create a balanced dataset several strategies exists:
- Undersampling the majority classes
- Oversampling the minority classes
- Combining over- and under-sampling
- Create ensemble balanced sets

But have in mind, that minority class oversampling could result in overfitting problems doing it before cross-validation. We would link the information of validation data to our training dataset which is forbidden.

Note:<br>
Doing balancing activities the scikit package 'imbalanced-learn' is imported.<br>
For combining the strategies we implement a naive random oversampling of the minority classes.<br>
For undersampling the package can be used as well to create the pipeline with `PipelineImb`. The pipeline itself includes the class `RandomUnderSampler()` directly before the MultiOutputClassifier to equalize the number of samples in all the classes before the training.

Using such package throws the following ValueError: 'Imbalanced-learn currently supports binary, multiclass and binarized encoded multiclasss targets. Multilabel and multioutput targets are not supported.' So, the associated package classes do not support the multi-target classification we need for our project. Therefore this coding is removed.

According the [paper](https://arxiv.org/ftp/arxiv/papers/1810/1810.11612.pdf) <i>Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting</i> of 27 October 2018 from Genta Indra Winata and Masayu Leylia Khodra, regarding new data, it is more appropriate to balance the dataset on the algorithm level instead of the data level to avoid overfitting. The algorithm "approach modifies algorithm by adjusting weight or cost of various classes."

E.g. the `RandomForestClassifier` is an ensemble model estimator in which each tree of the forest will be provided a balanced bootstrap sample if the class weight attribute is set appropriately. We will have a look if this configuration is already enough to get a good model performance.<br>

If this is not the case, could we use a variant of the popular method called `SMOTE` (Synthetic Minority Oversampling Technique)? It creates synthetic samples from the minor category classes rather of creating copies like the naive random oversampling method. According Max Kuhn and Kjell Johnson in their book 'Applied Predictive Modeling', chapter 16.7 'Sampling Methods' (as part of chapter 16 'Remedies of Severe Class Imbalance') SMOTE is a sampling procedure using both, up-sampling and down-sampling.

Note: To use the 'RandomForestClassifer' has not been enough. The improvement has been too small. And the usage of another meta-estimator ('BoostingClassifier') directly for the other base-estimator models leads to an endless loop of ConvergenceWarning's during fit() action. The ensemble BaggingClassifier, according scikit-learn is a "meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction." It seems as if this is not working for a multi-label classification task.

The ideal solution for the drastical differences of target class observations would be to get more data for them, but this is not the case for this project. So, we try the synthetic creation of new data points of the minority classes by SMOTE technique. But according the [research paper](https://docplayer.net/13735758-Preprocessing-imbalanced-dataset-using-oversampling-approach.html) of Dasharath C.Magar and S.M.Rokade called <i>Preprocessing Imbalanced Dataset Using Oversampling Approach</i> we have to take care which SMOTE variant to use, some methods could be inappropriate. They propose three steps to work on:
- Selection of  an  appropriate subset of the original minority class samples
- Assigning weights to the selected samples according to their importance in the data
- Using a clustering approach for generating the useful synthetic minority class samplese

So again, could we try to improve the dataset with the Python library ['smote_variants'](https://smote-variants.readthedocs.io/en/latest/index.html) described in the [paper](https://www.sciencedirect.com/science/article/pii/S0925231219311622) <i>Smote-variants: A python implementation of 85 minority oversampling techniques</i> from György Kovács, e.g. running sampling via ('MWMOTE', "{'proportion': 0.7, 'k1': 5, 'k2': 5, 'k3': 5, 'M': 10, 'cf_th': 5.0, 'cmax': 10.0, 'n_jobs': 4, 'random_state': 42}") on our 'X' and 'y' datasets? No, we cannot, because this are text data (string data), so, we will get ValueError's like: could not convert string to float: 'Weather update - a cold front from Cuba that could pass over Haiti'.

**Finally**, as possible improvement we use other multi-label classifiers from the `scikit-multilearn` package. They transform our task into multiple single-label actions. As a result, existing single-label models can be used. One possible solution is using [binary relevance](http://scikit.ml/api/skmultilearn.problem_transform.br.html#skmultilearn.problem_transform.BinaryRelevance) which is possible because there are no label correlations anymore.

In [75]:
def build_model(model_type, params):
    ''' 
    input:
    model_type - the estimator model used for the MultiOutputClassifier
    params - the estimator model parameter grid used for the GridSearchCV 
    ''' 
    
    pipeline = Pipeline([
        ('features', FeatureUnion([ 
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer()),
            ]))           
        ])),

        ('clf', MultiOutputClassifier(model_type))
    ])

    #return pipeline.get_params()
    
    # the higher the verbose number the more information is thrown
    # cv not higher than 5 buckets, child_alone support has been only 10
    cv = GridSearchCV(pipeline, param_grid=params, n_jobs=4, cv=10, verbose=1) 
    
    return cv

To display the evaluation metric results this time we use the classification report version for imbalanced datasets. Its metrics are: precision/recall/specificity, geometric mean, and index balanced accuracy of the geometric mean. Additionally to that, the 'normal' accuracy score and the best found parameter set for the model are given.

In [74]:
# create param grids for the models
# during former testing with different params, ngrams of (1,2) have been the best compared to (1,1)
mnb_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'features__text_pipeline__vect__max_df': [0.5, 0.75, 1.0],
    'features__text_pipeline_tfidf_sublinear_tf': [False, True],
    'clf__estimator__classifier__alpha': [1e-5, 1e-4, 1e-2, 1e-1, 1]
}

svm_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'features__text_pipeline__vect__max_df': [0.5, 0.75, 1.0],
    'features__text_pipeline_tfidf_sublinear_tf': [False, True],
    'clf__estimator__classifier__C': [0.01, 0.1, 1, 1.5, 5],
    'clf__estimator__classifier__multi_class': ['ovr', 'crammer_singer'],
    'clf__estimator__classifier__max_iter': [1000, 1500, 3000, 5000]
}

rfc_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'features__text_pipeline__vect__max_df': [0.5, 0.75, 1.0],
    'features__text_pipeline_tfidf_sublinear_tf': [False, True],
    'clf__estimator__classifier__n_estimators': [10, 100, 500, 1000, 1500],
    'clf__estimator__classifier__max_depth': [3, 5, 10],
    'clf__estimator__classifier__class_weight': ['balanced', 'balanced_subsample']
}

In [None]:
print("\n----- MultinomialNB -----")
print("Build best model: ...")
cv_mnb_model = build_model(MultinomialNB(random_state=FIXED_SEED), mnb_param_grid)
print("Train model: ...")
cv_mnb_model.fit(X_train, y_train)

In [None]:
y_mnb_pred = cv_mnb_model.predict(X_test)

In [49]:
print("\nModel evaluation on tuned MultinomialNB ...")
display_results_imbalanced(y_test, y_mnb_pred, cv_mnb_model)


Model evaluation: ...
First: accuracy score: 0.1901954700589513
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       0.76      1.00      0.86      4880
               request       0.00      0.00      0.00      1119
                 offer       0.00      0.00      0.00        29
           aid_related       0.42      0.02      0.04      2673
          medical_help       0.00      0.00      0.00       514
      medical_products       0.00      0.00      0.00       340
     search_and_rescue       0.00      0.00      0.00       169
              security       0.00      0.00      0.00       110
              military       0.00      0.00      0.00       205
           child_alone       0.00      0.00      0.00         6
                 water       0.00      0.00      0.00       406
                  food       0.00      0.00      0.00       741
               shelter       0.00      0.00      0.00   

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [76]:
print("\n----- LinearSVC -----")
print("Build best model: ...")
cv_svm_model = build_model(LinearSVC(random_state=FIXED_SEED), svm_param_grid)
print("Train model: ...")
cv_svm_model.fit(X_train, y_train)


----- LinearSVC -----
Build best model: ...
Train model: ...
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  9.5min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 51.7min
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed: 174.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text_pipeline',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('vect',
                                                                                         CountVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                                                                         decode_error='strict',
                                                                                                         dtype=<class 'numpy.int64'>,
  

In [77]:
y_svm_pred = cv_svm_model.predict(X_test)

In [78]:
print("\nModel evaluation on tuned LinearSVC: ...")
display_results(y_test, y_svm_pred, cv_svm_model)


Model evaluation on tuned LinearSVC: ...
First: overall accuracy score: 0.088205
Classification Report for each target class:
                         precision    recall  f1-score   support

               request       0.55      0.48      0.51      1082
                 offer       0.00      0.00      0.00        31
           aid_related       0.57      0.60      0.59      2700
          medical_help       0.25      0.01      0.03       564
      medical_products       0.10      0.00      0.01       317
     search_and_rescue       0.67      0.01      0.02       178
              security       0.00      0.00      0.00       118
              military       0.11      0.00      0.01       213
           child_alone       0.00      0.00      0.00        10
                 water       0.05      0.00      0.00       409
                  food       0.31      0.08      0.13       722
               shelter       0.16      0.02      0.04       554
              clothing       0.25      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [45]:
print("\n----- RandomForestClassifier -----")
print("Build best model: ...")
cv_rfc_model = build_model(RandomForestClassifier(random_state=FIXED_SEED), rfc_param_grid)
print("Train model: ...")
cv_rfc_model.fit(X_train, y_train)


----- RandomForestClassifier -----
Build model: ...
Train model: ...
Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed: 187.5min
[Parallel(n_jobs=3)]: Done 180 out of 180 | elapsed: 500.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text_pipeline',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('vect',
                                                                                         CountVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                                                                         decode_error='strict',
                                                                                                         dtype=<class 'numpy.int64'>,
  

In [46]:
y_rfc_pred = cv_rfc_model.predict(X_test)

In [48]:
print("\nModel evaluation RandomForestClassifier: ...")
display_results(y_test, y_rfc_pred, cv_rfc_model)


Model evaluation RandomForestClassifier: ...
First: accuracy score: 0.097576
Classification Report for each target class:
                         precision    recall  f1-score   support

               request       0.53      0.36      0.43      1082
                 offer       0.00      0.00      0.00        31
           aid_related       0.58      0.57      0.57      2700
          medical_help       0.40      0.01      0.02       564
      medical_products       0.10      0.00      0.01       317
     search_and_rescue       0.00      0.00      0.00       178
              security       0.00      0.00      0.00       118
              military       0.00      0.00      0.00       213
           child_alone       0.00      0.00      0.00        10
                 water       0.00      0.00      0.00       409
                  food       0.29      0.02      0.04       722
               shelter       0.11      0.01      0.01       554
              clothing       0.50      0.01

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Regarding the evaluation results the best model ...


In [None]:
# model = ... best evaluated model with its best params ...

### 9. Export your model as a pickle file

Finally, having found the best model from our model selection list, we save this model with its best parameters as a pickle file. Pickle is the standard way of serialising objects in Python. With this pickle file we can deserialise our model and use it to make new predictions.

In [None]:
def save_model(model, model_filepath):
    pickle.dump(model, open(model_filepath, "wb" ) )

In [None]:
# see train_classifier.py file
model_filepath = "classifier.p"
print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(model, model_filepath)

print('Best trained model saved!')

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.