# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.<br>

My general notes:<br>
Have in mind, that we work on a multi-class, multi-label text classification which assigns to each message sample a set of category target labels. The messages are short and an imbalanced data distribution exists. The dataset has 19634 data points with 32 different target categories.

During the disaster messages processing, the English text is tokenized, lower cased, lemmatized and the contractions are expanded. Additionally, e.g. spaces, punctuation and English stop words are removed.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
#
# import libraries
#

# download necessary NLTK data
import nltk
#nltk.download(['punkt', 'wordnet', 'stopwords'])

# import libraries
import random as rn
import numpy as np
import pandas as pd
import string
import pickle
from sqlalchemy import create_engine
import Contractions

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from bs4 import BeautifulSoup

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier 
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# warnings status to show
import warnings
warnings.warn("once")



Make the code reproducible ...

In [2]:
FIXED_SEED = 42

# The below is necessary for starting NumPy generated random numbers in a well-defined initial state.
np.random.seed(FIXED_SEED)

# The below is necessary for starting core Python generated random numbers in a well-defined state.
rn.seed(FIXED_SEED)

In [3]:
# load data from database
try:
    engine = create_engine('sqlite:///Disaster_Messages_engine.db')
    df = pd.read_sql_table('Messages_Categories_table', engine)
    
    # success
    print("The dataset has {} data points with {} variables each.".format(*df.shape))
except:
    print("The database 'Disaster_Messages_engine.db' could not be loaded. No ML pipeline activities possible.")

The dataset has 19634 data points with 40 variables each.


In [4]:
df.head()

Unnamed: 0,message,original,genre,lang_code,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,en,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,en,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,en,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,en,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Storm at sacred heart of jesus,Cyclone Coeur sacr de jesus,direct,en,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [5]:
# create input (X) and output (y) samples, we know that related is always one ...
# as input we have to take care about the messages
# the categories are the targets of the multi-class, multi-label classification
X = df['message']
y = df[df.columns[4:]]
TARGET_NAMES = y.columns

In [6]:
print("X datatype: {}".format(type(X)))
print("y datatype: {}".format(type(y)))

X datatype: <class 'pandas.core.series.Series'>
y datatype: <class 'pandas.core.frame.DataFrame'>


In [7]:
X.head(2)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
Name: message, dtype: object

In [8]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [9]:
y.iloc[0:5,:].values

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

In [10]:
for group in y.columns:
    print("'{}' includes {} x value 1.".format(group, y[group].sum()))

'related' includes 19634 x value 1.
'request' includes 4374 x value 1.
'offer' includes 117 x value 1.
'aid_related' includes 10729 x value 1.
'medical_help' includes 2066 x value 1.
'medical_products' includes 1297 x value 1.
'search_and_rescue' includes 718 x value 1.
'security' includes 467 x value 1.
'military' includes 857 x value 1.
'child_alone' includes 0 x value 1.
'water' includes 1650 x value 1.
'food' includes 2885 x value 1.
'shelter' includes 2281 x value 1.
'clothing' includes 401 x value 1.
'money' includes 598 x value 1.
'missing_people' includes 297 x value 1.
'refugees' includes 872 x value 1.
'death' includes 1187 x value 1.
'other_aid' includes 3392 x value 1.
'infrastructure_related' includes 1688 x value 1.
'transport' includes 1197 x value 1.
'buildings' includes 1313 x value 1.
'electricity' includes 528 x value 1.
'tools' includes 158 x value 1.
'hospitals' includes 283 x value 1.
'shops' includes 118 x value 1.
'aid_centers' includes 308 x value 1.
'other_inf

### 2. Write a tokenization function to process your text data

During EPL pipeline activities we realised that there are messages which are not useful (e.g. 'nonsense' character sequences, html characters) and there are probably web links included. We have to deal with this in the tokenize() function.

In [11]:
CONTRACTION_MAP = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [12]:
# function from Dipanjan's repository:
# https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%\
# 20content/nlp%20proven%20approach/NLP%20Strategy%20I%20-%20Processing%20and%20Understanding%20Text.ipynb

def expand_contractions(text, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    
    return expanded_text

In [13]:
stop_words = set(stopwords.words('english'))
#stop_words.remove('no')
#stop_words.remove('not')

def tokenize(text):
    # have in mind that we use this for a web app adding new messages;
    # if still html, xml or other undefined parts in the existing messages:
    # first remove such metatext from English messages
    # see: https://docs.python.org/3.7/library/codecs.html#encodings-and-unicode
    # "To be able to detect the endianness of a UTF-16 or UTF-32 byte sequence,
    # there’s the so called BOM (“Byte Order Mark”). [...]
    # In UTF-8, the use of the BOM is discouraged and should generally be avoided."
    # specific ones are e.g. notepad signatures from Microsoft as part of the messages which should be avoided;
    # other undefined characters have the coding of the 'replacement character' unicode u"\ufffd"
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'       
    detected_urls = re.findall(url_regex, bom_removed)
    for url in detected_urls:
        text = bom_removed.replace(url, "urlplaceholder")
        
    # change the negation wordings like don't to do not, won't to will not 
    # or other contractions like I'd to I would, I'll to I will etc. via dictionary
    text = expand_contractions(text, CONTRACTION_MAP)

    # remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
    text = text.translate(str.maketrans('','', string.punctuation))
    # remove numbers
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    # during ETL pipeline we have reduced the dataset on English messages ('en' language coding,
    # but there can be some wrong codings
    tokens = word_tokenize(letters_only, language='english')
    lemmatizer = WordNetLemmatizer()  # for the lexical correctly found word stem (root)

    clean_tokens = []
    for tok in tokens:
        # use only lower cases, remove leading and ending spaces
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        # remember: there have been nonsense sentences, so, now some strings could be empty
        # toDo: what is the correct length number to use now? Small ones are probably no relevant words ...
        # remove English stop words
        if (len(clean_tok) > 1) & (clean_tok not in stop_words):
            clean_tokens.append(clean_tok)

    return clean_tokens

In [14]:
# example for unit test to remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
example_str = 'This [is an] example? {of} string. with.? some &punctuation &signs!!??!!'
result = example_str.translate(str.maketrans('','', string.punctuation))
print(result)
# output shall be: This is an example of string with some punctuation signs

This is an example of string with some punctuation signs


In [15]:
# test tokenize
for message in X[:10]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over
['hurricane'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'needs', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 

Storm at sacred heart of jesus
['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you!
['please', 'need', 'tent', 'water', 'silo', 'thank'] 

I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )
['croixdesbouquets', 'health', 'issue', 'worker', 'santo', 'area', 'croixdesbouquets'] 

There's nothing to eat and water, we starving and t

### 3. Build a machine learning pipeline
Notes:
- Regarding the class default parameters, for this Python implementation scikit-learn version 0.21.2 is used.
- We use np.random.seed() too beside of random_state/random_seed parameters ([reason](https://stackoverflow.com/questions/47923258/random-seed-on-svm-sklearn-produces-different-results))
- For the pipeline workflow a `FeatureUnion`instance concatenates results of multiple transformer objects

Remember, we are dealing with an imbalanced dataset, therefore not all models can be used. A machine learning classifier could be more biased towards the majority class, causing bad classification of the minority class. Therefore we have to take care.

This machine pipeline should take in the `message` column as input and output classification results on the other remaining 31 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

According scikit-learn [documentation](https://scikit-learn.org/stable/modules/multiclass.html) we can choose only specific classifier using this meta-estimator. We start with `RandomForestClassier`.<br>

In [16]:
pipeline = Pipeline([
        ('features', FeatureUnion([ 
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
            ]))
            
        ])),
    
        ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, class_weight='balanced',
                                                             n_jobs=-1, random_state=FIXED_SEED)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [17]:
# shuffle is by default set on True,
# usage of stratify param leads to stratify split technique for this imbalanced dataset,
# having both would be a StratifiedShuffleSplit algorithm in the background,
# but
# stratify=y leads to a ValueError: The least populated class in y has only 1 member, which is too few.
# The minimum number of groups for any class cannot be less than 2.
# ToDo: clarify why => must be: stratify=y.iloc[:,1]
# Therefore I added randomised resampling to reduce the probability of getting ValueError's during model fit()
# and after clarification comment them out

X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, stratify=y.iloc[:,1],
                                                    test_size=0.2, random_state=FIXED_SEED)
#X_train = X_train.sample(n = X_train.shape[0], axis=0, random_state=FIXED_SEED) 
#y_train = y_train.sample(n = y_train.shape[0], axis=0, random_state=FIXED_SEED)

In [18]:
X_train.shape

(15707,)

In [19]:
y_train.shape

(15707, 36)

In [20]:
print("X_train datatype: {}".format(type(X_train)))
print("y_train datatype: {}".format(type(y_train)))

X_train datatype: <class 'numpy.ndarray'>
y_train datatype: <class 'numpy.ndarray'>


In [21]:
for i in range(y_train.shape[1]):
    print("{}. numpy.ndarray element is: {}".format(i, y_train[i]))
    print(set(y_train[i]))

0. numpy.ndarray element is: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1]
{0, 1}
1. numpy.ndarray element is: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]
{0, 1}
2. numpy.ndarray element is: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
3. numpy.ndarray element is: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0]
{0, 1}
4. numpy.ndarray element is: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
5. numpy.ndarray element is: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
6. numpy.ndarray element is: [1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
7. numpy.ndarray element is: [1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
8. numpy.ndarray element is: [1 1 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
{0, 1}
9. numpy.n

Now, we train the pipeline ...

In [22]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('text_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('vect',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8',
                                                                                  input='content',
                                                                                  low

And calculate the model prediction ...

In [23]:
y_rfc_pred = pipeline.predict(X_test)

### 5. Test your model
For evaluation:<br>
Report accuracy score, f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each, where:

TP = TruePositive; FP = FalsePositive; TN = TrueNegative; FN = FalseNegative.

**Accuracy Score** is a classification score. It is the number of correct predictions made divided by the total number of predictions made. In a multilabel classification task it computes subset accuracy. 
  
Furthermore, beside accuracy, we add additional metrics to compare the model performance having an originally imbalanced dataset. Accuracy would focus too much on the majority classes. Because of this overfitting of the majority classes, its value would be too good and therefore misleading.

**Precision** quantifies the binary precision. In other words, a measure of a classifiers exactness. It is a ratio of true positives (messages correctly classified to their categories)) to all positives (all messages classified to categories, irrespective of whether that was the correct classification), in other words it is the ratio of

TP / (TP + FP)

**Recall** tells us what proportion of messages that actually were classified to specific categories were classified by us as this categories. Means, a measure of a classifiers completeness. It is a ratio of true positives to all the correctly category classified messages that were actually disaster messages, in other words it is the ratio of

TP / (TP + FN)

A model's ability to precisely predict those that are correctly categoriesed disaster messages is more important than the model's ability to recall those individuals. 

We can use **F-beta score** as a metric that considers both precision and recall. According scikit-learn, the F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. F – Measure is nothing but the harmonic mean of Precision and Recall.

Fβ=(1 + β2)  (precision⋅recall / ((β2⋅precision) + recall))

In particular, when β=0.5, more emphasis is placed on precision. And when β=1.0 recall and precision are equally important.

According scikit-learn: "The **F1 score** ... reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter."

From scikit-learn documentation for the classification report:<br>
The classification_report() function returns an additional value: **Support** - the number of occurrences of each label in y_true.<br>
The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), sample average (only for multilabel classification) and micro average (averaging the total true positives, false negatives and false positives) it is only shown for multi-label or multi-class with a subset of classes because it is accuracy otherwise.

Note: Having the imbalanced dataset in mind, Cohen's Kappa and Confusion Matrix are not possible because this is a multi-label classification task.

In [24]:
def display_results(target_names, y_test, y_pred, cv=None):
    # text summary of the overall accuracy, precision, recall, F1 score for each class   
    print("First: overall accuracy score: {:5f}".format(accuracy_score(y_test, y_pred)))

    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    # shows F1_score, precision and recall
    class_report = classification_report(y_test, y_pred, target_names=target_names)
    print("Classification Report for each target class:\n", class_report)

    if cv != None:
        print("\n\n---- Best Parameters: ----\n{}".format(cv.best_params_))

In [25]:
display_results(TARGET_NAMES, y_test, y_rfc_pred, None)

First: overall accuracy score: 0.070283
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.56      0.39      0.46       875
                 offer       0.00      0.00      0.00        29
           aid_related       0.58      0.70      0.63      2146
          medical_help       0.14      0.00      0.01       386
      medical_products       0.18      0.01      0.02       253
     search_and_rescue       0.20      0.01      0.01       142
              security       0.00      0.00      0.00       103
              military       0.00      0.00      0.00       179
           child_alone       0.00      0.00      0.00         0
                 water       0.10      0.00      0.01       332
                  food       0.21      0.01      0.01       553
               shelter       0.06      0.00      0.00       456
              clo

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Such kind of behaviour has been expected because having an imbalanced dataset and in the output vectors for each message, most of the target label values are set to 0 - only few are set to 1.<br>
The accuracy metric is not an appropriate measure to evaluate model performance of such kind of dataset. It could classify all instances as part of the majority class and classifies the minority class targets as noise. It is not able to evaluate the model performance of a multi-class dataset with multi-output vectors.<br>
Additionally in this classification report, often the metrics are not reliable because of being set to 0.0 according calculation rules. If values are available, precision is in general higher than recall, in other words, we have a high rate of false negatives (all items wrongly classified as not being part of the specific target class). Therefore we start to improve the model by using cross-validated hyperparameters.

### 6. Improve your model
Use grid search to find better parameters. 

In [26]:
pipeline.get_params()

{'memory': None, 'steps': [('features', FeatureUnion(n_jobs=None,
                transformer_list=[('text_pipeline',
                                   Pipeline(memory=None,
                                            steps=[('vect',
                                                    CountVectorizer(analyzer='word',
                                                                    binary=False,
                                                                    decode_error='strict',
                                                                    dtype=<class 'numpy.int64'>,
                                                                    encoding='utf-8',
                                                                    input='content',
                                                                    lowercase=True,
                                                                    max_df=1.0,
                                                                    max_fea

In [27]:
# specify parameters for grid search
rfc_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'clf__estimator__n_estimators': [10, 100, 500, 1000],
    'clf__estimator__max_depth': [None, 5, 10],
    'clf__estimator__class_weight': ['balanced', 'balanced_subsample']
}

# create grid search object
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
# cv not higher than 5 buckets, training needs days with cv=10 if e.g. amazon AWS EC2 service is not available
grid_cv = GridSearchCV(pipeline, param_grid=rfc_param_grid, n_jobs=-1, cv=5, verbose=1)

### 7. Test your model
Show the accuracy, precision, recall and F-score of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
# model = cv
grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


In [45]:
y_rfc_pred2 = grid_cv.predict(X_test)

In [1]:
print("Evaluation results for the cross validation tuned 'RandomForestClassifier' estimator:")
display_results(y_test, y_rfc_pred2, grid_cv)

The tuned evaluation result is ......

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

**First**, we try out other machine learning algorithms which are tuned by cross validation to compare their prediction results. Other estimator models for the requested `MultiOutputClassifier` are:
- `KNeighborsClassifier` with its default parameters: (n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)
- `RadiusNeighborsClassifier` with its default parameters: (radius=1.0, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, outlier_label=None, metric_params=None, n_jobs=None, **kwargs)

As stated in the scikit-learn [documentation](https://scikit-learn.org/stable/modules/neighbors.html#classification) "scikit-learn implements two different nearest neighbors classifiers: <i>KNeighborsClassifier</i> implements learning based on the nearest neighbors of each query point, where is an integer value specified by the user.<br> <i>RadiusNeighborsClassifier</i> implements learning based on the number of neighbors within a fixed radius of each training point, where is a floating-point value specified by the user."
    
**Second**, because it is an imbalanced dataset we could do a balancing before classification. The categority classes with low numbers of observations are outnumbered. So, the dataset is highly skewed. To create a balanced dataset several strategies exists:
- Undersampling the majority classes
- Oversampling the minority classes
- Combining over- and under-sampling
- Create ensemble balanced sets

But have in mind, that minority class oversampling could result in overfitting problems doing it before cross-validation. We would link the information of validation data to our training dataset which is forbidden.

Note:<br>
Doing balancing activities the specific scikit package 'imbalanced-learn' is imported.<br>
For combining the strategies we implement a naive random oversampling of the minority classes.<br>
For undersampling the package can be used as well to create the pipeline with `PipelineImb`. The pipeline itself includes the class `RandomUnderSampler()` directly before the MultiOutputClassifier to equalize the number of samples in all the classes before the training.

Using such package throws the following ValueError: 'Imbalanced-learn currently supports binary, multiclass and binarized encoded multiclasss targets. Multilabel and multioutput targets are not supported.' So, the associated package classes do not support the multi-target classification with multiple outputs as we need for our project. Therefore this coding is removed after such experiment.

According the [paper](https://arxiv.org/ftp/arxiv/papers/1810/1810.11612.pdf) <i>Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting</i> of 27 October 2018 from Genta Indra Winata and Masayu Leylia Khodra, regarding new data, it is more appropriate to balance the dataset on the algorithm level instead of the data level to avoid overfitting. The algorithm "approach modifies algorithm by adjusting weight or cost of various classes."

If the usage of the mentioned specific library is not possible for our task, what could we do instead?<br>
Another option is a `feature-selection` approach which can be done after the feature extraction of the `TfidfTransformer`, which is creating [feature vectors](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

Scikit-learn offers the package [feature selection](): "The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets."<br>
So, first we have to select the important features to get the meaningful ones and afterwards we can do the prediction with the found highest scoring features (say n ones). We could add this directly in our pipeline by using `SelectFromModel` as proposed by scikit-learn.

There are several links for information:<br>
- https://scikit-learn.org/stable/modules/feature_selection.html#tree-based-feature-selection
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
- https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-as-part-of-a-pipeline
- https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py

In [75]:
def build_model(model_type, params):
    ''' 
    input:
    model_type - the estimator model used for the MultiOutputClassifier
    params - the estimator model parameter grid used for the GridSearchCV 
    ''' 
    
    pipeline = Pipeline([
        ('features', FeatureUnion([ 
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
            ]))           
        ])),

        ('feature_selection', SelectFromModel(LinearSVC(multi_class=‘crammer_singer’))),
        ('clf', MultiOutputClassifier(model_type))
    ])
    
    # the higher the verbose number the more information is thrown
    cv = GridSearchCV(pipeline, param_grid=params, n_jobs=-1, cv=5, verbose=1) 
    
    return cv

First, we try this new pipeline again with the <i>RandomForestClassifier</i> and a reduced parameter grid.

In [74]:
# create param grids for the models
rfc_reduced_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'clf__estimator__n_estimators': [10, 100, 500, 1000],
    'clf__estimator__max_depth': [None, 5, 10],
    'clf__estimator__class_weight': ['balanced', 'balanced_subsample']
}

knn_param_grid  = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'clf__estimator__weights': ['uniform', 'distance']
}

rn_param_grid  = {
    'features__text_pipeline__vect__ngram_range': [(1, 2), (1,3)],
    'clf__estimator__weights': ['uniform', 'distance']
}

In [None]:
print("\n----- RandomForestClassifier with SelectFromModel -----")
print("Build best model: ...")
cv_rfc_reduced_model = build_model(RandomForestClassifier(random_state=FIXED_SEED), rfc_reduced_param_grid)
print("Train model: ...")
cv_rfc_reduced_model.fit(X_train, y_train)

In [None]:
y_rfc_reduced_pred = cv_rfc_reduced_model.predict(X_test)

In [None]:
print("\nModel evaluation on second tuned RandomForestClassifier ...")
display_results_imbalanced(y_test, y_rfc_reduced_pred, cv_rfc_reduced_model)

In [None]:
print("\n----- KNeighborsClassifier with SelectFromModel -----")
print("Build best model: ...")
cv_knn_model = build_model(KNeighborsClassifier(random_state=FIXED_SEED), knn_param_grid)
print("Train model: ...")
cv_knn_model.fit(X_train, y_train)

In [None]:
y_knn_pred = cv_knn_model.predict(X_test)

In [2]:
print("\nModel evaluation on tuned KNeighborsClassifier ...")
display_results_imbalanced(y_test, y_knn_pred, cv_knn_model)

In [None]:
print("\n----- RadiusNeighborsClassifier with SelectFromModel -----")
print("Build best model: ...")
cv_rn_model = build_model(RadiusNeighborsClassifier(random_state=FIXED_SEED), rn_param_grid)
print("Train model: ...")
cv_rn_model.fit(X_train, y_train)

In [None]:
y_rn_pred = cv_rn_model.predict(X_test)

In [2]:
print("\nModel evaluation on tuned RadiusNeighborsClassifier ...")
display_results_imbalanced(y_test, y_rn_pred, cv_rn_model)

Regarding the evaluation results the best model ...


### 9. Export your model as a pickle file

Finally, having found the best model from our model selection list, we save this model with its best parameters as a pickle file. Pickle is the standard way of serialising objects in Python. With this pickle file we can deserialise our model and use it to make new predictions.

In [None]:
def save_model(model, model_filepath):
    pickle.dump(model, open(model_filepath, "wb" ) )

In [None]:
# see train_classifier.py file
model_filepath = "classifier.p"
print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(model, model_filepath)

print('Best trained model saved!')

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.