# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.<br>

My general notes:<br>
Have in mind, that we work on a multi-class, multi-output text classification which assigns to each message sample a set of category target classes. The messages are short and an imbalanced data distribution exists. The dataset has 19634 data points with 40 different target categories.

During the disaster messages processing, the English text is tokenized, lower cased, lemmatized and the contractions are expanded. Additionally, e.g. spaces, punctuation and English stop words are removed.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and y

In [1]:
#
# import libraries
#

# download necessary NLTK data
#%pip install nltk
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])

import random as rn
import numpy as np
import pandas as pd
import string
import pickle
from sqlalchemy import create_engine
from collections import Counter

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

#%pip install bs4
from bs4 import BeautifulSoup

import sklearn.neighbors
from sklearn.utils import resample
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report

from skmultilearn.model_selection import IterativeStratification

# from imblearn.combine import SMOTETomek  -  resampling not possible because of having a multi-class, multi-output task
from imblearn.ensemble import BalancedRandomForestClassifier

# warnings status to show
import warnings
warnings.warn("once")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Using TensorFlow backend.


Make the code reproducible ...

In [2]:
FIXED_SEED = 42

# The below is necessary for starting NumPy generated random numbers in a well-defined initial state.
np.random.seed(FIXED_SEED)

# The below is necessary for starting core Python generated random numbers in a well-defined state.
rn.seed(FIXED_SEED)

In [3]:
# load data from database
try:
    engine = create_engine('sqlite:///Disaster_Messages_engine.db')
    df = pd.read_sql_table('Messages_Categories_table', engine)
    
    # success
    print("The dataset has {} data points with {} variables each.".format(*df.shape))
except:
    print("The database 'Disaster_Messages_engine.db' could not be loaded. No ML pipeline activities possible.")

The dataset has 19634 data points with 40 variables each.


In [4]:
df.head()

Unnamed: 0,message,original,genre,lang_code,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,en,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,en,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,en,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,en,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Storm at sacred heart of jesus,Cyclone Coeur sacr de jesus,direct,en,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [5]:
# create input (X) and output (y) samples, we know that related is always one ...
# as input we have to take care about the messages
# the categories are the targets of the multi-class, multi-output classification
X = df['message']
y = df[df.columns[4:]]
TARGET_NAMES = y.columns

In [6]:
print("X datatype: {}".format(type(X)))
print("y datatype: {}".format(type(y)))

X datatype: <class 'pandas.core.series.Series'>
y datatype: <class 'pandas.core.frame.DataFrame'>


In [7]:
X.head(2)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
Name: message, dtype: object

In [8]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [9]:
y.iloc[0:5,:].values

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

In [10]:
# for creation of train and test datasets it is important that no column includes only 0 values
# stratification will not work properly (errors are thrown)
for group in y.columns:
    print("'{}' includes {} x value 1.".format(group, y[group].sum()))

'related' includes 19634 x value 1.
'request' includes 4374 x value 1.
'offer' includes 117 x value 1.
'aid_related' includes 10729 x value 1.
'medical_help' includes 2066 x value 1.
'medical_products' includes 1297 x value 1.
'search_and_rescue' includes 718 x value 1.
'security' includes 467 x value 1.
'military' includes 857 x value 1.
'child_alone' includes 19 x value 1.
'water' includes 1650 x value 1.
'food' includes 2885 x value 1.
'shelter' includes 2281 x value 1.
'clothing' includes 401 x value 1.
'money' includes 598 x value 1.
'missing_people' includes 297 x value 1.
'refugees' includes 872 x value 1.
'death' includes 1187 x value 1.
'other_aid' includes 3392 x value 1.
'infrastructure_related' includes 1688 x value 1.
'transport' includes 1197 x value 1.
'buildings' includes 1313 x value 1.
'electricity' includes 528 x value 1.
'tools' includes 158 x value 1.
'hospitals' includes 283 x value 1.
'shops' includes 118 x value 1.
'aid_centers' includes 308 x value 1.
'other_in

### 2. Write a tokenization function to process your text data

During EPL pipeline activities we realised that there are messages which are not useful (e.g. 'nonsense' character sequences, html characters) and there are probably web links included. We have to deal with this in the tokenize() function.

In [11]:
CONTRACTION_MAP = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [12]:
# function from Dipanjan's repository:
# https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%\
# 20content/nlp%20proven%20approach/NLP%20Strategy%20I%20-%20Processing%20and%20Understanding%20Text.ipynb

def expand_contractions(text, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    
    return expanded_text

In [13]:
stop_words = set(stopwords.words('english'))
stop_words.remove('no')
stop_words.remove('not')

def tokenize(text):
    # have in mind that we use this for a web app adding new messages;
    # if still html, xml or other undefined parts in the existing messages:
    # first remove such metatext from English messages
    # see: https://docs.python.org/3.7/library/codecs.html#encodings-and-unicode
    # "To be able to detect the endianness of a UTF-16 or UTF-32 byte sequence,
    # there’s the so called BOM (“Byte Order Mark”). [...]
    # In UTF-8, the use of the BOM is discouraged and should generally be avoided."
    # specific ones are e.g. notepad signatures from Microsoft as part of the messages which should be avoided;
    # other undefined characters have the coding of the 'replacement character' unicode u"\ufffd"
    soup = BeautifulSoup(text, 'html')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'       
    detected_urls = re.findall(url_regex, bom_removed)
    for url in detected_urls:
        text = bom_removed.replace(url, "urlplaceholder")
        
    # change the negation wordings like don't to do not, won't to will not 
    # or other contractions like I'd to I would, I'll to I will etc. via dictionary
    text = expand_contractions(text, CONTRACTION_MAP)

    # remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
    text = text.translate(str.maketrans('','', string.punctuation))
    # remove numbers
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    # during ETL pipeline we have reduced the dataset on English messages ('en' language coding,
    # but there can be some wrong codings
    tokens = word_tokenize(letters_only, language='english')
    lemmatizer = WordNetLemmatizer()  # for the lexical correctly found word stem (root)

    clean_tokens = []
    for tok in tokens:
        # use only lower cases, remove leading and ending spaces
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        # remember: there have been nonsense sentences, so, now some strings could be empty
        # toDo: what is the correct length number to use now? Small ones are probably no relevant words ...
        # remove English stop words
        if (len(clean_tok) > 1) & (clean_tok not in stop_words):
            clean_tokens.append(clean_tok)

    return clean_tokens

In [14]:
# example for unit test to remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
example_str = 'This [is an] example? {of} string. with.? some &punctuation &signs!!??!!'
result = example_str.translate(str.maketrans('','', string.punctuation))
print(result)
# output shall be: This is an example of string with some punctuation signs

This is an example of string with some punctuation signs


In [14]:
# test tokenize
for message in X[:10]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over
['hurricane', 'not'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'needs', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 

Storm at sacred heart of jesus
['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you!
['please', 'need', 'tent', 'water', 'silo', 'thank'] 

I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )
['croixdesbouquets', 'health', 'issue', 'worker', 'santo', 'area', 'croixdesbouquets'] 

There's nothing to eat and water, we starvin

### 3. Build a machine learning pipeline
Notes:
- Regarding the class default parameters, for this Python implementation scikit-learn version 0.21.2 anbd scikit-multilearn version 0.2.0 are used.
- We use np.random.seed() too beside of random_state/random_seed parameters ([reason](https://stackoverflow.com/questions/47923258/random-seed-on-svm-sklearn-produces-different-results))
- For the pipeline workflow a `FeatureUnion`instance concatenates results of multiple transformer objects

Remember, we are dealing with an imbalanced dataset, therefore not all models can be used. One machine learning classifier could be more biased towards the majority class, causing bad classification of the minority class compared to other model types. Therefore we have to take care and to evaluate some of them.

This machine pipeline should take in the `message` column as input and output classification results on the other remaining target categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

According scikit-learn [documentation](https://scikit-learn.org/stable/modules/multiclass.html) we can choose only specific classifier using this meta-estimator. We start with `RandomForestClassier`.<br>
Its default parameter values are:<br>
<i>RandomForestClassifier</i>(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None).

For our classifiation task, most important parameters are <i>n_estimators</i> and <i>max_features</i>. As stated in the scikit-learn documentation "using a random subset of size sqrt(n_features)) for classification tasks (where n_features is the number of features in the data)" is in general the best for the prediction results. This is the case with max_features='auto', therefore, we will not change this parameter.

<i>n_jos=1</i> is used because all other values throw errors and the training task crashed.

In [15]:
pipeline = Pipeline([
        ('features', FeatureUnion([
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
            ]))
            
        ])),
    
        ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, class_weight='balanced',
                                                             n_jobs=1, random_state=FIXED_SEED)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [18]:
# shuffle is by default set on True,
# usage of stratify param leads to stratify split technique for this imbalanced dataset,
# having both would be a StratifiedShuffleSplit algorithm in the background,
# but
# stratify=y leads to a ValueError: The least populated class in y has only 1 member, which is too few.
# The minimum number of groups for any class cannot be less than 2.
# ToDo: clarify why => solution, must be: stratify=y.iloc[:,:] but that throws errors;
# wrong coding with y.iloc[:,1] for getting the rest to run (wrong results with and after training)
#X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, stratify=y.iloc[:,1],
#                                                    test_size=0.2, random_state=FIXED_SEED)

# therefore: creation of X and y with scikit-multilearn iterative stratifier,
# works only because 'child_alone' target class has been mapped to some messages
# if this would be still 0 on all rows ValueError would be thrown
test_size = 0.2
stratifier = IterativeStratification(n_splits=2, order=1,
                                     sample_distribution_per_fold=[test_size, 1.0-test_size],
                                     random_state=FIXED_SEED)
train_indexes, test_indexes = next(stratifier.split(X, y))

# y slicing with iloc because y is a dataframe, X is a series;
# by adding values to X and y we create numpy arrays
X_train, y_train = X[train_indexes].values, y.iloc[train_indexes, :].values  
X_test, y_test = X[test_indexes].values, y.iloc[test_indexes, :].values

In [19]:
X_train.shape

(15707,)

In [20]:
y_train.shape

(15707, 36)

In [21]:
print("X_train datatype: {}".format(type(X_train)))
print("y_train datatype: {}".format(type(y_train)))

X_train datatype: <class 'numpy.ndarray'>
y_train datatype: <class 'numpy.ndarray'>


In [22]:
for i in range(y_train.shape[1]):
    print("{}. numpy.ndarray element is: {}".format(i, y_train[i]))
    print(set(y_train[i]))

0. numpy.ndarray element is: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0]
{0, 1}
1. numpy.ndarray element is: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
2. numpy.ndarray element is: [1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
{0, 1}
3. numpy.ndarray element is: [1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1]
{0, 1}
4. numpy.ndarray element is: [1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
{0, 1}
5. numpy.ndarray element is: [1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
{0, 1}
6. numpy.ndarray element is: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]
{0, 1}
7. numpy.ndarray element is: [1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
{0, 1}
8. numpy.ndarray element is: [1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
{0, 1}
9. numpy.n

**Note:**<br>
As we already know, the dataset is an imbalanced one, which will lead to emphasize the majority target classes too much. We want to get a more balanced dataset distribution by duplicating minority class instances of the training set. With this **oversampling** approach some overfitting may appear.

In [23]:
TARGET_NAMES

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [24]:
# datayptes are class 'numpy.ndarray'

print('Before resampling, shape of X_train: {}'.format(X_train.shape))
print('Before resampling, shape of y_train: {} \n'.format(y_train.shape))

print("Before resampling, label counts '1': {}".format(sum(y_train==1)))
print("Before resampling, label counts '0': {} \n".format(sum(y_train==0)))

Before resampling, shape of X_train: (15707,)
Before resampling, shape of y_train: (15707, 36) 

Before resampling, label counts '1': [15707  2564    95  8493  1775  1055   595   400   825    10  1229  1994
  1727   268   517   240   778  1028  2596  1516  1084  1129   461   136
   247    99   270  1043  6122  2013  2186   255  1727   490  1251  3152]
Before resampling, label counts '0': [    0 13143 15612  7214 13932 14652 15112 15307 14882 15697 14478 13713
 13980 15439 15190 15467 14929 14679 13111 14191 14623 14578 15246 15571
 15460 15608 15437 14664  9585 13694 13521 15452 13980 15217 14456 12555] 



In [25]:
# resampling with scikit-learn utils package
X_train_res, y_train_res = resample(X_train, y_train, n_samples=7000, random_state=FIXED_SEED)

In [26]:
print('After resampling, shape of X_train_res: {}'.format(X_train_res.shape))
print('After resampling, shape of y_train_res: {} \n'.format(y_train_res.shape))

print("After resampling, label counts '1': {}".format(sum(y_train_res==1)))
print("After resampling, label counts '0': {}".format(sum(y_train_res==0)))

After resampling, shape of X_train_res: (7000,)
After resampling, shape of y_train_res: (7000, 36) 

After resampling, label counts '1': [7000 1137   36 3860  812  490  304  177  353    2  555  940  791  118
  249  130  361  468 1180  653  489  539  225   62  103   35  128  456
 2670  875  941  116  769  199  568 1487]
After resampling, label counts '0': [   0 5863 6964 3140 6188 6510 6696 6823 6647 6998 6445 6060 6209 6882
 6751 6870 6639 6532 5820 6347 6511 6461 6775 6938 6897 6965 6872 6544
 4330 6125 6059 6884 6231 6801 6432 5513]


Now, we train the pipeline, first with the original training set afterwards with the resampled one ...

In [27]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('text_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('vect',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8',
                                                                                  input='content',
                                                                                  low

And calculate the model prediction for our original training and testing data ...

In [28]:
y_rfc_pred = pipeline.predict(X_test)

Now, we do the same thing with the resampled dataset ...

In [31]:
pipeline.fit(X_train_res, y_train_res)

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('text_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('vect',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8',
                                                                                  input='content',
                                                                                  low

In [32]:
y_rfc_pred_res = pipeline.predict(X_test)

### 5. Test your model
For evaluation:<br>
Report accuracy score, f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each, where:

TP = TruePositive; FP = FalsePositive; TN = TrueNegative; FN = FalseNegative.

**Accuracy Score** is a classification score. It is the number of correct predictions made divided by the total number of predictions made. In a multilabel classification task it computes subset accuracy. 
  
Furthermore, beside accuracy, we add additional metrics to compare the model performance having an originally imbalanced dataset. Accuracy would focus too much on the majority classes. Because of this overfitting of the majority classes, its value would be too good and therefore misleading.

**Precision** quantifies the binary precision. In other words, a measure of a classifiers exactness. It is a ratio of true positives (messages correctly classified to their categories)) to all positives (all messages classified to categories, irrespective of whether that was the correct classification), in other words it is the ratio of

TP / (TP + FP)

**Recall** tells us what proportion of messages that actually were classified to specific categories were classified by us as this categories. Means, a measure of a classifiers completeness. It is a ratio of true positives to all the correctly category classified messages that were actually disaster messages, in other words it is the ratio of

TP / (TP + FN)

A model's ability to precisely predict those that are correctly categoriesed disaster messages is more important than the model's ability to recall those individuals. 

We can use **F-beta score** as a metric that considers both precision and recall. According scikit-learn, the F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. F – Measure is nothing but the harmonic mean of Precision and Recall.

Fβ=(1 + β2)  (precision⋅recall / ((β2⋅precision) + recall))

In particular, when β=0.5, more emphasis is placed on precision. And when β=1.0 recall and precision are equally important.

According scikit-learn: "The **F1 score** ... reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter."

From scikit-learn documentation for the classification report:<br>
The classification_report() function returns an additional value: **Support** - the number of occurrences of each label in y_true.<br>
The reported averages include macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label), sample average (only for multilabel classification) and micro average (averaging the total true positives, false negatives and false positives) it is only shown for multi-label or multi-class with a subset of classes because it is accuracy otherwise.

In [29]:
def display_results(target_names, y_test, y_pred, cv=None, parameters=None):
   
    # text summary of the overall accuracy, precision, recall, F1 score for each class   
    print("\nFirst: overall accuracy score: {:5f}".format(accuracy_score(y_test, y_pred)))

    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    # shows F1_score, precision and recall
    class_report = classification_report(y_test, y_pred, target_names=target_names)
    print("Classification Report for each target class:\n", class_report)
    
    if cv != None:
        print("\n\n---- Best Parameters: ----\n")
        print("Best score: {:3f}".format(cv.best_score_))
        print("Best estimators parameters set:")
        best_parameters = cv.best_estimator_.get_params()
        for param_name in sorted(parameters.keys()):
            print("\t {}: {}".format(param_name, best_parameters[param_name]))  

What are the metric results for our original data without resampling?

In [30]:
display_results(TARGET_NAMES, y_test, y_rfc_pred, None, None)


First: overall accuracy score: 0.065190
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.56      0.40      0.46      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.59      0.71      0.64      2236
          medical_help       0.07      0.00      0.01       291
      medical_products       0.20      0.01      0.02       242
     search_and_rescue       0.00      0.00      0.00       123
              security       0.00      0.00      0.00        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.25      0.01      0.01       421
                  food       0.31      0.02      0.03       891
               shelter       0.06      0.00      0.01       554
              cl

  'precision', 'predicted', average, warn_for)


Such kind of behaviour has been expected because having an imbalanced dataset and in the output vectors for each message, most of the target label values are set to 0 - only few are set to 1. So, the vector is not a dense one.<br>
The accuracy metric is not an appropriate measure to evaluate model performance of such kind of dataset. It could classify all instances as part of the majority class and classifies the minority class targets as noise. It is not able to evaluate the model performance of a multi-class dataset with multi-output vectors.<br>
Additionally in this classification report, often the metrics are not reliable because of being set to 0.0 according calculation rules. If values are available, precision is often higher than recall, in other words, we have a high rate of false negatives (all items wrongly classified as not being part of the specific target classes). A hugh amount of the token inputs are noise features, not associated with the target response class features.<br>
Mainly for support values >1000 appropriate F1-score values exists (except earthquake, score >10%). This appeared for the following target features: request, aid_related, wheather related, earthquake and direct_report.<br>
And as we know from the ETL pipeline, some target features are correlated.

In other words, we start to improve the model by using cross-validated hyperparameters.

What are the metric results for our resampled data?

In [33]:
display_results(TARGET_NAMES, y_test, y_rfc_pred_res, None, None)


First: overall accuracy score: 0.050675
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.57      0.36      0.44      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.58      0.77      0.66      2236
          medical_help       0.33      0.00      0.01       291
      medical_products       0.00      0.00      0.00       242
     search_and_rescue       0.00      0.00      0.00       123
              security       0.00      0.00      0.00        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.30      0.01      0.01       421
                  food       0.26      0.01      0.02       891
               shelter       0.40      0.00      0.01       554
              cl

  'precision', 'predicted', average, warn_for)


Regarding the F1 score values for each class of both trained models leads to the conclusion that for this dataset the calculated oversampling is no improvement. After having done the resampling the label counts 0 and 1 for each target class still looks being imbalanced and there are still target features having a very low score.

The idea behind resampling was, that a hybrid method of doing resampling first and then using an ensemble classification model, would be less prone to imbalanced data and would lead to better prediction results. So, this one oversampling calculation is not good, but there are better resampling methods which are possible to get the desired result. We are using some in the next chapters.

### 6. Improve your model
We use grid search to find better parameters for our model. 

In [34]:
pipeline.get_params()

{'memory': None, 'steps': [('features', FeatureUnion(n_jobs=None,
                transformer_list=[('text_pipeline',
                                   Pipeline(memory=None,
                                            steps=[('vect',
                                                    CountVectorizer(analyzer='word',
                                                                    binary=False,
                                                                    decode_error='strict',
                                                                    dtype=<class 'numpy.int64'>,
                                                                    encoding='utf-8',
                                                                    input='content',
                                                                    lowercase=True,
                                                                    max_df=1.0,
                                                                    max_fea

In [35]:
# specify parameters for grid search
rfc_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1,2), (1,3)],
    'clf__estimator__n_estimators': [200, 500, 1000],
    'clf__estimator__max_depth': [10, 20],
    'clf__estimator__class_weight': ['balanced']
}

# create grid search object
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
# cv not higher than 5 buckets, training needs days with cv=10 if e.g. amazon AWS EC2 service is not available
# n_jobs set to 1 because cloud service throws TerminatedWorkerError if > 1
# for scoring and refit see: https://stackoverflow.com/questions/57591311/combination-of-gridsearchcvs-refit-and-scorer-unclear
#scoring = {'f1': make_scorer(f1_score, average="samples"), 'Accuracy': make_scorer(accuracy_score)}
grid_cv = GridSearchCV(pipeline, param_grid=rfc_param_grid, n_jobs=-1, cv=5,
                       return_train_score=True, verbose=2)# scoring = scoring, refit='f1', return_train_score=True, verbose=2)

### 7. Test your model
Show the accuracy, precision, recall and F-score of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [36]:
# model = cv
grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 122.3min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 331.7min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text_pipeline',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('vect',
                                                                                         CountVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                                                                         decode_error='strict',
                                                                                                         dtype=<class 'numpy.int64'>,
  

In [37]:
y_rfc_pred2 = grid_cv.predict(X_test)
y_rfc_pred2

array([[1, 0, 0, ..., 0, 0, 1],
       [1, 1, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       ...,
       [1, 1, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [38]:
print("CV results:")
sorted(grid_cv.cv_results_.keys())

CV results:


['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'mean_train_score',
 'param_clf__estimator__class_weight',
 'param_clf__estimator__max_depth',
 'param_clf__estimator__n_estimators',
 'param_features__text_pipeline__vect__ngram_range',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split0_train_score',
 'split1_test_score',
 'split1_train_score',
 'split2_test_score',
 'split2_train_score',
 'split3_test_score',
 'split3_train_score',
 'split4_test_score',
 'split4_train_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score',
 'std_train_score']

In [39]:
for param_name, param_value in zip(grid_cv.cv_results_.keys(), grid_cv.cv_results_.values()):
    print(param_name, "=", param_value, "\n")

mean_fit_time = [ 143.32499766  222.84364619  420.73699727  584.24256101  832.77612128
 1144.81913681  420.15419774  548.97682629  880.09615598 1191.90661998
 1714.74533973 2363.58026776] 

std_fit_time = [  8.79552331  29.07888301  28.34485122  45.49333393  44.03554115
  80.26441434  20.05158935  27.73019138  32.80056408  78.54556821
  87.63899992 135.97262028] 

mean_score_time = [ 22.0082262   44.34183908  54.73562617 101.05241728 106.72635608
 190.67582159  30.54257665  44.96658883  57.79411554  99.36488132
 102.66963129 193.37914872] 

std_score_time = [ 2.27988077  3.9070024   3.51014375 12.98181257 18.30584527 17.27878492
  1.17855691  2.26240257  5.69602013 15.49061136  1.33642001 54.79085304] 

param_clf__estimator__class_weight = ['balanced' 'balanced' 'balanced' 'balanced' 'balanced' 'balanced'
 'balanced' 'balanced' 'balanced' 'balanced' 'balanced' 'balanced'] 

param_clf__estimator__max_depth = [10 10 10 10 10 10 20 20 20 20 20 20] 

param_clf__estimator__n_estimators = [2

In [40]:
type(grid_cv.best_estimator_)

sklearn.pipeline.Pipeline

In [41]:
print("Evaluation results for the 5 buckets cross validation tuned 'RandomForestClassifier' estimator:")
display_results(TARGET_NAMES, y_test, y_rfc_pred2, grid_cv, rfc_param_grid)

Evaluation results for the 5 buckets cross validation tuned 'RandomForestClassifier' estimator:

First: overall accuracy score: 0.035905
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.51      0.89      0.65      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.60      0.70      0.65      2236
          medical_help       0.00      0.00      0.00       291
      medical_products       1.00      0.01      0.02       242
     search_and_rescue       0.00      0.00      0.00       123
              security       0.00      0.00      0.00        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.16      0.08      0.11       421
                  food       0.26      0.59     

  'precision', 'predicted', average, warn_for)


The evaluation result of the RandomForestClassifier with tuned hyperparameters is better, even there are still a lot of categories set to 0.0. Some recall values of specific target features are better. If the recall of minority target classes is very less, it proves that the model is still more biased towards majority classes. This issue is reduced as well, but still this is not the best model.

With this approach target features for support values round about >400 appropriate F1-score values exists (>10%). This appeared for the following target features: request, direct_report, aid_related, earthquake, wheather related, food, other aid, shelter, storm and water. Additionally, the weighted avg F1 value is  (now 55%), the samples F1 avg value is still the same (57%).

Furthermore, have in mind that some target features are not disaster related, they are document type related, like 'direct_report' or 'request'. Other target features deliver no value for the prediction task: 'related' is always set to 1 being a disaster message or 'child_alone' which is set originally to 0 for all - means no message has been labelled to this target and the existing training examples are changed manually during the ETL pipeline activities. Nevertheless, there are not enough data sets for this target making a good prediction.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

**First**, we try out other machine learning algorithms which are tuned by cross validation to compare their prediction results. Other estimator models for the requested `MultiOutputClassifier` are:
- `KNeighborsClassifier` with its default parameters: (n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)<br>
According [KNN with TF-IDF Based Framework for Text Categorization](https://core.ac.uk/download/pdf/82438337.pdf) from Bruno Trstenjak, Sasa Mikac and Dzenana Donko in '24th DAAAM International Symposium on Intelligent Manufacturing and Automation, 2013', "The algorithm assumes that it is possible to classify documents in the Euclidean space as points. Euclidean  distance is the distance between two points in Euclidean space."<br>
But in [Effects of Distance Measure Choice on KNN Classifier Performance - A Review](https://arxiv.org/pdf/1708.04321.pdf) from V. B. Surya Prasath et al., 29.Sept.2019, in chapter '2.1. Brief overview of KNN classifier' 4 disadvantages of the KNN are mentioned. To determine a proper distance metric is one of them. Because a particular distance metric is problem and dataset dependent, we first try the euclidian default of the KNN classifier and afterwards other ones.  
- `AdaBoostClassifier` default values are: class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None).

As stated in the scikit-learn [documentation](https://scikit-learn.org/stable/modules/neighbors.html#classification) "scikit-learn implements two different nearest neighbors classifiers. One of them is the <i>KNeighborsClassifier</i> implements learning based on the nearest neighbors of each query point, where is an integer value specified by the user.
    
**Second**, because it is an imbalanced dataset we could do a balancing before classification. The categority classes with low numbers of observations are outnumbered. So, the dataset is highly skewed. To create a balanced dataset several strategies exists:
- Undersampling the majority classes
- Oversampling the minority classes
- Combining over- and under-sampling
- Create ensemble balanced sets

But have in mind, that minority class oversampling could result in overfitting problems doing it before cross-validation. Therefore we tried to use the 'imbalanced-learn' package to modify our dataset being more balanced.

Note:<br>
Doing balancing activities the specific scikit package 'imbalanced-learn' is imported.<br>
For combining the strategies we implement a naive random oversampling of the minority classes.<br>
For undersampling the package can be used as well to create the pipeline with `PipelineImb`. The pipeline itself includes the class `RandomUnderSampler` directly before the MultiOutputClassifier to equalize the number of samples in all the classes before the training. Another possible approach is using the `SMOTETomek` class directly on the training dataset before classification.

But using such package throws the following ValueError: 'Imbalanced-learn currently supports binary, multiclass and binarized encoded multiclasss targets. Multilabel and multioutput targets are not supported.' So, the associated package classes do not support the multi-target classification with multiple outputs as we need for our project. Therefore this coding is removed after such experiment.

Another resampling technique is `cross-validation`, a method repeatingly creating additional training samples from the original training dataset to obtain additional fit information from the selected model. It creates an additional model validation set. The prediction model fits on the remaining training set and afterwards is doing its predictions on the validation set. This calculated validation error rate is an estimation of the datasets test error rate. Specific cross validation strategies exist, we are using the `k-fold cross-validation`, that divides the training set in k non-overlapping groups - called folders -. One of this folders acts as a validation set and the rest is used for training. This process is repeated k times, each time a different validation set is selected out of the group. The k-fold cross validation estimate is calculated by averaging the single k times estimation results. For k we use 5 because of time consuming calculations and not 10.

According the [paper](https://arxiv.org/ftp/arxiv/papers/1810/1810.11612.pdf) <i>Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting</i> of 27 October 2018 from Genta Indra Winata and Masayu Leylia Khodra, regarding new data, it is more appropriate to balance the dataset on the algorithm level instead of the data level to avoid overfitting. The algorithm "approach modifies algorithm by adjusting weight or cost of various classes."<br>
So, the `AdaBoostClassifier` is an ensemble method using boosting process to optimise weights. We will try this estimator as well for the <i>MultiOutputClassifier</i>. The <i>AdaBoostClassifier</i> is using the <i>DecisionTreeClassifier</i> as its own base estimator. The tree parameters are changed in the parameter grid to improve the imbalanced data situation. Weak learners are boosted to be stronger learners and the results are aggregated at the end.

If the usage of the mentioned specific library is not possible for our task, what could we do instead having an appropriate input for the data classifier model? We do feature engineering.<br>
Another option is a `feature-selection` approach which can be done after the feature extraction of the `TfidfVectorizer`, which is creating [feature vectors](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

Additionally, scikit-learn offers the package [feature decomposition](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) to reduce the complexity of features. With its help a subsampling is added:
- For the sparse matrix delivered from the `TfidfVectorizer` instance we use 3000 most frequent text features, each feature token shall appear at least 2 times and n-gram wording during grid search hyperparameter tuning. The  importance of the token is increased proportionally to the number of appearing in the disaster messages.
- Feature relationship of the sparse matrix is handled with `TruncatedSVD` for latent semantic analysis (LSA). There, a component relationship parameter is evaluated via grid search hyperparameter tuning. Afterwards we have to normalise again.

In [66]:
# This resampling with imbalanced package is not possible:
# The following ValueError is thrown:
# Imbalanced-learn currently supports binary, multiclass and binarized encoded multiclasss targets.
# Multilabel and multioutput targets are not supported.

# smote_tomek = SMOTETomek(random_state=FIXED_SEED)
# X_train_res, y_train_res = smote_tomek.fit_sample(X_train, y_train)

In [None]:
#print('After resampling, shape of train_X: {}'.format(X_train_res.shape))
#print('After resampling, shape of train_y: {} \n'.format(y_train_res.shape))

#print("After resampling, label counts '1': {}".format(sum(y_train_res==1)))
#print("After resampling, label counts '0': {}".format(sum(y_train_res==0)))

In [42]:
def build_model(model_type, params):
    ''' 
    input:
    model_type - the estimator model used for the MultiOutputClassifier
    params - the estimator model parameter grid used for the GridSearchCV 
    ''' 
    
    # TfidfVectorizer, by default: use_idf=True, norm=’l2’
    # TruncatedSVD: for SLA n_components of 100 is recommended, but it is stated:
    # Desired dimensionality of output data. Must be strictly less than the number of features.
    # We have 36 target categories. Some of them are 'useless'. We want to know the prio list of all.
    # The max features are 3000 tokens, so we use a smaller value as n_compontents for LSA.
    # A token is part of the result if it appears at least 2 times
    #
    # For RandomizedSearchCV: for RandomForestClassifier, we have 8 parameters => n_iter=8
    pipeline2 = Pipeline([
        ('features', FeatureUnion([            
            ('text_pipeline', Pipeline([
                ('tfidf', TfidfVectorizer(tokenizer=tokenize, sublinear_tf=True,                                   
                                          max_features=3000, min_df=2)),
                ('best', TruncatedSVD(random_state=FIXED_SEED)),
                ('normalizer', Normalizer(copy=False))
            ]))           
        ])),

        ('clf', MultiOutputClassifier(model_type))
    ])
    
    # the higher the verbose number the more information is thrown
    cv = GridSearchCV(pipeline2, param_grid=params, return_train_score=True, n_jobs=1, cv=5, verbose=2)
    
    return cv

In [57]:
def build_model_randomcv(model_type, params, cv_iter):
    ''' 
    input:
    model_type - the estimator model used for the MultiOutputClassifier
    params - the estimator model parameter grid used for the GridSearchCV 
    ''' 
    
    # TfidfVectorizer, by default: use_idf=True, norm=’l2’
    # TruncatedSVD: for SLA n_components of 100 is recommended, but it is stated:
    # Desired dimensionality of output data. Must be strictly less than the number of features.
    # We have 36 target categories. Some of them are 'useless'. We want to know the prio list of all.
    # The max features are 3000 tokens, so we use a smaller value as n_compontents for LSA.
    # A token is part of the result if it appears at least 2 times
    #
    # For RandomizedSearchCV:
    # RandomForestClassifier: we have 8 parameters => n_iter=8
    # AdaBoostClassifier: we have  parameters => n_iter=
    pipeline2 = Pipeline([
        ('features', FeatureUnion([            
            ('text_pipeline', Pipeline([
                ('tfidf', TfidfVectorizer(tokenizer=tokenize, sublinear_tf=True,                                   
                                          max_features=3000, min_df=2)),
                ('best', TruncatedSVD(random_state=FIXED_SEED)),
                ('normalizer', Normalizer(copy=False))
            ]))           
        ])),

        ('clf', MultiOutputClassifier(model_type))
    ])
    
    # the higher the verbose number the more information is thrown
    cv = RandomizedSearchCV(pipeline2, param_distributions=params, n_jobs=1, cv=5, n_iter=cv_iter,
                            return_train_score=True, verbose=2, random_state=FIXED_SEED)
    
    return cv

We try this new pipeline including feature selection and decomposition first with the other mentioned classifiers and afterwards with an additionally tuned RandomForestClassifier.

This simple `KNN` parameter grid needs a long time for calculation, means the computational time cost is high. As stated in the mentioned KNN paper from Sept. 2019, Euclidian distance is not an appropriate metric if the feature dimension is high. This is the case with a high n_components value of >=1000. So, we try 'best' n_components=100 and 500 instead of 1000 or higher (note: in the scikit-learn documentation 100 is proposed for LSA tasks) and do other parameter modifications.

In [43]:
sorted(sklearn.neighbors.VALID_METRICS['brute'])

['braycurtis',
 'canberra',
 'chebyshev',
 'cityblock',
 'correlation',
 'cosine',
 'cosine',
 'dice',
 'euclidean',
 'hamming',
 'haversine',
 'jaccard',
 'kulsinski',
 'l1',
 'l2',
 'mahalanobis',
 'manhattan',
 'matching',
 'minkowski',
 'precomputed',
 'rogerstanimoto',
 'russellrao',
 'seuclidean',
 'sokalmichener',
 'sokalsneath',
 'sqeuclidean',
 'wminkowski',
 'yule']

In [54]:
# create param grids for the models

# KNeighborsClassifier
# according http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.5135&rep=rep1&type=pdf
# cosine distance metric is commonly used,
# compared are the cosine angles between two documents/vectors
# (the term frequencies in different documents collected as metrics).
# This particular metric is used when the magnitude between vectors does not matter but the orientation.
# 
# The hamming distance tells us about the differences of compared strings of equal length.
# It is defined as the amount of positions having different characters or symbols.

knn_param_grid  = {
    'features__text_pipeline__tfidf__ngram_range': [(1, 2), (1,3)],
    'features__text_pipeline__best__n_components':[100, 500],
    'clf__estimator__n_neighbors': [1, 3],
    'clf__estimator__metric': ['euclidean', 'cosine', 'hamming'],
    'clf__estimator__weights': ['uniform', 'distance']
}

In [45]:
# according scikitlearn: we have a sparse matrix therefore use algorithm 'brute'
print("\n----- KNeighborsClassifier with feature engineering -----")
print("Build best model: ...")
cv_knn_model = build_model(KNeighborsClassifier(n_jobs=1, algorithm='brute'), knn_param_grid)
print("Train model: ...")
cv_knn_model.fit(X_train, y_train)


----- KNeighborsClassifier with feature engineering -----
Build best model: ...
Train model: ...
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 1.5min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.6min remaining:    0.0s


[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.6min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, f

[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.4min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.1min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=1

[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.8min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 3.0min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, f

[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.4min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=euclidean, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=1

[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.2min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 3.0min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_p

[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 1.9min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.0min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__t

[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_p

[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.2min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.4min
[CV] clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=cosine, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__t

[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 7.6min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 7.5min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__t

[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 7.0min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 7.0min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=1, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, featur

[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 7.1min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 6.9min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__t

[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 6.1min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 6.4min
[CV] clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__metric=hamming, clf__estimator__n_neighbors=3, clf__estimator__weights=distance, features__text_pipeline__best__n_components=100, featur

[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 7668.3min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text_pipeline',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('tfidf',
                                                                                         TfidfVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                                                                         decode_error='strict',
                                                                                                         dtype=<class 'numpy.float64'>,

In [46]:
y_knn_pred = cv_knn_model.predict(X_test)

In [47]:
type(cv_knn_model.estimator)

sklearn.pipeline.Pipeline

In [48]:
type(cv_knn_model.estimator['features']) 

sklearn.pipeline.FeatureUnion

In [49]:
type(cv_knn_model.estimator['features'].get_params()['transformer_list'][0])

tuple

In [50]:
type(cv_knn_model.estimator['features'].get_params()['transformer_list'][0][1])

sklearn.pipeline.Pipeline

In [51]:
cv_knn_model.estimator['features'].get_params()['transformer_list'][0][1]

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=3000,
                                 min_df=2, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=True,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at 0x000000789273AC80>,
                                 use_idf=True, vocabulary=None)),
                ('best',
                 TruncatedSVD(algorithm='randomized', n_components=2, n_iter=5,
                              

In [52]:
print("Best score: %0.3f" % cv_knn_model.best_score_)
print("Best parameters set:")
best_parameters = cv_knn_model.best_estimator_.get_params()
for param_name in sorted(knn_param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.076
Best parameters set:
	clf__estimator__metric: 'euclidean'
	clf__estimator__n_neighbors: 3
	clf__estimator__weights: 'distance'
	features__text_pipeline__best__n_components: 100
	features__text_pipeline__tfidf__ngram_range: (1, 2)


In [53]:
print("\nModel evaluation on tuned KNeighborsClassifier ...")
display_results(TARGET_NAMES, y_test, y_knn_pred, cv_knn_model, knn_param_grid)


Model evaluation on tuned KNeighborsClassifier ...

First: overall accuracy score: 0.089127
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.53      0.43      0.47      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.58      0.60      0.59      2236
          medical_help       0.10      0.03      0.04       291
      medical_products       0.15      0.04      0.06       242
     search_and_rescue       0.00      0.00      0.00       123
              security       0.00      0.00      0.00        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.11      0.03      0.05       421
                  food       0.24      0.12      0.16       891
               shelter      

  'precision', 'predicted', average, warn_for)


Can we improve the hyperparameter settings for the KNN classifier? By default with p=2 euclidian metric is set.

In [59]:
better_knn_param_grid  = {
    'features__text_pipeline__tfidf__ngram_range': [(1, 2)],
    'features__text_pipeline__best__n_components':[35, 50, 100],
    'clf__estimator__n_neighbors': [5, 7],
    'clf__estimator__weights': ['distance', 'uniform']
}

In [60]:
# according scikitlearn: we have a sparse matrix therefore use algorithm 'brute'
print("\n----- KNeighborsClassifier with feature engineering, better param grid -----")
print("Build best model: ...")
better_cv_knn_model = build_model(KNeighborsClassifier(n_jobs=1, algorithm='brute'), better_knn_param_grid)
print("Train model: ...")
better_cv_knn_model.fit(X_train, y_train)


----- KNeighborsClassifier with feature engineering, better param grid -----
Build best model: ...
Train model: ...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.7min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  9.8min remaining:    0.0s


[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.2min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.1min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=distance, features__text_pipeline__best__n_co

[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 3.5min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 3.3min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__n_neighbors=5, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=100, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.9min
[CV] clf__estimator__n_neighbors=5, clf__estimator__weights=uniform, features__text_pipeline__best__n_com

[CV]  clf__estimator__n_neighbors=7, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=35, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__n_neighbors=7, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=50, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__n_neighbors=7, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=50, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.5min
[CV] clf__estimator__n_neighbors=7, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=50, features__text_pipeline__tfidf__ngram_range=(1, 2) 
[CV]  clf__estimator__n_neighbors=7, clf__estimator__weights=uniform, features__text_pipeline__best__n_components=50, features__text_pipeline__tfidf__ngram_range=(1, 2), total= 2.4min
[CV] clf__estimator__n_neighbors=7, clf__estimator__weights=uniform, features__text_pipeline__best__n_componen

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 538.2min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('text_pipeline',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('tfidf',
                                                                                         TfidfVectorizer(analyzer='word',
                                                                                                         binary=False,
                                                                                                         decode_error='strict',
                                                                                                         dtype=<class 'numpy.float64'>,

In [62]:
better_y_knn_pred = better_cv_knn_model.predict(X_test)

In [63]:
print("\nModel evaluation on second better tuned KNeighborsClassifier ...")
display_results(TARGET_NAMES, y_test, better_y_knn_pred, better_cv_knn_model, better_knn_param_grid)


Model evaluation on second better tuned KNeighborsClassifier ...

First: overall accuracy score: 0.082506
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.54      0.42      0.47      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.58      0.64      0.61      2236
          medical_help       0.20      0.01      0.03       291
      medical_products       0.18      0.01      0.02       242
     search_and_rescue       0.00      0.00      0.00       123
              security       0.00      0.00      0.00        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.14      0.01      0.01       421
                  food       0.24      0.07      0.11       891
              

  'precision', 'predicted', average, warn_for)


The result of this KNN training and prediction is still not good for the single categories. Only the categories with highest amount of samples are predicted properly.

**Now**, we try the other ensemble model for prediction - the `AdaBoostClassifier`. AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers to get high accuracy by using classifier weights and with them optimising the training data samples in each iteration by minimising training error. Therefore it deals with imbalanced datasets more appropriate compared to e.g. KNN. So, we expect to have better prediction results.

In [64]:
# ensemble model AdaBoostClassifier
# class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0,
# algorithm='SAMME.R', random_state=None)
# base_estimator is by default DecisionTreeClassifier(max_depth=1), changed it
ada_param_grid = {
    'features__text_pipeline__tfidf__ngram_range': [(1,2), (1,3)],
    'features__text_pipeline__best__n_components':[35, 50, 100],
    'clf__estimator__base_estimator__max_depth': [1, 3],
    'clf__estimator__n_estimators': [50, 100]
}

In [65]:
print("\n----- AdaBoostClassifier with feature engineering -----")
print("Build best model: ...")
cv_ada_model = build_model_randomcv(model_type=AdaBoostClassifier(
                                                base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                                      random_state=FIXED_SEED),
                                                random_state=FIXED_SEED),
                                    params=ada_param_grid, cv_iter=24)
print("Train model: ...")
cv_ada_model.fit(X_train, y_train)


----- AdaBoostClassifier with feature engineering -----
Build best model: ...
Train model: ...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total= 8.8min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 10.0min remaining:    0.0s


[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total= 8.8min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total= 8.8min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total= 8.6min
[CV] features__text_pipeline__tfidf__ngr

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total=21.0min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total=20.9min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=1, total=20.9min
[CV] features__text_pipeline__tfidf

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=1, total=21.9min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=1 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=1, total=22.0min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=1 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=1, total=21.6min
[CV] features__text_pipeline__tfidf

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=3, total=20.5min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=3 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=3, total=20.4min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=3 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=3, total=20.5min
[CV] features__text_pipeline__tfidf__ngr

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=50, clf__estimator__base_estimator__max_depth=3, total=53.5min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3, total=38.5min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3, total=40.1min
[CV] features__text_pipeline__tfidf

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3, total=108.6min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3, total=109.3min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 2), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=100, clf__estimator__base_estimator__max_depth=3, total=108.5min
[CV] features__text_pipelin

[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed: 4398.0min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('features',
                                              FeatureUnion(n_jobs=None,
                                                           transformer_list=[('text_pipeline',
                                                                              Pipeline(memory=None,
                                                                                       steps=[('tfidf',
                                                                                               TfidfVectorizer(analyzer='word',
                                                                                                               binary=False,
                                                                                                               decode_error='strict',
                                                                           

In [66]:
y_ada_pred = cv_ada_model.predict(X_test)

In [67]:
print("\nModel evaluation on tuned AdaBoostClassifier ...")
display_results(TARGET_NAMES, y_test, y_ada_pred, cv_ada_model, ada_param_grid)


Model evaluation on tuned AdaBoostClassifier ...

First: overall accuracy score: 0.005093
Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.50      0.68      0.58      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.59      0.60      0.60      2236
          medical_help       0.07      0.17      0.10       291
      medical_products       0.07      0.12      0.09       242
     search_and_rescue       0.04      0.06      0.05       123
              security       0.01      0.01      0.01        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.11      0.24      0.15       421
                  food       0.25      0.43      0.31       891
               shelter       0

  'precision', 'predicted', average, warn_for)


In [69]:
for param_name, param_value in zip(cv_ada_model.cv_results_.keys(), cv_ada_model.cv_results_.values()):
    print(param_name, "=", param_value, "\n")

mean_fit_time = [ 500.81215472  518.36387935  661.96901188  678.1040729  1236.76648321
 1201.19012208  827.04902592  939.91503806 1260.20814085 1284.69999075
 2389.56403565 2387.29037957 1182.47269912 1201.14438825 1652.31215382
 1686.02609415 3106.19282417 3212.97365623 2355.07406769 2338.73415484
 3277.80501637 3249.80948391 6335.19864755 6429.43141322] 

std_fit_time = [  8.47003324   2.52748568  42.09369518   9.42296901   6.65352007
  48.74900499 139.98581277   6.2688438   52.92003108   5.07312582
  54.70592671  42.07549276  43.46590703  16.46598529  59.09347479
  13.28271674 177.09868614  80.95115605  35.47786403  72.78984239
  74.35818212  95.04932708 279.46561469  69.63976451] 

mean_score_time = [21.3091783  18.82001257 19.09625058 18.72874808 24.22245455 24.28735037
 23.99989381 25.65693498 26.947825   25.40314484 35.97902799 37.59176879
 18.39714007 21.13625941 22.5374536  18.29763508 22.71581182 24.37856922
 26.40464034 25.89849601 23.85839944 26.32019887 31.92602406 38.2958

Regarding the evaluation results of the <i>KNeighborsClassifier</i> model, it is not acceptable comparing the single target features. The hamming distance is not valuable at all, still euclidian metric has been the best. 

Compared to the KNN model the <i>AdaBoostClassifier</i> model can handle the imbalanced dataset much better and has much more appropriate predictions regarding the metric values of the single target categories. By now, this is the best model we have been evaluated yet.

Would the feature selection and decomposition improve the RandomForestClassifier? Because of calculation time range we use the <i>RandomizedSearchCV</i>, knowing that this has a little bit lesser performance.

In [71]:
# for the other models 100 best n_components have been the best hyperparameter for TruncatedSVD
better_rfc_param_grid = {
    'features__text_pipeline__tfidf__ngram_range': [(1,3)],
    'features__text_pipeline__best__n_components':[35, 50, 100],
    'clf__estimator__n_estimators': [200, 600, 800],
    'clf__estimator__max_depth': [20],
    'clf__estimator__class_weight': ['balanced']
}

In [72]:
print("\n----- RandomForestClassifier with feature engineering and modified param grid -----")
print("Build best model: ...")
cv_better_rfc_model = build_model_randomcv(model_type=RandomForestClassifier(n_jobs=1, random_state=FIXED_SEED),
                                           params=better_rfc_param_grid, cv_iter=8)
print("Train model: ...")
cv_better_rfc_model.fit(X_train, y_train)


----- RandomForestClassifier with feature engineering and modified param grid -----
Build best model: ...
Train model: ...
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced, total=110.8min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 118.1min remaining:    0.0s


[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced, total=90.8min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced, total=132.6min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=50, clf__estimator__n_estimators=800

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced, total=188.7min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced, total=187.3min
[CV] features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimators=800, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced 
[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=100, clf__estimator__n_estimat

[CV]  features__text_pipeline__tfidf__ngram_range=(1, 3), features__text_pipeline__best__n_components=35, clf__estimator__n_estimators=600, clf__estimator__max_depth=20, clf__estimator__class_weight=balanced, total=72.3min


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 3687.1min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('features',
                                              FeatureUnion(n_jobs=None,
                                                           transformer_list=[('text_pipeline',
                                                                              Pipeline(memory=None,
                                                                                       steps=[('tfidf',
                                                                                               TfidfVectorizer(analyzer='word',
                                                                                                               binary=False,
                                                                                                               decode_error='strict',
                                                                           

In [73]:
y_better_rfc_pred = cv_better_rfc_model.predict(X_test)

In [74]:
print("\nModel evaluation on tuned RandomForestClassifier with feature engineering...")
display_results(TARGET_NAMES, y_test, y_better_rfc_pred, cv_better_rfc_model, better_rfc_param_grid)


Model evaluation on tuned RandomForestClassifier with feature engineering...

First: overall accuracy score: 0.048892


  'precision', 'predicted', average, warn_for)


Classification Report for each target class:
                         precision    recall  f1-score   support

               related       1.00      1.00      1.00      3927
               request       0.54      0.54      0.54      1810
                 offer       0.00      0.00      0.00        22
           aid_related       0.59      0.74      0.66      2236
          medical_help       0.00      0.00      0.00       291
      medical_products       0.60      0.01      0.02       242
     search_and_rescue       0.00      0.00      0.00       123
              security       0.00      0.00      0.00        67
              military       0.00      0.00      0.00        32
           child_alone       0.00      0.00      0.00         9
                 water       0.14      0.00      0.01       421
                  food       0.29      0.06      0.09       891
               shelter       0.17      0.01      0.02       554
              clothing       0.20      0.02      0.03    

In [75]:
for param_name, param_value in zip(cv_better_rfc_model.cv_results_.keys(),
                                   cv_better_rfc_model.cv_results_.values()):
    print(param_name, "=", param_value, "\n")

mean_fit_time = [ 6766.61998363  1934.81381388  7879.77318654  1270.81696072
 10694.54701014  2624.38108959  5573.72268753  4256.54027619] 

std_fit_time = [1137.62940298   23.27127936  121.31311604  182.72848141  192.71065327
   72.68445393  338.75292779  144.00183303] 

mean_score_time = [406.60862017  35.02166605 136.52052999  33.98980808 278.84143052
  38.55520611 130.07929959 133.5517818 ] 

std_score_time = [315.68765054   4.12343799   6.40957199   8.13886705  99.33203965
   1.16232684   9.0522394   14.06957846] 

param_features__text_pipeline__tfidf__ngram_range = [(1, 3) (1, 3) (1, 3) (1, 3) (1, 3) (1, 3) (1, 3) (1, 3)] 

param_features__text_pipeline__best__n_components = [50 50 100 35 100 100 50 35] 

param_clf__estimator__n_estimators = [800 200 600 200 800 200 600 600] 

param_clf__estimator__max_depth = [20 20 20 20 20 20 20 20] 

param_clf__estimator__class_weight = ['balanced' 'balanced' 'balanced' 'balanced' 'balanced' 'balanced'
 'balanced' 'balanced'] 

params = [{'fe

**Note**:<br>
For the RandomForestClassifier the usage of the feature selection and decomposition improves the prediction results for the specific target features and the model is much less biased towards the majority classes.

Nevertheless, still the `AdaBoostClassifier`can handle the imbalanced dataset much better compared to all other used model types. So, we store it as our pickle file.

### 9. Export your model as a pickle file

Finally, having found the best model from our model selection list, we save this model with its best parameters as a pickle file. Pickle is the standard way of serialising objects in Python. With this pickle file we can deserialise our model and use it to make new predictions.

In [76]:
def save_model(model, model_filepath):
    pickle.dump(model, open(model_filepath, "wb" ))

In [77]:
# see train_classifier.py file
model_filepath = "classifier.pkl"
model = cv_ada_model
print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(model, model_filepath)

print('Best trained model saved!')

Saving model...
    MODEL: classifier.pkl
Best trained model saved!


### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.