# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords', 'averaged_perceptron_tagger'])

# import libraries
import random as rn
import numpy as np
import pandas as pd
import string
import pickle
from sqlalchemy import create_engine

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report  # text summary of the precision, recall, F1 score for each class

import Contractions
import warnings
warnings.warn("once")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Ilona\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Make the code reproducible ...

In [2]:
# The below is necessary for starting NumPy generated random numbers
# in a well-defined initial state.
np.random.seed(42)

# The below is necessary for starting core Python generated random numbers
# in a well-defined state.
rn.seed(1042)

In [3]:
# load data from database
try:
    engine = create_engine('sqlite:///Disaster_Messages_engine.db')
    df = pd.read_sql_table('Messages_Categories_table', engine)
except:
    print("The database 'Disaster_Messages_engine.db' could not be loaded. No ML pipeline activities possible.")

# success
print("The dataset has {} data points with {} variables each.".format(*df.shape))
df.head(3)

The dataset has 25783 data points with 37 variables each.


Unnamed: 0,message,original,genre,lang_code,related,request,offer,aid_related,medical_help,medical_products,...,tools,hospitals,shops,aid_centers,weather_related,floods,storm,fire,earthquake,cold
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,en,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,en,1,0,0,1,0,0,...,0,0,0,0,1,0,1,0,0,0
2,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,en,1,1,0,1,0,1,...,0,1,0,0,0,0,0,0,0,0


In [4]:
# create input (X) and output (y) samples
# as input we have to take care about the messages; additionally we include the 'genre', perhaps we need it for the app,
# for the models only the 'message' column is used;
# the categories are the targets of the classification
X = df[['message', 'genre']]
y = df[df.columns[4:]]

### 2. Write a tokenization function to process your text data

During EPL pipeline activities we realised that there are messages which are not useful (e.g. 'nonsense' character sequences, html characters) and there are probably web links included. We have to deal with this in the tokenize() function.

In [5]:
CONTRACTION_MAP = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

In [6]:
# function from Dipanjan's repository:
# https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%\
# 20content/nlp%20proven%20approach/NLP%20Strategy%20I%20-%20Processing%20and%20Understanding%20Text.ipynb

def expand_contractions(text, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    
    return expanded_text

In [7]:
def tokenize(text):
    stop_words = set(stopwords.words('english'))
    stop_words.remove('no')
    stop_words.remove('not')
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'   
    
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    # change the negation wordings like don't to do not, won't to will not 
    # or other contractions like I'd to I would, I'll to I will etc. via dictionary
    text = expand_contractions(text, CONTRACTION_MAP)

    # remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
    text = text.translate(str.maketrans('','', string.punctuation))
    # during ETL pipeline we have reduced the dataset on English messages ('en' language coding,
    # but there can be some wrong codings
    tokens = word_tokenize(text, language='english')
    lemmatizer = WordNetLemmatizer()  # for the lexical correctly found word stem (root)

    clean_tokens = []
    for tok in tokens:
        # use only lower cases, remove leading and ending spaces
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        # remember: there have been nonsense sentences, so, now some strings could be empty
        # numbers are still in, could be important and 
        # toDo: what is the correct length number to use now? Small ones are probably no relevant words ...
        # remove English stop words
        if (len(clean_tok) > 1) & (clean_tok not in stop_words):
            clean_tokens.append(clean_tok)

    return clean_tokens

In [8]:
# example for unit test to remove punctuation [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
example_str = 'This [is an] example? {of} string. with.? some &punctuation &signs!!??!!'
result = example_str.translate(str.maketrans('','', string.punctuation))
print(result)
# output shall be: This is an example of string with some punctuation signs

This is an example of string with some punctuation signs


In [9]:
# test tokenize
for message in X['message'][:10]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over
['hurricane', 'not'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', '8090', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'needs', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 

Information about the National Palace-
['information', 'national', 'palace'] 

Storm at sacred heart of jesus
['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you!
['please', 'need', 'tent', 'water', 'silo', 'thank'] 

I would like to receive the messages, thank you
['would', 'like', 'receive', 'message', 'thank'] 

I am in Croix-des-Bouquets. We have health issues. They ( workers 

### 3. Build a machine learning pipeline
Notes:
- Regarding the class default parameters, for this Python implementation scikit-learn version 0.21.2 is used.
- We use np.random.seed() beside of random_state/random_seed parameters ([reason](https://stackoverflow.com/questions/47923258/random-seed-on-svm-sklearn-produces-different-results))

Remember, we are dealing with an imbalanced dataset, therefore not all models can be used. A machine learning classifier could be more biased towards the majority class, causing bad classification of the minority class. Therefore we have to take care. We start with <i>LogisticRegression</i> and use other appropriate models later in this Python implementation. If the metric evaluation of the used models shows issues, we have to change our dataset doing undersampling or oversampling.

This machine pipeline should take in the `message` column as input and output classification results on the other remaining 33 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

As its first estimator we use [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression). Its default parameter values are:<br>
LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’auto’, verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

We have to solve a supervised, multi-class problem, therefore some parameters have to be changed:<br>
- solver = 'saga'  (handles L1 and L2, according scikit-learn documentation it is often the best choice)
- multi_class = 'multinomial'
- C = the optimal value for the inverse of regularization strength is going to be set later, in the cross validation optimisation subchapter of this project.


Regarding feature extraction:<br>
For having a measure of the word frequency of each text term the <i>Term Frequency - Inverse Document Frequency</i> class exists in the library scikit-learn with 2 types - vectoriser and transformer. The used class [TfidTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) has the default parameters:<br>
TfidfTransformer(norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

Both pipeline classes <i>TfidTransformer</i> and <i>LogisticRegression</i> use a L2 normalisation for scaling. Therefore, the amount of words has no influence on our result. All feature vectors have an euclidian norm of 1.<br>
The text messages are transformed to a number vector representation used to train supervised classifiers able to predict the associated categories of future, new messages.

The usage of other machine learning models for imbalanced datasets, like Linear Support Vector Machine or Multinomial Naive Bayes classification model or RandomForest ensemble classification model, as well as model parameter optimisation is part of the project subchapters below.

In [10]:
pipeline = Pipeline([
        ('features', FeatureUnion([ 
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
            
        ])),
    
        ('clf', MultiOutputClassifier(LogisticRegression(solver='newton-cg', multi_class='multinomial', random_state=42)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
# model = pipeline and we use the given default test_size=0.25 and we need the 'message' column
X_train, X_test, y_train, y_test = train_test_split(X['message'], y, random_state=42)
# X_train
# y_train

Before using the pipeline, do single prediction with its components for some disaster messages.

In [12]:
print("Category target labels:\n{}".format(y.columns))

Category target labels:
Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'weather_related',
       'floods', 'storm', 'fire', 'earthquake', 'cold'],
      dtype='object')


In [13]:
example_child_alone_sentence = "Is there any help in place for orphans? My mother and father have died in the tragedy."

In [14]:
count_vect = CountVectorizer(tokenizer=tokenize)
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultiOutputClassifier(LogisticRegression(solver = 'newton-cg', multi_class = 'multinomial'))
clf.fit(X_train_tfidf, y_train)

MultiOutputClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                   dual=False,
                                                   fit_intercept=True,
                                                   intercept_scaling=1,
                                                   l1_ratio=None, max_iter=100,
                                                   multi_class='multinomial',
                                                   n_jobs=None, penalty='l2',
                                                   random_state=None,
                                                   solver='newton-cg',
                                                   tol=0.0001, verbose=0,
                                                   warm_start=False),
                      n_jobs=None)

In [15]:
print(clf.predict(count_vect.transform([example_child_alone_sentence])))

[[1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [16]:
X.query("message == 'Is there any help in place for orphans? My mother and father have died in the tragedy.'")

Unnamed: 0,message,genre
2258,Is there any help in place for orphans? My mot...,direct


In [17]:
y.iloc[2258:2259]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,tools,hospitals,shops,aid_centers,weather_related,floods,storm,fire,earthquake,cold
2258,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


This example showed, that the mechanism works in general, but the prediction has not been completely correct compared to the given labeled categories. To my opinion, the ones which are chosen by prediction are better ones regarding the context and its request for help. Means imagine, that they would have been in the original dataset as well.

Nevertheless, we train the pipeline now.

In [19]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('text_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('vect',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8',
                                                                                  input='content',
                                                                                  low

In [20]:
y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each, where:

TP = TruePositive; FP = FalsePositive; TN = TrueNegative; FN = FalseNegative 

**Precision** quantifies the binary precision. It is a ratio of true positives (messages correctly classified to their categories)) to all positives (all messages classified to categories, irrespective of whether that was the correct classification), in other words it is the ratio of

TP / (TP + FP)

**Recall** tells us what proportion of messages that actually were classified to specific categories were classified by us as this categories. It is a ratio of true positives to all the correctly category classified messages that were actually disaster messages, in other words it is the ratio of

TP / (TP + FN)

A model's ability to precisely predict those that are correctly categoriesed disaster messages is more important than the model's ability to recall those individuals. 

We can use **F-beta score** as a metric that considers both precision and recall. According scikit-learn, the F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. F – Measure is nothing but the harmonic mean of Precision and Recall.

Fβ=(1 + β2)  (precision⋅recall / ((β2⋅precision) + recall))

In particular, when β=0.5, more emphasis is placed on precision. And when β=1.0 recall and precision are equally important.

According scikit-learn: "The **F1 score** ... reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter."

The classification_report() function returns an additional value: **Support** - the number of occurrences of each label in y_true.

In [21]:
def display_results(y_test, y_pred, cv=None):
    target_names = y_test.columns
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    # shows F1_score, precision and recall
    class_report = classification_report(y_test, y_pred, target_names=target_names)

    print("Classification Report for each target class:\n", class_report)

    if cv != None:
        print("\n\n---- Best Parameters: ----\n{}".format(cv.best_params_))

In [22]:
y_test.shape

(6446, 33)

In [23]:
y_pred.shape

(6446, 33)

In [24]:
df_pred = pd.DataFrame(y_pred, columns=y_test.columns)
df_pred.to_csv("y_pred_file.csv")  
y_test.to_csv("y_test_file.csv")  

In [26]:
# Are there differences in the test dataFrame and in y_pred, means exists a test item without corresponding prediction
set(y_test) - set(df_pred)

set()


In [27]:
display_results(y_test, y_pred, None)

Classification Report for each target class:
                         precision    recall  f1-score   support

               related       0.76      0.96      0.85      4880
               request       0.48      0.13      0.21      1119
                 offer       0.00      0.00      0.00        29
           aid_related       0.45      0.29      0.35      2673
          medical_help       0.00      0.00      0.00       514
      medical_products       0.00      0.00      0.00       340
     search_and_rescue       0.00      0.00      0.00       169
              security       0.00      0.00      0.00       110
              military       0.00      0.00      0.00       205
           child_alone       0.00      0.00      0.00         6
                 water       0.00      0.00      0.00       406
                  food       0.45      0.01      0.01       741
               shelter       0.00      0.00      0.00       615
              clothing       0.00      0.00      0.00    

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Such kind of behaviour, means the UndefinedMetricWarnings, has been expected because having an imbalanced dataset. It may result in having 'undefined metric warnings'. In this classification report table, often the metrics are not reliable because of being set to 0.0. They are ill defined, e.g. devision by zero may appear. Therefore we start to improve the model by using cross-validated values as hyperparameters and not only a single value. If this is not enough to solve the issue, we have to create a balanced dataset for all the features or do a feature reduction by using a PCA (Principle Component Analysis) method using only the most important features.

### 6. Improve your model
Use grid search to find better parameters. 

In [28]:
pipeline.get_params()

{'memory': None, 'steps': [('features', FeatureUnion(n_jobs=None,
                transformer_list=[('text_pipeline',
                                   Pipeline(memory=None,
                                            steps=[('vect',
                                                    CountVectorizer(analyzer='word',
                                                                    binary=False,
                                                                    decode_error='strict',
                                                                    dtype=<class 'numpy.int64'>,
                                                                    encoding='utf-8',
                                                                    input='content',
                                                                    lowercase=True,
                                                                    max_df=1.0,
                                                                    max_fea

In [35]:
# specify parameters for grid search
parameters = {
    #'features__text_pipeline__vect__encoding' : ['utf-8', 'latin-1'],  # utf-8 is useful only for the English messages
    #'features__text_pipeline__vect__min_df' : [1, 2, 3],  # minimum amount of docs the token is included
    'features__text_pipeline__vect__ngram_range': [(1, 1), (1, 2)],
    'clf__estimator__C' : [0.01, 0.1, 1, 10, 100],  
    'clf__estimator__max_iter': [100, 500, 1000, 1500],
    'clf__estimator__solver' : ['saga', 'newton-cg']
}

# create grid search object
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
grid_cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=3, cv=5, verbose=1)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
# model = cv
grid_cv.fit(X_train, y_train)

  


In [None]:
y_pred = grid_cv.predict(X_test)

In [None]:
print("Evaluation results for the cross validation tuned 'Logistic Regression' estimator:\nFirst: accuracy score: ")
accuracy_score(y_test, y_pred)
display_results(y_test, y_pred, grid_cv)

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

First, we try out other machine learning algorithms which are tuned by cross validation to compare their prediction results. Further models are:
- Naïve Bayes: `MultinomialNB`<br>
    The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g. word counts for text classification). Its default setting is: MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None). The parameter 'alpha' is its "Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing)."
    
- Support Vector Machines (regular `LinearSVC`)<br>
  Linear Support Vector Classification default setting is: LinearSVC(penalty=’l2’, loss=’squared_hinge’, dual=True, tol=0.0001, C=1.0, multi_class=’ovr’, fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)<br>
  According scikit-learn the parameter multi_class (string ‘ovr’ or ‘crammer_singer’ (default=’ovr’)): Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while "crammer_singer" optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If "crammer_singer" is chosen, the options loss, penalty and dual will be ignored.<br>
  And the parameter C is the penalty parameter of the error term.

- Ensemble Method: `RandomForestClassifier`<br>
    A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.<br>
    Its default setting is: RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

In [None]:
def build_model(model_type, params):
    ''' 
    input:
    model_type - the estimator model used for the MultiOutputClassifier
    params - the estimator model parameter grid used for the GridSearchCV 
    '''
    
    pipeline = Pipeline([
        ('features', FeatureUnion([ 
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
            
        ])),
    
        ('clf', MultiOutputClassifier(model_type))
    ])

    # the higher the verbose number the more information is thrown
    cv = GridSearchCV(pipeline, param_grid=params, n_jobs=3, cv=5, verbose=1) 
    
    return cv

In [None]:
# create param grids for the models
mnb_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 1), (1, 2)],
    'clf__estimator__alpha': [1e-5, 1e-4, 1e-2, 1e-1, 1]
}

svm_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 1), (1, 2)],
    'clf__estimator__C': [0.01, 0.1, 1, 1.5, 5],
    'clf__estimator__multi_class': ['ovr', 'crammer_singer'],
    'clf__estimator__max_iter': [1000, 1200, 1500]
}

rfc_param_grid = {
    'features__text_pipeline__vect__ngram_range': [(1, 1), (1, 2)],
    'clf__estimator__n_estimators': [10, 100, 200], # The number of trees in the forest, will be 100 as default in v0.22
    'clf__estimator__max_depth': [None, 3, 5]
}

In [None]:
models = [
    [MultinomialNB(), mnb_param_grid],
    [LinearSVC(random_state=42), svm_param_grid],
    [RandomForestClassifier(random_state=42), rfc_param_grid] 
]

In [None]:
cv_model_list = []
for model in models:
    model_name = model[0].__class__.__name__
    print("\n----- {} -----".format(model_name))
    print("Build model: ...")
    cv_model = build_model(model_name, model[1])
    
    print("\nTrain model: ...")
    cv_model.fit(X_train, y_train)    
    y_pred = cv_model.predict(X_test)
    
    print("\nModel evaluation: ...")
    print("First: accuracy score: ")
    accuracy_score(y_test, y_pred)
    display_results(y_test, y_pred, cv_model)
    
    cv_model_list.append(cv_model)

Regarding the evaluation results the best model is .......

In [None]:
# model = ... best evaluated model with its best params ...

Finally, having found the best model from our selection, we save this model as a pickle file.

### 9. Export your model as a pickle file

In [None]:
def save_model(model, model_filepath):
    pickle.dump(model, open(model_filepath, "wb" ) )

In [None]:
# see train_classifier.py file
model_filepath = "classifier.p"
print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(model, model_filepath)

print('Trained model saved!')

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.