# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and y

In [48]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine

import re
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords', 'averaged_perceptron_tagger'])
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier # generates error: , GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin

import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
def load_db_data(db_path, db_table):
    '''
    Load the contents of a DB table into a DataFrame.
    
    Input:
    - db_path: the path to the database
    - db_table: the name of the database table containing the data to load
    
    Output:
    - a DataFrame with each column being a field of the DB table
                   and each row being a record
    '''
    engine = create_engine(db_path)
    df = pd.read_sql_table(db_table, con=engine)
    
    return df

In [3]:
# Load the Disaster Response SQLite table
df = load_db_data('sqlite:///Bunn_DisasterResponse.db', 'Bunn_DisasterResponse')

In [20]:
# Take a peek at the contents of the DataFrame
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# The features in X are just the text contained in column 'message'
# The targets in y are the last 36 columns; we drop:
# - 'id': useless
# - 'message': the features in X
# - 'original': not always in english
# - 
X = df['message']
y = df.drop(['id', 'message', 'original', 'genre'], axis=1)
# X = df.iloc[:,1:2]    # Cleaner but less explicit than above...
# y = df.iloc[:,4:]
print('\n-- Labels --\n{}\n\n'.format(df.columns.values))
print('\n-- X --\n{}\n\n'.format(X[:2]))
print('\n-- y --\n{}\n\n'.format(y[:2]))


-- Labels --
['id' 'message' 'original' 'genre' 'related' 'request' 'offer'
 'aid_related' 'medical_help' 'medical_products' 'search_and_rescue'
 'security' 'military' 'child_alone' 'water' 'food' 'shelter' 'clothing'
 'money' 'missing_people' 'refugees' 'death' 'other_aid'
 'infrastructure_related' 'transport' 'buildings' 'electricity' 'tools'
 'hospitals' 'shops' 'aid_centers' 'other_infrastructure' 'weather_related'
 'floods' 'storm' 'fire' 'earthquake' 'cold' 'other_weather'
 'direct_report']



-- X --
0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
Name: message, dtype: object



-- y --
   related  request  offer  aid_related  medical_help  medical_products  \
0        1        0      0            0             0                 0   
1        1        0      0            1             0                 0   

   search_and_rescue  security  military  child_alone      ...        \
0                  0         0        

### 2. Write a tokenization function to process your text data

In [5]:
def tokenize(text):
    '''
    Normalizes, removes punctuation, lemmatizes, and
    removes stop words and trailing spaces from the input text.
    
    Input - text string
    Output - list of the resulting words/tokens
    '''
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    lemmatizer = WordNetLemmatizer()
    stop_words = stopwords.words("english")
    tokens = [lemmatizer.lemmatize(word).strip() for word in tokens if word not in stop_words]
    
    return tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
def build_model1(verbose):
    '''
    Builds a simple Pipeline model for Natural Language Processing
    
    Input:
    - verbose: level of verbosity while running (0 is non, the higher the more messages)
    
    Output:
    - the resulting Pipeline model
    '''
    rfc = RandomForestClassifier(verbose=verbose)

    pipeline = Pipeline([
#                    ('vect', CountVectorizer(tokenizer=tokenize)),
#                    ('tfidf', TfidfTransformer()),
                    ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                    ('clf', MultiOutputClassifier(rfc, n_jobs=-1)),
               ])
    return pipeline

In [7]:
model1 = build_model1(0)

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=68)
print('X_train:\n{}\n\ny_train:\n{}\n\nX_test:\n{}\n\ny_test:\n{}\n'.format(
    X_train[:2], y_train[:2], X_test[:2], y_test[:2]))

X_train:
20238    The current government is trying to keep in pl...
10672    I 'm being encouraged to evacuate my apartment...
Name: message, dtype: object

y_train:
       related  request  offer  aid_related  medical_help  medical_products  \
20238        0        0      0            0             0                 0   
10672        0        0      0            0             0                 0   

       search_and_rescue  security  military  child_alone      ...        \
20238                  0         0         0            0      ...         
10672                  0         0         0            0      ...         

       aid_centers  other_infrastructure  weather_related  floods  storm  \
20238            0                     0                0       0      0   
10672            0                     0                0       0      0   

       fire  earthquake  cold  other_weather  direct_report  
20238     0           0     0              0              0  
10672     0   

In [9]:
# Train the model
model1.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=-1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [10]:
def simple_model_test(model, phrases, targets, num_tests):
    '''
    Prints the list of targets resulting from applying
    the NLP model to a few sample phrases.
    
    Input:
    - model: the model to be tested
    - phrases: a DataFrame of sample text
    - the number of tests we want to perform
    '''
    
    max = min(phrases.shape[0], num_tests)
    
    for i in range(0, max):
        test = model.predict([phrases.iloc[i]])
        print('Test #{}:\n{}\n{}\n'.format(i, phrases.iloc[i],
                                           targets.columns.values[(test.flatten() == 1)], '\n'))

In [11]:
simple_model_test(model1, X_test, y_test, 5)

Test #0:
In cooperation with communities and local organizations, Mercy Corps is currently focused on long-term recovery programs designed to rebuild and revitalize tsunami-stricken communities.
['related' 'aid_related']

Test #1:
2nd stage: Greece was the first country to send a hired aircraft - from Olympic Airlines - to Phuket to pick up the first wave of stranded tourists, Finally, and as a result of the continuous collaboration with the European Union and the member states in the special flight besides the Greek and Cypriot citizens were also boarded other citizens from the EU members.
['related' 'search_and_rescue']

Test #2:
Santiago tomorrow to Plan the big trip south. Anyone between Maule and Bio Bio who wanna meet and tell us their recent stories, DM me
['related']

Test #3:
As a result of emergency situation 7,200 houses (4,870 houses in Krymsk), 7 socially important facilities were flooded and gas, power and water supply sistems were disrupted in the cities of Gelendzhik, N

In [12]:
y_pred1 = model1.predict(X_test)
y_pred1_df = pd.DataFrame(y_pred1, columns=y.columns)

In [13]:
print('y_pred1 (type = {}):\n\n{}\n\n'.format(type(y_pred1), y_pred1[:2]))
print('y_pred1_df (type = {}):\n\n{}\n\n'.format(type(y_pred1_df), y_pred1_df[:2]))

y_pred1 (type = <class 'numpy.ndarray'>):

[[1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


y_pred1_df (type = <class 'pandas.core.frame.DataFrame'>):

   related  request  offer  aid_related  medical_help  medical_products  \
0        1        0      0            1             0                 0   
1        1        0      0            0             0                 0   

   search_and_rescue  security  military  child_alone      ...        \
0                  0         0         0            0      ...         
1                  1         0         0            0      ...         

   aid_centers  other_infrastructure  weather_related  floods  storm  fire  \
0            0                     0                0       0      0     0   
1            0                     0                0       0      0     0   

   earthquake  cold  other_weather  direct_report  
0           0 

In [14]:
# As seen above, y_pred needs to be converted to a DataFrame, to match y_test's structure

for column in y.columns:
    print('\n---- {} ----\n{}\n'.format(column, classification_report(y_test[column], y_pred1_df[column])))


---- related ----
             precision    recall  f1-score   support

          0       0.61      0.48      0.54      1436
          1       0.86      0.91      0.89      5072
          2       0.18      0.35      0.24        46

avg / total       0.80      0.81      0.81      6554



---- request ----
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5402
          1       0.79      0.41      0.54      1152

avg / total       0.87      0.88      0.86      6554



---- offer ----
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6529
          1       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      6554



---- aid_related ----
             precision    recall  f1-score   support

          0       0.74      0.85      0.79      3746
          1       0.75      0.60      0.67      2808

avg / total       0.75      0.74      0.74      6554



----

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [15]:
# List the current model's parameters
model1.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
           stop_words=None, strip_accents=None, sublinear_tf=False,
           token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fb76da07620>, use_idf=True,
           vocabulary=None)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None, verbose=0,

In [16]:
# When passing a tokenizer to CountVectorizer(), it overrides the string tokenization step,
# as long as analyzer = ‘word’ (default), so I don’t bother with the ‘vect_*’ parameters
# during the grid search and concentrate only on 'tfidf_*' and 'clf_*'.

parameters = {
#    'tfidf__use_idf': [True, False],
    'clf__estimator__criterion': ['entropy', 'gini'],
    'clf__estimator__max_depth': [None, 10],
    'clf__estimator__max_leaf_nodes': [None, 5],
    'clf__estimator__min_samples_leaf': [1, 4],
    'clf__estimator__min_samples_split': [2, 4],
    'clf__estimator__n_estimators':  [10, 50],
}

model2 = GridSearchCV(model1, param_grid=parameters, verbose=1)

In [17]:
# Train the model
model2.fit(X_train, y_train)

Fitting 3 folds for each of 64 candidates, totalling 192 fits


[Parallel(n_jobs=1)]: Done 192 out of 192 | elapsed: 150.5min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=-1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__criterion': ['entropy', 'gini'], 'clf__estimator__max_depth': [None, 10], 'clf__estimator__max_leaf_nodes': [None, 5], 'clf__estimator__min_samples_leaf': [1, 4], 'clf__estimator__min_samples_split': [2, 4], 'clf__estimator__n_estimators': [10, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
# Display the model's best parameters, according to the grid search above
model2.best_params_

{'clf__estimator__criterion': 'gini',
 'clf__estimator__max_depth': None,
 'clf__estimator__max_leaf_nodes': None,
 'clf__estimator__min_samples_leaf': 1,
 'clf__estimator__min_samples_split': 2,
 'clf__estimator__n_estimators': 50}

In [25]:
# Predict on the test data and display the metrics
y_pred2 = model2.predict(X_test)
y_pred2_df = pd.DataFrame(y_pred2, columns=y.columns)
for column in y.columns:
    print('\n---- {} ----\n{}\n'.format(column, classification_report(y_test[column], y_pred2_df[column])))


---- related ----
             precision    recall  f1-score   support

          0       0.68      0.45      0.54      1436
          1       0.86      0.94      0.89      5072
          2       0.27      0.35      0.30        46

avg / total       0.81      0.83      0.81      6554



---- request ----
             precision    recall  f1-score   support

          0       0.90      0.98      0.94      5402
          1       0.84      0.46      0.60      1152

avg / total       0.89      0.89      0.88      6554



---- offer ----
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6529
          1       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      6554



---- aid_related ----
             precision    recall  f1-score   support

          0       0.78      0.84      0.81      3746
          1       0.76      0.68      0.72      2808

avg / total       0.77      0.77      0.77      6554



----

  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [44]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    '''
    Custom Transformer for detecting phrases that start with a verb or retweets
    '''

    def starting_verb(self, text):

        # tokenize into sentences
        sentence_list = nltk.sent_tokenize(text)

        for sentence in sentence_list:

            # tokenize into words and tag Part of Speech
            pos_tags = nltk.pos_tag(tokenize(sentence))
            #print('pos_tags = {}\n'.format(pos_tags))

            if (len(pos_tags) != 0):
                # get the 1st word and PoS tag
                first_word, first_tag = pos_tags[0]

                # return True if the 1st word is a verb or indicates a retweet
                if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                    return True

        return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [45]:
def build_model3(classifier):
    '''
    Builds a slightly more complex Pipeline model for Natural Language Processing
    
    Input:
    - classifier: the learning algorithm (build_model1() used RandomForestClassifier exclusively)
    
    Output:
    - the resulting Pipeline model
    '''

    pipeline = Pipeline([
                ('features', FeatureUnion([
                    ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                    ('starting_verb', StartingVerbExtractor())
                ])),
                ('clf', MultiOutputClassifier(classifier, n_jobs=-1))
               ])
    return pipeline

In [50]:
# Build and creat a 3rd model, based on the multi-feature Pipeline above
# Try with different types of classifiers

# RandomForestClassifier
classifier = RandomForestClassifier(criterion='gini', max_depth=None, max_leaf_nodes=None,
                                    min_samples_leaf=1, min_samples_split=2, n_estimators=50)
model3 = build_model3(classifier)
model3.fit(X_train, y_train)
y_pred3 = model3.predict(X_test)
y_pred3_df = pd.DataFrame(y_pred3, columns=y.columns)
print("\n===== RandomForestClassifier =====\n")

for column in y.columns:
    print('\n---- {} ----\n{}\n'.format(column, classification_report(y_test[column], y_pred3_df[column])))

# SVC (Support Vector Machine Classifier)
'''classifier = SVC()
model4 = build_model3(classifier)
model4.fit(X_train, y_train)
y_pred4 = model4.predict(X_test)
y_pred4_df = pd.DataFrame(y_pred4, columns=y.columns)
print("\n===== SVC =====\n")

for column in y.columns:
    print('\n---- {} ----\n{}\n'.format(column, classification_report(y_test[column], y_pred4_df[column])))'''


===== RandomForestClassifier =====


---- related ----
             precision    recall  f1-score   support

          0       0.70      0.45      0.54      1436
          1       0.86      0.94      0.90      5072
          2       0.27      0.37      0.31        46

avg / total       0.82      0.83      0.82      6554



---- request ----
             precision    recall  f1-score   support

          0       0.90      0.98      0.94      5402
          1       0.83      0.48      0.61      1152

avg / total       0.89      0.89      0.88      6554



---- offer ----
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6529
          1       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      6554



---- aid_related ----
             precision    recall  f1-score   support

          0       0.78      0.84      0.81      3746
          1       0.76      0.68      0.72      2808

avg / total       0.77 

  'precision', 'predicted', average, warn_for)


'classifier = SVC()\nmodel4 = build_model3(classifier)\nmodel4.fit(X_train, y_train)\ny_pred4 = model4.predict(X_test)\ny_pred4_df = pd.DataFrame(y_pred4, columns=y.columns)\nprint("\n===== SVC =====\n")\n\nfor column in y.columns:\n    print(\'\n---- {} ----\n{}\n\'.format(column, classification_report(y_test[column], y_pred4_df[column])))'

### 9. Export your model as a pickle file

In [51]:
def serialize_object(py_object, filepath):
    '''
    Serialize a Python object into a pickle file.
    In the context of this Jupyter Notebook, the object is an ML model.
    
    Input:
    - py_object: Python object to be serialized
    - filepath: path and name of the file that will contain the serialized Python object
    '''
    pickle.dump(py_object, open(filepath, "wb" ))

In [52]:
# Export/serialize the model to a Pickle file
model_pickle_file = "ML_pipeline_classifier.p"
serialize_object(model3, model_pickle_file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.