# ML Pipeline Preparation
This notebook is used to prepare and develop the NLP pipeline used in this project
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import nltk
nltk.download('stopwords')
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sqlalchemy import create_engine 
import sqlite3
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
import time
import re
from sklearn.tree import DecisionTreeClassifier
import pickle

[nltk_data] Downloading package stopwords to /home/ernest/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ernest/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ernest/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ernest/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [5]:
# load data from sqlite database
engine = create_engine('sqlite:///DisasterResponseETL4.db')

conn = engine.connect()

df = pd.read_sql_table('ETLTable', con=conn)

# Get the input parameter
X = df.iloc[:,1]

# Get the multi-output parameters Y
Y = df.iloc[:, 4:]

category_names = Y.columns

category_names

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [20]:
for i in range(20, 40):
    print(df.message[i])


I would like to know if one of the radio ginen Journalist died?
I'm in Laplaine, I am a victim
There's a lack of water in Moleya, please informed them for me.
Those people who live at Sibert need food they are hungry.
I want to say hello, my message is to let you know that there's an area in faustin Anhy street that has nothing neither food, water and medicine,
Can you tell me about this service
People I'm at Delma 2, we don't anything what so ever, please provide us with some food, water, and medicine
We are at Gressier we needs assistance right away. ASAP, Come help us.
How can we get water and food in Fontamara 43 cite Tinante?
We need help. Carrefour has been forgotten completely. The foul odor is killing us. Just letting you know. Thanks!!
Good evening, Radio one please. I would like information on Tiyous.
We have a lot of problem at Delma 75 Avenue Albert Jode, those people need water and food.
I'm here, I didn't find the person that I needed to send the pant by phone
People have

In [21]:
# Convert the Y to numpy (1-d array) array
Y = Y.values

### 2. A tokenization function to process the text data

In [22]:
def tokenize(text):
    # Remove special characters
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    words = word_tokenize(text)
    stop_words = stopwords.words('english')
    tokens = [w for w in words if w not in stop_words]
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tok = lemmatizer.lemmatize(tok, pos='v')
        clean_tokens.append(clean_tok)

    return clean_tokens
    

### 3. Build a machine learning pipeline
This machine pipeline takes in the `message` column as input and output classification results on the other 36 categories in the dataset. Example sklearn MultiOutputClassifier is found in [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)  for predicting multiple target variables.

In [23]:
# Define a pipeline with AdaBoostClassifier as a Classifier

pipeline1 = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
                ('tfidf_transformer', TfidfTransformer())
            ]))
            
        ])),

        ('classifier', MultiOutputClassifier(AdaBoostClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [24]:
%%time
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state= 0)

# train and fit the pipeline

pipeline1.fit(X_train, y_train)

CPU times: user 1min 16s, sys: 230 ms, total: 1min 16s
Wall time: 1min 17s


Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('text_pipeline',
                                                 Pipeline(steps=[('count_vectorizer',
                                                                  CountVectorizer(tokenizer=<function tokenize at 0x7f9ae40df1f0>)),
                                                                 ('tfidf_transformer',
                                                                  TfidfTransformer())]))])),
                ('classifier',
                 MultiOutputClassifier(estimator=AdaBoostClassifier()))])

In [25]:
#pipeline1.get_params()

### 6. Improve the model
Grid search is used to find better parameters. 

In [8]:
pipeline_cv1 = GridSearchCV(pipeline1, param_grid=parameters, scoring='f1_micro', n_jobs=-1)

In [None]:
# Train the new pipeline with Grid Search 
# %%time

pipeline_cv1.fit(X_train, y_train)

In [10]:
pipeline_cv1.best_params_

{'classifier__estimator__learning_rate': 0.01,
 'classifier__estimator__n_estimators': 10}

In [30]:
# Define a pipeline with AdaBoostClassifier using the best parameters above

pipeline2 = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
                ('tfidf_transformer', TfidfTransformer())
            ]))
            
        ])),

        ('classifier', MultiOutputClassifier(AdaBoostClassifier(learning_rate = 0.01, n_estimators = 10)))
    ])

In [31]:
%%time
pipeline2.fit(X_train, y_train)

CPU times: user 22.4 s, sys: 260 ms, total: 22.7 s
Wall time: 22.8 s


Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('text_pipeline',
                                                 Pipeline(steps=[('count_vectorizer',
                                                                  CountVectorizer(tokenizer=<function tokenize at 0x7f9ae40df1f0>)),
                                                                 ('tfidf_transformer',
                                                                  TfidfTransformer())]))])),
                ('classifier',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(learning_rate=0.01,
                                                                    n_estimators=10)))])

In [32]:
base = [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=8) ]

parameters2 = {
    #'classifier__estimator__algorithm': ['SAMME.R', 'SAMME'],
    'classifier__estimator__base_estimator': base
    
}

In [34]:
# Search for a better base_estimator
#%%timeit -n 10

start_time = time.time()
pipeline_cv2 = GridSearchCV(pipeline2, param_grid=parameters2, scoring='f1_micro', n_jobs=-1)




print(f"Time taken is {time.time() - start_time}")

Time taken is 0.00017452239990234375


# Search for a better base_estimator
#%%timeit -n 10

start_time = time.time()



# Train the new pipeline with Grid Search 
pipeline_cv2.fit(X_train, y_train)

print(f"Time taken is {time.time() - start_time}")

In [36]:
#pipeline_cv2.best_params_

In [37]:
# Define a pipeline with AdaBoostClassifier using the best parameters above

pipeline3 = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
                ('tfidf_transformer', TfidfTransformer())
            ]))
            
        ])),

        ('classifier', MultiOutputClassifier(AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=8) , learning_rate = 0.01, n_estimators = 10)))
    ])

In [38]:
start_time = time.time()

pipeline3.fit(X_train, y_train)

print(f"Time take is {time.time() - start_time}")

Time take is 102.11413025856018


### 5. Test the models
The f1 score, precision and recall for each output category of the dataset are evaluated. This is done by iterating through the columns and calling sklearn's `classification_report`  and `accuracy_score` on each.

In [39]:
# Use the models to Predict the y_test values using X_test as input

#%%time
y_pred= pipeline1.predict(X_test)
#labels = y_test.columns


In [41]:
y_pred3 = pipeline2.predict(X_test)

In [43]:
y_pred4 = pipeline3.predict(X_test)

In [44]:
labels = ['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
      'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
      'missing_people', 'refugees', 'death', 'other_aid',
      'infrastructure_related', 'transport', 'buildings', 'electricity',
      'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
     'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
        'other_weather', 'direct_report']

In [46]:
print(classification_report(y_test, y_pred4, target_names=labels))

                        precision    recall  f1-score   support

               related       0.79      0.97      0.87      5045
               request       0.77      0.45      0.57      1135
                 offer       0.00      0.00      0.00        23
           aid_related       0.68      0.56      0.62      2742
          medical_help       0.61      0.20      0.30       524
      medical_products       0.65      0.28      0.39       327
     search_and_rescue       0.46      0.17      0.25       190
              security       0.26      0.10      0.15       109
              military       0.61      0.26      0.36       199
           child_alone       0.00      0.00      0.00         0
                 water       0.77      0.59      0.67       424
                  food       0.81      0.76      0.78       727
               shelter       0.73      0.54      0.62       588
              clothing       0.57      0.36      0.44       109
                 money       0.52      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [47]:
print(accuracy_score(y_test, y_pred4))

0.20216661580714068


In [25]:
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    print(classification_report(y_test[:, ix], y_pred[:, ix]))
    print('Accuracy {0:.2g}\n\n'.format(accuracy_score(y_test[:, ix], y_pred[:, ix])))

related: **************************************************
             precision    recall  f1-score   support

          0       0.62      0.20      0.31      1509
          1       0.80      0.96      0.87      5045

avg / total       0.76      0.79      0.74      6554

Accuracy 0.79


request: **************************************************
             precision    recall  f1-score   support

          0       0.91      0.97      0.93      5419
          1       0.76      0.52      0.61      1135

avg / total       0.88      0.89      0.88      6554

Accuracy 0.89


offer: **************************************************
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6531
          1       0.00      0.00      0.00        23

avg / total       0.99      1.00      0.99      6554

Accuracy 1


aid_related: **************************************************
             precision    recall  f1-score   support

          0   

In [26]:
print("GridSearch Pipeline metrics:")
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    print(classification_report(y_test[:, ix], y_pred[:, ix]))
    print('Accuracy {0:.2g}\n\n'.format(accuracy_score(y_test[:, ix], y_pred3[:, ix])))

GridSearch Pipeline metrics:
related: **************************************************
             precision    recall  f1-score   support

          0       0.62      0.20      0.31      1509
          1       0.80      0.96      0.87      5045

avg / total       0.76      0.79      0.74      6554

Accuracy 0.77


request: **************************************************
             precision    recall  f1-score   support

          0       0.91      0.97      0.93      5419
          1       0.76      0.52      0.61      1135

avg / total       0.88      0.89      0.88      6554

Accuracy 0.84


offer: **************************************************
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6531
          1       0.00      0.00      0.00        23

avg / total       0.99      1.00      0.99      6554

Accuracy 1


aid_related: **************************************************
             precision    recall  f1-sc

### 7. Test the tuned model
Show the accuracy, precision, and recall of the tuned model.  



In [29]:
#%%time
y_pred4 = pipeline3.predict(X_test)

In [30]:
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    print(classification_report(y_test[:, ix], y_pred[:, ix]))
    print('Accuracy {0:.2g}\n\n'.format(accuracy_score(y_test[:, ix], y_pred4[:, ix])))

related: **************************************************
             precision    recall  f1-score   support

          0       0.62      0.20      0.31      1509
          1       0.80      0.96      0.87      5045

avg / total       0.76      0.79      0.74      6554

Accuracy 0.78


request: **************************************************
             precision    recall  f1-score   support

          0       0.91      0.97      0.93      5419
          1       0.76      0.52      0.61      1135

avg / total       0.88      0.89      0.88      6554

Accuracy 0.87


offer: **************************************************
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6531
          1       0.00      0.00      0.00        23

avg / total       0.99      1.00      0.99      6554

Accuracy 1


aid_related: **************************************************
             precision    recall  f1-score   support

          0   

### Compare the accuracies of the pipelines 

In [35]:

d = 0
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    #print(classification_report(y_test[:, ix], y_pred[:, ix]))
    a = accuracy_score(y_test[:, ix], y_pred4[:, ix])
    b = accuracy_score(y_test[:, ix], y_pred[:, ix])
    c = a -b 
   
    if c >= 0:
          d +=1
    print('Difference in accuracy {0:.2g}\n\n'.format(c))
print("The number of positive improvement is {}".format(d))

related: **************************************************
Difference in accuracy -0.012


request: **************************************************
Difference in accuracy -0.021


offer: **************************************************
Difference in accuracy 0.0012


aid_related: **************************************************
Difference in accuracy -0.074


medical_help: **************************************************
Difference in accuracy -0.0026


medical_products: **************************************************
Difference in accuracy -0.0023


search_and_rescue: **************************************************
Difference in accuracy -0.00031


security: **************************************************
Difference in accuracy -0.00015


military: **************************************************
Difference in accuracy -0.0046


child_alone: **************************************************
Difference in accuracy 0


water: ***************************************

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [42]:
class WordCountEstimator(BaseEstimator, TransformerMixin):
    """Determine the number of words in a message."""

    def find_total_words(self, text):
        
        # Split the text into list of words
        words = word_tokenize(text)

        # Remove stop words, as thye are non-informative in our case
        stop_words = stopwords.words('english')
        tokens = [word for word in words if word not in stop_words]
        # Get and return an array of length of words and token
        obs = np.array([len(words), len(tokens)])
        return obs

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        res = np.array(list(map(self.find_total_words, X)))
        res = pd.DataFrame(res)
        res.columns = ['num_words', 'num_non_stops']
        return res

In [43]:
improvedPipeline1 = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize))
            , ('tfidf', TfidfTransformer())]))
        , ('word_count', WordCountEstimator())]))
    , ('classifier', MultiOutputClassifier(AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=3) , learning_rate = 0.01, n_estimators = 10)))])


#improvedPipeline.get_params()

In [44]:
improvedPipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize))
            , ('tfidf', TfidfTransformer())]))
        , ('word_count', WordCountEstimator())]))
    , ('clf', RandomForestClassifier())])


#improvedPipeline.get_params()

In [45]:
start_time = time.time()

improvedPipeline.fit(X_train, y_train)

print(f"Time take is {time.time() - start_time}")

Time take is 39.867016315460205


In [46]:
start_time = time.time()

improvedPipeline1.fit(X_train, y_train)

print(f"Time take is {time.time() - start_time}")

Time take is 72.42438912391663


#y_train = y_train.values

In [47]:
%%time
y_pred_improved = improvedPipeline.predict(X_test)

CPU times: user 8.08 s, sys: 200 ms, total: 8.28 s
Wall time: 8.31 s


In [48]:
%%time

y_pred_improved1 = improvedPipeline1.predict(X_test)

CPU times: user 8.3 s, sys: 172 ms, total: 8.47 s
Wall time: 8.5 s


In [49]:
print("Improved Pipeline metrics")
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    print(classification_report(y_test[:, ix], y_pred_improved[:, ix]))
    print('Accuracy {0:.2g}\n\n'.format(accuracy_score(y_test[:, ix], y_pred_improved[:, ix])))

Improved Pipeline metrics
related: **************************************************
             precision    recall  f1-score   support

          0       0.58      0.51      0.55      1509
          1       0.86      0.89      0.87      5045

avg / total       0.80      0.80      0.80      6554

Accuracy 0.8


request: **************************************************
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5419
          1       0.81      0.41      0.54      1135

avg / total       0.87      0.88      0.86      6554

Accuracy 0.88


offer: **************************************************
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6531
          1       0.00      0.00      0.00        23

avg / total       0.99      1.00      0.99      6554

Accuracy 1


aid_related: **************************************************
             precision    recall  f1-score 

  'precision', 'predicted', average, warn_for)


In [52]:
print("Improved Pipeline metrics")
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    print(classification_report(y_test[:, ix], y_pred_improved1[:, ix]))
    print('Accuracy {0:.0%} \n\n'.format(accuracy_score(y_test[:, ix], y_pred_improved1[:, ix])))

Improved Pipeline metrics
related: **************************************************
             precision    recall  f1-score   support

          0       0.58      0.51      0.55      1509
          1       0.86      0.89      0.87      5045

avg / total       0.80      0.80      0.80      6554

Accuracy 77% 


request: **************************************************
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5419
          1       0.81      0.41      0.54      1135

avg / total       0.87      0.88      0.86      6554

Accuracy 87% 


offer: **************************************************
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6531
          1       0.00      0.00      0.00        23

avg / total       0.99      1.00      0.99      6554

Accuracy 100% 


aid_related: **************************************************
             precision    recall  f1-s

  'precision', 'predicted', average, warn_for)


In [55]:
d = 0
for ix, label in enumerate(labels):
    print(f"{label}: {'*'*50}")
    #print(classification_report(y_test[:, ix], y_pred[:, ix]))
    a = accuracy_score(y_test[:, ix], y_pred_improved1[:, ix])
    b = accuracy_score(y_test[:, ix], y_pred[:, ix])
    c = a -b 
   
    if c >= 0:
          d +=1
    print('Difference in accuracy {0:.2g}\n\n'.format(c))
print("The number of positive improvement is {}".format(d))

related: **************************************************
Difference in accuracy -0.016


request: **************************************************
Difference in accuracy -0.021


offer: **************************************************
Difference in accuracy 0.0012


aid_related: **************************************************
Difference in accuracy -0.09


medical_help: **************************************************
Difference in accuracy -0.0024


medical_products: **************************************************
Difference in accuracy -0.0021


search_and_rescue: **************************************************
Difference in accuracy -0.00046


security: **************************************************
Difference in accuracy -0.00015


military: **************************************************
Difference in accuracy -0.0044


child_alone: **************************************************
Difference in accuracy 0


water: ****************************************

### 9. Export the model as a pickle file

In [36]:
pickle.dump(pipeline3, open("model.pickle", "wb"))

### 10. This notebook was used to complete the `ML Pipeline in `train_classifier.py`
