# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
pd.set_option('display.max_columns', 50)

import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponseTable.db')
df = pd.read_sql_table('DisasterResponseTable', engine)

df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [3]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,0.0,0.063778,0.111497,0.088267,0.015449,0.023039,0.011367,0.033377,0.045545,0.131446,0.065037,0.045812,0.050847,0.020293,0.006065,0.010795,0.004577,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,0.0,0.244361,0.314752,0.283688,0.123331,0.150031,0.106011,0.179621,0.2085,0.337894,0.246595,0.209081,0.219689,0.141003,0.077643,0.103338,0.067502,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [4]:
# Finding the max values of the data
df.max()

id                                                                    30265
message                   | News Update | Serious loss of life expected ...
genre                                                                social
related                                                                   2
request                                                                   1
offer                                                                     1
aid_related                                                               1
medical_help                                                              1
medical_products                                                          1
search_and_rescue                                                         1
security                                                                  1
military                                                                  1
child_alone                                                               0
water       

In [5]:
# Finding the minimum values of the data 
df.min()

id                             2
message                         
genre                     direct
related                        0
request                        0
offer                          0
aid_related                    0
medical_help                   0
medical_products               0
search_and_rescue              0
security                       0
military                       0
child_alone                    0
water                          0
food                           0
shelter                        0
clothing                       0
money                          0
missing_people                 0
refugees                       0
death                          0
other_aid                      0
infrastructure_related         0
transport                      0
buildings                      0
electricity                    0
tools                          0
hospitals                      0
shops                          0
aid_centers                    0
other_infr

In [6]:
df.related.value_counts()

1    19906
0     6122
2      188
Name: related, dtype: int64

In [7]:
df.child_alone.value_counts()

0    26216
Name: child_alone, dtype: int64

Dropping the child_alone column as that's completely empty 


In [8]:
df = df.drop('child_alone', axis = 1)


Some values for related are 2, assuming this is an error I think changing these values from 2 to 1 is the best course of action.

In [9]:
df['related'] = df['related'].replace(2,1)


In [10]:
#checking that the values have been changed from 2 to 1
df['related'].value_counts()

1    20094
0     6122
Name: related, dtype: int64

In [11]:
df.iloc[:,4:]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
7,1,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


#### Defining X and y variables

In [12]:
X = df['message']
y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [13]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

#### Creating a StartingVerbExtractor Class

In [14]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [15]:
def build_model():
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),
            ('starting_verb', StartingVerbExtractor())
        ])),
        ('clf', MultiOutputClassifier(RandomForestClassifier())
    )])
    
    return pipeline


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = build_model()
model.fit(X_train, y_train)


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [46]:
# Getting our y prediction results and producing a classification report for this.
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names = y.columns))

                        precision    recall  f1-score   support

               related       0.82      0.93      0.87      5024
               request       0.79      0.37      0.50      1074
                 offer       0.00      0.00      0.00        34
           aid_related       0.78      0.52      0.62      2739
          medical_help       0.63      0.05      0.09       529
      medical_products       0.68      0.09      0.16       329
     search_and_rescue       0.58      0.13      0.22       186
              security       0.29      0.02      0.03       119
              military       0.48      0.05      0.09       214
                 water       0.86      0.23      0.36       397
                  food       0.82      0.34      0.49       712
               shelter       0.86      0.25      0.39       582
              clothing       0.73      0.07      0.13       109
                 money       1.00      0.02      0.04       164
        missing_people       1.00      

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [45]:
# using get_params() to find what paramaters our models uses.
build_model().get_params().keys()
 

dict_keys(['memory', 'steps', 'features', 'clf', 'features__n_jobs', 'features__transformer_list', 'features__transformer_weights', 'features__text_pipeline', 'features__starting_verb', 'features__text_pipeline__memory', 'features__text_pipeline__steps', 'features__text_pipeline__vect', 'features__text_pipeline__tfidf', 'features__text_pipeline__vect__analyzer', 'features__text_pipeline__vect__binary', 'features__text_pipeline__vect__decode_error', 'features__text_pipeline__vect__dtype', 'features__text_pipeline__vect__encoding', 'features__text_pipeline__vect__input', 'features__text_pipeline__vect__lowercase', 'features__text_pipeline__vect__max_df', 'features__text_pipeline__vect__max_features', 'features__text_pipeline__vect__min_df', 'features__text_pipeline__vect__ngram_range', 'features__text_pipeline__vect__preprocessor', 'features__text_pipeline__vect__stop_words', 'features__text_pipeline__vect__strip_accents', 'features__text_pipeline__vect__token_pattern', 'features__text_p

#### When choosing my parameters, it became apparent that running this would take a very long time and was hard to complete without my computer crashing. Therefore I chose to only have a couple of parameters, so my model isn't as optimal as it could've been

In [48]:

model_2 = build_model()

parameters_grid = {'clf__estimator__min_samples_split': [2, 3],
              'clf__estimator__n_estimators': [50, 100]}


cv = GridSearchCV(model_2, param_grid=parameters_grid,  cv=3, n_jobs=-1, verbose=3)
cv.fit(X_train, y_train)
# cv=3 to fit 3 folds for each of our parameters
# n_jobs =-1 so i use all cores on my cpu to speed up the process
# Verbose = 3 to see a logging output 

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, score=0.23924321025328044, total= 3.7min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50 


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  4.5min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, score=0.23848031736344216, total= 3.7min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50 


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  9.0min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50, score=0.2593835825450107, total= 3.7min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, score=0.23954836740921576, total= 6.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, score=0.23771742447360392, total= 6.4min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, score=0.2580103753433018, total= 6.4min
[CV] clf__estimator__min_samples_split=3, clf__estimator__n_estimators=50 
[CV]  clf__estimator__min_samples_split=3, clf__estimator__n_estimators=50, score=0.23298748855660664, total= 3.4min
[CV] clf__estimator__min_samples_split=3, clf__estimator__n_estimators=50 
[CV]  clf__estimator__min_samples_sp

[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed: 69.7min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__min_samples_split': [2, 3], 'clf__estimator__n_estimators': [50, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [50]:
y_pred_2 = cv.predict(X_test)

# classification report on test data
print(classification_report(y_test, y_pred_2, target_names=y.columns))

                        precision    recall  f1-score   support

               related       0.82      0.97      0.89      5024
               request       0.85      0.42      0.57      1074
                 offer       0.00      0.00      0.00        34
           aid_related       0.79      0.60      0.69      2739
          medical_help       0.75      0.05      0.09       529
      medical_products       0.63      0.05      0.10       329
     search_and_rescue       0.80      0.02      0.04       186
              security       0.25      0.01      0.02       119
              military       0.60      0.04      0.08       214
                 water       0.92      0.20      0.32       397
                  food       0.86      0.40      0.55       712
               shelter       0.85      0.17      0.28       582
              clothing       0.57      0.04      0.07       109
                 money       1.00      0.01      0.02       164
        missing_people       1.00      

  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

#### For another model, I remove my StartingVerbExtractor to used random forest classifer again to see if the model would perform better with or without the transformer.

In [17]:
def build_model_3():
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
        ])),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    
    return pipeline

In [18]:
model_3 = build_model_3()
model_3.fit(X_train, y_train)
y_pred_3 = model_3.predict(X_test)

print(classification_report(y_test, y_pred_3, target_names=y.columns))

                        precision    recall  f1-score   support

               related       0.82      0.93      0.87      4994
               request       0.80      0.35      0.49      1087
                 offer       0.00      0.00      0.00        25
           aid_related       0.75      0.53      0.62      2675
          medical_help       0.52      0.07      0.12       516
      medical_products       0.68      0.07      0.12       315
     search_and_rescue       0.88      0.07      0.13       192
              security       1.00      0.01      0.02       125
              military       0.75      0.10      0.18       232
                 water       0.80      0.13      0.23       417
                  food       0.84      0.38      0.52       736
               shelter       0.89      0.17      0.28       568
              clothing       0.81      0.13      0.23        97
                 money       0.89      0.05      0.09       169
        missing_people       0.67      

  'precision', 'predicted', average, warn_for)


In [19]:
import pickle

### 9. Export your model as a pickle file

In [20]:
# save model in pickle file
pickle.dump(model_3, open('model.pkl', 'wb'))

Although model_2 was marginally better, to avoid having to re-run the code i've decided to go with model 3

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.