# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [413]:
# Load Libraries
import pandas as pd
import numpy as np
import re
import warnings

import pickle

from sqlalchemy import create_engine

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import multioutput
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report

import nltk
nltk.download(['punkt', 'wordnet'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Charles\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Charles\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [379]:
# define project directory
projects_dir = "D:/DSN Projects/Disaster-Response-ML-Model/"

## Load Dataset

In [387]:
def load_data():
    """Function to tokenize text

    Parameters
    ----------
    none

    Returns
    -------
    Pandas DataFrame
    """
    # load data from database
    engine = create_engine('sqlite:///{}/data//DisasterResponse.db'.format(projects_dir))
    df = pd.read_sql_table('messages', engine)
    
    X =  df['message']
    y =  df.iloc[:,4:]
    return X, y

In [394]:
X, y = load_data()

### Utility functions

In [395]:
def consolidate_classification_reports(y_test, y_pred):
    # covert to dataframe for reprting purposes
    y_pred = pd.DataFrame (y_pred, columns = y_test.columns)
    
    feature_scores = pd.DataFrame()
    
    for col in y_test.columns:
        # get classification report in dictionary format
        report_dict = classification_report(y_test[col],y_pred[col], output_dict=True)
        
        # convert report to dataframe
        report_df = pd.DataFrame.from_dict(report_dict)
        
        # remove unnecessary support row
        report_df = report_df.iloc[:-1]
        
        #lets drop unnecessary columns
        report_df.drop(['micro avg', 'macro avg', 'weighted avg'], axis =1, inplace = True)
        
        # calculate the average scores
        report_df = pd.DataFrame(report_df.transpose().mean())
        
        # reshape df
        report_df = report_df.transpose()
        
        feature_scores = feature_scores.append(report_df, ignore_index=True)
    return feature_scores

In [396]:
def print_classification_results(y_test, y_pred):
    # covert to dataframe for reprting purposes
    y_pred = pd.DataFrame (y_pred, columns = y_test.columns)
    for col in y_test.columns:
        print('Category feature : {}'.format(col.capitalize()))
        print('.................................................................\n')
        print(classification_report(y_test[col],y_pred[col]))

### 2. Write a tokenization function to process your text data

In [397]:
def tokenize(text):
    """Function to tokenize text

    Parameters
    ----------
    text : str
        String to tokenize

    Returns
    -------
    List
        List of tokenized words
    """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [398]:
# Baseline pipeline
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier())),
    ])
    return pipeline

In [399]:
# optimised model using AdaBoost Classifier instead of random forest
def build_new_model():
    pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
        ])),

        ('clf', MultiOutputClassifier(AdaBoostClassifier()))
    ])

    return pipeline

In [400]:
pipeline = build_model()

In [401]:
print(pipeline)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))])


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [402]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
# train classifier
with warnings.catch_warnings():
    warnings.simplefilter("ignore") # ignore future warnings
    pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [403]:
y_pred = pipeline.predict(X_test)

In [404]:
print_classification_results(y_test, y_pred)

Category feature : Related
.................................................................

              precision    recall  f1-score   support

         0.0       0.33      0.10      0.15      1569
         1.0       0.76      0.93      0.84      4894

   micro avg       0.73      0.73      0.73      6463
   macro avg       0.55      0.52      0.50      6463
weighted avg       0.66      0.73      0.67      6463

Category feature : Request
.................................................................

              precision    recall  f1-score   support

         0.0       0.83      0.98      0.90      5329
         1.0       0.50      0.08      0.14      1134

   micro avg       0.82      0.82      0.82      6463
   macro avg       0.67      0.53      0.52      6463
weighted avg       0.77      0.82      0.77      6463

Category feature : Offer
.................................................................

              precision    recall  f1-score   support

         0.

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      6341
         1.0       0.00      0.00      0.00       122

   micro avg       0.98      0.98      0.98      6463
   macro avg       0.49      0.50      0.50      6463
weighted avg       0.96      0.98      0.97      6463

Category feature : Tools
.................................................................

              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      6425
         1.0       0.00      0.00      0.00        38

   micro avg       0.99      0.99      0.99      6463
   macro avg       0.50      0.50      0.50      6463
weighted avg       0.99      0.99      0.99      6463

Category feature : Hospitals
.................................................................

              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      6393
         1.0       0.00      0.00      0.00      

In [405]:
scores = consolidate_classification_reports(y_test, y_pred)
scores.index = y_test.columns

In [406]:
scores.head()

Unnamed: 0,f1-score,precision,recall
related,0.497663,0.546344,0.517465
request,0.518916,0.665456,0.530797
offer,0.498798,0.497602,0.5
aid_related,0.497903,0.533936,0.522205
medical_help,0.478958,0.460543,0.498908


In [407]:
overall_accuracy = (y_pred == y_test).mean().mean()*100
print('Overall Accuracy: {0:.1f} %'.format(overall_accuracy))

Overall Accuracy: 92.4 %


In [408]:
scores['f1-score'].mean()

0.5113479430172512

### 6. Improve your model
Use grid search to find better parameters. 

In [409]:
def multioutput_f1_scorer(y_test, y_pred): 

    # covert to dataframe for reprting purposes
    y_pred = pd.DataFrame (y_pred, columns = y_test.columns)
    
    feature_scores = pd.DataFrame()
    
    for col in y_test.columns:
        # get classification report in dictionary format
        report_dict = classification_report(y_test[col],y_pred[col], output_dict=True)
        
        # convert report to dataframe
        report_df = pd.DataFrame.from_dict(report_dict)
        
        # remove unnecessary support row
        report_df = report_df.iloc[:-1]
        
        #lets drop unnecessary columns
        report_df.drop(['micro avg', 'macro avg', 'weighted avg'], axis=1, inplace = True)
        
        # calculate the average scores
        report_df = pd.DataFrame(report_df.transpose().mean())
        
        # reshape df
        report_df = report_df.transpose()
        
        feature_scores = feature_scores.append(report_df, ignore_index=True)
    # for grid search, return the overall f1 score for all features
    return feature_scores['f1-score'].mean()

In [410]:
# create scoring object for grid search
scorer = make_scorer(multioutput_f1_scorer, greater_is_better = True)

parameters = { 
    'clf__estimator__n_estimators': [50,100],
    'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
    'features__text_pipeline__vect__max_df': (0.75, 1.0),
    'features__text_pipeline__tfidf__use_idf': (True, False)
}

new_pipeline = build_new_model()

X_train, X_test, y_train, y_test = train_test_split(X, y)

cv = GridSearchCV(new_pipeline, param_grid=parameters, scoring=scorer, verbose=2)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [411]:
cv.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.4min
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.8min remaining:    0.0s
  'precision', 'predicted', average, warn_for)


[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.4min
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 1) 


  'precision', 'predicted', average, warn_for)


[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.4min
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 2), total= 3.8min
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=True, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 2), total= 3.6min
[CV] clf__estimator__n_estimators=50, features__text_pipelin

  'precision', 'predicted', average, warn_for)


[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.2min
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.1min
[CV] clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 2) 
[CV]  clf__estimator__n_estimators=50, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 2), total= 3.3min
[CV] clf__estimator__n_estimators=50, features__text_pipelin

[CV]  clf__estimator__n_estimators=100, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=0.75, features__text_pipeline__vect__ngram_range=(1, 2), total= 6.5min
[CV] clf__estimator__n_estimators=100, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=100, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.9min
[CV] clf__estimator__n_estimators=100, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=100, features__text_pipeline__tfidf__use_idf=False, features__text_pipeline__vect__max_df=1.0, features__text_pipeline__vect__ngram_range=(1, 1), total= 1.9min
[CV] clf__estimator__n_estimators=100, features__text_

[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 184.4min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=None,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, ma...or=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'clf__estimator__n_estimators': [50, 100], 'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)), 'features__text_pipeline__vect__max_df': (0.75, 1.0), 'features__text_pipeline__tfidf__use_idf': (True, False)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(multioutput_f1_scorer), verbose=2)

In [415]:
# lets test the new model 
y_pred_new = cv.predict(X_test)

In [417]:
new_overall_accuracy = (y_pred_new == y_test).mean().mean()*100
print('Overall Accuracy: {0:.1f} %'.format(new_overall_accuracy))

Overall Accuracy: 92.2 %


In [418]:
# new scores
new_scores = consolidate_classification_reports(y_test, y_pred_new)
new_scores.index = y_test.columns

In [424]:
# lets calculate the % change in overall F1 Score
change = new_scores['f1-score'].mean() - scores['f1-score'].mean()
percent_change = 100*change/scores['f1-score'].mean()

In [426]:
print('Overall F1 Score increases by {0:.2f} %'.format(percent_change))

Overall F1 Score increases by 1.17 %


In [427]:
print_classification_results(y_test, y_pred_new)

Category feature : Related
.................................................................

              precision    recall  f1-score   support

         0.0       0.36      0.06      0.11      1542
         1.0       0.77      0.96      0.85      4921

   micro avg       0.75      0.75      0.75      6463
   macro avg       0.56      0.51      0.48      6463
weighted avg       0.67      0.75      0.68      6463

Category feature : Request
.................................................................

              precision    recall  f1-score   support

         0.0       0.85      0.95      0.90      5345
         1.0       0.42      0.17      0.24      1118

   micro avg       0.82      0.82      0.82      6463
   macro avg       0.63      0.56      0.57      6463
weighted avg       0.77      0.82      0.78      6463

Category feature : Offer
.................................................................

              precision    recall  f1-score   support

         0.


Category feature : Tools
.................................................................

              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      6423
         1.0       0.00      0.00      0.00        40

   micro avg       0.99      0.99      0.99      6463
   macro avg       0.50      0.50      0.50      6463
weighted avg       0.99      0.99      0.99      6463

Category feature : Hospitals
.................................................................

              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      6388
         1.0       0.08      0.01      0.02        75

   micro avg       0.99      0.99      0.99      6463
   macro avg       0.53      0.51      0.51      6463
weighted avg       0.98      0.99      0.98      6463

Category feature : Shops
.................................................................

              precision    recall  f1-score   support

         0

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

From the above cells, we can see that while using an AdaBoost algorithm did not increase overrall accuracy, it increased overall f1 score by 1.2 percentage points. Training of the algorithm using GridSearch took over 3 hours. We could try other techniques such as using custom estimators.

### 9. Export your model as a pickle file

In [414]:
# Save the model to a pickle file
filename = projects_dir + 'Classifier.model'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.