# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [47]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

import string
import re

# nlp libraries
import nltk
nltk.download(['punkt', 'stopwords', 'wordnet'])

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# ml libraries
import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, recall_score, precision_score
from sklearn.multioutput import MultiOutputClassifier

[nltk_data] Downloading package punkt to /home/rachneet/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rachneet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/rachneet/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# !pip install scikit-learn --upgrade
print(sklearn.__version__)

0.23.2


In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql('DisasterResponse.db', engine)
X = df['message'].values
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1).values
# df.head()

In [4]:
df[df.aid_related==2]

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report


### 2. Write a tokenization function to process your text data

In [None]:
from contractions import contractions_dict

def expand_contractions(text, contractions_dict):
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)
    
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text


def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contractions_dict.get(match) \
            if contractions_dict.get(match) \
            else contractions_dict.get(match.lower())
        expanded_contraction = expanded_contraction
        return expanded_contraction

In [126]:
def tokenize(text):
    '''
    Args:
        text(string): a string containing the message
    Return:
        tokenized_message(list): a list of words containing the processed message

    '''
    tokenized_message = []
    try:
        
        # for unbalanced parenthesis problem
        text = text.replace(')','')
        text = text.replace('(','')
        
        url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
        
         # get list of all urls using regex
        detected_urls = re.findall(url_regex, text)
        
        # replace each url in text string with placeholder
        for url in detected_urls:
            text = re.sub(url, "urlplaceholder", text)

        # remove whitespaces
        text = re.sub(r" +", " ", text)
        
        # expand contractions
        text = expand_contractions(text, contractions_dict)

        # tokenize text
        tokens = word_tokenize(text)
       
        # initiate lemmatizer
        lemmatizer = WordNetLemmatizer()
        # get stopwords
        stopwords_english = stopwords.words('english')
        stopwords_english += 'u'

        for word in tokens:
            # normalize word
            word = word.lower()
          
            if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
                
                word = lemmatizer.lemmatize(word)  # lemmatizing word
                tokenized_message.append(word)
                
    except Exception as e:
        print(e)
#         print(text)
        
    return tokenized_message

In [127]:
text = "The first time you   see The Second Renaissance it may   look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones https://bachda.com)  who started the war ? Is AI a bad thing ?"
print(tokenize(text))

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'urlplaceholder', 'started', 'war', 'ai', 'bad', 'thing']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [39]:
# multi output classifier
pipeline_multi = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs=10)))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [40]:
from time import time

start = time()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
pipeline_multi.fit(X_train, y_train)
end = time()

print("Training time:{}".format(end-start))

Training time:62.05896782875061


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [41]:
y_pred = pipeline_multi.predict(X_test)
report = []
for idx, col in enumerate(y_pred.T):
    report.append(f1_score(y_test.T[idx], col, average='weighted'))

In [42]:
full_report = []
for idx, col in enumerate(y_pred.T):
    full_report.append(classification_report(y_test.T[idx], col))

  _warn_prf(average, modifier, msg_start, len(result))


In [43]:
print(report)
print(np.mean(report))

[0.8030172581621196, 0.8889930969213857, 0.9917124722746894, 0.7786338971674004, 0.9038678741959928, 0.9369072804110671, 0.9609627922494944, 0.9718597584401955, 0.9500228109988827, 1.0, 0.9473010789661409, 0.9387282095617763, 0.9242893395900669, 0.9770436829909154, 0.9691966670769512, 0.9836900874854219, 0.9493531561502692, 0.9538228729096186, 0.8201524540975088, 0.9123017873618181, 0.9412920741602253, 0.9422969143189367, 0.9728452539713291, 0.9922835323617831, 0.9834403633829247, 0.9925690898928685, 0.9814459719245847, 0.9445668168570569, 0.8743640983114663, 0.9408875264511519, 0.9327767450195307, 0.9823842130916282, 0.9726861292412279, 0.9725205543430628, 0.9285204600684861, 0.8370026862463706]
0.9376038612959542


In [44]:
print(full_report[0])

              precision    recall  f1-score   support

           0       0.73      0.40      0.52      1238
           1       0.84      0.95      0.89      4006

    accuracy                           0.82      5244
   macro avg       0.78      0.68      0.70      5244
weighted avg       0.81      0.82      0.80      5244



### 6. Improve your model
Use grid search to find better parameters. 

In [15]:
parameters = {
    'vect__ngram_range': ((1,1), (1,2)),
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000),
    'tfidf__use_idf': (True, False),
    'clf__n_estimators': [100, 200, 300],
    'clf__min_samples_split': [2, 3, 4],
}

cv = GridSearchCV(pipeline_multi, param_grid=parameters, n_jobs=10, verbose=10)

In [16]:
cv.fit(X_train, y_train)



GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x7f1d9fbbcbf8>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf', RandomForestClassifier())]),
             n_jobs=10,
             param_grid={'clf__min_samples_split': [2, 3, 4],
                         'clf__n_estimators': [100, 200, 300],
                         'tfidf__use_idf': (True, False),
                         'vect__max_df': (0.5, 0.75, 1.0),
                         'vect__max_features': (None, 5000, 10000),
                         'vect__ngram_range': ((1, 1), (1, 2))})

In [17]:
import joblib

joblib.dump(cv, "best_params.pkl")

['best_params.pkl']

In [18]:
cv.best_params_

{'clf__min_samples_split': 2,
 'clf__n_estimators': 100,
 'tfidf__use_idf': False,
 'vect__max_df': 0.5,
 'vect__max_features': 5000,
 'vect__ngram_range': (1, 2)}

In [35]:
# train with best params
# multi output classifier
pipeline_multi_best = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize, max_df=0.5, max_features=5000, ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(use_idf=False)),
    ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, min_samples_split=2, n_jobs=10)))
])

In [36]:
from time import time

start = time()
pipeline_multi_best.fit(X_train, y_train)
end = time()

print("Training time:{}".format(end-start))

Training time:47.22592377662659


In [37]:
y_pred = pipeline_multi_best.predict(X_test)
report = []
for idx, col in enumerate(y_pred.T):
    report.append(f1_score(y_test.T[idx], col, average='weighted'))

In [38]:
print(report)
print(np.mean(report))

[0.80826145170927, 0.8839689542463472, 0.9917124722746894, 0.7739841853976814, 0.9043479337286663, 0.947266174588642, 0.965884411896301, 0.9736646879100621, 0.9559498484153882, 1.0, 0.9586550302483187, 0.9479940879403576, 0.9350153975990199, 0.9816585517191361, 0.9665131609339072, 0.9780519218494087, 0.9540567997208698, 0.9515079610756865, 0.8194629236133815, 0.9009293106489187, 0.9420527135590082, 0.9421775783331847, 0.9729654932210757, 0.9891436100131752, 0.9828704447364472, 0.994854209149384, 0.9823006000810917, 0.931509206100485, 0.8786325645865102, 0.9495559046587569, 0.9437514842008289, 0.9853017181560437, 0.971446950540802, 0.9746307332171688, 0.9309944311226099, 0.8332650295154341]
0.9390093871307794


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [74]:
# add new tranformers for features
from sklearn.base import BaseEstimator, TransformerMixin

class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    
    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            if pos_tags:
                first_word, first_tag = pos_tags[0][0], pos_tags[0][1]
                if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                    return True
        return False
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        x_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(x_tagged)

In [75]:
pipeline_improved = Pipeline([
    ('features', FeatureUnion([
        
        ('nlp_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
        
        ('starting_verb', StartingVerbExtractor())
    ])),
    
    ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs=10)))
])

In [76]:
%timeit pipeline_improved.fit(X_train, y_train)

1min 21s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [77]:
%timeit pred = pipeline_improved.predict(X_test)
report = []
for idx, col in enumerate(pred.T):
    report.append(f1_score(y_test.T[idx], col, average='weighted'))
print(np.mean(report))

11.8 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0.9374718642268118


XGBoost for better perfromance

In [78]:
# try using xgboost
import xgboost as xgb

pipeline_xgb = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(xgb.sklearn.XGBClassifier()))
])

In [79]:
start = time()
# X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
pipeline_xgb.fit(X_train, y_train)
end = time()

In [82]:
print("Training Time: {}".format(end-start))

Training Time: 75.94581055641174


In [80]:
pred = pipeline_xgb.predict(X_test)
report = []
for idx, col in enumerate(pred.T):
    report.append(f1_score(y_test.T[idx], col, average='weighted'))
print(np.mean(report))

0.9399919814415109


In [100]:
c_report = []
# for idx, col in enumerate(pred):
#     c_report.append(classification_report(y_test[:idx], pred[:idx], labels=df.columns[4:].tolist()))
cols = df.columns[4:].tolist()

for idx in range(pred.shape[1]):
    c_report.append(classification_report(y_test[:, idx],pred[:, idx], output_dict=True))

  _warn_prf(average, modifier, msg_start, len(result))


In [108]:
f1= []
for i in range(len(c_report)):
    f1.append(c_report[i]['weighted avg']['f1-score'])
print(np.mean(f1))

0.9399919814415109


Oprimize xgboost parameters

In [113]:
parameters = {
#     'vect__ngram_range': ((1,1), (1,2)),
#     'vect__max_df': (0.5, 0.75, 1.0),
#     'vect__max_features': (None, 5000, 10000),
#     'tfidf__use_idf': (True, False),
    'clf__estimator__learning_rate': [0.05, 0.15, 0.25],  # shrinks feature values for better boosting
    'clf__estimator__max_depth': [4, 6, 8, 10],
    'clf__estimator__min_child_weight': [1, 3, 5, 7],   # sum of child weights for further partitioning
    'clf__estimator__gamma': [0.0, 0.1, 0.2, 0.3, 0.4],  # prevents overfitting, split leaf node if min. gamma loss
    'clf__estimator__colsample_bytree': [0.3, 0.4, 0.5, 0.7]  # subsample ratio of columns when tree is constructed
}

xgb_cv = GridSearchCV(pipeline_xgb, param_grid=parameters, n_jobs=10, verbose=10)

In [114]:
xgb_cv.fit(X_train, y_train)
joblib.dump(xgb_cv, 'xgb_params.pkl')

Fitting 5 folds for each of 960 candidates, totalling 4800 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 tasks      | elapsed:  1.2min
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:  1.7min
[Parallel(n_jobs=10)]: Done  21 tasks      | elapsed:  2.7min
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:  3.9min
[Parallel(n_jobs=10)]: Done  41 tasks      | elapsed:  6.0min
[Parallel(n_jobs=10)]: Done  52 tasks      | elapsed:  7.3min
[Parallel(n_jobs=10)]: Done  65 tasks      | elapsed: 10.4min
[Parallel(n_jobs=10)]: Done  78 tasks      | elapsed: 12.5min
[Parallel(n_jobs=10)]: Done  93 tasks      | elapsed: 14.0min
[Parallel(n_jobs=10)]: Done 108 tasks      | elapsed: 15.8min
[Parallel(n_jobs=10)]: Done 125 tasks      | elapsed: 18.4min
[Parallel(n_jobs=10)]: Done 142 tasks      | elapsed: 21.5min
[Parallel(n_jobs=10)]: Done 161 tasks      | elapsed: 24.1min
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed: 25.9min
[Parallel(n_jobs=10)]: Done 201 tasks      | elapsed: 2

['xgb_params.pkl']

In [115]:
xgb_cv.best_params_

{'clf__estimator__colsample_bytree': 0.7,
 'clf__estimator__gamma': 0.4,
 'clf__estimator__learning_rate': 0.25,
 'clf__estimator__max_depth': 10,
 'clf__estimator__min_child_weight': 7}

In [116]:
xgb_cv.best_score_

0.29587037046510056

In [117]:
pipeline_xgb = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(xgb.sklearn.XGBClassifier(colsample_bytree=0.7, gamma=0.4, learning_rate=0.25, max_depth=10, min_child_weight=7)))
])

start = time()
# X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
pipeline_xgb.fit(X_train, y_train)
end = time()

print("Training Time: {}".format(end-start))

pred = pipeline_xgb.predict(X_test)
report = []
for idx, col in enumerate(pred.T):
    report.append(f1_score(y_test.T[idx], col, average='weighted'))
print("Mean f1-score: {}".format(np.mean(report)))

Training Time: 124.384770154953
Mean fi-score: 0.9455706541768107


In [118]:
parameters = {
    'vect__ngram_range': ((1,1), (1,2)),
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000),
    'tfidf__use_idf': (True, False)
#     'clf__estimator__learning_rate': [0.05, 0.15, 0.25],  # shrinks feature values for better boosting
#     'clf__estimator__max_depth': [4, 6, 8, 10],
#     'clf__estimator__min_child_weight': [1, 3, 5, 7],   # sum of child weights for further partitioning
#     'clf__estimator__gamma': [0.0, 0.1, 0.2, 0.3, 0.4],  # prevents overfitting, split leaf node if min. gamma loss
#     'clf__estimator__colsample_bytree': [0.3, 0.4, 0.5, 0.7]  # subsample ratio of columns when tree is constructed
}

vect_cv = GridSearchCV(pipeline_xgb, param_grid=parameters, n_jobs=10, verbose=10)

vect_cv.fit(X_train, y_train)
joblib.dump(vect_cv, 'vect_params.pkl')

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 tasks      | elapsed:  3.8min
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:  7.1min
[Parallel(n_jobs=10)]: Done  21 tasks      | elapsed: 10.8min
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed: 14.5min
[Parallel(n_jobs=10)]: Done  41 tasks      | elapsed: 19.2min
[Parallel(n_jobs=10)]: Done  52 tasks      | elapsed: 23.7min
[Parallel(n_jobs=10)]: Done  65 tasks      | elapsed: 29.6min
[Parallel(n_jobs=10)]: Done  78 tasks      | elapsed: 35.3min
[Parallel(n_jobs=10)]: Done  93 tasks      | elapsed: 42.0min
[Parallel(n_jobs=10)]: Done 108 tasks      | elapsed: 48.6min
[Parallel(n_jobs=10)]: Done 125 tasks      | elapsed: 56.6min
[Parallel(n_jobs=10)]: Done 142 tasks      | elapsed: 64.5min
[Parallel(n_jobs=10)]: Done 161 tasks      | elapsed: 72.1min
[Parallel(n_jobs=10)]: Done 180 out of 180 | elapsed: 80.8min remaining:    0.0s
[Parallel(n_jobs=10)]: Done 180 out 

['vect_params.pkl']

In [125]:
vect_cv.best_params_

{'tfidf__use_idf': False,
 'vect__max_df': 0.5,
 'vect__max_features': None,
 'vect__ngram_range': (1, 2)}

In [120]:
vect_cv.best_score_

0.3023075362215049

In [121]:
pipeline_xgb = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize, max_df=0.5, max_features=None, ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(use_idf=False)),
    ('clf', MultiOutputClassifier(xgb.sklearn.XGBClassifier(colsample_bytree=0.7, gamma=0.4, learning_rate=0.25, max_depth=10, min_child_weight=7)))
])

start = time()
# X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
pipeline_xgb.fit(X_train, y_train)
end = time()

print("Training Time: {}".format(end-start))

pred = pipeline_xgb.predict(X_test)
report = []
for idx, col in enumerate(pred.T):
    report.append(f1_score(y_test.T[idx], col, average='weighted'))
print("Mean f1-score: {}".format(np.mean(report)))

Training Time: 428.9596199989319
Mean f1-score: 0.9459521628170968


In [123]:
type(pipeline_xgb)

sklearn.pipeline.Pipeline

### 9. Export your model as a pickle file

In [122]:
joblib.dump(pipeline_xgb, 'models/xgboost_model.pkl')

['models/xgboost_model.pkl']

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.