# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import re
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as  np
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to /Users/Shrushti/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/Shrushti/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Shrushti/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///disaster_messages.db')
df = pd.read_sql_table('message_categories', engine)
X = df['message']
category_columns = [col for col in df.columns if 
                    ((col != 'id') & (col != 'message') 
                     & (col != 'original') & (col != 'genre'))]
y = df[category_columns]

In [6]:
pd.set_option('display.max_columns', 100)

In [7]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [38]:
df.shape

(26216, 40)

In [99]:
basic_amenities = (df['water'] == 1) | (df['food'] == 1) | (df['shelter'] == 1)
natural_disasters = ['weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather']
need_aid = ['aid_related', 'medical_help', 'medical_products', 'other_aid']

In [14]:
df[(basic_amenities + natural_disasters)].sum()

water              1672
food               2923
shelter            2314
weather_related    7297
floods             2155
storm              2443
fire                282
earthquake         2455
cold                530
other_weather      1376
dtype: int64

In [101]:
top_10_basic_amenities = pd.DataFrame((df[basic_amenities][df.columns[4:]]
                                       .sum()
                                       .sort_values(ascending=False)[:10]),
                                      columns=['counts'])
top_10_basic_amenities

Unnamed: 0,counts
related,5082
aid_related,5082
food,2923
request,2429
direct_report,2398
shelter,2314
water,1672
weather_related,1650
other_aid,898
medical_products,760


In [104]:
messages_aid_related = pd.DataFrame(df[need_aid].sum(), columns=['counts'])
messages_aid_related

Unnamed: 0,counts
aid_related,10860
medical_help,2084
medical_products,1313
other_aid,3446


In [93]:
disasters_counts = pd.DataFrame(df[natural_disasters].sum(), columns=['counts'])
disasters_counts

Unnamed: 0,counts
weather_related,7297
floods,2155
storm,2443
fire,282
earthquake,2455
cold,530
other_weather,1376


In [23]:
X[:5]

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [24]:
y[:5]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [5]:
stop_words = set(stopwords.words('english'))    # this step is very important or gridsearch will give errors!
def tokenize(text):
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = list()
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
        
    return clean_tokens

In [6]:
X[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

In [7]:
print(tokenize(X[0]))

['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [9]:
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(), n_jobs=-1))
    ], verbose=True)
    return pipeline

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model = build_model()
model.fit(X_train, y_train)

[Pipeline] .............. (step 1 of 3) Processing vect, total=   5.1s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total= 3.1min


Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [11]:
y_pred = model.predict(X_test)

In [12]:
y_pred = pd.DataFrame(y_pred, columns=y_test.columns, index=y_test.index)
y_pred.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
7338,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19032,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9739,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11273,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11346,1,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
for col in y_test.columns:
    print(col, y_test[col].unique())

related [1 0 2]
request [0 1]
offer [0 1]
aid_related [0 1]
medical_help [0 1]
medical_products [0 1]
search_and_rescue [0 1]
security [0 1]
military [0 1]
child_alone [0]
water [0 1]
food [0 1]
shelter [0 1]
clothing [0 1]
money [0 1]
missing_people [0 1]
refugees [0 1]
death [0 1]
other_aid [0 1]
infrastructure_related [0 1]
transport [0 1]
buildings [0 1]
electricity [0 1]
tools [0 1]
hospitals [0 1]
shops [0 1]
aid_centers [0 1]
other_infrastructure [0 1]
weather_related [0 1]
floods [0 1]
storm [0 1]
fire [0 1]
earthquake [0 1]
cold [0 1]
other_weather [0 1]
direct_report [0 1]


In [14]:
scores = pd.DataFrame(columns=['class_name', 'f1-score', 'precision', 'recall', 'support'])
class_dict = dict()
for col in y_test.columns:
    if col == 'related':
        output = classification_report(y_test[col].values, y_pred[col].values, 
                                       target_names=['False', 'True', 'Neither'], output_dict=True)
    elif col == 'child_alone':
        output = classification_report(y_test[col].values, y_pred[col].values, target_names=['False'],
                                       output_dict=True)
    else:
        output = classification_report(y_test[col].values, y_pred[col].values, output_dict=True, 
                                       target_names=['False', 'True'])
    class_scores = pd.DataFrame(output).transpose()
    class_scores['class_name'] = col
    scores = pd.concat([scores, class_scores], axis=0)
scores = scores.reset_index()
scores.rename(columns={'index': 'values'}, inplace=True)
scores = scores.set_index(['class_name', 'values'])
scores

Unnamed: 0_level_0,Unnamed: 1_level_0,f1-score,precision,recall,support
class_name,values,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
related,False,0.516451,0.730271,0.399485,1552.000000
related,True,0.887740,0.830869,0.952967,4954.000000
related,Neither,0.450704,0.695652,0.333333,48.000000
related,accuracy,0.817363,0.817363,0.817363,0.817363
related,macro avg,0.618298,0.752264,0.561928,6554.000000
...,...,...,...,...,...
direct_report,False,0.917645,0.862853,0.979867,5265.000000
direct_report,True,0.503219,0.815652,0.363848,1289.000000
direct_report,accuracy,0.858712,0.858712,0.858712,0.858712
direct_report,macro avg,0.710432,0.839253,0.671857,6554.000000


### 6. Improve your model
Use grid search to find better parameters. 

In [15]:
print(model.get_params())

{'memory': None, 'steps': [('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function tokenize at 0x11b4a9280>, vocabulary=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)), ('clf', MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_

In [17]:
def tuned_model():
    pipeline = Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2), 
                                     max_features=5000)),
            ('tfidf', TfidfTransformer(use_idf=True)),
            ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=200, 
                                                                 min_samples_split=3)
                                          , n_jobs=-1))
        ], verbose=True)
#     parameters = {
#         'vect__ngram_range': ((1, 1), (1, 2)),
#         'vect__max_features': (None, 5000),
#         'tfidf__use_idf': (True, False),
#         'clf__estimator__n_estimators': [100, 200],
#         'clf__estimator__min_samples_split': [3, 4],
#     }

#     cv = GridSearchCV(pipeline, parameters, verbose=2, n_jobs=8)
    
    return pipeline

In [18]:
new_model = tuned_model()
new_model.fit(X_train, y_train)

[Pipeline] .............. (step 1 of 3) Processing vect, total=   6.1s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total= 3.5min


Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=5000, min_df=1,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

__Grid search results:__
- best params:
    - 'clf__estimator__min_samples_split': 3,
    - 'clf__estimator__n_estimators': 200,
    - 'tfidf__use_idf': True,
    - 'vect__max_features': 5000,
    - 'vect__ngram_range': (1, 2)
- best score:
    - 0.2831851493182216

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [19]:
y_pred = new_model.predict(X_test)

In [20]:
y_pred = pd.DataFrame(y_pred, columns=y_test.columns, index=y_test.index)
y_pred.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
7338,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19032,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
9739,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11273,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11346,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
scores = pd.DataFrame(columns=['class_name', 'f1-score', 'precision', 'recall', 'support'])
class_dict = dict()
for col in y_test.columns:
    if col == 'related':
        output = classification_report(y_test[col].values, y_pred[col].values, 
                                       target_names=['False', 'True', 'Neither'], output_dict=True)
    elif col == 'child_alone':
        output = classification_report(y_test[col].values, y_pred[col].values, target_names=['False'],
                                       output_dict=True)
    else:
        output = classification_report(y_test[col].values, y_pred[col].values, output_dict=True, 
                                       target_names=['False', 'True'])
    class_scores = pd.DataFrame(output).transpose()
    class_scores['class_name'] = col
    scores = pd.concat([scores, class_scores], axis=0)
scores = scores.reset_index()
scores.rename(columns={'index': 'values'}, inplace=True)
scores = scores.set_index(['class_name', 'values'])
scores

Unnamed: 0_level_0,Unnamed: 1_level_0,f1-score,precision,recall,support
class_name,values,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
related,False,0.561213,0.707150,0.465206,1552.000000
related,True,0.888995,0.846477,0.936011,4954.000000
related,Neither,0.407767,0.381818,0.437500,48.000000
related,accuracy,0.820873,0.820873,0.820873,0.820873
related,macro avg,0.619325,0.645148,0.612906,6554.000000
...,...,...,...,...,...
direct_report,False,0.918091,0.868270,0.973979,5265.000000
direct_report,True,0.527620,0.788580,0.396431,1289.000000
direct_report,accuracy,0.860391,0.860391,0.860391,0.860391
direct_report,macro avg,0.722856,0.828425,0.685205,6554.000000


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [40]:
# Tried different features like stopwords, WordNetLemmatizer
scores.xs('accuracy', level=1)

Unnamed: 0_level_0,f1-score,precision,recall,support
class_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
related,0.820873,0.820873,0.820873,0.820873
request,0.89762,0.89762,0.89762,0.89762
offer,0.995575,0.995575,0.995575,0.995575
aid_related,0.779066,0.779066,0.779066,0.779066
medical_help,0.925542,0.925542,0.925542,0.925542
medical_products,0.958041,0.958041,0.958041,0.958041
search_and_rescue,0.973756,0.973756,0.973756,0.973756
security,0.981691,0.981691,0.981691,0.981691
military,0.9704,0.9704,0.9704,0.9704
child_alone,1.0,1.0,1.0,1.0


### 9. Export your model as a pickle file

In [22]:
import pickle
# now you can save it to a file
with open('new_model.pkl', 'wb') as f:
    pickle.dump(new_model, f)