# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import pandas as pd

import re
import sys
from sqlalchemy import create_engine
import pickle

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.externals import joblib

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [2]:
#Load data from the disaster data - database
engine = create_engine('sqlite:///DisasterData.db')
df = pd.read_sql_table('DisasterData', engine)
#Creating X and Y to use in the machine learning models - where X is the input and Y is the output
X = df['message']
y = df.drop(['id', 'message', 'original', 'genre'], axis = 1).values
category_names = df.columns[-36:]

In [3]:
df.head(2)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    """function to tokenize text, remove stop words and reduces words to their root form
    Input: text to tokenize
    Output: cleaned text """
    #Normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    #Tokenize text
    words = nltk.word_tokenize(text)
    #Remove stop words
    words = [w for w in words if w not in stopwords.words("english")]
    #Reduce to root form
    tokens = [WordNetLemmatizer().lemmatize(word) for word in words]
    return tokens

In [5]:
#Test the the tokenize function
print("Before tokenize :", df.message[10])
print("Using tokenizing:", tokenize(df.message[10]))

Before tokenize : There's nothing to eat and water, we starving and thirsty.
Using tokenizing: ['nothing', 'eat', 'water', 'starving', 'thirsty']


### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
def build_model():
    pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])
    return pipeline

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
model = build_model()

In [9]:
model.fit(X_train,Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
def evaluate_model(model, X_test, Y_test, category_names):
    Y_pred = model.predict(X_test)
    print(classification_report(Y_test, Y_pred, target_names=category_names, digits=3))

In [11]:
evaluate_model(model, X_test, Y_test, category_names)

                        precision    recall  f1-score   support

               related      0.850     0.916     0.882      3998
               request      0.818     0.428     0.562       891
                 offer      0.000     0.000     0.000        24
           aid_related      0.748     0.586     0.657      2164
          medical_help      0.625     0.103     0.178       435
      medical_products      0.692     0.097     0.170       279
     search_and_rescue      0.679     0.140     0.232       136
              security      0.400     0.021     0.040        96
              military      0.560     0.089     0.153       158
           child_alone      0.000     0.000     0.000         0
                 water      0.826     0.397     0.536       335
                  food      0.853     0.358     0.504       584
               shelter      0.779     0.385     0.515       468
              clothing      0.909     0.143     0.247        70
                 money      0.692     0

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [12]:
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
        ])
    
    parameters = {'vect__min_df': [1],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10], 
              'clf__estimator__min_samples_split':[2]}    
   
    model = GridSearchCV(pipeline, param_grid = parameters, verbose = 2)
    return model

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = build_model()
model.fit(X_train,Y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1, total= 3.0min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.5min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1, total= 3.0min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1, total= 3.0min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1, total= 3.1min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1, total= 3.1min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split

[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 27.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__min_df': [1], 'tfidf__use_idf': [True, False], 'clf__estimator__n_estimators': [10], 'clf__estimator__min_samples_split': [2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [14]:
evaluate_model(model, X_test, Y_test, category_names)

                        precision    recall  f1-score   support

               related      0.844     0.919     0.880      3998
               request      0.798     0.440     0.567       891
                 offer      0.000     0.000     0.000        24
           aid_related      0.750     0.592     0.662      2164
          medical_help      0.548     0.053     0.096       435
      medical_products      0.795     0.111     0.195       279
     search_and_rescue      0.613     0.140     0.228       136
              security      0.000     0.000     0.000        96
              military      0.571     0.051     0.093       158
           child_alone      0.000     0.000     0.000         0
                 water      0.811     0.269     0.404       335
                  food      0.865     0.505     0.638       584
               shelter      0.783     0.293     0.426       468
              clothing      0.727     0.114     0.198        70
                 money      0.600     0

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [9]:
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
        ])
    
    parameters = {
        'vect__min_df':[1,5],
        'tfidf__use_idf':[True, False],
        'clf__estimator__n_estimators': [10,25],
        'clf__estimator__min_samples_split':[2,5]}

    model = GridSearchCV(pipeline, param_grid = parameters, verbose = 2)
    return model

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = build_model()
model.fit(X_train,Y_train)
evaluate_model(model, X_test, Y_test, category_names)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [10]:
def build_model():
    pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(estimator=AdaBoostClassifier()))
                    ])
    parameters = {
                    'vect__min_df':[1],
                    'clf__estimator__learning_rate': [0.1],
                    'tfidf__smooth_idf': [True, False]
                    }
    model  = GridSearchCV(pipeline, param_grid=parameters) 
    return model 

In [26]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = build_model()
model.fit(X_train,Y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__min_df': [1], 'clf__estimator__learning_rate': [0.1], 'tfidf__smooth_idf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [27]:
evaluate_model(model, X_test, Y_test, category_names)

                        precision    recall  f1-score   support

               related      0.763     1.000     0.865      3998
               request      0.857     0.263     0.402       891
                 offer      0.000     0.000     0.000        24
           aid_related      0.814     0.357     0.496      2164
          medical_help      0.522     0.028     0.052       435
      medical_products      0.786     0.039     0.075       279
     search_and_rescue      0.875     0.051     0.097       136
              security      0.000     0.000     0.000        96
              military      0.000     0.000     0.000       158
           child_alone      0.000     0.000     0.000         0
                 water      0.845     0.537     0.657       335
                  food      0.827     0.712     0.765       584
               shelter      0.853     0.397     0.542       468
              clothing      0.824     0.200     0.322        70
                 money      0.750     0

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [6]:
def build_model():
    pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(estimator=AdaBoostClassifier()))
                    ])

    parameters = {
                    'vect__min_df':[1,10],
                    'clf__estimator__learning_rate': [0.01, 0.1],
                    'tfidf__smooth_idf': [True, False]
                    }
    model  = GridSearchCV(pipeline, param_grid=parameters, cv=2) 
    return model 

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = build_model()
model.fit(X_train,Y_train)

GridSearchCV(cv=2, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__min_df': [1, 10], 'clf__estimator__learning_rate': [0.01, 0.1], 'tfidf__smooth_idf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [None]:
evaluate_model(model, X_test, Y_test, category_names)

### 9. Export your model as a pickle file

In [28]:
with open('classifier.pkl', 'wb') as pkl_file:
    pickle.dump(model, pkl_file)
pkl_file.close()

In [29]:
model = joblib.load('classifier.pkl')

In [30]:
model

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__min_df': [1], 'clf__estimator__learning_rate': [0.1], 'tfidf__smooth_idf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

Note: Models with several parameters take very long time to run. RandomForestClassifier() with most parameter setup was choosen for completing train_classifier.py script