# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [24]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import re
import nltk
from nltk.corpus import stopwords
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import pickle
from sklearn.externals import joblib

In [25]:
# configure libraries
stop_words = nltk.corpus.stopwords.words("english")
stop_words.append('us')
stop_words.append('000')
stop_words.append('http')

In [26]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('Message', con=engine)
X = df['message']
Y = df.drop(['message', 'genre', 'id', 'original'], axis=1)

### 2. Write a tokenization function to process your text data

In [6]:
def tokenize(text):
    """
    Tokenizes a text input
    
    Args:  
        text: Source text to be tokenized
        
    Returns:
        tokenized_text: The source text after the transformation
    """
    # convert text to lower case
    text = text.lower()
    
    # remove punctuation with a regex
    text = re.sub(r'[^a-zA-Z0-9]',' ',text)
    
    # tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # remove stop words
    filtered_tokens = [w for w in tokens if not w in stop_words]
    
    # lemmatize words
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    tokenized_text = [lemmatizer.lemmatize(w) for w in filtered_tokens]
    return tokenized_text

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
# construct pipeline with tf-idf vectorizer and multi-output classifier
pipeline = Pipeline([
    ('vect', TfidfVectorizer(tokenizer=tokenize)),
    ('mo_clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=2)))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
# split out training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size = 0.3, random_state = 40)

# train the classifier through the pipeline
pipeline.fit(X_train.as_matrix(), y_train.as_matrix())

  """


Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [9]:
def performance_report(pipeline, X_test, y_test):
    '''
    Generates aggregated scores for f1 score, precision, and recall for each output 
    category of the data set
    
    Args: 
        Pipeline: the model pipeline to test on
        X_Test: the X values for testing
        y_test: the y values to test on 
        
    Returns: 
        Output_df: performance report results for the model 
    '''
    # get predicted values based on using the pipeline on X_test
    y_pred = pipeline.predict(X_test)
    
    # build a dataframe to store the outputs of our test
    output_df = pd.DataFrame(columns=['Category', 'Precision', 'Recall', 'F1_Score'])
    
    # loop through categories to retrieve performance scores and append to output
    cat_list = y_test.columns
    tracker = 0
    for item in cat_list:
        precision, recall, f1_score, support = precision_recall_fscore_support(y_test[item], y_pred[:,tracker], average='weighted')
        output_df.at[tracker+1, 'Category'] = item
        output_df.at[tracker+1, 'Precision'] = precision
        output_df.at[tracker+1, 'Recall'] = recall
        output_df.at[tracker+1, 'F1_Score'] = f1_score
        tracker = tracker + 1
        
    # print aggregated outputs
    print('Mean precision:', output_df['Precision'].mean())
    print('Mean recall:', output_df['Recall'].mean())
    print('Mean f1_score:', output_df['F1_Score'].mean())

    # return results
    return output_df

In [10]:
# run report
performance_report(pipeline, X_test, y_test)

Mean precision: 0.8958085979414805
Mean recall: 0.9080977990068899
Mean f1_score: 0.8949804650538401


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Unnamed: 0,Category,Precision,Recall,F1_Score
1,index,0.0,0.0,0.0
2,related,0.751968,0.697012,0.714771
3,request,0.842431,0.857343,0.831688
4,offer,0.991626,0.995804,0.993711
5,aid_related,0.694409,0.686205,0.663446
6,medical_help,0.89484,0.919135,0.895726
7,medical_products,0.914573,0.944565,0.922224
8,search_and_rescue,0.957626,0.972791,0.961401
9,security,0.965942,0.981437,0.973628
10,military,0.952501,0.966942,0.956029


### 6. Improve your model
Use grid search to find better parameters. 

In [14]:
# print parameters to see available options for tuning
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
           stop_words=None, strip_accents=None, sublinear_tf=False,
           token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x10d7c67b8>, use_idf=True,
           vocabulary=None)),
  ('mo_clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=2, n_jobs=None,
               oob_score=False, random_state=None, verbose

In [15]:
# set some parameter options to tune
parameters = {
    'mo_clf__estimator__max_depth': [None, 3],
    'mo_clf__estimator__min_samples_split': [2, 4],
    'vect__max_df': (0.7, 1.0),
}

cv = GridSearchCV(pipeline, parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [16]:
# fit the model using optimal params
cv.fit(X_train.as_matrix(), y_train.as_matrix())

  


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'mo_clf__estimator__max_depth': [None, 3], 'mo_clf__estimator__min_samples_split': [2, 4], 'vect__max_df': (0.7, 1.0)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [17]:
# check model performance
y_pred_tuned = cv.predict(X_test)
performance_report(cv, X_test, y_test)

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Mean precision: 0.89659529706891
Mean recall: 0.9079912716276353
Mean f1_score: 0.8943757066695766


Unnamed: 0,Category,Precision,Recall,F1_Score
1,index,0.0,0.0,0.0
2,related,0.753645,0.706294,0.720532
3,request,0.846703,0.86014,0.835542
4,offer,0.991625,0.995677,0.993647
5,aid_related,0.681466,0.675906,0.652263
6,medical_help,0.881776,0.914812,0.888281
7,medical_products,0.930929,0.947616,0.929199
8,search_and_rescue,0.95857,0.972409,0.962237
9,security,0.965955,0.9822,0.974009
10,military,0.959966,0.969231,0.957639


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [18]:
# try a regular decision tree classifier as an alternative
moc_alternate = MultiOutputClassifier(DecisionTreeClassifier())

pipeline_alternate = Pipeline([
    ('vect', TfidfVectorizer(tokenizer=tokenize)),
    ('clf', moc_alternate)
    ])

In [19]:
# split out training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size = 0.3, random_state = 40)

# train 
pipeline_alternate.fit(X_train.as_matrix(), y_train.as_matrix())

# test performance
performance_report(pipeline_alternate, X_test, y_test)

  """
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Mean precision: 0.9058512753484796
Mean recall: 0.9075170529715982
Mean f1_score: 0.9066189473099828


Unnamed: 0,Category,Precision,Recall,F1_Score
1,index,0.0,0.0,0.0
2,related,0.768252,0.759568,0.763411
3,request,0.849841,0.85213,0.85093
4,offer,0.992177,0.99288,0.992525
5,aid_related,0.718208,0.71939,0.718654
6,medical_help,0.895317,0.896122,0.895717
7,medical_products,0.939758,0.944565,0.941894
8,search_and_rescue,0.961697,0.962492,0.96209
9,security,0.968271,0.971138,0.969689
10,military,0.960653,0.961475,0.961057


### 9. Export your model as a pickle file

In [20]:
# save output as a pickle file
# pickle.dump(pipeline, open('model.pkl', 'wb'))
joblib.dump(cv,  'model.pkl', compress=3)

['model.pkl']

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.