# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import sys
# import libraries
# import libraries
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

import re
import pickle
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sqlalchemy import create_engine

from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
def load_data(database_filepath):
    engine = create_engine('sqlite:///'+database_filepath)
    pd.read_sql("SELECT * FROM df", engine)
    df = pd.read_sql("SELECT * FROM df", engine)
    #df['genre'].value_counts()
    X = df.message.values
    y = df.iloc[:,4:]
    y=y.astype(int)
    labels = (y.columns)
    return X, y.values,labels
#database_filepath='data/DisasterResponse.db'
#X, Y, category_names = load_data(database_filepath)

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [9]:
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize,)),
        ('tfidf', TfidfTransformer()),
        #('clf', RandomForestClassifier())
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    return pipeline

In [59]:
   
def evaluate_model(model, X_test, Y_test):
    Y_pred = model.predict(X_test)
    col=["Precision","Recall","Accuracy"]
    dfCategoriesAccuracy=pd.DataFrame(columns=col)
    dfCategoriesAccuracy.columns=col
    from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score
    f1=[]
    p=[]
    r=[]
    a=[]
    for c in range(0,36):
        f1.append(f1_score(Y_test[:,c],Y_pred[:,c],average='macro'))
        p.append(precision_score(Y_test[:,c],Y_pred[:,c],average='macro'))
        r.append(recall_score(Y_test[:,c],Y_pred[:,c],average='macro'))
        a.append(accuracy_score(Y_test[:,c],Y_pred[:,c]))
    
    dfCategoriesAccuracy.Precision=p
    dfCategoriesAccuracy.Recall=r
    dfCategoriesAccuracy.Accuracy=a
    return (np.array(a).mean(),np.array(p).mean(),np.array(r).mean(),np.array(f1).mean(),dfCategoriesAccuracy)
def save_model(model, model_filepath):
    pickle.dump(model, open(model_filepath, 'wb'))

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [79]:
database_filepath='DisasterResponse.db'
X, Y, category_names = load_data(database_filepath)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
model = build_model()

print('Building model...')
print('Model built')
model = build_model()
print('Training model...')
model.fit(X_train, Y_train)
print('Model Trained')

Building model...
Model built
Training model...
Model Trained


### 5. Test your model
Report the accuracy, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [80]:
#Testing
def display_results(model, X_train, Y_train,X_test, Y_test,category_names):

    accuracy, precision, recall,f1score,df=evaluate_model(model, X_train, Y_train)
    print("Train Stats")
    print("Accuracy:",accuracy)
    print("precision:",precision)
    print("recall:",recall)
    print("f1score:",f1score)
    df.index=category_names
    print("Precision, Recall and Accuracy at Category Level for Train set\n",df)
    #Testing on Test Set
    accuracy, precision, recall,f1score,dftest=evaluate_model(model, X_test, Y_test)
    print("------------------------------------------------------------------------")
    print("Test Stats")
    print("Accuracy:",accuracy)
    print("precision:",precision)
    print("recall:",recall)
    print("f1score:",f1score)
    dftest.index=category_names
    print("Precision, Recall and Accuracy at Category Level for Test set\n",dftest)


In [81]:
display_results(model, X_train, Y_train,X_test, Y_test,category_names)

Train Stats
Accuracy: 0.993126695389
precision: 0.995450316705
recall: 0.922378511707
f1score: 0.954660796802
Precision, Recall and Accuracy at Category Level for Train set
                         Precision    Recall  Accuracy
related                  0.991704  0.966697  0.991192
request                  0.992668  0.965675  0.988272
offer                    0.999424  0.866667  0.998851
aid_related              0.985139  0.979738  0.982863
medical_help             0.993817  0.926614  0.988463
medical_products         0.994337  0.911336  0.991336
search_and_rescue        0.996029  0.900638  0.994208
security                 0.997479  0.858696  0.995022
military                 0.997931  0.938053  0.995979
child_alone              1.000000  1.000000  1.000000
water                    0.996274  0.962490  0.995165
food                     0.996000  0.976605  0.994639
shelter                  0.995798  0.958742  0.992772
clothing                 0.996644  0.908614  0.997319
money           

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [21]:
parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        #'vect__max_df': (0.5, 1.0),
        #'vect__max_features': (None, 5000),
        #'tfidf__use_idf': (True, False),
        'clf__estimator__n_estimators': [20, 30],
        'clf__estimator__min_samples_split': [2, 3],
    }

cv = GridSearchCV(build_model(), param_grid=parameters)
cv.fit(X_train,Y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=-1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__ngram_range': ((1, 1), (1, 2)), 'clf__estimator__n_estimators': [20, 30], 'clf__estimator__min_samples_split': [2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [23]:
#clf.get_params().keys()
print("\nBest Parameters:", cv.best_params_)


Best Parameters: {'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 30, 'vect__ngram_range': (1, 2)}


In [82]:
display_results(cv, X_train, Y_train,X_test, Y_test,category_names)

Train Stats
Accuracy: 0.987882293495
precision: 0.988434814498
recall: 0.901077713611
f1score: 0.939100512338
Precision, Recall and Accuracy at Category Level for Train set
                         Precision    Recall  Accuracy
related                  0.958906  0.875814  0.958066
request                  0.980017  0.941539  0.978124
offer                    0.999424  0.866667  0.998851
aid_related              0.951764  0.941370  0.948013
medical_help             0.988264  0.892023  0.982719
medical_products         0.986602  0.893237  0.989086
search_and_rescue        0.993392  0.884860  0.993202
security                 0.997648  0.868207  0.995357
military                 0.992651  0.891494  0.992772
child_alone              1.000000  1.000000  1.000000
water                    0.988393  0.930073  0.990522
food                     0.985027  0.946858  0.986788
shelter                  0.987358  0.926136  0.986357
clothing                 0.985491  0.880253  0.996266
money           

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [27]:
model_filepath="bestmodel.pkl"
print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(cv, model_filepath)

Saving model...
    MODEL: bestmodel.pkl


### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.