# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries

import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger','stopwords'])

import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sqlalchemy import create_engine

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.tree import DecisionTreeClassifier 

import pickle



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('Disaster_Messages', engine)


In [3]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X = df['message']
Y = df.iloc[:,4:]
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [5]:
def tokenize(text):
    
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

    # tokenize text
    tokens = word_tokenize(text)

    # Remove stop words
    tokens = [w for w in tokens if w not in stopwords.words("english")]

    #stemming
    stemmed = [PorterStemmer().stem(w) for w in tokens]

    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in stemmed]

    return tokens


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

    

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
# perform train test split

X_train, X_test,Y_train, Y_test = train_test_split(X, Y)

# train classifier

pipeline.fit(X_train, Y_train)


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:


Y_pred = pipeline.predict(X_test)



In [9]:
def display_results(Y_test, Y_pred):
   
 
    accuracy = (Y_pred == Y_test).mean()
    print( 'Accuracy :',accuracy)
    for n, col in enumerate(Y_test.columns):
        print('Category: {}\n'.format(col))
        print(classification_report(Y_test[col], Y_pred[:, n]))

In [10]:
display_results(Y_test, Y_pred)


Accuracy : related                   0.801986
request                   0.886937
offer                     0.994652
aid_related               0.749885
medical_help              0.918258
medical_products          0.953858
search_and_rescue         0.973415
security                  0.977998
military                  0.968984
child_alone               1.000000
water                     0.954469
food                      0.933537
shelter                   0.936593
clothing                  0.984721
money                     0.978304
missing_people            0.989610
refugees                  0.966234
death                     0.960886
other_aid                 0.867685
infrastructure_related    0.928037
transport                 0.955233
buildings                 0.953400
electricity               0.981513
tools                     0.993736
hospitals                 0.987777
shops                     0.994041
aid_centers               0.987930
other_infrastructure      0.950802
weather_r

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fe26705b510>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [None]:
parameters = {
  
        #'vect__ngram_range': ((1, 1) , (1, 2) , (2, 2)),
        #'vect__max_df': (0.5, 0.75, 1.0),
        #'vect__max_features': (None, 5000, 10000),
        #'tfidf__norm': ('l1','l2'),
        #'tfidf__use_idf':(True, False),
        'clf__estimator__n_estimators': [10,50, 100], 
        'clf__estimator__min_samples_split':[2, 3, 4]

    }

cv = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1)

In [11]:


parameters = {#'vect__max_df': (0.75, 1.0),
             #  'vect__stop_words': ('english', None),
               # 'clf__estimator__n_estimators': [10, 20],
                'clf__estimator__min_samples_split': [2,3,5]
              }
cv = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1)



In [None]:
cv.fit(X_train, Y_train)
print("Best parameters set found on development set:")
print()
print(cv.best_params_)

In [25]:
cv.fit(X_train.as_matrix(), Y_train.as_matrix())

  """Entry point for launching an IPython kernel.


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__min_samples_split': [2, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [13]:
Y_pred = cv.predict(X_test)



In [10]:
def display_results(Y_test, Y_pred):
   
 
    accuracy = (Y_pred == Y_test).mean()
    print( 'Accuracy :',accuracy)
    for n, col in enumerate(Y_test.columns):
        print('Category: {}\n'.format(col))
        print(classification_report(Y_test[col], Y_pred[:, n]))
    
    

In [23]:
display_results(Y_test, Y_pred)

Accuracy : related                   0.798319
request                   0.884186
offer                     0.995569
aid_related               0.751566
medical_help              0.923453
medical_products          0.952483
search_and_rescue         0.972956
security                  0.982277
military                  0.969901
child_alone               1.000000
water                     0.954469
food                      0.933231
shelter                   0.932926
clothing                  0.985332
money                     0.978457
missing_people            0.991291
refugees                  0.967914
death                     0.962108
other_aid                 0.867838
infrastructure_related    0.935371
transport                 0.957372
buildings                 0.953552
electricity               0.977540
tools                     0.994041
hospitals                 0.988083
shops                     0.996180
aid_centers               0.988846
other_infrastructure      0.956761
weather_r

  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [6]:
from sklearn.decomposition import TruncatedSVD

In [7]:
pipeline_2 = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('other',TruncatedSVD()),
                    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
                    ])

# perform train test split

X_train, X_test,Y_train, Y_test = train_test_split(X, Y)

# train classifier

pipeline_2.fit(X_train, Y_train)
   

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ion_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=1))])

In [8]:
parameters = { #'clf__estimator__n_estimators': [10, 50],
               'clf__estimator__min_samples_split': [2,3,5]
              }
cv2 = GridSearchCV(pipeline_2, parameters, cv=3, n_jobs=-1)

cv2.fit(X_train, Y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ion_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__min_samples_split': [2, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [11]:
Y_pred = cv2.predict(X_test)

display_results(Y_test, Y_pred)

Accuracy : related                   0.668449
request                   0.808251
offer                     0.993430
aid_related               0.614057
medical_help              0.855615
medical_products          0.908785
search_and_rescue         0.948663
security                  0.965011
military                  0.939037
child_alone               1.000000
water                     0.907257
food                      0.860963
shelter                   0.853170
clothing                  0.967914
money                     0.957678
missing_people            0.978610
refugees                  0.934607
death                     0.915966
other_aid                 0.782124
infrastructure_related    0.884034
transport                 0.917494
buildings                 0.900993
electricity               0.962720
tools                     0.988694
hospitals                 0.979068
shops                     0.991597
aid_centers               0.977846
other_infrastructure      0.917494
weather_r

### 9. Export your model as a pickle file

In [12]:
# save the model to disk
filename = 'finalized_model.pkl'
pickle_out =open(filename, 'wb')
pickle.dump(cv2, pickle_out)
pickle_out.close()

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.