# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import nltk
nltk.download(['punkt', 'wordnet'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [20]:
# import libraries
import pandas as pd
import numpy as np

from sqlalchemy import create_engine


from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

import pickle

In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('messages', engine)
X = df['message']
y = df.drop(['id','message', 'original', 'genre'], axis=1)

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
y_pred = pipeline.predict(X_test)

In [8]:
print(classification_report(y_test.iloc[:,1:].values, np.array([x[1:] for x in y_pred]), target_names=y_test.columns))

                        precision    recall  f1-score   support

               related       0.83      0.39      0.53      1472
               request       0.00      0.00      0.00        38
                 offer       0.74      0.49      0.59      3545
           aid_related       0.61      0.10      0.17       701
          medical_help       0.67      0.08      0.15       446
      medical_products       0.30      0.01      0.03       226
     search_and_rescue       0.00      0.00      0.00       160
              security       0.54      0.07      0.13       267
              military       0.00      0.00      0.00         0
           child_alone       0.82      0.44      0.58       543
                 water       0.85      0.29      0.43       965
                  food       0.79      0.19      0.30       775
               shelter       0.57      0.03      0.06       127
              clothing       0.77      0.05      0.10       191
                 money       1.00      

  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [16]:
parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        'tfidf__use_idf': (True, False),
        'clf__estimator__n_estimators': [100, 200],
    }


cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=-1, verbose=2)

In [17]:
cv.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 1), total= 5.1min
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  5.6min remaining:    0.0s


[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 1), total= 5.1min
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 1), total= 5.2min
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 2), total=11.3min
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 2), total=11.5min
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 2) 
[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__ngram_range=(1, 2), total=12.7min
[CV] clf__estimator__n_estimators=100, tfidf__use_idf=False, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_estimators=100, tfidf__use_idf=False, ve

[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed: 326.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': ((1, 1), (1, 2)), 'tfidf__use_idf': (True, False), 'clf__estimator__n_estimators': [100, 200]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [18]:
cv_y_pred = cv.predict(X_test)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  


In [19]:
print(classification_report(y_test.iloc[:,1:].values, np.array([x[1:] for x in cv_y_pred]), target_names=y_test.columns))

                        precision    recall  f1-score   support

               related       0.89      0.45      0.60      1472
               request       0.00      0.00      0.00        38
                 offer       0.79      0.58      0.67      3545
           aid_related       0.69      0.04      0.08       701
          medical_help       0.81      0.10      0.17       446
      medical_products       0.94      0.07      0.13       226
     search_and_rescue       0.00      0.00      0.00       160
              security       0.67      0.01      0.03       267
              military       0.00      0.00      0.00         0
           child_alone       0.89      0.30      0.45       543
                 water       0.86      0.53      0.66       965
                  food       0.85      0.30      0.45       775
               shelter       0.80      0.06      0.12       127
              clothing       0.89      0.04      0.08       191
                 money       1.00      

  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 9. Export your model as a pickle file

In [21]:
pickle.dump(cv, open('model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.