# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [7]:
# import libraries
import nltk
import re
nltk.download(['punkt', 'wordnet'])
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [48]:
import pandas as pd
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.multioutput import MultiOutputClassifier
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [9]:
# load data from database
engine = create_engine('sqlite:///FigureEight.db')
df = pd.read_sql_table('Message', engine)
X = df.message.values
y = df[df.columns[4:]].values

### 2. Write a tokenization function to process your text data

In [11]:
def tokenize(text):
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    return clean_tokens

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [28]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, max_depth = 10, random_state=0)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

## 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [27]:
from sklearn.metrics import classification_report
target_names = list(df.columns[4:])
print(classification_report(y_test, y_pred, target_names = target_names))

                        precision    recall  f1-score   support

               related       0.77      1.00      0.87      4044
               request       0.00      0.00      0.00       891
                 offer       0.00      0.00      0.00        25
           aid_related       0.95      0.09      0.17      2159
          medical_help       1.00      0.00      0.00       426
      medical_products       0.00      0.00      0.00       262
     search_and_rescue       0.00      0.00      0.00       132
              security       0.00      0.00      0.00        98
              military       0.00      0.00      0.00       174
           child_alone       0.00      0.00      0.00         0
                 water       0.00      0.00      0.00       326
                  food       0.00      0.00      0.00       556
               shelter       0.00      0.00      0.00       436
              clothing       0.00      0.00      0.00        70
                 money       0.00      

### 6. Improve your model
Use grid search to find better parameters. 

In [43]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(max_depth = 10, random_state=0)))
    ])
parameters = {'tfidf__norm': ['l1', None],
              'clf__estimator__n_estimators': [10, 100, 200],
             }

cv = GridSearchCV(pipeline, param_grid=parameters, scoring='f1_weighted')
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...1,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'tfidf__norm': ['l1', None], 'clf__estimator__n_estimators': [10, 100, 200]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_weighted', verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [44]:
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names = target_names))

                        precision    recall  f1-score   support

               related       0.77      1.00      0.87      4044
               request       1.00      0.01      0.01       891
                 offer       0.00      0.00      0.00        25
           aid_related       0.83      0.21      0.33      2159
          medical_help       1.00      0.01      0.01       426
      medical_products       0.50      0.00      0.01       262
     search_and_rescue       0.00      0.00      0.00       132
              security       0.00      0.00      0.00        98
              military       0.00      0.00      0.00       174
           child_alone       0.00      0.00      0.00         0
                 water       0.67      0.01      0.01       326
                  food       1.00      0.01      0.01       556
               shelter       1.00      0.00      0.01       436
              clothing       0.00      0.00      0.00        70
                 money       0.00      

In [45]:
cv.best_params_

{'clf__estimator__n_estimators': 10, 'tfidf__norm': None}

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [57]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(max_depth = 10, random_state=0)))
    ])
parameters = {
              #'clf__estimator__learning_rate' : [0.1, 0.4],
              #'clf__estimator__max_depth': [3, 5],
              'clf__estimator__n_estimators' : [10]
             }

# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
cv = GridSearchCV(pipeline, param_grid=parameters, scoring='f1_weighted')
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...1,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_weighted', verbose=0)

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.