# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [4]:
# import libraries
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])
import re
import numpy as np
import pandas as pd

from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

  return f(*args, **kwds)
  return f(*args, **kwds)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [5]:
# load data from database
engine = create_engine('sqlite:///cleaned_message_data.db')
df = pd.read_sql(sql="SELECT * FROM MessageData", con=engine)
X = df['message'].values
Y = df.drop(['id','message','original','genre'], axis=1)

### 2. Write a tokenization function to process your text data

In [6]:
def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
        
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
    ])),
     ('clf', MultiOutputClassifier(OneVsRestClassifier(SGDClassifier())))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [9]:
import warnings; warnings.simplefilter('ignore')

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...   shuffle=True, tol=None, verbose=0, warm_start=False),
          n_jobs=1),
           n_jobs=1))])

In [10]:
y_pred = pipeline.predict(X_test)

In [11]:
def display_model_results(y_test, y_pred):
    labels = np.unique(y_pred)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)    
    print("Accuracy:", accuracy)    


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [12]:
display_model_results(y_test, y_pred)

Labels: [0 1 2]
Accuracy: related                   0.821941
request                   0.905401
offer                     0.995880
aid_related               0.787916
medical_help              0.922948
medical_products          0.956973
search_and_rescue         0.972841
security                  0.980165
military                  0.969484
child_alone               1.000000
water                     0.963686
food                      0.954074
shelter                   0.950412
clothing                  0.987489
money                     0.980317
missing_people            0.988709
refugees                  0.968569
death                     0.967501
other_aid                 0.871987
infrastructure_related    0.932560
transport                 0.957431
buildings                 0.959262
electricity               0.977418
tools                     0.994507
hospitals                 0.988557
shops                     0.996033
aid_centers               0.988862
other_infrastructure      0.9

### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
# Check the list of available paramaters 
for key in pipeline.get_params().keys():
    print(key)

memory
steps
features
clf
features__n_jobs
features__transformer_list
features__transformer_weights
features__text_pipeline
features__text_pipeline__memory
features__text_pipeline__steps
features__text_pipeline__vect
features__text_pipeline__tfidf
features__text_pipeline__vect__analyzer
features__text_pipeline__vect__binary
features__text_pipeline__vect__decode_error
features__text_pipeline__vect__dtype
features__text_pipeline__vect__encoding
features__text_pipeline__vect__input
features__text_pipeline__vect__lowercase
features__text_pipeline__vect__max_df
features__text_pipeline__vect__max_features
features__text_pipeline__vect__min_df
features__text_pipeline__vect__ngram_range
features__text_pipeline__vect__preprocessor
features__text_pipeline__vect__stop_words
features__text_pipeline__vect__strip_accents
features__text_pipeline__vect__token_pattern
features__text_pipeline__vect__tokenizer
features__text_pipeline__vect__vocabulary
features__text_pipeline__tfidf__norm
features__text_p

In [14]:
parameters = {
    'features__text_pipeline__vect__ngram_range': ((1,1),(1,2),(1,3)),
    'features__text_pipeline__tfidf__use_idf': (True, False),
    'features__text_pipeline__tfidf__smooth_idf': (True, False),
    'features__transformer_weights': (
        {'text_pipeline': 1},
        {'text_pipeline': 0.5},
        {'text_pipeline': 0.2}
    ),
    'clf__estimator__estimator__n_jobs': [50],# 100, 200],
    'clf__estimator__estimator__alpha': [0.0001] #0.001,0.01]    
}

cv = GridSearchCVProgressBar(pipeline, param_grid=parameters, verbose=1,n_jobs=-1)

In [1]:
!pip install pactools

Collecting pactools
  Downloading https://files.pythonhosted.org/packages/a2/d7/3f49de72a91e98f0d69043a5821efa28c1e9cab322dd7aede44c9cca4c2f/pactools-0.2.0b0.tar.gz (66kB)
[K    100% |████████████████████████████████| 71kB 2.1MB/s ta 0:00:011
[?25hBuilding wheels for collected packages: pactools
  Running setup.py bdist_wheel for pactools ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/5b/a6/57/f1df50567735175243e07792baced6076d67ab30ca1c138b71
Successfully built pactools
Installing collected packages: pactools
Successfully installed pactools-0.2.0b0
[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
from pactools import simulate_pac
from pactools.grid_search import ExtractDriver, AddDriverDelay
from pactools.grid_search import DARSklearn, MultipleArray
from pactools.grid_search import GridSearchCVProgressBar

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [15]:
fs = 200.  # Hz
high_fq = 50.0  # Hz
low_fq = 5.0  # Hz
low_fq_width = 1.0  # Hz

n_epochs = 3
n_points = 10000
noise_level = 0.4

low_sig = np.array([
    simulate_pac(n_points=n_points, fs=fs, high_fq=high_fq, low_fq=low_fq,
                 low_fq_width=low_fq_width, noise_level=noise_level,
                 random_state=i) for i in range(n_epochs)
])

In [16]:
import warnings; warnings.simplefilter('ignore')

X = MultipleArray(low_sig, None)
cv.fit(X_train, y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[........................................] 100% | 2309.01 sec | GridSearchCV 


[ParallelProgressBar(n_jobs=-1)]: Done 108 out of 108 | elapsed: 38.5min finished


GridSearchCVProgressBar(cv=None, error_score='raise',
            estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...   shuffle=True, tol=None, verbose=0, warm_start=False),
          n_jobs=1),
           n_jobs=1))]),
            fit_params=None, iid=True, n_jobs=-1,
            param_grid={'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2), (1, 3)), 'features__text_pipeline__tfidf__use_idf': (True, False), 'features__text_pipeline__tfidf__smooth_idf': (True, False), 'features__transformer_weights': ({'text_pipeline': 1}, {'text_pipeline': 0.5}, {'text_pipeline': 0.2}), 'clf__estimator__estimator__n_jobs': [50], 'clf__estimator__estimator__alpha': [0.0001]},
            pre_dispatch='2*n_j

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [17]:
import pickle
filename = 'trained_model.sav'
pickle.dump(cv, open(filename, 'wb'))

In [21]:
def display_results(cv, y_test, y_pred):
    labels = np.unique(y_pred)
    
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    
    print("Accuracy:", accuracy)
    print("\nBest Parameters:", cv.best_params_)

In [22]:
y_pred = cv.predict(X_test)

In [23]:
display_results(cv, y_test, y_pred)

Labels: [0 1 2]
Accuracy: related                   0.821483
request                   0.906164
offer                     0.995880
aid_related               0.789594
medical_help              0.922948
medical_products          0.957278
search_and_rescue         0.972994
security                  0.980165
military                  0.969637
child_alone               1.000000
water                     0.962923
food                      0.954532
shelter                   0.950259
clothing                  0.987641
money                     0.980317
missing_people            0.988709
refugees                  0.969179
death                     0.966585
other_aid                 0.871834
infrastructure_related    0.932560
transport                 0.957278
buildings                 0.958193
electricity               0.977571
tools                     0.994507
hospitals                 0.988557
shops                     0.996033
aid_centers               0.988862
other_infrastructure      0.9

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.