# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger', 'stopwords'])

import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sqlalchemy import create_engine
from time import time

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponses.db')
df = pd.read_sql_table('DisasterResponses', engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
X = df['message']
y = df.drop(['message', 'id', 'original', 'genre'], axis=1)

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    
    #Detect url 
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # get list of all urls using regex
    detected_urls = re.findall(url_regex, text)
    
    # replace each url in text string with placeholder
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    # tokenize text and initiate lemmatizer
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    # Remove stopwords
    tokens = [w for w in tokens if w not in stopwords.words("english")]

    # iterate through each token
    clean_tokens = []
    for tok in tokens:
        
        # lemmatize, normalize case, and remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens
    

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
def build_pipeline():

    # build pipeline
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', (RandomForestClassifier(n_estimators = 100, n_jobs = 6)))   
    ])
    
    return pipeline

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = build_pipeline()

# train classifier
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...n_jobs=6,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
from sklearn.metrics import precision_score, recall_score, f1_score

In [8]:
# predict on test data
y_pred = pipeline.predict(X_test)

In [9]:
from sklearn.metrics import classification_report

for i, col in enumerate(y.columns):
    # if i in [0, 9]: continue  # Skip bad column, TODO: Fix 0th column to not be 3 classes for no reason
    
    print("Column {}: {}".format(i, col))
    
    Y_true = list(y_test.values[:, i])
    Y_pred = list(y_pred[:, i])
    target_names = ['is_{}'.format(col), 'is_not_{}'.format(col)]
    print(classification_report(Y_true, Y_pred, target_names=target_names))

Column 0: related
                precision    recall  f1-score   support

    is_related       0.67      0.47      0.55      1556
is_not_related       0.85      0.93      0.89      4998

   avg / total       0.81      0.82      0.81      6554

Column 1: request
                precision    recall  f1-score   support

    is_request       0.90      0.98      0.94      5446
is_not_request       0.85      0.46      0.59      1108

   avg / total       0.89      0.89      0.88      6554

Column 2: offer
              precision    recall  f1-score   support

    is_offer       1.00      1.00      1.00      6527
is_not_offer       0.00      0.00      0.00        27

 avg / total       0.99      1.00      0.99      6554

Column 3: aid_related
                    precision    recall  f1-score   support

    is_aid_related       0.74      0.92      0.82      3869
is_not_aid_related       0.82      0.53      0.64      2685

       avg / total       0.77      0.76      0.75      6554

Column 4: 

  'precision', 'predicted', average, warn_for)
  .format(len(labels), len(target_names))


              precision    recall  f1-score   support

    is_tools       0.99      1.00      1.00      6506
is_not_tools       0.00      0.00      0.00        48

 avg / total       0.99      0.99      0.99      6554

Column 24: hospitals
                  precision    recall  f1-score   support

    is_hospitals       0.99      1.00      0.99      6487
is_not_hospitals       0.00      0.00      0.00        67

     avg / total       0.98      0.99      0.98      6554

Column 25: shops
              precision    recall  f1-score   support

    is_shops       1.00      1.00      1.00      6522
is_not_shops       0.00      0.00      0.00        32

 avg / total       0.99      1.00      0.99      6554

Column 26: aid_centers
                    precision    recall  f1-score   support

    is_aid_centers       0.99      1.00      0.99      6481
is_not_aid_centers       0.00      0.00      0.00        73

       avg / total       0.98      0.99      0.98      6554

Column 27: other_infras

### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7ff033df6a60>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=6,
               oob_score=False, random_state=None, verbose=0,
               warm

In [11]:
from sklearn.multioutput import MultiOutputClassifier
# specify parameters for grid search
def build_model():
    """
    :return: model after applying pipeline
    """
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
    ])
    parameters = {
        'clf__min_samples_split': [3, 4],
        'clf__n_estimators': [50, 100]
    }
    cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs= 2, cv=3)
    
    return cv

In [12]:
def main():
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    model = build_model()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)


main()

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [13]:
from sklearn.metrics import accuracy_score

In [14]:

def build_report(main, X_test, y_test):
    
    #build classification report on each output category of the dataset
    performances = []
    for i in range(len(y_test.columns)):
        performances.append([accuracy_score(y_test.iloc[:, i].values, y_pred[:, i]),
                         precision_score(y_test.iloc[:, i].values, y_pred[:, i], average='weighted'),   
                         recall_score(y_test.iloc[:, i].values, y_pred[:, i], average='weighted')])
    
    #build datafram
    performances = pd.DataFrame(performances, columns=['accuracy', 'precision', 'recall'], index= y_test.columns)

    return performances

In [15]:
build_report(main, X_test, y_test)

  'precision', 'predicted', average, warn_for)


Unnamed: 0,accuracy,precision,recall
related,0.818737,0.805926,0.818737
request,0.894416,0.890426,0.894416
offer,0.99588,0.991778,0.99588
aid_related,0.75801,0.769583,0.75801
medical_help,0.917913,0.884546,0.917913
medical_products,0.954989,0.947208,0.954989
search_and_rescue,0.968874,0.949471,0.968874
security,0.983522,0.983793,0.983522
military,0.968264,0.955652,0.968264
child_alone,1.0,1.0,1.0


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [16]:
np.unique(X)

array(['    ', '                           ', '          .', ...,
       "zone ''salo'' near in the common ''Verettes 4th section ",
       "zone in hunger their killing us in zone, we can't. ..",
       '| News Update | Serious loss of life expected in devastating earthquake in Haiti http ow.ly 16klRU'], dtype=object)

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [18]:
from copy import deepcopy, copy
from functools import wraps

def copy_class(cls):
    copy_cls = type(f'{cls.__name__}', cls.__bases__, dict(cls.__dict__))
    for name, attr in cls.__dict__.items():
        try:
            hash(attr)
        except TypeError:
            # Assume lack of __hash__ implies mutability. This is NOT
            # a bullet proof assumption but good in many cases.
            setattr(copy_cls, name, deepcopy(attr))
    return copy_cls

def upgrade_to_work_with_single_class(SklearnPredictor):
    SklearnPredictor = copy_class(SklearnPredictor)
    original_init = deepcopy(SklearnPredictor.__init__)
    original_fit = deepcopy(SklearnPredictor.fit)
    original_predict = deepcopy(SklearnPredictor.predict)

    @staticmethod
    def _has_only_one_class(y):
        return len(np.unique(y)) == 1

    def _fitted_on_single_class(self):
        return self._single_class_label is not None

    @wraps(SklearnPredictor.__init__)
    def new_init(self, *args, **kwargs):
        self._single_class_label = None
        original_init(self, *args, **kwargs)

    @wraps(SklearnPredictor.fit)
    def new_fit(self, X, y=None):
        if self._has_only_one_class(y):
            self._single_class_label = y[0]
        else:
            original_fit(self, X, y)
        return self

    @wraps(SklearnPredictor.predict)
    def new_predict(self, X):
        if self._fitted_on_single_class():
            return np.full(X.shape[0], self._single_class_label)
        else:
            return original_predict(self, X)

    setattr(SklearnPredictor, '_has_only_one_class', _has_only_one_class)
    setattr(SklearnPredictor, '_fitted_on_single_class', _fitted_on_single_class)
    SklearnPredictor.__init__ = new_init
    SklearnPredictor.fit = new_fit
    SklearnPredictor.predict = new_predict
    return SklearnPredictor

LogisticRegression = upgrade_to_work_with_single_class(LogisticRegression)

In [19]:
logreg = LogisticRegression(solver='liblinear', multi_class='ovr')

In [20]:
new_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(logreg, n_jobs=-1))
])

In [21]:
# Split datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [22]:
new_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...one, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
           n_jobs=-1))])

### 9. Export your model as a pickle file

In [29]:
import pickle
import joblib

In [31]:
pickle.dump(pipeline, open('rfc_model.pkl', 'wb'))

In [36]:
joblib.dump(new_pipeline, open('logreg_model.pkl', 'wb'))

PicklingError: Can't pickle <class 'sklearn.linear_model.logistic.LogisticRegression'>: it's not the same object as sklearn.linear_model.logistic.LogisticRegression

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.