# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import pickle
import nltk

nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

import warnings

warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database

engine = create_engine('sqlite:///InsertDatabaseName.db')

data = pd.read_sql_table('InsertTableName', con=engine)

cat = data.columns[4:]

X = data[['message']].values[:, 0]

Y = data[cat].values

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

    tokens = word_tokenize(text)
    
    stemmer = PorterStemmer()
    
    stop = stopwords.words("english")
    
    stem = [stemmer.stem(word) for word in tokens if word not in stop]
    
    return stem

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
pipeline = Pipeline([
                        ('vect', CountVectorizer(tokenizer=tokenize)),
                        ('tfidf', TfidfTransformer()),
                        ('clf', MultiOutputClassifier(RandomForestClassifier(class_weight='balanced')))
                    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3, random_state=30)

pipeline.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None,
            verbose=0, warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
def multiout_class_report(y_true, y_pred):
    
    for i in range(0, len(cat)):
        
        print(cat[i])
        
        print("Accuracy: {:.3f}\t Precision: {:.3f}\t Recall: {:.3f}\t F1_score: {:.3f}\n ".format(
            
            accuracy_score(y_true[:, i], y_pred[:, i]),
            
            precision_score(y_true[:, i], y_pred[:, i], average='weighted'),
            
            recall_score(y_true[:, i], y_pred[:, i], average='weighted'),
            
            f1_score(y_true[:, i], y_pred[:, i], average='weighted')
        ))

In [7]:
y_pred = pipeline.predict(X_train)

multiout_class_report(y_train, y_pred)

related
Accuracy: 0.989	 Precision: 0.989	 Recall: 0.989	 F1_score: 0.989
 
request
Accuracy: 0.988	 Precision: 0.988	 Recall: 0.988	 F1_score: 0.987
 
offer
Accuracy: 0.998	 Precision: 0.998	 Recall: 0.998	 F1_score: 0.998
 
aid_related
Accuracy: 0.985	 Precision: 0.985	 Recall: 0.985	 F1_score: 0.985
 
medical_help
Accuracy: 0.988	 Precision: 0.988	 Recall: 0.988	 F1_score: 0.987
 
medical_products
Accuracy: 0.992	 Precision: 0.992	 Recall: 0.992	 F1_score: 0.992
 
search_and_rescue
Accuracy: 0.993	 Precision: 0.993	 Recall: 0.993	 F1_score: 0.993
 
security
Accuracy: 0.995	 Precision: 0.995	 Recall: 0.995	 F1_score: 0.995
 
military
Accuracy: 0.996	 Precision: 0.996	 Recall: 0.996	 F1_score: 0.996
 
child_alone
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
water
Accuracy: 0.995	 Precision: 0.995	 Recall: 0.995	 F1_score: 0.995
 
food
Accuracy: 0.994	 Precision: 0.994	 Recall: 0.994	 F1_score: 0.994
 
shelter
Accuracy: 0.993	 Precision: 0.993	 Recall: 0.993	 F1_

### 6. Improve your model
Use grid search to find better parameters. 

In [8]:
parameters = { "clf__estimator__n_estimators": [25, 50, 100],"clf__estimator__min_samples_split": [2, 3]}

cv = GridSearchCV(pipeline, param_grid=parameters)

In [9]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None,
            verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [25, 50, 100], 'clf__estimator__min_samples_split': [2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [10]:
y_pred = cv.predict(X_train)
multiout_class_report(y_train, y_pred)

related
Accuracy: 0.997	 Precision: 0.997	 Recall: 0.997	 F1_score: 0.997
 
request
Accuracy: 0.999	 Precision: 0.999	 Recall: 0.999	 F1_score: 0.999
 
offer
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
aid_related
Accuracy: 0.998	 Precision: 0.998	 Recall: 0.998	 F1_score: 0.998
 
medical_help
Accuracy: 0.999	 Precision: 0.999	 Recall: 0.999	 F1_score: 0.999
 
medical_products
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
search_and_rescue
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
security
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
military
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
child_alone
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
water
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
food
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
shelter
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_

In [11]:
y_pred = cv.predict(X_test)
multiout_class_report(y_test, y_pred)

related
Accuracy: 0.823	 Precision: 0.813	 Recall: 0.823	 F1_score: 0.813
 
request
Accuracy: 0.903	 Precision: 0.898	 Recall: 0.903	 F1_score: 0.898
 
offer
Accuracy: 0.996	 Precision: 0.993	 Recall: 0.996	 F1_score: 0.995
 
aid_related
Accuracy: 0.782	 Precision: 0.782	 Recall: 0.782	 F1_score: 0.782
 
medical_help
Accuracy: 0.922	 Precision: 0.902	 Recall: 0.922	 F1_score: 0.900
 
medical_products
Accuracy: 0.951	 Precision: 0.940	 Recall: 0.951	 F1_score: 0.933
 
search_and_rescue
Accuracy: 0.975	 Precision: 0.976	 Recall: 0.975	 F1_score: 0.964
 
security
Accuracy: 0.982	 Precision: 0.973	 Recall: 0.982	 F1_score: 0.973
 
military
Accuracy: 0.969	 Precision: 0.961	 Recall: 0.969	 F1_score: 0.960
 
child_alone
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
water
Accuracy: 0.956	 Precision: 0.953	 Recall: 0.956	 F1_score: 0.947
 
food
Accuracy: 0.943	 Precision: 0.939	 Recall: 0.943	 F1_score: 0.939
 
shelter
Accuracy: 0.940	 Precision: 0.934	 Recall: 0.940	 F1_

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [15]:
pipeline_ada = Pipeline([
                            ('vect', CountVectorizer(tokenizer=tokenize)),
                            ('tfidf', TfidfTransformer()),
                            ('clf', MultiOutputClassifier(
                            AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1, class_weight='balanced'))))
                        ])

parameters_ada = { "clf__estimator__n_estimators": [25, 50, 100],"clf__estimator__min_samples_split": [2, 3]}

cv_ada = GridSearchCV(pipeline, param_grid=parameters_ada)

In [16]:
np.random.seed(34)
cv_ada.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None,
            verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [25, 50, 100], 'clf__estimator__min_samples_split': [2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [17]:
cv_ada.best_params_

{'clf__estimator__min_samples_split': 3, 'clf__estimator__n_estimators': 100}

In [18]:
y_pred = cv_ada.predict(X_train)
multiout_class_report(y_train, y_pred)

related
Accuracy: 0.997	 Precision: 0.997	 Recall: 0.997	 F1_score: 0.997
 
request
Accuracy: 0.999	 Precision: 0.999	 Recall: 0.999	 F1_score: 0.999
 
offer
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
aid_related
Accuracy: 0.998	 Precision: 0.998	 Recall: 0.998	 F1_score: 0.998
 
medical_help
Accuracy: 0.999	 Precision: 0.999	 Recall: 0.999	 F1_score: 0.999
 
medical_products
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
search_and_rescue
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
security
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
military
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
child_alone
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
water
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
food
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
shelter
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_

In [19]:
y_pred = cv_ada.predict(X_test)
multiout_class_report(y_test, y_pred)

related
Accuracy: 0.822	 Precision: 0.812	 Recall: 0.822	 F1_score: 0.813
 
request
Accuracy: 0.904	 Precision: 0.899	 Recall: 0.904	 F1_score: 0.898
 
offer
Accuracy: 0.996	 Precision: 0.993	 Recall: 0.996	 F1_score: 0.995
 
aid_related
Accuracy: 0.783	 Precision: 0.784	 Recall: 0.783	 F1_score: 0.784
 
medical_help
Accuracy: 0.922	 Precision: 0.901	 Recall: 0.922	 F1_score: 0.899
 
medical_products
Accuracy: 0.950	 Precision: 0.936	 Recall: 0.950	 F1_score: 0.931
 
search_and_rescue
Accuracy: 0.976	 Precision: 0.976	 Recall: 0.976	 F1_score: 0.964
 
security
Accuracy: 0.982	 Precision: 0.973	 Recall: 0.982	 F1_score: 0.973
 
military
Accuracy: 0.970	 Precision: 0.962	 Recall: 0.970	 F1_score: 0.960
 
child_alone
Accuracy: 1.000	 Precision: 1.000	 Recall: 1.000	 F1_score: 1.000
 
water
Accuracy: 0.959	 Precision: 0.956	 Recall: 0.959	 F1_score: 0.953
 
food
Accuracy: 0.946	 Precision: 0.943	 Recall: 0.946	 F1_score: 0.943
 
shelter
Accuracy: 0.940	 Precision: 0.934	 Recall: 0.940	 F1_

### 9. Export your model as a pickle file

In [21]:
pickle.dump(cv, open('classifier.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.