# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [7]:
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])

# import libraries
import numpy as np
import pandas as pd
import pickle
import statistics
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, recall_score, precision_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.pipeline import FeatureUnion
from custom_transformer import StartingVerbExtractor

# disable warnings
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\P335437\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\P335437\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
# load data from database
engine = create_engine('sqlite:///test.db')
df = pd.read_sql_table('Responses', engine)
X = df.message.values
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1).values

# df.genre.value_counts()

### 2. Write a tokenization function to process your text data

In [9]:
def tokenize(text):
    # tokenize text
    word_list = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    # iterate through each token of word list
    clean_tokens = []
    for tok in word_list:
        # lemmatize, normalize case, and remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
# build pipeline conaining a CountVectorizer, TfidfTransformer and a
# MultiOutputClassifier using Logistic Regression as an estimator
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])   

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
# splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y)

num_rows = -1

# convert data using NLP and train classifier
pipeline.fit(X_train[0:num_rows], y_train[0:num_rows])


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [12]:
def test_model(model, X_test, y_test):
    '''
    Function to determine model performance scores
    Args:
        model - model to be investigated
        X_test - testing data for model input
        y_test - testing data for model input
    Returns:
        y_pred: prediction of the model
    '''
    # make predictions
    y_pred = model.predict(X_test)

    # calculate scores for each Class
    target_names = [f'Class {i+1}' for i in range(y_test.shape[1])]
    reports_list = []
    precisions_list = []
    recalls_list = []
    f1_sores_list = []

    for i in range(y_test.shape[1]):
        reports_list.append(classification_report(y_test[:, i], y_pred[:, i], output_dict=True))
        precisions_list.append(precision_score(y_test[:, i], y_pred[:, i], average='weighted'))
        recalls_list.append(recall_score(y_test[:, i], y_pred[:, i], average='weighted'))
        f1_sores_list.append(f1_score(y_test[:, i], y_pred[:, i], average='weighted'))

    # print best Parameters
    try:
        print("\nBest Parameters:", model.best_params_)
    except:
        print("\nBest Parameters cant be determined since no GridSearch was used")

    # print mean accuracy over all Classes
    accuracy = (y_pred == y_test).mean()
    print("Accuracy:", accuracy)

    # print print mean precision over all Classes
    precision = statistics.mean(precisions_list)
    print("Precision:", precision)

    # print mean recall over all Classes
    recall = statistics.mean(recalls_list)
    print("Recall:", recall)

    # print mean F1 score over all Classes
    try:
        f1 = statistics.mean(f1_sores_list)
        print("F1-score:", f1)
    except Exception as e:
        print("F1 score could not be determined")
        print(e)

    return y_pred

In [13]:
# text initial pipeline
test_model(pipeline, X_test, y_test)


Best Parameters cant be determined since no GridSearch was used
Accuracy: 0.9467246465262944
Precision: 0.9371091343193743
Recall: 0.9467246465262944
F1-score: 0.9329718254069351


array([[1, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

### 6. Improve your model
Use grid search to find better parameters. 

In [14]:
# determine parameters of the pipeline available for GridSearch
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001211C1F63B0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001211C1F63B0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,


In [15]:
# define the parameters to be investigated via Gridsearch
parameters = {
    'clf__estimator__n_estimators': [100, 200],
    'clf__estimator__min_samples_split': [2, 3, 4]
}

cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=-1)
cv.fit(X_train[0:num_rows], y_train[0:num_rows])

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
# test Random Forrest pipeline
test_model(cv, X_test, y_test)


Best Parameters: {'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 80}
Accuracy: 0.9455591157223748
Precision: 0.9361997258750273
Recall: 0.9455591157223748
F1-score: 0.9315481148624439


array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [None]:
# build pipeline conaining a CountVectorizer, TfidfTransformer and a
# MultiOutputClassifier using AdaBoost as an estimator
Ada_pipeline = Pipeline([
    ('features', FeatureUnion([

                ('text_pipeline', Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer())
                ])),
            ])),

    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

# determine parameters of the pipeline available for GridSearch
Ada_pipeline.get_params()

{'memory': None,
 'steps': [('features',
   FeatureUnion(transformer_list=[('text_pipeline',
                                   Pipeline(steps=[('vect',
                                                    CountVectorizer(tokenizer=<function tokenize at 0x00000186D26996C0>)),
                                                   ('tfidf',
                                                    TfidfTransformer())]))])),
  ('clf', MultiOutputClassifier(estimator=AdaBoostClassifier()))],
 'verbose': False,
 'features': FeatureUnion(transformer_list=[('text_pipeline',
                                 Pipeline(steps=[('vect',
                                                  CountVectorizer(tokenizer=<function tokenize at 0x00000186D26996C0>)),
                                                 ('tfidf',
                                                  TfidfTransformer())]))]),
 'clf': MultiOutputClassifier(estimator=AdaBoostClassifier()),
 'features__n_jobs': None,
 'features__transformer_list': [

In [None]:
# define the parameters to be investigated via Gridsearch
parameters = {
    'clf__estimator__n_estimators': [50, 100],
    'clf__estimator__learning_rate': [0.01, 0.1]    
}

Ada_pipeline = GridSearchCV(Ada_pipeline, param_grid=parameters, n_jobs=-1)
Ada_pipeline.fit(X_train[0:num_rows], y_train[0:num_rows])

In [None]:
# test Ada_pipeline
test_model(Ada_pipeline, X_test, y_test)


Best Parameters: {'clf__estimator__learning_rate': 0.1, 'clf__estimator__n_estimators': 30}
Accuracy: 0.9388244668226359
Precision: 0.9237393669182867
Recall: 0.9388244668226359
F1-score: 0.9198448176735545


array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [None]:
# build pipeline conaining a CountVectorizer, StartingVerbExtractor and a
# MultiOutputClassifier using AdaBoost as an estimator
Ada_SVE_pipeline = Pipeline([
    ('features', FeatureUnion([

                ('text_pipeline', Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer())
                ])),

                ('starting_verb', StartingVerbExtractor())
            ])),

    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

# determine parameters of the pipeline available for GridSearch
Ada_SVE_pipeline.get_params()

{'memory': None,
 'steps': [('features',
   FeatureUnion(transformer_list=[('text_pipeline',
                                   Pipeline(steps=[('vect',
                                                    CountVectorizer(tokenizer=<function tokenize at 0x00000186D26996C0>)),
                                                   ('tfidf',
                                                    TfidfTransformer())])),
                                  ('starting_verb', StartingVerbExtractor())])),
  ('clf', MultiOutputClassifier(estimator=AdaBoostClassifier()))],
 'verbose': False,
 'features': FeatureUnion(transformer_list=[('text_pipeline',
                                 Pipeline(steps=[('vect',
                                                  CountVectorizer(tokenizer=<function tokenize at 0x00000186D26996C0>)),
                                                 ('tfidf',
                                                  TfidfTransformer())])),
                                ('starting_ver

In [None]:
# define the parameters to be investigated via Gridsearch
parameters = {
    'clf__estimator__n_estimators': [50, 100],
    'clf__estimator__learning_rate': [0.01, 0.1]    
}

Ada_SVE_pipeline = GridSearchCV(Ada_SVE_pipeline, param_grid=parameters, n_jobs=-1)
Ada_SVE_pipeline.fit(X_train[0:num_rows], y_train[0:num_rows])

KeyboardInterrupt: 

In [None]:
# test Ada_SVE_pipeline
test_model(Ada_SVE_pipeline, X_test, y_test)


Best Parameters: {'clf__estimator__learning_rate': 0.1, 'clf__estimator__n_estimators': 50}
Accuracy: 0.9412402943071237
Precision: 0.9250339511750043
Recall: 0.9412402943071237
F1-score: 0.9248711278449603


array([[1, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

### 9. Export your model as a pickle file

In [None]:
# Saving the trained model as a pickle file
with open('trained_model.pkl', 'wb') as file:
    pickle.dump(Ada_SVE_pipeline, file)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.