# ML Pipeline Preparation
The following workflow will be executed in this notebook:
1. Import common libraries for machine learning pipelines
2. Load the dataset
3. Create a Customer Text Transformer
4. Build a machine learning pipeline
5. Test model performance - 1st iteration
6. Hyperparameter tuning: Improve Model Performance
7. Test model performance - 2nd iteration
8. Train on best hyperparameters
9. Compare Training Results 
10. Improve Model Performance: Train another classifer
11. Final Comparison of all Models - 3rd iteration
12. Export the model as a pickle file





### 1. Import Required Libraries

In [1]:
# import libraries
import re # for regular expressions
import numpy as np # numeric python, vector operations
import pandas as pd # data manipulation
from nltk.corpus import stopwords # natural language tool kit: stopwords
from nltk.tokenize import word_tokenize # natural language tool kit: word_tokenize
from nltk.stem import WordNetLemmatizer # natural language tool kit: lemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer # for text processing
from sklearn.model_selection import train_test_split # for splitting data into training and testing
from sklearn.multioutput import MultiOutputClassifier   # for multi-output classification
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn.pipeline import Pipeline   # for creating a pipeline
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, f1_score # for model evaluation

import warnings 
warnings.filterwarnings('ignore')

###  2. Load the dataset
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [2]:
# load data from database
from sqlalchemy import create_engine
engine = create_engine('sqlite:///../data/disaster_response.db')
df = pd.read_sql_table('features', engine) 
X = df['message']
y = df.iloc[:,4:]

### 3. Create a Customer Text Transformer

In [3]:
# create custom text transformer

def tokenize(text):
    """
    This function takes a text and returns a list of cleaned and tokenized words.
    Designed to be used in a pipeline, with the CountVectorizer and TfidfTransformer objects

    Args:
    text: str: a string of text to be tokenized

    Returns:
    lemmed: list: a list of cleaned and tokenized words
    """
    # Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize text
    words = word_tokenize(text)
    
    # Remove stop words
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Lemmatization
    words = [WordNetLemmatizer().lemmatize(w) for w in words]

    # Stemming (not used, does not improve performance)
    # stemmed = [PorterStemmer().stem(w) for w in lemmed]
    
    return words
    

### Test Tokenization Function

In [4]:
# test out function
for message in X.sample(5):
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

In every alley, every corner and every lane of Mogadishu, a mujahid (Islamist fighter) lies in wait for them, the rebels added.
['every', 'alley', 'every', 'corner', 'every', 'lane', 'mogadishu', 'mujahid', 'islamist', 'fighter', 'lie', 'wait', 'rebel', 'added'] 

How is the capital now, because I learned that there were people enjoying themselves pillaging the stores and businesses that were still standing 
['capital', 'learned', 'people', 'enjoying', 'pillaging', 'store', 'business', 'still', 'standing'] 

ANTANANARIVO, March 11 (AFP) - Thirty-six people were killed and 42 were missing in north Madagascar after a storm lashed the region at the weekend, while scores more were feared drowned at sea, rescue services said Thursday.
['antananarivo', 'march', '11', 'afp', 'thirty', 'six', 'people', 'killed', '42', 'missing', 'north', 'madagascar', 'storm', 'lashed', 'region', 'weekend', 'score', 'feared', 'drowned', 'sea', 'rescue', 'service', 'said', 'thursday'] 

As a result of the disru

### 4. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
# Pipeline will have 3 steps
# 1. CountVectorizer - Convert a collection of text documents to a matrix of token counts
# 2. TfidfTransformer - Transform a count matrix to a normalized tf or tf-idf representation
# 3. MultiOutputClassifier - This is a simple meta-estimator for fitting one classifier per target.
pipeline = Pipeline([ 
    ('vect', CountVectorizer(tokenizer=tokenize)), # here is where you use the custom text transformer
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(verbose=1)))
])

### Train pipeline with the Benchmark model (we will look to improve through more iterations)
- Split data into train and test sets
- Train pipeline

In [6]:
# show params for benchmark model
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x1651182c0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier(verbose=1)))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x1651182c0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier(verbose=1)),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': Fal

In [7]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# show shape of the different datsets
print(f'total training observations: {X_train.shape[0]}')
print(f'total testing observations: {X_test.shape[0]}')

total training observations: 20972
total testing observations: 5244


In [8]:
# train classifier
pipeline.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:   11.4s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    9.3s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.3s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:   11.8s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    6.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.8s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    3.8s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    3.2s


### 5. Test your model - 1st iteration
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### Create reusable function to evaluate model performance for all 36 classes

In [None]:
def extract_precision(df,iteration):
    """
    Extract the precision from the results dataframe and annotates columns for comparison

    Args:
    df: DataFrame: the results dataframe
    iteration: int: the iteration number

    Returns:
    DataFrame: a dataframe containing the precision and support metrics
    """
    # filter the results to only include the precision and support metrics
    df = df.loc[df.metric=='1',['label','precision']].reset_index(drop=True)

    # add _iteration to the column names, starting with column 1
    df.columns = ['label', f'precision_{iteration}']

    return df


def evaluate_models(X_test,y_test,estimator,iteration):
    """
        Evaluate the model by calculating the precision, recall, and f1-score for each label
    
        Args:
        X_test: the test features
        y_test: the test labels
        estimator: the trained model
    
        Returns:
        results: a dataframe containing the precision, recall, and f1-score for each label
    """
    # import report that captures appropriate metrics
    from sklearn.metrics import classification_report

    # predict on test data
    y_pred = estimator.predict(X_test)

    # initialize an empty dataframe to store the results
    results = pd.DataFrame()

    # calculate and store the precision, recall, and f1-score for each label
    for i, col in enumerate(y_test.columns):
        report = classification_report(y_test[col], y_pred[:, i], output_dict=True)
        df = pd.DataFrame(report).transpose()
        df['label'] = col
        results = pd.concat([results, df])

    # reset the index of the results dataframe
    results.reset_index(inplace=True)
    results.rename(columns={'index': 'metric'}, inplace=True)

    # rearrange the columns to put 'label' first
    cols = ['label'] + [col for col in results.columns if col != 'label']
    results = results[cols]

    # round the results to 2 decimal places
    results = results.round(2)

    # convert the 'support' column to integers
    results.support = results.support.astype(int)

    # extract the precision from the results dataframe
    results = extract_precision(results,iteration)
    
    return results



In [None]:
# evaluate the model, 1st iteration
results1 = evaluate_models(X_test,y_test,pipeline,1)

results1


### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# use grid search to find better parameters
from sklearn.model_selection import GridSearchCV

# we are limiting the grid to these options, which will take 2 hrs to train
# adding parameters will increase time exponentially
parameters = {
    'clf__estimator__n_estimators':[5, 10],
    'clf__estimator__max_depth': [3, 5],
    'clf__estimator__min_samples_split': [2, 4]
}

# instantiate grid search object with appropriate parameters
cv = GridSearchCV(pipeline, 
                  param_grid=parameters, 
                  verbose=1, 
                  cv=5, 
                  n_jobs=1, 
                  return_train_score=True, 
                  scoring='f1_weighted')

# train the model
cv.fit(X_train, y_train)

### 7. Test your model - 2nd iteration
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
# evaluate the model - 2nd iteration    
results2 = evaluate_models(X_test,y_test,cv,2)

results2


In [None]:
# show the best parameters
print('best parameters:', cv.best_params_)

### 8. Train the model on the best parameters and evaluate performance

In [None]:
# build pipeline
pipeline = Pipeline([ 
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(verbose=1)))
])

# set pareters to best parameters from grid search
pipeline.set_params(**cv.best_params_)

# fit the model
pipeline.fit(X_train, y_train)

In [None]:
# evaluate the model - 3rd iteration
results3 = evaluate_models(X_test,y_test,pipeline,3)

results3

### 9. Compare training results with the benchmark model
The second model is clearly not catching the true positive classes<br>
We will attempt to train on a different classifer to see if we can improve accuracy<br>
<br>
Now we will look closer at the different model results

In [None]:
# consolidate results between 3 models
from functools import reduce

# combine the precision dataframes
dfs = [results1, results2, results3]

# merge the dataframes on the 'label' column
results = reduce(lambda left, right: pd.merge(left, right, on='label'), dfs)

results

### 10.  Improve Model Performance: Train another classifer
We will attempt to train the same pipline with an XGboost Classifier

In [None]:
# use the XGBoost classifier for multiclass objective function
from xgboost import XGBClassifier

# create a pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(XGBClassifier()))

])

pipeline.fit(X_train, y_train)

In [None]:
# evaluate the model
results4 = evaluate_models(X_test, y_test, pipeline,4)

results4

### 11.  Final Comparison of all Models - 3rd iteration


In [None]:
# merge the results with other iterations
results = pd.merge(results, results4, on='label', how='left')


In [None]:

# extract the average precision per model 
avg_precision = (
                        results.iloc[:,1:]
                        .agg({col: 'mean' for col in results.columns[1:]})
                        .to_frame()
                        .rename(columns={0: 'avg_precision'})
                        .rename_axis('metric')
                        .reset_index()
                )

# drop the last row, as it is not needed
avg_precision = avg_precision.drop(avg_precision.index[4])

avg_precision

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches  # import the patches module

# create a new column for colors
avg_precision['color'] = avg_precision['model']

# create a custom color palette
fig, ax = plt.subplots(figsize=(10, 6))
barplot = sns.barplot(x='model', y='avg_precision', hue='color', data=avg_precision, ax=ax, dodge=False, palette='tab10')

# add labels and title
plt.title('Average Precision by Model')
plt.ylabel('Average Precision')

# change the x labels to match the legend and orient them at a 45-degree angle
x_labels = ['Random Forest 1', 'Random Forest 2', 'Random Forest 3', 'XGBoost']
ax.set_xticklabels(x_labels, rotation=45, ha='right')  # adjust the horizontal alignment

# create a custom legend
legend_labels = ['Random Forest 1', 'Random Forest 2', 'Random Forest 3', 'XGBoost']
legend_colors = [barplot.patches[i].get_facecolor() for i in range(len(legend_labels))]
legend = plt.legend(handles=[mpatches.Patch(color=c) for c in legend_colors], labels=legend_labels, bbox_to_anchor=(1.05, 1), loc='upper left')  # use mpatches.Patch and move the legend outside of the plot

plt.show()

In [None]:
# plot the average precision for each model
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# set the rc parameters for font weight, size, and spacing
plt.rcParams['font.weight'] = 'bold'
plt.rcParams['font.size'] = 12
plt.rcParams['lines.linewidth'] = 2

# set the colormap
cmap = cm.get_cmap('tab10')  # get the 'tab10' colormap

# set the figure size
plt.figure(figsize=(10,10))

# create stacked horizontal bar plots with a different color palette
plt.barh(results.label, results.precision_1, color=cmap(0), alpha=0.5, label='Random Forest 1')
plt.barh(results.label, results.precision_2, left=results.precision_1, color=cmap(1), alpha=0.5, label='Random Forest 2')
plt.barh(results.label, results.precision_3, left=results.precision_1 + results.precision_2, color=cmap(2), alpha=0.5, label='Random Forest 3')
plt.barh(results.label, results.precision_4, left=results.precision_1 + results.precision_2 + results.precision_3, color=cmap(3), alpha=0.5, label='XGBoost')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

### Summary of Model Training Iterations
1.  Benchmark model: RandomForestClassifier - showed 95% accuracy
2.  Grid Search: We tried 4 different sets of hyperparameters with cross validation.  However, the model seriously underfit and could not detect most of the postive classes.
3.  XGBoost - We tried XGBoost with GridSearch and it showed 95% accuracy.  We saved this model as the final model.

### 12. Export your model as a pickle file

In [None]:
# export your model as a pickel file
import joblib
with open('classifier.pkl', 'wb') as file:
    joblib.dump(pipeline, file, compress=5)

# load model


### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.