# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import re
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, f1_score

import warnings 
warnings.filterwarnings('ignore')

In [2]:
# load data from database
from sqlalchemy import create_engine
engine = create_engine('sqlite:///../data/disaster_response.db')
df = pd.read_sql_table('features', engine) 
X = df['message']
y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    # Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize text
    words = word_tokenize(text)
    
    # Remove stop words
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Lemmatization
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]

    # Stemming
    # stemmed = [PorterStemmer().stem(w) for w in lemmed]
    
    return lemmed
    

### Test Tokenization Function

In [4]:
# test out function
for message in X.sample(5):
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Please, we need food, medication etc. WE are in the croix-des-mission, Rail road, original impass ( dead end ) #17. Thanks
['please', 'need', 'food', 'medication', 'etc', 'croix', 'de', 'mission', 'rail', 'road', 'original', 'impass', 'dead', 'end', '17', 'thanks'] 

I need more infos about the country 
['need', 'info', 'country'] 

Big problem that we have in Barade, we are asking Digi to see what it can do for us so we do not die of food starvation. Ok Thank you.
['big', 'problem', 'barade', 'asking', 'digi', 'see', 'u', 'die', 'food', 'starvation', 'ok', 'thank'] 

In another Bangkok hospital, doctors discovered the case of a 35-year-old man who had also inhaled seawater, but whose infection had produced a large amount of "green-colored purulent material" and sand packed into his sinuses.
['another', 'bangkok', 'hospital', 'doctor', 'discovered', 'case', '35', 'year', 'old', 'man', 'also', 'inhaled', 'seawater', 'whose', 'infection', 'produced', 'large', 'amount', 'green', 'colored'

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
# Pipeline will have 3 steps
# 1. CountVectorizer - Convert a collection of text documents to a matrix of token counts
# 2. TfidfTransformer - Transform a count matrix to a normalized tf or tf-idf representation
# 3. MultiOutputClassifier - This is a simple meta-estimator for fitting one classifier per target.
pipeline = Pipeline([ 
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(verbose=3)))
])

### 4. Train pipeline with the Benchmark model (we will look to improve through more iterations)
- Split data into train and test sets
- Train pipeline

In [44]:
# show params for benchmark model
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x164b30400>)),
  ('tfidf', TfidfTransformer()),
  ('clf',
   MultiOutputClassifier(estimator=XGBClassifier(base_score=None, booster=None,
                                                 callbacks=None,
                                                 colsample_bylevel=None,
                                                 colsample_bynode=None,
                                                 colsample_bytree=None,
                                                 device=None,
                                                 early_stopping_rounds=None,
                                                 enable_categorical=False,
                                                 eval_metric=None,
                                                 feature_types=None, gamma=None,
                                                 grow_policy=None,
                                                 importance_ty

In [42]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# show shape of the different datsets
print(f'total training observations: {X_train.shape[0]}')
print(f'total testing observations: {X_test.shape[0]}')

total training observations: 20972
total testing observations: 5244


In [None]:

# train classifier
pipeline.fit(X_train, y_train)

# predict on test data
y_pred = pipeline.predict(X_test)

In [7]:
# evaluate model accuracy
accuracy = (y_pred == y_test).mean().mean()
print('Average Accuracy for All Models: {:.4f}'.format(accuracy))


Accuracy: 0.9482


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
# evaluate model
from sklearn.metrics import multilabel_confusion_matrix

for i, col in enumerate(y_test.columns):
    print(f'label:',col)
    print(classification_report(y_test[col], y_pred[:, i]))
    print('Accuracy: {:.2f}'.format(accuracy_score(y_test[col], y_pred[:, i])))
    print('F1 Score: {:.2f}'.format(f1_score(y_test[col], y_pred[:, i], average='weighted')))
    print()
    print('Confusion Matrix:\n ', multilabel_confusion_matrix(y_test[col], y_pred[:, i]))
    print('------------------------------------------------------')

label: related
              precision    recall  f1-score   support

           0       0.70      0.41      0.52      1200
           1       0.84      0.94      0.89      4001
           2       0.39      0.33      0.35        43

    accuracy                           0.82      5244
   macro avg       0.64      0.56      0.59      5244
weighted avg       0.80      0.82      0.80      5244

Accuracy: 0.82
F1 Score: 0.80

Confusion Matrix:
  [[[3830  214]
  [ 702  498]]

 [[ 527  716]
  [ 221 3780]]

 [[5179   22]
  [  29   14]]]
------------------------------------------------------
label: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      4354
           1       0.80      0.49      0.61       890

    accuracy                           0.89      5244
   macro avg       0.85      0.73      0.78      5244
weighted avg       0.89      0.89      0.88      5244

Accuracy: 0.89
F1 Score: 0.88

Confusion Matrix:
  [[[ 440  450]
 

### Create Dataframe to Test Accuracy for Each Label

In [9]:
# Calculate metrics for each column
from sklearn.metrics import multilabel_confusion_matrix

# Initialize lists to store metrics
cols = []
accuracies = []
tps = []
tns = []
fps = []
fns = []

# Calculate metrics for each column
for i, col in enumerate(y_test.columns):
    mcm = multilabel_confusion_matrix(y_test[col], y_pred[:, i])
    for j in range(mcm.shape[0]):
        tn, fp, fn, tp = mcm[j].ravel()
        accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    # Append metrics to lists
    cols.append(f'{col}')
    accuracies.append(accuracy)
    tps.append(tp)
    tns.append(tn)
    fps.append(fp)
    fns.append(fn)

# Create DataFrame
df = pd.DataFrame({
    'Label': cols,
    'Accuracy': accuracies,
    'TP': tps,
    'TN': tns,
    'FP': fps,
    'FN': fns
})

# capture the results from training the random forrest model. 
# compare later with different hyper parameters and models
random_forrest_results_1 = df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

random_forrest_results_1

Unnamed: 0,Label,Accuracy,TP,TN,FP,FN
0,child_alone,1.0,5244,0,0,0
1,shops,0.996377,0,5225,0,19
2,offer,0.995233,0,5219,0,25
3,tools,0.992754,0,5206,0,38
4,hospitals,0.990465,0,5194,0,50
5,related,0.990275,14,5179,22,29
6,missing_people,0.988558,1,5183,0,60
7,fire,0.987605,1,5178,0,65
8,aid_centers,0.986842,0,5175,0,69
9,clothing,0.985889,10,5160,1,73


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
# use grid search to find better parameters
from sklearn.model_selection import GridSearchCV

# we are limiting the grid to these options, which will take 2 hrs to train
# adding parameters will increase time exponentially
parameters = {
    'clf__estimator__n_estimators': [10, 20],
    'clf__estimator__max_depth': [3, 5],
}

# instantiate grid search object with appropriate parameters
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=3, cv=5, n_jobs=1, return_train_score=True, scoring='f1_weighted')

# train the model
cv.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tre

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [11]:
# show the best parameters
print('best parameters:', cv.best_params_)

best parameters: {'clf__estimator__max_depth': 3, 'clf__estimator__n_estimators': 10}


In [12]:
# before testing for accuracy, we will make predictions on the test set
y_pred = cv.predict(X_test)

In [13]:
# show accuracy, precision and recall
for i, col in enumerate(y_test.columns):
    print(f'label :',col)
    print()
    print(classification_report(y_test[col], y_pred[:, i]))
    print('_________________________________________________________')
    print()


label : related

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1200
           1       0.76      1.00      0.87      4001
           2       0.00      0.00      0.00        43

    accuracy                           0.76      5244
   macro avg       0.25      0.33      0.29      5244
weighted avg       0.58      0.76      0.66      5244

_________________________________________________________

label : request

              precision    recall  f1-score   support

           0       0.83      1.00      0.91      4354
           1       0.00      0.00      0.00       890

    accuracy                           0.83      5244
   macro avg       0.42      0.50      0.45      5244
weighted avg       0.69      0.83      0.75      5244

_________________________________________________________

label : offer

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5219
           1    

### Train the model on the best parameters and evaluate performance

In [14]:
# build pipeline
pipeline = Pipeline([ 
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(verbose=3)))
])

# set pareters to best parameters from grid search
pipeline.set_params(**cv.best_params_)

# fit the model
pipeline.fit(X_train, y_train)

building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
buildi

### show accuracy, precision, and recall of the tuned model

In [15]:
# make predictions on the test set
y_pred = pipeline.predict(X_test)

# evaluate the results
for i, col in enumerate(y_test.columns):
    print(f'label:',col)
    print(classification_report(y_test[col], y_pred[:, i]))
    print('Accuracy: {:.2f}'.format(accuracy_score(y_test[col], y_pred[:, i])))
    print('F1 Score: {:.2f}'.format(f1_score(y_test[col], y_pred[:, i], average='weighted')))
    print()
    print('Confusion Matrix:\n ', multilabel_confusion_matrix(y_test[col], y_pred[:, i]))
    print('------------------------------------------------------')

label: related
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1200
           1       0.76      1.00      0.87      4001
           2       0.00      0.00      0.00        43

    accuracy                           0.76      5244
   macro avg       0.25      0.33      0.29      5244
weighted avg       0.58      0.76      0.66      5244

Accuracy: 0.76
F1 Score: 0.66

Confusion Matrix:
  [[[4044    0]
  [1200    0]]

 [[   0 1243]
  [   0 4001]]

 [[5201    0]
  [  43    0]]]
------------------------------------------------------
label: request
              precision    recall  f1-score   support

           0       0.83      1.00      0.91      4354
           1       0.00      0.00      0.00       890

    accuracy                           0.83      5244
   macro avg       0.42      0.50      0.45      5244
weighted avg       0.69      0.83      0.75      5244

Accuracy: 0.83
F1 Score: 0.75

Confusion Matrix:
  [[[   0  890]
 

### Create a dataframe with accuracy, true pos, true neg, false pos, false neg

In [16]:
# create empty lists to store metrics
cols = []
accuracies = []
tps = []
tns = []
fps = []
fns = []

# calculate metrics for each column
for i, col in enumerate(y_test.columns):
    # calculate confusion matrix, tn, fp, fn, tp
    mcm = multilabel_confusion_matrix(y_test[col], y_pred[:, i])
    # exctract metrics from confusion matrix
    for j in range(mcm.shape[0]):
        tn, fp, fn, tp = mcm[j].ravel()
        # calculate accuracy
        accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    # append metrics to lists
    cols.append(f'{col}')
    accuracies.append(accuracy)
    tps.append(tp)
    tns.append(tn)
    fps.append(fp)
    fns.append(fn)

# create DataFrame
df = pd.DataFrame({
    'Label': cols,
    'Accuracy': accuracies,
    'TP': tps,
    'TN': tns,
    'FP': fps,
    'FN': fns
})

# sort DataFrame by False Negatives
random_forrest_results_2 = df.sort_values(by='Accuracy', ascending=False)

random_forrest_results_2



Unnamed: 0,Label,Accuracy,TP,TN,FP,FN
9,child_alone,1.0,5244,0,0,0
25,shops,0.996377,0,5225,0,19
2,offer,0.995233,0,5219,0,25
23,tools,0.992754,0,5206,0,38
0,related,0.9918,0,5201,0,43
24,hospitals,0.990465,0,5194,0,50
15,missing_people,0.988368,0,5183,0,61
31,fire,0.987414,0,5178,0,66
26,aid_centers,0.986842,0,5175,0,69
13,clothing,0.984172,0,5161,0,83


### Comparing Training for 1st and 2nd model
The second model is clearly not catching the true positive classes<br>
We will attempt to train on a different classifer to see if we can improve accuracy<br>
<br>
Now we will look closer at the different model results

In [45]:
# create dataframe comparing random forrrest to xgboost for accuracy

compare_models = (

    random_forrest_results_1.loc[:,['Label','Accuracy','TP']]
    .rename(columns={'Accuracy':'Model_1_accuracy','TP':'Model_1_TP'})
    .merge(random_forrest_results_2.loc[:,['Label','Accuracy','TP']]
    .rename(columns={'Accuracy':'Model_2_accuracy','TP':'Model_2_TP'}), on='Label')
    .round(2)
)

compare_models['missed_true_positives'] = compare_models['Model_1_TP'] - compare_models['Model_2_TP']

print('TP stands for True Positives, which is the number of correct positive predictions')
display(compare_models)

TP stands for True Positives, which is the number of correct positive predictions


Unnamed: 0,Label,Model_1_accuracy,Model_1_TP,Model_2_accuracy,Model_2_TP,missed_true_positives
0,child_alone,1.0,5244,1.0,5244,0
1,shops,1.0,0,1.0,0,0
2,offer,1.0,0,1.0,0,0
3,tools,0.99,0,0.99,0,0
4,hospitals,0.99,0,0.99,0,0
5,related,0.99,14,0.99,0,14
6,missing_people,0.99,1,0.99,0,1
7,fire,0.99,1,0.99,0,1
8,aid_centers,0.99,0,0.99,0,0
9,clothing,0.99,10,0.98,0,10


In [46]:
# sum missed true positives
print(f'There were a total of {compare_models["missed_true_positives"].sum()} missed true positives')

There were a total of 4998 missed true positives


### 8. Try improving your model further. Here are a few ideas:
We will attempt to train the same pipline with an XGboost Classifier

In [26]:
# use the XGBoost classifier for multiclass objective function
from sklearn.xgboost import XGBClassifier

# create a pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(XGBClassifier()))

])

pipeline.fit(X_train, y_train)

In [27]:
# evaluate the model accuracy and training logs
y_pred = pipeline.predict(X_test)

accuracy = (y_pred == y_test).mean().mean()
print('Accuracy: {:.4f}'.format(accuracy))

Accuracy: 0.9508


In [28]:
# create empty lists to store metrics
cols = []
accuracies = []
tps = []
tns = []
fps = []
fns = []

# calculate metrics for each column
for i, col in enumerate(y_test.columns):
    # calculate confusion matrix, tn, fp, fn, tp
    mcm = multilabel_confusion_matrix(y_test[col], y_pred[:, i])
    # exctract metrics from confusion matrix
    for j in range(mcm.shape[0]):
        tn, fp, fn, tp = mcm[j].ravel()
        # calculate accuracy
        accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    # append metrics to lists
    cols.append(f'{col}')
    accuracies.append(accuracy)
    tps.append(tp)
    tns.append(tn)
    fps.append(fp)
    fns.append(fn)

# create DataFrame
df = pd.DataFrame({
    'Label': cols,
    'Accuracy': accuracies,
    'TP': tps,
    'TN': tns,
    'FP': fps,
    'FN': fns
})

xgboost_model_results = df.sort_values(by='Accuracy', ascending=False)

xgboost_model_results

Unnamed: 0,Label,Accuracy,TP,TN,FP,FN
9,child_alone,1.0,5244,0,0,0
25,shops,0.996186,0,5224,1,19
2,offer,0.994851,0,5217,2,25
0,related,0.992754,7,5199,2,36
23,tools,0.992754,1,5205,1,37
13,clothing,0.990656,47,5148,13,36
24,hospitals,0.990465,6,5188,6,44
31,fire,0.989512,19,5170,8,47
15,missing_people,0.98913,7,5180,3,54
26,aid_centers,0.986461,5,5168,7,64


### Compare training attempts
- We will merge the accuracy and true positive columns to compare the three models

In [30]:
# merge xgboost results with compare_models
compare_models = (
    compare_models
    .merge(xgboost_model_results.loc[:,['Label','Accuracy','TP']]
    .rename(columns={'Accuracy':'XGBoostAccuracy','TP':'XGBoost_TP'}), on='Label')
    .round(2)
)

compare_models

Unnamed: 0,Label,Model_1_accuracy,Model_1_TP,Model_2_accuracy,Model_2_TP,XGBoostAccuracy,XGBoost_TP
0,child_alone,1.0,5244,1.0,5244,1.0,5244
1,shops,1.0,0,1.0,0,1.0,0
2,offer,1.0,0,1.0,0,0.99,0
3,tools,0.99,0,0.99,0,0.99,1
4,hospitals,0.99,0,0.99,0,0.99,6
5,related,0.99,14,0.99,0,0.99,7
6,missing_people,0.99,1,0.99,0,0.99,7
7,fire,0.99,1,0.99,0,0.99,19
8,aid_centers,0.99,0,0.99,0,0.99,5
9,clothing,0.99,10,0.98,0,0.99,47


### There is an obvious difference in true positive rate from XGboost
Let's look closer at the difference in those metrics

In [41]:
# sum true positives for model_1 and xgboost
compare_models = compare_models.groupby('Label')[['Model_1_TP','XGBoost_TP']].sum().reset_index()

# calculate difference in true postive counts
compare_models['difference'] = compare_models['XGBoost_TP'] - compare_models['Model_1_TP']  

compare_models.agg({'Model_1_TP':'sum','XGBoost_TP':'sum','difference':'sum'}).to_frame().T 

Unnamed: 0,Model_1_TP,XGBoost_TP,difference
0,10290,11642,1352


### Summary of Model Training Iterations
1.  Benchmark model: RandomForestClassifier - showed 95% accuracy
2.  Grid Search: We tried 4 different sets of hyperparameters with cross validation.  However, the model seriously underfit and could not detect most of the postive classes.
3.  XGBoost - We tried XGBoost with GridSearch and it showed 95% accuracy.  We saved this model as the final model.

### 9. Export your model as a pickle file

In [None]:
# export your model as a pickel file
import joblib
with open('classifier.pkl', 'wb') as file:
    joblib.dump(pipeline, file, compress=5)

# load model


### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.