# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
import nltk
import pickle
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from sklearn.pipeline import Pipeline
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import warnings
import re



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shchen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shchen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shchen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
# remove 'child_alone' as this column only has 1 constant value 0
engine = create_engine('sqlite:///data_cleaned.db')
df = pd.read_sql("SELECT * FROM data_cleaned", engine)


In [3]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,...,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,...,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,...,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [4]:
# There are weird "2" values in 'related' column. 
df_related_2 = df[df['related'] == 2]

In [6]:
# 'child_alone' has only one constant value 0. Remove it.
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre','child_alone'], axis = 1)

### 2. Write a tokenization function to process your text data

In [7]:
def tokenize(text):
    """
    This function aims to tokenize the text function
    Arguments: text = input text; token = resulting cleaned tokens
    
    """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    urls = re.findall(url_regex, text)
    for urls in urls:
        text = text.replace(urls, "i_am_just_a_placeholder")
    tokens = nltk.word_tokenize(text)
    lemmatizer = nltk.WordNetLemmatizer()

    # List of clean tokens
    clean_tokens = [lemmatizer.lemmatize(w).lower().strip() for w in tokens]
    return clean_tokens

In [8]:
def tokenize(text):
    """
    Function: split text into words and return the root form of the words
    Args:
      text(str): the message
    Return:
      lemm(list of str): a list of the root form of the message words
    """
    # Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize text
    words = word_tokenize(text)
    
    # Remove stop words
    stop = stopwords.words("english")
    words = [t for t in words if t not in stop]
    
    # Lemmatization
    lemm = [WordNetLemmatizer().lemmatize(w) for w in words]
    return lemm

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [9]:
# Random Forest pipeline
pipeline_rf = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])
#Adaboost pipeline
pipeline_ada = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf',  MultiOutputClassifier(AdaBoostClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 1221)
pipeline_fit = pipeline_rf.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [11]:
y_prediction_train = pipeline_fit.predict(X_train)
y_prediction_test = pipeline_fit.predict(X_test)
print(classification_report(y_test.values, y_prediction_test, target_names=y_test.columns.values))

                        precision    recall  f1-score   support

               related       0.84      0.95      0.89      4966
               request       0.80      0.48      0.60      1077
                 offer       0.00      0.00      0.00        28
           aid_related       0.76      0.69      0.72      2719
          medical_help       0.72      0.06      0.11       541
      medical_products       0.76      0.07      0.13       339
     search_and_rescue       0.65      0.08      0.15       183
              security       0.33      0.01      0.02       110
              military       0.64      0.08      0.14       201
                 water       0.90      0.40      0.55       416
                  food       0.86      0.63      0.73       719
               shelter       0.80      0.39      0.53       575
              clothing       0.75      0.11      0.20       105
                 money       1.00      0.01      0.03       145
        missing_people       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
def eval_metrics(True_label, Predict_label, col_names):
    """Evalute metrics of the ML pipeline model
    
    inputs:
    True_label: array. Array containing the real labels.
    Predict_label: array. Array containing predicted labels
    col_names: names for predicted labels

    """
    metrics = []
    
    for i in range(len(col_names)):
        accuracy = accuracy_score(True_label[:, i], Predict_label[:, i])
        precision = precision_score(True_label[:, i], Predict_label[:, i])
        recall = recall_score(True_label[:, i], Predict_label[:, i])
        f1 = f1_score(True_label[:, i], Predict_label[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    metrics = np.array(metrics)
    data_metrics = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return data_metrics

In [13]:
# Calculate evaluation metrics for test set
y_test_pred = pipeline_rf.predict(X_test)

pred_metrics = eval_metrics(np.array(y_test), y_test_pred, y_test.columns.values)
print(pred_metrics)

  _warn_prf(average, modifier, msg_start, len(result))


                        Accuracy  Precision    Recall        F1
related                 0.824036   0.839162  0.951873  0.891971
request                 0.894114   0.795732  0.484680  0.602424
offer                   0.995697   0.000000  0.000000  0.000000
aid_related             0.779007   0.760472  0.687753  0.722287
medical_help            0.919932   0.717391  0.060998  0.112436
medical_products        0.950515   0.757576  0.073746  0.134409
search_and_rescue       0.972952   0.652174  0.081967  0.145631
security                0.982941   0.333333  0.009091  0.017699
military                0.970186   0.640000  0.079602  0.141593
water                   0.958660   0.901639  0.396635  0.550918
food                    0.947749   0.856874  0.632823  0.728000
shelter                 0.937759   0.801418  0.393043  0.527421
clothing                0.985093   0.750000  0.114286  0.198347
money                   0.978024   1.000000  0.013793  0.027211
missing_people          0.989857   1.000

In [14]:
# Calculate evaluation metrics for training set
y_train_pred = pipeline_rf.predict(X_train)

pred_metrics_t = eval_metrics(np.array(y_train), y_train_pred, y_train.columns.values)
print(pred_metrics_t)

                        Accuracy  Precision    Recall        F1
related                 0.998463   0.999196  0.998795  0.998996
request                 0.999180   0.999114  0.996173  0.997642
offer                   1.000000   1.000000  1.000000  1.000000
aid_related             0.999180   0.999263  0.998772  0.999017
medical_help            0.999641   1.000000  0.995463  0.997727
medical_products        0.999949   1.000000  0.998973  0.999486
search_and_rescue       0.999846   0.996310  0.998152  0.997230
security                0.999795   1.000000  0.988920  0.994429
military                0.999846   0.998480  0.996965  0.997722
water                   0.999949   1.000000  0.999204  0.999602
food                    1.000000   1.000000  1.000000  1.000000
shelter                 0.999949   1.000000  0.999425  0.999712
clothing                1.000000   1.000000  1.000000  1.000000
money                   0.999949   1.000000  0.997821  0.998909
missing_people          0.999949   1.000

### 6. Improve your model
Use grid search to find better parameters. 

In [30]:
# To avoid too long computation time, only do grid search with two values of one parameter
param_grid = { 
    'tfidf__use_idf': (True, False),
}

CV_rfc = GridSearchCV(estimator=pipeline_rf, param_grid=param_grid, scoring='f1_micro', cv = 5)
CV_rfc.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000002226D6105E8>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'tfidf__use_idf': (True, False)}, scoring='f1_micro')

In [31]:
# show grid search results
CV_rfc.cv_results_

{'mean_fit_time': array([377.21718359, 378.29851952]),
 'std_fit_time': array([ 9.22895786, 11.1561009 ]),
 'mean_score_time': array([17.14813747, 18.56588645]),
 'std_score_time': array([1.83232558, 2.63847786]),
 'param_tfidf__use_idf': masked_array(data=[True, False],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'tfidf__use_idf': True}, {'tfidf__use_idf': False}],
 'split0_test_score': array([0.64590653, 0.645     ]),
 'split1_test_score': array([0.63797094, 0.63750247]),
 'split2_test_score': array([0.65129434, 0.65298686]),
 'split3_test_score': array([0.65082376, 0.65365659]),
 'split4_test_score': array([0.65138909, 0.64850881]),
 'mean_test_score': array([0.64747693, 0.64753095]),
 'std_test_score': array([0.00517511, 0.00591985]),
 'rank_test_score': array([2, 1])}

In [32]:
# Get best parameters
CV_rfc.best_params_

{'tfidf__use_idf': False}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [33]:
pred_test_tuned = CV_rfc.predict(X_test)
pred_metrics_tuned = eval_metrics(np.array(y_test), pred_test_tuned, y_test.columns.values)
print(pred_metrics_tuned)

  _warn_prf(average, modifier, msg_start, len(result))


                        Accuracy  Precision    Recall        F1
related                 0.823114   0.835178  0.957108  0.891996
request                 0.898110   0.814590  0.497679  0.617867
offer                   0.995697   0.000000  0.000000  0.000000
aid_related             0.775780   0.752000  0.691431  0.720445
medical_help            0.920393   0.734694  0.066543  0.122034
medical_products        0.951898   0.842105  0.094395  0.169761
search_and_rescue       0.973260   0.800000  0.065574  0.121212
security                0.982941   0.333333  0.009091  0.017699
military                0.970186   0.666667  0.069652  0.126126
water                   0.957277   0.915663  0.365385  0.522337
food                    0.946980   0.843750  0.638387  0.726841
shelter                 0.935300   0.810484  0.349565  0.488457
clothing                0.984325   0.666667  0.057143  0.105263
money                   0.978485   1.000000  0.034483  0.066667
missing_people          0.989857   1.000

In [35]:
pred_metrics_tuned.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.945742,0.577581,0.203633,0.25216
std,0.051558,0.372377,0.275044,0.293965
min,0.77578,0.0,0.0,0.0
25%,0.937682,0.166667,0.004545,0.00885
50%,0.957277,0.756098,0.066543,0.122034
75%,0.982711,0.838642,0.357475,0.505397
max,0.995697,1.0,0.957108,0.891996


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [36]:
param_grid_ada = {
        'clf__estimator__n_estimators': [50, 100],
        'tfidf__use_idf': (True, False)
}

cv_ada = GridSearchCV(pipeline_ada, param_grid = param_grid_ada)
cv_ada

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000002226D6105E8>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=AdaBoostClassifier()))]),
             param_grid={'clf__estimator__n_estimators': [50, 100],
                         'tfidf__use_idf': (True, False)})

In [39]:
# try other machine learning algorithms
cv_ada.fit(X_train, y_train)
pred_test_tuned_ada = cv_ada.predict(X_test)
pred_metrics_tuned_ada = eval_metrics(np.array(y_test), pred_test_tuned_ada, y_test.columns.values)
print(pred_metrics_tuned_ada)

                        Accuracy  Precision    Recall        F1
related                 0.793914   0.806114  0.961136  0.876826
request                 0.891194   0.728059  0.546890  0.624602
offer                   0.994314   0.000000  0.000000  0.000000
aid_related             0.765944   0.762511  0.638838  0.695217
medical_help            0.923467   0.587755  0.266174  0.366412
medical_products        0.953127   0.595506  0.312684  0.410058
search_and_rescue       0.972952   0.563636  0.169399  0.260504
security                0.981712   0.304348  0.063636  0.105263
military                0.973721   0.627119  0.368159  0.463950
water                   0.962195   0.727273  0.653846  0.688608
food                    0.948824   0.796923  0.720445  0.756757
shelter                 0.943138   0.737819  0.553043  0.632207
clothing                0.985554   0.603774  0.304762  0.405063
money                   0.977716   0.500000  0.200000  0.285714
missing_people          0.989857   0.521

In [41]:
pred_metrics_tuned_ada.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.944671,0.550003,0.327724,0.395084
std,0.055655,0.22935,0.248553,0.246606
min,0.765944,0.0,0.0,0.0
25%,0.940526,0.444674,0.142042,0.213162
50%,0.962195,0.595506,0.266174,0.366412
75%,0.982788,0.726312,0.532673,0.61465
max,0.994314,0.873646,0.961136,0.876826


### 9. Export your model as a pickle file

In [43]:
# pickle the tuned model
pickle.dump(CV_rfc, open('classifier.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.