# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sqlalchemy import create_engine
import pickle


# download NLTK data
import re
import nltk
nltk.download(['punkt', 'wordnet','stopwords'])

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///messages.db')
df = pd.read_sql_table('messages',engine)
X = df['message']
Y = df.iloc[:,4:]
categories = list(df.columns[4:])

In [3]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [4]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [5]:
def tokenize(text):
    '''
    Applies Natural Language Processing to raw text, namely: normalizes case, removes punctuation and english stop words, tokenizes and lemmatizes words.
    
    Args:
    text: str - raw message (text) to be cleaned
    
    Returns:
    tokens: cleaned, tokenized and lemmatized text
    '''
    
    #Normalize case and remove punctuation
    text = re.sub(r'[^a-zA-Z0-9]',' ' , text.lower())
    
    #Split text into words
    tokens = word_tokenize(text)
    
    # Initiate Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #Lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stopwords.words('english')]
    
    return tokens


In [6]:
#test the tokenize function
for message in X[:5]:
    tokens=tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over
['hurricane'] 

Looking for someone but no name
['looking', 'someone', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', '80', '90', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'need', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 



### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
#ML Pipeline using Random Forest Classifier
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
#Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y)

#Train pipeline
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [9]:
#Predict on test data
y_pred = pipeline.predict(X_test)

for i in range(Y.shape[1]):
    print('Category:', Y.columns[i], '\n', classification_report(y_test.iloc[:,1].values, y_pred[:,i]))    

Category: related 
              precision    recall  f1-score   support

          0       0.95      0.20      0.33      5423
          1       0.20      0.94      0.33      1122
          2       0.00      0.00      0.00         0

avg / total       0.82      0.33      0.33      6545

Category: request 
              precision    recall  f1-score   support

          0       0.89      0.98      0.93      5423
          1       0.78      0.42      0.55      1122

avg / total       0.87      0.88      0.87      6545

Category: offer 
              precision    recall  f1-score   support

          0       0.83      1.00      0.91      5423
          1       0.00      0.00      0.00      1122

avg / total       0.69      0.83      0.75      6545

Category: aid_related 
              precision    recall  f1-score   support

          0       0.91      0.73      0.81      5423
          1       0.33      0.66      0.44      1122

avg / total       0.81      0.72      0.75      6545

Categ

  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [10]:
accuracy = (y_pred == y_test).mean()

avg_accuracy = accuracy.mean()

print("Accuracy:", accuracy)
print("Average Accuracy:", avg_accuracy)

Accuracy: related                   0.802445
request                   0.880825
offer                     0.995416
aid_related               0.755539
medical_help              0.920397
medical_products          0.954469
search_and_rescue         0.971429
security                  0.981054
military                  0.966539
child_alone               1.000000
water                     0.950649
food                      0.929412
shelter                   0.928189
clothing                  0.985791
money                     0.979526
missing_people            0.989610
refugees                  0.970053
death                     0.960886
other_aid                 0.865241
infrastructure_related    0.937510
transport                 0.956303
buildings                 0.949121
electricity               0.980138
tools                     0.993430
hospitals                 0.989458
shops                     0.996028
aid_centers               0.988694
other_infrastructure      0.959358
weather_re

In [11]:
#Function to calculate basic statistics for total accuracy of the model
def calculate_stats(accuracy):
    '''
    Takes a list of accuracies and calculates the basic statistics, like minimum, maximum, mean and median
    
    Args:
    accuracy: str - list of accuracies for each category
    
    Returns: non
    '''
    minimum = accuracy.min()
    maximum = accuracy.max()
    mean = accuracy.mean()
    median = accuracy.median()
    
    return print('Min:', minimum ,'\n','Max:', maximum ,'\n','Mean:', mean ,'\n','Median:', median)

In [12]:
#Apply stats function to Random Forest Pipeline
calculate_stats(accuracy)

Min: 0.755538579068 
 Max: 1.0 
 Mean: 0.944346829641 
 Median: 0.958441558442


### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
#Get pipeline parameters
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fcf754bc620>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [14]:
parameters = {#'clf__estimator__bootstrap': [True,False],
              #'clf__estimator__criterion': ['gini', 'entropy']
              #'clf__estimator__n_estimators':[1,10,20,30,60],
              'clf__estimator__n_estimators': [10,30]
             }

In [15]:
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [10, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [16]:
cv.best_estimator_

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [17]:
cv.best_params_

{'clf__estimator__n_estimators': 30}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
#Predict on test data with tuned model
y_pred_cv = cv.predict(X_test)

for i in range(Y.shape[1]):
    print('Category:', Y.columns[i], '\n', classification_report(y_test.iloc[:,1].values, y_pred_cv[:,i]))

Category: related 
              precision    recall  f1-score   support

          0       0.96      0.18      0.30      5423
          1       0.20      0.96      0.33      1122
          2       0.00      0.00      0.00         0

avg / total       0.83      0.31      0.31      6545

Category: request 
              precision    recall  f1-score   support

          0       0.90      0.97      0.94      5423
          1       0.80      0.50      0.61      1122

avg / total       0.89      0.89      0.88      6545

Category: offer 
              precision    recall  f1-score   support

          0       0.83      1.00      0.91      5423
          1       0.00      0.00      0.00      1122

avg / total       0.69      0.83      0.75      6545

Category: aid_related 
              precision    recall  f1-score   support

          0       0.93      0.71      0.80      5423
          1       0.34      0.74      0.47      1122

avg / total       0.83      0.71      0.75      6545

Categ

  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [19]:
accuracy_tunned = (y_pred_cv == y_test).mean()
avg_accuracy_tunned = accuracy_tunned.mean()

print("Accuracy Tunned:", accuracy_tunned)

print("Average Accuracy Tunned:", avg_accuracy_tunned)

Accuracy Tunned: related                   0.814515
request                   0.892743
offer                     0.995416
aid_related               0.771123
medical_help              0.922383
medical_products          0.956914
search_and_rescue         0.973415
security                  0.981054
military                  0.966387
child_alone               1.000000
water                     0.953094
food                      0.940107
shelter                   0.935523
clothing                  0.986555
money                     0.980138
missing_people            0.989610
refugees                  0.970053
death                     0.960886
other_aid                 0.863866
infrastructure_related    0.938732
transport                 0.957219
buildings                 0.949885
electricity               0.980443
tools                     0.993430
hospitals                 0.989458
shops                     0.996028
aid_centers               0.988694
other_infrastructure      0.959206
wea

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [20]:
#K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier

#Pipeline with K Nearest Neighbors estimator
pipeline_knn = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(KNeighborsClassifier()))
    ])

#Train KNN pipeline
pipeline_knn.fit(X_train, y_train)

#Predict on test data with KNN classifier
y_pred_knn = pipeline_knn.predict(X_test)

for i in range(Y.shape[1]):
    print('Category:', Y.columns[i], '\n', classification_report(y_test.iloc[:,1].values, y_pred_knn[:,i]))

Category: related 
              precision    recall  f1-score   support

          0       0.95      0.06      0.11      5423
          1       0.33      0.16      0.22      1122
          2       0.00      0.00      0.00         0

avg / total       0.84      0.08      0.13      6545

Category: request 
              precision    recall  f1-score   support

          0       0.84      0.99      0.91      5423
          1       0.74      0.09      0.16      1122

avg / total       0.82      0.84      0.78      6545

Category: offer 
              precision    recall  f1-score   support

          0       0.83      1.00      0.91      5423
          1       0.00      0.00      0.00      1122

avg / total       0.69      0.83      0.75      6545

Category: aid_related 
              precision    recall  f1-score   support

          0       0.84      0.99      0.91      5423
          1       0.60      0.09      0.15      1122

avg / total       0.80      0.83      0.78      6545

Categ

  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [21]:
accuracy_knn = (y_pred_knn == y_test).mean()

avg_accuracy_knn = accuracy_knn.mean()

print("Accuracy KNN:", accuracy)

print("Average Accuracy KNN:", avg_accuracy_knn)

Accuracy KNN: related                   0.802445
request                   0.880825
offer                     0.995416
aid_related               0.755539
medical_help              0.920397
medical_products          0.954469
search_and_rescue         0.971429
security                  0.981054
military                  0.966539
child_alone               1.000000
water                     0.950649
food                      0.929412
shelter                   0.928189
clothing                  0.985791
money                     0.979526
missing_people            0.989610
refugees                  0.970053
death                     0.960886
other_aid                 0.865241
infrastructure_related    0.937510
transport                 0.956303
buildings                 0.949121
electricity               0.980138
tools                     0.993430
hospitals                 0.989458
shops                     0.996028
aid_centers               0.988694
other_infrastructure      0.959358
weathe

In [22]:
#AdaBoostClassifier

#Pipeline with  AdaBoost Classifier
pipeline_boost = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(AdaBoostClassifier()))
    ])

#Train SVC pipeline
pipeline_boost.fit(X_train, y_train)

#Predict on test data with SVC classifier
y_pred_boost = pipeline_boost.predict(X_test)

for i in range(Y.shape[1]):
    print('Category:', Y.columns[i], '\n', classification_report(y_test.iloc[:,1].values, y_pred_boost[:,i]))    

Category: related 
              precision    recall  f1-score   support

          0       0.93      0.06      0.10      5423
          1       0.18      0.98      0.30      1122
          2       0.00      0.00      0.00         0

avg / total       0.80      0.21      0.14      6545

Category: request 
              precision    recall  f1-score   support

          0       0.91      0.96      0.93      5423
          1       0.72      0.52      0.60      1122

avg / total       0.87      0.88      0.88      6545

Category: offer 
              precision    recall  f1-score   support

          0       0.83      1.00      0.91      5423
          1       0.14      0.00      0.00      1122

avg / total       0.71      0.83      0.75      6545

Category: aid_related 
              precision    recall  f1-score   support

          0       0.91      0.73      0.81      5423
          1       0.34      0.67      0.45      1122

avg / total       0.81      0.72      0.75      6545

Categ

  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [23]:
accuracy_boost = (y_pred_boost == y_test).mean()

avg_accuracy_boost = accuracy_boost.mean()

print("Accuracy Boost:", accuracy_boost)

print("Average Accuracy AdaBoost:", avg_accuracy_boost)

Accuracy Boost: related                   0.772956
request                   0.882964
offer                     0.994347
aid_related               0.756608
medical_help              0.923147
medical_products          0.962108
search_and_rescue         0.972956
security                  0.979832
military                  0.967303
child_alone               1.000000
water                     0.962414
food                      0.942552
shelter                   0.943468
clothing                  0.990069
money                     0.980596
missing_people            0.989305
refugees                  0.968831
death                     0.965470
other_aid                 0.864171
infrastructure_related    0.936287
transport                 0.959664
buildings                 0.956455
electricity               0.980749
tools                     0.993125
hospitals                 0.987930
shops                     0.995264
aid_centers               0.987471
other_infrastructure      0.954927
weat

In [25]:
#Accuracy for RandomForest Model
calculate_stats(accuracy)

Min: 0.755538579068 
 Max: 1.0 
 Mean: 0.944346829641 
 Median: 0.958441558442


In [26]:
#Accuracy for RandomForest Model Tunned
calculate_stats(accuracy_tunned)

Min: 0.771122994652 
 Max: 1.0 
 Mean: 0.948030727442 
 Median: 0.960045836516


In [27]:
#Accuracty for AdaBoost Model
calculate_stats(accuracy_boost)

Min: 0.756608097785 
 Max: 1.0 
 Mean: 0.947237076649 
 Median: 0.963941940413


In [28]:
#Accuracy for KNN Model
calculate_stats(accuracy_knn)

Min: 0.111688311688 
 Max: 1.0 
 Mean: 0.910376878024 
 Median: 0.955233002292


In [29]:
#We will consider Random Forest Tunned (with 30 estimators) as our final model

### 9. Export your model as a pickle file

In [30]:
pickle.dump(cv, open('model.pkl', "wb"))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.