# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
conda install -c anaconda nltk


Note: you may need to restart the kernel to use updated packages.


In [2]:
# import libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

import sys
import os
import re
from sqlalchemy import create_engine
import pickle

from scipy.stats import gmean
# import relevant functions/modules from the sklearn
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score, classification_report, confusion_matrix, make_scorer, accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.svm import SVC

np.random.seed(42)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\diarm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\diarm\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\diarm\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
# load data from database
database_filepath = r"C:\Users\diarm\OneDrive\DOCUMENTS\EDUCATION\Data_Science\Udacity data science nanodegree\data engineering\Project_disaster response pipeline\disaster_response_db.db"
engine = create_engine('sqlite:///' + database_filepath)
table_name = database_filepath.replace(".db","") + "_table"
df = pd.read_sql_table(table_name,engine)

In [4]:
df.head(5000)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,5704,"2 KM east of the city of Jacmel, in the South-...",On est 2km de la ville de Jacmel (entre princ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4996,5705,"I am a student, I want to get help and a job","Je suis un etudiant, je veux trouver de l'aide...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4997,5706,The commitee is waiting for an answer from the...,LE COMITE ATTENDS UNE REPONS DE L INSTITUTION ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4998,5707,The people of La voute have lots of water prob...,Pp lavout la gen problem dlo anpil,direct,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,0.0,0.063778,0.111497,0.088267,0.015449,0.023039,0.011367,0.033377,0.045545,0.131446,0.065037,0.045812,0.050847,0.020293,0.006065,0.010795,0.004577,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,0.0,0.244361,0.314752,0.283688,0.123331,0.150031,0.106011,0.179621,0.2085,0.337894,0.246595,0.209081,0.219689,0.141003,0.077643,0.103338,0.067502,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
# Given value 2 in the related field are neglible so it could be error. Replacing 2 with 1 to consider it a valid response.
# Could have assumed it to be 0. In the absence of information I have gone with majority class.
df['related']=df['related'].map(lambda x: 1 if x == 2 else x)

In [7]:
# Extract X and y variables from the data for the modelling
X = df['message']
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [8]:
# def tokenize(text):
#     pass

def tokenize(text,url_place_holder_string="urlplaceholder"):
    """
    Tokenize the text function
    
    Arguments:
        text -> Text message which needs to be tokenized
    Output:
        clean_tokens -> List of tokens extracted from the provided text
    """
    
    # Replace all urls with a urlplaceholder string
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # Extract all the urls from the provided text 
    detected_urls = re.findall(url_regex, text)
    
    # Replace url with a url placeholder string
    for detected_url in detected_urls:
        text = text.replace(detected_url, url_place_holder_string)

    # Extract the word tokens from the provided text
    tokens = nltk.word_tokenize(text)
    
    #Lemmanitizer to remove inflectional and derivationally related forms of a word
    lemmatizer = nltk.WordNetLemmatizer()

    # List of clean tokens
    clean_tokens = [lemmatizer.lemmatize(w).lower().strip() for w in tokens]
    return clean_tokens

In [9]:
# Build a custom transformer which will extract the starting verb of a sentence
class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    """
    Starting Verb Extractor class
    
    This class extract the starting verb of a sentence,
    creating a new feature for the ML classifier
    """

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    # Given it is a tranformer we can return the self 
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
# pipeline = 

pipeline = Pipeline ( [
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ] )

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.2, random_state = 42)

print(X_train.shape)
print(Y_train.shape)

(20972,)
(20972, 36)


In [12]:
# Train pipeline model
model1=pipeline.fit(X_train, Y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [14]:
def get_eval_metrics(actual, predicted, col_names):
    """
    Calculate evaluation metrics for model
    
    Args:
    actual: array. Array containing actual labels.
    predicted: array. Array containing predicted labels.
    col_names: List of strings. List containing names for each of the predicted fields.
       
    Returns:
    metrics_df: dataframe. Dataframe containing the accuracy, precision, recall 
    and f1 score for a given set of actual and predicted labels.
    """
    metrics = []
    
    # average{‘micro’, ‘macro’, ‘samples’,’weighted’, ‘binary’} or None, default=’binary’
    avg_type='weighted'  # weighted is supposed to take label imbalance into account 
    zero_division_treatment=0 # 0,1,'warn'
    
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i], average=avg_type, zero_division=zero_division_treatment)
        recall = recall_score(actual[:, i], predicted[:, i], average=avg_type, zero_division=zero_division_treatment)
        f1 = f1_score(actual[:, i], predicted[:, i], average=avg_type, zero_division=zero_division_treatment)
        
        metrics.append( [accuracy, precision, recall, f1] )
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    metrics_df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return metrics_df

In [15]:
# Calculate evaluation metrics for training set
Y_train_pred = pipeline.predict(X_train)
col_names = list(Y.columns.values)

eval_metrics0 = get_eval_metrics(np.array(Y_train), Y_train_pred, col_names)
print(eval_metrics0)

                        Accuracy  Precision    Recall        F1
related                 0.999189   0.999189  0.999189  0.999189
request                 0.999619   0.999618  0.999619  0.999618
offer                   0.999952   0.999952  0.999952  0.999952
aid_related             0.999475   0.999476  0.999475  0.999475
medical_help            0.999666   0.999666  0.999666  0.999666
medical_products        0.999714   0.999714  0.999714  0.999714
search_and_rescue       0.999905   0.999905  0.999905  0.999905
security                0.999809   0.999809  0.999809  0.999809
military                0.999762   0.999762  0.999762  0.999761
child_alone             1.000000   1.000000  1.000000  1.000000
water                   1.000000   1.000000  1.000000  1.000000
food                    0.999952   0.999952  0.999952  0.999952
shelter                 0.999905   0.999905  0.999905  0.999905
clothing                0.999952   0.999952  0.999952  0.999952
money                   1.000000   1.000

In [16]:
# Calculate predicted classes for test dataset
Y_test_pred = pipeline.predict(X_test)

# Calculate evaluation metrics
eval_metrics1 = get_eval_metrics(np.array(Y_test), Y_test_pred, col_names)
print(eval_metrics1)

                        Accuracy  Precision    Recall        F1
related                 0.802632   0.792049  0.802632  0.767437
request                 0.891304   0.891330  0.891304  0.874233
offer                   0.995042   0.990108  0.995042  0.992569
aid_related             0.772311   0.774104  0.772311  0.765828
medical_help            0.923150   0.910847  0.923150  0.892395
medical_products        0.950610   0.942869  0.950610  0.929770
search_and_rescue       0.976735   0.973995  0.976735  0.966453
security                0.983028   0.972491  0.983028  0.975171
military                0.971396   0.965996  0.971396  0.958866
child_alone             1.000000   1.000000  1.000000  1.000000
water                   0.948131   0.944653  0.948131  0.933627
food                    0.923913   0.922349  0.923913  0.909381
shelter                 0.928299   0.925374  0.928299  0.908485
clothing                0.986842   0.987015  0.986842  0.980990
money                   0.981312   0.979

### 6. Improve your model
Use grid search to find better parameters. 

In [17]:
# Define performance metric for use in grid search scoring object
def performance_metric(y_true, y_pred)->float:
    """
    
    Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    average_type='binary'
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i],average='micro')
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

In [18]:
# Create grid search object

# commenting out some parameters to reduce runtime with a small number of values each

parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
#               'clf__estimator__n_estimators':[100, 150], 
#               'clf__estimator__min_samples_split':[2, 5, 10]
             }

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, cv=3, verbose = 10, n_jobs=None)

# Find best parameters
np.random.seed(42)
model2 = cv.fit(X_train, Y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3; 1/4] START tfidf__use_idf=True, vect__min_df=1.........................
[CV 1/3; 1/4] END tfidf__use_idf=True, vect__min_df=1;, score=0.956 total time= 5.6min
[CV 2/3; 1/4] START tfidf__use_idf=True, vect__min_df=1.........................
[CV 2/3; 1/4] END tfidf__use_idf=True, vect__min_df=1;, score=0.957 total time= 5.6min
[CV 3/3; 1/4] START tfidf__use_idf=True, vect__min_df=1.........................
[CV 3/3; 1/4] END tfidf__use_idf=True, vect__min_df=1;, score=0.959 total time= 5.6min
[CV 1/3; 2/4] START tfidf__use_idf=True, vect__min_df=5.........................
[CV 1/3; 2/4] END tfidf__use_idf=True, vect__min_df=5;, score=0.958 total time= 4.8min
[CV 2/3; 2/4] START tfidf__use_idf=True, vect__min_df=5.........................
[CV 2/3; 2/4] END tfidf__use_idf=True, vect__min_df=5;, score=0.959 total time= 4.8min
[CV 3/3; 2/4] START tfidf__use_idf=True, vect__min_df=5.........................
[CV 3/3; 2/4] END t

In [19]:
# Print the best parameters in the GridSearch
cv.best_params_

{'tfidf__use_idf': False, 'vect__min_df': 5}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [33]:
# Calculate evaluation metrics for test set
model2_pred_test = model2.predict(X_test)

eval_metrics2 = get_eval_metrics(np.array(Y_test), model2_pred_test, col_names)

print(eval_metrics2)

                        Accuracy  Precision    Recall        F1
related                 0.804539   0.795187  0.804539  0.769825
request                 0.893021   0.891904  0.893021  0.877353
offer                   0.995042   0.990108  0.995042  0.992569
aid_related             0.771358   0.770930  0.771358  0.766598
medical_help            0.922960   0.904164  0.922960  0.895149
medical_products        0.953089   0.944806  0.953089  0.936979
search_and_rescue       0.977498   0.972524  0.977498  0.969293
security                0.982647   0.970248  0.982647  0.974972
military                0.971205   0.961612  0.971205  0.960287
child_alone             1.000000   1.000000  1.000000  1.000000
water                   0.949847   0.946285  0.949847  0.936963
food                    0.942220   0.938924  0.942220  0.937465
shelter                 0.937071   0.931155  0.937071  0.926474
clothing                0.987033   0.984475  0.987033  0.982012
money                   0.981312   0.978

In [34]:
# Get summary stats for tuned model test
eval_metrics2.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,36.0,36.0,36.0,36.0
mean,0.948227,0.940942,0.948227,0.935894
std,0.053253,0.054867,0.053253,0.059989
min,0.771358,0.77093,0.771358,0.766598
25%,0.94098,0.927554,0.94098,0.926224
50%,0.959477,0.956828,0.959477,0.950015
75%,0.983743,0.977744,0.983743,0.976448
max,1.0,1.0,1.0,1.0


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [35]:
from sklearn.tree import DecisionTreeClassifier

In [36]:
# Try using DecisionTreeClassifier instead of Random Forest Classifier
pipeline3 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier( DecisionTreeClassifier(splitter='best') ))])

In [37]:
# List all the parameters for this pipeline
pipeline3.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001D5A95903A0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=DecisionTreeClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001D5A95903A0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=DecisionTreeClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text, url_place_holder_string='urlplaceholder')>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_i

In [38]:
# Create grid search object
# commenting out some parameters to reduce runtime with a small number of values each

parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
#               'clf__estimator__criterion':['gini', 'entropy'], 
#               'clf__estimator__min_samples_leaf':[1, 3]
             }

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline3, param_grid = parameters, scoring = scorer, cv=3, verbose = 10, n_jobs=None)

In [39]:
# Find best parameters
np.random.seed(42)
model3 = cv.fit(X_train, Y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3; 1/4] START tfidf__use_idf=True, vect__min_df=1.........................
[CV 1/3; 1/4] END tfidf__use_idf=True, vect__min_df=1;, score=0.957 total time= 4.5min
[CV 2/3; 1/4] START tfidf__use_idf=True, vect__min_df=1.........................
[CV 2/3; 1/4] END tfidf__use_idf=True, vect__min_df=1;, score=0.956 total time= 4.5min
[CV 3/3; 1/4] START tfidf__use_idf=True, vect__min_df=1.........................
[CV 3/3; 1/4] END tfidf__use_idf=True, vect__min_df=1;, score=0.955 total time= 4.4min
[CV 1/3; 2/4] START tfidf__use_idf=True, vect__min_df=5.........................
[CV 1/3; 2/4] END tfidf__use_idf=True, vect__min_df=5;, score=0.956 total time= 3.5min
[CV 2/3; 2/4] START tfidf__use_idf=True, vect__min_df=5.........................
[CV 2/3; 2/4] END tfidf__use_idf=True, vect__min_df=5;, score=0.954 total time= 3.5min
[CV 3/3; 2/4] START tfidf__use_idf=True, vect__min_df=5.........................
[CV 3/3; 2/4] END t

In [40]:
# Print the best parameters in the GridSearch
cv.best_params_

{'tfidf__use_idf': False, 'vect__min_df': 1}

In [41]:
# Calculate evaluation metrics for training set
Y_train_pred = model3.predict(X_train)
col_names = list(Y.columns.values)

In [42]:
eval_metrics3 = get_eval_metrics(np.array(Y_train), Y_train_pred, col_names)
print(eval_metrics3)

                        Accuracy  Precision    Recall        F1
related                 0.999189   0.999192  0.999189  0.999190
request                 0.999714   0.999714  0.999714  0.999714
offer                   0.999952   0.999952  0.999952  0.999952
aid_related             0.999475   0.999476  0.999475  0.999475
medical_help            0.999809   0.999809  0.999809  0.999809
medical_products        0.999809   0.999809  0.999809  0.999809
search_and_rescue       0.999952   0.999952  0.999952  0.999952
security                1.000000   1.000000  1.000000  1.000000
military                0.999809   0.999809  0.999809  0.999809
child_alone             1.000000   1.000000  1.000000  1.000000
water                   1.000000   1.000000  1.000000  1.000000
food                    0.999952   0.999952  0.999952  0.999952
shelter                 0.999952   0.999952  0.999952  0.999952
clothing                0.999952   0.999952  0.999952  0.999952
money                   1.000000   1.000

In [43]:
# Get summary stats for tuned model
eval_metrics3.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,36.0,36.0,36.0,36.0
mean,0.999875,0.999876,0.999875,0.999875
std,0.000174,0.000174,0.000174,0.000174
min,0.999189,0.999192,0.999189,0.99919
25%,0.999809,0.999809,0.999809,0.999809
50%,0.999952,0.999952,0.999952,0.999952
75%,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0


In [44]:
# Calculate evaluation metrics for test set
model3_pred_test = model3.predict(X_test)

eval_metrics4 = get_eval_metrics(np.array(Y_test), model3_pred_test, col_names)

print(eval_metrics4)

                        Accuracy  Precision    Recall        F1
related                 0.754005   0.744384  0.754005  0.748535
request                 0.850114   0.845739  0.850114  0.847736
offer                   0.991609   0.990091  0.991609  0.990850
aid_related             0.697368   0.696101  0.697368  0.696639
medical_help            0.900458   0.896086  0.900458  0.898187
medical_products        0.944889   0.939344  0.944889  0.941795
search_and_rescue       0.962052   0.960395  0.962052  0.961213
security                0.973684   0.969284  0.973684  0.971437
military                0.956140   0.957221  0.956140  0.956675
child_alone             1.000000   1.000000  1.000000  1.000000
water                   0.953280   0.951978  0.953280  0.952583
food                    0.942410   0.942079  0.942410  0.942240
shelter                 0.936117   0.934522  0.936117  0.935263
clothing                0.986842   0.985364  0.986842  0.985986
money                   0.972540   0.973

In [45]:
# Get summary stats for model3 test
eval_metrics4.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,36.0,36.0,36.0,36.0
mean,0.932786,0.929815,0.932786,0.931183
std,0.069603,0.071136,0.069603,0.070439
min,0.697368,0.696101,0.697368,0.696639
25%,0.931541,0.923787,0.931541,0.927869
50%,0.95471,0.954599,0.95471,0.954629
75%,0.97869,0.976534,0.97869,0.977523
max,1.0,1.0,1.0,1.0


### 9. Export your model as a pickle file

In [46]:
# Pickle best model
pickle.dump(model3, open('response_message_model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.