# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
!pip install --upgrade pip
!pip install --upgrade tensorflow

In [1]:
# import libraries
import re
import nltk
import pickle
nltk.download(['punkt', 'wordnet', 'stopwords'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import (confusion_matrix, 
                             f1_score, 
                             classification_report, 
                             precision_score, 
                             recall_score)

from sklearn.feature_extraction.text import (CountVectorizer, 
                                             TfidfTransformer, 
                                             TfidfVectorizer)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# display formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 120)

In [3]:
# load data from database
engine = create_engine('sqlite:///messages.db')
df = pd.read_sql('messages_data', con = engine)
df.head(3)

Unnamed: 0,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that could pass over Haiti,direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Is the Hurricane over or is it not over,direct,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Looking for someone but no name,direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# display few messages
df['message'][:5].tolist()

['Weather update - a cold front from Cuba that could pass over Haiti',
 'Is the Hurricane over or is it not over',
 'Looking for someone but no name',
 'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.',
 'says: west side of Haiti, rest of the country today and tonight']

### 2. Write a tokenization function to process your text data

In [5]:
def tokenizer(text):
    """
    This function will transform the raw text by applying few transformations
    Args:
        text: messages (raw text)
    Returns:
        clean_tokens: clean tokenized text
    """
    # remove punctuations from raw text
    text = re.sub(r'[^\w\s]', '', text) 
    
    # tokenize filtered text
    tokens = word_tokenize(text)
    
    # create lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    # iterate over the tokens
    for token in tokens:
        # lemmatize, lowercase and strip spaces
        clean_token = lemmatizer.lemmatize(token).lower().strip()
        clean_tokens.append(clean_token)
        
    return clean_tokens

In [6]:
# Compare the output of tokenizer function with raw text
for mess in df.message[:5]:
    print(f'Raw text: {mess}')
    print(f'Tokenized text: {tokenizer(mess)}')
    print('\n')

Raw text: Weather update - a cold front from Cuba that could pass over Haiti
Tokenized text: ['weather', 'update', 'a', 'cold', 'front', 'from', 'cuba', 'that', 'could', 'pas', 'over', 'haiti']


Raw text: Is the Hurricane over or is it not over
Tokenized text: ['is', 'the', 'hurricane', 'over', 'or', 'is', 'it', 'not', 'over']


Raw text: Looking for someone but no name
Tokenized text: ['looking', 'for', 'someone', 'but', 'no', 'name']


Raw text: UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
Tokenized text: ['un', 'report', 'leogane', '8090', 'destroyed', 'only', 'hospital', 'st', 'croix', 'functioning', 'needs', 'supply', 'desperately']


Raw text: says: west side of Haiti, rest of the country today and tonight
Tokenized text: ['say', 'west', 'side', 'of', 'haiti', 'rest', 'of', 'the', 'country', 'today', 'and', 'tonight']




### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
# simple machine learning pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenizer)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
# split the data into train and validation sets
def split_data(df, features = None, labels = None, test_size = 0.2):
    """
    This function will split the data into training and validation sets
    Args:
        df: dataframe to be splitted
        features: list of column names to be used as features
        labels: list of column names to be uses as labels
        test_size: split size (default = 0.2)    
    Returns:
       x_train, y_train, x_val, y_val: training and validation sets
    """
    
    # This splitting of data is without any feature engineering
    if features is None:
        x_train, x_val, y_train, y_val = train_test_split(df.iloc[:, 0],
                                                          df.iloc[:, 2:],
                                                          test_size = test_size, 
                                                          random_state = 42, 
                                                          stratify = df['genre'])
    # This splitting of data is with additional features
    else:
        x_train, x_val, y_train, y_val = train_test_split(df[features],
                                                          df[label_cols],
                                                          test_size = test_size, 
                                                          random_state = 42, 
                                                          stratify = df['genre'])
        
        
    # print the shape of the training and validation sets
    print(f'x_train shape: {x_train.shape}\ny_train shape: {y_train.shape}')
    print(f'x_val shape: {x_val.shape}\ny_val shape: {y_val.shape}')
    
    return x_train, y_train, x_val, y_val

In [9]:
# split data
x_train, y_train, x_val, y_val = split_data(df)

# fit the training data 
pipeline.fit(x_train, y_train.values)

x_train shape: (20965,)
y_train shape: (20965, 36)
x_val shape: (5242,)
y_val shape: (5242, 36)


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [10]:
def evaluate_model(x_val, y_val, pipeline = pipeline): 
    """
    This function will evaluate the model by calculating f1 score,
    precision score, recall score and classification report
    
    Args:
        x_val: validation features 
        y_val: ground truth labels of validation data
    """
    # make predictions on validation data
    y_pred = pipeline.predict(x_val)
    
    # create dataframe from predictions
    preds_df = pd.DataFrame(y_pred, columns = label_cols)
    
    # Create dictionary to hold the evaluation scores
    report = {}
    for col in y_val.columns:
        report[col] = []
        # calculate precision score
        report[col].append(precision_score(y_val[col], preds_df[col]))
        # calculate recall score
        report[col].append(recall_score(y_val[col], preds_df[col]))
        # calculate f1 score
        report[col].append(f1_score(y_val[col], preds_df[col]))
    
    # create dataframe from report dictionary
    report = pd.DataFrame(report)
    
    # print classification report 
    for i in range(len(label_cols)):
        print("Precision, Recall, F1 Score for {}".format(y_val.columns[i]))
        print(classification_report(y_val.iloc[:, i], y_pred[:, i]))
        
    return report

In [11]:
# create a list of label columns
label_cols = df.columns[2:].tolist()
label_cols

['related',
 'request',
 'offer',
 'aid_related',
 'medical_help',
 'medical_products',
 'search_and_rescue',
 'security',
 'military',
 'child_alone',
 'water',
 'food',
 'shelter',
 'clothing',
 'money',
 'missing_people',
 'refugees',
 'death',
 'other_aid',
 'infrastructure_related',
 'transport',
 'buildings',
 'electricity',
 'tools',
 'hospitals',
 'shops',
 'aid_centers',
 'other_infrastructure',
 'weather_related',
 'floods',
 'storm',
 'fire',
 'earthquake',
 'cold',
 'other_weather',
 'direct_report']

In [12]:
# evaluate model on the validation data
report = evaluate_model(x_val, y_val)
report.T

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Precision, Recall, F1 Score for related
             precision    recall  f1-score   support

        0.0       0.32      0.13      0.18      1226
        1.0       0.77      0.92      0.84      4016

avg / total       0.67      0.73      0.69      5242

Precision, Recall, F1 Score for request
             precision    recall  f1-score   support

        0.0       0.84      0.98      0.90      4347
        1.0       0.43      0.08      0.13       895

avg / total       0.77      0.82      0.77      5242

Precision, Recall, F1 Score for offer
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      5217
        1.0       0.00      0.00      0.00        25

avg / total       0.99      0.99      0.99      5242

Precision, Recall, F1 Score for aid_related
             precision    recall  f1-score   support

        0.0       0.60      0.83      0.70      3095
        1.0       0.44      0.19      0.27      2147

avg / total       0.54      0.57

Unnamed: 0,0,1,2
related,0.774837,0.918576,0.840606
request,0.42515,0.07933,0.13371
offer,0.0,0.0,0.0
aid_related,0.444444,0.193759,0.269867
medical_help,0.192308,0.012285,0.023095
medical_products,0.05,0.003861,0.007168
search_and_rescue,0.2,0.007299,0.014085
security,0.0,0.0,0.0
military,0.0,0.0,0.0
child_alone,0.0,0.0,0.0


### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
# display all the parameters of the pipeline
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estimator__verbose', 'clf__estimator__

In [14]:
# prepare dictionary of parameters
parameters = {'vect__max_df': [0.5, 1.0],
              'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__estimator__max_depth': [50, 100],
              'tfidf__use_idf': [True, False]
             }

# create GridSearchCV object
grid_cv = GridSearchCV(pipeline, parameters, cv = 3, n_jobs = -1)

# Fit and tune the model
grid_cv.fit(x_train, y_train)

# display the best parameters
grid_cv.best_params_

{'clf__estimator__max_depth': 50,
 'tfidf__use_idf': False,
 'vect__max_df': 0.5,
 'vect__ngram_range': (1, 2)}

In [15]:
# refitting on entire training data using best settings
grid_cv.refit

True

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [16]:
# evaluate the model performance on validation dataset
report = evaluate_model(x_val, y_val, pipeline = grid_cv)
report.T

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Precision, Recall, F1 Score for related
             precision    recall  f1-score   support

        0.0       0.33      0.00      0.00      1226
        1.0       0.77      1.00      0.87      4016

avg / total       0.67      0.77      0.67      5242

Precision, Recall, F1 Score for request
             precision    recall  f1-score   support

        0.0       0.83      1.00      0.91      4347
        1.0       0.43      0.00      0.01       895

avg / total       0.76      0.83      0.75      5242

Precision, Recall, F1 Score for offer
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      5217
        1.0       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      5242

Precision, Recall, F1 Score for aid_related
             precision    recall  f1-score   support

        0.0       0.59      0.94      0.73      3095
        1.0       0.43      0.06      0.11      2147

avg / total       0.52      0.58

Unnamed: 0,0,1,2
related,0.766291,0.998506,0.867121
request,0.428571,0.003352,0.006652
offer,0.0,0.0,0.0
aid_related,0.428571,0.060084,0.105392
medical_help,0.0,0.0,0.0
medical_products,0.0,0.0,0.0
search_and_rescue,0.0,0.0,0.0
security,0.0,0.0,0.0
military,0.0,0.0,0.0
child_alone,0.0,0.0,0.0


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [17]:
def add_features(df):
    """
    This function will create additional features to improve the performace
    of the model. Features such as length of the message, number of words, 
    number of non stopwords and average word length in each message will be
    created by this method.
    
    Args: x
        df: original dataframe
        
    Returns:
        df: dataframe with new added features
    """
    # create a set of stopwords
    StopWords = set(stopwords.words('english'))
    
    # lowering and removing punctuation
    df['processed_text'] = df['message'].apply(lambda x: re.sub(r'[^\w\s]', '', x.lower()))
    
    # apply lemmatization
    df['processed_text'] = df['processed_text'].apply(
        lambda x: ' '.join([WordNetLemmatizer().lemmatize(token) for token in x.split()]))
    # get length of the message
    df['length'] = df['processed_text'].apply(lambda x: len(x))
    
    # get number of words in each message
    df['num_words'] = df['processed_text'].apply(lambda x: len(x.split()))
    
    # get the number of non stopwords in each message
    df['non_stopwords'] = df['processed_text'].apply(
        lambda x: len([t for t in x.split() if t not in StopWords]))
    
    # get the average word length
    df['avg_word_len'] = df['processed_text'].apply(
        lambda x: np.mean([len(t) for t in x.split() if t not in StopWords]) \
        if len([len(t) for t in x.split() if t not in StopWords]) > 0 else 0)
    
    # update stop words (didn't want to remove negation)
    StopWords = StopWords.difference(
        ["aren't", 'nor', 'not', 'no', "isn't", "couldn't", "hasn't", "hadn't", "haven't",
         "didn't", "doesn't", "wouldn't", "can't"])
    
    # remove stop words from processed text message
    df['processed_text'] = df['processed_text'].apply(
        lambda x: ' '.join([token for token in x.split() if token not in StopWords]))
        
    # filter the words with length > 2
    df['processed_text'] = df['processed_text'].apply(
        lambda x: ' '.join([token for token in x.split() if len(token) > 2]))
    
    return df

In [18]:
# dataframe with additional features
df = add_features(df)
df.head()

Unnamed: 0,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,processed_text,length,num_words,non_stopwords,avg_word_len
0,Weather update - a cold front from Cuba that could pass over Haiti,direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,weather update cold front cuba could pas haiti,63,12,8,4.875
1,Is the Hurricane over or is it not over,direct,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,hurricane not,39,9,1,9.0
2,Looking for someone but no name,direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,looking someone name,31,6,3,6.0
3,UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.,direct,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,report leogane 8090 destroyed hospital croix functioning need supply desperately,91,13,12,6.25
4,"says: west side of Haiti, rest of the country today and tonight",direct,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,say west side haiti rest country today tonight,60,12,8,4.875


In [19]:
# one hot encoding of genre column
# df = pd.concat([df, pd.get_dummies(df['genre'])], axis = 1)

# select featrue columns from the dataframe
features = ['processed_text', 'genre', 'length', 'num_words', 'non_stopwords', 'avg_word_len']

# split the data into train and validation sets
x_train, y_train, x_val, y_val = split_data(df, features = features, labels = label_cols)

x_train shape: (20965, 6)
y_train shape: (20965, 36)
x_val shape: (5242, 6)
y_val shape: (5242, 36)


In [20]:
class TextColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations.
    This class will select columns containing text data.
    """
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        return X[self.key]
    
    

class NumColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations.
    This class will select the columns containing numeric data.
    """
    def __init__(self, key):
        self.key = key
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        return X[[self.key]]
    
    

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    """
    This class will create custom label binarizer for one hot encoding the genre column.
    """
    def __init__(self, sparse_output = False):
        self.sparse_output = sparse_output
        
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        label_encoder = LabelBinarizer(sparse_output = self.sparse_output)
        return label_encoder.fit_transform(X)

In [21]:
# create separate pipelines to process individual features

# pipeline to process num_words column
num_words = Pipeline([
    ('selector', NumColumnSelector(key = 'num_words')),
    ('scaler', StandardScaler())
])

# pipeline to process non_stopwords column
num_non_stopwords = Pipeline([
    ('selector', NumColumnSelector(key = 'non_stopwords')),
    ('scaler', StandardScaler())
])

# pipeline to process avg_word_len column
avg_word_length = Pipeline([
    ('selector', NumColumnSelector(key = 'avg_word_len')),
    ('scaler', StandardScaler())
])

# pipeline to process processed_text column
message_processing = Pipeline([
    ('selecor', TextColumnSelector(key = 'processed_text')),
    ('tfidf', TfidfVectorizer(stop_words = 'english'))
])


# pipeline to process length column
length = Pipeline([
    ('selector', NumColumnSelector(key = 'length')),
    ('scaler', StandardScaler())
])

# pipeline to process genre column
genre = Pipeline([
    ('selector', TextColumnSelector(key = 'genre')),
    ('scaler', CustomLabelBinarizer())
])

In [22]:
# process all the pipelines in parallel using feature union
feature_union = FeatureUnion([
    ('num_words', num_words),
    ('num_non_stopwords', num_non_stopwords),
    ('avg_word_length', avg_word_length),
    ('message_processing', message_processing),
    ('length', length),
    ('genre_ohe', genre)
])


# create final pipeline to train the classifier
final_pipeline = Pipeline([
    ('feature_union', feature_union),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

# fit the pipeline on trainig data
final_pipeline.fit(x_train, y_train.values)

Pipeline(memory=None,
     steps=[('feature_union', FeatureUnion(n_jobs=1,
       transformer_list=[('num_words', Pipeline(memory=None,
     steps=[('selector', NumColumnSelector(key='num_words')), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('num_non_stopwords', Pipeline(memory=None,
     steps=[...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [23]:
# evaluate the model performance on validation dataset
report = evaluate_model(x_val, y_val, pipeline = final_pipeline)
report.T

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Precision, Recall, F1 Score for related
             precision    recall  f1-score   support

        0.0       0.37      0.14      0.20      1226
        1.0       0.78      0.93      0.85      4016

avg / total       0.68      0.74      0.70      5242

Precision, Recall, F1 Score for request
             precision    recall  f1-score   support

        0.0       0.84      0.97      0.90      4347
        1.0       0.40      0.09      0.15       895

avg / total       0.76      0.82      0.77      5242

Precision, Recall, F1 Score for offer
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      5217
        1.0       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      5242

Precision, Recall, F1 Score for aid_related
             precision    recall  f1-score   support

        0.0       0.60      0.81      0.69      3095
        1.0       0.43      0.21      0.28      2147

avg / total       0.53      0.56

Unnamed: 0,0,1,2
related,0.779403,0.929034,0.847666
request,0.395238,0.092737,0.150226
offer,0.0,0.0,0.0
aid_related,0.433301,0.207266,0.280403
medical_help,0.125,0.004914,0.009456
medical_products,0.055556,0.003861,0.00722
search_and_rescue,0.0,0.0,0.0
security,0.0,0.0,0.0
military,0.333333,0.00565,0.011111
child_alone,0.0,0.0,0.0


In [24]:
# get the parameters from the pipeline
final_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'feature_union', 'clf', 'feature_union__n_jobs', 'feature_union__transformer_list', 'feature_union__transformer_weights', 'feature_union__num_words', 'feature_union__num_non_stopwords', 'feature_union__avg_word_length', 'feature_union__message_processing', 'feature_union__length', 'feature_union__genre_ohe', 'feature_union__num_words__memory', 'feature_union__num_words__steps', 'feature_union__num_words__selector', 'feature_union__num_words__scaler', 'feature_union__num_words__selector__key', 'feature_union__num_words__scaler__copy', 'feature_union__num_words__scaler__with_mean', 'feature_union__num_words__scaler__with_std', 'feature_union__num_non_stopwords__memory', 'feature_union__num_non_stopwords__steps', 'feature_union__num_non_stopwords__selector', 'feature_union__num_non_stopwords__scaler', 'feature_union__num_non_stopwords__selector__key', 'feature_union__num_non_stopwords__scaler__copy', 'feature_union__num_non_stopwords__scaler__with_mean', 'fea

### GridSearchCV is computationally intensive for hyperparameter tuning 
- Uncomment the code cell below for **tuning hyperparameters of the model**. Parameters can be added or removed based on the computation resources available.

In [25]:
# # prepare dictionary of parameters
# parameters = {'feature_union__message_processing__tfidf__max_df': [0.9, 0.95],
#                'feature_union__message_processing__tfidf__ngram_range': [(1, 1), (1, 2)],
#                'clf__estimator__max_depth': [200, 400],
#                'clf__estimator__min_samples_leaf': [1, 2]
#              }

# # create GridSearchCV object
# grid_cv = GridSearchCV(final_pipeline, parameters, cv = 3, n_jobs = -1)

# # Fit and tune model
# grid_cv.fit(x_train, y_train)

# # display the best parameters
# grid_cv.best_params_

# # refitting on entire training data using best settings
# grid_cv.refit

In [26]:
# evaluate the model performance on validation dataset
report = evaluate_model(x_val, y_val, pipeline = final_pipeline)
report.T

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Precision, Recall, F1 Score for related
             precision    recall  f1-score   support

        0.0       0.37      0.14      0.20      1226
        1.0       0.78      0.93      0.85      4016

avg / total       0.68      0.74      0.70      5242

Precision, Recall, F1 Score for request
             precision    recall  f1-score   support

        0.0       0.84      0.97      0.90      4347
        1.0       0.40      0.09      0.15       895

avg / total       0.76      0.82      0.77      5242

Precision, Recall, F1 Score for offer
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      5217
        1.0       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      5242

Precision, Recall, F1 Score for aid_related
             precision    recall  f1-score   support

        0.0       0.60      0.81      0.69      3095
        1.0       0.43      0.21      0.28      2147

avg / total       0.53      0.56

Unnamed: 0,0,1,2
related,0.779403,0.929034,0.847666
request,0.395238,0.092737,0.150226
offer,0.0,0.0,0.0
aid_related,0.433301,0.207266,0.280403
medical_help,0.125,0.004914,0.009456
medical_products,0.055556,0.003861,0.00722
search_and_rescue,0.0,0.0,0.0
security,0.0,0.0,0.0
military,0.333333,0.00565,0.011111
child_alone,0.0,0.0,0.0


### 9. Export your model as a pickle file

In [28]:
pickle.dump(final_pipeline, open('model_pipeline', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.