# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
import nltk
nltk.download(['punkt','stopwords','wordnet'])

# import libraries
import re
import numpy as np
import pandas as pd

from sqlalchemy import create_engine

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', engine)
X = df.message.values

## getting the categorical values 
categorical_col = df.columns[4:]
y = df[categorical_col]


In [49]:
categorical_col

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [3]:
y

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
7,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    """
    Process the raw texts includes:
        1. replace any urls with the string 'urlplaceholder'
        2. remove punctuation
        3. tokenize texts
        4. remove stop words
        5. normalize and lemmatize texts
    Args:
    text (str): raw texts
    Return: a list of clean words in their roots form
    """
    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    # get list of all urls using regex
    detected_urls = re.findall(url_regex,text)
    # replace each url in text strings with placeholder
    for url in detected_urls:
        text = text.replace(url, 'urlplaceholder')
    
    # remove puntuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ",text)
    
    #tokenizing the text
    tokens = word_tokenize(text)
    token_list =[]
    #initializing lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    
    # remove stop words
    for tok in tokens:
        if tok not in stopwords.words("english"):
             token_list.append(tok)

    clean_tokens = []
    for tok in token_list:
        #lemmatizing, case normalization and removing leading and trailing whitespace
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens
  

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
#Craeting a pipeline
pipeline = pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
#splitting data set into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

## train classifier 
pipeline.fit(X_train, y_train)

# predict test data
y_pred = pipeline.predict(X_test)

In [7]:
# finding the accurcay 

accuracy = (y_pred == y_test).mean()

print("Accuracy:", accuracy)

Accuracy: related                   0.810650
request                   0.886482
offer                     0.995880
aid_related               0.747025
medical_help              0.922948
medical_products          0.954379
search_and_rescue         0.976350
security                  0.981538
military                  0.969027
child_alone               1.000000
water                     0.950565
food                      0.923863
shelter                   0.925236
clothing                  0.985810
money                     0.979707
missing_people            0.990998
refugees                  0.966738
death                     0.959262
other_aid                 0.872139
infrastructure_related    0.932255
transport                 0.954532
buildings                 0.950412
electricity               0.979554
tools                     0.993592
hospitals                 0.987946
shops                     0.995423
aid_centers               0.988862
other_infrastructure      0.954837
weather_re

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
def get_scores(y_test,y_pred):
    i =0
    for col in y_test:
        print('Feature {}:{}'.format(i+1,col))
        print(classification_report(y_test[col],y_pred[:,i]))
        i=i+1
    rfc_accuracy = (y_pred == y_test.values).mean()
    print('The model accuracy score is {:.2f}'.format(accuracy))

In [9]:
get_scores(y_test,y_pred)


Feature 1:related
             precision    recall  f1-score   support

          0       0.64      0.46      0.53      1501
          1       0.85      0.92      0.88      5008
          2       0.50      0.11      0.18        45

avg / total       0.80      0.81      0.80      6554

Feature 2:request
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5399
          1       0.82      0.45      0.58      1155

avg / total       0.88      0.89      0.87      6554

Feature 3:offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

Feature 4:aid_related
             precision    recall  f1-score   support

          0       0.75      0.85      0.80      3822
          1       0.74      0.61      0.67      2732

avg / total       0.75      0.75      0.74      6554

Feature 5:med

  'precision', 'predicted', average, warn_for)


TypeError: unsupported format string passed to Series.__format__

In [10]:
rfc_accuracy = (y_pred == y_test).mean()
rfc_accuracy

related                   0.810650
request                   0.886482
offer                     0.995880
aid_related               0.747025
medical_help              0.922948
medical_products          0.954379
search_and_rescue         0.976350
security                  0.981538
military                  0.969027
child_alone               1.000000
water                     0.950565
food                      0.923863
shelter                   0.925236
clothing                  0.985810
money                     0.979707
missing_people            0.990998
refugees                  0.966738
death                     0.959262
other_aid                 0.872139
infrastructure_related    0.932255
transport                 0.954532
buildings                 0.950412
electricity               0.979554
tools                     0.993592
hospitals                 0.987946
shops                     0.995423
aid_centers               0.988862
other_infrastructure      0.954837
weather_related     

### 6. Improve your model
Use grid search to find better parameters. 

In [11]:
parameters = {
    'clf__estimator__max_depth': [20, 25],
    'clf__estimator__n_estimators': [100,150]
        }

# create grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

It takes a long time to fir the cv into the train set

In [12]:

cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__max_depth': [20, 25], 'clf__estimator__n_estimators': [100, 150]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [None]:
##classification report on train set 

In [13]:
y_pred = cv.predict(X_test)

In [14]:
## prinitng classification report
i =0
for col in y_test:
    print('Feature {}:{}'.format(i+1,col))
    print(classification_report(y_test[col],y_pred[:,i]))
    i=i+1

Feature 1:related
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      1501
          1       0.76      1.00      0.87      5008
          2       0.00      0.00      0.00        45

avg / total       0.58      0.76      0.66      6554

Feature 2:request
             precision    recall  f1-score   support

          0       0.83      1.00      0.90      5399
          1       1.00      0.01      0.02      1155

avg / total       0.86      0.83      0.75      6554

Feature 3:offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

Feature 4:aid_related
             precision    recall  f1-score   support

          0       0.65      0.98      0.78      3822
          1       0.89      0.28      0.43      2732

avg / total       0.75      0.69      0.63      6554

Feature 5:med

  'precision', 'predicted', average, warn_for)


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [15]:
rfc_grid_accuracy = (y_pred == y_test).mean()


In [16]:
##accuracy score on the test set
print(rfc_grid_accuracy)

related                   0.764114
request                   0.825603
offer                     0.995880
aid_related               0.685841
medical_help              0.920659
medical_products          0.950412
search_and_rescue         0.975587
security                  0.982453
military                  0.968111
child_alone               1.000000
water                     0.933933
food                      0.882362
shelter                   0.905706
clothing                  0.984437
money                     0.978486
missing_people            0.990388
refugees                  0.966738
death                     0.956057
other_aid                 0.872292
infrastructure_related    0.933476
transport                 0.953464
buildings                 0.947208
electricity               0.978639
tools                     0.993592
hospitals                 0.987946
shops                     0.995423
aid_centers               0.988862
other_infrastructure      0.955294
weather_related     

### 8. Try improving your model further. Here are a few ideas:
try other machine learning algorithms
add other features besides the TF-IDF

#### Decision Tree
We will try both with grid search and without

##### With Grid Search

In [17]:
#creating the tree pipeline 

Decisiontree_pipe = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
])

## get the parameters
Decisiontree_pipe.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f473f0cfd90>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=None,
               splitter='best'),
              n_jobs=1))],
 

In [18]:
## specifying params
Decisiontree_params = {'clf__estimator__max_depth':[5]}

## creating the grid search
grid_Decisiontree = GridSearchCV(Decisiontree_pipe, param_grid=Decisiontree_params, cv=3, verbose=3)

## fitting the test sets
grid_Decisiontree.fit(X_train,y_train)

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] clf__estimator__max_depth=5 .....................................
[CV]  clf__estimator__max_depth=5, score=0.2076594446139762, total= 1.4min
[CV] clf__estimator__max_depth=5 .....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.3min remaining:    0.0s


[CV]  clf__estimator__max_depth=5, score=0.21635642355813245, total= 1.4min
[CV] clf__estimator__max_depth=5 .....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  4.7min remaining:    0.0s


[CV]  clf__estimator__max_depth=5, score=0.22184925236496797, total= 1.4min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  7.0min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ion_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__max_depth': [5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [19]:
grid_Decisiontree.best_params_

{'clf__estimator__max_depth': 5}

In [20]:
## getting the y pred
y_grid_Decisiontree_pred = grid_Decisiontree.predict(X_test)


In [21]:
# classification report on test set
print('Grid tree Test Scores')
i =0
for col in y_test:
    print('Feature {}:{}'.format(i+1,col))
    print(classification_report(y_test[col],y_grid_Decisiontree_pred[:,i]))
    i=i+1


Grid tree Test Scores
Feature 1:related
             precision    recall  f1-score   support

          0       0.67      0.14      0.23      1501
          1       0.79      0.98      0.87      5008
          2       0.00      0.00      0.00        45

avg / total       0.75      0.78      0.72      6554

Feature 2:request
             precision    recall  f1-score   support

          0       0.88      0.98      0.93      5399
          1       0.81      0.38      0.52      1155

avg / total       0.87      0.88      0.86      6554

Feature 3:offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

Feature 4:aid_related
             precision    recall  f1-score   support

          0       0.70      0.86      0.77      3822
          1       0.72      0.48      0.58      2732

avg / total       0.71      0.70      0.69   

  'precision', 'predicted', average, warn_for)


In [22]:
# accuracy score on test set
print('Grid tree Accuracy')
grid_Decisiontree_accuracy = (y_grid_Decisiontree_pred == y_test).mean()
print(grid_Decisiontree_accuracy)

Grid tree Accuracy
related                   0.779982
request                   0.875648
offer                     0.995728
aid_related               0.704303
medical_help              0.923711
medical_products          0.961550
search_and_rescue         0.976655
security                  0.982453
military                  0.967806
child_alone               1.000000
water                     0.959109
food                      0.948886
shelter                   0.944309
clothing                  0.990388
money                     0.980165
missing_people            0.991608
refugees                  0.969789
death                     0.968569
other_aid                 0.872902
infrastructure_related    0.933781
transport                 0.958956
buildings                 0.955600
electricity               0.979249
tools                     0.992218
hospitals                 0.987641
shops                     0.995575
aid_centers               0.987794
other_infrastructure      0.954074
w

##### Without GridSearch


In [24]:
## fitting the test set
Decisiontree_pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ion_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=1))])

In [25]:
y_Decisiontree_pred = Decisiontree_pipe.predict(X_test)

In [26]:
# classification report on test set
print('Decisiontree Test Scores')
i =0
for col in y_test:
    print('Feature {}:{}'.format(i+1,col))
    print(classification_report(y_test[col],y_Decisiontree_pred[:,i]))
    i=i+1


Decisiontree Test Scores
Feature 1:related
             precision    recall  f1-score   support

          0       0.54      0.52      0.53      1501
          1       0.86      0.86      0.86      5008
          2       0.17      0.49      0.25        45

avg / total       0.79      0.78      0.78      6554

Feature 2:request
             precision    recall  f1-score   support

          0       0.90      0.92      0.91      5399
          1       0.58      0.54      0.56      1155

avg / total       0.85      0.85      0.85      6554

Feature 3:offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      0.99      0.99      6554

Feature 4:aid_related
             precision    recall  f1-score   support

          0       0.75      0.76      0.75      3822
          1       0.66      0.65      0.65      2732

avg / total       0.71      0.71      0.71

In [27]:
# accuracy score on test set
print('Decisiontree Accuracy')
Decisiontree_accuracy = (y_Decisiontree_pred == y_test).mean()
print(Decisiontree_accuracy)

Decisiontree Accuracy
related                   0.779829
request                   0.850473
offer                     0.992981
aid_related               0.713000
medical_help              0.903113
medical_products          0.943088
search_and_rescue         0.962161
security                  0.967958
military                  0.963534
child_alone               1.000000
water                     0.955600
food                      0.936985
shelter                   0.931187
clothing                  0.985810
money                     0.973604
missing_people            0.983979
refugees                  0.958651
death                     0.961550
other_aid                 0.820415
infrastructure_related    0.901587
transport                 0.942325
buildings                 0.948428
electricity               0.973604
tools                     0.989472
hospitals                 0.981080
shops                     0.992524
aid_centers               0.980623
other_infrastructure      0.93393

##### Comparing the accuracy 

In [28]:
# model names
model_names = ['rfc','grid rfc','dtree','grid_dtree']

# concatenate accuracry scores
accuracy_df = pd.concat([pd.Series(rfc_accuracy), pd.Series(rfc_grid_accuracy),pd.Series(Decisiontree_accuracy),pd.Series(grid_Decisiontree_accuracy)
                        ],axis=1)
accuracy_df.columns=model_names

#Decisiontree_accuracy,grid_Decisiontree_accuracy
print('Models Accuracy Score Comparison')
accuracy_df

Models Accuracy Score Comparison


Unnamed: 0,rfc,grid rfc,dtree,grid_dtree
related,0.81065,0.764114,0.779829,0.779982
request,0.886482,0.825603,0.850473,0.875648
offer,0.99588,0.99588,0.992981,0.995728
aid_related,0.747025,0.685841,0.713,0.704303
medical_help,0.922948,0.920659,0.903113,0.923711
medical_products,0.954379,0.950412,0.943088,0.96155
search_and_rescue,0.97635,0.975587,0.962161,0.976655
security,0.981538,0.982453,0.967958,0.982453
military,0.969027,0.968111,0.963534,0.967806
child_alone,1.0,1.0,1.0,1.0


#### Using a Custom Transformer 
1.Create a custom estimator that identify the buzzwords relates to Disaster
2.Apply the feature union in above pipleline using rfc as the estimator
3.Choose to apply of rfc only

In [29]:
class DisasterWordExtractor(BaseEstimator, TransformerMixin):

    def disaster_words(self, text):
        """
        INPUT: text - string, raw text data
        OUTPUT: bool -bool object, True or False
        """
        # Build a list of words that are constantly used during a disaster event
        words = ['food','hunger','hungry','starving','water','drink','eat','thrist',
            'need','hospital','medicine','medical','ill','pain','disease','injured','falling',
            'wound','dying','death','dead','aid','help','assistance','cloth','cold','wet','shelter',
                'hurricane','earthquake','flood','live','alive','child','people','shortage','blocked',
                 'gas','pregnant','baby'
        ]
        
        
        # lemmatize the words
        lemmatized_words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
        # Get the stem words of each word in  lemmatized_words 
        stem_disaster_words = [PorterStemmer().stem(w) for w in lemmatized_words]
       
        # get list of all urls using regex
        url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
        detected_urls = re.findall(url_regex,text)
        # replace each url in text strings with placeholder
        for url in detected_urls:
            text = text.replace(url, 'urlplaceholder')
            
        #tokenize the text
        clean_tokens = tokenize(text)
        for token in clean_tokens:
            if token in stem_disaster_words:
                return True
        return False
        
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_disaster_word = pd.Series(X).apply(self.disaster_words)
        return pd.DataFrame(X_disaster_word)

In [30]:
# create the pipeline

dis_pipeline = Pipeline([
        ('features',FeatureUnion([
        ('text_pipeline',Pipeline([
        ('vect',CountVectorizer(tokenizer=tokenize)),
        ('tfidf',TfidfTransformer())])),
        ('disaster_words',DisasterWordExtractor())
        ])),
        ('clf',MultiOutputClassifier(RandomForestClassifier()))
        ])

In [31]:
# train classifier
dis_pipeline.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [32]:
y_dis_pred = dis_pipeline.predict(X_test)

In [33]:
## printing classification report
print('DiastorWordExtractor Test Scores')
i =0
for col in y_test:
    print('Feature {}:{}'.format(i+1,col))
    print(classification_report(y_test[col],y_dis_pred[:,i]))
    i=i+1

DiastorWordExtractor Test Scores
Feature 1:related
             precision    recall  f1-score   support

          0       0.62      0.45      0.52      1501
          1       0.85      0.91      0.88      5008
          2       0.30      0.40      0.34        45

avg / total       0.79      0.80      0.79      6554

Feature 2:request
             precision    recall  f1-score   support

          0       0.90      0.98      0.94      5399
          1       0.82      0.47      0.60      1155

avg / total       0.88      0.89      0.88      6554

Feature 3:offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

Feature 4:aid_related
             precision    recall  f1-score   support

          0       0.75      0.85      0.80      3822
          1       0.75      0.61      0.67      2732

avg / total       0.75      0.75  

  'precision', 'predicted', average, warn_for)


In [35]:
## printing accuracy
print('DiastorWordExtractor Accuracy')
dis_accuracy = (y_dis_pred == y_test).mean()
print(dis_accuracy)

DiastorWordExtractor Accuracy
related                   0.804699
request                   0.889075
offer                     0.995880
aid_related               0.750229
medical_help              0.923253
medical_products          0.956210
search_and_rescue         0.976655
security                  0.981996
military                  0.968569
child_alone               1.000000
water                     0.952548
food                      0.930119
shelter                   0.927373
clothing                  0.985047
money                     0.979249
missing_people            0.989930
refugees                  0.968721
death                     0.960024
other_aid                 0.873207
infrastructure_related    0.933476
transport                 0.953464
buildings                 0.951785
electricity               0.979707
tools                     0.993592
hospitals                 0.987946
shops                     0.995423
aid_centers               0.988862
other_infrastructure     

##### with grid search


In [36]:
params = {'clf__estimator__n_estimators':[100,200],
              'clf__estimator__max_depth':[5]}

# create grid search object
grid_dis_pipeline = GridSearchCV(dis_pipeline, param_grid=params, cv=3, verbose=3)

In [38]:
grid_dis_pipeline.fit(X_train,y_train)


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__max_depth=5, clf__estimator__n_estimators=100 ...
[CV]  clf__estimator__max_depth=5, clf__estimator__n_estimators=100, score=0.18858712236801953, total= 3.4min
[CV] clf__estimator__max_depth=5, clf__estimator__n_estimators=100 ...


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.6min remaining:    0.0s


[CV]  clf__estimator__max_depth=5, clf__estimator__n_estimators=100, score=0.20170888007323773, total= 3.4min
[CV] clf__estimator__max_depth=5, clf__estimator__n_estimators=100 ...


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 11.2min remaining:    0.0s


[CV]  clf__estimator__max_depth=5, clf__estimator__n_estimators=100, score=0.19606347268843455, total= 3.4min
[CV] clf__estimator__max_depth=5, clf__estimator__n_estimators=200 ...
[CV]  clf__estimator__max_depth=5, clf__estimator__n_estimators=200, score=0.18889227952395485, total= 3.7min
[CV] clf__estimator__max_depth=5, clf__estimator__n_estimators=200 ...
[CV]  clf__estimator__max_depth=5, clf__estimator__n_estimators=200, score=0.20186145865120536, total= 3.7min
[CV] clf__estimator__max_depth=5, clf__estimator__n_estimators=200 ...
[CV]  clf__estimator__max_depth=5, clf__estimator__n_estimators=200, score=0.1962160512664022, total= 3.7min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 34.6min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [100, 200], 'clf__estimator__max_depth': [5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [40]:
grid_dis_pipeline.best_params_
y_grid_dispipeline_pred= grid_dis_pipeline.predict(X_test)

In [41]:
# classification report on test set
print('With Disaster Word Extractor Test Scores')
i =0
for col in y_test: 
    print('Feature {}:{}'.format(i+1,col))
    print(classification_report(y_test[col],y_grid_dispipeline_pred[:,i]))
    i=i+1

With Disaster Word Extractor Test Scores
Feature 1:related
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      1501
          1       0.76      1.00      0.87      5008
          2       0.00      0.00      0.00        45

avg / total       0.58      0.76      0.66      6554

Feature 2:request
             precision    recall  f1-score   support

          0       0.82      1.00      0.90      5399
          1       0.00      0.00      0.00      1155

avg / total       0.68      0.82      0.74      6554

Feature 3:offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

Feature 4:aid_related
             precision    recall  f1-score   support

          0       0.59      1.00      0.74      3822
          1       0.96      0.01      0.02      2732

avg / total       0.74    

  'precision', 'predicted', average, warn_for)


In [43]:
# accuracy score on test set
print('With Disaster Word Extractor Accuracy')
grid_dispipeline_test_accuracy = (y_grid_dispipeline_pred == y_test).mean()
print(grid_dispipeline_test_accuracy)

With Disaster Word Extractor Accuracy
related                   0.764114
request                   0.823772
offer                     0.995880
aid_related               0.586970
medical_help              0.920659
medical_products          0.950412
search_and_rescue         0.975587
security                  0.982453
military                  0.968111
child_alone               1.000000
water                     0.933933
food                      0.882362
shelter                   0.905706
clothing                  0.984437
money                     0.978486
missing_people            0.990388
refugees                  0.966738
death                     0.956057
other_aid                 0.872292
infrastructure_related    0.933476
transport                 0.953464
buildings                 0.947208
electricity               0.978639
tools                     0.993592
hospitals                 0.987946
shops                     0.995423
aid_centers               0.988862
other_infrastruct

In [44]:
## Comparing 
# model names
model_names = ['disaster','grid disaster']

# concatenate accuracry scores
accuracy_df2 = pd.concat([pd.Series(dis_accuracy), pd.Series(grid_dispipeline_test_accuracy)
                        ],axis=1)
accuracy_df2.columns=model_names


print('Models Accuracy Score Comparison')
accuracy_df2

Models Accuracy Score Comparison


Unnamed: 0,disaster,grid disaster
related,0.804699,0.764114
request,0.889075,0.823772
offer,0.99588,0.99588
aid_related,0.750229,0.58697
medical_help,0.923253,0.920659
medical_products,0.95621,0.950412
search_and_rescue,0.976655,0.975587
security,0.981996,0.982453
military,0.968569,0.968111
child_alone,1.0,1.0


The final model will not use the grid search as it has less accuracy 

### 9. Export your model as a pickle file

In [45]:
import pickle


In [48]:
# save the model to disk
pickle.dump(dis_pipeline, open('classifier.pickle','wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.