# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

#   import re
#   import numpy as np
import pandas as pd
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.multioutput import MultiOutputClassifier

from sklearn.metrics import classification_report, precision_recall_fscore_support #confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline # FeatureUnion
#   from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import pickle 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('ETL', con=engine)
X = df['message']
Y = df.iloc[:,4:]

In [3]:
X.shape

(26216,)

In [4]:
Y.shape

(26216, 36)

### 2. Write a tokenization function to process your text data

In [10]:
def tokenize(text):

    '''Function to process the text: tokenizes the text, converts to lower case, strips out special characters and lemmatizes
    
    Parameters:
    text (str): Raw text to be processed
    
    Returns:
    tokens (list): tokens processed from raw text 
    
    '''
    
    raw_tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    tokens = []
    for token in raw_tokens:
        clean_token = lemmatizer.lemmatize(token).lower().strip('^°!"§$%&/()=?\}][{+~+#-_.:,;<>|')
        tokens.append(clean_token)

    return tokens

In [12]:
#spot check functionality of tokenize function
print(tokenize(X[5]))

['information', 'about', 'the', 'national', 'palace']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [13]:
# create pipeline with vectorizer, tfidf transformer and random forest classifier
pipeline1 = Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultiOutputClassifier(RandomForestClassifier(random_state=42)))
            ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [14]:
# create test and training sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

#fit defined pipeline1
pipeline1.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [15]:
#predict based on pipeline1
Y_pred1=pd.DataFrame(pipeline1.predict(X_test))

In [22]:
def print_metrics(Y_test, Y_pred):
    '''Function to print metrics of the prediction, i.e. the precision, recall and fscore of the prediction and to return detailed results in a dataframe
    
    Parameters:
    Y_test (pandas dataframe): Test labels
    Y_pred (pandas dataframe): Predicted labels
    
    Returns:
    results (pandas dataframe): details of precision, recall and fscore per category 
    
    '''    
    
    results=pd.DataFrame()
    results['categories']=Y.columns
    results.set_index('categories')
    results['precision'], results['recall'], results['fscore']='','',''
    for category in range(len(Y.columns)):
        results['precision'][category]= precision_recall_fscore_support(Y_test.iloc[:,category], Y_pred.iloc[:,category], average='weighted')[0]
        results['recall'][category]= precision_recall_fscore_support(Y_test.iloc[:,category], Y_pred.iloc[:,category], average='weighted')[1]
        results['fscore'][category]= precision_recall_fscore_support(Y_test.iloc[:,category], Y_pred.iloc[:,category], average='weighted')[2]
    print('Overall average precision is: ' + str(results['precision'].mean()))
    print('Overall average recall is: ' + str(results['recall'].mean()))
    print('Overall average fscore is: ' + str(results['fscore'].mean()))
    
    #print('Details: ')
    #print(results)
    return results

In [23]:
# check print_metrics function with prediction1
results1=print_metrics(Y_test, Y_pred1)

Overall average precision is: 0.931493498224
Overall average recall is: 0.940638456583
Overall average fscore is: 0.925891246715


  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

***Note that I ran the GridSearch only once as it took very, very long. Pipeline2 that will be defined in chapter 7 holds the results of the gridsearch, e.g. the parameters are defined according to the best parameters found out via gridsearch***

In [16]:
parameters = {
        'classifier__estimator__n_estimators':[11,12,13],
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__max_df': (0.5, 0.75, 1.0),
        'tfidf__use_idf': (True, False)
}

In [17]:
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=50)

In [18]:
cv.fit(X_train, Y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2050656087885261, total=  49.4s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   59.5s remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2216966737870003, total=  50.9s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.0min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21330485199877938, total=  

[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.20735428745804088, total=  49.7s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed: 38.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2181873664937443, total=  50.7s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed: 39.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2148306377784559, total=  49.4s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed: 40.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max

[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.22505340250228867, total=  51.9s
[Parallel(n_jobs=1)]: Done  49 out of  49 | elapsed: 77.2min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.22047604516325908, total=  53.2s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22871528837351235, total= 1.9min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.228104974061641

[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.21574610924626184, total=  56.9s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.21849252364967958, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.22779981690570644, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2177296307598413, total= 2.1min
[CV] classifier__estimator__n_estimators=13, tfidf__use_

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classifier__estimator__n_estimators': [11, 12, 13], 'vect__ngram_range': ((1, 1), (1, 2)), 'vect__max_df': (0.5, 0.75, 1.0), 'tfidf__use_idf': (True, False)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=50)

In [19]:
cv.best_params_

{'classifier__estimator__n_estimators': 12,
 'tfidf__use_idf': True,
 'vect__max_df': 0.75,
 'vect__ngram_range': (1, 2)}

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2050656087885261, total=  49.4s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   59.5s remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2216966737870003, total=  50.9s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.0min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21330485199877938, total=  50.2s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.0min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.21956057369545315, total= 1.9min
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  5.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.21757705218187368, total= 1.9min
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  7.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.2085749160817821, total= 1.9min
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  9.4min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.21238938053097345, total=  49.2s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 10.4min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2230698809887092, total=  50.4s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 11.4min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2122368019530058, total=  50.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 12.4min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.21788220933780897, total= 1.9min
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 14.5min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2210863594751297, total= 2.0min
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed: 16.7min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2125419591089411, total= 2.0min
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 18.9min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21086359475129693, total=  49.3s
[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed: 19.9min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21330485199877938, total=  49.6s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed: 20.9min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.20918523039365272, total=  49.8s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 21.9min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.2171193164479707, total= 2.0min
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed: 24.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.21589868782422947, total= 1.9min
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed: 26.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.21025328043942632, total= 1.8min
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 28.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.20598108025633202, total=  50.1s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed: 29.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2171193164479707, total=  50.4s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 30.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2070491303021056, total=  49.8s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed: 31.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.21589868782422947, total= 1.9min
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed: 33.5min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.2162038449801648, total= 2.0min
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed: 35.7min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.21040585901739395, total= 1.9min
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 37.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.20735428745804088, total=  49.7s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed: 38.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2181873664937443, total=  50.7s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed: 39.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2148306377784559, total=  49.4s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed: 40.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.21864510222764724, total= 1.8min
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed: 42.9min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.22261214525480622, total= 2.0min
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed: 45.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.21238938053097345, total= 1.8min
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 47.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.20582850167836436, total=  48.8s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed: 48.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21330485199877938, total=  50.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed: 49.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21147390906316754, total=  49.5s
[Parallel(n_jobs=1)]: Done  33 out of  33 | elapsed: 50.0min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.21742447360390602, total= 1.8min
[Parallel(n_jobs=1)]: Done  34 out of  34 | elapsed: 52.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.21788220933780897, total= 1.9min
[Parallel(n_jobs=1)]: Done  35 out of  35 | elapsed: 54.2min remaining:    0.0s
[CV] classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=11, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.2181873664937443, total= 1.8min
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 56.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.22154409520903265, total=  52.9s
[Parallel(n_jobs=1)]: Done  37 out of  37 | elapsed: 57.3min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.23298748855660664, total=  53.6s
[Parallel(n_jobs=1)]: Done  38 out of  38 | elapsed: 58.4min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.22093378089716204, total=  53.8s
[Parallel(n_jobs=1)]: Done  39 out of  39 | elapsed: 59.5min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.23146170277693012, total= 2.1min
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 61.8min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.22810497406164174, total= 2.1min
[Parallel(n_jobs=1)]: Done  41 out of  41 | elapsed: 64.1min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.22001830942935613, total= 2.1min
[Parallel(n_jobs=1)]: Done  42 out of  42 | elapsed: 66.4min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.22642660970399756, total=  52.7s
[Parallel(n_jobs=1)]: Done  43 out of  43 | elapsed: 67.5min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.2331400671345743, total=  54.2s
[Parallel(n_jobs=1)]: Done  44 out of  44 | elapsed: 68.5min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.221391516631065, total=  53.3s
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed: 69.6min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2308513884650595, total= 1.9min
[Parallel(n_jobs=1)]: Done  46 out of  46 | elapsed: 71.7min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2308513884650595, total= 2.0min
[Parallel(n_jobs=1)]: Done  47 out of  47 | elapsed: 73.9min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.22841013121757706, total= 2.0min
[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 76.2min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.22505340250228867, total=  51.9s
[Parallel(n_jobs=1)]: Done  49 out of  49 | elapsed: 77.2min remaining:    0.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.22047604516325908, total=  53.2s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22871528837351235, total= 1.9min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22810497406164174, total= 2.0min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22444308819041806, total= 1.8min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21986573085138847, total=  53.1s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.22963075984131828, total=  53.5s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2187976808056149, total=  52.8s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.22902044552944767, total= 1.9min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.22520598108025633, total= 2.0min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.22413793103448276, total= 2.0min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.22322245956667683, total=  52.4s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.22978333841928594, total=  53.0s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.22886786695148, total=  52.6s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2308513884650595, total= 2.0min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.23069880988709185, total= 1.9min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2268843454379005, total= 2.1min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21986573085138847, total=  53.4s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.2282575526396094, total=  53.7s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.2268843454379005, total=  53.4s
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.2233750381446445, total= 1.9min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22902044552944767, total= 2.2min
[CV] classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=12, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22978333841928594, total= 1.9min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21361000915471468, total=  56.8s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.22245956667683858, total=  57.5s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21635642355813245, total=  56.9s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.2233750381446445, total= 2.1min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.221391516631065, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.21696673787000306, total= 2.1min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.21177906621910284, total=  57.1s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.22261214525480622, total=  57.5s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.21574610924626184, total=  56.9s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.21849252364967958, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.22779981690570644, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2177296307598413, total= 2.1min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.2168141592920354, total=  56.6s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21757705218187368, total=  57.1s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.20918523039365272, total=  56.5s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22184925236496797, total= 2.1min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.2187976808056149, total= 2.3min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=True, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.21605126640219713, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21391516631065, total=  56.2s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.2197131522734208, total=  56.9s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.21315227342081172, total=  56.2s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.22215440952090326, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.221391516631065, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.21742447360390602, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.21361000915471468, total=  57.3s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.221391516631065, total=  57.6s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 1), score=0.21849252364967958, total=  56.0s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.22261214525480622, total= 2.1min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.22551113823619165, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=0.75, vect__ngram_range=(1, 2), score=0.2148306377784559, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.2105584375953616, total=  56.1s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.2220018309429356, total=  56.9s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 1), score=0.21895025938358254, total=  56.7s
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.2148306377784559, total= 2.2min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.21833994507171192, total= 2.4min
[CV] classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2) 
[CV]  classifier__estimator__n_estimators=13, tfidf__use_idf=False, vect__max_df=1.0, vect__ngram_range=(1, 2), score=0.22154409520903265, total= 2.2min
[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed: 179.6min finished
GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classifier__estimator__n_estimators': [11, 12, 13], 'vect__ngram_range': ((1, 1), (1, 2)), 'vect__max_df': (0.5, 0.75, 1.0), 'tfidf__use_idf': (True, False)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=50)

{'classifier__estimator__n_estimators': 12,
 'tfidf__use_idf': True,
 'vect__max_df': 0.75,
 'vect__ngram_range': (1, 2)}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

***Note that I tested the results of the GridSearch only once, as the running of the GridSearch took very, very long. Pipeline 2 is replicating the results of the GridSearch***

In [20]:
Y_pred_GS = pd.DataFrame(cv.predict(X_test))

In [21]:
print_metrics(Y_test, Y_pred_GS)

                categories precision    recall    fscore
0                  related  0.771251  0.791425  0.766427
1                  request  0.882071  0.884651  0.868169
2                    offer  0.990866  0.995423  0.993139
3              aid_related  0.745637   0.73787  0.726531
4             medical_help  0.909314  0.925389  0.897141
5         medical_products  0.949574  0.955905  0.939422
6        search_and_rescue  0.966241  0.973299  0.961536
7                 security  0.981732  0.981385  0.972316
8                 military   0.95639  0.967043  0.953681
9              child_alone         1         1         1
10                   water    0.9463  0.948276  0.935318
11                    food  0.921022   0.92661  0.917769
12                 shelter  0.924933  0.930729   0.91481
13                clothing  0.981908   0.98459   0.97879
14                   money  0.972784  0.977724  0.967835
15          missing_people   0.98437  0.988099  0.982618
16                refugees  0.9

  'precision', 'predicted', average, warn_for)


Unnamed: 0,categories,precision,recall,fscore
0,related,0.771251,0.791425,0.766427
1,request,0.882071,0.884651,0.868169
2,offer,0.990866,0.995423,0.993139
3,aid_related,0.745637,0.73787,0.726531
4,medical_help,0.909314,0.925389,0.897141
5,medical_products,0.949574,0.955905,0.939422
6,search_and_rescue,0.966241,0.973299,0.961536
7,security,0.981732,0.981385,0.972316
8,military,0.95639,0.967043,0.953681
9,child_alone,1.0,1.0,1.0


In [24]:
# Definition of pipeline2 that replicates the results of the GridSearch
pipeline2 = Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize, max_df=0.75, ngram_range=(1, 2))),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultiOutputClassifier(RandomForestClassifier(random_state=42, n_estimators=12)))
            ])

In [25]:
#fit defined pipeline2
pipeline2.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.75, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        stri...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))])

In [26]:
#predict based on pipeline2 
Y_pred2=pd.DataFrame(pipeline2.predict(X_test))

Comparing performance of the models associated with pipeline 1 and pipeline 2

In [29]:
# print results of prediction created with pipeline1
results1=print_metrics(Y_test, Y_pred1)

Overall average precision is: 0.931493498224
Overall average recall is: 0.940638456583
Overall average fscore is: 0.925891246715


  'precision', 'predicted', average, warn_for)


In [30]:
# print results of prediction created with pipeline2
results2=print_metrics(Y_test, Y_pred2)

Overall average precision is: 0.932276071014
Overall average recall is: 0.942240531652
Overall average fscore is: 0.92813983796


  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [31]:
#create pipeline3, this is based on pipeline2, just with a change of the classifier to AdaBoost
pipeline3 = Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize, max_df=0.75, ngram_range=(1, 2))),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultiOutputClassifier(AdaBoostClassifier(random_state=42)))
            ])

In [32]:
#fit defined pipeline3
pipeline3.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.75, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        stri...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=1))])

In [33]:
# predict based on pipeline3
Y_pred3=pd.DataFrame(pipeline3.predict(X_test))

In [34]:
# create pipeline4, again with AdaBoost, but without the parameters from the GRidSeach, these were anyways just specific to the Random Forst Classifier
pipeline4 = Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultiOutputClassifier(AdaBoostClassifier(random_state=42)))
            ])

In [35]:
#Fit the defined pipeline4
pipeline4.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=1))])

In [36]:
# create prediction based on pipeline4
Y_pred4=pd.DataFrame(pipeline4.predict(X_test))

COmpare predictions based on pipeline3 and pipeline4

In [37]:
results3=print_metrics(Y_test, Y_pred3)

Overall average precision is: 0.93758581898
Overall average recall is: 0.945597260367
Overall average fscore is: 0.938396499505


In [38]:
results4=print_metrics(Y_test, Y_pred4)

Overall average precision is: 0.937724588244
Overall average recall is: 0.945923608992
Overall average fscore is: 0.938442591884


Overall, best results were given by pipeline4, that is built based on AdaBoost; It could be further refined with GridSearch

### 9. Export your model as a pickle file

In [39]:
filename = 'model.sav'
pickle.dump(pipeline4, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.