# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
nltk.download()

In [13]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import pickle
import nltk

nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.neighbors import KNeighborsClassifier,RadiusNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.naive_bayes import GaussianNB, MultinomialNB


[nltk_data] Downloading package punkt to /home/enzo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/enzo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# load data from database
engine = create_engine('sqlite:///disaster.db')
df = pd.read_sql_table('InsertTableName', con=engine)

relevant_categories = df.columns[4:]

X = df['message']
y = df[relevant_categories]

In [4]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
X.head(5)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [6]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [7]:
def tokenize(text):
    
    # Replace any URL with  urlplaceholder
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    
    # remove punctuation and Convert to lowercase
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
        
    tokens = word_tokenize(text)
    
    # Remove stop-words
    stop_words = stopwords.words("english")
    tokens = [token for token in tokens if token not in stop_words]
    
    # lemmitize
    lemmatizer = nltk.WordNetLemmatizer()
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).strip()
        clean_tokens.append(clean_tok)

    return clean_tokens
tokenize("testing: tested I do not doesn't lovVed loves loving you warnings .... http://www.hotmail.com")



### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
        
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

#### Evaluation for the train set 

In [9]:
y_train_pred = pipeline.predict(X_train)
print(classification_report(y_train, y_train_pred, target_names=relevant_categories))

                        precision    recall  f1-score   support

               related       0.99      0.99      0.99     14909
               request       1.00      0.92      0.95      3408
                 offer       1.00      0.77      0.87        90
           aid_related       1.00      0.97      0.98      8167
          medical_help       1.00      0.85      0.92      1594
      medical_products       1.00      0.84      0.91       974
     search_and_rescue       0.99      0.80      0.89       532
              security       0.99      0.74      0.85       349
              military       1.00      0.88      0.93       656
                 water       1.00      0.91      0.95      1251
                  food       1.00      0.95      0.97      2235
               shelter       1.00      0.92      0.96      1738
              clothing       0.99      0.84      0.91       316
                 money       1.00      0.80      0.89       458
        missing_people       0.99      

#### Evaluation for the test set 

In [10]:
y_test_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_test_pred, target_names=relevant_categories))

                        precision    recall  f1-score   support

               related       0.85      0.91      0.88      4997
               request       0.77      0.42      0.55      1066
                 offer       0.00      0.00      0.00        28
           aid_related       0.73      0.61      0.67      2693
          medical_help       0.52      0.11      0.19       490
      medical_products       0.70      0.13      0.22       339
     search_and_rescue       0.71      0.06      0.11       192
              security       0.20      0.01      0.02       122
              military       0.57      0.08      0.14       204
                 water       0.89      0.30      0.45       421
                  food       0.80      0.46      0.58       688
               shelter       0.82      0.31      0.45       576
              clothing       0.40      0.07      0.12        89
                 money       1.00      0.03      0.07       146
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [11]:
parameters = {'vect__min_df': [1, 5, 10],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10, 25, 50], 
              'clf__estimator__min_samples_split':[2, 5, 10, 20, 40]}

cv = GridSearchCV(pipeline, param_grid = parameters, scoring = 'f1_weighted', verbose = 1, cv=3)

In [12]:
cv.fit(X_train, y_train)

Fitting 3 folds for each of 90 candidates, totalling 270 fits


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__min_df': [1, 5, 10], 'tfidf__use_idf': [True, False], 'clf__estimator__n_estimators': [10, 25, 50], 'clf__estimator__min_samples_split': [2, 5, 10, 20, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_weighted', verbose=1)

In [13]:
cv.best_params_

{'clf__estimator__min_samples_split': 10,
 'clf__estimator__n_estimators': 50,
 'tfidf__use_idf': False,
 'vect__min_df': 10}

{'clf__estimator__min_samples_split': 10,
 'clf__estimator__n_estimators': 50,
 'tfidf__use_idf': False,
 'vect__min_df': 10}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [14]:
best_model_predict=cv.predict(X_test)

In [15]:
print(classification_report(y_test, best_model_predict, target_names=relevant_categories))

                        precision    recall  f1-score   support

               related       0.86      0.93      0.89      4997
               request       0.78      0.51      0.62      1066
                 offer       0.00      0.00      0.00        28
           aid_related       0.73      0.72      0.73      2693
          medical_help       0.56      0.16      0.25       490
      medical_products       0.73      0.19      0.31       339
     search_and_rescue       0.64      0.08      0.15       192
              security       0.12      0.01      0.02       122
              military       0.60      0.13      0.21       204
                 water       0.82      0.49      0.61       421
                  food       0.79      0.72      0.76       688
               shelter       0.79      0.48      0.59       576
              clothing       0.69      0.35      0.46        89
                 money       1.00      0.05      0.09       146
        missing_people       0.67      

  'precision', 'predicted', average, warn_for)


In [17]:
pickle.dump(cv, open('multiforest.sav', 'wb'))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [1]:
def test_pipeline(pipeline):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    pipeline.fit(X_train, y_train)
    y_test_pred = pipeline.predict(X_test)
    print(classification_report(y_test, y_test_pred, target_names=relevant_categories))

#### LinearSVC

In [10]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(LinearSVC(multi_class="crammer_singer"), n_jobs=1))
        #('clf', RandomForestClassifier())
    ])

test_pipeline(pipeline)  

                        precision    recall  f1-score   support

               related       0.87      0.90      0.88      4994
               request       0.73      0.59      0.65      1100
                 offer       0.00      0.00      0.00        29
           aid_related       0.72      0.72      0.72      2697
          medical_help       0.59      0.32      0.42       520
      medical_products       0.65      0.31      0.42       308
     search_and_rescue       0.72      0.15      0.25       183
              security       0.75      0.03      0.06       101
              military       0.61      0.33      0.43       210
                 water       0.71      0.64      0.67       411
                  food       0.78      0.75      0.76       699
               shelter       0.75      0.58      0.66       572
              clothing       0.74      0.53      0.62        91
                 money       0.66      0.24      0.36       177
        missing_people       0.85      

  'precision', 'predicted', average, warn_for)


#### LogisticRegressionCV(multi_class="multinomial")

In [12]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(LogisticRegressionCV(multi_class="multinomial")))
        #('clf', RandomForestClassifier())
    ])

test_pipeline(pipeline)  

                        precision    recall  f1-score   support

               related       0.84      0.93      0.88      4916
               request       0.94      0.14      0.25      1077
                 offer       0.00      0.00      0.00        32
           aid_related       0.90      0.27      0.42      2660
          medical_help       0.71      0.01      0.02       528
      medical_products       0.67      0.01      0.01       344
     search_and_rescue       1.00      0.01      0.01       180
              security       0.00      0.00      0.00       125
              military       1.00      0.02      0.04       206
                 water       0.70      0.04      0.07       424
                  food       0.88      0.08      0.14       744
               shelter       0.90      0.05      0.09       561
              clothing       1.00      0.03      0.06       100
                 money       0.00      0.00      0.00       146
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)


#### MultinomialNB

In [14]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(MultinomialNB()))
        
    ])

test_pipeline(pipeline) 

                        precision    recall  f1-score   support

               related       0.78      0.99      0.88      5020
               request       0.83      0.23      0.36      1127
                 offer       0.00      0.00      0.00        33
           aid_related       0.74      0.63      0.68      2763
          medical_help       0.33      0.00      0.00       505
      medical_products       1.00      0.00      0.01       320
     search_and_rescue       0.00      0.00      0.00       206
              security       0.00      0.00      0.00       115
              military       0.00      0.00      0.00       231
                 water       0.00      0.00      0.00       407
                  food       0.77      0.02      0.04       759
               shelter       0.67      0.00      0.01       563
              clothing       0.00      0.00      0.00        98
                 money       0.00      0.00      0.00       145
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)


#### DecisionTreeClassifier

In [15]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
        
    ])

test_pipeline(pipeline) 

                        precision    recall  f1-score   support

               related       0.84      0.86      0.85      4936
               request       0.55      0.54      0.54      1112
                 offer       0.04      0.03      0.03        33
           aid_related       0.65      0.64      0.64      2702
          medical_help       0.34      0.31      0.32       529
      medical_products       0.46      0.43      0.44       342
     search_and_rescue       0.24      0.24      0.24       175
              security       0.10      0.08      0.09       111
              military       0.48      0.38      0.43       229
                 water       0.64      0.69      0.67       432
                  food       0.71      0.73      0.72       739
               shelter       0.63      0.62      0.62       561
              clothing       0.59      0.54      0.56       102
                 money       0.37      0.33      0.35       149
        missing_people       0.23      

#### AdaBoostClassifier

In [16]:
from sklearn.ensemble import AdaBoostClassifier
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(AdaBoostClassifier()))
        
    ])

test_pipeline(pipeline) 

                        precision    recall  f1-score   support

               related       0.79      0.98      0.88      4993
               request       0.74      0.49      0.59      1107
                 offer       0.10      0.04      0.05        27
           aid_related       0.76      0.61      0.68      2756
          medical_help       0.58      0.27      0.37       513
      medical_products       0.62      0.35      0.45       309
     search_and_rescue       0.52      0.18      0.27       179
              security       0.29      0.05      0.08       107
              military       0.60      0.35      0.44       199
                 water       0.73      0.64      0.68       427
                  food       0.79      0.70      0.74       725
               shelter       0.79      0.52      0.63       604
              clothing       0.72      0.48      0.58        97
                 money       0.61      0.29      0.39       157
        missing_people       0.56      

#### SVC

In [17]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(SVC()))
        
    ])

test_pipeline(pipeline) 

                        precision    recall  f1-score   support

               related       0.76      1.00      0.86      4957
               request       0.00      0.00      0.00      1088
                 offer       0.00      0.00      0.00        31
           aid_related       0.00      0.00      0.00      2733
          medical_help       0.00      0.00      0.00       482
      medical_products       0.00      0.00      0.00       324
     search_and_rescue       0.00      0.00      0.00       178
              security       0.00      0.00      0.00       100
              military       0.00      0.00      0.00       210
                 water       0.00      0.00      0.00       409
                  food       0.00      0.00      0.00       735
               shelter       0.00      0.00      0.00       609
              clothing       0.00      0.00      0.00        90
                 money       0.00      0.00      0.00       153
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)


### Summary Performance Models

Model | precision |   recall | f1-score
------------ | -------------
Content cell 1 | Content cell 2
Content column 1 | Content column 2

In [18]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(LinearSVC(multi_class="crammer_singer"), n_jobs=1))
        
    ])
parameters = {
        'clf__estimator__C': [1, 1.2, 1.4, 1.8],
        'clf__estimator__max_iter': [500, 1000, 1200, 1500, 2000],
    } 
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = 'f1_weighted', verbose = 1, cv=3)


In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
cv.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 25.3min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...rammer_singer', penalty='l2', random_state=None,
     tol=0.0001, verbose=0),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__C': [1, 1.2, 1.4, 1.8], 'clf__estimator__max_iter': [500, 1000, 1200, 1500, 2000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_weighted', verbose=1)

In [21]:
cv.best_params_

{'clf__estimator__C': 1, 'clf__estimator__max_iter': 500}

In [22]:
best_model_predict=cv.predict(X_test)
print(classification_report(y_test, best_model_predict, target_names=relevant_categories))

                        precision    recall  f1-score   support

               related       0.87      0.90      0.88      4973
               request       0.73      0.58      0.64      1113
                 offer       0.00      0.00      0.00        32
           aid_related       0.72      0.70      0.71      2692
          medical_help       0.58      0.31      0.40       514
      medical_products       0.62      0.31      0.41       351
     search_and_rescue       0.66      0.20      0.30       168
              security       0.25      0.01      0.02       126
              military       0.52      0.34      0.41       201
                 water       0.77      0.63      0.69       423
                  food       0.81      0.72      0.76       745
               shelter       0.74      0.59      0.65       583
              clothing       0.81      0.57      0.67        99
                 money       0.67      0.28      0.39       170
        missing_people       0.58      

  'precision', 'predicted', average, warn_for)


### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.