### Business Questions:

From udacity rubrics:
-    Go into more detail about the dataset and your data cleaning and modeling process in your README file, add screenshots of your web app and model results.
-    Add more visualizations to the web app.
-    Based on the categories that the ML algorithm classifies text into, advise some organizations to connect to.
-    Customize the design of the web app.
-    Deploy the web app to a cloud service provider.
-    Improve the efficiency of the code in the ETL and ML pipeline.
-    This dataset is imbalanced (ie some labels like water have few examples). In your README, discuss how this imbalance, how that affects training the model, and your thoughts about emphasizing precision or recall for the various categories.

From my view:
- (tag: Tech) How to deal with different expressions which share the same meaning to compress datasets? Word embedding? N-gram model?
- (tag: Tech) How to visualize the comprehension of model?
- (tag: Tech) How to measure the ability of model generalization?
- (tag: Business) This model is suitable for Figure Eight, show me the reasons :)
- (tag: Business) If applied, what is needed to make sure the  normal operation of this model? How to evaluate the cost of potential necessary changes occured in company like staff structure, financial?
- (tag: Business) Draw a data transportation map(like from client to host, from host to company staff), find out the most time-consuming transportation line and try to optimize

# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# TODO 2019/3/8: reload all libraries


import re, sys
import logging

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sqlalchemy import create_engine

import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger', 'brown'])
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import brown
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [2]:
!pip install lightgbm
import lightgbm as lgb

[33mYou are using pip version 9.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
import gensim

In [37]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse2.db')
df = pd.read_sql("SELECT * FROM DisasterResponse2", engine)
X = df["message"]
Y = df.drop(["id", "message", "original", "genre"], axis=1)

In [6]:
X

0        Weather update - a cold front from Cuba that c...
1                  Is the Hurricane over or is it not over
2                          Looking for someone but no name
3        UN reports Leogane 80-90 destroyed. Only Hospi...
4        says: west side of Haiti, rest of the country ...
5                   Information about the National Palace-
6                           Storm at sacred heart of jesus
7        Please, we need tents and water. We are in Sil...
8          I would like to receive the messages, thank you
9        I am in Croix-des-Bouquets. We have health iss...
10       There's nothing to eat and water, we starving ...
11       I am in Petionville. I need more information r...
12       I am in Thomassin number 32, in the area named...
13       Let's do it together, need food in Delma 75, i...
14       More information on the 4636 number in order f...
15       A Comitee in Delmas 19, Rue ( street ) Janvier...
16       We need food and water in Klecin 12. We are dy.

In [4]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [5]:
def tokenize_sent(text):
    """Tokenizes text row by row from pandas DataFrame.

    Tokenizes text by this function combine tokenize method 
    from nltk package and some self-defined rules. It may fail
    when strings out of range of the defined rules occur in text.

    Args:
        text: A row text data from the relevant column of pandas 
        DataFrame
        
    Returns:
        A numpy array containing clean tokens extracted from text

        "we were friends, good friends" --> ['we', 'were', 'friend', 'good', 'friend']

    Raises:
        None yet
    """
    
    url_reg = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
#     punct_reg = "[^a-zA-Z0-9@]+"
    punct_reg = "[^a-zA-Z0-9]+"
    detected_urls = re.findall(url_reg, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    text = re.sub(punct_reg, " ", text)
    text = text.lower()
    
    word_tokenizer_tweet = TweetTokenizer()
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    clean_tokens_list = []
    
    sentence_list = nltk.sent_tokenize(text)
    for sentence in sentence_list:
        tokens = word_tokenizer_tweet.tokenize(sentence)
        clean_tokens_list.append(
            [lemmatizer.lemmatize(tok).strip() for tok in tokens]
        )
    clean_tokens_list = [lemmatizer.lemmatize(stemmer(tok)).strip() for tok in tokens]  
    
    # Note 2019/3/7: remove any text followed by  "@",  fail because unsolved Error "TypeError: iteration over a 0-d array"
    clean_tokens_arr = np.array(clean_tokens_list).squeeze()
#     func = lambda x: "@" in x
#     try:
#         clean_tokens_arr_ = clean_tokens_arr[[
#             not(item) for item in map(func, clean_tokens_arr)
#         ]]
#     except TypeError:  # TypeError: iteration over a 0-d array
#         return clean_tokens_arr
#     else:
#         return clean_tokens_arr_
    
    return clean_tokens_arr

In [5]:
def tokenize_word(text):
    """Tokenizes text row by row from pandas DataFrame.

    Tokenizes text by this function combine tokenize method 
    from nltk package and some self-defined rules. It may fail
    when strings out of range of the defined rules occur in text.

    Args:
        text: A row text data from the relevant column of pandas 
        DataFrame
        
    Returns:
        A numpy array containing clean tokens extracted from text

        "we were friends, good friends" --> ['we', 'were', 'friend', 'good', 'friend']

    Raises:
        None yet
    """
    
    url_reg = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    punct_reg = "[^a-zA-Z0-9@]+"
#     punct_reg = "[^a-zA-Z0-9]+"
    detected_urls = re.findall(url_reg, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    text = re.sub(punct_reg, " ", text)
    text = text.lower()
    
    word_tokenizer_tweet = TweetTokenizer()
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    tokens = word_tokenizer_tweet.tokenize(text)
    clean_tokens_list = [lemmatizer.lemmatize(tok).strip() for tok in tokens]
#     clean_tokens_list = [lemmatizer.lemmatize(stemmer(tok)).strip() for tok in tokens]  
    
    # Note 2019/3/7: remove any text followed by  "@",  fail because unsolved Error "TypeError: iteration over a 0-d array "
#     clean_tokens_arr = np.array(clean_tokens_list).squeeze()
#     func = lambda x: "@" in x
#     try:
#         clean_tokens_arr_ = clean_tokens_arr[[
#             not(item) for item in map(func, clean_tokens_arr)
#         ]]
#     except TypeError:
#         return clean_tokens_arr
#     else:
#         return clean_tokens_arr_
    for ind, tok in enumerate(clean_tokens_list):
        if "@" in tok:
            del clean_tokens_list[ind]
        
    return clean_tokens_list

In [7]:
tokenize_sent("we were friends, good friends. wont't we? I hpe this is true @123")

array(['we', 'were', 'friend', 'good', 'friend', 'wont', 't', 'we', 'i',
       'hpe', 'this', 'is', 'true', '123'], 
      dtype='<U6')

In [13]:
tokenize_word("hi i am fine. The sun is really warm and happy! @everyone")

['hi', 'i', 'am', 'fine', 'the', 'sun', 'is', 'really', 'warm', 'and', 'happy']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize_word)),
                ('tfidf', TfidfTransformer())
            ])),
        ])),

        ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, min_samples_split=3)))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, shuffle=True)

In [54]:
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [55]:
Y_pred = pipeline.predict(X_test)

In [30]:
# Use default parameters of clf


(Y_test == Y_pred).mean()

related                   0.798347
request                   0.881755
offer                     0.995931
aid_related               0.735919
medical_help              0.925111
medical_products          0.950922
search_and_rescue         0.971011
security                  0.981945
military                  0.966052
child_alone               1.000000
water                     0.950286
food                      0.919644
shelter                   0.923840
clothing                  0.985124
money                     0.976987
missing_people            0.988303
refugees                  0.967196
death                     0.958423
other_aid                 0.870947
infrastructure_related    0.936300
transport                 0.957406
buildings                 0.951430
electricity               0.979402
tools                     0.994024
hospitals                 0.989320
shops                     0.996186
aid_centers               0.989320
other_infrastructure      0.957661
weather_related     

In [56]:
# Use specific parameters of clf as "n_estimators=100, min_samples_split=3"

(Y_test == Y_pred).mean()

related                   0.804069
request                   0.895232
offer                     0.996058
aid_related               0.783217
medical_help              0.926891
medical_products          0.950286
search_and_rescue         0.971011
security                  0.981945
military                  0.965798
child_alone               1.000000
water                     0.951939
food                      0.923586
shelter                   0.932104
clothing                  0.984743
money                     0.977622
missing_people            0.988175
refugees                  0.966561
death                     0.957788
other_aid                 0.872473
infrastructure_related    0.937190
transport                 0.959186
buildings                 0.950286
electricity               0.979275
tools                     0.994024
hospitals                 0.989320
shops                     0.996186
aid_centers               0.989320
other_infrastructure      0.957661
weather_related     

In [50]:
for ind, each_col in enumerate(Y_test.columns):
    print("----------Check metrics for feature named as {}----------".format(each_col))
    print(classification_report(Y_test.loc[:, each_col], Y_pred[:, ind], target_names=[each_col+"_true", each_col+"_false"]))
    print("")

----------Check metrics for feature named as related----------
               precision    recall  f1-score   support

 related_true       0.64      0.35      0.46      1829
related_false       0.82      0.94      0.88      5977

  avg / total       0.78      0.80      0.78      7865


----------Check metrics for feature named as request----------
               precision    recall  f1-score   support

 request_true       0.89      0.98      0.93      6524
request_false       0.83      0.38      0.53      1341

  avg / total       0.88      0.88      0.86      7865


----------Check metrics for feature named as offer----------
             precision    recall  f1-score   support

 offer_true       1.00      1.00      1.00      7834
offer_false       0.00      0.00      0.00        31

avg / total       0.99      1.00      0.99      7865


----------Check metrics for feature named as aid_related----------
                   precision    recall  f1-score   support

 aid_related_true     

  .format(len(labels), len(target_names))
  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [6]:
import signal
from contextlib import contextmanager
import requests

DELAY = INTERVAL = 4 * 60  # interval time in seconds
MIN_DELAY = MIN_INTERVAL = 2 * 60
KEEPALIVE_URL = "https://nebula.udacity.com/api/v1/remote/keep-alive"
TOKEN_URL = "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token"
TOKEN_HEADERS = {"Metadata-Flavor":"Google"}


def _request_handler(headers):
    def _handler(signum, frame):
        requests.request("POST", KEEPALIVE_URL, headers=headers)
    return _handler


@contextmanager
def active_session(delay=DELAY, interval=INTERVAL):
    """
    Example:

    from workspace_utils import active session

    with active_session():
        # do long-running work here
    """
    token = requests.request("GET", TOKEN_URL, headers=TOKEN_HEADERS).text
    headers = {'Authorization': "STAR " + token}
    delay = max(delay, MIN_DELAY)
    interval = max(interval, MIN_INTERVAL)
    original_handler = signal.getsignal(signal.SIGALRM)
    try:
        signal.signal(signal.SIGALRM, _request_handler(headers))
        signal.setitimer(signal.ITIMER_REAL, delay, interval)
        yield
    finally:
        signal.signal(signal.SIGALRM, original_handler)
        signal.setitimer(signal.ITIMER_REAL, 0)

In [10]:
parameters = {
        'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
#         'features__text_pipeline__vect__max_df': (0.5, 0.75, 1.0),
        'features__text_pipeline__vect__max_features': (None, 5000, 10000),
#         'features__text_pipeline__tfidf__use_idf': (True, False),
        'features__transformer_weights': (
            {'text_pipeline': 1, 'starting_verb': 0.5},
            {'text_pipeline': 0.5, 'starting_verb': 1},
            {'text_pipeline': 0.8, 'starting_verb': 1},
        )
    }

cv = GridSearchCV(pipeline, param_grid=parameters)

In [None]:
with active_session():
    cv.fit(X_train, Y_train)

In [None]:
cv

In [None]:
import joblib


with active_session():
    joblib.dump(cv, 'DisasterResponse_gcv.pkl')

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
Y_pred = cv.predict(X_test)

In [None]:
(Y_test == Y_pred).mean()

In [None]:
for ind, each_col in enumerate(Y_test.columns):
    print("----------Check metrics for feature named as {}----------".format(each_col))
    print(classification_report(Y_test.loc[:, each_col], Y_pred[:, ind], target_names=[each_col+"_true", each_col+"_false"]))
    print("")

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [4]:
# TODO 2019/3/8: update all descriptions here


def number_normalizer(tokens):
    """ Map all numeric tokens to a placeholder.

    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.

    Args:
        text: A row text data from the relevant column of pandas 
        DataFrame
            
    Returns:
        A numpy array containing clean tokens extracted from text

        "we were friends, good friends" --> ['we', 'were', 'friend', 'good', 'friend']

    Raises:
        None yet
    """
    return ("numberplaceholder" if token[0].isdigit() else token for token in tokens)


def tokenize_word(text):
    """Tokenizes text row by row from pandas DataFrame.

    Tokenizes text by this function combine tokenize method 
    from nltk package and some self-defined rules. It may fail
    when strings out of range of the defined rules occur in text.

    Args:
        text: A row text data from the relevant column of pandas 
        DataFrame
        
    Returns:
        A numpy array containing clean tokens extracted from text

        "we were friends, good friends" --> ['we', 'were', 'friend', 'good', 'friend']

    Raises:
        None yet
    """
    
    url_reg = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    punct_reg = "[^a-zA-Z0-9@]+"
    detected_urls = re.findall(url_reg, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    text = re.sub(punct_reg, " ", text)
    text = text.lower()
    
    word_tokenizer_tweet = TweetTokenizer()
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    tokens = word_tokenizer_tweet.tokenize(text)
    tokens = number_normalizer(tokens)
    clean_tokens_list = [lemmatizer.lemmatize(stemmer.stem(tok)).strip() for tok in tokens]  

    for ind, tok in enumerate(clean_tokens_list):
        if "@" in tok:
            del clean_tokens_list[ind]
        
    return clean_tokens_list

In [14]:
# TODO 2019/3/9: construct to deal with the error


# class ExpandForBalance(BaseEstimator, TransformerMixin):
#     """Summary of class here.

#     Longer class information....
#     Longer class information....

#     Attributes:
#         likes_spam: A boolean indicating if we like SPAM or not.
#         eggs: An integer count of the eggs we have laid.
#     """
    
#     def starting_verb(self, arr):
#         """Inits SampleClass with blah."""
        
#         return False

#     def fit(self, X, y=None):
#         """Inits SampleClass with blah."""        
#         return self

#     def transform(self, X):
#         """Inits SampleClass with blah."""        
#         X_tagged = pd.Series(X).apply(self.starting_verb)
#         return pd.DataFrame([X, X_count, X_std, X_mean, X_sum])

In [39]:
for each_col in Y.columns:
    print(Y.loc[:, each_col].value_counts().shape[0])

2
2
2
2
2
2
2
2
2
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2


In [38]:
X_df = pd.DataFrame(X)
X_df = X_df.drop((Y[Y.loc[:, "related"] == 2]).index).reset_index(drop=True)
Y = Y.drop((Y[Y.loc[:, "related"] == 2]).index).reset_index(drop=True)

X_train_df, X_test_df, Y_train, Y_test = train_test_split(X_df, Y, test_size=0.2, shuffle=True)
X_train_part_df, X_valid_df, Y_train_part, Y_valid = train_test_split(X_train_df, Y_train, test_size=0.2, shuffle=True)

#### evaluate_model

In [5]:
# https://lightgbm.readthedocs.io/en/latest/

params = {
          'boosting_type': 'rf',
          'num_leaves': 70, 
          'max_depth': -1,
          'learning_rate': 0.008,
          'n_estimators': 100,
          'subsample_for_bin': 200000,
          'objective': 'binary',
          'class_weight': None,
          'min_split_gain': 0,
          'min_child_weight': 0,
          'min_child_samples': 35,
          'subsample': 1,
          'subsample_freq': 0,
          'colsample_bytree': 1,
          'reg_alpha': 0.11,  # L1 reg
          'reg_lambda': 0.5,  # L2 reg
          'random_state': 23,
          'n_jobs': 3,
          'silent': False,
          'importance_type': 'split',

          "feature_fraction": 0.85,
          "bagging_freq": 1,
          "bagging_fraction": 0.82,
          "bagging_seed": 42,
          "metric_freq": 200,
          "early_stopping_round": 150,
          "num_iterations": 10000,
#           "is_unbalance": True,  # used only in `binary` application
         
          "num_class": 1
         }

pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize_word)),
                ('tfidf', TfidfTransformer()), 
#                 ('expand', ExpandForBalance())  # TODO 2019/3/9: add new transformation to deal with the error
            ])),
             
        ])),
    
        ('clf', lgb.LGBMClassifier(**params))
    ])

In [46]:
from keras.utils.np_utils import to_categorical


Y_pred = {each_col:0 for each_col in Y_train.columns}

for ind, each_col in enumerate(Y_train.columns):
    Y_train_part_each = to_categorical(Y_train_part.loc[:, each_col], 2)
    Y_valid_each = to_categorical(Y_valid.loc[:, each_col], 2)
    
    
    pipeline.fit(
        X_train_part_df.values[:, 0],   # Note 2019/3/9: use numpy array to avoid the check failue of X shape by sklearn.utils: validation.py
        Y_train_part_each, 
        **{
                'clf__eval_set': [(X_valid_df.values[:, 0], Y_valid_each)],  # Note 2019/3/9: use numpy array to avoid the check failue of X shape by sklearn.utils: validation.py
                'clf__eval_metric': 'binary_logloss'
        }
    )
    Y_pred[each_col] = pipeline.predict(X_test)
    print("Finish the train of {} feature".format(ind+1))

Y_pred_df = pd.DataFrame(Y_pred)
    
# for ind, each_col in enumerate(Y_test.columns):
#     print("----------Check metrics for feature named as {}----------".format(each_col))
#     print(classification_report(Y_test.loc[:, each_col], Y_pred[:, ind], target_names=[each_col+"_true", each_col+"_false"]))
#     print("")

ValueError: bad input shape (16657, 2)

In [67]:
with active_session():
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences = brown.sents()
    model = gensim.models.Word2Vec(sentences, min_count=1)
    model.save('brown_model')
#     model = gensim.models.Word2Vec.load('brown_model')

# model['computer']  # raw NumPy vector of a requested word, for this example the array is something like this "array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)"

2019-03-09 11:13:12,163 : INFO : collecting all words and their counts
2019-03-09 11:13:12,172 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-03-09 11:13:12,929 : INFO : PROGRESS: at sentence #10000, processed 219770 words, keeping 23488 word types
2019-03-09 11:13:13,616 : INFO : PROGRESS: at sentence #20000, processed 430477 words, keeping 34367 word types
2019-03-09 11:13:14,377 : INFO : PROGRESS: at sentence #30000, processed 669056 words, keeping 42365 word types
2019-03-09 11:13:15,069 : INFO : PROGRESS: at sentence #40000, processed 888291 words, keeping 49136 word types
2019-03-09 11:13:15,616 : INFO : PROGRESS: at sentence #50000, processed 1039920 words, keeping 53024 word types
2019-03-09 11:13:16,041 : INFO : collected 56057 word types from a corpus of 1161192 raw words and 57340 sentences
2019-03-09 11:13:16,045 : INFO : Loading a fresh vocabulary
2019-03-09 11:13:16,202 : INFO : min_count=1 retains 56057 unique words (100% of original 5605

### 9. Export your model as a pickle file

In [None]:
import pickle
from sklearn.externals import joblib


# model = pickle.dumps(pipeline)
# fw = open('DisasterResponse_model.txt',' w')
# fw.write(model)
# fw.close()
# fr = open('DisasterResponse_model.txt', 'r')
# model = pickle.loads(fr.read())

joblib.dump(pipeline, 'DisasterResponse_model.pkl')
# model = joblib.load('DisasterResponse_model.pkl')

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.