# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [80]:
conda create -n env -c conda-forge python=3.9 scikit-learn-intelex

^C

Note: you may need to restart the kernel to use updated packages.


In [81]:
conda install scikit-learn-intelex -c conda-forge

^C

Note: you may need to restart the kernel to use updated packages.


In [1]:
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet','stopwords'])

import re
import string
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB



[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [78]:
from sklearnex import patch_sklearn
patch_sklearn()

ImportError: cannot import name '_is_arraylike_not_scalar' from 'sklearn.utils.validation' (C:\Users\runqi\anaconda3\lib\site-packages\sklearn\utils\validation.py)

In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql('SELECT * FROM DisasterResponse', engine)
X = df['message']
y = df.iloc[:,4:]

In [5]:
df

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26211,30261,The training demonstrated how to enhance micro...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26212,30262,A suitable candidate has been selected and OCH...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26213,30263,"Proshika, operating in Cox's Bazar municipalit...",,news,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26214,30264,"Some 2,000 women protesting against the conduc...",,news,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [6]:
list(df['message'])

['Weather update - a cold front from Cuba that could pass over Haiti',
 'Is the Hurricane over or is it not over',
 'Looking for someone but no name',
 'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.',
 'says: west side of Haiti, rest of the country today and tonight',
 'Information about the National Palace-',
 'Storm at sacred heart of jesus',
 'Please, we need tents and water. We are in Silo, Thank you!',
 'I would like to receive the messages, thank you',
 'I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )',
 "There's nothing to eat and water, we starving and thirsty.",
 'I am in Petionville. I need more information regarding 4636',
 'I am in Thomassin number 32, in the area named Pyron. I would like to have some water. Thank God we are fine, but we desperately need water. Thanks',
 "Let's do it together, need food in Delma 75, in didine area",
 'More informati

In [7]:
def tokenize(text):    

    # normalize and tokenize text
    tokens = word_tokenize(text.lower())
    # remove stopwords
    tokens = [w for w in tokens if w not in stopwords.words("english") and w not in string.punctuation]
    # initiate lemmatizer
    lemmatizer = WordNetLemmatizer()

    # iterate through each token
    clean_tokens = []
    for tok in tokens:
        
        # lemmatize, normalize case, and remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok, pos='v').strip()

        clean_tokens.append(clean_tok)

    return clean_tokens

In [75]:
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [8]:
for message in X[:5]:
    tokens = tokenize(message)
    print(tokens,'\n')

['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pass', 'haiti'] 

['hurricane'] 

['look', 'someone', 'name'] 

['un', 'report', 'leogane', '80-90', 'destroy', 'hospital', 'st.', 'croix', 'function', 'need', 'supply', 'desperately'] 

['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 



In [65]:
sentence='Please we need help, food and toiletries.'
pos_tags = nltk.pos_tag(tokenize(sentence))
first_word, first_tag = pos_tags[0]
first_word, first_tag

('please', 'VB')

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [9]:
def ML_pipeline_1(clf = RandomForestClassifier()):
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
        ('clf', MultiOutputClassifier(clf))
        ])
    return pipeline


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [3]:
# perform train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=123)

In [5]:
X_train[:20]

14734                  ZIMBABWE: No election boycott - MDC
10262    blimey it s a bad day for Haiti this morning o...
4376     we are hungra please help us now otherwise we ...
23213    From 04-07 August, apart from simple diarrheal...
20153    Survivors, traumatized by the barbarous acts o...
7436     About tap water we can drink it without treati...
2660     She is waiting for help, needs money and food ...
3631     I wrote you many times to tell you that I can'...
17781    Seismologists at the central weather bureau sa...
24152    Treated mosquito nets are an effective, low-co...
4073            Where can you get the card (to get food)? 
19593    Eyewitnesses reported that security forces fir...
13315    A strong southwest monsoon prevailing over the...
9509     i'm happy you come back with this program we h...
13014    RIP to those who were taken from us due to thi...
23370    In the north-western Nigerian state of Niger, ...
6687     Living in Santo 6. I would like for you to sen.

In [48]:
y_train.shape

(19662, 35)

In [49]:
y_test.shape

(6554, 35)

In [50]:
train_test_compare = pd.DataFrame({'train_mean': y_train.mean(), 'test_mean': y_test.mean()})
train_test_compare 

Unnamed: 0,train_mean,test_mean
related,0.766351,0.76686
request,0.17221,0.166005
offer,0.004679,0.003967
aid_related,0.413997,0.415014
medical_help,0.079392,0.079799
medical_products,0.050097,0.050046
search_and_rescue,0.028736,0.02426
security,0.018004,0.017852
military,0.03255,0.033567
water,0.063829,0.063625


In [None]:
#overview of train dataset

In [None]:
# we want to minimize FN, so maximize recall = TP / (TP + FN).

In [51]:
# train classifier
model1 = ML_pipeline_1()
model1.fit(X_train, y_train)


Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>)),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [52]:
# predict on test data
y_pred=model1.predict(X_test)


In [53]:
print(classification_report(y_test.values, y_pred, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.84      0.95      0.89      5026
               request       0.85      0.50      0.63      1088
                 offer       0.00      0.00      0.00        26
           aid_related       0.78      0.69      0.73      2720
          medical_help       0.60      0.07      0.12       523
      medical_products       0.79      0.07      0.13       328
     search_and_rescue       0.78      0.09      0.16       159
              security       0.33      0.01      0.02       117
              military       0.84      0.07      0.13       220
                 water       0.91      0.31      0.46       417
                  food       0.90      0.52      0.66       731
               shelter       0.83      0.36      0.50       574
              clothing       0.75      0.09      0.16        99
                 money       0.80      0.06      0.10       144
        missing_people       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 6. Improve your model
Use grid search to find better parameters. 

In [14]:
model1.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>)),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'tfidf': TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': <function __main__.tokenize(text)>,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'clf__estimator_

In [25]:
# specify parameters for grid search
# which kernel works better
# parameters = {
#         'clf__estimator': [RandomForestClassifier(), SVC(), MLPClassifier()]
# }
from joblib import parallel_backend
parameters = {
        'clf__estimator__n_estimators': [50, 100, 200],
        'clf__estimator__min_samples_split': [2, 3, 4],
        'clf__estimator__max_depth': [None, 10, 20, 30]
}
model2 = ML_pipeline_1()

with parallel_backend('multiprocessing', n_jobs=-1):
    cv = GridSearchCV(model2, param_grid=parameters, cv=5)

cv.fit(X_train, y_train)


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>)),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__max_depth': [None, 10, 20, 30],
                         'clf__estimator__min_samples_split': [2, 3, 4],
                         'clf__estimator__n_estimators': [50, 100, 200]})

In [27]:
cv.best_params_

{'clf__estimator__max_depth': None,
 'clf__estimator__min_samples_split': 3,
 'clf__estimator__n_estimators': 200}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [28]:
y_pred_cv=cv.predict(X_test)
print(classification_report(y_test.values, y_pred_cv, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.83      0.95      0.89      4990
               request       0.84      0.49      0.62      1080
                 offer       0.00      0.00      0.00        27
           aid_related       0.75      0.70      0.72      2664
          medical_help       0.55      0.06      0.11       530
      medical_products       0.81      0.09      0.17       324
     search_and_rescue       0.53      0.05      0.09       185
              security       0.50      0.01      0.02       130
              military       0.71      0.06      0.10       216
                 water       0.90      0.38      0.54       392
                  food       0.86      0.59      0.70       704
               shelter       0.79      0.38      0.51       548
              clothing       0.67      0.09      0.15        93
                 money       1.00      0.04      0.08       142
        missing_people       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [29]:
# what happens to 'offer'
df['offer'].value_counts()

0    26098
1      118
Name: offer, dtype: int64

In [32]:
df.iloc[:,4:].sum()

related                   20094
request                    4474
offer                       118
aid_related               10860
medical_help               2084
medical_products           1313
search_and_rescue           724
security                    471
military                    860
water                      1672
food                       2923
shelter                    2314
clothing                    405
money                       604
missing_people              298
refugees                    875
death                      1194
other_aid                  3446
infrastructure_related     1705
transport                  1201
buildings                  1333
electricity                 532
tools                       159
hospitals                   283
shops                       120
aid_centers                 309
other_infrastructure       1151
weather_related            7297
floods                     2155
storm                      2443
fire                        282
earthqua

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [58]:
# Add a custom transformer to count the number of words in each message
from sklearn.base import BaseEstimator, TransformerMixin

class WordCountTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        word_count = []
        for message in X:
            words = message.split()
            word_count.append(len(words))
        return np.array(word_count).reshape(-1, 1)
    


In [72]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(word_tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [73]:
# Add WordCountTransformer to ML_pipeline_2

def ML_pipeline_2(clf = RandomForestClassifier()):
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
            ('wordCount', WordCountTransformer())            
        ])),        
        ('clf', MultiOutputClassifier(clf))
        ])
    return pipeline


def ML_pipeline_3(clf = RandomForestClassifier()):
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
            ('verb', StartingVerbExtractor())            
        ])),        
        ('clf', MultiOutputClassifier(clf))
        ])
    return pipeline

In [74]:
# train classifier
model_RF = ML_pipeline_3()
model_RF.fit(X_train, y_train)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('tfidf',
                                                 TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>)),
                                                ('verb',
                                                 StartingVerbExtractor())])),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

In [76]:
# predict on test data
y_pred_RF=model_RF.predict(X_test)
print(classification_report(y_test.values, y_pred_RF, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.84      0.95      0.89      5026
               request       0.86      0.48      0.62      1088
                 offer       0.00      0.00      0.00        26
           aid_related       0.78      0.68      0.73      2720
          medical_help       0.58      0.06      0.11       523
      medical_products       0.75      0.06      0.12       328
     search_and_rescue       0.82      0.11      0.20       159
              security       0.00      0.00      0.00       117
              military       0.73      0.04      0.07       220
                 water       0.91      0.32      0.47       417
                  food       0.91      0.57      0.70       731
               shelter       0.84      0.33      0.47       574
              clothing       0.77      0.10      0.18        99
                 money       0.75      0.04      0.08       144
        missing_people       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [77]:
# train SVC
model_SVC1 = ML_pipeline_1(clf = SVC())
model_SVC1.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
# predict on test data
y_pred_SVC1 = model_SVC1.predict(X_test)
print(classification_report(y_test.values, y_pred_SVC1, target_names=y.columns.values))

In [85]:
# train MLPClassifier()
model_MLP1 = ML_pipeline_1(clf = MLPClassifier())
model_MLP1.fit(X_train, y_train)

  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.validation_scores_ = self.validation_scores_
  self.valid

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>)),
                ('clf', MultiOutputClassifier(estimator=MLPClassifier()))])

In [86]:
# predict on test data
y_pred_MLP1 = model_MLP1.predict(X_test)
print(classification_report(y_test.values, y_pred_MLP1, target_names=y.columns.values))

KeyboardInterrupt: 

In [83]:
# train XGBoost()
model_XGB1 = ML_pipeline_1(clf =  XGBClassifier())
model_XGB1.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x0000021C05BDD9D0>)),
                ('clf',
                 MultiOutputClassifier(estimator=XGBClassifier(base_score=None,
                                                               booster=None,
                                                               callbacks=None,
                                                               colsample_bylevel=None,
                                                               colsample_bynode=None,
                                                               colsample_bytree=None,
                                                               early_stopping_rounds=None,
                                                               enable_categorical=False,
                                                               eval_metric=None,
                                                               feature_types=None,
                       

In [84]:
# predict on test data
y_pred_XGB1 = model_XGB1.predict(X_test)
print(classification_report(y_test.values, y_pred_XGB1, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.83      0.96      0.89      5026
               request       0.81      0.55      0.65      1088
                 offer       0.50      0.04      0.07        26
           aid_related       0.79      0.65      0.71      2720
          medical_help       0.62      0.24      0.34       523
      medical_products       0.68      0.27      0.38       328
     search_and_rescue       0.66      0.26      0.38       159
              security       0.33      0.03      0.06       117
              military       0.70      0.35      0.46       220
                 water       0.79      0.65      0.71       417
                  food       0.84      0.73      0.78       731
               shelter       0.77      0.59      0.67       574
              clothing       0.83      0.48      0.61        99
                 money       0.46      0.19      0.27       144
        missing_people       0.56      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# train Naive Bayes Classifier
model_NB1 = ML_pipeline_1(clf = GaussianNB())
model_NB1.fit(X_train, y_train)

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.