# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [8]:
pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Using cached scikit_learn-0.24.1-cp39-cp39-macosx_10_13_x86_64.whl (7.3 MB)
Collecting scipy>=0.19.1
  Using cached scipy-1.6.0-cp39-cp39-macosx_10_9_x86_64.whl (30.9 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Using legacy 'setup.py install' for sklearn, since package 'wheel' is not installed.
Installing collected packages: threadpoolctl, scipy, scikit-learn, sklearn
    Running setup.py install for sklearn ... [?25ldone
[?25hSuccessfully installed scikit-learn-0.24.1 scipy-1.6.0 sklearn-0.0 threadpoolctl-2.1.0
Note: you may need to restart the kernel to use updated packages.


In [19]:

# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import nltk
from nltk import WordNetLemmatizer, pos_tag, word_tokenize
nltk.download('stopwords','wordnet')
from nltk.corpus import stopwords, wordnet
import re
from collections import defaultdict

from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier

In [45]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
  

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/algirdasducinskas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/algirdasducinskas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/algirdasducinskas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/algirdasducinskas/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [46]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('data','sqlite:///DisasterResponse.db')
X =df['message'] 
y =df.drop(['id','message','original','genre'],axis=1) 

y.sum()

related                   20093
request                    4474
offer                       118
aid_related               10860
medical_help               2084
medical_products           1313
search_and_rescue           724
security                    471
military                    860
child_alone                   0
water                      1672
food                       2923
shelter                    2314
clothing                    405
money                       604
missing_people              298
refugees                    875
death                      1194
other_aid                  3446
infrastructure_related     1705
transport                  1201
buildings                  1333
electricity                 532
tools                       159
hospitals                   283
shops                       120
aid_centers                 309
other_infrastructure       1151
weather_related            7297
floods                     2155
storm                      2443
fire    

In [47]:
y=y.drop('child_alone',axis=1)
y.sum()

related                   20093
request                    4474
offer                       118
aid_related               10860
medical_help               2084
medical_products           1313
search_and_rescue           724
security                    471
military                    860
water                      1672
food                       2923
shelter                    2314
clothing                    405
money                       604
missing_people              298
refugees                    875
death                      1194
other_aid                  3446
infrastructure_related     1705
transport                  1201
buildings                  1333
electricity                 532
tools                       159
hospitals                   283
shops                       120
aid_centers                 309
other_infrastructure       1151
weather_related            7297
floods                     2155
storm                      2443
fire                        282
earthqua

### 2. Write a tokenization function to process your text data

In [48]:
def tokenize(text):
    
    """
    Convert text into tokens
    
    Input:
        text - message that needs to be tokenized
    Output:
        clean_tokens - list of tokens from the given message
    """
    
    # remove url place holder
    
    url_regex= r'(https?://\S+)'
    text = re.sub(url_regex, 'urlplaceholder',text)
    
    #tokenize message into words 
    
    tokens=word_tokenize(text)
    
    #remove the stop words 
    
    filtered_tokens=[w for w in tokens if not w in stopwords.words('english')]
    
    #remove punctuation and tokens containing non alphabetic symbols
    
    alpha_tokens=[token.lower() for token in filtered_tokens if token.isalpha()]
    
    # make a default dictionary for the pos tagging 
    tag_map = defaultdict(lambda : wordnet.NOUN)
    tag_map['J'] = wordnet.ADJ
    tag_map['V'] = wordnet.VERB
    tag_map['R'] = wordnet.ADV

    #lemmatize tokens using pos tags from defaulct dict
    
    clean_tokens=[]
    lmtzr = WordNetLemmatizer()
    for token, tag in pos_tag(alpha_tokens):
        clean_tokens.append(lmtzr.lemmatize(token, tag_map[tag[0]]))
    
    
    return clean_tokens
    
    


#### Building custom transformer 

In [49]:
class ContainsHelpNeed(BaseEstimator, TransformerMixin):
    """
    This custom transformer extracts the messages which start with verb 
    creates new feature consisting of 1 (True) and 0 (False) values.
    
    """       
    def filter_verb(self, text):
        words=tokenize(text)
        if 'help' in words or 'need' in words:
            return True
        return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.filter_verb)
        return pd.DataFrame(X_tagged)

In [50]:
tokenize('Labas diena , kaip sekasi?')

['labas', 'diena', 'kaip', 'sekasi']

#### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [51]:
pipeline1 = Pipeline([ 
            ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
            ('tfidf_transformer', TfidfTransformer()),
            ('classifier', MultiOutputClassifier(AdaBoostClassifier()))
    ])


In [52]:
pipeline2 = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
                ('tfidf_transformer', TfidfTransformer())
            ])),

            ('need_help_transformer', ContainsHelpNeed())
        ])),

        ('classifier', MultiOutputClassifier(AdaBoostClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline_fitted = pipeline1.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [54]:
y_prediction_train = pipeline_fitted.predict(X_train)
y_prediction_test = pipeline_fitted.predict(X_test)

In [55]:
print(classification_report(y_test.values, y_prediction_test, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.81      0.97      0.88      5012
               request       0.74      0.52      0.61      1124
                 offer       0.00      0.00      0.00        28
           aid_related       0.75      0.62      0.68      2711
          medical_help       0.65      0.30      0.41       548
      medical_products       0.62      0.37      0.46       350
     search_and_rescue       0.67      0.21      0.32       184
              security       0.33      0.08      0.14       118
              military       0.53      0.31      0.39       201
                 water       0.74      0.64      0.68       433
                  food       0.79      0.71      0.75       768
               shelter       0.75      0.53      0.62       567
              clothing       0.77      0.34      0.47       121
                 money       0.61      0.29      0.40       153
        missing_people       0.63      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [56]:
print('\n',classification_report(y_train.values, y_prediction_train, target_names=y.columns.values))


                         precision    recall  f1-score   support

               related       0.81      0.97      0.88     15081
               request       0.79      0.52      0.63      3350
                 offer       0.52      0.12      0.20        90
           aid_related       0.77      0.62      0.69      8149
          medical_help       0.64      0.28      0.39      1536
      medical_products       0.71      0.36      0.48       963
     search_and_rescue       0.69      0.22      0.33       540
              security       0.47      0.08      0.13       353
              military       0.69      0.41      0.52       659
                 water       0.78      0.66      0.71      1239
                  food       0.81      0.72      0.76      2155
               shelter       0.81      0.56      0.66      1747
              clothing       0.78      0.46      0.58       284
                 money       0.61      0.31      0.41       451
        missing_people       0.59    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 6. Improve your model
Use grid search to find better parameters. 

In [57]:
pipeline1.get_params()

{'memory': None,
 'steps': [('count_vectorizer',
   CountVectorizer(tokenizer=<function tokenize at 0x12971ea60>)),
  ('tfidf_transformer', TfidfTransformer()),
  ('classifier', MultiOutputClassifier(estimator=AdaBoostClassifier()))],
 'verbose': False,
 'count_vectorizer': CountVectorizer(tokenizer=<function tokenize at 0x12971ea60>),
 'tfidf_transformer': TfidfTransformer(),
 'classifier': MultiOutputClassifier(estimator=AdaBoostClassifier()),
 'count_vectorizer__analyzer': 'word',
 'count_vectorizer__binary': False,
 'count_vectorizer__decode_error': 'strict',
 'count_vectorizer__dtype': numpy.int64,
 'count_vectorizer__encoding': 'utf-8',
 'count_vectorizer__input': 'content',
 'count_vectorizer__lowercase': True,
 'count_vectorizer__max_df': 1.0,
 'count_vectorizer__max_features': None,
 'count_vectorizer__min_df': 1,
 'count_vectorizer__ngram_range': (1, 1),
 'count_vectorizer__preprocessor': None,
 'count_vectorizer__stop_words': None,
 'count_vectorizer__strip_accents': None,
 

In [58]:
parameters = {'classifier__estimator__n_estimators': [40,70,100] }

cv = GridSearchCV(pipeline1, param_grid=parameters)
cv.fit(X_train, y_train)

KeyboardInterrupt: 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [17]:
y_cv_prediction_test = cv.predict(X_test)
y_cv_prediction_train = cv.predict(X_train)

In [18]:
print(classification_report(y_test.values, y_cv_prediction_test, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.81      0.96      0.88      4959
               request       0.76      0.53      0.62      1128
                 offer       0.00      0.00      0.00        25
           aid_related       0.76      0.63      0.69      2678
          medical_help       0.54      0.24      0.33       542
      medical_products       0.62      0.30      0.40       358
     search_and_rescue       0.50      0.19      0.27       189
              security       0.25      0.08      0.12       115
              military       0.52      0.31      0.39       231
                 water       0.72      0.63      0.67       394
                  food       0.79      0.73      0.76       714
               shelter       0.72      0.57      0.63       580
              clothing       0.67      0.39      0.49       106
                 money       0.51      0.31      0.38       133
        missing_people       0.28      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [19]:
## Trying to improve the model with custom transformer, which checks is message contains 'need' or 'help'
pipeline2_fitted = pipeline2.fit(X_train, y_train)

In [20]:
y_2_prediction_train = pipeline2_fitted.predict(X_train)
y_2_prediction_test = pipeline2_fitted.predict(X_test)

In [21]:
print(classification_report(y_test.values, y_2_prediction_test, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.79      0.97      0.87      4959
               request       0.75      0.48      0.59      1128
                 offer       0.00      0.00      0.00        25
           aid_related       0.75      0.62      0.68      2678
          medical_help       0.56      0.25      0.34       542
      medical_products       0.62      0.30      0.41       358
     search_and_rescue       0.59      0.20      0.29       189
              security       0.32      0.09      0.14       115
              military       0.59      0.31      0.40       231
                 water       0.71      0.63      0.67       394
                  food       0.81      0.65      0.72       714
               shelter       0.78      0.58      0.67       580
              clothing       0.74      0.35      0.47       106
                 money       0.53      0.32      0.40       133
        missing_people       0.43      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 9. Export your model as a pickle file

In [27]:
import pickle
pickle_param = open('models\classifier.pkl', 'wb')
pickled_model=pickle.dump(cv,pickle_param)
pickle_param.close()


### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.