# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [36]:
# import libraries
import re
import pandas as pd
import numpy as np
import nltk
import pickle
nltk.download('punkt')
nltk.download('stopwords')
import warnings
warnings.simplefilter('ignore')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report

[nltk_data] Downloading package punkt to /Users/kunal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/kunal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [37]:
# load data from database
engine = create_engine('sqlite:///processed_data.db')

df = pd.read_sql("SELECT * FROM df", engine)
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

### 2. Write a tokenization function to process your text data

In [38]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    # Find and replace URL's with a placeholder 
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, 'urlplaceholder')
        
    # Normalize, tokenize, and remove punctuation
    # word_tokenize simply splits up a sentence string into a list containing split elements as the module sees fit
    tokens = word_tokenize(re.sub(r"[^a-zA-Z0-9]", " ", text.lower()))

    # Remove Stopwords
    # stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content.  
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    
    # Lemmatization
    # Lemmatization is the process of replacing a word with its root or head word called lemma
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens

In [None]:
# Example
tokenize('I will be going to IKEA today to eat and buy a few things for home.')

In [23]:
cv = CountVectorizer()
cv_transformer = cv.fit_transform(['I will visit IKEA today.',
                                    'Fried chicken tastes good.',
                                    'This is an example... :D'])

In [32]:
# Column Names
cv.get_feature_names_out()

array(['an', 'chicken', 'example', 'fried', 'good', 'ikea', 'is',
       'tastes', 'this', 'today', 'visit', 'will'], dtype=object)

In [25]:
# Matrix
cv_transformer.toarray()

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],
       [0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]])

In [33]:
# Simpler way to understand:
df = pd.DataFrame(data    = cv_transformer.toarray(),
                  index   = ['Sentence 1:', 'Sentence 2:', 'Sentence 3:'],
                  columns = cv.get_feature_names_out()
                 )
df

Unnamed: 0,an,chicken,example,fried,good,ikea,is,tastes,this,today,visit,will
Sentence 1:,0,0,0,0,0,1,0,0,0,1,1,1
Sentence 2:,0,1,0,1,1,0,0,1,0,0,0,0
Sentence 3:,1,0,1,0,0,0,1,0,1,0,0,0


$\delta$

The goal here is to learn a classification rule whose output is a set, or vector, of labels

$\hat{v}_1 = {\hat{y}_1 ∈ Y_1,  \hat{y}_2 ∈ Y_2,..., \hat{y}_n ∈ Y_n}$

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [39]:
pipeline = Pipeline([
                    ('count_vectorizer', CountVectorizer(tokenizer = tokenize)),
                    ('tfidf_transformer', TfidfTransformer()),
                    ('MultiOutput_Random_Forest_Classifier', MultiOutputClassifier(RandomForestClassifier()))
                    ])

In [49]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estim

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [40]:
#define the predictor variables and the response variable
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

#split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 1234)

#fit the model using the training data
pipeline.fit(X_train, Y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [41]:
pipeline.predict_proba(X_test)

[array([[0.05, 0.95],
        [0.26, 0.74],
        [0.52, 0.48],
        ...,
        [0.11, 0.89],
        [0.12, 0.88],
        [0.06, 0.94]]),
 array([[0.95, 0.05],
        [0.84, 0.16],
        [0.66, 0.34],
        ...,
        [0.59, 0.41],
        [0.93, 0.07],
        [0.94, 0.06]]),
 array([[1.  , 0.  ],
        [0.99, 0.01],
        [1.  , 0.  ],
        ...,
        [1.  , 0.  ],
        [0.99, 0.01],
        [1.  , 0.  ]]),
 array([[0.65, 0.35],
        [0.68, 0.32],
        [0.78, 0.22],
        ...,
        [0.28, 0.72],
        [0.41, 0.59],
        [0.73, 0.27]]),
 array([[0.93, 0.07],
        [0.96, 0.04],
        [0.91, 0.09],
        ...,
        [0.89, 0.11],
        [1.  , 0.  ],
        [0.99, 0.01]]),
 array([[0.99, 0.01],
        [1.  , 0.  ],
        [1.  , 0.  ],
        ...,
        [0.97, 0.03],
        [0.97, 0.03],
        [1.  , 0.  ]]),
 array([[0.91, 0.09],
        [0.99, 0.01],
        [1.  , 0.  ],
        ...,
        [0.77, 0.23],
        [0.96, 0.

In [51]:
#use model to make predictions on test data
y_hat = pipeline.predict(X_test)
y_true = np.array(Y_test)

In [55]:
print(classification_report(y_true=y_true, 
                            y_pred=y_hat, 
                            target_names = Y.columns))

                        precision    recall  f1-score   support

               related       0.83      0.95      0.89      4955
               request       0.85      0.50      0.63      1119
                 offer       0.00      0.00      0.00        30
           aid_related       0.77      0.69      0.72      2726
          medical_help       0.72      0.08      0.14       540
      medical_products       0.73      0.10      0.18       317
     search_and_rescue       0.79      0.06      0.12       176
              security       0.00      0.00      0.00       115
              military       0.67      0.04      0.08       231
                 water       0.95      0.35      0.51       411
                  food       0.85      0.58      0.69       710
               shelter       0.84      0.39      0.53       585
              clothing       0.63      0.11      0.19       109
                 money       0.67      0.03      0.06       130
        missing_people       0.00      

`Evaluation Metric Note:` 
* Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.

* With the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.

Here is an overview of some metrics:

#### 1) Accuracy
- ![Accuracy](https://miro.medium.com/max/720/1*gFW6rXbctrhWHxD8OXi4wg.webp)

- Accuracy represents the number of correctly classified data instances over the total number of data instances.

- Accuracy may not be a good measure if the feature is unbalanced




#### 2) Precision
- ![Precision](https://miro.medium.com/max/640/1*VXnUvOEdf3IiYVCD6Wd2vg.webp)

- Percentage of correct positive predictions relative to total positive predictions.

- Precision should ideally be 1 (high) for a good classifier. Precision becomes 1 only when the numerator and denominator are equal i.e **TP = TP + FP**

- This also means FP is zero. As FP increases the value of denominator becomes greater than the numerator and precision value decreases (which we don’t want).


#### 3) Recall / Sensitivity / TPR
- ![Recall](https://miro.medium.com/max/640/1*Aj3aYW4vwYAoJqyL36PVtQ.webp)

- Percentage of correct positive predictions relative to total actual positives.

- Recall should ideally be 1 (high) for a good classifier. Recall becomes 1 only when the numerator and denominator are equal i.e **TP = TP + FN**

- This also means FN is zero. As FN increases the value of denominator becomes greater than the numerator and recall value decreases (which we don’t want).

##### Thererfore...
- Ideally in a good classifier, we want both precision and recall to be one which also means FP and FN are zero. 
- Therefore we need a metric that takes into account both precision and recall. F1-score is a metric which takes into account both precision and recall and is defined as follows:

#### 4) F1-Score
- ![F1_Score](https://miro.medium.com/max/720/1*9uo7HN1pdMlMwTbNSdyO3A.webp)

- A weighted harmonic mean of precision and recall. The closer to 1, the better the model.

- F1 Score becomes 1 only when precision and recall are both 1.

- F1 score becomes high only when both precision and recall are high.




***Courtesy of Harikrishnan (https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd)***

### 6. Improve your model
Use grid search to find better parameters. 

In [70]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estim

In [94]:
%%time

parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10, 20, 30], 
              'clf__estimator__min_samples_split':[2, 5, 8]}

pipeline_cv = GridSearchCV(estimator=pipeline, param_grid=parameters)

CPU times: user 17 µs, sys: 1e+03 ns, total: 18 µs
Wall time: 21 µs


In [95]:
%%time
pipeline_cv.fit(X_train, Y_train)


CPU times: user 5h 16min 7s, sys: 36min 34s, total: 5h 52min 41s
Wall time: 6h 2min 8s


In [96]:
print('Best Parameters:', pipeline_cv.best_params_)

Best Parameters: {'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 30, 'tfidf__use_idf': True, 'vect__min_df': 5}


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [97]:
%%time

y_hat2 = pipeline_cv.predict(X_test)
y_true = np.array(Y_test)

CPU times: user 23.2 s, sys: 3.34 s, total: 26.5 s
Wall time: 26.7 s


In [98]:
print(classification_report(y_true=y_true, 
                            y_pred=y_hat2, 
                            target_names = Y.columns))

                        precision    recall  f1-score   support

               related       0.84      0.94      0.89      4955
               request       0.82      0.49      0.61      1119
                 offer       0.00      0.00      0.00        30
           aid_related       0.74      0.69      0.71      2726
          medical_help       0.66      0.12      0.21       540
      medical_products       0.80      0.19      0.30       317
     search_and_rescue       0.75      0.15      0.25       176
              security       0.00      0.00      0.00       115
              military       0.68      0.08      0.15       231
                 water       0.88      0.28      0.43       411
                  food       0.82      0.66      0.73       710
               shelter       0.82      0.43      0.57       585
              clothing       0.65      0.10      0.17       109
                 money       0.62      0.04      0.07       130
        missing_people       0.00      

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [103]:
%%time

pipeline_ada = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultiOutputClassifier(AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',max_depth=1))))
                        ])

parameters_ada = {'clf__estimator__learning_rate': [0.1, 0.3],
                  'clf__estimator__n_estimators': [100, 200]}

cv_ada = GridSearchCV(estimator=pipeline_ada, param_grid=parameters_ada, cv=3, scoring='f1')

#define the predictor variables and the response variable
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

#split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 1234)

CPU times: user 23.5 ms, sys: 162 ms, total: 185 ms
Wall time: 455 ms


In [104]:
%%time

tuned_model = cv_ada.fit(X_train, Y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100;, score=nan total time= 3.5min
[CV 2/3] END clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100;, score=nan total time= 3.3min
[CV 3/3] END clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100;, score=nan total time= 3.2min
[CV 1/3] END clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200;, score=nan total time= 5.2min
[CV 2/3] END clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200;, score=nan total time= 5.4min
[CV 3/3] END clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200;, score=nan total time= 6.1min
[CV 1/3] END clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100;, score=nan total time= 4.2min
[CV 2/3] END clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100;, score=nan total time= 4.2min
[CV 3/3] END clf__estimator__learning_rate=0

In [105]:
print(cv_ada.best_params_)

{'clf__estimator__learning_rate': 0.1, 'clf__estimator__n_estimators': 100}


In [106]:
%%time

y_hat3 = cv_ada.predict(X_test)
y_true = np.array(Y_test)

CPU times: user 23.5 s, sys: 3.43 s, total: 26.9 s
Wall time: 27.4 s


In [107]:
print(classification_report(y_true=y_true, 
                            y_pred=y_hat3, 
                            target_names = Y.columns))

                        precision    recall  f1-score   support

               related       0.93      0.60      0.73      4955
               request       0.53      0.69      0.60      1119
                 offer       0.02      0.60      0.04        30
           aid_related       0.71      0.61      0.65      2726
          medical_help       0.44      0.55      0.49       540
      medical_products       0.28      0.69      0.40       317
     search_and_rescue       0.09      0.57      0.15       176
              security       0.06      0.38      0.11       115
              military       0.40      0.68      0.50       231
                 water       0.56      0.87      0.68       411
                  food       0.76      0.81      0.78       710
               shelter       0.62      0.73      0.67       585
              clothing       0.40      0.62      0.49       109
                 money       0.19      0.71      0.30       130
        missing_people       0.04      

`testing ada2`
- using the best parameters
- dropping certain features with less than 1%

In [115]:
# Testing ada with best params:
# {'clf__estimator__learning_rate': 0.1, 'clf__estimator__n_estimators': 100}


pipeline_ada = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultiOutputClassifier(AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced'))))
                        ])

parameters_ada = {'clf__estimator__learning_rate': [0.1],
                  'clf__estimator__n_estimators': [100]}

cv_ada2 = GridSearchCV(estimator=pipeline_ada, param_grid=parameters_ada, cv=2, scoring='f1')

#define the predictor variables and the response variable
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre', 'offer', 'clothing', 'missing_people', 'tools', 'shops', 'aid_centers'], axis = 1)

#split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 1234)

In [116]:
%%time

tuned_model = cv_ada2.fit(X_train, Y_train)

CPU times: user 1h 6min 16s, sys: 1min 17s, total: 1h 7min 34s
Wall time: 1h 9min 49s


In [117]:
%%time

y_hat4 = cv_ada2.predict(X_test)
y_true = np.array(Y_test)

CPU times: user 30 s, sys: 5.29 s, total: 35.3 s
Wall time: 38.1 s


In [118]:
print(classification_report(y_true=y_true, 
                            y_pred=y_hat4, 
                            target_names = Y.columns))

                        precision    recall  f1-score   support

               related       0.86      0.84      0.85      4955
               request       0.56      0.60      0.58      1119
           aid_related       0.69      0.65      0.67      2726
          medical_help       0.42      0.36      0.39       540
      medical_products       0.44      0.42      0.43       317
     search_and_rescue       0.23      0.16      0.19       176
              security       0.10      0.10      0.10       115
              military       0.41      0.42      0.41       231
                 water       0.60      0.70      0.65       411
                  food       0.69      0.78      0.73       710
               shelter       0.62      0.62      0.62       585
                 money       0.20      0.38      0.26       130
              refugees       0.34      0.33      0.33       206
                 death       0.64      0.56      0.60       323
             other_aid       0.37      

### 9. Export your model as a pickle file

In [30]:
pickle.dump(tuned_model, open('disaster_model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.