# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [36]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
import re
import numpy as np
# nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet') # download for lemmatization
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
# from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
# from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
# other models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# pickle
import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterData.db')
df = pd.read_sql_table('TextMessages', engine)
X = df[["message", "original", "genre"]]
Y = df.drop(columns= ["id", "message", "original", "genre"])

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    # Normalization
    
    # Convert to lower case
    text = text.lower()
    
    # Remove punctuation characters - this regex finds everything which is not a combination of letters
    # and numbers and replaces it with a whitespace
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    
    # Tokenization
    
    # Split into tokens
    words = word_tokenize(text)
    
    
    # Remove stopwords
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Part-of-speech tagging maybe useful here?
    # Named Entity Recognition usefuk here?
    
    # Stemming - only keep the stem of a word, simple find and replace method which removes f.e. "ing"
    # stemmed = [PorterStemmer().stem(w) for w in words]
    
    # Lemmatization - more complex appraoch using dictionaries which can f.e. map "is" and "was" to "be"
    # Lemmatize verbs by specifying pos
    lemmed_verbs = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    # Reduce nouns to their root form
    lemmed_nouns = [WordNetLemmatizer().lemmatize(w) for w in lemmed_verbs]
    return lemmed_nouns

In [4]:
# Split the data in training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size = 0.05) # We drastically decrease the train_size to allow our GridSearch to run in a feasible amount of time

In [5]:
# Calculate the average accuracy for each target column
def print_acc(name, model, y_test, y_pred):
    columns = y_test.columns
    y_pred_df = pd.DataFrame(y_pred, columns = columns)
    accuracy = (y_pred_df == y_test.reset_index().drop(["index"], axis = 1)).mean()
    print(f"Accuracy per category {name}: ")
    print(f"Average accuracy: {accuracy.mean()}")
    print(accuracy)
    return {'name' : name, 'model': model, 'accuracy' : accuracy}

In [6]:
# Create an empty array to store all the results and the models to find the best one in the end
results = []

# Native model without optimization (MultiOutputClassifier with RandomForestClassifier)

In [7]:
# pipeline = Pipeline([
#         ('features', FeatureUnion([

#             ('text_pipeline', Pipeline([
#                 ('vect', CountVectorizer(tokenizer=tokenize)),
#                 ('tfidf', TfidfTransformer())
#             ]))
#         ])),

#         ('clf', MultiOutputClassifier(RandomForestClassifier()))
#     ])

random_forest_pipe = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

random_forest_pipe.fit(X_train["message"], y_train)
y_pred = random_forest_pipe.predict(X_test["message"])

In [8]:
results.append(print_acc("MultiOutputClassifier RandomForest", random_forest_pipe, y_test, y_pred))

Accuracy per category MultiOutputClassifier RandomForest: 
Average accuracy: 0.9399697146986956
related                   0.793182
request                   0.879065
offer                     0.995463
aid_related               0.741869
medical_help              0.920220
medical_products          0.950213
search_and_rescue         0.972095
security                  0.982012
military                  0.967438
water                     0.942905
food                      0.922629
shelter                   0.915763
clothing                  0.984943
money                     0.977395
missing_people            0.988557
refugees                  0.967076
death                     0.954870
other_aid                 0.868706
infrastructure_related    0.934755
transport                 0.954910
buildings                 0.950012
electricity               0.979844
tools                     0.993777
hospitals                 0.988999
shops                     0.995262
aid_centers               0.9

# kNN

In [9]:
knn_pipe = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', KNeighborsClassifier())
    ])
knn_pipe.fit(X_train["message"], y_train)
y_pred_knn = knn_pipe.predict(X_test["message"])

In [10]:
results.append(print_acc("kNN", knn_pipe, y_test, y_pred_knn))

Accuracy per category kNN: 
Average accuracy: 0.9303656032396095
related                   0.775395
request                   0.852847
offer                     0.995463
aid_related               0.641532
medical_help              0.920501
medical_products          0.950373
search_and_rescue         0.972216
security                  0.982012
military                  0.967438
water                     0.939733
food                      0.900145
shelter                   0.913716
clothing                  0.985626
money                     0.977515
missing_people            0.988557
refugees                  0.966996
death                     0.956035
other_aid                 0.865976
infrastructure_related    0.934795
transport                 0.955192
buildings                 0.950414
electricity               0.980085
tools                     0.993777
hospitals                 0.988999
shops                     0.995262
aid_centers               0.988155
other_infrastructure     

# Decision tree


In [11]:
decision_tree_pipe = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', DecisionTreeClassifier())
    ])
decision_tree_pipe.fit(X_train["message"], y_train)
y_pred_decision_tree = decision_tree_pipe.predict(X_test["message"])

In [12]:
results.append(print_acc("Decision Tree", decision_tree_pipe, y_test, y_pred_decision_tree))

Accuracy per category Decision Tree: 
Average accuracy: 0.9243796675499878
related                   0.744479
request                   0.836826
offer                     0.994740
aid_related               0.674295
medical_help              0.896732
medical_products          0.927126
search_and_rescue         0.965591
security                  0.971252
military                  0.961816
water                     0.943267
food                      0.926925
shelter                   0.900827
clothing                  0.982494
money                     0.969164
missing_people            0.984702
refugees                  0.943066
death                     0.946800
other_aid                 0.822653
infrastructure_related    0.912391
transport                 0.941139
buildings                 0.928571
electricity               0.976190
tools                     0.990324
hospitals                 0.987433
shops                     0.995222
aid_centers               0.985546
other_infrastru

# Random Forest

In [13]:
random_forest_only_pipe = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
    ])
random_forest_only_pipe.fit(X_train["message"], y_train)
y_pred_random_forest_only = random_forest_only_pipe.predict(X_test["message"])

In [14]:
results.append(print_acc("Random Forest", random_forest_only_pipe, y_test, y_pred_random_forest_only))

Accuracy per category Random Forest: 
Average accuracy: 0.9397701070310077
related                   0.805348
request                   0.879788
offer                     0.995463
aid_related               0.728459
medical_help              0.920180
medical_products          0.950253
search_and_rescue         0.972095
security                  0.982012
military                  0.967317
water                     0.941781
food                      0.922589
shelter                   0.913956
clothing                  0.985305
money                     0.977395
missing_people            0.988557
refugees                  0.967076
death                     0.954750
other_aid                 0.868546
infrastructure_related    0.934795
transport                 0.954429
buildings                 0.949370
electricity               0.979844
tools                     0.993777
hospitals                 0.988999
shops                     0.995262
aid_centers               0.988196
other_infrastru

In [15]:
for result in results:
  print(result["name"])
  print(result["accuracy"].mean())

MultiOutputClassifier RandomForest
0.9399697146986956
kNN
0.9303656032396095
Decision Tree
0.9243796675499878
Random Forest
0.9397701070310077


# Improve models using GridSearch

## MultiOutputClassifier + RandomForestClassifier

In [16]:
# Check for available parameters to optimize
random_forest_pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf_

In [17]:
parameters_mo_rf = {
    # vect
    # https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

    
    # tfidf
    # https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
    'tfidf__norm' : ['l1', 'l2'],
  #  'tfidf__use_idf' : [True, False],
   # 'tfidf__smooth_idf': [True, False],
   # 'tfidf__sublinear_tf' : [True, False],

    # clf
    # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    'clf__estimator__criterion' : ['gini', 'entropy'],
    'clf__estimator__n_estimators': [50, 100, 150, 200],
    'clf__estimator__max_depth' : [None, 5, 10],
}

cv_parameters_mo_rf = GridSearchCV(random_forest_pipe, param_grid=parameters_mo_rf) 
cv_parameters_mo_rf.fit(X_train["message"], y_train)
y_pred_mo_rf_cv = cv_parameters_mo_rf.predict(X_test["message"])

In [18]:
results.append(print_acc("MultiOutputClassifier Random Forest CV", cv_parameters_mo_rf, y_test, y_pred_mo_rf_cv))

Accuracy per category MultiOutputClassifier Random Forest CV: 
Average accuracy: 0.9394546351424211
related                   0.797358
request                   0.880350
offer                     0.995463
aid_related               0.739420
medical_help              0.920220
medical_products          0.950293
search_and_rescue         0.972095
security                  0.982012
military                  0.967397
water                     0.941902
food                      0.913234
shelter                   0.914318
clothing                  0.985024
money                     0.977395
missing_people            0.988557
refugees                  0.967076
death                     0.954750
other_aid                 0.868626
infrastructure_related    0.934755
transport                 0.954629
buildings                 0.949570
electricity               0.979844
tools                     0.993777
hospitals                 0.988999
shops                     0.995262
aid_centers              

## kNN

In [19]:
knn_pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__algorithm', 'clf__leaf_size', 'clf__metric', 'clf__metric_params', 'clf__n_jobs', 'clf__n_neighbors', 'clf__p', 'clf__weights'])

In [23]:
parameters_knn = {
    # vect
    # https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

    
    # tfidf
    # https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
    'tfidf__norm' : ['l1', 'l2'],
  #  'tfidf__use_idf' : [True, False],
  #  'tfidf__smooth_idf': [True, False],
  #  'tfidf__sublinear_tf' : [True, False],

    # clf
    # https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
    'clf__n_neighbors' : [3, 5, 8],
    'clf__weights' : ['uniform', 'distance'],
    'clf__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],

}

cv_knn = GridSearchCV(knn_pipe, param_grid=parameters_knn) 
cv_knn.fit(X_train["message"], y_train)
y_pred_knn_cv = cv_knn.predict(X_test["message"])



In [24]:
results.append(print_acc("kNN CV", cv_knn, y_test, y_pred_mo_rf_cv))

Accuracy per category kNN CV: 
Average accuracy: 0.9394546351424211
related                   0.797358
request                   0.880350
offer                     0.995463
aid_related               0.739420
medical_help              0.920220
medical_products          0.950293
search_and_rescue         0.972095
security                  0.982012
military                  0.967397
water                     0.941902
food                      0.913234
shelter                   0.914318
clothing                  0.985024
money                     0.977395
missing_people            0.988557
refugees                  0.967076
death                     0.954750
other_aid                 0.868626
infrastructure_related    0.934755
transport                 0.954629
buildings                 0.949570
electricity               0.979844
tools                     0.993777
hospitals                 0.988999
shops                     0.995262
aid_centers               0.988196
other_infrastructure  

# Evaluate the results

In [25]:
for result in results:
  print(result["name"])
  print(result["accuracy"].mean())

MultiOutputClassifier RandomForest
0.9399697146986956
kNN
0.9303656032396095
Decision Tree
0.9243796675499878
Random Forest
0.9397701070310077
MultiOutputClassifier Random Forest CV
0.9394546351424211
kNN CV
0.9394546351424211


As we can see, the models performed all very similar. Only the decision tree model is a bit worse compared to the other ones. Surprisingly, our unoptimized orginal model with a MultiOutpuClassfier and a RandomForestClassifier performed best. Therefore we can assume that the standard model configuration fits good to our problem and the optimization attempt only leads us away from the optimum. 94% is a quite good result so we can stick with that model.

In [30]:
best_model = results[0]['model']

Now that we found the best model configuration, we retrain the model with 80% of the data

In [34]:
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X, Y, train_size = 0.80)
best_model.fit(X_train_new["message"], y_train_new)
y_pred_final = best_model.predict(X_test_new["message"])

### 9. Export your model as a pickle file

In [37]:
model_params = best_model.get_params()
model = best_model

fileObj = open('model_params.obj', 'wb')
pickle.dump(model_params,fileObj)
fileObj.close()

fileObj = open('model.obj', 'wb')
pickle.dump(model,fileObj)
fileObj.close()

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.