# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import basic libraries
import pandas as pd
import re
# import SQL-libraries
from sqlalchemy.engine import create_engine
# import nltk libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# import sk-learn libraries
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion

In [2]:
# custom randomstate
myseed = 42

In [3]:
# download NLTK package
nltk.download(['stopwords', 'wordnet', 'punkt', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [4]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse', con=engine)
# Create X and Y from the df
X = df.iloc[:, :4]
Y = df.iloc[:, 4:]

### 2. Write a tokenization function to process your text data

In [5]:
def tokenize(text):
    '''
    tokenizes given text string.
    arg:
        text: string. twitter message
    return:
        cleaned: list. tokens of message.
    '''
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    # get list of all urls using regex
    detected_urls = re.findall(url_regex, text)
    # replace each url in text string with placeholder
    for url in detected_urls:
        text = text.replace(url, 'urlsub')
    # initiate Stemmer
    stemmer = PorterStemmer()
    # initiate Lemmatizer
    lemmatizer =  WordNetLemmatizer()
    # removing special characters and tranforming capital letters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()).split()
    # removing stop words
    text_tokens = [word for word in text if word not in stopwords.words('english')]
    # lemmatize tokens
    cleaned = [lemmatizer.lemmatize(word) for word in text_tokens]
    return cleaned

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
# Setting up basic pipline
pipeline = Pipeline([
    ('vec', TfidfVectorizer(tokenizer = tokenize)),
    ('classifier', MultiOutputClassifier(RandomForestClassifier()))
     ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
# splitting into training and test dataset
x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state = myseed)

In [9]:
# fitting pipeline
test = pipeline.fit(x_train["message"], y_train)

In [10]:
# predicting pipeline with test dataset
y_pred = pipeline.predict(x_test["message"])
# transforming into pd.Dataframe
y_pred = pd.DataFrame(y_pred, columns=y_train.columns)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
def get_scores(y_test, y_pred):
    '''
    Function: 
        gets Precision, Recall and F1 scores of the model for each output category of the trainingsset and modelaverages
    Args:
        y_test (pd.Dataframe): labels of the testdata.
        y_pred (pd.Dataframe): predicted outcome of the testdata
    Return:
        df_scores (pd.Dataframe): Precision, Recall and F1 scores of categories
        averages (pd.Series): model-averages for Precision, Recall and F1.
    '''
    scores = []
    # Extract Precision, Recall and F1 scores for each output category
    for column in y_test.columns:
        scores.append(classification_report(y_test[column], y_pred[column]).split()[-4:-1])
    df_scores = pd.DataFrame(scores, index=y_test.columns, columns = ['Precision', 'Recall', 'F1-Score'])
    df_scores = df_scores.astype(float)
    # calculate averages over all output categories
    averages = pd.Series([df_scores[column].mean() for column in df_scores.columns], index = df_scores.columns)
    return df_scores, averages

In [12]:
scores, means = get_scores(y_test, y_pred)

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
class Model:
    '''
    Class: allows fitting, predicting and scoring classifier with given parameters.
    '''
    def __init__(self, classifier, parameters):
        '''
        Function:
            Instanciates Pipeline object with given parameters.
        Args:
            classifier (Estimator): classifier object to use as estimator.
            parameters (Dictionary): hyperparameter to use in gridsearch.
        '''
        self.classifier = classifier
        self.parameters = parameters
        # Instanciating Pipeline object with customer tokenizer
        self.pipeline = Pipeline([
            ('vec', TfidfVectorizer(tokenizer = tokenize)),
            ('classifier', MultiOutputClassifier(self.classifier))
                                ])
    def fit(self, X, y, verbose=3, scoring='f1_micro'):
        '''
        Function:
            Runs Gridsearch on classifier with given hyperparameter
        Args:
            X (pd.Dataframe): Predictor of trainingsdata
            y (pd.Dataframe): labels of trainingsdata
        Return:
            cv (Estimator): fitted best estimator
        '''
        self.cv = GridSearchCV(self.pipeline, param_grid=self.parameters, verbose=verbose, scoring=scoring)
        self.columns = y.columns
        self.cv.fit(X, y)
        self.best_params_ = self.cv.best_params_
        self.best_estimator_ = self.cv.best_estimator_
        self.best_score_ = self.cv.best_score_
        return self.cv

    def predict(self, X):
        '''
        Function:
            Predicts labels on the given dataset
        Args:
            X (pd.Dataframe): Predictor of dataset
        Return:
            y_pred: predicted labels of the dataset
        '''
        self.y_pred = self.cv.predict(X)
        self.y_pred = pd.DataFrame(self.y_pred, columns=self.columns)
        return self.y_pred

    def get_scores(self, y_test):
        '''
        Function: 
            gets Precision, Recall and F1 scores of the model for each output category of the trainingsset and modelaverages.
        Args:
            y_test (pd.Dataframe): labels of the testdata.
            y_pred (pd.Dataframe): predicted outcome of the testdata.
        Return:
            df_scores (pd.Dataframe): Precision, Recall and F1 scores of categories.
            averages (pd.Series): model-averages for Precision, Recall and F1.
        '''
        scores = []
        for column in y_test.columns:
            scores.append(classification_report(y_test[column], self.y_pred[column]).split()[-4:-1])
        df_scores = pd.DataFrame(scores, index=y_test.columns, columns = ['Precision', 'Recall', 'F1-Score'])
        df_scores = df_scores.astype(float)
        averages = pd.Series([df_scores[column].mean() for column in df_scores.columns], index = df_scores.columns)
        return df_scores, averages

In [35]:
# Fitting and predicting RandomForestClassifier with hyperparameters
parameters_rfc =    {
        'classifier__estimator__n_estimators': [100, 200],
        'classifier__estimator__min_samples_split': [2, 4]
                }

model_rfc = Model(RandomForestClassifier(), parameters=parameters_rfc)
model_rfc.fit(x_train["message"], y_train)
y_pred = model_rfc.predict(x_test['message'])

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100, score=0.6395359057748088, total= 6.9min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.4min remaining:    0.0s


[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100, score=0.6447088946434286, total= 7.1min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 17.1min remaining:    0.0s


[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100, score=0.6410732466910224, total= 7.1min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200, score=0.6449636825063446, total=12.3min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200, score=0.6442828047716032, total=12.3min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200, score=0.6395826462387192, total=12.4min
[CV] classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=100 
[CV]  classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=100, score=0.646574229854383, 

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 128.9min finished


In [None]:
# Scoring RandomForestClassifier
scores_rdf, averages_rfc = model_rfc.get_scores(y_test)

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [8]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    '''
    Class: Transformer that checks for every text entry in a Series if a sentence starts with a verb
    '''
    def starting_verb(self, text):
        '''
        Function: 
            checks for a text if one sentence starts with a verb
        Arg:
            text (string): text containing at least one sentence
        Return:
            Bool (bool): Whether one sentence starts with verb
        '''
        # splits text into lists of sentences
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
        # Check if sentence starts with verb
            if pos_tags:
                first_word, first_tag = pos_tags[0]
                if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                    return True
        return False

    def fit(self, X, y=None):
        '''
        Function: 
            returns itself.
        Arg:
            X (pd.Series): Series of text-messages
        Return:
            self (StartingVerbExtractor object): returns reference to object
        '''
        return self
    
    def transform(self, X):
        '''
        Function: 
            Applies starting_verb function on a Series.
        Arg:
            X (pd.Series): Series of text-messages.
        Return:
            X_tagged (pd.Dataframe): Booleans stating wheter a sentence in datapoint starts with verb.
        '''
        # Applies starting_verb function on series
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [11]:
class Improved_Model:
    '''
    Class: allows fitting, predicting and scoring classifier with given parameters.
    '''
    def __init__(self, classifier, parameters):
        '''
        Function:
            Instanciates Pipeline object with given parameters, custom tokenizer and StartingVerbExtractor.
        Args:
            classifier (Estimator): classifier object to use as estimator.
            parameters (Dictionary): hyperparameter to use in gridsearch.
        '''
        self.classifier = classifier
        self.parameters = parameters
        # Instanciating Pipeline object with customer tokenizer
        self.pipeline = Pipeline([
    ('features', FeatureUnion([
        ('vec', TfidfVectorizer(tokenizer = tokenize)),
        ('starting_verb', StartingVerbExtractor())
    ])),
    ('classifier', MultiOutputClassifier(self.classifier))
                        ])
    def fit(self, X, y, verbose=3, scoring='f1_micro'):
        '''
        Function:
            Runs Gridsearch on classifier with given hyperparameter
        Args:
            X (pd.Dataframe): Predictor of trainingsdata
            y (pd.Dataframe): labels of trainingsdata
        Return:
            cv (Estimator): fitted best estimator
        '''
        self.cv = GridSearchCV(self.pipeline, param_grid=self.parameters, verbose=verbose, scoring=scoring, cv=2)
        self.columns = y.columns
        self.cv.fit(X, y)
        self.best_params_ = self.cv.best_params_
        self.best_estimator_ = self.cv.best_estimator_
        self.best_score_ = self.cv.best_score_
        return self.cv

    def predict(self, X):
        '''
        Function:
            Predicts labels on the given dataset
        Args:
            X (pd.Dataframe): Predictor of dataset
        Return:
            y_pred: predicted labels of the dataset
        '''
        self.y_pred = self.cv.predict(X)
        self.y_pred = pd.DataFrame(self.y_pred, columns=self.columns)
        return self.y_pred

    def get_scores(self, y_test):
        '''
        Function: 
            gets Precision, Recall and F1 scores of the model for each output category of the trainingsset and modelaverages.
        Args:
            y_test (pd.Dataframe): labels of the testdata.
            y_pred (pd.Dataframe): predicted outcome of the testdata.
        Return:
            df_scores (pd.Dataframe): Precision, Recall and F1 scores of categories.
            averages (pd.Series): model-averages for Precision, Recall and F1.
        '''
        scores = []
        for column in y_test.columns:
            scores.append(classification_report(y_test[column], self.y_pred[column]).split()[-4:-1])
        df_scores = pd.DataFrame(scores, index=y_test.columns, columns = ['Precision', 'Recall', 'F1-Score'])
        df_scores = df_scores.astype(float)
        averages = pd.Series([df_scores[column].mean() for column in df_scores.columns], index = df_scores.columns)
        return df_scores, averages

In [19]:
# Fitting and predicting SupportVectorClassifier with hyperparameters
parameters_svc = {
        'classifier__estimator__C': [0.1,0.5,1]
                }

model_svc = Model(SVC(), parameters=parameters_svc)
model_svc.fit(x_train['message'], y_train)
model_svc.predict(x_test['message'])

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] classifier__estimator__C=0.1 ....................................
[CV]  classifier__estimator__C=0.1, score=0.3643571558465176, total= 5.4min
[CV] classifier__estimator__C=0.1 ....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.7min remaining:    0.0s


[CV]  classifier__estimator__C=0.1, score=0.36374230671182484, total= 5.3min
[CV] classifier__estimator__C=0.1 ....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 17.4min remaining:    0.0s


[CV]  classifier__estimator__C=0.1, score=0.36827596483385483, total= 5.3min
[CV] classifier__estimator__C=0.5 ....................................
[CV]  classifier__estimator__C=0.5, score=0.3643571558465176, total= 5.3min
[CV] classifier__estimator__C=0.5 ....................................
[CV]  classifier__estimator__C=0.5, score=0.36374230671182484, total= 5.3min
[CV] classifier__estimator__C=0.5 ....................................
[CV]  classifier__estimator__C=0.5, score=0.36827596483385483, total= 5.4min
[CV] classifier__estimator__C=1 ......................................
[CV]  classifier__estimator__C=1, score=0.3643571558465176, total= 5.4min
[CV] classifier__estimator__C=1 ......................................
[CV]  classifier__estimator__C=1, score=0.36374230671182484, total= 5.3min
[CV] classifier__estimator__C=1 ......................................
[CV]  classifier__estimator__C=1, score=0.36827596483385483, total= 5.4min


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 78.6min finished


AttributeError: 'Model' object has no attribute 'y_pred'

In [23]:
# Scoring SupportVectorClassifier
scores_svc, averages_svc = model_svc.get_scores(y_test)
print(scores_svc, averages_svc)

  'precision', 'predicted', average, warn_for)


                        Precision  Recall  F1-Score
related                      0.57    0.75      0.65
request                      0.68    0.83      0.75
offer                        0.99    1.00      0.99
aid_related                  0.34    0.58      0.43
medical_help                 0.85    0.92      0.88
medical_products             0.91    0.95      0.93
search_and_rescue            0.94    0.97      0.96
security                     0.96    0.98      0.97
military                     0.93    0.96      0.95
water                        0.88    0.94      0.90
food                         0.79    0.89      0.84
shelter                      0.83    0.91      0.87
clothing                     0.97    0.99      0.98
money                        0.96    0.98      0.97
missing_people               0.98    0.99      0.98
refugees                     0.93    0.96      0.94
death                        0.91    0.95      0.93
other_aid                    0.76    0.87      0.81
infrastructu

In [25]:
# Fitting and predicting KNeighborsClassifier with hyperparameters
parameters_knn = {
        'classifier__estimator__n_neighbors': [2],
        'classifier__estimator__weights': ["uniform", "distance"]
                }

model_knn = Model(KNN(), parameters=parameters_knn)
model_knn.fit(x_train['message'], y_train)
model_knn.predict(x_test['message'])

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] classifier__estimator__n_neighbors=2, classifier__estimator__weights=uniform 
[CV]  classifier__estimator__n_neighbors=2, classifier__estimator__weights=uniform, score=0.08599430920012646, total= 2.8min
[CV] classifier__estimator__n_neighbors=2, classifier__estimator__weights=uniform 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  6.6min remaining:    0.0s


[CV]  classifier__estimator__n_neighbors=2, classifier__estimator__weights=uniform, score=0.08935920047031158, total= 2.8min
[CV] classifier__estimator__n_neighbors=2, classifier__estimator__weights=uniform 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 13.1min remaining:    0.0s


[CV]  classifier__estimator__n_neighbors=2, classifier__estimator__weights=uniform, score=0.08917494519845157, total= 2.8min
[CV] classifier__estimator__n_neighbors=2, classifier__estimator__weights=distance 
[CV]  classifier__estimator__n_neighbors=2, classifier__estimator__weights=distance, score=0.1881324374645835, total= 2.7min
[CV] classifier__estimator__n_neighbors=2, classifier__estimator__weights=distance 
[CV]  classifier__estimator__n_neighbors=2, classifier__estimator__weights=distance, score=0.19105707476673264, total= 2.7min
[CV] classifier__estimator__n_neighbors=2, classifier__estimator__weights=distance 
[CV]  classifier__estimator__n_neighbors=2, classifier__estimator__weights=distance, score=0.1929228474351537, total= 2.8min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 38.5min finished


In [26]:
# scoring KNeighborsClassifier
scores_knn, averages_knn = model_knn.get_scores(y_test)
print(scores_knn, averages_knn)

                        Precision  Recall  F1-Score
related                      0.68    0.35      0.32
request                      0.83    0.85      0.82
offer                        0.99    0.99      0.99
aid_related                  0.64    0.61      0.53
medical_help                 0.86    0.91      0.88
medical_products             0.92    0.95      0.93
search_and_rescue            0.95    0.97      0.96
security                     0.96    0.98      0.97
military                     0.94    0.96      0.95
water                        0.91    0.93      0.91
food                         0.87    0.90      0.87
shelter                      0.89    0.91      0.89
clothing                     0.98    0.99      0.98
money                        0.96    0.98      0.97
missing_people               0.98    0.99      0.98
refugees                     0.93    0.96      0.94
death                        0.94    0.95      0.94
other_aid                    0.81    0.86      0.82
infrastructu

In [12]:
# Fitting and predicting RandomForestClassifier with hyperparameters
# on improved model
parameters_rfc =    {
        'classifier__estimator__n_estimators': [100, 200],
        'classifier__estimator__min_samples_split': [2, 4]
                }

model_rfc = Improved_Model(RandomForestClassifier(), parameters=parameters_rfc)
model_rfc.fit(x_train["message"], y_train)
y_pred = model_rfc.predict(x_test['message'])

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100, score=0.635406661047184, total= 6.4min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.3min remaining:    0.0s


[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=100, score=0.6367458108828232, total= 6.5min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 16.6min remaining:    0.0s


[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200, score=0.6379340771387675, total= 9.6min
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=200, score=0.6381218733997714, total= 9.7min
[CV] classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=100 
[CV]  classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=100, score=0.6394350811485642, total= 5.8min
[CV] classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=100 
[CV]  classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=100, score=0.6400329005346337, total= 5.8min
[CV] classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=200 
[CV]  classifier__estimator__min_samples_split=4, classifier__estimator__n_estimators=200, score=0.6417506186792415,

[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 76.8min finished


In [14]:
# scoring RandomForestClassifier on improved Model
scores_rfc, averages_rfc = model_rfc.get_scores(y_test)
print(scores_rfc, averages_rfc)

                        Precision  Recall  F1-Score
related                      0.81    0.82      0.81
request                      0.89    0.89      0.88
offer                        0.99    1.00      0.99
aid_related                  0.78    0.78      0.78
medical_help                 0.91    0.92      0.90
medical_products             0.94    0.95      0.93
search_and_rescue            0.97    0.97      0.96
security                     0.96    0.98      0.97
military                     0.95    0.96      0.95
water                        0.95    0.96      0.95
food                         0.94    0.94      0.94
shelter                      0.94    0.94      0.93
clothing                     0.98    0.99      0.98
money                        0.97    0.98      0.97
missing_people               0.98    0.99      0.98
refugees                     0.96    0.96      0.94
death                        0.95    0.96      0.95
other_aid                    0.86    0.87      0.82
infrastructu

  'precision', 'predicted', average, warn_for)


In [22]:
# show best parameters
model_rfc.best_params_

{'classifier__estimator__min_samples_split': 4,
 'classifier__estimator__n_estimators': 200}

### 9. Export your model as a pickle file

In [None]:
# save best model
pickle.dump(model_rfc, open(model_filepath, 'wb'))