# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import time
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import AdaBoostClassifier
import pickle
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\darka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\darka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\darka\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse',engine)
#split data into X,y
X = df['message']
y = df[df.columns.difference(['id','message','original','genre'])]

In [3]:
category_names = list(y.columns)
category_names

['aid_centers',
 'aid_related',
 'buildings',
 'child_alone',
 'clothing',
 'cold',
 'death',
 'direct_report',
 'earthquake',
 'electricity',
 'fire',
 'floods',
 'food',
 'hospitals',
 'infrastructure_related',
 'medical_help',
 'medical_products',
 'military',
 'missing_people',
 'money',
 'offer',
 'other_aid',
 'other_infrastructure',
 'other_weather',
 'refugees',
 'related',
 'request',
 'search_and_rescue',
 'security',
 'shelter',
 'shops',
 'storm',
 'tools',
 'transport',
 'water',
 'weather_related']

In [4]:
#Check X
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [5]:
#Check y
y.head()

Unnamed: 0,aid_centers,aid_related,buildings,child_alone,clothing,cold,death,direct_report,earthquake,electricity,...,request,search_and_rescue,security,shelter,shops,storm,tools,transport,water,weather_related
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [6]:
#Set up word tokenizer function

def tokenize(text):
    #Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())    
    #tokenize text
    tokens = word_tokenize(text)
    # remove stop words
    tokens = [w for w in tokens if w not in stopwords.words("english")]
    #initiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    #iterate through each token
    for tok in tokens:
        # lemmatize, and remove white space
        clean_tok = lemmatizer.lemmatize(tok).strip()
        clean_tokens.append(clean_tok)
    return clean_tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\darka\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
# Create pipeline
pipeline = Pipeline([
    ('features', FeatureUnion([

        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
    ])),

    ('clf',  MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
# train/test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

#train model
start = time.time()
pipeline.fit(X_train,y_train)

end = time.time()
print(end - start)

201.77833366394043


In [9]:
#Predict
y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [10]:
# accuracy is the total correct divided by the total to predict
def calc_accuracy(actual, preds):
    '''
    INPUT
    preds - predictions as a numpy array or pandas series
    actual - actual values as a numpy array or pandas series
    
    OUTPUT:
    returns the accuracy as a float
    '''
    return np.sum(preds == actual)/len(actual)

# precision is the true positives over the predicted positive values
def calc_precision(actual, preds):
    '''
    INPUT
    (assumes positive = 1 and negative = 0)
    preds - predictions as a numpy array or pandas series 
    actual - actual values as a numpy array or pandas series
    
    OUTPUT:
    returns the precision as a float
    '''
    total = len( np.intersect1d(np.where(preds == 1),np.where(actual==1)))
    pred_pos = (preds==1).sum()
    
    #check for division by zero
    if pred_pos == 0:
        result = 0
    else:
        result = total/(pred_pos)
    
    return result

# recall is true positives over all actual positive values
def calc_recall(actual, preds):
    '''
    INPUT
    preds - predictions as a numpy array or pandas series
    actual - actual values as a numpy array or pandas series
    
    OUTPUT:
    returns the recall as a float
    '''

    total = len( np.intersect1d(np.where(preds == 1),np.where(actual==1)))
    act_pos = (actual==1).sum()
    
    #check for division by zero
    if act_pos == 0:
        result = 0
    else:
        result = total/(act_pos)    
    
    return result

# f1_score is 2*(precision*recall)/(precision+recall))
def calc_f1(preds, actual):
    '''
    INPUT
    preds - predictions as a numpy array or pandas series
    actual - actual values as a numpy array or pandas series
    
    OUTPUT:
    returns the f1score as a float
    '''
    rec = calc_recall(actual,preds)
    prec = calc_precision(actual,preds)
    
    #check for division by zero
    if prec+rec == 0:
        result = 0
    else:
        result = (2 *  (prec * rec)/(prec+rec))       
    
    return result

In [11]:
# Outputs the average accuracy,precision,recall, and f1 score
def model_results(y_test,y_pred):
    #Create lists to append results
    accuracy = []
    precision = []
    recall = []
    f1_score = []

    # Show f1 score, precision, and recall for each category
    for col in list(range(y_pred.shape[1])):
        actual = y_test.iloc[:,col]
        pred = y_pred[:,col]
        accuracy.append(calc_accuracy(actual,pred))
        precision.append(calc_precision(actual,pred))
        recall.append(calc_recall(actual,pred))
        f1_score.append(calc_f1(pred,actual))
        print('Results for ',category_names[col])
        print('Accuracy',accuracy[col])
        print('Precision',precision[col])
        print('Recall',recall[col])
        print('F1_score',f1_score[col])
        

    model1_results = pd.DataFrame({'Accuracy': accuracy,
                                   'Precision': precision,
                                   'Recall': recall,
                                   'F1_Score': f1_score})
    results = model1_results.sum()/model1_results.shape[0]
    
    print('Overall Scores: ',results)
    
    return results

model_results(y_test,y_pred)

Results for  aid_centers
Accuracy 0.9868601986249045
Precision 0
Recall 0.0
F1_score 0
Results for  aid_related
Accuracy 0.747135217723453
Precision 0.7441093308199811
Recall 0.5867707172054998
F1_score 0.6561396218574692
Results for  buildings
Accuracy 0.9533995416348358
Precision 0.8378378378378378
Recall 0.09393939393939393
F1_score 0.16893732970027248
Results for  child_alone
Accuracy 1.0
Precision 0
Recall 0
F1_score 0
Results for  clothing
Accuracy 0.9873185637891521
Precision 0.7727272727272727
Recall 0.17894736842105263
F1_score 0.2905982905982906
Results for  cold
Accuracy 0.9813598166539343
Precision 0.7142857142857143
Recall 0.11450381679389313
F1_score 0.19736842105263158
Results for  death
Accuracy 0.9624140565317036
Precision 0.75
Recall 0.21875
F1_score 0.3387096774193548
Results for  direct_report
Accuracy 0.8469060351413292
Precision 0.7291666666666666
Recall 0.33175355450236965
F1_score 0.4560260586319218
Results for  earthquake
Accuracy 0.959511077158136
Precision 0.

Accuracy     0.944287
Precision    0.595086
Recall       0.185771
F1_Score     0.243735
dtype: float64

### 6. Improve your model
Use grid search to find better parameters. 

In [12]:
# look for parameters to vary
pipeline.get_params()

{'memory': None, 'steps': [('features', FeatureUnion(n_jobs=1,
          transformer_list=[('text_pipeline', Pipeline(memory=None,
        steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip...y=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True))]))],
          transformer_weights=None)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
              

In [13]:
# Set up grid search adjusting parameters
parameters = {
    'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2))
    #'features__text_pipeline__vect__max_df': (0.5, 0.75, 1.0),
    #'features__text_pipeline__vect__max_features': (None, 5000, 10000),
    #'features__text_pipeline__tfidf__use_idf': (True, False),
    #'clf__estimator__n_estimators': [10, 50, 100, 200]
}
# I limited it to 1 parameter because of the time it takes to run
cv = GridSearchCV(pipeline, parameters)
cv

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2))},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [14]:
#Train
start = time.time()
cv.fit(X_train,y_train)

end = time.time()
print(end - start)

2021.2436521053314


In [15]:
#Predict
y_pred_cv = cv.predict(X_test)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [16]:
# Results from initial model
print(model_results(y_test,y_pred))

Results for  aid_centers
Accuracy 0.9868601986249045
Precision 0
Recall 0.0
F1_score 0
Results for  aid_related
Accuracy 0.747135217723453
Precision 0.7441093308199811
Recall 0.5867707172054998
F1_score 0.6561396218574692
Results for  buildings
Accuracy 0.9533995416348358
Precision 0.8378378378378378
Recall 0.09393939393939393
F1_score 0.16893732970027248
Results for  child_alone
Accuracy 1.0
Precision 0
Recall 0
F1_score 0
Results for  clothing
Accuracy 0.9873185637891521
Precision 0.7727272727272727
Recall 0.17894736842105263
F1_score 0.2905982905982906
Results for  cold
Accuracy 0.9813598166539343
Precision 0.7142857142857143
Recall 0.11450381679389313
F1_score 0.19736842105263158
Results for  death
Accuracy 0.9624140565317036
Precision 0.75
Recall 0.21875
F1_score 0.3387096774193548
Results for  direct_report
Accuracy 0.8469060351413292
Precision 0.7291666666666666
Recall 0.33175355450236965
F1_score 0.4560260586319218
Results for  earthquake
Accuracy 0.959511077158136
Precision 0.

In [17]:
# Results from cv model
print(model_results(y_test,y_pred_cv))

Results for  aid_centers
Accuracy 0.986707410236822
Precision 0.0
Recall 0.0
F1_score 0
Results for  aid_related
Accuracy 0.7417876241405653
Precision 0.7557485947879408
Recall 0.5496098104793757
F1_score 0.6364027538726335
Results for  buildings
Accuracy 0.9538579067990832
Precision 0.7
Recall 0.1484848484848485
F1_score 0.24500000000000002
Results for  child_alone
Accuracy 1.0
Precision 0
Recall 0
F1_score 0
Results for  clothing
Accuracy 0.9873185637891521
Precision 0.8
Recall 0.16842105263157894
F1_score 0.2782608695652174
Results for  cold
Accuracy 0.9801375095492743
Precision 0.6
Recall 0.022900763358778626
F1_score 0.044117647058823525
Results for  death
Accuracy 0.9640947288006112
Precision 0.7431192660550459
Recall 0.28125
F1_score 0.4080604534005038
Results for  direct_report
Accuracy 0.8482811306340718
Precision 0.7407407407407407
Recall 0.33175355450236965
F1_score 0.458265139116203
Results for  earthquake
Accuracy 0.9555385790679908
Precision 0.8724279835390947
Recall 0.64

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [18]:
# Create pipeline
pipeline_ada = Pipeline([
    ('features', FeatureUnion([

        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
    ])),

    ('clf',  MultiOutputClassifier(AdaBoostClassifier()))
])

#train model
start = time.time()
pipeline_ada.fit(X_train,y_train)

end = time.time()
print(end - start)

192.31594729423523


In [19]:
# Predict
y_pred_ada = pipeline_ada.predict(X_test)
print(model_results(y_test,y_pred_ada))

Results for  aid_centers
Accuracy 0.9857906799083269
Precision 0.29411764705882354
Recall 0.05813953488372093
F1_score 0.0970873786407767
Results for  aid_related
Accuracy 0.7627196333078686
Precision 0.7600548446069469
Recall 0.6179858788554441
F1_score 0.681697069071531
Results for  buildings
Accuracy 0.9581359816653934
Precision 0.6346153846153846
Recall 0.4
F1_score 0.49070631970260226
Results for  child_alone
Accuracy 1.0
Precision 0
Recall 0
F1_score 0
Results for  clothing
Accuracy 0.9905271199388846
Precision 0.7704918032786885
Recall 0.49473684210526314
F1_score 0.6025641025641025
Results for  cold
Accuracy 0.9844155844155844
Precision 0.7101449275362319
Recall 0.37404580152671757
F1_score 0.49
Results for  death
Accuracy 0.9673032849503438
Precision 0.7010869565217391
Recall 0.4479166666666667
F1_score 0.5466101694915255
Results for  direct_report
Accuracy 0.8493506493506493
Precision 0.6881720430107527
Recall 0.40442338072669826
F1_score 0.5094527363184079
Results for  earth

Because initial AdaBoostClassifier had a better F1_score than the initial random forest model, we will try to use GridSearchCV on it.

In [20]:
# Set up GridSearchCV to improve model
parameters = {
    'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2))
    #'features__text_pipeline__vect__max_df': (0.5, 0.75, 1.0),
    #'features__text_pipeline__vect__max_features': (None, 5000, 10000),
    #'features__text_pipeline__tfidf__use_idf': (True, False),
    #'clf__estimator__n_estimators': [10, 50, 100, 200]
}
# I limited it to 1 parameter because of the time it takes to run
cv_ada = GridSearchCV(pipeline_ada, parameters)
cv_ada

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2))},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [21]:
# Improve adaboost model
start = time.time()
cv_ada.fit(X_train,y_train)

end = time.time()
print(end - start)

2072.5882489681244


In [22]:
# Predict and show results
y_pred_cv_ada = cv_ada.predict(X_test)
print(model_results(y_test,y_pred_cv_ada))

Results for  aid_centers
Accuracy 0.9857906799083269
Precision 0.29411764705882354
Recall 0.05813953488372093
F1_score 0.0970873786407767
Results for  aid_related
Accuracy 0.7627196333078686
Precision 0.7600548446069469
Recall 0.6179858788554441
F1_score 0.681697069071531
Results for  buildings
Accuracy 0.9581359816653934
Precision 0.6346153846153846
Recall 0.4
F1_score 0.49070631970260226
Results for  child_alone
Accuracy 1.0
Precision 0
Recall 0
F1_score 0
Results for  clothing
Accuracy 0.9905271199388846
Precision 0.7704918032786885
Recall 0.49473684210526314
F1_score 0.6025641025641025
Results for  cold
Accuracy 0.9844155844155844
Precision 0.7101449275362319
Recall 0.37404580152671757
F1_score 0.49
Results for  death
Accuracy 0.9673032849503438
Precision 0.7010869565217391
Recall 0.4479166666666667
F1_score 0.5466101694915255
Results for  direct_report
Accuracy 0.8493506493506493
Precision 0.6881720430107527
Recall 0.40442338072669826
F1_score 0.5094527363184079
Results for  earth

### 9. Export your model as a pickle file

In [23]:
#Save as pickle file
filename = 'finalized_model.sav'
pickle.dump(cv_ada, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.