# ML Pipeline Preparation

#### The second project due as part of Udacity's Data Science Nanodegree involves building an app that can use a NLP model to classify messages relating to natural disasters.

#### There are three stages to this project:
#### 1) Extracting and cleaning data
#### 2) Using a machine learning pipeline to create a model for classifying data
#### 3) Deploying the model as a web app using Flask.

#### This notebook relates to step 2 and shows how the data is used to train a machine learning pipeline and create a model that can be used to classify new data


### 1. Import libraries and load data from database.
#### Import Python libraries

In [9]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

from sklearn.model_selection import GridSearchCV
import pickle

####  Load data from database

In [10]:
engine = create_engine('sqlite:///NLPproject.db')
df = pd.read_sql('SELECT * FROM NLPtraining', engine)
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)

In [11]:
pd.set_option('display.max_columns', None)
Y.apply(pd.Series.value_counts)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,6116,21716.0,26062.0,15339.0,24099.0,24869.0,25456.0,25709.0,25321.0,26180.0,24511.0,23263.0,23872.0,25776.0,25577.0,25882.0,25306.0,24988.0,22739.0,24475.0,24981.0,24849.0,25648.0,26021.0,25897.0,26060.0,25871.0,25029.0,18894.0,24031.0,23740.0,25898.0,23728.0,25652.0,24804.0,21116.0
1,19876,4464.0,118.0,10841.0,2081.0,1311.0,724.0,471.0,859.0,,1669.0,2917.0,2308.0,404.0,603.0,298.0,874.0,1192.0,3441.0,1705.0,1199.0,1331.0,532.0,159.0,283.0,120.0,309.0,1151.0,7286.0,2149.0,2440.0,282.0,2452.0,528.0,1376.0,5064.0
2,188,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### There are two features that stand out in the data above. The first is that 'related' takes three values : 0, 1 and 2. This needs to be changed to just 0 and 1. The other feature is that 'child alone' only takes 0's. This makes it impossible to build a model that can identify message about abandoned children, as the estimator used to create the model never learns what such a message looks like. I would have deleted this column, however the project rubrik says the model should include all 36 categories. In a professional context I would seek out additional data that allows the model to identify messages about children alone.

In [12]:
Y['related'] = Y['related'].replace([2], 1 )

#### A caveat on the code above- while the replace method seems like a standard way of editting data, this threw an error later on when running the run.py file in the IDE.  As a result, this command appears in the run.py file as:

#### X['related'] = X['related'].map({0: 0, 1: 1, 2: 1})

In [13]:
Y.apply(pd.Series.value_counts)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,6116,21716,26062,15339,24099,24869,25456,25709,25321,26180.0,24511,23263,23872,25776,25577,25882,25306,24988,22739,24475,24981,24849,25648,26021,25897,26060,25871,25029,18894,24031,23740,25898,23728,25652,24804,21116
1,20064,4464,118,10841,2081,1311,724,471,859,,1669,2917,2308,404,603,298,874,1192,3441,1705,1199,1331,532,159,283,120,309,1151,7286,2149,2440,282,2452,528,1376,5064


In [14]:
Y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26180 entries, 0 to 26179
Data columns (total 36 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   related                 26180 non-null  int64
 1   request                 26180 non-null  int64
 2   offer                   26180 non-null  int64
 3   aid_related             26180 non-null  int64
 4   medical_help            26180 non-null  int64
 5   medical_products        26180 non-null  int64
 6   search_and_rescue       26180 non-null  int64
 7   security                26180 non-null  int64
 8   military                26180 non-null  int64
 9   child_alone             26180 non-null  int64
 10  water                   26180 non-null  int64
 11  food                    26180 non-null  int64
 12  shelter                 26180 non-null  int64
 13  clothing                26180 non-null  int64
 14  money                   26180 non-null  int64
 15  missing_people     

In [15]:
df

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26175,30261,The training demonstrated how to enhance micro...,,news,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26176,30262,A suitable candidate has been selected and OCH...,,news,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26177,30263,"Proshika, operating in Cox's Bazar municipalit...",,news,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26178,30264,"Some 2,000 women protesting against the conduc...",,news,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [16]:
"""
This function takes a text or series of texts and removes punctuation, sets to lower case, tokenizes text, lemmatizes it and removes stopwords.

Inputs: text (str or dataframe)
Outputs: text (str or dataframe) - text with the processing carried out
"""
    
def tokenize(text):
    
    #remove punctuation 
    text = re.sub(r'[^\w\s]', '', text) # w matches alphanumeric, s matches whitespace
    
    tokens = word_tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    
    no_stops = [w for w in tokens if w not in stopwords.words("english")]
        
    tidy_tokens = []
    
    for t in no_stops:
     lemmed = lemmatizer.lemmatize(t).lower().strip()
     tidy_tokens.append(lemmed)
          
    return tidy_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [17]:
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))])


### 4. Train pipeline
- Split data into train and test sets

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, Y) 

- Train model on the training data

In [19]:
pipeline2.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000025F0FD1E3A0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

- Generate predictions

In [20]:
y_pred = pipeline2.predict(X_test)

In [21]:
y_pred.shape

(6545, 36)

In [22]:
y_test.shape

(6545, 36)

In [23]:
names = y_test.columns
y_pred = pd.DataFrame(y_pred, columns = names)

In [24]:
y_pred.apply(pd.Series.value_counts)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,883,5848,6545.0,4064,6498,6497,6526,6545.0,6526,6545.0,6371,6049,6283,6526,6540,6545.0,6531,6492,6490,6542,6524,6489,6533,6545.0,6545.0,6545.0,6545.0,6541,5108,6301,6141,6545.0,5985,6536,6518,5959
1,5662,697,,2481,47,48,19,,19,,174,496,262,19,5,,14,53,55,3,21,56,12,,,,,4,1437,244,404,,560,9,27,586


In [25]:
y_test = y_test.reset_index(drop=True)
y_pred = y_pred.reset_index(drop=True)

- Let's get a general idea of how the model performed. The scores below show that the accuracy was generally good across the different categories

In [26]:
#labels = np.unique(y_pred)
accuracy = (y_pred == y_test).mean()

print("Accuracy:", accuracy)

Accuracy: related                   0.822307
request                   0.894270
offer                     0.995569
aid_related               0.779374
medical_help              0.925592
medical_products          0.955844
search_and_rescue         0.974484
security                  0.981971
military                  0.969290
child_alone               1.000000
water                     0.957830
food                      0.943621
shelter                   0.934301
clothing                  0.983957
money                     0.975859
missing_people            0.990222
refugees                  0.968526
death                     0.957066
other_aid                 0.872269
infrastructure_related    0.936440
transport                 0.957066
buildings                 0.953705
electricity               0.980749
tools                     0.995416
hospitals                 0.991902
shops                     0.993888
aid_centers               0.988235
other_infrastructure      0.956455
weather_re

In [27]:
print("y_pred", y_pred.shape)
print("y_test", y_test.shape)

y_pred (6545, 36)
y_test (6545, 36)


- Let's look at the percentage of observations that took each value in the training data. Looking at this allows us to check that the model performed compared to just predicting the same value for every row. 

In [28]:
Q = (y_test.apply(pd.Series.value_counts)) / len(y_test) *100

In [29]:
Q 

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,23.437739,82.444614,99.556914,59.159664,92.268908,95.187166,97.37204,98.197097,96.791444,100.0,93.613445,88.861727,91.413293,98.319328,97.540107,99.022154,96.760886,95.26356,87.028266,93.659282,95.477464,94.820474,98.013751,99.541635,99.190222,99.388846,98.823529,95.706646,72.89534,92.452254,90.970206,98.884645,90.741024,98.044309,94.667685,80.473644
1,76.562261,17.555386,0.443086,40.840336,7.731092,4.812834,2.62796,1.802903,3.208556,,6.386555,11.138273,8.586707,1.680672,2.459893,0.977846,3.239114,4.73644,12.971734,6.340718,4.522536,5.179526,1.986249,0.458365,0.809778,0.611154,1.176471,4.293354,27.10466,7.547746,9.029794,1.115355,9.258976,1.955691,5.332315,19.526356


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [30]:
report = classification_report(y_test, y_pred, output_dict=True, target_names = names )
report_df = pd.DataFrame(report).transpose()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


- I formatted the report as a dataframe as it previously looked really messy. 

- Notice how the columns with almost no variation in the data (e.g. 'tools','hospitals' & 'shops') have 0 in the first three columns

In [31]:
report_df 

Unnamed: 0,precision,recall,f1-score,support
related,0.839809,0.948912,0.891033,5011.0
request,0.827834,0.502176,0.625135,1149.0
offer,0.0,0.0,0.0,29.0
aid_related,0.747682,0.693977,0.719829,2673.0
medical_help,0.702128,0.065217,0.119349,506.0
medical_products,0.770833,0.11746,0.203857,315.0
search_and_rescue,0.631579,0.069767,0.125654,172.0
security,0.0,0.0,0.0,118.0
military,0.736842,0.066667,0.122271,210.0
child_alone,0.0,0.0,0.0,0.0


### 6. Improve your model
- Use grid search to find better parameters. 
- An issue I encountered was that this was too computationally heavy for my machine. As a result I only passed one parameter for Grid Search to evaluate. I have included the code showing all the parameters list that I wanted to evaluate.

In [32]:
parameters = {
    #'clf__estimator__bootstrap': [True, False],
    #'clf__estimator__criterion': ['gini', 'entropy', 'log_loss'],
    'tfidf__smooth_idf': [True, False]
}

In [33]:
cv = GridSearchCV(pipeline2, param_grid=parameters) 

In [34]:
cv.fit(X_train, y_train) # took about an hour and a half to run

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x0000025F0FD1E3A0>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'tfidf__smooth_idf': [True, False]})

In [35]:
y_pred2 = cv.predict(X_test)

- Let's see the best parameters identified by Grid Search. I can then specify these when working in Flask, rather than running Grid Search again.

In [36]:
best_params = cv.best_params_

In [37]:
best_params

{'tfidf__smooth_idf': False}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [38]:
grid_report = classification_report(y_test, y_pred2, output_dict=True, target_names = names )
grid_report_df = pd.DataFrame(report).transpose()


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [39]:
grid_report_df

Unnamed: 0,precision,recall,f1-score,support
related,0.839809,0.948912,0.891033,5011.0
request,0.827834,0.502176,0.625135,1149.0
offer,0.0,0.0,0.0,29.0
aid_related,0.747682,0.693977,0.719829,2673.0
medical_help,0.702128,0.065217,0.119349,506.0
medical_products,0.770833,0.11746,0.203857,315.0
search_and_rescue,0.631579,0.069767,0.125654,172.0
security,0.0,0.0,0.0,118.0
military,0.736842,0.066667,0.122271,210.0
child_alone,0.0,0.0,0.0,0.0


### 8. Export your model as a pickle file

In [40]:
filename = 'NLPmodel.pkl'
pickle.dump(cv, open(filename, 'wb'))

### 9. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.fff

In [None]:
"""
This functon loads in the disaster messages and categories data from an sql database and creates X, Y and category_names variables.

Inputs - filepath(str) - this is filepath of the database where the data is stored
Outputs - X, Y, category names - these coreespond to the X and Y variables to be used in build our  odel as well as a list of column names.
"""
def load_data(filepath):
    engine = create_engine(filepath)
    df = pd.read_sql('SELECT * FROM NLPtraining', engine)
    X = df['message'] # Part of developing the model further could be adding the 'genre' column
    Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)
    category_names = Y.columns
    return X, Y, category_names

In [None]:
"""
This function takes a text or series of texts and removes punctuation, sets to lower case, tokenizes text, lemmatizes it and removes stopwords.

Inputs: text (str or dataframe)
Outputs: text (str or dataframe) - text with the processing carried out
"""

def tokenize(text):
    #remove punctuation 
    text = re.sub(r'[^\w\s]', '', text) # w matches alphanumeric, s matches whitespace
    
    tokens = word_tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    
    no_stops = [w for w in tokens if w not in stopwords.words("english")]
        
    tidy_tokens = []
    
    for t in no_stops:
     lemmed = lemmatizer.lemmatize(t).lower().strip()
     tidy_tokens.append(lemmed)
          
    return tidy_tokens


In [None]:
"""
This function builds a machine learning pipeline, in which the text is vectorised then weighed for frequency using 
Tfidf Transformer, then data passed to one Random Forest Classifer per column in Y. In order to select model parameters that
optimise performance, Grid Search is used.

Inputs - none
Outputs - pipeline - a machine learning pipeline
"""
def build_model():   
    pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfid', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))])
    
    parameters = {
    #'clf__estimator__bootstrap': [True, False],
    #'clf__estimator__criterion': ['gini', 'entropy', 'log_loss'],
    'tfidf__smooth_idf': [True, False]
}
    cv = GridSearchCV(pipeline, param_grid=parameters) 
    
    return cv 

In [None]:
"""
Create predicted values and produce a classification report for the model built in the previous cell.

Inputs - model - our trainined model
X_test - the messages data from a our test dataset
Y_test - the categories data from our test dataset
category-names - a list of column names from the categories data

Outputs - report_df - a classification report evaluating the performance of the model on the test data

"""
def evaluate_model(model, X_test, Y_test, category_names):
    
    y_pred = model.predict(X_test)
    y_pred = pd.DataFrame(y_pred, columns = category_names)
    Y_test = Y_test.reset_index(drop=True)
    y_pred = y_pred.reset_index(drop=True)
    accuracy = (y_pred == Y_test).mean()
    print("Accuracy:", accuracy)
    report = classification_report(Y_test, y_pred, output_dict=True, target_names = category_names)
    report_df = pd.DataFrame(report).transpose()
    print(report_df)

In [None]:
"""
Using the function below save a trained model as a pickle file.

Inputs - filepath (str) - the destination and name of the pickle file
Outputs - model saved at pickle file at stated location
"""

def save_model(model, model_filepath):
 pickle.dump(model, open(model_filepath, 'wb'))


In [None]:
"""
This function executes all of the function above. Here I have included all of the arguments that I will use when 
running this in the Flask app
"""

def main():
    if len(sys.argv) == 3:
        database_filepath, model_filepath = sys.argv[1:]
        print('Loading data...\n    DATABASE: {}'.format(database_filepath))
        X, Y, category_names = load_data('data/NLPproject.db')
        Y['related'] = Y['related'].replace([2], 1 )
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
        
        print('Building model...')
        model = build_model()
        
        print('Training model...')
        model.fit(X_train, Y_train)
        
        print('Evaluating model...')
        evaluate_model(model, X_test, Y_test, category_names)

        print('Saving model...\n    MODEL: {}'.format(model_filepath))
        save_model(model, 'models/NLPmodel.pkl')

        print('Trained model saved!')

    else:
        print('Please provide the filepath of the disaster messages database '\
              'as the first argument and the filepath of the pickle file to '\
              'save the model to as the second argument. \n\nExample: python '\
              'train_classifier.py ../data/DisasterResponse.db classifier.pkl')


if __name__ == '__main__':
    main()