# ML Model Building

This notebook will cover the feature extraction and model training for classifying the disaster messages' categories. We will be using the cleaned dataset, which was previously processed in 2 separate notebooks, namely 'Data Augmentation.ipynb' and 'Data Cleaning.ipynb'. 

At the end of this notebook, we will save the prediction pipeline so that it can be loaded and used directly in the future.

### Setting-up the Environment

Required libraries are first imported, and then the datasets are read into the environment.

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import multioutput
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, fbeta_score, make_scorer
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gianatmaja/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Read data
Train = pd.read_csv('Cleaned/Train.csv', index_col = [0])
Test = pd.read_csv('Cleaned/Test.csv', index_col = [0])

### Looking at the Data

We will print out a section of the data, and look at the columns present.

In [3]:
Train.head(5)

Unnamed: 0,index,ID,date,labeled,message,original,language,related,request,aid_related,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0,1,2010-01-01,0,"With the cooperation of First Hawaiian Bank, t...",,en,1,0,1,...,0,0,1,0,1,0,0,0,1,0
1,1,2,2010-01-01,1,PEWODEN FIFTH SECTION OF THE DEPARTEMEN OF L'A...,Pewoden 5em Seksyon Depatman Atibonit ap fe no...,ht,1,0,0,...,0,0,1,0,0,0,1,0,0,1
2,2,3,2010-01-01,1,"Today on a call with Dr. Chan, Director Genera...",,en,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,3,4,2010-01-01,0,"YANGON, Jul 08, 2008 (Xinhua via COMTEX News N...",,en,1,0,1,...,1,1,1,1,1,0,0,0,0,0
4,4,5,2010-01-01,1,Throughout the year there were growing signs o...,,en,1,0,1,...,0,0,0,0,0,0,0,0,0,0


### Defining Helper Functions

Here, we will define some helper functions required for training purposes. First, we will define a function which will give us the average of the precision, recall, and the F1-Score for each of target variables.

In [4]:
# A function that measures mean of f1, precision, recall for classes within a multi-class prediction problem
def f1_pre_acc_evaluation(y_true, y_pred): 
    
    report = pd.DataFrame()
    
    for col in y_true.columns:
        # Dictionary from classification report
        class_dict = classification_report(output_dict = True, y_true = y_true.loc[:,col], y_pred = y_pred.loc[:,col])
    
        # Converting to dataframe
        eval_df = pd.DataFrame(pd.DataFrame.from_dict(class_dict))
        
        # Calculate mean
        av_eval_df = pd.DataFrame(eval_df.transpose().mean())
        
        # Transpose to rows
        av_eval_df = av_eval_df.transpose()
    
        # Record result
        report = report.append(av_eval_df, ignore_index = True)    
    
    report.index = y_true.columns
    
    return report


Then, we will also define below, a function that will help us in preprocessing our text data, by performing text normalization, tokenization, stop words removal, as well as stemming and lemmatization.

In [5]:
def tokenize(text):
    
    # Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Get stopwords
    stop_words = stopwords.words("english")
    
    #tokenize
    words = word_tokenize(text)
    
    #stemming
    stemmed = [PorterStemmer().stem(w) for w in words]
    
    #lemmatizing
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in stemmed if w not in stop_words]
   
    return words_lemmed

### Preparing the Dataset

Below, we will prepare our dataset for training by splitting it into X_train, X_test, y_train, and y_test.

In [6]:
# Create training and testing set
X_train = Train['message']
y_train = Train.iloc[:,7:]

X_test = Test['message']
y_test = Test.iloc[:,7:]

### Building the Pipeline

Below, we will build the prediction pipeline. First, we will be using count vectorizer and tf-idf transformer to extract features from our text dataset. Then, we will be training a multi-output random forest classifier on those features. We will fit the pipeline on our training dataset, and see how it performs when we use it to predict on our testing dataset.

In [7]:
# Building pipeline
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier(RandomForestClassifier()))
        ])

In [8]:
# Fit pipeline into training set
pipeline.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x7fb004706680>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

We will predict on the test dataset below.

In [9]:
# Predict test set
y_pred = pipeline.predict(X_test)
y_pred = pd.DataFrame(y_pred, columns = y_test.columns)

In [10]:
# Evaluate results
report = f1_pre_acc_evaluation(y_test, y_pred)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We can observe the precision, recall and F1-Score below.

In [11]:
# View results
report

Unnamed: 0,precision,recall,f1-score,support
related,0.786179,0.774017,0.779427,1572.172366
request,0.899284,0.795555,0.822057,1572.178321
aid_related,0.810601,0.80327,0.805015,1572.161832
medical_help,0.839252,0.70071,0.708548,1572.184122
medical_products,0.943159,0.71,0.722181,1572.192443
search_and_rescue,0.918008,0.729401,0.752104,1572.198092
security,0.795347,0.717327,0.729566,1572.198779
military,0.666111,0.684342,0.675055,1572.19229
water,0.95001,0.790167,0.829515,1572.190229
food,0.943393,0.852828,0.884631,1572.188855


### Saving the Prediction Pipeline

Now, we will save our prediction pipeline using the joblib library so that we can simply load it for prediction purposes in the future, without the need to train it from scratch.

In [12]:
joblib.dump(pipeline, 'Prediction_Pipeline.joblib')

['Prediction_Pipeline.joblib']

### Loading and Testing the Pipeline

Let's see how we can load and use the pipeline for prediction in the future.

In [13]:
Pipe_loaded = joblib.load('Prediction_Pipeline.joblib')

We will use this pipeline to classify the sample message below.

In [14]:
n = 5
trial = X_test.iloc[n:(n+1)]
print(trial.values[0])

We are here in a repatriated village in zone menelas, We are in the sun and being burned by the sun, we ask that you help us please. We have children here.


For this message, the model outputs are as follows.

In [15]:
pred = Pipe_loaded.predict(trial)
pd.Series(pred[0], index = ['related', 'request', 'aid_related', 'medical_help', 'medical_products',
       'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'])

related                   1
request                   1
aid_related               1
medical_help              0
medical_products          0
search_and_rescue         0
security                  0
military                  0
water                     0
food                      0
shelter                   0
clothing                  0
money                     0
missing_people            0
refugees                  0
death                     0
other_aid                 0
infrastructure_related    0
transport                 0
buildings                 0
electricity               0
tools                     0
hospitals                 0
shops                     0
aid_centers               0
other_infrastructure      0
weather_related           0
floods                    0
storm                     0
fire                      0
earthquake                0
cold                      0
other_weather             0
direct_report             1
dtype: int64

Here, we can see that for the sample message, 1 is predicted for 'related' since the message is a relevant emergency message. Further, the message is also classified as a request, aid related, and a direct report.