# Prediction Illustration

In this notebook, we will illustrate how we can use the pipeline file, created in a previous notebook ('ML Model Building.ipynb') can be used to predict some sample test data.

### Setting-up the Environment

First, we import the required libraries and load the dataset, as well as the pipeline for prediction.

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from Helpers import tokenize
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gianatmaja/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gianatmaja/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
pd.set_option("display.max_columns", 50)

In [3]:
# Import test dataset
Test = pd.read_csv('Cleaned/Test.csv', index_col = [0])

In [4]:
# Load pipeline
Pipeline = joblib.load('Prediction_Pipeline.joblib')

### Preparing Data for Prediction Testing

We take a sample of the testing dataset for illustration.

In [5]:
# Take a sample of the testing dataset
Sample = Test.iloc[:5]

# Split X and y variables
X_sample = Sample.iloc[:,:7]
y_sample = Sample.iloc[:,7:]

In [6]:
# Viewing the sample
X_sample

Unnamed: 0,index,ID,date,labeled,message,original,language
0,0,21047,2020-01-01,1,.. about the answer but we don't have enough s...,SEVRE KONSNE REPONSE LA MEN NOU PAGEN FSE ANK ...,ht
1,1,21048,2020-01-01,0,"Soon after, explosions and radioactive leaks r...",,en
2,2,21049,2020-01-01,0,"With telecommunication lines down, efforts to ...",,en
3,3,21050,2020-01-01,0,It appeared that whenever assessment missions ...,,en
4,4,21051,2020-01-01,1,I HEAR THAT THERE'S CYCLONE IS IT TRUE?,mwen tande yo di ke gen sikl ske se vre,sl


### Running the Prediction

Next, we run the prediction on the sample above when the message aren't labelled. If they are labelled, then we simply take the y values from the sample dataset. This happens when users have defined the message classes before submitting the ticket.

In [7]:
Pred_df = pd.DataFrame(columns = ['related', 'request', 'aid_related', 'medical_help', 'medical_products',
                                  'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter',
                                  'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
                                  'infrastructure_related', 'transport', 'buildings', 'electricity',
                                  'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
                                  'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
                                  'other_weather', 'direct_report'])

for i in range(len(X_sample)):
    if (X_sample['labeled'][i] == 1):
        Pred_df = Pred_df.append(y_sample.iloc[i], ignore_index = True)
    else:
        pred = Pipeline.predict(X_sample['message'].iloc[i:(i+1)])
        y_pred = pd.Series(pred[0], index = ['related', 'request', 'aid_related', 'medical_help', 'medical_products',
                                             'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter',
                                             'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
                                             'infrastructure_related', 'transport', 'buildings', 'electricity',
                                             'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
                                             'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
                                             'other_weather', 'direct_report'])
        Pred_df = Pred_df.append(y_pred, ignore_index = True)

### Determine Emergency Category

Here, we determine the emergency category by summing the predicted classes.

In [8]:
Sums = Pred_df.sum(axis = 1)
Category = []
for j in range(len(Sums)):
    s = Sums[j]
    if (s == 0):
        Category.append('N/A')
    elif (s == 1):
        Category.append('Low')
    elif (s <= 3):
        Category.append('Med')
    else:
        Category.append('High')

### Joining the Input Data and Results

Finally, we combine the predictions and the input data, and remove unnecessary columns.

In [9]:
Pred_df['category'] = Category

In [10]:
Concat_df = pd.concat([X_sample, Pred_df], axis = 1)
Output_df = Concat_df.drop(['index', 'original'], axis = 1)

### Viewing the Results

The results can be viewed below.

In [11]:
Output_df

Unnamed: 0,ID,date,labeled,message,language,related,request,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,category
0,21047,2020-01-01,1,.. about the answer but we don't have enough s...,ht,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,21048,2020-01-01,0,"Soon after, explosions and radioactive leaks r...",en,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Low
2,21049,2020-01-01,0,"With telecommunication lines down, efforts to ...",en,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Low
3,21050,2020-01-01,0,It appeared that whenever assessment missions ...,en,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,High
4,21051,2020-01-01,1,I HEAR THAT THERE'S CYCLONE IS IT TRUE?,sl,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,Med
