# Inference notebook

Purpose of this notebook is to present performance of the model on given data or sentences.
You can use model fine-tuned using this repository or you can use pretrained model from HuggingFace.
You can use data from the test set or you can provide your own sentences.

In [29]:
# import libraries
import sys
import os
import numpy as np
import pandas as pd
from pathlib import Path
pd.options.display.max_colwidth = 100 # to display full text in columns

In [30]:
# Add parent path to sys.path
cwd = Path(os.getcwd())
sys.path.append(str(cwd.parents[1]))

In [31]:
# Import modules
from src.model.nlp_models_selector import get_model_and_tokenizer
from src.utils.text_cleaning import text_cleaning
from src.utils.get_predictions import get_prdiction

## Load model

You can load model from local directory or from HuggingFace.

In [32]:
MODEL_PATH = "KubiakJakub01/finetuned-distilbert-base-uncased"
model, tokenizer = get_model_and_tokenizer(MODEL_PATH)

Some layers from the model checkpoint at KubiakJakub01/finetuned-distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at KubiakJakub01/finetuned-distilbert-base-uncased and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Load data

You can use sample.csv or you can provide your own data.

In [65]:
PATH_TO_DATA = "sample.csv"
df = pd.read_csv(PATH_TO_DATA)
df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,2668,crush,"San Diego, Texas.",Love love love do you remember your first crush ? ??,0
1,7721,panicking,UK,@ushiocomics I may be panicking a little I wasn't as fast submitting the form as I usually am,0
2,7547,outbreak,,Families to sue over Legionnaires: More than 40 families affected by the fatal outbreak of Legio...,1
3,10687,wreck,,I am a wreck,0
4,4547,emergency,,#EMERGENCY in Odai Bucharest Romania 600 Dogs Dying!They are so Hungry that they EAT EACH OTHER!...,1
5,10805,wrecked,probably not home,coleslaw #wrecked http://t.co/sijNBmCZIJ,0
6,2332,collapse,Fakefams,Correction: Tent Collapse Story http://t.co/S7VYGeNJuv,1
7,3694,destroy,SEA Server,dazzle destroy the fun ??,0
8,394,annihilation,"Chandler, AZ",U.S National Park Services Tonto National Forest: Stop the Annihilation of the Salt River Wild H...,1
9,5602,flood,,survived the plague\nfloated the flood\njust peeked our heads above the mud\nno one's immune\nde...,0


In [66]:
# Clean text
df["text"] = df["text"].apply(text_cleaning)

In [67]:
# Get predictions
id_list = df["id"].tolist()
text_list = df["text"].tolist()
batch_size = 8
predictions = get_prdiction(model, tokenizer, id_list, text_list, batch_size)

Sample text: ['love love love remember first crush ', ' may panicking little fast submitting form usually', 'families sue legionnaires families affected fatal outbreak legionnaires disease edinburgh ', 'wreck', ' odai bucharest romania dogs dying they hungry eat other ']


Predictions: 100%|██████████| 20/20 [00:00<00:00, 191958.99it/s]


### Show predictions

* id - id of the sentence
* score - score of the sentence. It is a probability that sentence is belong to target class
* target - target of the sentence where: 
        1 is tweet about disaster
        0 is tweet not about disaster 

In [68]:
# Show results
predicted_list = [x["target"] for x in predictions]
score_list = [x["score"] for x in predictions]
df["predictions"] = predicted_list
df["score"] = score_list
df.head(10)

Unnamed: 0,id,keyword,location,text,target,predictions,score
0,2668,crush,"San Diego, Texas.",love love love remember first crush,0,0,0.9453
1,7721,panicking,UK,may panicking little fast submitting form usually,0,0,0.9474
2,7547,outbreak,,families sue legionnaires families affected fatal outbreak legionnaires disease edinburgh,1,1,0.9906
3,10687,wreck,,wreck,0,0,0.902
4,4547,emergency,,odai bucharest romania dogs dying they hungry eat other,1,1,0.9519
5,10805,wrecked,probably not home,coleslaw,0,0,0.9242
6,2332,collapse,Fakefams,correction tent collapse story,1,1,0.6246
7,3694,destroy,SEA Server,dazzle destroy fun,0,0,0.9661
8,394,annihilation,"Chandler, AZ",u s national park services tonto national forest stop annihilation salt river wild horse via,1,0,0.6337
9,5602,flood,,survived plague\nfloated flood\njust peeked heads mud\nno one immune\ndeafening bells\nmy god su...,0,1,0.5445


In [37]:
# Compute metrics
from src.utils.nlp_metric import Metric
from src.utils.compute_results import get_results

In [69]:
# Define metrics
metrics_list = ["accuracy", "precision", "recall", "f1"]
metrics = [Metric(metric_name) for metric_name in metrics_list]

target_list = df["target"].tolist()

# Compute metrics
results = get_results(preds=predictions, 
                    labels=target_list, 
                    metrics=metrics)

In [63]:
# Show results
print(results)

{'accuracy': '0.850', 'precision': '0.833', 'recall': '0.714', 'f1': '0.769'}


In [70]:
# Show corect predictions
df[df["predictions"] == df["target"]]

Unnamed: 0,id,keyword,location,text,target,predictions,score
0,2668,crush,"San Diego, Texas.",love love love remember first crush,0,0,0.9453
1,7721,panicking,UK,may panicking little fast submitting form usually,0,0,0.9474
2,7547,outbreak,,families sue legionnaires families affected fatal outbreak legionnaires disease edinburgh,1,1,0.9906
3,10687,wreck,,wreck,0,0,0.902
4,4547,emergency,,odai bucharest romania dogs dying they hungry eat other,1,1,0.9519
5,10805,wrecked,probably not home,coleslaw,0,0,0.9242
6,2332,collapse,Fakefams,correction tent collapse story,1,1,0.6246
7,3694,destroy,SEA Server,dazzle destroy fun,0,0,0.9661
10,5627,flooding,,maybe plan dilute safely say start charging over longs would come flooding in,0,0,0.9258
11,10110,upheaval,Oregon,look state actions year ferguson upheaval,0,0,0.576


In [71]:
# Show incorrect predictions
df[df["predictions"] != df["target"]]

Unnamed: 0,id,keyword,location,text,target,predictions,score
8,394,annihilation,"Chandler, AZ",u s national park services tonto national forest stop annihilation salt river wild horse via,1,0,0.6337
9,5602,flood,,survived plague\nfloated flood\njust peeked heads mud\nno one immune\ndeafening bells\nmy god su...,0,1,0.5445
13,4624,emergency%20services,USA,call tasmania emergency services trained horse,1,0,0.705
