# Metrics

### This notebook introduces metrics used for model evaluation

In [2]:
import pandas as pd

import sys
sys.path.append('..')
from src.models.metrics import semantic_similarity, style_accuracy, fluency, j_metric

Some weights of the model checkpoint at SkolkovoInstitute/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Loading an example on which the metrics will be demonstrated

In [12]:
df = pd.read_csv('../data/interim/filtered.csv')
df = df.iloc[0]
pred, label = df['reference'], df['translation']
print(f'Original text: {pred}\n\nNon-toxic text: {label}')

Original text: They're all laughing at us, so we'll kick your ass.

Non-toxic text: they're laughing at us. We'll show you.


### Semantic similarity shows how similar two texts are based on their meaning (1 - same meaning, 0 - different meaning). In our problem it is important to not lose the meaning when removing the toxicity from the text

In [15]:
similarity = semantic_similarity([pred], [label])

print(f'Semantic similarity: {similarity:.5f}')

Semantic similarity: 0.75373


### Style accuracy shows how toxic the text is (1 - non-toxic, 0 - toxic)

In [19]:
original_accuracy = style_accuracy([pred])
non_toxic_accuracy = style_accuracy([label])

print(f'Style Accuracy of original: {original_accuracy:.5f}\n\nStyle Accuracy of non-toxic: {non_toxic_accuracy:.5f}')

Style Accuracy of original: 0.00073

Style Accuracy of non-toxic: 0.99966


### Fluency shows how gramatically correct the given text is accordinng to the english grammar (1 - correct, 0 - incorrect)

In [20]:
original_fluency = fluency([pred])
non_toxic_fluency = fluency([label])

print(f'Fluency of original: {original_fluency:.5f}\n\nFluency of non-toxic: {non_toxic_fluency:.5f}')

Fluency of original: 0.99038

Fluency of non-toxic: 0.99189


### J metric is a combination of three metrics listed above

In [21]:
original_j = j_metric(similarity, original_accuracy, original_fluency)
non_toxic_j = j_metric(similarity, non_toxic_accuracy, non_toxic_fluency)

print(f'J-Metric of original: {original_j:.5f}\n\nJ-Metric of non-toxic: {non_toxic_j:.5f}')

J-Metric of original: 0.00054

J-Metric of non-toxic: 0.74736
