# Text Mining Grouo Project
## DE-EN Corpus

##### TOC for Implemented Metrics.

---

In [1]:
# Necessary Installs.
# !pip install rouge

In [2]:
# Imports
import pandas as pd
from collections import Counter
from rouge import Rouge
from nltk.translate import chrf_score

Load Dataset

In [3]:
df1 = pd.read_csv("scores.csv")

In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21704 entries, 0 to 21703
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   source       21704 non-null  object 
 1   reference    21704 non-null  object 
 2   translation  21704 non-null  object 
 3   z-score      21704 non-null  float64
 4   avg-score    21704 non-null  float64
 5   annotators   21704 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 1017.5+ KB


In [5]:
df1.head()

Unnamed: 0,source,reference,translation,z-score,avg-score,annotators
0,"Ihr Zeitlupentempo maßen sie, als sie vor Spit...",Her timeless pace measures them when they equi...,Their slow speed was measured by researchers o...,-0.345024,76.0,1
1,"Er sagte, dass die Bereiche ruhige Treffpunkte...",He said the areas offer quiet meeting points b...,He said the spaces provided calm meeting point...,0.9038,97.5,2
2,Für die Geschäftsleute an der B 27 ist es nur ...,"For businessmen at the B 27, it's only a small...",This is only a small consolation for businesse...,0.700503,94.0,1
3,Diese Fähigkeit sei möglicherweise angeboren o...,This ability may be born or developed with gen...,"This ability may be innate, or may develop as ...",-1.256572,51.5,2
4,Weil sie Wassertemperaturen um die sechs Grad ...,Because they prefer water temperatures around ...,They generally only come to the surface in win...,0.293909,87.0,2


---
### PreProcessing

In [6]:
# Pre processing of this set.
df = df1.copy()

for x in ["source","reference","translation"]:
    # lowercase.
    df[x] = df1[x].str.lower()

---
### Metrics

Rouge metric as described in

https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460

In [7]:
reference = df["reference"][0]
model_out = df["translation"][0]

In [8]:
rouge = Rouge()

In [9]:
print(reference)
print(model_out)

her timeless pace measures them when they equipped six animals with a broadcaster before spitsbergen.
their slow speed was measured by researchers off svalbard, who fitted six animals with a tracker.


In [10]:
# The get scores method returns three metrics, F1 score, p precision and recall r.
# For each unigram,bigram and Longest sequence.
rouge.get_scores(model_out,reference)

[{'rouge-1': {'f': 0.2580645111342353, 'p': 0.25, 'r': 0.26666666666666666},
  'rouge-2': {'f': 0.20689654673008337, 'p': 0.2, 'r': 0.21428571428571427},
  'rouge-l': {'f': 0.2580645111342353, 'p': 0.25, 'r': 0.26666666666666666}}]

In [11]:
# For the entire model, model_out and reference need to be lists of strings.
model_out = df["translation"].to_list()
reference = df["reference"].to_list()
rouge_scores = rouge.get_scores(model_out,reference)

In [12]:
# For each of the three scores, output a new column in the df with the f1 scores.
for key in rouge_scores[0].keys():
    df[(key+" score")] = pd.Series([score[key]["f"] for score in rouge_scores])

---
#### chrF metric

Check the paper here: https://www.aclweb.org/anthology/W15-3049.pdf

The general formula for the CHRF score is:

`CHRFBeta = (1 + Beta**2) * ((chrP * chrR) / (Beta**2*chrP + chrR))`

where:
* chrP is the percentage of n-grams in the hypothesis which have a counterpart in the reference.
* chrR is the percentage of character n-grams in the reference which are also present in the hypothesis.
* Beta is a parameter which assigns beta times more importance to recall than to precision (if beta == 1, they have the same importance).

In [None]:
# I was surprised, but this works exactly like it's intended. Makes a new column with the chrF score for each row of the df.
# The default n-gram values are min == 1, max == 6. 
# The default beta is 3.

# All parameters to test chrf scores with. feel free to play around with this and test out different combinations.
# Note: this takes a few minutes to run.
min_len = [1,2]
max_len = [6]
beta = [1,3]

chrf_scores = []
for min_l in min_len:
    for max_l in max_len:
        for b in beta:
            append_str = "chrf_b" + str(b) + "_n" + str(min_l) + str(max_l)
            chrf_scores.append(append_str)
            df[append_str] = df.apply(lambda row: chrf_score.sentence_chrf(row["reference"],row["translation"],min_len=min_l,max_len=max_l,beta=b),axis=1)

df.loc[:,chrf_scores]

---
### Comparison of Applied Metrics
Because the numeric system used for all of these can be different, the best way to compare them is by checking the correlation with the annotator's scores.

In [None]:
# Initialize a dict to be transformed to a df later, for score comparison.
scores_dict = {"pearson":[],"kendall":[],"spearman":[]}
scores_index = []

In [None]:
# Thankfully, Pandas has a corr method.

# for each declared corr method, compute the corr between each computed metric and the avg-score column.
for corr in scores_dict.keys():
    for key in rouge_scores[0].keys():
        scores_dict[corr].append(df.loc[:,(key+ " score")].corr(df.loc[:,"avg-score"],method=corr))

for corr in scores_dict.keys():
    for chrf_score in chrf_scores:
        scores_dict[corr].append(df.loc[:,chrf_score].corr(df.loc[:,"avg-score"],method=corr))

# Build also a list that will be used to create the index for the scores dataframe.
scores_index.extend(list(rouge_scores[0].keys()))
scores_index.extend(chrf_scores)

In [None]:
scores_df = pd.DataFrame(scores_dict,index=scores_index)
scores_df