<a href="https://colab.research.google.com/github/Hernanros/SOTA/blob/master/summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rethinking Semantic Similarity Metrics
### Cohn, Hershkovitz, Rosenblum, Serfaty, Solomon (Ydata) - 2020
##### data files and code can be found in this [repository] (https://github.com/Hernanros/SOTA)

As elaborated in the seminal work (Yamshchikov 20) there is currently no good metric available for semantic similarity, which is  a major obstacle to progress in style transfer and paragraph summarization. Currently an ensemble method (of WMD, BLEU and POS) works best to measure semantic preservation, but even then “[t]here is still no metric that could distinguish paraphrases from style transfers definitively.” While the main focus seems to be on using larger and larger LMs (BERT) for richer sentence embeddings, we propose that an effective solution will only exist when we start “climbing the right hill, not just the hill on whose slope we currently sit…” (Bender 2020)


As the cheshire cat said so succinctly, “if you don’t know where you are going, any road will take you there.” While we struggle to find a viable metric for semantic preservation, more fundamental questions must first be addressed:

1. What does ‘Semantic Similarity’ even refer to? Besides for the larger epistemological problem of “What does it mean to have meaning?” even humans have an incredibly hard time parsing intent without a larger context.<br><br>
2. As we don’t have a clear definition for ourselves what the question even is, on what basis are we even evaluating ‘human-labeled metrics’? There is no clear intuition as to what value, out of five, comparing *“The tiramisu was simply divine.”* and *“Trump once again confuses son-in-law Kushner with oversized elmo doll.”* should give us. As such, why are we even using human labeling as some form of guiding star?<br><br>
3. As we don’t have a clear intuition as to why the metrics work at all, if we use an ensemble method, how do we weigh them? Similarly, treating the established metrics as “black box” methods of extracting results, why should we trust any of them at all?<br><br>
4. Unlike image recognition, where the information of what is in the picture is found within the pixels of the image, communicative intents are about something that is outside of language. Any model that doesn’t incorporate some form of larger knowledge base will never be able to “understand” anything.<br><br>

Before you get too excited, we do not intend to solve any of these problems. Rather, we would like to propose that we would not need to solve any of them fully in order to have a better defined metric.

First of, what we are trying to optimize here is not actually semantic similarity, but **perceived** semantic similarity. Unlike other ML tasks (classification, regression, etc.) where we are trying to approximate some real-world distribution, in our situation the goal is to model the intuition people feel for semantic preservation.

Trying to tackle perceived semantic similarity, we are no longer trying to define understanding itself, but rather the heuristics humans use when making comparisons. As such, human-labeling is the end-goal. However, due to the fact that subjective evaluation is, by its very definition, subjective, a rigorous approach is to discover and find the effective metrics to the underlying heuristics.

**Our hypothesis is that there are several heuristics underlying our conception of perceived semantic similarity, and through exploring and creating metrics that evaluate those underlying heuristics, we can eventually come to a fair evaluation of semantic similarity.** The current metrics in use represent some of the underlying heuristics, which is why they have some effectiveness, but we are not sure as to which heuristics they are mapping onto and to what degree of complexity.

Therefore, our goal here is to explore two areas:<br>
1. The practical element - if we can assume that the current metrics do capture some of the underlying heuristics, how much can we trust our current metrics (ensemble or otherwise) for the various datasets and how versatile are these ensemble methods if wanted to transfer them to new datasets?<br>

2. The theoretical element - We propose the beginning steps (what would start off initially as a social experiment) to explore the underlying heuristics to perceived semantic preservation.


## The Practical element

### Intializations

In [None]:
import pandas as pd
import numpy as np
import os, urllib, glob, sys
from getpass import getpass

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

import re

from scipy import stats

In [None]:
user = input('User name: ')
password = getpass('Password: ')
password = urllib.parse.quote(password) # your password is converted into url format
cmd_string = "! git clone https://{0}:{1}@github.com/Hernanros/SOTA".format(user, password)

os.system(cmd_string)
cmd_string, password = "", "" # removing the password from the variable

%cd SOTA/data

In [None]:
root = !pwd
root = root[0]
# Paraphrase dataset with predictions based off linear and non-linear models
df_para = pd.read_csv(f"{root}/data/Paraphrase_labeled_data_with_predictions_both.csv").drop(columns=["Unnamed: 0"])

# Base dataset with texts and labels
df_texts = pd.read_csv(f"{root}/data/Paraphrase.csv")

# All datasets with predictions based off linear and non-linear models
df_all = pd.read_csv(f"{root}/data/combined_data_with_predictions_on_separate_datasets_both.csv")

# Loss for the MLP model on all datasets
df_mlp_testloss = pd.read_csv(f"{root}/data/test_loss_on_MLP.csv",names = ['Dataset','Test Loss'],header=0)

df_linear_testloss = pd.read_csv(f"{root}/data/linear_datasets_loss.csv")


### Analysis of Paraphrase Dataset

We initially started with the Paraphrase dataset and processed the scores for all of the following metrics (normalized with MinMax Scaler [0,5] for easier readability):

In [None]:
df_para.head(3)

We then trained a linear model (RF) 

```
model = RandomForestRegressor(max_depth=2)
```

and explored the weights:

<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th>feature</th>
      <th>importance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>3</th>
      <td>WMD</td>
      <td>0.376846</td>
    </tr>
    <tr>
      <th>2</th>
      <td>chrf_score_norm</td>
      <td>0.248155</td>
    </tr>
    <tr>
      <th>13</th>
      <td>BertScore</td>
      <td>0.195598</td>
    </tr>
    <tr>
      <th>14</th>
      <td>L2_score</td>
      <td>0.087413</td>
    </tr>
    <tr>
      <th>5</th>
      <td>ROUGE-1 precision</td>
      <td>0.021447</td>
    </tr>
    <tr>
      <th>12</th>
      <td>ROUGE-L F</td>
      <td>0.021338</td>
    </tr>
    <tr>
      <th>11</th>
      <td>ROUGE-L precision</td>
      <td>0.017681</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1-gram_overlap</td>
      <td>0.010945</td>
    </tr>
    <tr>
      <th>7</th>
      <td>ROUGE-2 recall</td>
      <td>0.009196</td>
    </tr>
    <tr>
      <th>10</th>
      <td>ROUGE-L recall</td>
      <td>0.004518</td>
    </tr>
    <tr>
      <th>6</th>
      <td>ROUGE-1 F</td>
      <td>0.003775</td>
    </tr>
    <tr>
      <th>4</th>
      <td>ROUGE-1 recall</td>
      <td>0.003087</td>
    </tr>
    <tr>
      <th>0</th>
      <td>POS dist score</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>8</th>
      <td>ROUGE-2 precision</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>9</th>
      <td>ROUGE-2 F</td>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>

and the test loss (MAE):  0.6027



and a simple MLP non-linear model

```
class Basemodel(nn.Module):
  
  def __init__(self,n_feature,n_hidden,n_output, keep_probab = 0.1):
    '''
    input : tensor of dimensions (batch_size*n_feature)
    output: tensor of dimension (batchsize*1)

    num_features = 15 # of metrics
    num_hl = 128
    num_output = 1
    '''
    super().__init__()
  
    self.input_dim = n_feature    
    self.hidden = nn.Linear(n_feature, n_hidden) 
    self.predict = torch.nn.Linear(n_hidden, n_output)
    self.dropout = nn.Dropout(keep_probab)

  def forward(self, x):
    x = F.relu(self.dropout(self.hidden(x)))
    x = self.predict(x)
    return x
```
 And got the test loss (MAE): 0.645

 Exploring not only the accuracy of the metrics but also their respective distributions:

In [None]:
df_para_z = df_para.copy()
df_para_z['Normal_Label'],df_para_z['Normal_Linear_Preds'],df_para_z['Normal_MLP_Preds'] = stats.zscore(df_para.label),stats.zscore(df_para.Predictions), stats.zscore(df_para["MLP predictions"])

errors = []
for col in df_para_z.columns:
    if 'label' not in col.lower():
        if "Normal" in col:
          diff = df_para_z['Normal_Label'] - df_para_z[col]
          diff.name = col
          errors.append(diff.abs())
        else:
          diff = df_para_z['label'] - df_para_z[col]
          diff.name = col
          errors.append(diff.abs())
error_df = pd.concat(errors, axis=1)
error_df.describe()

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(111)
ax.hist(x=df_para_z.Normal_Label, bins=15, alpha=0.5, rwidth=0.85, label='Normal_Label')
ax.hist(x=df_para_z.Normal_Linear_Preds, bins=15, alpha=0.5, rwidth=0.85, label='Normal_Linear_Preds')
ax.hist(x=df_para_z.Normal_MLP_Preds, bins=15, alpha=0.5, rwidth=0.85, label='Normal_MLP_Preds')
plt.title('Histogram of Predicted Output vs Labels')
plt.legend()
plt.show()

What we see is that the (normalized) human labels follow a normal distribution (with a slight left skew, the (normalized) linear model predictions seems to be normalish with two peaks, and the (normalized) non-linear predictions follows a normal distribution also (just with a much smaller variance).


Just looking at the scores themselves, neither model does exceedingly well (one must take into account the severe noiseness of human-labeling - which we will discuss later on), however it does seem though that a non-linear model better captures the label distribution.

Below is an interactive plot to explore all of the distributions for all of the predictions and metrics

In [None]:
fig = go.Figure(layout_title_text="Histogram of Similarity Metric Scores",)
for col in df_para_z.columns:
    fig.add_trace(go.Histogram(x=df_para_z[col], name=col, nbinsx=25))

# Overlay both histograms
fig.update_layout(barmode='overlay', height=600)
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.55)
fig.show()

Below you can explore the distribution of errors for each metric, and we can see that there is significant variance and difference between them.

In [None]:
fig = go.Figure(layout_title_text="Similarity Metric Errors Histogram")
for col in error_df.columns:
    fig.add_trace(go.Histogram(x=error_df[col], name=col, nbinsx=50), )

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.5)
fig.show()

What we also see is that our predictions (both linear and MLP), are within the variance of the annotations (+-1) which is relatively fair.

In [None]:
plt.scatter(df_para_z.label,error_df.Normal_MLP_Preds, label="Normal_MLP_Error")
plt.scatter(df_para_z.label,error_df.Normal_Linear_Preds, label="Normal_Linear_Error")
plt.plot(df_para_z.label,[error_df.Normal_MLP_Preds.mean()] * df_para_z.label.shape[0], label="mlp_mean_error")
plt.plot(df_para_z.label,[error_df.Normal_Linear_Preds.mean()] * df_para_z.label.shape[0], label="linear_mean_error")

plt.xlabel('label')
plt.ylabel('error')
plt.title('Error rate per labels')
plt.legend()
plt.gcf().set_size_inches(20, 10)

While the MLP model has a more similar distribution to the human labeling, the linear model seems to do have an overall smaller error variance. ( Theoretically speaking with enough fine-tuning, we should be able to get the MLP model to match the linear model, if not beat it. This is just the exploratory phase for a POC)


In [None]:
df_para['text_1'] = df_texts['text_1']
df_para['text_2'] = df_texts['text_2']

In [None]:
high_err = df_para[np.abs(df_para.Predictions- df_para.label)>1]
low_err = df_para[np.abs(df_para.Predictions- df_para.label)<.25]

In [None]:
high_samps, low_samps = np.random.choice(high_err.index,10),np.random.choice(low_err.index,10)
print("samples of high error sentences")
for i in high_samps:
  print(f"\nsentence 1 ({i}): {df_para.iloc[i].text_1}\n" +
      f"sentence 2:{df_para.iloc[i].text_2}\n" +
      f"label:{np.round(df_para.iloc[i].label,2)} prediction:{np.round(df_para.iloc[i].Predictions,2)}," +
      f"difference: {np.round(df_para.iloc[i].Predictions- df_para.iloc[i].label,2)}")

print("\n\n\nsamples of low error sentences")
for i in low_samps:
  print(f"\nsentence 1: {df_para.iloc[i].text_1}\n" +
        f"sentence 2:{df_para.iloc[i].text_2}\n" +
        f"label:{np.round(df_para.iloc[i].label,2)} prediction:{np.round(df_para.iloc[i].Predictions,2)},"+
        f"difference: {np.round(df_para.iloc[i].Predictions- df_para.iloc[i].label,2)}")

Reflecting on anecdotal evidence, we can see examples where our predictions would actually be preferred to the human-label, as well as examples of where the human labeling didn't seem intuitive at all.

While this is anecdotal, it strengthens the argument that the human labels are quite noisy and unreliable.

### Analysis of All Datasets

We then wanted to see how well our models worked on other datasets (all together, and each separate).

In both scenarios, we used the same architectures that we did for the Paraphrase dataset, but just train them on their respective datasets.

In [None]:
df_all.head(3)

The test loss (MAE) between the prediction and label for each dataset based off the MLP model:




In [None]:
df_mlp_testloss = df_mlp_testloss.sort_values(by="Test Loss").reset_index().drop(columns=['index'])
df_mlp_testloss

The test loss (MAE) between the prediction and label for each dataset based off the linear model:

In [None]:
df_linear_testloss

Feature weights (from the Linear Model) trained on all the datasets:<br><br>

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>feature</th>
      <th>importance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>bleu_withoutstop</td>
      <td>0.905143</td>
    </tr>
    <tr>
      <th>5</th>
      <td>ftext_withoutstop</td>
      <td>0.094857</td>
    </tr>
    <tr>
      <th>0</th>
      <td>bleu_allwords</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>14</th>
      <td>ROUGE-2 recall</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>23</th>
      <td>L2_score</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>22</th>
      <td>POS dist score</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>21</th>
      <td>chrf_score_norm</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>20</th>
      <td>chrf_score</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>19</th>
      <td>ROUGE-L F</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>18</th>
      <td>ROUGE-L precision</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>17</th>
      <td>ROUGE-L recall</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>16</th>
      <td>ROUGE-2 F</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>15</th>
      <td>ROUGE-2 precision</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>12</th>
      <td>ROUGE-1 precision</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>13</th>
      <td>ROUGE-1 F</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>11</th>
      <td>ROUGE-1 recall</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>10</th>
      <td>4-gram_overlap</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>9</th>
      <td>3-gram_overlap</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>8</th>
      <td>2-gram_overlap</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>7</th>
      <td>1-gram_overlap</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>6</th>
      <td>WMD</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>4</th>
      <td>ftext_allwords</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>3</th>
      <td>glove_withoutstop</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>2</th>
      <td>glove_allwords</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>24</th>
      <td>bert</td>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>

In [None]:
with open(f'{root}/data/feature_weights_per_dataset.txt', 'r+') as f:
  txt = f.read()

In [None]:
all_scores =  txt.split(sep="********************************************************************\n")
new_scores = [scores.split("\n") for scores in all_scores if scores != '']
scores_dict = {}
for score in new_scores:
  title = re.sub("prediect labels for dataset ","",score[0])
  _, mse_score = score[1].split(": ")
  scores_dict[title] = {'mse': mse_score}
  for values in score[4:-1]:
    value = values.split("  ")
    value = [val.strip() for val in value if val != ""]   
    scores_dict[title][value[1]] = float(value[-1])

In [None]:
pd.DataFrame.from_dict(scores_dict).T

In [None]:
df_all['Normal_Label'],df_all['Normal_Linear_Preds'] = stats.zscore(df_all.label),stats.zscore(df_all.Predictions)

In [None]:
cmap = {ds:i for i, ds in enumerate(df_all.dataset.unique())}
df_all['abs_diff'],df_all['abs_zdiff'] = np.abs(df_all.Predictions - df_all.label),np.abs(stats.zscore(df_all.Predictions - df_all.label))


fig, ax = plt.subplots()

scatter = ax.scatter(df_all.label,df_all.Predictions, c = [cmap[d] for d in df_all.dataset])
plt.gcf().set_size_inches(20, 10)
plt.xlabel('label')
plt.ylabel('predictions')
plt.title('predictions VS actual labels')

legend1 = ax.legend(*scatter.legend_elements(),
                    loc="best", title="datasets")
ax.add_artist(legend1)
plt.show();

In [None]:
f, ax = plt.subplots(8,4,figsize = (40,40))
for i,s in enumerate(df_all.dataset.unique()):
  ax[i%5] = plt.subplot(8,4,i+1) 
  ax[i%5] = plt.scatter(df_all[df_all.dataset==s].label,df_all[df_all.dataset==s].Predictions)
  plt.xlabel('labels')
  plt.ylabel('predictions')
  plt.title(s)

While the ensemble works for some datasets, however many other datasets, the idea of using an ensemble method doesn't work at all - and our results look like noise. This could be due to the lack of labels for the dataset or that the dataset itself sentences are too ambigous. Exploring which datasets work better would also help us develop better heuristics (part 2).

In [None]:
high_err, low_err = df_all[df_all.abs_zdiff>1],df_all[df_all.abs_zdiff<.25]
high_err = high_err.drop(['text_1_tokens','text_2_tokens'],axis=1)
low_err = low_err.drop(['text_1_tokens','text_2_tokens'],axis=1)

In [None]:
high_samps, low_samps = np.random.choice(high_err.index,10),np.random.choice(low_err.index,10)
print("samples of high error sentences")
for i in high_samps:
  print(f"\ndataset: {high_err.dataset[i]}\nsentence 1: {high_err.text_1[i]}\nsentence 2:{high_err.text_2[i]}\nlabel:{np.round(high_err.label[i],2)} prediction:{np.round(high_err.Predictions[i],2)}, difference: {np.round(high_err.Predictions[i]- high_err.label[i],2)}")

print("\n\n\nsamples of low error sentences")
for i in low_samps:
  print(f"\ndataset: {low_err.dataset[i]}\nsentence 1: {low_err.text_1[i]}\nsentence 2:{low_err.text_2[i]}\nlabel:{np.round(low_err.label[i],2)} prediction:{np.round(low_err.Predictions[i],2)}, difference: {np.round(low_err.Predictions[i]- low_err.label[i],2)}")


1. We need to tackle the nosiness of the human labeled data. We offer a suggestion in part 2.
2. While non-linear model distributions was similar to the human labeling, we saw that they didn't perform exceedingly well, and were even less beneficial when we applied it to other datasets.
3. While using the ensemble method works for some datasets, in other datasets (or a generic ensemble) doesn't perform well at all.

## The Theoretical Element

What we need to do is to hypothesize what are the underlying heuristics beneath perceived semantic similairty. 

We would then develop particular pairs of sentences in which the distinction would be on that given heuristic, and see how it impacts the human score.<br> 

Ultimately, instead of having one human-labeled metric for semantic similarity, we would have a human-labeled score for each of the underlying heuristic. The current ideas for heuristics are:<br>
1. Word Overlap
2. Word similarity 
3. Subject-Object relationship 
4. Type of sentence (Question/Statement/Elaboration/etc.)
5. Mood
6. Sentiment Similarity (Could be seen as a subdivision of Mood)
7. Similar sentence length.
8. Bias within size of the sentences 
9. Grammatical consistency (same form of mistakes, etc.)
10. Spelling/Writing dialect<br>

As we get a better understand of the underlying features, we may be also are to come back to our current metrics with a deeper understanding of what exactly they are capturing.

We should also establish clearer guidelines as to what form of human labeling we accept. Being that they are still asked to reflect on their subjective evaluation, we need a way to measure whether or not humans could give us a good approximation for any particular pair. This can be encouraged by only taking labels where there is a form of **consensus** between a majority of all labelers (ex: where the variance of the label within a given range is within a certain range.)

With a richer language to discuss semantic similarity, we can look deeper into style transfer and paragraph generation and be able to ask, what underlying heuristics are we actually looking for in a given task. 

While we have not fully explored every possible avenue of curiosity, we hope we have at least provided you with a promising idea for future research. We look at our exploration as 'laying the ground-work' as opposed to 'defending a thesis'. 

To quote the great Geoffrey Hinton: "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."