# HW-4 - Predicting Sentence Similarity
In lecture 4 we saw several techniques for sentence embedding. The most simple one which also leads to competitive results is simply taking the average of pre-trained word embedding. Therefore, in this exercise, we will explore the STS dataset and we will see how can we improve the existing word embedding to better handle this task.


## 1 - Package installation

For some of the methods below, we also need the word frequencies estimated from the corpus. As we currently don't have this available for the GloVe pretrained vectors / Common Crawl corpus, we use the wordfreq package (https://github.com/LuminosoInsight/wordfreq/)

The SemEval data are obtained from the datasets-sts repo: https://github.com/brmson/dataset-sts

GloVe - Global Vectors for Word Representation (https://nlp.stanford.edu/projects/glove/). Pre-trained word vectors have been downloaded (we use the 300-dimensional vectors trained on the 840 billion token Common Crawl corpus: http://nlp.stanford.edu/data/glove.840B.300d.zip) - *The download may take a while*

In [1]:
from google.colab import drive
#drive.flush_and_unmount()
#drive.mount("/content/drive", force_remount=True)
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
cd "/content/drive/My Drive/WORK/ML/YDATA/NLP"


/content/drive/My Drive/WORK/ML/YDATA/NLP


In [None]:
!pip install wordfreq

In [None]:
!pip install -q transformers


In [5]:
%%bash
WORDFILE=glove.840B.300d.zip
if [ -f "$WORDFILE" ]; then
    echo "$WORDFILE exists."
else
    wget http://nlp.stanford.edu/data/glove.840B.300d.zip
fi


glove.840B.300d.zip exists.


In [6]:
%%bash
STS_FOLDER=dataset-sts
if [ -d "$STS_FOLDER" ]; then
    echo "$STS_FOLDER exists."
else
    git clone https://github.com/brmson/dataset-sts
fi

dataset-sts exists.


We will convert the downloaded word embedding to a dictionary for further usage.

In [7]:
import pandas as pd
import zipfile

z = zipfile.ZipFile("./glove.840B.300d.zip")
glove_pd = pd.read_csv(z.open('glove.840B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in glove_pd.T.items()}
del glove_pd



After this you can easily access the word embedding by accessing a dictionary

In [None]:
glove['test']

If we want to acess the frequency of a specific word it can be done easily using wordfreq (note that in general it would have been better to calculate the word frequency based on our on corpus)

In [9]:
import wordfreq
wordfreq.word_frequency('test', 'en', wordlist='large')

0.000158

## 2 - Loading the datasets


In [10]:
%matplotlib inline
import glob
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import sys
sys.path.append('./dataset-sts/')
import pysts
from pysts.loader import load_sts
import torch

import scipy
from scipy.stats import pearsonr
import re
import os

from sklearn.decomposition import TruncatedSVD


nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Now we will load the 2015 STS dataset on the healines segement. As you can see each example is composed by two sentences s0 and s1 and a label

In [11]:
s0, s1, labels = load_sts("dataset-sts/data/sts/semeval-sts/2015/headlines.test.tsv")

In [12]:
print(f"Sentence A: {s0[0]}")
print(f"Sentence B: {s1[0]}")
print(f"Label: {labels[0]}")


Sentence A: ['The', 'foundations', 'of', 'South', 'Africa', 'are', 'built', 'on', 'Nelson', 'Mandela', "'s", 'memory']
Sentence B: ['Australian', 'politicians', 'lament', 'over', 'Nelson', 'Mandela', "'s", 'death']
Label: 1.3



## 3. Predict similarity between sentences based on GloVe
To predict the similarity between two sentences, the word embeddings (using the GloVe word vectors) are combined into a sentence embedding.

Similarity is calculate as the cosine similarity of the two sentence embeddings, and the overall performance is evaluated as the Pearson's coefficient between the predicted scores and the labels.



In [13]:
# Implement the following functions:
# 1.  A function which gets as an input a sentence and returns the average of it's word embedding.
# 2.  A function which gets as an input two sentence embeddings and returns their cosine-similarity
# 3.  A function which gets the the predicted scores and the labels and returns the pearson's r coefficent Tip: For calculating the pearson's r coefficent you can use from scipy.stats import pearsonr

def preprocess(words):
    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", str(words))
    # convert to lower case and split 
    words = letters_only_text.lower().split()
    return words 

def cosine_distance_wordembedding_method(s1, s2):
    s1=str(s1)
    s2=str(s2)
    s1 = list(filter(lambda x: x in glove.keys(), s1))
    s2 = list(filter(lambda x: x in glove.keys(), s2))
    vector_1 = [glove[word] for word in preprocess(s1)]
    vector_2 = [glove[word] for word in preprocess(s2)]      
    vector_1 = np.mean(np.stack(vector_1),axis=0)
    vector_2 = np.mean(np.stack(vector_2),axis=0)
    cosine = scipy.spatial.distance.cosine(vector_1, vector_2)
    return round((1-cosine),2)

def pearson_metric(preds, labels):
    return (pearsonr(preds,labels)[0])


You can check the implementation using a small batch of the data

## 6. Predict similarity between sentences based on average BERT embeddings

Evaluate the similarity using average BERT embeddings. You are adviced to use the transformers package.

Place the result from this section next to the results from above so we can spot the diffrences.


In [14]:
import transformers as ppb # pytorch transformers
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
bert_model = model_class.from_pretrained(pretrained_weights)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [15]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cuda'

In [47]:
files = glob.glob("dataset-sts/data/sts/semeval-sts/all/*.test.tsv")
#files = glob.glob("dataset-sts/data/sts/semeval-sts/2015/head*.test.tsv")

data_df =  pd.DataFrame(columns=['st0','st1','lbl'])


for f in files:

    #if "OnWN.test.tsv" not in f:
    #    continue

    fsize = os.stat(f).st_size
    if fsize < 40:
        a = open(f,"r").read()
        f = os.path.join(os.path.dirname(f),a)

    dataset_name = os.path.basename(f)
    print(dataset_name)

    s0, s1, labels = load_sts(f)

    for i in range(len(s0)):
      #st0 = [" ".join(words) for words in [s0[i]]][0]
      #st1 = [" ".join(words) for words in [s1[i]]][0]
      
      st0 = s0[i]
      st1 = s1[i]

      lbl = labels[i]
      data_df = data_df.append({'dataset':f,'st0': st0,'st1': st1,'lbl': lbl}, ignore_index=True)

data_df['cosine']=0.0  #float
data_df

MSRpar.test.tsv
OnWN.test.tsv
SMTeuroparl.test.tsv
SMTnews.test.tsv
FNWN.test.tsv
OnWN.test.tsv
headlines.test.tsv
OnWN.test.tsv
deft-forum.test.tsv
deft-news.test.tsv
headlines.test.tsv
images.test.tsv
tweet-news.test.tsv
answers-forums.test.tsv
answers-students.test.tsv
belief.test.tsv
headlines.test.tsv
images.test.tsv
2015.test.tsv
answer-answer.test.tsv
headlines.test.tsv
plagiarism.test.tsv
postediting.test.tsv
question-question.test.tsv


Unnamed: 0,st0,st1,lbl,dataset,cosine
0,"[The, problem, likely, will, mean, corrective,...","[He, said, the, problem, needs, to, be, correc...",4.4,dataset-sts/data/sts/semeval-sts/all/../2012/M...,0.0
1,"[The, technology-laced, Nasdaq, Composite, Ind...","[The, broad, Standard, &, Poor, 's, 500, Index...",0.8,dataset-sts/data/sts/semeval-sts/all/../2012/M...,0.0
2,"[``, It, 's, a, huge, black, eye, ,, '', said,...","[``, It, 's, a, huge, black, eye, ,, '', Arthu...",3.6,dataset-sts/data/sts/semeval-sts/all/../2012/M...,0.0
3,"[SEC, Chairman, William, Donaldson, said, ther...","[``, I, think, there, 's, a, building, confide...",3.4,dataset-sts/data/sts/semeval-sts/all/../2012/M...,0.0
4,"[Vivendi, shares, closed, 1.9, percent, at, 15...","[In, New, York, ,, Vivendi, shares, were, 1.4,...",1.4,dataset-sts/data/sts/semeval-sts/all/../2012/M...,0.0
...,...,...,...,...,...
18539,"[How, to, make, good, coffee, in, a, Moka, pot...","[How, to, make, more, than, one, good, cup, of...",4.0,dataset-sts/data/sts/semeval-sts/all/../2016/q...,0.0
18540,"[How, do, I, prepare, this, porous, interior, ...","[How, do, I, install, a, new, interior, partit...",1.0,dataset-sts/data/sts/semeval-sts/all/../2016/q...,0.0
18541,"[What, could, be, causing, my, GFCI, to, trip, ?]","[What, could, be, causing, my, GFCI, outlet, t...",4.0,dataset-sts/data/sts/semeval-sts/all/../2016/q...,0.0
18542,"[How, do, I, prepare, this, porous, interior, ...","[How, do, I, make, this, paint, match, ?]",1.0,dataset-sts/data/sts/semeval-sts/all/../2016/q...,0.0


In [48]:
MAX_LEN=100 #sentence padding length

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.st0 = dataframe.st0
        self.st1 = dataframe.st1
        self.lbl = self.data.lbl
        self.max_len = max_len

    def __len__(self):
        return len(self.st0)

    def sentence_tokenize(self,word_list):
        assert isinstance(word_list, list), 'input has to be a list'
        ret = f"{' '.join(word_list)}"
        #print("st0:",st0) 
        ret = self.tokenizer.tokenize(ret)
        return ret     

    def __getitem__(self, index):
      
        st0 = self.sentence_tokenize(self.st0[index])
        inputs0 = {k: torch.tensor([v]) for k, v in self.tokenizer.encode_plus(
            st0,
            max_length=self.max_len,
            pad_to_max_length=True
            ).items()}

        st1 = self.sentence_tokenize(self.st1[index])
        inputs1 = {k: torch.tensor([v]) for k, v in self.tokenizer.encode_plus(
            st1,
            max_length=self.max_len,
            pad_to_max_length=True
            ).items()}

        return {
            'ind': index,
            's0': inputs0,
            's1': inputs1,
            'lbl': torch.tensor(self.lbl[index], dtype=torch.float)
        }

In [50]:
params = {'batch_size': 32,
                'shuffle': False,
                'num_workers': 0
                }

data_set = CustomDataset(data_df, tokenizer, MAX_LEN)
                

data_ld = DataLoader(data_set, **params)

In [51]:
data_set[2]

{'ind': 2,
 'lbl': tensor(3.6000),
 's0': {'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0]]),
  'input_ids': tensor([[  101,  1036,  1036,  2009,  1005,  1055,  1037,  4121,  2304,  3239,
            1010,  1005,  1005,  2056,  6674,  4300, 28166,  2015, 21396,  2480,
           14859,  3781,  1012,  1010,  3005,  2155,  2038,  4758,  1996,  3259,
            2144,  6306,  1012,   102,     0,     0,     0,     0,     0,     0,
               0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
               0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
               0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
   

In [52]:
bert_model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [53]:
j=0

for data_batch in (data_ld):

    for batch_i in range(len(data_batch['s0']['input_ids'])):

        index = data_batch['ind'][batch_i].item()

        if (index % 50 ==0):
            print(".", end='')
        if (index % 1000==0):
            print(".")

        ids0 = data_batch['s0']['input_ids'][batch_i].to(device, dtype = torch.long)
        mask0 = data_batch['s0']['attention_mask'][batch_i].to(device, dtype = torch.long)
        token_type_ids0 = data_batch['s0']['token_type_ids'][batch_i].to(device, dtype = torch.long)

        ids1 = data_batch['s1']['input_ids'][batch_i].to(device, dtype = torch.long)
        mask1 = data_batch['s1']['attention_mask'][batch_i].to(device, dtype = torch.long)
        token_type_ids1 = data_batch['s1']['token_type_ids'][batch_i].to(device, dtype = torch.long)

        #print("ids0:",ids0)   
        #print(mask0)   
        #print(token_type_ids0)   
        #print("ids1:",ids1)   
        #print(mask1)   
        #print(token_type_ids0)  

        with torch.no_grad():
          vector_0 = bert_model(ids0, mask0, token_type_ids0)[0].squeeze(0).mean(axis=0, keepdim=True)
          vector_1 = bert_model(ids1, mask1, token_type_ids1)[0].squeeze(0).mean(axis=0, keepdim=True)

        vector_0 = vector_0.cpu().detach().numpy()
        vector_1 = vector_1.cpu().detach().numpy()

        #print(vector_0.shape)   

        for i in range(len(vector_0)):
          v0 = vector_0[i,:]     
          v1 = vector_1[i,:]     
          #print(v0.shape)   
          cosine = scipy.spatial.distance.cosine(v0, v1)
          ret = round((1-cosine),2)
          #print(ret)
          #print(index,ret)
          data_df.at[index,'cosine']=ret
    
    #j=j+1
    #print("j:",j)
    #if j>4:
    #  break

print("finished!!")
#data_df.head(5)

#print(f"labels:{labels}")
#print(f"scores:{scores}")
#print("pearson for small batch:",pearson_metric(scores,labels))

    

..
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
..........finished!!


In [59]:
data_t = data_df.groupby('dataset').agg({'cosine':lambda x: list(x),'lbl':lambda x: list(x)})
data_t.reset_index(level=0, inplace=True)
for i in range(len(data_t)):
  l1 = list(data_t.lbl.items())[i][1]
  l2 = list(data_t.cosine.items())[i][1]
  res = pearson_metric(l1,l2)
  print(i,res)
  data_t.at[i,'pearson']=res
data_t


0 0.320006411946197
1 0.5622344660222475
2 0.4348767942642349
3 0.5456227686134182
4 0.3710429069813618
5 0.3980330879274827
6 0.6054066186981093
7 0.5894256527058297
8 0.18707666278281884
9 0.7280248085735032
10 0.5876428860443536
11 0.4296092354224354
12 0.5815861026427618
13 0.540343232105273
14 0.5207151548686301
15 0.6306817314673456
16 0.6269587311922952
17 0.5454201762024197
18 0.43073535356224757
19 0.6534526704427118
20 0.7097849050300201
21 0.7330836113171911
22 0.2717563175040206
23 0.5012044751238144


Unnamed: 0,dataset,cosine,lbl,pearson
0,dataset-sts/data/sts/semeval-sts/all/../2012/M...,"[0.91, 0.94, 0.94, 0.9, 0.92, 0.98, 0.92, 0.9,...","[4.4, 0.8, 3.6, 3.4, 1.4, 4.6, 1.4, 3.6, 2.0, ...",0.320006
1,dataset-sts/data/sts/semeval-sts/all/../2012/O...,"[0.79, 0.88, 0.84, 0.86, 0.86, 0.89, 0.86, 0.7...","[5.0, 3.25, 3.25, 4.0, 3.25, 4.0, 3.333, 4.75,...",0.562234
2,dataset-sts/data/sts/semeval-sts/all/../2012/S...,"[0.92, 0.99, 0.92, 0.95, 1.0, 0.92, 0.87, 0.97...","[4.5, 5.0, 4.25, 4.5, 5.0, 5.0, 4.667, 5.0, 5....",0.434877
3,dataset-sts/data/sts/semeval-sts/all/../2012/S...,"[0.94, 0.88, 0.92, 0.66, 0.89, 0.9, 0.89, 0.93...","[4.0, 5.0, 5.0, 4.667, 4.5, 5.0, 4.5, 5.0, 4.0...",0.545623
4,dataset-sts/data/sts/semeval-sts/all/../2013/F...,"[0.67, 0.71, 0.7, 0.67, 0.68, 0.64, 0.59, 0.69...","[0.6, 0.8, 0.8, 1.2, 0.4, 1.8, 2.2, 0.0, 1.6, ...",0.371043
5,dataset-sts/data/sts/semeval-sts/all/../2013/O...,"[0.72, 0.89, 0.78, 0.88, 0.84, 0.86, 0.9, 0.74...","[0.8, 3.0, 3.8, 2.2, 3.8, 2.6, 0.6, 0.2, 1.0, ...",0.398033
6,dataset-sts/data/sts/semeval-sts/all/../2013/h...,"[0.81, 0.98, 0.9, 0.92, 0.89, 0.83, 0.87, 0.88...","[2.6, 4.4, 2.6, 3.8, 4.2, 3.0, 3.8, 3.2, 4.0, ...",0.605407
7,dataset-sts/data/sts/semeval-sts/all/../2014/O...,"[0.9, 0.92, 0.85, 0.89, 0.93, 0.79, 0.8, 0.9, ...","[4.0, 3.8, 4.2, 1.8, 4.0, 4.0, 3.6, 4.4, 3.0, ...",0.589426
8,dataset-sts/data/sts/semeval-sts/all/../2014/d...,"[0.81, 0.68, 0.94, 0.76, 0.95, 0.95, 0.91, 0.9...","[3.0, 0.8, 3.8, 1.0, 0.4, 4.6, 5.0, 2.0, 5.0, ...",0.187077
9,dataset-sts/data/sts/semeval-sts/all/../2014/d...,"[0.92, 0.97, 0.93, 0.91, 0.95, 0.9, 0.9, 0.92,...","[4.0, 4.2, 4.2, 3.2, 0.6, 3.6, 3.6, 2.0, 4.4, ...",0.728025


In [None]:
df_bert = evaluate_tasks_bert_batch(files,inv_frq_cosine_distance_wordembedding_method)
df_weigthed_mean.rename(columns={'pearson':'weigthed_mean'}, inplace=True)
df_results = df_results. join(df_weigthed_mean.weigthed_mean)
df_results

In [129]:
df_results

Unnamed: 0,dataset,mean_emb
0,MSRpar.test.tsv,0.332961
1,OnWN.test.tsv,0.535711
2,SMTeuroparl.test.tsv,0.372015
3,SMTnews.test.tsv,0.444419
4,FNWN.test.tsv,0.155544
5,OnWN.test.tsv,0.323323
6,headlines.test.tsv,0.474409
7,OnWN.test.tsv,0.397907
8,deft-forum.test.tsv,0.383872
9,deft-news.test.tsv,0.492019


# Albert

In [67]:
model_class, tokenizer_class, pretrained_weights = (ppb.AlbertModel, ppb.AlbertTokenizer, 'albert-base-v2')
albert_tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
albert_model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=684.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=47376696.0, style=ProgressStyle(descrip…




In [69]:
params = {'batch_size': 128,
                'shuffle': False,
                'num_workers': 0
                }

albert_data_set = CustomDataset(data_df, albert_tokenizer, MAX_LEN)
                

albert_data_ld = DataLoader(data_set, **params)

albert_data_set[2]

{'ind': 2,
 'lbl': tensor(3.6000),
 's0': {'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0]]),
  'input_ids': tensor([[   2,   13,    7,   32,   13,   22,   18,   21, 2329,  319, 1356,   13,
             15,   13,    7,   87, 5916, 2614,   13, 4550,   18, 6065,  380, 8135,
           2000,    9,   13,   15, 1196,  190,   63, 3959,   14, 1397,  179, 6213,
             13,    9,    3,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  

In [70]:
albert_model.to(device)

AlbertModel(
  (embeddings): AlbertEmbeddings(
    (word_embeddings): Embedding(30000, 128, padding_idx=0)
    (position_embeddings): Embedding(512, 128)
    (token_type_embeddings): Embedding(2, 128)
    (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0, inplace=False)
  )
  (encoder): AlbertTransformer(
    (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
    (albert_layer_groups): ModuleList(
      (0): AlbertLayerGroup(
        (albert_layers): ModuleList(
          (0): AlbertLayer(
            (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (attention): AlbertAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0, inplace=False)
              (d

In [71]:
j=0

for data_batch in (albert_data_ld):

    for batch_i in range(len(data_batch['s0']['input_ids'])):

        index = data_batch['ind'][batch_i].item()

        if (index % 50 ==0):
            print(".", end='')
        if (index % 1000==0):
            print(".")

        ids0 = data_batch['s0']['input_ids'][batch_i].to(device, dtype = torch.long)
        mask0 = data_batch['s0']['attention_mask'][batch_i].to(device, dtype = torch.long)
        token_type_ids0 = data_batch['s0']['token_type_ids'][batch_i].to(device, dtype = torch.long)

        ids1 = data_batch['s1']['input_ids'][batch_i].to(device, dtype = torch.long)
        mask1 = data_batch['s1']['attention_mask'][batch_i].to(device, dtype = torch.long)
        token_type_ids1 = data_batch['s1']['token_type_ids'][batch_i].to(device, dtype = torch.long)

        #print("ids0:",ids0)   
        #print(mask0)   
        #print(token_type_ids0)   
        #print("ids1:",ids1)   
        #print(mask1)   
        #print(token_type_ids0)  

        with torch.no_grad():
          vector_0 = albert_model(ids0, mask0, token_type_ids0)[0].squeeze(0).mean(axis=0, keepdim=True)
          vector_1 = albert_model(ids1, mask1, token_type_ids1)[0].squeeze(0).mean(axis=0, keepdim=True)

        vector_0 = vector_0.cpu().detach().numpy()
        vector_1 = vector_1.cpu().detach().numpy()

        #print(vector_0.shape)   

        for i in range(len(vector_0)):
          v0 = vector_0[i,:]     
          v1 = vector_1[i,:]     
          #print(v0.shape)   
          cosine = scipy.spatial.distance.cosine(v0, v1)
          ret = round((1-cosine),2)
          #print(ret)
          #print(index,ret)
          data_df.at[index,'cosine']=ret
    
    #j=j+1
    #print("j:",j)
    #if j>4:
    #  break

print("finished!!")
#data_df.head(5)

#print(f"labels:{labels}")
#print(f"scores:{scores}")
#print("pearson for small batch:",pearson_metric(scores,labels))

    

..
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
.....................
..........finished!!


In [66]:
#albert-xxlarge-v2

data_t = data_df.groupby('dataset').agg({'cosine':lambda x: list(x),'lbl':lambda x: list(x)})
data_t.reset_index(level=0, inplace=True)
for i in range(len(data_t)):
  l1 = list(data_t.lbl.items())[i][1]
  l2 = list(data_t.cosine.items())[i][1]
  res = pearson_metric(l1,l2)
  print(i,res)
  data_t.at[i,'pearson']=res
data_t

0 0.15655423512446995
1 0.18431110110398852
2 0.35501891433453864
3 0.4612358031615824
4 0.1337351598175543
5 0.24812421781332009
6 0.28451691767009646
7 0.21390750715651793
8 0.16487095719800057
9 0.28680098023333156
10 0.15574145779342238
11 0.41979789372830784
12 0.14328544450499273
13 0.2818736779080512
14 0.1730356425823683
15 0.34031429043201117
16 0.28004363295110674
17 0.22045784932393075
18 0.22954100844412645
19 0.2202169351330345
20 0.4848062964244785
21 0.5237701580967015
22 0.24313439235866707
23 0.11216596762050791


Unnamed: 0,dataset,cosine,lbl,pearson
0,dataset-sts/data/sts/semeval-sts/all/../2012/M...,"[0.94, 0.96, 0.92, 0.96, 0.93, 0.99, 0.9, 0.96...","[4.4, 0.8, 3.6, 3.4, 1.4, 4.6, 1.4, 3.6, 2.0, ...",0.156554
1,dataset-sts/data/sts/semeval-sts/all/../2012/O...,"[0.79, 0.85, 0.9, 0.86, 0.81, 0.83, 0.83, 0.79...","[5.0, 3.25, 3.25, 4.0, 3.25, 4.0, 3.333, 4.75,...",0.184311
2,dataset-sts/data/sts/semeval-sts/all/../2012/S...,"[0.89, 0.98, 0.96, 0.9, 1.0, 0.97, 0.94, 0.96,...","[4.5, 5.0, 4.25, 4.5, 5.0, 5.0, 4.667, 5.0, 5....",0.355019
3,dataset-sts/data/sts/semeval-sts/all/../2012/S...,"[0.93, 0.91, 0.97, 0.6, 0.93, 0.94, 0.94, 0.98...","[4.0, 5.0, 5.0, 4.667, 4.5, 5.0, 4.5, 5.0, 4.0...",0.461236
4,dataset-sts/data/sts/semeval-sts/all/../2013/F...,"[0.78, 0.8, 0.81, 0.79, 0.85, 0.76, 0.63, 0.84...","[0.6, 0.8, 0.8, 1.2, 0.4, 1.8, 2.2, 0.0, 1.6, ...",0.133735
5,dataset-sts/data/sts/semeval-sts/all/../2013/O...,"[0.73, 0.66, 0.84, 0.65, 0.89, 0.81, 0.34, 0.8...","[0.8, 3.0, 3.8, 2.2, 3.8, 2.6, 0.6, 0.2, 1.0, ...",0.248124
6,dataset-sts/data/sts/semeval-sts/all/../2013/h...,"[0.84, 0.97, 0.84, 0.79, 0.9, 0.9, 0.94, 0.51,...","[2.6, 4.4, 2.6, 3.8, 4.2, 3.0, 3.8, 3.2, 4.0, ...",0.284517
7,dataset-sts/data/sts/semeval-sts/all/../2014/O...,"[0.74, 0.81, 0.91, 0.55, 0.77, 0.66, 0.8, 0.84...","[4.0, 3.8, 4.2, 1.8, 4.0, 4.0, 3.6, 4.4, 3.0, ...",0.213908
8,dataset-sts/data/sts/semeval-sts/all/../2014/d...,"[0.91, 0.92, 0.94, 0.85, 0.93, 0.96, 0.95, 0.9...","[3.0, 0.8, 3.8, 1.0, 0.4, 4.6, 5.0, 2.0, 5.0, ...",0.164871
9,dataset-sts/data/sts/semeval-sts/all/../2014/d...,"[0.89, 0.96, 0.96, 0.93, 0.95, 0.97, 0.96, 0.9...","[4.0, 4.2, 4.2, 3.2, 0.6, 3.6, 3.6, 2.0, 4.4, ...",0.286801


In [72]:
#albert-base-v2

data_t = data_df.groupby('dataset').agg({'cosine':lambda x: list(x),'lbl':lambda x: list(x)})
data_t.reset_index(level=0, inplace=True)
for i in range(len(data_t)):
  l1 = list(data_t.lbl.items())[i][1]
  l2 = list(data_t.cosine.items())[i][1]
  res = pearson_metric(l1,l2)
  print(i,res)
  data_t.at[i,'pearson']=res
data_t

0 0.28872626780402605
1 0.19678706263976456
2 0.4870466827249592
3 0.437361478576295
4 0.1560696404288193
5 0.11575328867648949
6 0.3289338066723639
7 0.11770020262575273
8 -0.006200079141500893
9 0.5120101000793377
10 0.3230125304856292
11 0.32785802310390577
12 0.3497321362748582
13 0.30711726482307516
14 0.4307957964550391
15 0.2806357593885825
16 0.3019806840420445
17 0.170045907491601
18 0.2888526327990853
19 0.4702967807272483
20 0.5161341952546794
21 0.36896290892723405
22 0.09967366075799787
23 0.17255642945329858


Unnamed: 0,dataset,cosine,lbl,pearson
0,dataset-sts/data/sts/semeval-sts/all/../2012/M...,"[0.95, 0.97, 0.98, 0.91, 0.95, 0.99, 0.96, 0.9...","[4.4, 0.8, 3.6, 3.4, 1.4, 4.6, 1.4, 3.6, 2.0, ...",0.288726
1,dataset-sts/data/sts/semeval-sts/all/../2012/O...,"[0.93, 0.88, 0.93, 0.92, 0.69, 0.88, 0.89, 0.4...","[5.0, 3.25, 3.25, 4.0, 3.25, 4.0, 3.333, 4.75,...",0.196787
2,dataset-sts/data/sts/semeval-sts/all/../2012/S...,"[0.94, 1.0, 0.98, 0.95, 1.0, 0.97, 0.98, 0.99,...","[4.5, 5.0, 4.25, 4.5, 5.0, 5.0, 4.667, 5.0, 5....",0.487047
3,dataset-sts/data/sts/semeval-sts/all/../2012/S...,"[0.98, 0.96, 0.98, 0.98, 0.97, 0.95, 0.97, 0.9...","[4.0, 5.0, 5.0, 4.667, 4.5, 5.0, 4.5, 5.0, 4.0...",0.437361
4,dataset-sts/data/sts/semeval-sts/all/../2013/F...,"[0.7, 0.87, 0.81, 0.78, 0.85, 0.78, 0.26, 0.87...","[0.6, 0.8, 0.8, 1.2, 0.4, 1.8, 2.2, 0.0, 1.6, ...",0.15607
5,dataset-sts/data/sts/semeval-sts/all/../2013/O...,"[0.92, 0.93, 0.93, 0.94, 0.89, 0.91, 0.78, 0.9...","[0.8, 3.0, 3.8, 2.2, 3.8, 2.6, 0.6, 0.2, 1.0, ...",0.115753
6,dataset-sts/data/sts/semeval-sts/all/../2013/h...,"[0.92, 0.99, 0.95, 0.98, 0.96, 0.91, 0.94, 0.9...","[2.6, 4.4, 2.6, 3.8, 4.2, 3.0, 3.8, 3.2, 4.0, ...",0.328934
7,dataset-sts/data/sts/semeval-sts/all/../2014/O...,"[0.94, 0.88, 0.89, 0.93, 0.91, 0.38, 0.88, 0.9...","[4.0, 3.8, 4.2, 1.8, 4.0, 4.0, 3.6, 4.4, 3.0, ...",0.1177
8,dataset-sts/data/sts/semeval-sts/all/../2014/d...,"[0.92, 0.97, 0.97, 0.97, 0.98, 0.98, 0.98, 0.9...","[3.0, 0.8, 3.8, 1.0, 0.4, 4.6, 5.0, 2.0, 5.0, ...",-0.0062
9,dataset-sts/data/sts/semeval-sts/all/../2014/d...,"[0.97, 0.98, 0.98, 0.97, 0.97, 0.95, 0.97, 0.9...","[4.0, 4.2, 4.2, 3.2, 0.6, 3.6, 3.6, 2.0, 4.4, ...",0.51201
