#COMPUTE SCORES

This notebook computes scores for every evaluation metric reported in the repository. 

Mainly,  we cover following types of metrics:

*   rouge-scores (measures n-gram overlap)
*   sentence-transformer based models (siamese networks)
*   pure-transformer based (averages the token embeddings & then measures cosine similarity)
*   spacy-lib based (doc embedding of both candidate and gold summary)
*   gensim based (word2vec and glove)
*   bertscore(computes pair wise contextual similarity)


**List of metrics we computed for measuring correlation with human judgements:**

1.  **R1**:	ROUGE1

2.  **R2**:	ROUGE2

3.  **RL**:	ROUGEL

4.  **CS1**:	sentence-transformers/sentence-t5-xl

5.  **CS2**:	sentence-transformers/sentence-t5-large

6.  **CS3**:	sentence-transformers/multi-qa-MiniLM-L6-cos-v1

7.  **CS4**:	sentence-transformers/distiluse-base-multiling...

8.  **CS5**:	sentence-transformers/paraphrase-MiniLM-L6-v2

9.  **CS6**:	bert-base-uncased

10. **CS7**:	roberta-base, roberta

11. **CS8**:	en_core_web_sm

12. **CS9**:	en_core_web_md

13. **CS10**:	en_core_web_lg

14. **CS11**:	word2vec-google-news-300

15. **CS12**:	glove-twitter-25

16. **BS00**:	bert-base-uncased

17. **BS01**:	bert-large-uncased

18. **BS02**:	bert-base-cased-finetuned-mrpc

19. **BS03**:	roberta-base

20. **BS04**:	roberta-large

21. **BS05**:	roberta-large-mnli

22. **BS06**:	facebook/bart-base

23. **BS07**:	facebook/bart-large

24. **BS08**:	facebook/bart-large-cnn

25. **BS09**:	facebook/bart-large-mnli

26. **BS10**:	facebook/bart-large-xsum

27. **BS11**:	t5-small

28. **BS12**:	t5-base

29. **BS13**:	t5-large

30. **BS14**:	microsoft/deberta-base

31. **BS15**:	microsoft/deberta-base-mnli

32. **BS16**:	microsoft/deberta-large

33. **BS17**:	microsoft/deberta-large-mnli

34. **BS18**:	microsoft/deberta-xlarge

35. **BS19**:	microsoft/deberta-xlarge-mnli

36. **BS20**:	google/pegasus-xsum


In [None]:
!pip install rouge-score
!pip install bert-score
!pip install datasets 
!pip install sentencepiece 
!pip install spacy==3.2
!pip install gensim
!pip install -U sentence-transformers

Collecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4
Collecting bert-score
  Downloading bert_score-0.3.11-py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 7.7 MB/s 
Collecting transformers>=3.0.0numpy
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 52.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 77.8 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 71.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.9 MB/s 
[?25hCollecting pyyaml>=5.1

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md 
!python -m spacy download en_core_web_lg 

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 30.6 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 617 kB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
[38;5;2m✔ Download and installation successful

In [None]:
import torch 
import pandas as pd 
import matplotlib.pyplot as plt 
from rouge_score import rouge_scorer
from tqdm import tqdm 
from scipy import spatial
import spacy
import gensim.downloader as api
import re
from sentence_transformers import SentenceTransformer
from torch.nn import CosineSimilarity
from transformers import BertTokenizer, BertModel, RobertaTokenizer, RobertaModel
from bert_score import BERTScorer
import gc

#Evaluation Metrics 

modularized implementation of above mentioned evaluation metrics. 

In [None]:
class RougeScorer:
    def __init__(self):
        """
            class: RougeScorer
    
            computes ROUGE-1, ROUGE-2, ROUGE-L for given set of candidate and reference summaries.
    
            Parameters:
                Nill
    
            Returns:
                Nill
  
        """
        self.ROUGE1 = 'rouge1'
        self.ROUGE2 = 'rouge2'
        self.ROUGE3 = 'rougeL'
        self.scorer = rouge_scorer.RougeScorer([self.ROUGE1, self.ROUGE2, self.ROUGE3], use_stemmer=True)

    def pred(self, cands, refs):
        """
            method: RougeScorer.pred
    
            computes ROUGE-1, ROUGE-2, ROUGE-L for given set of candidate and reference summaries.
    
            Parameters:
                cands (array-like) : list of candidate summaries 
                refs (array-like)  : list of reference summaries 
    
            Returns:
                ROUGE-1 (List(float)): f-measure of rouge-1 score
                ROUGE-2 (List(float)): f-measure of rouge-2 score
                ROUGE-L (List(float)): f-measure of rouge-l score
  
        """
        rouge1, rouge2, rougeL = [], [], []
        for cand, ref in tqdm(zip(cands, refs)): 
            scores = self.scorer.score(cand, ref)
            rouge1.append(scores[self.ROUGE1].fmeasure)
            rouge2.append(scores[self.ROUGE2].fmeasure)
            rougeL.append(scores[self.ROUGE3].fmeasure)
            gc.collect()

        return rouge1, rouge2, rougeL

In [None]:
class SpacyModel: 
    def __init__(self, model_name):
        """
            class: SpacyModel
    
            computes cosine similarity of doc embedding of given set of candidate and reference summaries.
    
            Parameters:
                model_name (str): name of spacy model. i.e. en_core_web_sm, en_core_web_md, en_core_web_lg
    
            Returns:
                Nill
  
        """
        self.model = spacy.load(model_name)
    
    def pred(self, cands, refs): 
        """
            method: SpacyModel.pred
    
            computes cosine similarity of doc embedding of given set of candidate and reference summaries.
    
            Parameters:
                cands (array-like) : list of candidate summaries 
                refs (array-like)  : list of reference summaries

            Returns:
                cs (list(float)): list of cosine similarity of computed doc embeddings
  
        """
        cs = []
        for cand, ref in tqdm(zip(cands, refs)): 
            cand_e, ref_e = self.model(cand).vector, self.model(ref).vector
            cs.append(1 - spatial.distance.cosine(cand_e, ref_e))
            gc.collect()

        return cs 

In [None]:
class GensimModel: 
    def __init__(self, model_name):
        """
            class: GensimModel

            computes cosine similarity of doc embedding of given set of candidate and reference summaries. 
            doc embeddings are computed using avg of token embeddings. 

            Parameters:
                model_name (str): name of spacy model. i.e. word2vec-google-news-300, glove-twitter-25

            Returns:
                Nill

        """
        self.wv = api.load(model_name)

    def gen_embedding(self, sent): 
        n = 0
        total = 0
        for tok in sent:
            try: 
                emb = self.wv[tok]
                total += emb
                n += 1
            except Exception as e: 
                continue 
        return total/(n + 1e-6)

    def preprocess(self, sent): 
        words = sent.lower().split()
        regex = re.compile('[^a-zA-Z]')

        processed = []
        for word in words:
            #First parameter is the replacement, second parameter is your input string
            res = regex.sub('', word)
            if res != '':
                processed.append(res)
        
        return processed

    def pred(self, cands, refs): 
        """
            method: GensimModel.pred
    
            computes cosine similarity of doc embedding of given set of candidate and reference summaries. we get doc embedding by takig average of token embeddings.
    
            Parameters:
                cands (array-like) : list of candidate summaries 
                refs (array-like)  : list of reference summaries

            Returns:
                cs (list(float)): list of cosine similarity of computed doc embeddings
  
        """
        cs = []
        for cand, ref in tqdm(zip(cands, refs)): 
            cand_emb = self.gen_embedding(self.preprocess(cand))
            ref_emb = self.gen_embedding(self.preprocess(ref))
            cs.append(1 - spatial.distance.cosine(cand_emb, ref_emb))
            gc.collect()

        return cs


In [None]:
class SentenceTransformerModel:
    def __init__(self, model_name, device): 
        """
            class: SentenceTransformerModel

            computes cosine similarity of doc embedding of given set of candidate and reference summaries. 
            doc embeddings are computed using sentence transformer models. 

            Parameters:
                model_name (str): name of sentence-transformer model. i.e. sentence-transformers/sentence-t5-xl
                device (torch.device): torch device object (cuda/cpu)

            Returns:
                Nill

        """
        self.model = SentenceTransformer(model_name).to(device)
        self.model.eval()

    def pred(self, cands, refs): 
        """
            method: SentenceTransformerModel.pred
    
            computes cosine similarity of doc embedding of given set of candidate and reference summaries. we get doc embedding by sentence-transformer models. 
    
            Parameters:
                cands (array-like) : list of candidate summaries 
                refs (array-like)  : list of reference summaries

            Returns:
                cs (list(float)): list of cosine similarity of computed doc embeddings
  
        """
        with torch.no_grad():
            cand_e, ref_e = self.model.encode(cands), self.model.encode(refs)
        
        gc.collect()
        return CosineSimilarity()(torch.tensor(cand_e), torch.tensor(ref_e)).detach().numpy()

In [None]:
class PureTransformerModel: 
    def __init__(self, model_name, device, model_type='bert'):

        """
            class: PureTransformerModel

            computes cosine similarity of doc embedding of given set of candidate and reference summaries. 
            doc embeddings are computed using [CLS] token of given string. 

            Parameters:
                model_name (str): name of transformer model. i.e. bert-base-uncased 
                device (torch.device): torch device object (cuda/cpu)
                model_type (str): type of model being used (bert/roberta)

            Returns:
                Nill

        """

        self.device = device

        if model_type == 'bert': 
            self.model = BertModel.from_pretrained(model_name).to(device)
            self.tokenizer = BertTokenizer.from_pretrained(model_name)
        else: 
            self.model = RobertaModel.from_pretrained(model_name).to(device)
            self.tokenizer = RobertaTokenizer.from_pretrained(model_name)

        self.model.eval()

    def pred(self, cands, refs):
        """
            method: PureTransformerModel.pred
    
            computes cosine similarity of doc embedding of given set of candidate and reference summaries.
            doc embeddings are computed using [CLS] token of given string. 
    
            Parameters:
                cands (array-like) : list of candidate summaries 
                refs (array-like)  : list of reference summaries

            Returns:
                cs (list(float)): list of cosine similarity of computed doc embeddings
  
        """
        cs = []
        with torch.no_grad():
            for cand, ref in tqdm(zip(cands, refs)):
                cand_toks = self.tokenizer(cand, padding=True, return_tensors="pt").to(self.device)
                ref_toks  = self.tokenizer(ref, padding=True, return_tensors="pt").to(self.device)

                cand_embs = self.model(**cand_toks).last_hidden_state
                ref_embs = self.model(**ref_toks).last_hidden_state
                gc.collect()

                cs.append(CosineSimilarity()(cand_embs[0][0].unsqueeze(dim=0), ref_embs[0][0].unsqueeze(dim=0)).cpu().numpy()[0])
        return cs

In [None]:
class BertScoreModel: 
    def __init__(self, model_name, device):
        """
            class: BertScoreModel

            computes pair wise cosine similarity of given set of candidate and reference summaries. token embeddings are computed using transformer models. 

            Parameters:
                model_name (str): name of transformer model. i.e. bert-base-uncased 
                device (torch.device): torch device object (cuda/cpu)

            Returns:
                Nill

        """
        self.bert_scorer = BERTScorer(model_type = model_name, device = device, lang="en")

    def pred(self, cands, pred): 
        """
            method: BertScoreModel.pred
    
            computes pair wise cosine similarity of given set of candidate and reference summaries
    
            Parameters:
                cands (array-like) : list of candidate summaries 
                refs (array-like)  : list of reference summaries

            Returns:
                cs (list(float)): list of cosine similarity of computed doc embeddings
  
        """

        with torch.no_grad():
            _, _, f1 = self.bert_scorer.score(cands, pred)
            
        gc.collect()
        return f1.detach().tolist()

In [None]:
def printAlias(dct):
    '''
        transforms given dict object into pandas dataframe

        Parameters:
            dct (dictionary object): (alias, model_name) pairs

        Returns:
            dataframe object: two columns namely, alias and model
    '''
    return pd.DataFrame({'alias':dct.keys(), 'model':dct.values()})

##device to run the computation: (cuda/cpu) 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

##reading the base csv file 

In [None]:
# location of base data csv file 
# change it according to your settings 
file_loc = '/content/drive/MyDrive/tarang_bertscore/DATA.csv'

scores = pd.read_csv(file_loc).drop(['Unnamed: 0'], axis=1).sort_values(by=['summaryID'])
data = scores.loc[:, ['summaryID', 'title','candidate','gold']]
data.head(2)

Unnamed: 0,summaryID,title,candidate,gold
0,0,Daman & Diu revokes mandatory Rakshabandhan in...,The Daman and Diu administration on Wednesday ...,The Administration of Union Territory Daman an...
1,1,Malaika slams user who trolled her for 'divorc...,Malaika Arora Khan is the brand ambassador of ...,Malaika Arora slammed an Instagram user who tr...


#ROUGE SCORES
Rouge score computes N-gram overlap between given candidate summary and gold summary. Rouge can be calculated in the form of precision, recall and F1 score. 

In [None]:
rougeScoreResuts = dict()

rougeScoreModelsList = [
    'ROUGE1',
    'ROUGE2',
    'ROUGEL' 
]
rougeScoreModels = dict(zip(['R1', 'R2', 'RL'], rougeScoreModelsList))

printAlias(rougeScoreModels)

Unnamed: 0,alias,model
0,R1,ROUGE1
1,R2,ROUGE2
2,RL,ROUGEL


In [None]:
rougeScorer = RougeScorer()
r1, r2, rl = rougeScorer.pred(data.candidate.tolist(), data.gold.tolist())

rougeScoreResuts['R1'] = r1
rougeScoreResuts['R2'] = r2
rougeScoreResuts['RL'] = rl

1001it [02:53,  5.77it/s]


# SENTENCE_TRANSFORMERS

sentence transformers are class of transformers available on huggingface platform. they use siamese-like architecture to compute document embeddings. 

In [None]:
sentenceTransformerResults = dict()

sentenceTransformerModelsList = [
    'sentence-transformers/sentence-t5-xl',
    'sentence-transformers/sentence-t5-large',
    'sentence-transformers/multi-qa-MiniLM-L6-cos-v1',
    'sentence-transformers/distiluse-base-multilingual-cased-v1', 
    'sentence-transformers/paraphrase-MiniLM-L6-v2'
]

sentenceTransformerModels = dict(zip(['CS1', 'CS2', 'CS3', 'CS4','CS5'], sentenceTransformerModelsList))
printAlias(sentenceTransformerModels)

Unnamed: 0,alias,model
0,CS1,sentence-transformers/sentence-t5-xl
1,CS2,sentence-transformers/sentence-t5-large
2,CS3,sentence-transformers/multi-qa-MiniLM-L6-cos-v1
3,CS4,sentence-transformers/distiluse-base-multiling...
4,CS5,sentence-transformers/paraphrase-MiniLM-L6-v2


In [None]:
for alias, model_name in sentenceTransformerModels.items(): 
    print(f'{alias} : {model_name}')
    stm = SentenceTransformerModel(model_name, device)
    sentenceTransformerResults[alias] = stm.pred(data.candidate, data.gold)

CS1 : sentence-transformers/sentence-t5-xl
CS2 : sentence-transformers/sentence-t5-large


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/461 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

CS3 : sentence-transformers/multi-qa-MiniLM-L6-cos-v1


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.22k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

CS4 : sentence-transformers/distiluse-base-multilingual-cased-v1


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

CS5 : sentence-transformers/paraphrase-MiniLM-L6-v2


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

# PURE_TRANSFORMER_MODELS 

pure transformer models are vanilla transformer models. they use sub-word tokenizer to tokenize given document and produce tokens in the form of **List([CLS],  TOKENS,  [SEP])** . 

using above information, we can extract document embedding using extracting embedding of [CLS] token as it accumulates representation of all tokens until the [SEP] token is discovered. 

In [None]:
pureTransformerResults = dict()
pureTransformerModelslist = [
    ['bert-base-uncased', 'bert'], 
    ['roberta-base', 'roberta']                         
]

pureTransformerModels = dict(zip(['CS6', 'CS7'], pureTransformerModelslist))
printAlias(pureTransformerModels)

Unnamed: 0,alias,model
0,CS6,"[bert-base-uncased, bert]"
1,CS7,"[roberta-base, roberta]"


In [None]:
for alias, model_name in pureTransformerModels.items():
    model_name, model_type = model_name
    print(f'{alias} : {model_name}')
    
    pt = PureTransformerModel(model_name, device, model_type)
    pureTransformerResults[alias] = pt.pred(data.candidate.tolist(), data.gold.tolist())

CS6 : bert-base-uncased


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

1001it [03:33,  4.69it/s]


CS7 : roberta-base


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

1001it [03:27,  4.82it/s]


# SPACY MODELS 

we used [en_core_web_sm, en_core_web_md,en_core_web_lg] models from spacy to compute the document embeddings. 

In [None]:
spacyResults = dict()

spacyModelsList = ['en_core_web_sm', 'en_core_web_md','en_core_web_lg']
spacyModels = dict(zip(['CS8', 'CS9', 'CS10'], spacyModelsList))

printAlias(spacyModels)

Unnamed: 0,alias,model
0,CS8,en_core_web_sm
1,CS9,en_core_web_md
2,CS10,en_core_web_lg


In [None]:
for alias, model_name in spacyModels.items():
    print(f'{alias} : {model_name}')
    sm = SpacyModel(model_name)
    spacyResults[alias] = sm.pred(data.candidate.tolist(), data.gold.tolist())

CS8 : en_core_web_sm


1001it [03:37,  4.60it/s]


CS9 : en_core_web_md


1001it [03:39,  4.55it/s]


CS10 : en_core_web_lg


1001it [03:39,  4.56it/s]


# GENSIM MODELS 

we used [word2vec-google-news-300, glove-twitter-25] from gensim to compute token embeddings. we then computed average of those token embedding to represent the given document. 

In [None]:
gensimResults = dict()

gensimModelsList = ['word2vec-google-news-300', 'glove-twitter-25']
gensimModels = dict(zip(['CS11', 'CS12'], gensimModelsList))

printAlias(gensimModels)

Unnamed: 0,alias,model
0,CS11,word2vec-google-news-300
1,CS12,glove-twitter-25


In [None]:
for alias, model_name in gensimModels.items():
    print(f'{alias} : {model_name}')
    model = GensimModel(model_name)
    gensimResults[alias] = model.pred(data.candidate.tolist(), data.gold.tolist())

CS11 : word2vec-google-news-300


1001it [11:12,  1.49it/s]


CS12 : glove-twitter-25


1001it [06:13,  2.68it/s]


# BERTSCORE_MODELS 

we used below mentioned bertscore models to compute similarity scores. bertscore in its core computes pair-wise cosine similarity of token embeddings. 

In [None]:
bertScoreModelsList = [
    'bert-base-uncased', 
    'bert-large-uncased',
    'bert-base-cased-finetuned-mrpc', 
    'roberta-base', 
    'roberta-large',
    'roberta-large-mnli',
    'facebook/bart-base',
    'facebook/bart-large',
    'facebook/bart-large-cnn',
    'facebook/bart-large-mnli',
    'facebook/bart-large-xsum',
    't5-small',
    't5-base',
    't5-large',
    'microsoft/deberta-base', 
    'microsoft/deberta-base-mnli', 
    'microsoft/deberta-large', 
    'microsoft/deberta-large-mnli', 
    'microsoft/deberta-xlarge', 
    'microsoft/deberta-xlarge-mnli', 
    'google/pegasus-xsum', 
]

bertScoreResults = dict()
bertScoreModels = dict(zip(['BS' + (str(i) if i>9 else "0"+str(i)) for i in range(len(bertScoreModelsList))], bertScoreModelsList))
printAlias(bertScoreModels)

Unnamed: 0,alias,model
0,BS00,bert-base-uncased
1,BS01,bert-large-uncased
2,BS02,bert-base-cased-finetuned-mrpc
3,BS03,roberta-base
4,BS04,roberta-large
5,BS05,roberta-large-mnli
6,BS06,facebook/bart-base
7,BS07,facebook/bart-large
8,BS08,facebook/bart-large-cnn
9,BS09,facebook/bart-large-mnli


In [None]:
for alias, model_name in bertScoreModels.items():
    print(f'{alias} : {model_name}')
    bsm = BertScoreModel(model_name, device)
    bertScoreResults[alias] = bsm.pred(data.candidate.tolist(), data.gold.tolist())

BS00 : bert-base-uncased


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS01 : bert-large-uncased


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS02 : bert-base-cased-finetuned-mrpc


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS03 : roberta-base


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS04 : roberta-large


Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS05 : roberta-large-mnli


Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaModel: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS06 : facebook/bart-base


Downloading:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532M [00:00<?, ?B/s]

BS07 : facebook/bart-large


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/971M [00:00<?, ?B/s]

BS08 : facebook/bart-large-cnn


Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

BS09 : facebook/bart-large-mnli


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS10 : facebook/bart-large-xsum


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

BS11 : t5-small


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

Some weights of T5EncoderModel were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BS12 : t5-base


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Some weights of T5EncoderModel were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BS13 : t5-large


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.75G [00:00<?, ?B/s]

Some weights of T5EncoderModel were not initialized from the model checkpoint at t5-large and are newly initialized: ['encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BS14 : microsoft/deberta-base


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/474 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/533M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-base were not used when initializing DebertaModel: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS15 : microsoft/deberta-base-mnli


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/728 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/531M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaModel: ['pooler.dense.bias', 'classifier.bias', 'config', 'classifier.weight', 'pooler.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS16 : microsoft/deberta-large


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-large were not used when initializing DebertaModel: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS17 : microsoft/deberta-large-mnli


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/729 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaModel: ['pooler.dense.bias', 'classifier.bias', 'config', 'classifier.weight', 'pooler.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS18 : microsoft/deberta-xlarge


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-xlarge were not used when initializing DebertaModel: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS19 : microsoft/deberta-xlarge-mnli


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.83G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['pooler.dense.bias', 'classifier.bias', 'classifier.weight', 'pooler.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BS20 : google/pegasus-xsum


Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at google/pegasus-xsum were not used when initializing PegasusModel: ['final_logits_bias']
- This IS expected if you are initializing PegasusModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PegasusModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PegasusModel were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# RECONCILIATION OF DATA

In this step, we reconcile the scores computed in each of above steps. 

In [None]:
finalData = {**rougeScoreResuts, **sentenceTransformerResults, **pureTransformerResults, **spacyResults, **gensimResults, **bertScoreResults}

In [None]:
import pickle 

In [None]:
pickle.dump(finalData, open('finalData.pickle', 'wb'))

In [None]:
for k, v in finalData.items(): 
    data[k] = v

In [None]:
data.columns 

Index(['summaryID', 'title', 'candidate', 'gold', 'R1', 'R2', 'RL', 'CS1',
       'CS2', 'CS3', 'CS4', 'CS5', 'CS6', 'CS7', 'CS8', 'CS9', 'CS10', 'CS11',
       'CS12', 'BS00', 'BS01', 'BS02', 'BS03', 'BS04', 'BS05', 'BS06', 'BS07',
       'BS08', 'BS09', 'BS10', 'BS11', 'BS12', 'BS13', 'BS14', 'BS15', 'BS16',
       'BS17', 'BS18', 'BS19', 'BS20'],
      dtype='object')

In [None]:
annotation_columns = ['url', 'grammatical_correctness_1',
       'arrangement_1', 'quality_1', 'conciseness_1', 'exhaustiveness_1',
       'subjectiveScore_1', 'annotator_1', 'grammatical_correctness_2',
       'arrangement_2', 'quality_2', 'conciseness_2', 'exhaustiveness_2',
       'subjectiveScore_2', 'annotator_2'] 

In [None]:
for column in annotation_columns: 
    data[column] = scores[column]

In [None]:
data.columns

Index(['summaryID', 'title', 'candidate', 'gold', 'R1', 'R2', 'RL', 'CS1',
       'CS2', 'CS3', 'CS4', 'CS5', 'CS6', 'CS7', 'CS8', 'CS9', 'CS10', 'CS11',
       'CS12', 'BS00', 'BS01', 'BS02', 'BS03', 'BS04', 'BS05', 'BS06', 'BS07',
       'BS08', 'BS09', 'BS10', 'BS11', 'BS12', 'BS13', 'BS14', 'BS15', 'BS16',
       'BS17', 'BS18', 'BS19', 'BS20', 'url', 'grammatical_correctness_1',
       'arrangement_1', 'quality_1', 'conciseness_1', 'exhaustiveness_1',
       'subjectiveScore_1', 'annotator_1', 'grammatical_correctness_2',
       'arrangement_2', 'quality_2', 'conciseness_2', 'exhaustiveness_2',
       'subjectiveScore_2', 'annotator_2'],
      dtype='object')

In [None]:
# change it according to your settings. 
data.to_csv('/content/drive/MyDrive/tarang_bertscore/DATA_with_scores.csv')