# Bidirectional Encoder Representations from Transformers (BERT)


Bidirectional Encoder Representations from Transformers or the BERT is a method of pre-training language representations, that is a general-purpose language  model is trained on a large pain text corpus, and then that model is used for other downstream NLP tasks. BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP models. 

### Install bert server and client using commands below in cmd prompt or terminal

pip install bert-serving-server  # server

pip install bert-serving-client  # client, independent of `bert-serving-server`

Download the BERT pretrained embeddings -  [**BERT-Base, Uncased**](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) and save it in the folder 'Pretrained_embeddings'

### Start the server with the pretrained model and keep it running (in the cmd prompt or terminal)

bert-serving-start -model_dir Pretrained_embeddings\BERT\uncased_L-12_H-768_A-12

In [2]:
# imports

from bert_serving.client import BertClient
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [3]:
# imports for preprocessing

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
bc = BertClient(check_length=False)

### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [10]:
candidate_doc =  '../Testsets/DE-EN/multi30k.test.pred.en.atok'  
reference_doc = '../Testsets/DE-EN/test2016.en.atok' 

with  open( candidate_doc ,'r') as cand, open( reference_doc ,'r') as ref:
    candidate_en = cand.readlines()
    reference_en = ref.readlines()   

In [6]:
candidate_en[:5]

['A man in an orange hat presenting something .\n',
 'A Boston traveler runs across lush , green fence in front of a white fence .\n',
 'A girl in a karate uniform is blocking a board with a kick .\n',
 'Five people in winter jackets and helmets are standing in the snow with vials in the background .\n',
 'People moving off the roof of a house .\n']

In [7]:
reference_en[:5]

['A man in an orange hat starring at something .\n',
 'A Boston Terrier is running on lush green grass in front of a white fence .\n',
 'A girl in karate uniform breaking a stick with a front kick .\n',
 'Five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background .\n',
 'People are fixing the roof of a house .\n']

###  Optional preprocessing

In [8]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [9]:
# use only if you want to preprocess the sentences

candidate_en = preprocessing(candidate_en, False) # True to remove stopwords, default only removes punctuation
reference_en = preprocessing(reference_en, False) 

### Semantic similarity scores

In [11]:
embeding_bert_cand = bc.encode(candidate_en)

In [12]:
embeding_bert_ref = bc.encode(reference_en)

In [14]:
# Cosine similarity function

similarity_bert =[]
for i in range(len(embeding_bert_ref)):
    similarity_bert.append(np.dot(embeding_bert_cand[i],embeding_bert_ref[i]) / (np.linalg.norm(embeding_bert_cand[i])*(np.linalg.norm(embeding_bert_ref[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [15]:
# for machine translation evaluation

bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(candidate_en[i],reference_en[i], smoothing_function=smoother.method4))

In [11]:
# for text summarization evaluation

rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(candidate_en[i], reference_en[i], 1) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [16]:
human_annotation = pd.read_excel('../Human annotations/DE-EN.xlsx')

In [17]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [18]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, bleu_scores) #bleu_scores or rouge_scores

(0.3901802069640419, 1.0390848166845472e-37)

In [19]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, similarity_bert) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.584123098114762, 1.559683843723382e-92)