# Bidirectional Encoder Representations from Transformers (BERT)


Bidirectional Encoder Representations from Transformers or the BERT is a method of pre-training language representations, that is a general-purpose language  model is trained on a large pain text corpus, and then that model is used for other downstream NLP tasks. BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP models. 

### Install bert server and client using commands below in cmd prompt or terminal

pip install bert-serving-server  # server

pip install bert-serving-client  # client, independent of `bert-serving-server`

Download the BERT pretrained embeddings -  [**BERT-Base, Uncased**](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)

### Start the server with the pretrained model and keep it running (in that folder in the cmd prompt)

bert-serving-start -model_dir pretrained_embeddings\BERT\uncased_L-12_H-768_A-12

In [1]:
# imports
from bert_serving.client import BertClient
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [2]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [3]:
bc = BertClient(check_length=False)

### Load testsets for evaluation

The Automatically generated texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br>  **reference-** '../testsets/de-en/test2016.en.atok'   **prediction-** '../testsets/de-en/multi30k.test.pred.en.atok' <br>


- For **RO-EN** translation, <br>  **reference-** '../testsets/ro-en/newstest2016_ref_1000.en'  **prediction-**- '../testsets/ro-en/newstest2016_output_1000.en'<br>


- For **giga word** summarization(titles), <br>  **reference-** '../testsets/giga/task1_ref0_giga_450.txt'  **prediction-**'../testsets/giga/giga.10_300000_450.txt'


- For **CNN-DM** summariation, <br>  **reference-** '../testsets/cnn/preprocessed_1000.ref'  **prediction-** '../testsets/cnn/preprocessed_1000.pred'


- For **Duc 2003** summarization, <br>  **reference-** '../testsets/duc/task1_ref0_duc2003.txt'  **prediction-** '../testsets/duc/duc2003.10_300000.txt'

In [25]:
reference_doc = '../testsets/cnn/preprocessed_1000.ref'
prediction_doc =  '../testsets/cnn/preprocessed_1000.pred'  

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

###  Optional preprocessing

In [26]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [27]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, False) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, False)

### Semantic similarity scores

In [28]:
embeding_bert_ref = bc.encode(reference_en)

In [29]:
embeding_bert_pred = bc.encode(prediction_en)

In [30]:
similarity_bert =[]
for i in range(len(embeding_bert_ref)):
    similarity_bert.append(np.dot(embeding_bert_ref[i],embeding_bert_pred[i]) / (np.linalg.norm(embeding_bert_ref[i])*(np.linalg.norm(embeding_bert_pred[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [None]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [11]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../human annotated/DE-EN.xlsx'


- For **RO-EN** translation, '../human annotated/RO-EN.xlsx'


- For **giga word** summarization(titles),'../human annotated/giga.xlsx'


- For **CNN-DM** summariation, '../human annotated/CNN_1000.xlsx'


- For **Duc 2003** summarization,  '../human annotated/duc2003.xlsx'


In [8]:
human_annotation = pd.read_excel('../human annotated/CNN_1000.xlsx')

In [9]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [12]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, rouge_scores) #bleu_scores or rouge_scores

(0.14908203814975426, 2.190813725020308e-06)

In [31]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, similarity_bert) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.09988277691376685, 0.001564036550240868)