# **NLP tasks using BERT**


---




BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.


BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.


## Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness (see Takeaways #3).

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

print(loss)
print(logits)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor(4.1324, grad_fn=<NllLossBackward>)
tensor([[[ -6.4346,  -6.4063,  -6.4097,  ...,  -5.7691,  -5.6326,  -3.7883],
         [-14.0119, -14.7240, -14.2120,  ..., -11.6976, -10.7304, -12.7618],
         [ -9.6561, -10.3125,  -9.7459,  ...,  -8.7782,  -6.6036, -12.6596],
         ...,
         [ -3.7861,  -3.8572,  -3.5644,  ...,  -2.5593,  -3.1093,  -4.3820],
         [-11.6598, -11.4274, -11.9267,  ...,  -9.8772, -10.2103,  -4.7594],
         [-11.7266, -11.7509, -11.8040,  ..., -10.5943, -10.9407,  -7.5151]]],
       grad_fn=<AddBackward0>)


## Next Sentence Prediction (NSP)

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

In [None]:
from transformers import BertTokenizer, BertForMultipleChoice
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMultipleChoice.from_pretrained('bert-base-uncased')

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
choice0 = "It is eaten with a fork and a knife."
choice1 = "It is eaten while held in the hand."
labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels)  # batch size is 1

 # the linear classifier still needs to be trained
loss = outputs.loss
logits = outputs.logits
print(logits)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly

tensor([[0.4631, 0.5342]], grad_fn=<ViewBackward>)


## Text Summarization

Extractive summarization means identifying important sections of the text and generating them verbatim producing a subset of the sentences from the original text; while abstractive summarization reproduces important material in a new way after interpretation and examination of the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original one.

Obviously, abstractive summarization is more advanced and closer to human-like interpretation. Though it has more potential (and is generally more interesting for researchers and developers), so far the more traditional methods have proved to yield better results.



In [None]:
!pip install bert-extractive-summarizer
!pip install spacy==2.1.3
!pip install transformers==2.2.2
!pip install neuralcoref

!python -m spacy download en_core_web_md

[31mERROR: en-core-web-sm 2.2.5 has requirement spacy>=2.2.2, but you'll have spacy 2.1.3 which is incompatible.[0m
Installing collected packages: blis, plac, preshed, thinc, spacy
  Found existing installation: blis 0.4.1
    Uninstalling blis-0.4.1:
      Successfully uninstalled blis-0.4.1
  Found existing installation: plac 1.1.3
    Uninstalling plac-1.1.3:
      Successfully uninstalled plac-1.1.3
  Found existing installation: preshed 3.0.5
    Uninstalling preshed-3.0.5:
      Successfully uninstalled preshed-3.0.5
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed blis-0.2.4 plac-0.9.6 preshed-2.0.1 spacy-2.1.3 thinc-7.0.8


Collecting transformers==2.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/08/4a6768ca1a7a4fa37e5ee08077c5d02b8d83876bd36caa5fc24d98992ac2/transformers-2.2.2-py3-none-any.whl (387kB)
[K     |▉                               | 10kB 15.9MB/s eta 0:00:01[K     |█▊                              | 20kB 12.6MB/s eta 0:00:01[K     |██▌                             | 30kB 8.8MB/s eta 0:00:01[K     |███▍                            | 40kB 7.5MB/s eta 0:00:01[K     |████▎                           | 51kB 4.5MB/s eta 0:00:01[K     |█████                           | 61kB 4.9MB/s eta 0:00:01[K     |██████                          | 71kB 4.9MB/s eta 0:00:01[K     |██████▊                         | 81kB 5.2MB/s eta 0:00:01[K     |███████▋                        | 92kB 5.5MB/s eta 0:00:01[K     |████████▌                       | 102kB 4.2MB/s eta 0:00:01[K     |█████████▎                      | 112kB 4.2MB/s eta 0:00:01[K     |██████████▏                     | 12

Collecting neuralcoref
[?25l  Downloading https://files.pythonhosted.org/packages/ea/24/0ec7845a5b73b637aa691ff4d1b9b48f3a0f3369f4002a59ffd7a7462fdb/neuralcoref-4.0-cp36-cp36m-manylinux1_x86_64.whl (287kB)
[K     |█▏                              | 10kB 17.7MB/s eta 0:00:01[K     |██▎                             | 20kB 23.4MB/s eta 0:00:01[K     |███▍                            | 30kB 27.3MB/s eta 0:00:01[K     |████▋                           | 40kB 31.4MB/s eta 0:00:01[K     |█████▊                          | 51kB 3.9MB/s eta 0:00:01[K     |██████▉                         | 61kB 4.6MB/s eta 0:00:01[K     |████████                        | 71kB 5.3MB/s eta 0:00:01[K     |█████████▏                      | 81kB 5.9MB/s eta 0:00:01[K     |██████████▎                     | 92kB 3.7MB/s eta 0:00:01[K     |███████████▍                    | 102kB 4.1MB/s eta 0:00:01[K     |████████████▌                   | 112kB 4.1MB/s eta 0:00:01[K     |█████████████▊               

In [None]:
from summarizer import Summarizer

body = 'Gemini Solutions is an IT Consulting and Product Development firm. Our services provide clients\
 with a flexibility to choose from an array of automation and application development solutions as well\
  as giving them an option to choose from outsourcing, onshore or offshore engagement models. Gemini offers\
   several management services and is able to combine our range of services to suit a diverse range\
    of needs. We cater to the diversified portfolio of clients across sectors such as banking & financial\
     services, retail, healthcare, education and government sector. We are proud to say that we have a\
      well-structured IT community that has been handpicked from the best colleges across India who keep\
       abreast with today’s rapidly changing and ever-evolving technological advancements. We strive to\
        continuously provide customizable, affordable and quality products & services to our patrons through\
         our creative & skilled teams who demonstrate an inherent agility towards projects. CMT \
         (Comprehensive Monitoring Tool) is a tool meant to ensure that your IT operations keep running\
          smoothly and without hitches. It’s a monitoring tool that allows you to monitor the entire\
           production environment and infrastructure very closely and generates notifications as soon as\
            any issues are identified either with the infrastructure, the models that are running or the\
             data itself. What differentiates this tool from the run-of-the-mill tools is how it embeds\
              machine learning thus being able to predict a failure even before it occurs.'

model = Summarizer()
model(body)

'Gemini Solutions is an IT Consulting and Product Development firm. CMT          (Comprehensive Monitoring Tool) is a tool meant to ensure that your IT operations keep running          smoothly and without hitches.'

In [None]:
result = model(body, ratio=0.2)  # Specified with ratio
result

'Gemini Solutions is an IT Consulting and Product Development firm. CMT          (Comprehensive Monitoring Tool) is a tool meant to ensure that your IT operations keep running          smoothly and without hitches.'

In [None]:
result = model(body, num_sentences=3)  # Will return 3 sentences 

## Sentence Similarity



### Sentence Embeddings

Word Embeddings represent the meaning of the words in a conversation.
However, sometimes we need to go a step further and encode the meaning of the whole sentence to be able to understand the context in which the words are said.

The representation of the meaning of a sentence is important for many tasks. It allows us to understand the intention of the sentence without calculating individually the embeddings of the words. It also enables the comparison of sentences to cluster them by similarity or to predict values for the sentences, such as sentiment.

In [None]:
!pip install sentence-transformers

In [None]:
import spacy
from sentence_transformers import SentenceTransformer 
from sentence_transformers import models

doc = "This is a sentence. We need to get its embeddings."

sp = spacy.load('en_core_web_sm')
tokenized = sp(doc)
sentences = []
for token in tokenized.sents:
  sentences.append(token.text)

    # Use encoder for mapping tokens to embeddings
word_embedding_model = models.Transformer('bert-base-cased')
    # Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                                   pooling_mode_mean_tokens=True,
                                   pooling_mode_cls_token=False,
                                   pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])   
sentence_embeddings = model.encode(sentences)
res = zip(sentences, sentence_embeddings)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




In [None]:
a = sentence_embeddings[0] 
b = sentence_embeddings[1]

In [None]:
for tup in res:  
  print("Sentence:", tup[0])
  print("Embedding:", tup[1])
  print("")

### Cosine Similarity

Cosine similarity using BERT


Cosine Distance/Similarity - It is the cosine of the angle between two vectors, which gives us the angular distance between the vectors. Formula to calculate cosine similarity between two vectors A and B is,

In [None]:
import numpy as np

def cosine_similarity_calc(vec_1,vec_2):
	sim = np.dot(vec_1,vec_2)/(np.linalg.norm(vec_1)*np.linalg.norm(vec_2))
	return sim

In [None]:
print('Sentence A and B smilarity:',cosine_similarity_calc(a,b))

Sentence A and B smilarity: 0.74686426


## Named Entity Recognition

In any text content, there are some terms that are more informative and unique in context. Named Entity Recognition (NER) also known as information extraction/chunking is the process in which algorithm extracts the real world noun entity from the text data and classifies them into predefined categories like person, place, time, organization, etc.


bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

### Example code

In [None]:
!pip install transformers

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Jacob Dixon and I work at Microsoft Corporation. I live in Dublin and would be moving to Greater Manchester shortly."

ner_results = nlp(example)
print(pd.DataFrame(ner_results))

          word     score entity  index  start  end
0        Jacob  0.999431  B-PER      4     11   16
1       Little  0.999339  I-PER      5     17   23
2        ##ton  0.997708  I-PER      6     23   26
3    Microsoft  0.999369  B-ORG     11     41   50
4  Corporation  0.999229  I-ORG     12     51   62
5       Dublin  0.999596  B-LOC     17     74   80
6      Greater  0.999152  B-LOC     23    104  111
7   Manchester  0.998406  I-LOC     24    112  122


## Question-Answering

In [None]:
def qa_bert(question, answer_text):


	import torch
	from transformers import BertForQuestionAnswering
	from transformers import BertTokenizer


	model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
	tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
	
	input_ids = tokenizer.encode(question, answer_text)
 

	print('Query has {:,} tokens.\n'.format(len(input_ids))) # Report how long the input sequence is

  # ======== Set Segment IDs ======== 
  # Search the input_ids for the first instance of the `[SEP]` token.
  
	sep_index = input_ids.index(tokenizer.sep_token_id) 
 
	num_seg_a = sep_index + 1 # The number of segment A tokens includes the [SEP] token istelf.
	num_seg_b = len(input_ids) - num_seg_a # The remainder are segment B.


	segment_ids = [0]*num_seg_a + [1]*num_seg_b # Construct the list of 0s and 1s.

	assert len(segment_ids) == len(input_ids) # There should be a segment_id for every input token.
	
	out = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                 token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text
	start_scores = out['start_logits']
	end_scores = out['end_logits']


  # ======== Reconstruct Answer ======== 
  # Find the tokens with the highest `start` and `end` scores.
	answer_start = torch.argmax(start_scores) 
	answer_end = torch.argmax(end_scores) 
	tokens = tokenizer.convert_ids_to_tokens(input_ids) # Get the string versions of the input tokens.


	answer = tokens[answer_start] # Start with the first token.
	for i in range(answer_start + 1, answer_end + 1):
		# If it's a subword token, then recombine it with the previous token.
		if tokens[i][0:2] == '##':
			answer += tokens[i][2:]

		# Otherwise, add a space then the token.
			
		else:
			answer += ' ' + tokens[i] 
	print('Answer: "' + answer + '"')




def test_qa():
	text = 'Gemini Solutions is an IT Consulting and Product Development firm. Our services provide clients\
 with a flexibility to choose from an array of automation and application development solutions as well\
  as giving them an option to choose from outsourcing, onshore or offshore engagement models. Gemini offers\
   several management services and is able to combine our range of services to suit a diverse range\
    of needs. We cater to the diversified portfolio of clients across sectors such as banking & financial\
     services, retail, healthcare, education and government sector. We are proud to say that we have a\
      well-structured IT community that has been handpicked from the best colleges across India who keep\
       abreast with today’s rapidly changing and ever-evolving technological advancements. We strive to\
        continuously provide customizable, affordable and quality products & services to our patrons through\
         our creative & skilled teams who demonstrate an inherent agility towards projects. CMT \
         (Comprehensive Monitoring Tool) is a tool meant to ensure that your IT operations keep running\
          smoothly and without hitches. It’s a monitoring tool that allows you to monitor the entire\
           production environment and infrastructure very closely and generates notifications as soon as\
            any issues are identified either with the infrastructure, the models that are running or the\
             data itself. What differentiates this tool from the run-of-the-mill tools is how it embeds\
              machine learning thus being able to predict a failure even before it occurs.'


	ans = 'y'
	while ans == 'y':
  		print('User:')
  		question = input()
  		qa_bert(question, text) 
  		print('\n\nAnymore questions? (y/n)')
  		ans = input()
