## *We will be presenting the implementation of the BERT algorithm for the Name Entity Recogntion task*. 

### **Summary of the Task**
To quickly summarize our task, we are trying to build a NLP model to detect the name entities associated to an input data. For now on, we will be considering two entities : "Data group" and "Data mouvement". 

### **For doing so, we will consider the following steps:** 
- Importation of the Data : the data we will be working on has a specific format in accordance with the NER task . 

- Data engineering: In this step, we will be working on our dataset. More specifically we will clean it,  we will seperate the inputs data from the targets(which represent the name entity tags) .The data will be split into training dataset and validation dataset, so that the performance of the model can be evaluated. 
Also, We will make use of the pytorch dataloader to load the batches of training dataset as well and the validation dataset. 

- Then, we will build our BERT algorithm to perform the name entity task. The model can be found on the **transformers** library .  

- After defining the model architecture , we will train it using the train dataloader , then validate the model with the test dataloader.

- To evaluate the performance of our model, we make use of a F1-score metric. This compares the precision of the model to the recall and is very used in real world classification problem . It can be written as : 

$$ F1-score = \frac{2*precision* recall}{precision+ recall}$$

- To finish , we will deploy the model to production using flask and AWS . 


In [1]:
## mount the drive in the colab notebook 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **Package installation**

In [2]:
!pip install ktrain 

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/bb/41/d36714e51bf4e1d304f2ba80eb3c30c7eed69d72310d7f34fab86ed10b58/ktrain-0.26.4.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 118kB/s 
[?25hCollecting scikit-learn==0.23.2
[?25l  Downloading https://files.pythonhosted.org/packages/f4/cb/64623369f348e9bfb29ff898a57ac7c91ed4921f228e9726546614d63ccb/scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.8MB 40.7MB/s 
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/0e/72/a3add0e4eec4eb9e2569554f7c70f4a3c27712f40e3284d483e88094cc0e/langdetect-1.0.9.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 26.5MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/80/72/a4fba7559978de00cf44081c548c5d294bf00ac7dcda2db405d2baa8c67a/cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263kB)
[K     |██████████████████████████

### **1- Data Description**

For the moment the data we will be using for estimanting the CFP is unknown. For illustrative purposes, we will be using a similar data format, which we will use to to train the Bert model. The implementation we will be performing is data independent, which means that the same implementation can be used for a different dataset. The data can be found under the name "ner.csv" on the data_estimancy folder . 

In [3]:
import pandas as pd


In [4]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data_estimancy/ner.csv', encoding='ISO-8859-1', error_bad_lines=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,pos,prev-iob,prev-lemma,prev-pos,prev-prev-iob,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
0,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,NNS,__START1__,__start1__,__START1__,__START2__,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
1,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,IN,O,thousand,NNS,__START1__,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
2,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,NNS,O,of,IN,O,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
3,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,VBP,O,demonstr,NNS,O,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
4,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,VBN,O,have,VBP,O,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


We will only consider three caracteristics of the data : "pos", "word", "sentence_idx" representing the "part of speech " of each "word" belonging to the sentence of id : "sentence_idx". This choice of feature set is for the purpose of simplicity . The feature set can eventually be extended to improve the performance of the model . 

In [6]:
df = df[["pos", "word", "sentence_idx", "tag"]].dropna()

In [7]:
df.head()

Unnamed: 0,pos,word,sentence_idx,tag
0,NNS,Thousands,1.0,O
1,IN,of,1.0,O
2,NNS,demonstrators,1.0,O
3,VBP,have,1.0,O
4,VBN,marched,1.0,O


In [None]:
func = lambda s : [[w, s, t] for w, s, t in zip(s["word"].values.tolist(), s["pos"].values.tolist(), s["tag"].values.tolist())]

The next step would be to perform the Data engineering. This envolves putting words wich belonging same sentence, separating the target data from the input data, vectorizing the textual data to numerical data, ect... 

### **2- Data engineering**
To perform the data engineering, we use the Dataset class .  

In [None]:
class Dataset:

  def __init__(self, dataset):

      self.dataset = dataset 
      self.n_sentence = 1
      self.labels = self.dataset["tag"]
      funct = lambda s : [ [w, p, t] for w, p, t in zip(s["word"].values.tolist(), s["pos"].values.tolist(), s["tag"].values.tolist())]
      self.group = self.dataset.groupby("sentence_idx").apply(funct)
      self.sentences = [s for s in self.group] 

  def get_next_sentence(self, index):
      next_sent = self.group_data[index]
      self.n_sentence +=1 
      return next_sent 


In [None]:
data = Dataset(df)  

In [None]:
list_tag = list(set(data.labels))
sentences = [' '.join([s[0] for s in sent]) for sent in data.sentences]
tag2idx = {t : i for i, t in enumerate(list_tag)}  
labels = [ [tag2idx[s[2]] for s in sent] for sent in data.sentences] 

### **3-Preparing the Dataset and the Dataloader** 

Dataset and Dataloader are constructs of pytorch library . Dataset is used to perform some processing on the data before sending it to the model . Dataloader allows the data to be sent in batches in the model for training performances.  
We will define the Dataset class with takes `tokenizer`, `sentences`, `labels` as inputs and outputs tokenized sentences and tags, that are used to train the model.


In [None]:
!pip install transformers 



In [None]:

import numpy as np
import pandas as pd
import transformers
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertForTokenClassification, BertTokenizer, BertConfig, BertModel

In [None]:
!pip install torch 
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl



#### We will make use of torch_xla, which allows computation in the TPU . 

In [None]:
import torch
import torch_xla
import torch_xla.core.xla_model as xm

 #### **Dataset Class** 
- the Dataset class takes as input `tokenizer`, `sentences`, `labels`  and tokenized the sentences .
- We use BertTokenizer to tokenize the sentences to [ids] and [mask] for encoding. 
- the tokenizer uses encode_plus to tokenize the data . 

*Dataloader*
The dataloader is used to divide the dataset into batches, that are loaded into a variable. To do so, we use 2 variables: `bach_size` and `max_size`. 


In [None]:
class Dataset:
  def __init__(self, tokenizer, sentences, labels, max_len):
    self.tokenizer = tokenizer 
    self.sentences = sentences
    self.labels = labels
    self.max_len = max_len 
    self.len = len(self.sentences)

  def __getitem__(self, index):

    sentence = str(self.sentences[index])
    tokenized = self.tokenizer.encode_plus(
        sentence,
        add_special_tokens =True,
        truncation = True, 
        max_length = self.max_len, 
        padding = 'max_length',
        return_token_type_ids = True 
    )
    ids = tokenized["input_ids"]
    mask = tokenized["attention_mask"]
    labels = self.labels[index]
    labels.extend([0]*200)
    labels = labels[:200]

    return {'ids':torch.tensor(ids, dtype = torch.long), 'mask':torch.tensor(mask, dtype = torch.long),
            'tags':torch.tensor(labels, dtype = torch.long)}
  
  def __len__ (self):
    return self.len 

In [None]:
### Constant variables

train_percent = 0.8 
max_len = 200
train_batch_size = 20  
test_batch_size = 20 
train_size = int(len(sentences)*train_percent)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [None]:
train_sentences = sentences[:train_size]  
train_labels    =  labels[:train_size]
test_sentences  = sentences [train_size:]
test_labels     = labels[ train_size:]


In [None]:
train_data = Dataset(tokenizer, train_sentences, train_labels, max_len) 
test_data  = Dataset(tokenizer, test_sentences, test_labels, max_len)

####**Dataloader**

In [None]:
train_dataloader = DataLoader(train_data, batch_size=train_batch_size , shuffle=True) 
test_dataloader  =  DataLoader(test_data, batch_size=test_batch_size, shuffle=True )


### **4-Model building -Training and Validation**

In [None]:
import torch.nn as nn

In [None]:
class BertClass(nn.Module): 
  def __init__(self):
      # first layer of the bert model 
      super(BertClass, self).__init__()
      self.l1 = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels = len(list_tag)) 
      # self.Dropout = nn.Dropout(0.2)
      # self.fc1 = nn.Linear(784, 200)
      # self.fc2 = nn.Linear(200, 100)
      # self.fc3 = nn.Linear(100, len(list_tag))

  def forward(self, ids, mask, labels): 
      x = self.l1(ids, mask, labels=labels)
      # x=  self.Dropout(x)
      # x = self.fc1(x)
      # x=  self.fc2(x)
      # x = self.fc3(x) 

      return x 
      


In [None]:
### to perform computations of the model on the TPU 
dev = xm.xla_device()

In [None]:
model = BertClass()
model.to(dev)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

BertClass(
  (l1): BertForTokenClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
       

In [None]:
Learning_rate = 2e-04
EPOCHS= 20

In [None]:
optimizer = torch.optim.SGD(params = model.parameters(), lr=Learning_rate, momentum=0.95) 

#### **Training of the model** 

In [None]:
def train(epochs):
  model.train()
  for ep in range(epochs): 
      for i, data in enumerate(train_dataloader):
        ids = data["ids"].to(dev, dtype=torch.long)
        mask = data["mask"].to(dev, dtype = torch.long)
        labels = data["tags"].to(dev, dtype= torch.long)

        loss = model(ids, mask, labels=labels)[0]  
        ## gradient initialization 
        optimizer.zero_grad()
        ## loss computation + backward + params update 
        if i%1000==0:
          print(f"epoch={ep+1}  loss={loss}") 

        loss.backward()
        xm.optimizer_step(optimizer)
        xm.mark_step()

In [None]:
train(EPOCHS)

epoch=1  loss=2.963646411895752
epoch=1  loss=0.6478630304336548
epoch=2  loss=0.49692726135253906
epoch=2  loss=0.3359334468841553
epoch=3  loss=0.43199533224105835
epoch=3  loss=0.26246723532676697
epoch=4  loss=0.3453127443790436
epoch=4  loss=0.32650381326675415
epoch=5  loss=0.18063196539878845
epoch=5  loss=0.2453652322292328


In [None]:
# data = next(iter(test_dataloader))

In [None]:
# ids = data["ids"].to(dev, dtype = torch.long)
# mask = data["mask"].to(dev, dtype= torch.long)
# labels = data["tags"].to(dev, dtype = torch.long)


#### **Validation of the model**

In [None]:
!pip install seqeval



In [None]:
from seqeval.metrics import f1_score

In [None]:
def val(epochs):

    model.eval()
    predictions, true_labels = [], []
    eval_accuracy, eval_loss, n_batch = 0.0 , 0.0, 0.0

    def Accuracy(predictions, labels):
        flat_predictions = np.argmax(predictions, axis=2).flatten()
        flat_labels =labels.flatten()
        return (flat_labels==flat_predictions).sum()/len(flat_labels)

    with torch.no_grad():
      for _, data in enumerate(test_dataloader):
          ids   = data["ids"].to(dev, dtype = torch.long)
          mask  = data["mask"].to(dev, dtype = torch.long)
          labels = data["tags"].to(dev, dtype = torch.long)

          output = model(ids, mask, labels=labels)
          loss, logits = output[:2]

          logits = logits.detach().to('cpu').numpy()
          labels = labels.to('cpu').numpy()

          predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
          true_labels.extend([list(l) for l in labels])

          eval_accuracy+= Accuracy(logits, labels)
          eval_loss+= loss.mean().item()
          n_batch+=1 

      eval_accuracy/=n_batch
      eval_loss/=n_batch 
      predic_tags = [[list_tag[p_i] for p_i in p ] for p in predictions]
      true_tags = [[list_tag[l_i] for l_i in l ] for l in true_labels]
      print(f"validation  loss ={eval_loss}")
      print(f"validation Accuracy ={eval_accuracy}")
      print(f"f1-score= {f1_score(predic_tags, true_tags)}")


###**val loss**

In [None]:
  val(EPOCHS)

validation  loss =0.20162769114937296
validation Accuracy =0.4463959517045451
f1-score= 0.10355338515573088


###  **6-Ktrain library to improve Bert Model**

Ktrain is a library that makes easier both the implementation and the deployment of Deep learning based models . It is inspired by the fast ai library in that,  with only a few lines of codes , it can estimate an optimal learning rate for the Deep learning model. Besides, it can also perform the preprocessing of the input data by using this library. 

For further detail on the ktrain library, check out this github [https://github.com/amaiya/ktrain]()

In [8]:
import ktrain
from ktrain import text, get_learner 


In [9]:
size_data = 500000

In [10]:
class Ktrain:

  def __init__(self, data_set, size_data):
    self.data_size = data_set.shape[0]
    self.size = size_data
    self.train_df =  data_set.iloc[:size_data]
    self.preproc = None 
    self.train = None
    self.val = None
    self.model = None 
    self.learner = None 

  def processing(self):
      """
      Load entities from pandas DataFrame
      Args:
        train_df(pd.DataFrame): training data
        val_df(pdf.DataFrame): validation data
        word_column(str): name of column containing the text
        tag_column(str): name of column containing lael
        sentence_column(str): name of column containing Sentence IDs
        use_char(bool):    If True, data will be preprocessed to use character embeddings  in addition to word embeddings
        verbose (boolean): verbosity

      """
      self.train, self.val, self.preproc = text.entities_from_df(train_df=self.train_df, 
                                                                   word_column="word", sentence_column='sentence_idx',
                                                                   tag_column='tag',
                                                                   val_pct = 0.2, 
                                                                   use_char=False, verbose=1)

  def Model(self, ModelName):
    # the name of the model "ModelName" could be : Bi-LSTM-CRF, BERT, DistilBERT 
      self.model = text.sequence_tagger(ModelName, self.preproc)
      self.learner =get_learner(self.model, train_data=self.train, val_data =self.val,
                                   batch_size = 200,eval_batch_size = 20)
      return self.model 

In [11]:
if __name__=='__main__': 
  Ktrain = Ktrain(data_set=df, size_data=size_data)
  print("******************** preprocessing step ******************************")
  print("                                                                      ")

  print("                                                                      ")

  Ktrain.processing()
  print("******************* Model Building ***********************************")

  model = Ktrain.Model('bilstm-bert')

  print("                                                                      ")
  print("                                                                      ")
  print("*********************** Model training and Evaluation ******************")
  print("                                                                      ")
  print("                                                                      ")
  learner = Ktrain.learner
  learner.fit(1e-03, 5)

  

******************** preprocessing step ******************************
                                                                      
                                                                      
Number of sentences:  12906
Number of words in the dataset:  18519
Tags: ['I-geo', 'O', 'I-org', 'I-eve', 'I-gpe', 'B-art', 'I-per', 'B-nat', 'I-tim', 'B-per', 'B-gpe', 'I-art', 'B-eve', 'I-nat', 'B-org', 'B-tim', 'B-geo']
Number of Labels:  17
Longest sentence: 140 words
******************* Model Building ***********************************
Embedding schemes employed (combined with concatenation):
	word embeddings initialized randomly
	BERT embeddings with bert-base-multilingual-cased



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961828.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1083389348.0, style=ProgressStyle(descr…


                                                                      
                                                                      
*********************** Model training and Evaluation ******************
                                                                      
                                                                      
preparing training data ...done.
preparing validation data ...done.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [12]:
print("******************** Model Summary *************************************")
learner.validate()

print("                                                                      ")
print("                                                                      ") 
print("(****************** Model view top losses *****************************")

print("                                                                       ")
print("                                                                      ")
learner.view_top_losses(n=1)
print("                                                                      ")
print("                                                                      ")
print("****************** Definition of a Predictor ************************")
predictor = ktrain.get_predictor(learner.model, Ktrain.preproc)




******************** Model Summary *************************************
   F1:  75.31
              precision    recall  f1-score   support

         art       0.00      0.00      0.00        47
         eve       0.00      0.00      0.00        24
         geo       0.74      0.87      0.80      3538
         gpe       0.91      0.86      0.88      1499
         nat       0.00      0.00      0.00        23
         org       0.54      0.55      0.55      1838
         per       0.74      0.75      0.75      1628
         tim       0.79      0.79      0.79      1915

   micro avg       0.73      0.77      0.75     10512
   macro avg       0.46      0.48      0.47     10512
weighted avg       0.73      0.77      0.75     10512

                                                                      
                                                                      
(****************** Model view top losses *****************************
                                                

NameError: ignored

In [13]:
print("                                                                      ")
print("                                                                      ")
print("****************** Definition of a Predictor ************************")
predictor = ktrain.get_predictor(learner.model, Ktrain.preproc)


                                                                      
                                                                      
****************** Definition of a Predictor ************************


In [14]:

print("                                                                      ")
print("                                                                      ")
print("****************** Save the Predictor ************************")
predictor.save('/content/bert')




                                                                      
                                                                      
****************** Save the Predictor ************************


### **6-BERT Model deployment**