<a href="https://colab.research.google.com/github/SaiidAmiri/Machine-Learning-and-Deep-Learning-Project/blob/main/Deep_Learning_Model_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing Python Libraries and preparing the environment**

Installation of the library "transformers" developed by Huggingface which contains implementations of several transfer-learning models in PyTorch and Tensorflow.

In [None]:
!pip install transformers==3.0.2 

Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 5.2 MB/s 
Collecting tokenizers==0.8.1.rc1
  Downloading tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 56.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 56.7 MB/s 
[?25hCollecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 50.4 MB/s 
Installing collected packages: tokenizers, sentencepiece, sacremoses, transformers
Successfully installed sacremoses-0.0.47 sentencepiece-0.1.96 tokenizers-0.8.1rc1 transformers-3.0.2


The necessary libraries for the deep learning model are imported.

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification 

Setting up the device for GPU usage. A GPU is needed to finetune the deep learning model.

In [None]:
from torch import cuda
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

**Importing and Pre-Processing the data**

In [None]:
import gzip
import json
from pathlib import Path

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/gdrive')
data_path = Path('/gdrive/MyDrive/industry_data/')
file_name1 = 'train_small.ndjson.gz'
file_name2 = 'test_small.ndjson.gz'

Mounted at /gdrive


In [None]:
# open train file
with gzip.open(data_path/file_name1, "rt", encoding='UTF-8') as file:
    data1 = [json.loads(line) for line in file]

In [None]:
# open test file
with gzip.open(data_path/file_name2, "rt", encoding='UTF-8') as file:
    data2 = [json.loads(line) for line in file]

The content of the "html" key has been cleaned using the BeautifulSoup library to parse HTML, and then by removing all the punctuations which are irrelevant when training the model.

In [None]:
import string
from bs4 import BeautifulSoup 
def cleaning_text(text):
    """custom function to remove the punctuation"""
    text = text.replace('\n', ' ')
    text = BeautifulSoup(text, 'html.parser').text # HTML decoding 
    text = text.translate(str.maketrans('', '',string.punctuation)) # Punctuations to remove
    return text

In [None]:
# Cleaning of train dataset
for i in range(len(data1)):
  data1[i]['html'] = cleaning_text(data1[i]['html'])

In [None]:
# Cleaning of test dataset
for i in range(len(data2)):
  data2[i]['html'] = cleaning_text(data2[i]['html'])

In [None]:
# Defining the inputs and the labels of the train dataset 
texts = [doc["html"] for doc in data1]
labels = [doc["industry_label"] for doc in data1]

In [None]:
# Defining the inputs and the labels of the test dataset
test_texts = [doc["html"] for doc in data2]
test_labels = [doc["industry_label"] for doc in data2]

The development dataset known also as the validation dataset represents 15% of the train dataset.  

In [None]:
from sklearn.model_selection import train_test_split
train_texts, dev_texts, train_labels, dev_labels = train_test_split(texts, labels, test_size=0.15, random_state=1)
print("Train size:", len(train_texts))
print("Dev size:", len(dev_texts))
print("Test size:", len(test_texts))

Train size: 21407
Dev size: 3778
Test size: 8396


There are 19 industry classes in the dataset. Each of these classes is mapped to an index.

In [None]:
target_names = list(set(labels))
label2idx = {label: idx for idx, label in enumerate(target_names)}
print(label2idx)

{'Real Estate': 0, 'Recreational Facilities and Services': 1, 'Construction': 2, 'Leisure, Travel & Tourism': 3, 'Marketing and Advertising': 4, 'Financial Services': 5, 'Mechanical or Industrial Engineering': 6, 'Medical Practice': 7, 'Management Consulting': 8, 'Human Resources': 9, 'Legal Services': 10, 'Logistics and Supply Chain': 11, 'Telecommunications': 12, 'Insurance': 13, 'Wholesale': 14, 'Consumer Goods': 15, 'Information Technology and Services': 16, 'Automotive': 17, 'Renewables & Environment': 18}


**Initializing the deep learning model**

The Bert model is chosen as the deep learning model to classify companies' landing pages in the proposed industry classes. 

In [None]:
BERT_MODEL = "bert-base-uncased"

Each model comes with its own tokenizer. This tokenizer splits texts into word pieces. Since the uncased model is used here, the tokenizer should lowercase the text.

In [None]:
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

A full BERT model consists of a common, pretrained core, and an extension on top that depends on the particular NLP task. In this case, it's a classification task. Hence, the pretrained BERT model is used with a final layer for text classification on top.

In [None]:
model = BertForSequenceClassification.from_pretrained(BERT_MODEL, num_labels = len(label2idx))
model.to(device)

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

**Preparing the Dataset and Dataloader**

The dataset needs to be prepared for BERT. Every input text is represented as a Bert Input Item object, which contains all the information BERT needs. This object contains a list of input ids, the input mask, the segment_ids, and the label id.
*   Every text has been split up into subword units. If a word is more frequent, then it is kept intact. If it is less frequent, it is split up into subword units. This allows the model to process every text as a sequence of strings from a finite vocabulary of limited size.
*   The [CLS] token is added at the beginning of every document. The vector at the output of this token will be used by the BERT model for its class classification tasks: it serves as the input of the final, task-specific part of the neural network. 
*   The input mask tells the model which parts of the input it should look at and which parts it should ignore. In this example, every text has a length of 512 tokens, which is the maximum length in BERT Models. This means that BERT should not take into account more than 512 tokens for its classification task.
*   The segment ids tell BERT which sequence every token belongs to.



In [None]:
import logging
import numpy as np

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

MAX_SEQ_LENGTH=512

class BertInputItem(object):
    """An item with all the necessary attributes for finetuning BERT."""

    def __init__(self, text, input_ids, input_mask, segment_ids, label_id):
        self.text = text
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        

def convert_examples_to_inputs(example_texts, example_labels, label2idx, max_seq_length, tokenizer, verbose=0):
    """Loads a data file into a list of `InputBatch`s."""
    
    input_items = []
    examples = zip(example_texts, example_labels)
    for (ex_index, (text, label)) in enumerate(examples):

        # Create a list of token ids
        input_ids = tokenizer.encode(f"[CLS] {text} [SEP]")
        if len(input_ids) > max_seq_length:
            input_ids = input_ids[:max_seq_length]
        # All our tokens are in the first input segment (id 0).
        segment_ids = [0] * len(input_ids)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label2idx[label]

        input_items.append(
            BertInputItem(text=text,
                          input_ids=input_ids,
                          input_mask=input_mask,
                          segment_ids=segment_ids,
                          label_id=label_id))

        
    return input_items

train_features = convert_examples_to_inputs(train_texts, train_labels, label2idx, MAX_SEQ_LENGTH, tokenizer, verbose=0)
dev_features = convert_examples_to_inputs(dev_texts, dev_labels, label2idx, MAX_SEQ_LENGTH, tokenizer)
test_features = convert_examples_to_inputs(test_texts, test_labels, label2idx, MAX_SEQ_LENGTH, tokenizer)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


In the following, a data loader is initialized for each of the training, development and testing data. Those data loader put all the data in tensors and will allow to iterate over them during training.

In [None]:
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

def get_data_loader(features, max_seq_length, batch_size, shuffle=True): 

    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
    data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)

    dataloader = DataLoader(data, shuffle=shuffle, batch_size=batch_size)
    return dataloader

BATCH_SIZE = 16

train_dataloader = get_data_loader(train_features, MAX_SEQ_LENGTH, BATCH_SIZE, shuffle=True)
dev_dataloader = get_data_loader(dev_features, MAX_SEQ_LENGTH, BATCH_SIZE, shuffle=False)
test_dataloader = get_data_loader(test_features, MAX_SEQ_LENGTH, BATCH_SIZE, shuffle=False)

**Evaluation method**

Now, the model needs to be evaluated through the evaluation method. This method takes as input a model and a data loader with the data that would have been evaluated. For each batch, it computes the output of the model and the loss.

In [None]:
def evaluate(model, dataloader):
    model.eval()
    
    eval_loss = 0
    nb_eval_steps = 0
    predicted_labels, correct_labels = [], []

    for step, batch in enumerate(tqdm(dataloader, desc="Evaluation iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        with torch.no_grad():
            tmp_eval_loss, logits = model(input_ids, attention_mask=input_mask,
                                          token_type_ids=segment_ids, labels=label_ids)
        
        outputs = np.argmax(logits.cpu(), axis=1)
        label_ids = label_ids.cpu().numpy()

        predicted_labels += list(outputs)
        correct_labels += list(label_ids)
        
        eval_loss += tmp_eval_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    
    correct_labels = np.array(correct_labels)
    predicted_labels = np.array(predicted_labels)
        
    return eval_loss, correct_labels, predicted_labels

**Training**

In the following the training starts, in which the AdamW optimizer is used with a base learning rate of 5e-5, and the training process is done for 6 epochs, which is sufficient. The WarmupLinearScheduler is used to vary the learning rate during the training process. First, a small learning rate is introduced, which increases linearly during the warmup stage. Afterwards it slowly decreases again.

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

GRADIENT_ACCUMULATION_STEPS = 1
NUM_TRAIN_EPOCHS = 6
LEARNING_RATE = 5e-5
WARMUP_PROPORTION = 0.1
MAX_GRAD_NORM = 5

num_train_steps = int(len(train_dataloader.dataset) / BATCH_SIZE / GRADIENT_ACCUMULATION_STEPS * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(WARMUP_PROPORTION * num_train_steps)

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

optimizer = AdamW(optimizer_grouped_parameters, lr=LEARNING_RATE, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_train_steps)

Now, the model is ready to be trained. At each epoch, the train process is done on the training data and the evaluation process is performed on the development data. Then, a history of the loss is kept, and the training is stopped when the loss on the development set doesn't improve for a certain number of steps ( it is called number our patience). Whenever the development loss of the model improves, it is saved immediately.

In [None]:
import torch
import os
from tqdm import trange
from tqdm import tqdm_notebook as tqdm
from sklearn.metrics import classification_report, precision_recall_fscore_support

OUTPUT_DIR = "/tmp/"
MODEL_FILE_NAME = "pytorch_model.bin"
PATIENCE = 2

loss_history = []
no_improvement = 0
for _ in trange(int(NUM_TRAIN_EPOCHS), desc="Epoch"):
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(tqdm(train_dataloader, desc="Training iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        outputs = model(input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=label_ids)
        loss = outputs[0]

        if GRADIENT_ACCUMULATION_STEPS > 1:
            loss = loss / GRADIENT_ACCUMULATION_STEPS

        loss.backward()
        tr_loss += loss.item()

        if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)  
            
            optimizer.step()
            optimizer.zero_grad()
            scheduler.step()
    dev_loss, _, _ = evaluate(model, dev_dataloader)
    
    print("Loss history:", loss_history)
    print("Dev loss:", dev_loss)
    
    if len(loss_history) == 0 or dev_loss < min(loss_history):
        no_improvement = 0
        model_to_save = model.module if hasattr(model, 'module') else model
        output_model_file = os.path.join(OUTPUT_DIR, MODEL_FILE_NAME)
        torch.save(model_to_save.state_dict(), output_model_file)
    else:
        no_improvement += 1
    
    if no_improvement >= PATIENCE: 
        print("No improvement on development set. Finish training.")
        break
        
    
    loss_history.append(dev_loss)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


Training iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Loss history: []
Dev loss: 1.7432071413168928


Epoch:  17%|█▋        | 1/6 [20:23<1:41:59, 1223.93s/it]

Training iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Loss history: [1.7432071413168928]
Dev loss: 1.4447934997232654


Epoch:  33%|███▎      | 2/6 [40:49<1:21:39, 1224.87s/it]

Training iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Loss history: [1.7432071413168928, 1.4447934997232654]
Dev loss: 1.3976019852272066


Epoch:  50%|█████     | 3/6 [1:01:16<1:01:17, 1225.74s/it]

Training iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Loss history: [1.7432071413168928, 1.4447934997232654, 1.3976019852272066]
Dev loss: 1.3685818793652933


Epoch:  67%|██████▋   | 4/6 [1:21:44<40:53, 1226.60s/it]  

Training iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Epoch:  83%|████████▎ | 5/6 [1:42:12<20:27, 1227.30s/it]

Loss history: [1.7432071413168928, 1.4447934997232654, 1.3976019852272066, 1.3685818793652933]
Dev loss: 1.5484715031169134


Training iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Epoch:  83%|████████▎ | 5/6 [2:02:40<24:32, 1472.16s/it]

Loss history: [1.7432071413168928, 1.4447934997232654, 1.3976019852272066, 1.3685818793652933, 1.5484715031169134]
Dev loss: 1.6404196916753229
No improvement on development set. Finish training.





**Evaluating the model**

Now, the test dataset is introduced to evaluate the model on a data it has never seen. So, precision, recall and F1-score for the training, development and test set are displayed in the following, in addition to a full classification report for the test set.

In [None]:
model_state_dict = torch.load(os.path.join(OUTPUT_DIR, MODEL_FILE_NAME), map_location=lambda storage, loc: storage)
model = BertForSequenceClassification.from_pretrained(BERT_MODEL, state_dict=model_state_dict, num_labels = len(target_names))
model.to(device)

model.eval()

_, train_correct, train_predicted = evaluate(model, train_dataloader)
_, dev_correct, dev_predicted = evaluate(model, dev_dataloader)
_, test_correct, test_predicted = evaluate(model, test_dataloader)

print("Training performance:", precision_recall_fscore_support(train_correct, train_predicted, average="micro"))
print("Development performance:", precision_recall_fscore_support(dev_correct, dev_predicted, average="micro"))
print("Test performance:", precision_recall_fscore_support(test_correct, test_predicted, average="micro"))

bert_accuracy = np.mean(test_predicted == test_correct)

print(classification_report(test_correct, test_predicted, target_names=target_names))

02/18/2022 16:26:01 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
02/18/2022 16:26:01 - INFO - transformers.configuration_utils -   Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "

Evaluation iteration:   0%|          | 0/1338 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/237 [00:00<?, ?it/s]

Evaluation iteration:   0%|          | 0/525 [00:00<?, ?it/s]

Training performance: (0.8419675806979026, 0.8419675806979026, 0.8419675806979026, None)
Development performance: (0.6278454208575966, 0.6278454208575966, 0.6278454208575966, None)
Test performance: (0.6277989518818485, 0.6277989518818485, 0.6277989518818485, None)
                                      precision    recall  f1-score   support

                         Real Estate       0.67      0.71      0.69       240
Recreational Facilities and Services       0.56      0.39      0.46       163
                        Construction       0.64      0.48      0.55       390
           Leisure, Travel & Tourism       0.88      0.74      0.80       148
           Marketing and Advertising       0.70      0.64      0.67       830
                  Financial Services       0.72      0.74      0.73       469
Mechanical or Industrial Engineering       0.65      0.66      0.66      1004
                    Medical Practice       0.69      0.67      0.68       395
               Management Consu

In general, transformer models such as the BERT model can’t handle more than 512 words at a time. It's about a long-text classification. Therefore, several words of the html content of this dataset are being truncated, which will provide a lower average of f1-scores than expected. That's why this Deep Learning model performs a bit lower than the Support Vector Machine model and the Logistic Regression Model. 