<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#BBC-Article-Genre-Classification-with-BERT-using-the-FARM-Framework" data-toc-modified-id="BBC-Article-Genre-Classification-with-BERT-using-the-FARM-Framework-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>BBC Article Genre Classification with BERT using the FARM Framework</a></span><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Building-own-blocks" data-toc-modified-id="Building-own-blocks-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Building own blocks</a></span><ul class="toc-item"><li><span><a href="#Tokenizer" data-toc-modified-id="Tokenizer-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Tokenizer</a></span></li><li><span><a href="#Data-Processor" data-toc-modified-id="Data-Processor-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Data Processor</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Modeling</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Training</a></span></li></ul></li><li><span><a href="#Saving-and-Inferencing" data-toc-modified-id="Saving-and-Inferencing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Saving and Inferencing</a></span></li></ul></li></ul></div>

# BBC Article Genre Classification with BERT using the FARM Framework

## Setup

In [None]:
!pip install farm

In [None]:
!git clone https://github.com/deepset-ai/FARM.git
!pip install -r FARM/requirements.txt
!pip install FARM/

In [2]:
from farm.data_handler.data_silo import DataSilo
from farm.data_handler.processor import TextClassificationProcessor
from farm.modeling.optimization import initialize_optimizer
from farm.infer import Inferencer
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import MultiLabelTextClassificationHead
from farm.modeling.tokenization import Tokenizer
from farm.train import Trainer
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
import logging
import pandas as pd

10/25/2020 16:56:41 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [3]:
# Farm allows simple logging of many parameters & metrics. Let's use the MLflow framework to track our experiment ...
# You will see your results on https://public-mlflow.deepset.ai/

ml_logger = MLFlowLogger(tracking_uri="https://public-mlflow.deepset.ai/")
ml_logger.init_experiment(experiment_name="BBC_Articles", run_name="BBC News Articles")


 __          __  _                            _        
 \ \        / / | |                          | |       
  \ \  /\  / /__| | ___ ___  _ __ ___   ___  | |_ ___  
   \ \/  \/ / _ \ |/ __/ _ \| '_ ` _ \ / _ \ | __/ _ \ 
    \  /\  /  __/ | (_| (_) | | | | | |  __/ | || (_) |
     \/  \/ \___|_|\___\___/|_| |_| |_|\___|  \__\___/ 
  ______      _____  __  __  
 |  ____/\   |  __ \|  \/  |              _.-^-._    .--.
 | |__ /  \  | |__) | \  / |           .-'   _   '-. |__|
 |  __/ /\ \ |  _  /| |\/| |          /     |_|     \|  |
 | | / ____ \| | \ \| |  | |         /               \  |
 |_|/_/    \_\_|  \_\_|  |_|        /|     _____     |\ |
                                     |    |==|==|    |  |
|---||---|---|---|---|---|---|---|---|    |--|--|    |  |
|---||---|---|---|---|---|---|---|---|    |==|==|    |  |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 


In [9]:
set_all_seeds(seed=42)
device, n_gpu = initialize_device_settings(use_cuda=True)
n_epochs = 1
batch_size = 1
evaluate_every = 100

10/24/2020 10:42:39 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None


## Building own blocks

### Tokenizer

In [10]:
lang_model = "bert-base-cased"
do_lower_case = False

tokenizer = Tokenizer.load(
    pretrained_model_name_or_path=lang_model,
    do_lower_case=do_lower_case)

10/24/2020 10:42:42 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'BertTokenizer'


### Data Processor

In [11]:
label_list = ['entertainment', 'sport', 'politics', 'business', 'tech'] #labels in our data set
metric = "f1_macro" # desired metric for evaluation

processor = TextClassificationProcessor(tokenizer=tokenizer,
                                            max_seq_len=512, # BERT can only handle sequence lengths of up to 512
                                            data_dir='generated_data', 
                                            label_list=label_list,
                                            label_column_name="genre", # our labels are located in the "genre" column
                                            metric=metric,
                                            quote_char='"',
                                            multilabel=True,
                                            train_filename="train.tsv",
                                            dev_filename=None,
                                            test_filename="test.tsv",
                                            dev_split=0.1 # this will extract 10% of the train set to create a dev set
                                            )

In [12]:
data_silo = DataSilo(
    processor=processor,
    batch_size=batch_size)

10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -   Loading train set from: generated_data\train.tsv 
10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -   Got ya 3 parallel workers to convert 1780 dictionaries to pytorch datasets (chunksize = 119)...
10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -    0    0    0 
10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -   /w\  /w\  /w\
10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -   /'\  / \  /'\
10/24/2020 10:42:44 - INFO - farm.data_handler.data_silo -       
Preprocessing Dataset generated_data\train.tsv: 100%|██████████| 1780/1780 [00:49<00:00, 35.89 Dicts/s]
10/24/2020 10:43:34 - INFO - farm.data_handler.data_silo -   Loading dev set as a slice of trai

### Modeling

In [13]:
# loading the pretrained BERT base cased model
language_model = LanguageModel.load(lang_model)
# prediction head for our model that is suited for classifying news article genres
prediction_head = MultiLabelTextClassificationHead(num_labels=len(label_list))

model = AdaptiveModel(
        language_model=language_model,
        prediction_heads=[prediction_head],
        embeds_dropout_prob=0.1,
        lm_output_types=["per_sequence"],
        device=device)

	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
10/24/2020 10:44:41 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 5]


In [14]:
model, optimizer, lr_schedule = initialize_optimizer(
        model=model,
        learning_rate=3e-5,
        device=device,
        n_batches=len(data_silo.loaders["train"]),
        n_epochs=n_epochs)

10/24/2020 10:44:41 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 3e-05}'
10/24/2020 10:44:42 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
10/24/2020 10:44:42 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_steps': 154.70000000000002, 'num_training_steps': 1547}'


### Training

In [15]:
trainer = Trainer(
        model=model,
        optimizer=optimizer,
        data_silo=data_silo,
        epochs=n_epochs,
        n_gpu=n_gpu,
        lr_schedule=lr_schedule,
        evaluate_every=evaluate_every,
        device=device)

In [16]:
trainer.train()

10/24/2020 10:45:17 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

Train epoch 0/0 (Cur. train loss: 0.4709):   6%|▋         | 100/1547 [21:47<4:56:15, 12.28s/it]
Evaluating:   0%|          | 0/233 [00:00<?, ?it/s][A
Evaluating:   1%|▏         | 3/233 [00:11<14:13,  3.71s/it][A
Evaluating:   3%|▎         | 7/233 [00:21<12:40,  3.36s/i

10/24/2020 14:37:04 - INFO - farm.eval -   
 _________ text_classification _________
10/24/2020 14:37:04 - INFO - farm.eval -   loss: 0.05575110925193636
10/24/2020 14:37:04 - INFO - farm.eval -   task_name: text_classification
10/24/2020 14:37:04 - INFO - farm.eval -   f1_macro: 0.9677729409613468
10/24/2020 14:37:04 - INFO - farm.eval -   report: 
                precision    recall  f1-score   support

entertainment     0.9756    1.0000    0.9877        40
        sport     1.0000    1.0000    1.0000        53
     politics     0.9434    0.9615    0.9524        52
     business     0.9800    0.9074    0.9423        54
         tech     0.9429    0.9706    0.9565        34

    micro avg     0.9698    0.9657    0.9677       233
    macro avg     0.9684    0.9679    0.9678       233
 weighted avg     0.9702    0.9657    0.9675       233
  samples avg     0.9657    0.9657    0.9657       233

Train epoch 0/0 (Cur. train loss: 0.0088):  26%|██▌       | 400/1547 [4:11:01<3:22:39, 10.60s/

Evaluating:  98%|█████████▊| 228/233 [10:04<00:12,  2.58s/it][A
Evaluating: 100%|██████████| 233/233 [10:16<00:00,  2.65s/it][A
10/24/2020 16:05:18 - INFO - farm.eval -   

\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
***************************************************
***** EVALUATION | DEV SET | AFTER 600 BATCHES *****
***************************************************
\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

10/24/2020 16:05:18 - INFO - farm.eval -   
 _________ text_classification _________
10/24/2020 16:05:18 - INFO - farm.eval -   loss: 0.08029513205141789
10/24/2020 16:05:18 - INFO - farm.eval -   task_name: text_classification
10/24/2020 16:05:18 - INFO - farm.eval -   f1_macro: 0.9521094216243364
10/24/2020 16:05:18 - INFO - farm.eval -   report: 
                precision    recall  f1-score   support

entertainment     0.9750    0.9750    0.9750  

10/24/2020 17:33:24 - INFO - farm.eval -   
 _________ text_classification _________
10/24/2020 17:33:24 - INFO - farm.eval -   loss: 0.03386410057760551
10/24/2020 17:33:24 - INFO - farm.eval -   task_name: text_classification
10/24/2020 17:33:25 - INFO - farm.eval -   f1_macro: 0.9821596136260355
10/24/2020 17:33:25 - INFO - farm.eval -   report: 
                precision    recall  f1-score   support

entertainment     0.9756    1.0000    0.9877        40
        sport     0.9815    1.0000    0.9907        53
     politics     0.9808    0.9808    0.9808        52
     business     1.0000    0.9630    0.9811        54
         tech     0.9706    0.9706    0.9706        34

    micro avg     0.9828    0.9828    0.9828       233
    macro avg     0.9817    0.9829    0.9822       233
 weighted avg     0.9830    0.9828    0.9828       233
  samples avg     0.9828    0.9828    0.9828       233

Train epoch 0/0 (Cur. train loss: 0.0051):  65%|██████▍   | 1000/1547 [7:07:02<1:33:27, 10.25s

10/24/2020 19:02:08 - INFO - farm.eval -   f1_macro: 0.9821596136260355
10/24/2020 19:02:08 - INFO - farm.eval -   report: 
                precision    recall  f1-score   support

entertainment     0.9756    1.0000    0.9877        40
        sport     0.9815    1.0000    0.9907        53
     politics     0.9808    0.9808    0.9808        52
     business     1.0000    0.9630    0.9811        54
         tech     0.9706    0.9706    0.9706        34

    micro avg     0.9828    0.9828    0.9828       233
    macro avg     0.9817    0.9829    0.9822       233
 weighted avg     0.9830    0.9828    0.9828       233
  samples avg     0.9828    0.9828    0.9828       233

Train epoch 0/0 (Cur. train loss: 0.0036):  84%|████████▍ | 1300/1547 [8:35:04<43:43, 10.62s/it]    
Evaluating:   0%|          | 0/233 [00:00<?, ?it/s][A
Evaluating:   2%|▏         | 4/233 [00:12<12:10,  3.19s/it][A
Evaluating:   4%|▍         | 9/233 [00:24<11:00,  2.95s/it][A
Evaluating:   4%|▍         | 9/233 [00:3

Train epoch 0/0 (Cur. train loss: 0.0039): 100%|██████████| 1547/1547 [9:52:12<00:00, 22.97s/it]   
Evaluating: 100%|██████████| 445/445 [18:48<00:00,  2.54s/it]
10/24/2020 20:56:18 - INFO - farm.eval -   

\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
***************************************************
***** EVALUATION | TEST SET | AFTER 1547 BATCHES *****
***************************************************
\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

10/24/2020 20:56:18 - INFO - farm.eval -   
 _________ text_classification _________
10/24/2020 20:56:19 - INFO - farm.eval -   loss: 0.05163107706296645
10/24/2020 20:56:19 - INFO - farm.eval -   task_name: text_classification
10/24/2020 20:56:19 - INFO - farm.eval -   f1_macro: 0.9763426913345613
10/24/2020 20:56:19 - INFO - farm.eval -   report: 
                precision    recall  f1-score   support

entertainmen

AdaptiveModel(
  (language_model): Bert(
    (model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
            

## Saving and Inferencing

In [17]:
save_dir = "saved_models/bert-english-news-article"
model.save(save_dir)
processor.save(save_dir)

In [3]:
save_dir = "saved_models/bert-english-news-article"
inferenced_model = Inferencer.load(save_dir)

10/25/2020 16:57:03 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
10/25/2020 16:57:12 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
10/25/2020 16:57:12 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 5]
10/25/2020 16:57:13 - INFO - farm.modeling.prediction_head -   Loading prediction head from saved_models\bert-english-news-article\prediction_head_0.bin
10/25/2020 16:57:15 - INFO - farm.data_handler.processor -   Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
10/25/2020 16:57:15 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
10/25/2020 16:57:16 - INFO - farm.infer -   Got ya 3 parallel workers to do inference ...
10/25/2020 16:57:16 - INFO - farm.infer -

In [4]:
def read_file(file_name: str) -> dict:
  text_file = open (file_name, 'r')
  text_file = text_file.read().replace('\n', ' ')
  return {'text': text_file}

In [5]:
def create_input(text_files:list) -> list:
  model_input = list()
  for text_file in text_files:
    model_input.append(read_file(text_file['file']))
  return model_input

In [6]:
def create_result_overview (articles:list, result:list) -> pd.DataFrame:
  files = list()
  labels = list()
  predictions = list()
  for i in range(len(articles)):
    files.append (articles[i]['file'])
    labels.append(articles[i]['genre'])
    predictions.append(result[0]['predictions'][i]['label'].strip("'[]'"))
  data = {'file': files, 'actual': labels, 'prediction': predictions}
  df = pd.DataFrame(data)
  return df

In [7]:
articles = [{'file': 'generated_data/inferencing/business.txt', 'genre': 'business'},
            {'file': 'generated_data/inferencing/sport.txt', 'genre': 'sport'}]

article_texts = create_input(articles)

In [8]:
article_texts

[{'text': '"This plan is essential," said chief executive Clotilde Delbos, who announced cuts in production to focus on more profitable car models. Some 4,600 job cuts will be in France, and the firm said on Friday that it had begun talks with unions. On Thursday, Renault\'s strategic partner Nissan unveiled huge job cuts. Renault, 15% owned by the French state, said six sites are under review. The company is slashing costs by cutting the number of subcontractors in areas such as engineering, reducing the number of components it uses, freezing expansion plans in Romania and Morocco and shrinking gearbox manufacturing worldwide. The French firm plans to trim its global production capacity to 3.3 million vehicles in 2024 from 4 million now, focusing on areas like small vans or electric cars. Renault, which claims more than 4% of the global car market, said the plans would affect about 10% of its 179,000-strong global workforce and cost up to â‚¬1.2bn (Â£1.1bn). Falling sales Renault is p

In [24]:
articles = [{'file': 'generated_data/inferencing/business.txt', 'genre': 'business'},
            {'file': 'generated_data/inferencing/sport.txt', 'genre': 'sport'}]

article_texts = create_input(articles)

result = inferenced_model.inference_from_dicts(article_texts)

df = create_result_overview(articles, result)

df.head()

Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.83s/ Batches]


Unnamed: 0,file,actual,prediction
0,generated_data/inferencing/business.txt,business,business
1,generated_data/inferencing/sport.txt,sport,sport
