## BERT (Bidirectional Encoder Representations from Transformers)

Se realizará un proceso de ajuste fino (fine-tuning) de un modelo BERT previamente entrenado en artículos con etiquetas conocidas. Posteriormente, este modelo podrá ser utilizado para clasificar noticias desconocidas y determinar su temática.


### Imports

In [37]:
import pandas as pd
import re # Regular expressions
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ahmat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Carga de datos

In [38]:
df = pd.read_csv('data/bbc_data.csv')
df.rename(columns={'data': 'text', 'labels': 'category'}, inplace=True)
df.head()

Unnamed: 0,text,category
0,Musicians to tackle US red tape Musicians gro...,entertainment
1,"U2s desire to be number one U2, who have won ...",entertainment
2,Rocker Doherty in on-stage fight Rock singer ...,entertainment
3,Snicket tops US box office chart The film ada...,entertainment
4,"Oceans Twelve raids box office Oceans Twelve,...",entertainment


### Preprocesamiento de datos

#### Codificación de etiquetas

In [39]:
possible_labels = df.category.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'entertainment': 0, 'business': 1, 'sport': 2, 'politics': 3, 'tech': 4}

In [40]:
df['label'] = df.category.replace(label_dict)
df.head()

Unnamed: 0,text,category,label
0,Musicians to tackle US red tape Musicians gro...,entertainment,0
1,"U2s desire to be number one U2, who have won ...",entertainment,0
2,Rocker Doherty in on-stage fight Rock singer ...,entertainment,0
3,Snicket tops US box office chart The film ada...,entertainment,0
4,"Oceans Twelve raids box office Oceans Twelve,...",entertainment,0


#### Preprocesamiento del texto

In [41]:
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Eliminación de números
    text = re.sub(r'\d+', '', text)
    # Eliminación de caracteres especiales
    text = re.sub(r'\W', ' ', text)
    # Eliminación de stopwords
    text = ' '.join(word for word in text.split() if word not in stopwords.words('english')) 
    return text

In [42]:
df['text'] = df['text'].apply(preprocess_text)
df.head()

Unnamed: 0,text,category,label
0,musicians tackle us red tape musicians groups ...,entertainment,0
1,us desire number one u three prestigious gramm...,entertainment,0
2,rocker doherty stage fight rock singer pete do...,entertainment,0
3,snicket tops us box office chart film adaptati...,entertainment,0
4,oceans twelve raids box office oceans twelve c...,entertainment,0


In [43]:
df.groupby('category').describe()

Unnamed: 0_level_0,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
business,510.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
entertainment,386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
politics,417.0,3.0,0.0,3.0,3.0,3.0,3.0,3.0
sport,511.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0
tech,401.0,4.0,0.0,4.0,4.0,4.0,4.0,4.0


#### Split de datos para BERT

In [44]:
def get_split(text):
    words = text.split()
    total_words = len(words)
    chunk_size = 200
    overlap = 50
    step = chunk_size - overlap

    if total_words <= chunk_size:
        return [text]

    chunks = []
    for start in range(0, total_words, step):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)

    return chunks

In [52]:
split_df = df.copy()
split_df['split'] = split_df['text'].apply(get_split)
split_df.head()

Unnamed: 0,text,category,label,split
0,musicians tackle us red tape musicians groups ...,entertainment,0,[musicians tackle us red tape musicians groups...
1,us desire number one u three prestigious gramm...,entertainment,0,[us desire number one u three prestigious gram...
2,rocker doherty stage fight rock singer pete do...,entertainment,0,[rocker doherty stage fight rock singer pete d...
3,snicket tops us box office chart film adaptati...,entertainment,0,[snicket tops us box office chart film adaptat...
4,oceans twelve raids box office oceans twelve c...,entertainment,0,[oceans twelve raids box office oceans twelve ...


In [53]:
def flatten_column(df):
    """
    Flattens the specified column of a DataFrame into individual elements,
    associating each element with its corresponding label and index.
    
    Returns:
    tuple: Three lists containing the flattened column elements, corresponding labels, and indices.
    """
    column_elements = []
    category_elements = []
    label_elements = []

    for idx, row in df.iterrows():
        for element in row['split']:
            column_elements.append(element)
            category_elements.append(row['category'])
            label_elements.append(row['label'])
    
    return column_elements, category_elements ,label_elements

In [66]:
text_l, category_l, label_l = flatten_column(split_df)
bert_df = pd.DataFrame({'text': text_l, 'category': category_l, 'label': label_l})
bert_df.head()

Unnamed: 0,text,category,label
0,musicians tackle us red tape musicians groups ...,entertainment,0
1,us market seen holy grail one benchmarks succe...,entertainment,0
2,us desire number one u three prestigious gramm...,entertainment,0
3,band done everything considerable powers ensur...,entertainment,0
4,songs like sunday bloody sunday new years day ...,entertainment,0


#### Split de datos para entrenamiento y testeo

In [67]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(bert_df.index.values, 
                                                  bert_df.label.values, 
                                                  test_size=0.20, 
                                                  random_state=42, 
                                                  stratify=bert_df.label.values)

In [68]:
bert_df['data_type'] = ['not_set']*bert_df.shape[0]

bert_df.loc[X_train, 'data_type'] = 'train'
bert_df.loc[X_val, 'data_type'] = 'val'

bert_df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
business,1,train,598
business,1,val,150
entertainment,0,train,451
entertainment,0,val,113
politics,3,train,665
politics,3,val,166
sport,2,train,601
sport,2,val,150
tech,4,train,719
tech,4,val,180


### BERT Tokenizer

In [77]:
import torch
from torch.utils.data import TensorDataset
from transformers import BertTokenizer

# Ajustamos el nivel de logging a ERROR para ocultar los mensajes de INFO y WARNING
import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

In [62]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [69]:
encoded_data_train = tokenizer.batch_encode_plus(
    bert_df[bert_df.data_type=='train'].text.values, 
    add_special_tokens=True, # Add [CLS] and [SEP] tokens at the beginning and end of each sentence
    return_attention_mask=True, # Attention masks
    pad_to_max_length=True, # Padding
    max_length=256, # Tamaño máximo de la secuencia
    return_tensors='pt' # Return PyTorch tensors
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(bert_df[bert_df.data_type=='train'].label.values)



In [70]:
encoded_data_val = tokenizer.batch_encode_plus(
    bert_df[bert_df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(bert_df[bert_df.data_type=='val'].label.values)

In [71]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)
len(dataset_train), len(dataset_val)

(3034, 759)

### BERT from pre-trained model

In [73]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

#### Data Loaders

In [74]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 3
dataloader_train = DataLoader(dataset_train, sampler=RandomSampler(dataset_train), batch_size=batch_size)
dataloader_validation = DataLoader(dataset_val, sampler=SequentialSampler(dataset_val), batch_size=batch_size)

#### Optimizer and scheduler

In [86]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
epochs = 3
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloader_train)*epochs)



#### Metrics

In [87]:
import numpy as np
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

#### Training and evaluation

In [88]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cpu


In [89]:
def evaluate(dataloader_val):
    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [90]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)

    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'bert_finetuned/finetuned_BERT_epoch_{epoch}.model')
    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/1012 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.2823709606654526
Validation loss: 0.17123412908553678
F1 Score (Weighted): 0.9655609524525839


Epoch 2:   0%|          | 0/1012 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.10388365186703265
Validation loss: 0.13282264063821317
F1 Score (Weighted): 0.9723514140247191


Epoch 3:   0%|          | 0/1012 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.05127869044852617
Validation loss: 0.13440831445673804
F1 Score (Weighted): 0.9710191306431812


#### Loading and evaluating the model

In [91]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",num_labels=len(label_dict),output_attentions=False, output_hidden_states=False)
model.to(device)

n = 3
model.load_state_dict(torch.load(f'bert_finetuned/finetuned_BERT_epoch_{n}.model', map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)

Class: entertainment
Accuracy: 111/113

Class: business
Accuracy: 142/150

Class: sport
Accuracy: 148/150

Class: politics
Accuracy: 163/166

Class: tech
Accuracy: 173/180

