## Fine-Tuning BERT for Text Classification

### Load Data

In [1]:
import pandas as pd

excel_file = './Euraxess_GNSS_Keywords.xlsx'
df = pd.read_excel(excel_file)

df['Concatenated'] = df[['Title', 'OfferDescription', 'Requirements', 'AdditionalInformation']].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

descriptions = df['Concatenated'].tolist()

### Initialize BERT Tokenizer

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Batch Tokenization of Texts

In [3]:
from sklearn.model_selection import train_test_split

tokenized_texts = tokenizer(descriptions, truncation=True, padding=True)

combined_texts = [{'input_ids': input_ids, 'attention_mask': attention_mask} 
                  for input_ids, attention_mask in zip(tokenized_texts['input_ids'], tokenized_texts['attention_mask'])]

labels = list(df['Position'])

label_map = {label: idx for idx, label in enumerate(set(labels))}
labels = [label_map[label] for label in labels]

train_texts, val_texts, train_labels, val_labels = train_test_split(combined_texts, labels, test_size=0.2, random_state=42)

### Create Custom Dataset

In [4]:
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {
            'input_ids': torch.tensor(self.encodings[idx]['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(self.encodings[idx]['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CustomDataset(train_texts, train_labels)
val_dataset = CustomDataset(val_texts, val_labels)

### Create DataLoaders

In [5]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

### Model Setup and Training

In [6]:
from transformers import BertForSequenceClassification, AdamW
import torch

# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_map))

# Set device (GPU/CPU)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Optimizer, Loss function
optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
for epoch in range(3):  # Adjust number of epochs as needed
    print("Epoch: ",(epoch + 1))
    model.train()
    for i,batch in enumerate(train_loader): 
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        pred = outputs.logits
        loss = loss_fn(pred, batch['labels'])
        loss.backward()
        optimizer.step()
        
        train_batch_loss = loss.item()
        train_last_loss = train_batch_loss / 16
        print('Training batch {} last loss: {}'.format(i + 1, train_last_loss))
    print(f"\nTraining epoch {epoch + 1} loss: ",train_last_loss)
    
    # Validation
    model.eval()
    val_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()
            
            _, predicted = torch.max(outputs.logits, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    print(f'Epoch {epoch + 1}, Validation Loss: {val_loss / len(val_loader)}, Validation Accuracy: {(correct / total) * 100}%')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch:  1
Training batch 1 last loss: 0.14404307305812836
Training batch 2 last loss: 0.11648021638393402
Training batch 3 last loss: 0.11102175712585449
Training batch 4 last loss: 0.11154456436634064
Training batch 5 last loss: 0.10801249742507935
Training batch 6 last loss: 0.10873870551586151
Training batch 7 last loss: 0.10070991516113281
Training batch 8 last loss: 0.10630971938371658
Training batch 9 last loss: 0.10036035627126694
Training batch 10 last loss: 0.093991719186306
Training batch 11 last loss: 0.08659720420837402
Training batch 12 last loss: 0.08865533024072647
Training batch 13 last loss: 0.08101295679807663
Training batch 14 last loss: 0.08280676603317261
Training batch 15 last loss: 0.07919792085886002
Training batch 16 last loss: 0.08716821670532227
Training batch 17 last loss: 0.07308534532785416
Training batch 18 last loss: 0.07089197635650635
Training batch 19 last loss: 0.06962834298610687
Training batch 20 last loss: 0.06647060066461563
Training batch 21 las

KeyboardInterrupt: 

### Use Trained Model

In [None]:
new_texts = [
    """"Offer Description
        SuperGPS-2 projectThis research project aims to develop a robust and efficient terrestrial system for accurate positioning and time-transfer, using virtual ultra-wideband radio signals, which can serve as a backup and complement to GNSS (Global Navigation Satellite System) in environments with reduced GNSS availability. The virtual ultra-wideband approach allows for a limited demand for expensive radio frequency spectrum by using multiband signals and, through flexible signal design, straightforward implementation and integration with current and new generation telecommunications standards, such as 5G, are expected. In this project, we address multiband radio channel modelling, signal design for multiband ranging, estimation of the channel impulse response from a multiband signal, and multiband carrier-phase based ranging and positioning. Furthermore, a proof-of-principle hardware prototype test-bed will be developed for carrying out measurements and demonstrating the concept.The SuperGPS-2 project has started in Fall 2023 and the research team currently consists of two PhD-students supervised by two academic staff members at TU Delft. This project is actively supported by several partners from industry.Job descriptionThe SuperGPS-2 project is carried out jointly by the TU Delft Faculty of Electrical Engineering, Mathematics, and Computer Sciences (EEMCS) and the Faculty of Civil Engineering and Geosciences (CEG). The PostDoc should complement the current project-team, and will be appointed, for a 2.5 years term, at the former faculty. Next to research on radio signal processing and positioning, he/she will be leading the development of the proof-of-principle virtual ultra-wideband prototype test-bed for positioning and time-transfer, with the aim of a live-demonstration by the end of the project. Specifically, the PostDoc will supervise and contribute to the PhD students’ work on the prototype test-bed, and coordinate the actual demonstration.TU Delft offers an excellent and stimulating research environment with extensive available infrastructure and expertise.
        Requirements         
        Specific RequirementsCandidates are expected to have a PhD-degree in electrical engineering: telecommunications, signal processing or micro-electronics, excellent command of English (certified through a TOEFL or IELTS test), proven team-work competences, as well as good laboratory and programming competences (working with measurement equipment, electronics, and hardware). In addition, hardware programming skills (FPGA design, VHDL code development) are required and need to be proven."""
]

inputs = tokenizer(new_texts, truncation=True, padding=True, return_tensors="pt")

model.eval()
# Realizar predicciones
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Obtener las predicciones de clase
predictions = torch.argmax(logits, dim=-1)

# Convertir las predicciones de índices a etiquetas (si se desea)
predicted_labels = [list(label_map.keys())[list(label_map.values()).index(pred)] for pred in predictions]

print(predicted_labels)

["Bachelor's Degree"]
