## Advanced

## City or not but it can be anything !!! (few-shot learning) classificator

Absolutely! Let's break down this code step by step, providing a tutorial-style explanation.

**Objective:**

The core aim of this code is to build a model that can distinguish Latin city names from other Latin words (or non-city words). The model uses a pre-trained BERT (Bidirectional Encoder Representations from Transformers) language model for a "few-shot learning" scenario. In essence, we're teaching BERT to recognize a specific pattern (Latin city names) using a relatively small set of examples.

**Code Structure:**

1. **Setup:**
   - Import necessary libraries (PyTorch, Transformers, scikit-learn).
   - Load the pre-trained BERT model and its tokenizer.

2. **Data Preparation:**
   - `latin_cities`: A list of Latin city names (positive examples).
   - `non_cities`: A list of other Latin words (negative examples).
   - Combine the data, create labels (1 for city, 0 for non-city), and shuffle.
   - Split the data into training, validation, and test sets.
   - `LatinWordDataset`: A custom class to load and tokenize the words for BERT.

3. **Model Definition:**
   - `LatinCityClassifier`: A custom neural network class.
     - Leverages the pre-trained BERT model.
     - Adds a dropout layer (for regularization) and a linear classifier layer to make the final prediction (city or not city).
     
   - `prototypical_loss`: This function calculates the loss to update the model's learning. It uses a prototype-based approach, which means it tries to group similar examples together (city names close to other city names) while keeping them distinct from non-city words.

4. **Training and Validation:**
   - Set up the training environment (device - CPU or GPU).
   - Define the optimizer (AdamW) and loss function (cross-entropy).
   - Iterate over training epochs.
     - In each epoch, calculate the loss and update the model's parameters based on the training data.
     - Evaluate the model's performance on the validation set and save the best-performing model.

5. **Final Testing:**
   - Load the best-performing model from the validation phase.
   - Evaluate the model on the test set and print the accuracy, precision, recall, and F1 score.

**Few-Shot Learning Adaptations:**

This code specifically addresses a few-shot learning scenario through these modifications:

- **Small Dataset:** The training dataset is deliberately limited.
- **Regularization:** A higher dropout rate (0.5) is used to prevent overfitting to the limited training data.
- **Prototypical Loss:**  This loss function is well-suited for few-shot learning as it focuses on creating distinct clusters (prototypes) for the different classes (city vs. non-city) in the feature space.

**Key Points:**

- **Transfer Learning:** By using a pre-trained BERT model, we're leveraging its existing knowledge of language, which helps significantly when you have limited training data.
- **Custom Dataset Class:** This simplifies the process of loading and preparing your specific data for the BERT model.
- **Regularization:** Dropout helps the model generalize better to unseen data.

Let me know if you have any specific questions or would like me to elaborate on any aspect of the code!



In [None]:
import torch
import random
from torch import nn
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Load pre-trained model and tokenizer
model_name = "bert-base-multilingual-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert_model = AutoModel.from_pretrained(model_name)

# Latin city names dataset
latin_cities = [
    "Roma", "Athenae", "Alexandria", "Carthago", "Constantinopolis",
    "Neapolis", "Mediolanum", "Lugdunum", "Vindobona", "Antiochia",
    "Hierosolyma", "Damascus", "Tyrus", "Thessalonica", "Byzantium",
    "Sparta", "Corinth", "Lutetia", "Londinium", "Eboracum",
    "Augusta Treverorum", "Brundisium", "Toletum", "Florentia", "Syracusae",
    "Ephesus", "Pergamum", "Aquileia", "Massilia", "Tarraco",
    "Caesaraugusta", "Corduba", "Emerita Augusta", "Gades", "Pompeii",
    "Herculaneum", "Ravenna", "Nicomedia", "Mediolanum Santonum", "Colonia Agrippina"
]


# Non-city words for negative examples (more diverse)
non_cities = [
    "amo", "laudo", "dico", "habeo", "audio", "venio", "capio", "vivo",
    "scribo", "canto", "bellum", "pax", "sol", "luna", "terra",
    "aqua", "ignis", "aer", "arbor", "flumen", "mons", "mare", "coelum", "animal", "Caesar", "Cicero",
    "Augustus", "Vergilius", "Livius", "Caesar", "Cicero", "Augustus", "Vergilius",
    "Livius",  "et", "sed", "aut", "vel", "non", "si", "quia", "ut",
]

# Combine datasets and create labels
data = [(city, 1) for city in latin_cities] + [(word, 0) for word in non_cities]
random.shuffle(data)

# Modify the data split
train_data, temp_data = train_test_split(data, test_size=0.3, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Create datasets and dataloaders
train_dataset = LatinWordDataset(train_data, tokenizer)
val_dataset = LatinWordDataset(val_data, tokenizer)
test_dataset = LatinWordDataset(test_data, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4)
test_loader = DataLoader(test_dataset, batch_size=4)

# Custom dataset class
class LatinWordDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        word, label = self.data[idx]
        encoding = self.tokenizer(word, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Modified model for few-shot learning
class LatinCityClassifier(nn.Module):
    def __init__(self, bert_model):
        super(LatinCityClassifier, self).__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.5)  # Higher dropout for regularization
        self.classifier = nn.Linear(self.bert.config.hidden_size, 2)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output  # Use pooler output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

# Prototypical loss function
def prototypical_loss(prototype, query, labels):
    distances = torch.cdist(query, prototype, p=2.0)
    closest_prototype_index = distances.argmin(dim=1)  # Get closest prototype index for each query
    loss_val = nn.CrossEntropyLoss()(distances, closest_prototype_index) # Calculate cross-entropy loss
    return loss_val

# Create datasets and dataloaders (smaller batch size for few-shot)
train_dataset = LatinWordDataset(train_data, tokenizer)
test_dataset = LatinWordDataset(test_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=4)

# Initialize model and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LatinCityClassifier(bert_model).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
criterion = nn.CrossEntropyLoss()

# Training loop with validation
num_epochs = 8
best_val_acc = 0
patience = 3
epochs_without_improvement = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    train_loss /= len(train_loader)

    # Validation loop
    model.eval()
    val_loss = 0
    all_val_labels = []
    all_val_preds = []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            val_loss += loss.item()

            _, predicted = torch.max(outputs, 1)
            all_val_labels.extend(labels.cpu().numpy())
            all_val_preds.extend(predicted.cpu().numpy())

    val_loss /= len(val_loader)
    val_acc = accuracy_score(all_val_labels, all_val_preds)
    val_precision = precision_score(all_val_labels, all_val_preds)
    val_recall = recall_score(all_val_labels, all_val_preds)
    val_f1 = f1_score(all_val_labels, all_val_preds)

    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, "
          f"Accuracy: {val_acc:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1: {val_f1:.4f}")

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_model_few_shot.pt')
        epochs_without_improvement = 0
    else:
        epochs_without_improvement += 1
        if epochs_without_improvement >= patience:
            print("Early stopping triggered!")
            break

# Load the best model for final evaluation
model.load_state_dict(torch.load('best_model_few_shot.pt'))
model.eval()

all_test_labels = []
all_test_preds = []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask)
        _, predicted = torch.max(outputs, 1)
        all_test_labels.extend(labels.cpu().numpy())
        all_test_preds.extend(predicted.cpu().numpy())

accuracy = accuracy_score(all_test_labels, all_test_preds)
precision = precision_score(all_test_labels, all_test_preds)
recall = recall_score(all_test_labels, all_test_preds)
f1 = f1_score(all_test_labels, all_test_preds)
print(f"Final Test Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")

Epoch 1/8, Train Loss: 0.6793, Val Loss: 0.6906, Accuracy: 0.5000, Precision: 0.3333, Recall: 1.0000, F1: 0.5000
Epoch 2/8, Train Loss: 0.6884, Val Loss: 0.6803, Accuracy: 0.3333, Precision: 0.2727, Recall: 1.0000, F1: 0.4286
Epoch 3/8, Train Loss: 0.5061, Val Loss: 0.5625, Accuracy: 0.8333, Precision: 0.6000, Recall: 1.0000, F1: 0.7500
Epoch 4/8, Train Loss: 0.4399, Val Loss: 0.4546, Accuracy: 0.9167, Precision: 0.7500, Recall: 1.0000, F1: 0.8571
Epoch 5/8, Train Loss: 0.3179, Val Loss: 0.3016, Accuracy: 0.9167, Precision: 0.7500, Recall: 1.0000, F1: 0.8571
Epoch 6/8, Train Loss: 0.2066, Val Loss: 0.1375, Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, F1: 1.0000
Epoch 7/8, Train Loss: 0.0819, Val Loss: 0.1012, Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, F1: 1.0000
Epoch 8/8, Train Loss: 0.0411, Val Loss: 0.0348, Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, F1: 1.0000
Final Test Accuracy: 0.9231, Precision: 1.0000, Recall: 0.8889, F1: 0.9412


In [None]:
# Define a function to preprocess and predict
def predict_word(word, model, tokenizer, device):
    model.eval()
    with torch.no_grad():
        inputs = tokenizer(word, return_tensors='pt', max_length=128, padding='max_length', truncation=True)
        input_ids = inputs['input_ids'].to(device)
        attention_mask = inputs['attention_mask'].to(device)

        outputs = model(input_ids, attention_mask)
        _, predicted = torch.max(outputs, 1)
        return predicted.item()

# Load the saved model
model = LatinCityClassifier(bert_model).to(device)
model.load_state_dict(torch.load('best_model_few_shot.pt'))

# List of unseen words
unseen_words = ["Tarraco", "Pompeii", "Marcus", "Cicero", "Senatus", "amo", "ego", "budapest", "szeged", "Wien", "vienna", "Graz"]

# Make predictions
for word in unseen_words:
    prediction = predict_word(word, model, tokenizer, device)
    print(f"Word: {word}, Prediction: {'City' if prediction == 1 else 'Not a city'}")

Word: Tarraco, Prediction: City
Word: Pompeii, Prediction: City
Word: Marcus, Prediction: Not a city
Word: Cicero, Prediction: Not a city
Word: Senatus, Prediction: Not a city
Word: amo, Prediction: Not a city
Word: ego, Prediction: Not a city
Word: budapest, Prediction: City
Word: szeged, Prediction: City
Word: Wien, Prediction: Not a city
Word: vienna, Prediction: City
Word: Graz, Prediction: City
