### Data Loading and Tokenization

In [1]:
!pip install uv



In [2]:
# Install requirement libraries, packages
!uv pip install datasets
!uv pip install conllu
!uv pip install torchviz

!uv pip install wandb
!uv pip install ufal.chu-liu-edmonds


[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 123ms[0m[0m
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 121ms[0m[0m
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 101ms[0m[0m
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 104ms[0m[0m
[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 109ms[0m[0m


In [3]:
# main.py
import torch
from datasets import load_dataset
from config import DATASET_PATH, DATASET_NAME, EXPERIMENT_NAME, RELATION_NUM, HIDDEN_DIM, OUTPUT_DIM
from data import dataset_reading_and_encoding, print_first_batch
from models import model_initializing
from utils import count_parameters
from train import train

device = 'cuda' if torch.cuda.is_available() else 'cpu'


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Load and process datasets
dataset = load_dataset(path=DATASET_PATH, name=DATASET_NAME, trust_remote_code=True)
data = dataset_reading_and_encoding(dataset)
print_first_batch(data["train"])

First Batch:
input_ids shape: torch.Size([32, 200])
attention_mask shape: torch.Size([32, 200])
head shape: torch.Size([32, 200])
deprel_ids shape: torch.Size([32, 200])


In [5]:
# Initialize and train base model
base_model = model_initializing("base", hidden_dim=HIDDEN_DIM, output_dim=OUTPUT_DIM, relation_num=RELATION_NUM)
count_parameters(base_model)
base_model = train(base_model, data, EXPERIMENT_NAME, save_model=True, model_name="base_model")


Total parameters: 279,346,968
Trainable parameters: 29,654,808


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mghta00001[0m ([33mghta00001-university-of-saarland[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch 1/3:   1%|          | 2/392 [01:35<5:10:45, 47.81s/it]


KeyboardInterrupt: 

In [None]:
# Initialize and train extended model with adapters
extended_model = model_initializing("pfeiffer", hidden_dim=768, output_dim=256, relation_num=RELATION_NUM, trained_base_model=base_model)
count_parameters(extended_model)
train(extended_model, basque_data, "Basque_Adapter_Experiment", save_model=True, model_name="pfeiffer_adapter")

In [5]:
# Loading dataset from Huggingface
dataset = load_dataset(path=DATASET_PATH, name=DATASET_NAME, trust_remote_code=True)

# A map from dependency to id (id is literally the index) {key (deprels) : value(indexes)}
deprel_to_id = {deprel: idx for idx, deprel in enumerate(ALL_DEPRELS)}

# A map from id to dependency (id is literally the index) {value(indexes) : key (deprels)}
id_to_deprel = {idx: deprel for idx, deprel in enumerate(ALL_DEPRELS)}

In [10]:
sample_tokenized_inputs = tokenize_and_align_labels(dataset["train"][:10])
explore_some_data(dataset["train"], sample_tokenized_inputs)

Token : <s>        -> Head: N/A        -> Deprel: None       -> Word: N/A
Token : ▁Al        -> Head: N/A        -> Deprel: root       -> Word: N/A
Token : ▁-         -> Head: N/A        -> Deprel: punct      -> Word: -
Token : ▁Zaman     -> Head: N/A        -> Deprel: flat       -> Word: Zaman
Token : ▁:         -> Head: N/A        -> Deprel: punct      -> Word: :
Token : ▁American  -> Head: forces     -> Deprel: amod       -> Word: American
Token : ▁forces    -> Head: killed     -> Deprel: nsubj      -> Word: forces
Token : ▁killed    -> Head: N/A        -> Deprel: parataxis  -> Word: killed
Token : ▁Sha       -> Head: killed     -> Deprel: obj        -> Word: Shaikh
Token : ikh        -> Head: N/A        -> Deprel: None       -> Word: Shaikh
Token : ▁Abdullah  -> Head: Shaikh     -> Deprel: flat       -> Word: Abdullah
Token : ▁al        -> Head: Shaikh     -> Deprel: flat       -> Word: al
Token : ▁Ani       -> Head: Shaikh     -> Deprel: punct      -> Word: Ani
Token : ▁          

Here we initialize the dataset and create the dataloaders, then print the first batch of trainset

In [22]:
trained_model = InitialModel(HIDDEN_DIM, OUTPUT_DIM, RELATION_NUM)
trained_model.load_state_dict(torch.load("model_epoch_3.pth"))
extended_model = ExtendedModelWithHoulsby(trained_model, adapter_dim=64)
extended_model.to(device)
# Step 3: Train on low-resource language
# (Optimizer should only update parameters with requires_grad=True, i.e., adapters)
optimizer = torch.optim.Adam(
    [p for p in extended_model.parameters() if p.requires_grad],
    lr=1e-4
)

### Testing

In [None]:
# Testing model
test_loader = data["test"]
test_acc = test_model(model, test_loader, device)

Test UAS: 0.8408, LAS: 0.7992
Unlabeled Attachment Score (UAS): 0.7942


TEST : 91.75 percent

UAS : percent


VALIDATION :

# NOTE
Comments with LLM

# BIG NOTE

It's like 10.10 and ***** Colab stopped :(((((
I'll send loss and accuracy figures in email.
but you can see results here.
Oh DAMNNNNNNNNNNNNNNNNNNNNNNNNN
