# Poems classifier

This is an example about how to train a classifier of poems by topic or by form.

Set `run_as_standalone_nb = True` if you are running this notebook outside of a clone of its repository (https://github.com/Poems-AI/AI.git). For example, in a Colab or Kaggle notebook.

In [None]:
run_as_standalone_nb = True


from pathlib import Path


if run_as_standalone_nb:
    import sys    
    root_lib_path = Path('AI').resolve()
    if not root_lib_path.exists():
        !git clone https://github.com/Poems-AI/AI.git
    if str(root_lib_path) not in sys.path:
        sys.path.insert(0, str(root_lib_path))
        
    !pip install transformers
    !apt-get install git-lfs
    !git lfs install
else:
    import local_lib_import

In [None]:
from datasets import load_dataset, load_metric
from enum import auto, Enum
from functools import partial
from huggingface_hub import login, notebook_login
import numpy as np
import os
import pandas as pd
from poemsai.nb_utils import commit_checkpoint_to_hf_hub, download_checkpoint_from_hf_hub
import transformers
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, 
                          Trainer, TrainingArguments)
from transformers.optimization import SchedulerType
from transformers.trainer_utils import get_last_checkpoint
import torch
import torch.nn.functional as F
from typing import List

Clone our datasets repo:

In [None]:
!git clone https://github.com/Poems-AI/dataset.git

In [None]:
# Prevent wandb login requirement
os.environ["WANDB_DISABLED"] = "true"

# Labels selection

In [None]:
class LabelsType(Enum):
    Forms = "forms"
    Topics = "topics"

Choose if you want to train a classifier of poems by form (`LabelsType.Forms`) or by topic (`LabelsType.Topics`)

In [None]:
classify_by = LabelsType.Forms

# Login to HuggingFace

In [None]:
HF_USER = "YOUR_HF_USER"

**Option 1: notebook_login.**

In [None]:
notebook_login()

**Option 2: get token.** Unfortunately, you need to manually set your password. Every time you push to hub, you'll need to pass `use_auth_token=login_token`

In [None]:
pwd = 'YOUR_HF_PASSWORD'
login_token = login(HF_USER, pwd)
pwd = None

**Option 3 (recommended): interact with the git repo that stores your model** and pass the password every time you commit
<br><br>
Before commiting, you need to tell git your user and email (from HuggingFace)

In [None]:
HF_EMAIL = "YOUR_HF_EMAIL"
!git config --global user.email $HF_EMAIL
!git config --global user.name $HF_USER

You can push to hub by calling `commit_checkpoint_to_hub`. For instance:
```
commit_checkpoint_to_hub('distilbert-poems-clf-by-form.en', HF_USER, './checkpoints/checkpoint-7170', 
                         message='Update model after 50 epochs', pwd='YOUR_HF_PASSWORD')
```

# Data

We are going to use the same splits we used to train a simple generator:

In [None]:
splits_df_path = 'dataset/all.txt/en.txt/simple/all_poems.en.splits.csv'
splits_df = pd.read_csv(splits_df_path, index_col=0)
splits_df

If outside of Kaggle, you should set `kaggle_ds_root` to the root folder that contains the poems dataset
by Kaggle user michaelarman (https://www.kaggle.com/michaelarman/poemsdataset)

In [None]:
kaggle_ds_root_placeholder = '[KAGGLE_DS_ROOT]'
# If outside of Kaggle, replace with the path of a root folder that contains the poems dataset
# by Kaggle user michaelarman (https://www.kaggle.com/michaelarman/poemsdataset)
kaggle_ds_root = '/kaggle/input'
kaggle_ds_splits_df = splits_df.copy()[
    splits_df.Location.str.contains(f'/{classify_by.value}/', regex=False)
    & splits_df.Location.str.contains(kaggle_ds_root_placeholder, regex=False)
]
kaggle_ds_splits_df.Location = kaggle_ds_splits_df.Location.str.replace(kaggle_ds_root_placeholder, 
                                                                        kaggle_ds_root,
                                                                        regex=False)
kaggle_ds_splits_df

In [None]:
train_split_df = kaggle_ds_splits_df[kaggle_ds_splits_df.Split == 'Train']
valid_split_df = kaggle_ds_splits_df[kaggle_ds_splits_df.Split == 'Validation']
train_split_df, valid_split_df

In [None]:
def get_content_of_file_path(path:str):
    if not Path(path).exists():
        # Some poems contain strange characters in the title that don't match 
        # the original poem name, but they are about 1% and some are in french 
        # or other languages, so we don't mind discarding them
        #print('skipped ', path)
        return ''
    with open(path) as f:
        return f.read()

In [None]:
def split_to_labeled_df(split_df):
    labeled_df = pd.DataFrame({
        'text': split_df.Location.map(get_content_of_file_path), 
        'labels': split_df.Location.map(lambda path: Path(path).parent.name), 
    })
    return labeled_df


train_df = split_to_labeled_df(train_split_df)
valid_df = split_to_labeled_df(valid_split_df)
train_empty_selector = train_df.text == ''
valid_empty_selector = valid_df.text == ''
train_df = train_df[~train_empty_selector]
valid_df = valid_df[~valid_empty_selector]
train_df, valid_df, train_empty_selector.sum(), valid_empty_selector.sum()

Print the number of poems by category:

In [None]:
with pd.option_context('display.max_rows', None):
    print(train_df.labels.value_counts())

In [None]:
with pd.option_context('display.max_rows', None):
    print(valid_df.labels.value_counts())

[OPTIONAL]: set `min_poems_by_category` to a value greater than 1 to drop the poems whose category has less than `min_poems_by_category` training poems.

In [None]:
min_poems_by_category = 4
train_df = train_df.groupby(by='labels').filter(lambda x: x.shape[0] >= min_poems_by_category)
valid_df = valid_df.groupby(by='labels').filter(lambda x: x.name in train_df.labels.unique())
labels = train_df.labels.unique()
num_labels = len(labels)
train_df, valid_df, num_labels, valid_df.labels.nunique()

Export to csv in order to ease the load by datasets library:

In [None]:
train_ds_path = 'train.csv'
valid_ds_path = 'valid.csv'
train_df.to_csv(train_ds_path)
valid_df.to_csv(valid_ds_path)

In [None]:
data_files = {"train": train_ds_path, "validation": valid_ds_path}
raw_datasets = load_dataset("csv", data_files=data_files)
raw_datasets

## Tokenization and numericalization

In [None]:
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
id2label = dict(enumerate(labels))
label2id = {label: i for i, label in id2label.items()}

In [None]:
def preprocess_function(examples):
    result = tokenizer(examples["text"], truncation=True)
    result["labels"] = [label2id[l] for l in examples["labels"]]
    return result
columns_to_remove = [c for c in raw_datasets['train'].column_names if c != 'labels']
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True, remove_columns=columns_to_remove)
tokenized_datasets

We choose a collator that dinamically pads the inputs to the length of the longest sequence in the batch:

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Train

In [None]:
def freeze(params):
    for p in params: p.requires_grad = False
        
def freeze_backbone(model):#:DistilBertForSequenceClassification):
    freeze(model.distilbert.parameters())
    
def create_opt_disc_lrs(model, min_lr, max_lr, head_lr):
    n_blocks = len(model.distilbert.transformer.layer)
    lr_mult = (max_lr / min_lr) ** (1 / (n_blocks - 1))
    blocks_lrs = [min_lr * lr_mult ** i for i in range(n_blocks)]
    blocks_params = [{'params': model.distilbert.transformer.layer[i].parameters(), 'lr': blocks_lrs[i]}
                     for i in range(n_blocks)]
    return torch.optim.AdamW([
        {'params': model.distilbert.embeddings.parameters(), 'lr': min_lr},
        {'params': model.pre_classifier.parameters()}, #, 'lr': head_lr},
        {'params': model.classifier.parameters()},#, 'lr': head_lr},
        *blocks_params,
    ], lr=head_lr, weight_decay=0, betas=(0.9, 0.999))

In [None]:
custom_model_name = f'distilbert-poems-clf-by-{classify_by.value[:-1]}.en'
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, 
    num_labels=num_labels,
    id2label=id2label, 
    label2id=label2id,
    #dropout=0.3,
    #seq_classif_dropout=0.5,
    #attention_dropout=0.3,
)

In [None]:
# Clone repo of our model to commit there later
resume_training = False
if resume_training:
    hf_pwd = 'YOUR_HF_PASSWORD'
    download_checkpoint_from_hf_hub(custom_model_name, HF_USER, hf_pwd)
    hf_pwd = ''

[Optional]: create your own optimizer. In case you choose to use it, don't forget to uncomment the line that passes the optimizer to the `Trainer` constructor.

In [None]:
opt = create_opt_disc_lrs(model, 1e-7, 2e-5, 5e-5)
[(len(pg['params']), pg['lr']) for pg in opt.param_groups]

[Optional]: freeze some layers

In [None]:
freeze_backbone(model)
sum(1 for p in model.parameters() if p.requires_grad), sum(1 for p in model.parameters() if not p.requires_grad)

In [None]:
metric = load_metric("accuracy")


def compute_metrics(eval_preds, expect_preds=False):
    metric = load_metric("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1).reshape(-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(
    output_dir="./checkpoints",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    evaluation_strategy=transformers.trainer_utils.IntervalStrategy.EPOCH,  
    save_strategy=transformers.trainer_utils.IntervalStrategy.EPOCH,
    lr_scheduler_type=transformers.trainer_utils.SchedulerType.CONSTANT,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    #optimizers=(opt, None),
)

trainer.train(resume_from_checkpoint=custom_model_name if resume_training else None)

In [None]:
!ls -l checkpoints

In [None]:
last_checkpoint = get_last_checkpoint('./checkpoints')
custom_model_name, last_checkpoint

In [None]:
commit_checkpoint_to_hf_hub(custom_model_name, HF_USER, last_checkpoint,
                            message='Add model, 20 epochs', pwd='YOUR_HF_PASSWORD')