<a href="https://colab.research.google.com/github/Srinivasulu2003/DeepLearning/blob/main/finetune_bloom_token_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning BLOOM for Token Classification


**Information about BLOOM:**

* Documentation: https://huggingface.co/docs/transformers/model_doc/bloom
* Model: https://huggingface.co/bigscience/bloom
* Github: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme

**Transformers Package Documentation in Huggingface.co:**

* Tokenizer Class: https://huggingface.co/docs/transformers/glossary#attention-mask
* Trainer Class: https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/trainer#transformers.Trainer
* Finetuning using Trainer: https://huggingface.co/docs/transformers/training
* Token Classification: https://huggingface.co/docs/transformers/tasks/token_classification

**Architecture explained:**

* The Technology Behind BLOOM Training: https://huggingface.co/blog/bloom-megatron-deepspeed
* Understand BLOOM, the Largest Open-Access AI, and Run It on Your Local Computer:
    https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32

**Dataset used for Training explained:**

* Corpus Map: https://huggingface.co/spaces/bigscience-catalogue-lm-data/corpus-map
* Building a TB Scale Multilingual Dataset for Language Modeling: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling


**Dataset for Finetuning:**

* Conll2003: https://huggingface.co/datasets/conll2003

## About BLOOM:

**The Model**:
* 176B parameters decoder-only architecture (GPT-like)
* 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
    
    
BLOOM uses a Transformer architecture composed of an input embeddings layer, 70 Transformer blocks, and an output language-modeling layer, as shown in the figure below. Each Transformer block has a self-attention layer and a multi-layer perceptron layer, with input and post-attention layer norms.

![](https://miro.medium.com/max/1400/1*uwWJBgEx3Rtovbcb7HcRdA.jpeg)
    
**The Dataset**:
* Multilingual: 46 languages: Full list is here: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
* 341.6 billion tokens (1.5 TB of text data)
* Tokenizer vocabulary: 250 680 tokens

![](https://github.com/bigscience-workshop/model_card/blob/main/assets/data/pie_v2.svg?raw=true)

***

## Imports

In [None]:
from transformers import (BloomTokenizerFast,
                          BloomForTokenClassification,
                          DataCollatorForTokenClassification,
                          AutoModelForTokenClassification,
                          TrainingArguments, Trainer)
from datasets import load_dataset
import torch
import os

## Use Pretrained Model

**Load Model ans Tokenizer:**

The list of available Models can be found here: https://huggingface.co/docs/transformers/model_doc/bloom

In [None]:
model_name = "bloom-560m"
tokenizer = BloomTokenizerFast.from_pretrained(f"bigscience/{model_name}", add_prefix_space=True)
model = BloomForTokenClassification.from_pretrained(f"bigscience/{model_name}")

In [None]:
model.config

**Predict Labels:**

Since Bloom has not been fintuned for Token Classification yet, the prediction is poor as expected.

In [None]:
inputs = tokenizer(
    "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt"
)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_token_class_ids = logits.argmax(-1)

# Note that tokens are classified rather then input words which means that
# there might be more predicted token classes than words.
# Multiple token classes might account for the same word
predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
predicted_tokens_classes

## Download Dataset for Finetuning

See:
* Dataset on Huggingface: https://huggingface.co/datasets/conll2003
* Load Datasets: https://huggingface.co/docs/datasets/v2.4.0/en/package_reference/loading_methods

In [None]:
datasets = load_dataset('conll2003')

### About the Dataset:

**Training Examples:**

In [None]:
print("Dataset Object Type:", type(datasets["train"]))
print("Training Examples:", len(datasets["train"]))

**Sample Structure:**

In [None]:
datasets["train"][100]

**Class Labels:**

In [None]:
label_list = datasets["train"].features[f"ner_tags"].feature.names
label_list

## Tokenize Dataset

### Tokenize a Single Sample:

In [None]:
example = datasets["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

Sample after Tokenization:

In [None]:
tokenized_input

Word IDs:

In [None]:
tokenized_input.word_ids()

### Tokenize Whole Dataset

In [None]:
def tokenizeInputs(inputs):

    tokenized_inputs = tokenizer(inputs["tokens"], max_length = 512, truncation=True, is_split_into_words=True)
    word_ids = tokenized_inputs.word_ids()
    ner_tags = inputs["ner_tags"]
    labels = [ner_tags[word_id] for word_id in word_ids]
    tokenized_inputs["labels"] = labels

    return tokenized_inputs

In [None]:
example = datasets["train"][100]
tokenizeInputs(example)

In [None]:
tokenized_datasets = datasets.map(tokenizeInputs)

**Count of Tokens in the Training Set:**

In [None]:
token_count = 0
for sample in tokenized_datasets["train"]:
    token_count = token_count + len(sample["labels"])

print("Tokens in Training Set:", token_count)

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["id", "tokens", "ner_tags", "pos_tags", "chunk_tags"])

## Define Data Collator

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Define Trainer

Load Model Class which can be finetuned:

In [None]:
model = AutoModelForTokenClassification.from_pretrained(f"bigscience/{model_name}", num_labels=12).cuda()

About the Model:

see https://github.com/huggingface/transformers/blob/v4.21.1/src/transformers/modeling_utils.py#L829

In [None]:
print("Parameters:", model.num_parameters())
print("Expected Input Dict:", model.main_input_name )

# Estimate FLOPS needed for one training example
sample = tokenized_datasets["train"][0]
sample["input_ids"] = torch.Tensor(sample["input_ids"])
flops_est = model.floating_point_ops(input_dict = sample, exclude_embeddings = False)

print("FLOPS needed per Training Sample:", flops_est )

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    save_strategy= "epoch", # Disabled for runtime evaluation
    evaluation_strategy="steps", #"steps", # Disabled for runtime evaluation
    eval_steps = 500,
    learning_rate=2e-5,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="none",
    #fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)



## Train Model

GPU used by Kaggle: https://www.nvidia.com/de-de/data-center/tesla-p100/

In [None]:
!nvidia-smi

In [None]:
%%time

trainer.train()

In [None]:
eval_results = trainer.evaluate()
print(f"Eval Loss: {eval_results['eval_loss']}")

## Use Model Finetuned Model:

Load checkpoint:

In [None]:
model_tuned = BloomForTokenClassification.from_pretrained("./results/checkpoint-1171")

Set correct class labels:

In [None]:
label_names = datasets["train"].features[f"ner_tags"].feature.names

id2label = {id : label for id, label in enumerate(label_names)}
label2id = {label: id for id, label in enumerate(label_names)}

model_tuned.config.id2label = id2label
model_tuned.config.label2id = label2id

In [None]:
model_tuned.config.id2label

In [None]:
inputs = tokenizer(
    "HuggingFace is a company based in Paris and New York",
    add_special_tokens=False, return_tensors="pt"
)

with torch.no_grad():
    logits = model_tuned(**inputs).logits

predicted_token_class_ids = logits.argmax(-1)

# Note that tokens are classified rather then input words which means that
# there might be more predicted token classes than words.
# Multiple token classes might account for the same word
predicted_tokens_classes = [model_tuned.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
predicted_tokens_classes