<a href="https://colab.research.google.com/github/PanoEvJ/GenAI-CoverLetter/blob/main/PE_GenAI_CoverLetter_Fine_tuning_BLOOM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune a BLOOM-based ad generation model using `peft`, `transformers` and `bitsandbytes`

We can use the [job_postings_GPT dataset](PanoEvJ/job_postings_GPT) to fine-tune BLOOM to be able to generate marketing emails based off of a product and its description!

### Overview of PEFT and LoRA:

Based on some awesome new research [here](https://github.com/huggingface/peft), we can leverage techniques like PEFT and LoRA to train/fine-tune large models a lot more efficiently. 

It can't be explained much better than the overview given in the above link: 

```
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of
pre-trained language models (PLMs) to various downstream applications without 
fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often 
prohibitively costly. In this regard, PEFT methods only fine-tune a small 
number of (extra) model parameters, thereby greatly decreasing the 
computational and storage costs. Recent State-of-the-Art PEFT techniques 
achieve performance comparable to that of full fine-tuning.
```

### Install requirements

First, if the enviroment requirements are NOT ALREADY installed from the previous notebook, then run the cell below:

In [None]:
!python -m pip install -r https://raw.githubusercontent.com/PanoEvJ/GenAI-CoverLetter/main/requirements.txt

### Model loading

Let's load the `bloom-1b7` model!

We're also going to load the `bigscience/tokenizer` which is the tokenizer for all of the BLOOM models.

This step will take some time, as we have to download the model weights which are ~3.44GB.

In [None]:
import torch
torch.cuda.is_available()

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7", 
    torch_dtype=torch.float32,
    load_in_8bit=False, 
    device_map='auto',
    # offload_folder='offload'  # activate this argument in case no cuda devices are available and RAM is limited
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [4]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [5]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [6]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 3145728 || all params: 1725554688 || trainable%: 0.18230242262828822


### Preprocessing

We can simply load our dataset from 🤗 Hugging Face with the `load_dataset` method!

In [None]:
import transformers
from datasets import load_dataset

from datasets import load_dataset
dataset = load_dataset("PanoEvJ/job_postings_GPT") 

Inspect dataset.

In [None]:
print(dataset)
print(dataset['train'][0])

In [11]:
def generate_prompt(job: str, letter: str) -> str:
  prompt = f"Below is a job posting, please write a cover letter for this job.\n\n### Job:\n{job}\n\n### Letter:\n{letter}"
  return prompt

mapped_dataset = dataset.map(lambda samples: tokenizer(generate_prompt(samples['job_postings'], samples['cover_letters'])))
# mapped_dataset[0]

Map:   0%|          | 0/297 [00:00<?, ? examples/s]

In [None]:
trainer = transformers.Trainer(
    model=model, 
    train_dataset=mapped_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100, 
        learning_rate=1e-3, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

## Share adapters on the 🤗 Hub

In [None]:
HUGGING_FACE_USER_NAME = "PanoEvJ"

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model_name = "GenAI-CoverLetter"

model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = f"{HUGGING_FACE_USER_NAME}/{model_name}"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=False, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

### Take it for a spin!

In [None]:
from IPython.display import display, Markdown

def make_inference(job_posting):
  batch = tokenizer(f"Below is a job posting, please write a cover letter for this product.\n\n### Job posting:\n{job_posting} \n\n### Cover letter:\n", return_tensors='pt')

  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=200)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [None]:
your_job_posting_here = """Machine Learning Engineer
Full-time · Mid-Senior level

VantAI is building artificial intelligence to revolutionize drug discovery and development. We produce technology to help scientists deliver novel compounds for life saving therapies. We were launched by Roivant Sciences, a global biopharma focused on rapidly developing innovative medicines and drug development technologies.

About You:

We are looking for an experienced Machine Learning Engineer/Scientist to join our machine learning team to help develop the world’s most advanced ML pipeline for the design of proximity inducing molecules. You will work with a team of world-class machine learning engineers in a research-heavy position on a range of unsolved problems around representation learning of proteins, small molecules, biological networks and genomics.


Key Responsibilities:

Scientifically direct the design and training of large-scale, state-of-the art Deep Learning systems
Develop novel architecture and training paradigms to lead the industry in unsolved scientific problems
Collaborate with content experts from other domains (e.g., chemistry, physics, biology) to enable innovative feature-engineering and novel cross-disciplinary approaches

Basic Requirements:

MS/PhD degree in Computer Science, Statistics, Applied Mathematics, Computational Biology, Computational Chemistry or other related subject (will also consider BS degrees in these areas for candidates highly qualified across all other requirements or with significant work experience)
Track record of contributing to novel methods for state-of-the-art Deep Learning (in industry or through publications) including large-scale Transformers, Graph Neural Nets, ConvNets, etc.
2+ years of experience on machine learning teams, ideally at a start-up
4+ years of ML research experience in industry or academia, with strong familiarity with PyTorch; familiarity with Jax is a plus
Experience with Python is required; programming skills in Rust, C, C++ is a plus
Relevant experience working in Linux/UNIX environment with basic data engineering and scripting abilities
Ability to understand business problems and how to build models that can quickly drive value, while prioritizing your research efforts accordingly

Preferred Qualifications:

Experience in Cheminformatics, Computational Biology or Computational Chemistry
Competitive programming or scientific experience, including Kaggle, PUTNAM, CTFs, iGEM, Biology/Chemistry Olympiad
Strong working knowledge of containerized production (e.g., Go/Flask-Server running within Docker, Kubernetes), DevOps and CI/CD principles
Experience with state-of-the-art tools such as TensorFlow, MXNet and Sklearn
Experience working with large data sets, simulation/optimization, and distributed computing tools (e.g., Spark, Airflow, Dash, etc.)
"""

make_inference(your_job_posting_here)

Below is a job posting, please write a cover letter for this product.

### Job posting:
Machine Learning Engineer
Full-time · Mid-Senior level

VantAI is building artificial intelligence to revolutionize drug discovery and development. We produce technology to help scientists deliver novel compounds for life saving therapies. We were launched by Roivant Sciences, a global biopharma focused on rapidly developing innovative medicines and drug development technologies.

About You:

We are looking for an experienced Machine Learning Engineer/Scientist to join our machine learning team to help develop the world’s most advanced ML pipeline for the design of proximity inducing molecules. You will work with a team of world-class machine learning engineers in a research-heavy position on a range of unsolved problems around representation learning of proteins, small molecules, biological networks and genomics.


Key Responsibilities:

Scientifically direct the design and training of large-scale, state-of-the art Deep Learning systems
Develop novel architecture and training paradigms to lead the industry in unsolved scientific problems
Collaborate with content experts from other domains (e.g., chemistry, physics, biology) to enable innovative feature-engineering and novel cross-disciplinary approaches

Basic Requirements:

MS/PhD degree in Computer Science, Statistics, Applied Mathematics, Computational Biology, Computational Chemistry or other related subject (will also consider BS degrees in these areas for candidates highly qualified across all other requirements or with significant work experience)
Track record of contributing to novel methods for state-of-the-art Deep Learning (in industry or through publications) including large-scale Transformers, Graph Neural Nets, ConvNets, etc.
2+ years of experience on machine learning teams, ideally at a start-up
4+ years of ML research experience in industry or academia, with strong familiarity with PyTorch; familiarity with Jax is a plus
Experience with Python is required; programming skills in Rust, C, C++ is a plus
Relevant experience working in Linux/UNIX environment with basic data engineering and scripting abilities
Ability to understand business problems and how to build models that can quickly drive value, while prioritizing your research efforts accordingly

Preferred Qualifications:

Experience in Cheminformatics, Computational Biology or Computational Chemistry
Competitive programming or scientific experience, including Kaggle, PUTNAM, CTFs, iGEM, Biology/Chemistry Olympiad
Strong working knowledge of containerized production (e.g., Go/Flask-Server running within Docker, Kubernetes), DevOps and CI/CD principles
Experience with state-of-the-art tools such as TensorFlow, MXNet and Sklearn
Experience working with large data sets, simulation/optimization, and distributed computing tools (e.g., Spark, Airflow, Dash, etc.)
 

### Cover letter:
Dear Hiring Manager,

I am excited to apply for the position of Machine Learning Engineer at VantAI. As a highly skilled and experienced Machine Learning Scientist, I am confident that I can contribute to the development of the world’s most advanced ML pipeline for the design of proximity inducing molecules. I am particularly drawn to the opportunity to work with a team of world-class machine learning engineers on a range of unsolved problems around representation learning of proteins, small molecules, biological networks and genomics.

I have a Bachelor's degree in Computer Science and have a track record of contributing to novel methods for state-of-the-art Deep Learning (in industry or through publications). I have also worked on large-scale Transformers, Graph Neural Nets, ConvNets, etc. I am confident that my skills and experience can add value to your team and help you deliver innovative compounds for life saving therapies.

I am particularly impressed with the company's focus on rapidly developing innovative medicines and drug development technologies.