The architecture of BLOOM is based on the casual-decoder transformer model, which is the standard model used for developing LLMs with above 100B parameters for best performance

Note : its decoder model

*   peft : Applying LORA tech to the model
*   bitsandbytes : for doing quantization



In [None]:
!pip install opendatasets datasets transformers datasets peft accelerate bitsandbytes --upgrade --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/324.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.3/324.3 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: dholesneha
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail
Downloading newspaper-text-summarization-cnn-dailymail.zip to ./newspaper-text-summarization-cnn-dailymail


100%|██████████| 503M/503M [00:23<00:00, 22.1MB/s]





In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
    GenerationConfig, TrainingArguments, Trainer
)
from peft import LoraConfig, get_peft_model
import pandas as pd
from datasets import Dataset
import re

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", quantization_config=quant_config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

In [None]:
train_df = pd.read_csv("/content/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv")[["article", "highlights"]]


In [None]:
train_df = train_df.sample(500)

In [None]:
def filter_text(text):
  text = text.lower()
  text = re.sub('[^A-Za-z0-9]+', ' ', text)
  return text

train_df["article"] = train_df["article"].apply(filter_text)
train_df["highlights"] = train_df["highlights"].apply(filter_text)

In [None]:
train_df.head()

Unnamed: 0,article,highlights
22991,paris france cnn in a city famous for being th...,watch out for spikey haired parisians getting ...
168285,here s a new kind of fast fashion a zip up tie...,innovative fashion statement is draped round t...
263106,david cameron was personally assured the west ...,the prime minister intervened after public app...
87089,forget about william and catherine the real st...,palace releases three images of prince george ...
155208,pretty in a navy and blue lace dress and with ...,the duchess of york 55 looked slim and healthy...


In [None]:
train_df["final_statement"] = ""

for indx, row in train_df.iterrows():
    train_df.at[indx, "final_statement"] = (
        "Summarize the following article.\n\n"
        + str(row["article"])
        + "\nSummary:\n"
        + str(row["highlights"])
    )


In [None]:
train_df = train_df[["final_statement"]]

In [None]:
train_df['final_statement'].iloc[1]

'Summarize the following article.\n\nhere s a new kind of fast fashion a zip up tie for men the innovative fashion statement is draped round the neck and then the two sections are zipped up which could save precious seconds for anyone late for a meeting and eliminates the need to re tie it later in the day the tie costs 42 and comes in three different sizes and shades of blue and grey but even the designer josh jakus from oakland california admits it won t be to everyone s taste and is more for those who want to assert their individuality time saver the zip up tie can be donned in seconds with no need to re tie later in the day he said i don t think it will replace the original tie there are a lot of people who would prefer to stick with the tradition that s fine with me mr jakus added that his tie is perfect for any occasion where you want to both look good and different at the same time a friend has worn the tie to government functions in washington dc and he didn t look out of place

The architecture of BLOOM is based on the casual-decoder transformer model so we need only

In [None]:
### this model have decoder only but transformer want two input so we take the same input as input_ids and labels both are same value

tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(example):
    example["input_ids"] = tokenizer(example["final_statement"], padding="max_length", max_length = 250, truncation=True, return_tensors="pt").input_ids
    example["labels"] = tokenizer(example["final_statement"], padding="max_length", max_length = 250, truncation=True, return_tensors="pt").input_ids
    return example

# Convert your DataFrame into a Dataset object
train_data = Dataset.from_pandas(train_df)

### we have now three column final statements , input_ids and labels having same value so we remove final statements
# # Apply the tokenize function
train_tokenized_datasets = train_data.map(tokenize_function, batched=True, remove_columns=train_data.column_names)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
print(tokenizer.decode(train_tokenized_datasets[5]["input_ids"], skip_special_tokens = True))




In [None]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, peft_params)
peft_model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 1,066,493,952 || trainable%: 0.1106


In [None]:
training_args = TrainingArguments(
output_dir = './model_checkpoints',
save_total_limit = 1,
auto_find_batch_size = True,
learning_rate = 1e-3,
num_train_epochs = 5,
)

trainer = Trainer(
model = peft_model,
args = training_args,
train_dataset = train_tokenized_datasets,
)

trainer.train()

trainer.model.save_pretrained('./final_model')
tokenizer.save_pretrained('./final_model')



Step,Training Loss


('./final_model/tokenizer_config.json',
 './final_model/special_tokens_map.json',
 './final_model/tokenizer.json')

In [None]:
##to clear the gpu uses
del peft_model
torch.cuda.empty_cache()

In [None]:
news_article = """
All but one of the 100 cities with the world’s worst air pollution last year were in Asia, according to a new report, with the climate crisis playing a pivotal role in bad air quality that is risking the health of billions of people worldwide.

The vast majority of these cities — 83 — were in India and all exceeded the World Health Organization’s air quality guidelines by more than 10 times, according to the report by IQAir, which tracks air quality worldwide.

The study looked specifically at fine particulate matter, or PM2.5, which is the tiniest pollutant but also the most dangerous. Only 9% of more than 7,800 cities analyzed globally recorded air quality that met WHO’s standard, which says average annual levels of PM2.5 should not exceed 5 micrograms per cubic meter.

“We see that in every part of our lives that air pollution has an impact,” said IQAir Global CEO Frank Hammes. “And it typically, in some of the most polluted countries, is likely shaving off anywhere between three to six years of people’s lives. And then before that will lead to many years of suffering that are entirely preventable if there’s better air quality.”

"""

filtered_news_article = "Summarize the following article.\n\n" +filter_text(news_article) + "\nSummary:\n"

print("len of sent: ",len(filtered_news_article))

tokenizerd_news_article = tokenizer(filtered_news_article, max_length = 250, return_tensors="pt")
output = model.generate(tokenizerd_news_article.input_ids, max_new_tokens = 100)
summary = tokenizer.decode(output[0], skip_special_tokens = True)

print("len of summary: ",len(summary))

print("output: ",summary)


In [None]:
print(summary.split("\nSummary:\n")[1])




In [None]:
ftokenizer = AutoTokenizer.from_pretrained("final_model")
fmodel = AutoModelForCausalLM.from_pretrained("final_model")



In [None]:
ftokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")
fmodel = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", quantization_config=quant_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [None]:
input_ids = ftokenizer(filtered_news_article,padding="max_length" , truncation=True ,  max_length = 512, return_tensors="pt").input_ids


In [None]:
output = fmodel.generate(input_ids, max_new_tokens = 50)
summary = ftokenizer.decode(output[0], skip_special_tokens = True)


In [None]:
print("fin model 50 :",summary)

fin model 50 : Summarize the following article.

 all but one of the 100 cities with the world s worst air pollution last year were in asia according to a new report with the climate crisis playing a pivotal role in bad air quality that is risking the health of billions of people worldwide the vast majority of these cities 83 were in india and all exceeded the world health organization s air quality guidelines by more than 10 times according to the report by iqair which tracks air quality worldwide the study looked specifically at fine particulate matter or pm2 5 which is the tiniest pollutant but also the most dangerous only 9 of more than 7 800 cities analyzed globally recorded air quality that met who s standard which says average annual levels of pm2 5 should not exceed 5 micrograms per cubic meter we see that in every part of our lives that air pollution has an impact said iqair global ceo frank hammes and it typically in some of the most polluted countries is likely shaving off a

In [None]:
print(summary)

Summarize the following article.

 all but one of the 100 cities with the world s worst air pollution last year were in asia according to a new report with the climate crisis playing a pivotal role in bad air quality that is risking the health of billions of people worldwide the vast majority of these cities 83 were in india and all exceeded the world health organization s air quality guidelines by more than 10 times according to the report by iqair which tracks air quality worldwide the study looked specifically at fine particulate matter or pm2 5 which is the tiniest pollutant but also the most dangerous only 9 of more than 7 800 cities analyzed globally recorded air quality that met who s standard which says average annual levels of pm2 5 should not exceed 5 micrograms per cubic meter we see that in every part of our lives that air pollution has an impact said iqair global ceo frank hammes and it typically in some of the most polluted countries is likely shaving off anywhere between

In [None]:
print("fin model 20 :",summary)

Summarize the following article.

 all but one of the 100 cities with the world s worst air pollution last year were in asia according to a new report with the climate crisis playing a pivotal role in bad air quality that is risking the health of billions of people worldwide the vast majority of these cities 83 were in india and all exceeded the world health organization s air quality guidelines by more than 10 times according to the report by iqair which tracks air quality worldwide the study looked specifically at fine particulate matter or pm2 5 which is the tiniest pollutant but also the most dangerous only 9 of more than 7 800 cities analyzed globally recorded air quality that met who s standard which says average annual levels of pm2 5 should not exceed 5 micrograms per cubic meter we see that in every part of our lives that air pollution has an impact said iqair global ceo frank hammes and it typically in some of the most polluted countries is likely shaving off anywhere between

In [None]:
print(len(summary))