In [None]:
%pip install --upgrade datasets peft huggingface_hub bitsandbytes accelerate transformers torch torchvision nltk && %pip install huggingface_hub==0.15.1 peft==0.5.0 && %pip install -i https://test.pypi.org/simple/ bitsandbytes && %pip install -U bitsandbytes transformers torch

In [1]:
import wandb
wandb.init(mode="offline")

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


The code uses load_dataset from the datasets library to load the "knkarthick/dialogsum" dataset, which is divided into train, validation, and test splits. The dataset is printed out, showing the structure with the number of rows and features (id, dialogue, summary, topic) for each split.
It then converts the train split into a Pandas DataFrame, prints the first 5 rows, and checks the shape of the train dataset.

In [2]:
from datasets import load_dataset

ds = load_dataset("knkarthick/dialogsum")

  from pandas.core import (


In [3]:
print(ds)
print(ds['train'][:5])
df = ds['train'].to_pandas()
print(df.head())
print(df.shape)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})
{'id': ['train_0', 'train_1', 'train_2', 'train_3', 'train_4'], 'dialogue': ["#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr.

The dataset consists of 12,460 rows in the train split, 500 rows in the validation split, and 1,500 rows in the test split, all containing id, dialogue, summary, and topic features.
The first 5 rows display a sample of dialogues, summaries, and topics, showing conversations on various subjects like health check-ups, vaccines, and relationships.
The shape of the train dataset is (12460, 4), confirming it has 12,460 rows and 4 columns, representing dialogues, summaries, and topics.

The code first shuffles the train dataset using a random seed (42) and selects the first 50 samples from the shuffled data using the select method.
Similarly, it shuffles the validation dataset and selects the first 20 samples.
A new DatasetDict is created, containing the selected train and validation subsets, which is then printed to show the resulting data structure.

In [4]:
# Shuffle and then select 50 samples
train_subset = ds['train'].shuffle(seed=42).select(range(50))

In [5]:
print(train_subset)

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 50
})


In [6]:
# Shuffle and then select 20 samples
validation_subset = ds['validation'].shuffle(seed=42).select(range(20))

In [7]:
from datasets import DatasetDict, Dataset
# Create DatasetDict with the selected samples
data_subsets = DatasetDict({
    "train": train_subset,
    "validation": validation_subset
})

In [8]:
print(data_subsets)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 50
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 20
    })
})


The train subset now contains 50 rows with the same features (id, dialogue, summary, topic), as the data was shuffled and sampled.
The validation subset contains 20 rows with the same feature structure after applying the shuffle and sample operations.
The printed DatasetDict confirms the new size of both subsets: 50 samples for train and 20 samples for validation, preserving the original features.

In [9]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

The code imports the AutoTokenizer from the transformers library and loads the facebook/bart-base model's tokenizer.
It defines a preprocessing function that tokenizes the input dialogues and summaries, applying padding and truncation to ensure fixed-length sequences.
The preprocessing function is applied to the train and validation subsets, creating tokenized datasets, followed by removing the original text columns (like id, dialogue, summary, topic) to keep only the necessary model inputs (input_ids, attention_mask, and labels).

In [10]:
from transformers import AutoTokenizer
model_checkpoint = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [11]:
print(type(tokenizer))

<class 'transformers.models.bart.tokenization_bart_fast.BartTokenizerFast'>


In [13]:

tokenizer.pad_token = tokenizer.eos_token
# Define preprocessing function
#max_source_length = 1024
#max_target_length = 176
max_source_length = 512
max_target_length = 88

def preprocess_function(examples):
    # Tokenize inputs and labels with padding and truncation
    inputs = examples["dialogue"]
    targets = examples["summary"]
    
    model_inputs = tokenizer(
        inputs,
        max_length=max_source_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

    # Tokenize labels with padding and truncation
    labels = tokenizer(
        targets,
        max_length=max_target_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    ).input_ids

    # Update model inputs with labels
    model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the subset of training data
tokenized_dataset = data_subsets.map(preprocess_function, batched=True)

# Remove unnecessary columns for the model
tokenized_dataset = tokenized_dataset.remove_columns(['id', 'dialogue', 'summary', 'topic'])

In [14]:
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 50
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 20
    })
})


The train dataset now has 50 rows, with features input_ids, attention_mask, and labels, representing the tokenized inputs and their corresponding labels.
The validation dataset contains 20 rows with the same features (input_ids, attention_mask, and labels).
The printed DatasetDict confirms that the datasets are now preprocessed and tokenized, with the original columns removed.

In [15]:
from peft import LoraConfig, TaskType
from peft import get_peft_model

from nltk.tokenize import sent_tokenize
from transformers import (
    AutoModelForSeq2SeqLM
)
#adapt a large pre-trained model (like facebook/bart-base) for 
# a specific task without having to fine-tune all of its parameters
#LoRA helps make training more efficient by updating only a small 
#subset of parameters, rather than the entire model, which reduces 
# the computational resources needed for training.
print("Quantize=False, lora=True")
# Define LoRA Config 
# Define configuration parameters in a dictionary
lora_config_params = {
    "r": 8,
    "lora_alpha": 16,
    "target_modules": ["q_proj", "v_proj"],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": TaskType.SEQ_2_SEQ_LM
}

# Pass the dictionary to LoraConfig using the ** unpacking operator
lora_config = LoraConfig(**lora_config_params)
# base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)



Quantize=False, lora=True


In [16]:
# add LoRA adaptor
base_model = get_peft_model(base_model, lora_config)
base_model.print_trainable_parameters()

'NoneType' object has no attribute 'cadam32bit_grad_fp32'
trainable params: 1,179,648 || all params: 407,470,080 || trainable%: 0.2895


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [17]:
from peft import LoraConfig, TaskType
from peft import get_peft_model

from nltk.tokenize import sent_tokenize
from transformers import (
    AutoModelForSeq2SeqLM
)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
base_model = base_model.to(device)

In [19]:
from transformers import (
    Seq2SeqTrainingArguments, 
    Seq2SeqTrainer,
)

training_args_params = {
    "output_dir": "fine-tuned-bart",
    "overwrite_output_dir": True,
    "num_train_epochs": 1,
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 2,
    "learning_rate": 0.00005,
    "weight_decay": 0.005,
    "evaluation_strategy": "steps",
    "eval_steps": 2,
    "save_strategy": "epoch",  # Save less frequently
    "save_total_limit": 1,
    "report_to": None,
    "run_name": "facebook-bart-base-finetuning",
    "predict_with_generate": True,
    "fp16": True
}

# Pass the dictionary to Seq2SeqTrainingArguments using the ** unpacking operator
training_args = Seq2SeqTrainingArguments(**training_args_params)


In [20]:
import evaluate
import numpy as np
import nltk

# Load metric
metric = evaluate.load("rouge")
nltk.download("punkt")
nltk.download("punkt_tab")

Using the latest cached version of the module from /Users/nanchen/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--rouge/b01e0accf3bd6dd24839b769a5fda24e14995071570870922c71970b3a6ed886 (last modified on Sat Apr 20 12:58:46 2024) since it couldn't be found locally at evaluate-metric--rouge, or remotely on the Hugging Face Hub.
[nltk_data] Downloading package punkt to /Users/nanchen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/nanchen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [21]:
# The compute_metric function defined in your code calculates ROUGE metrics, 
# which are important for summarization tasks to check overlap in words and 
# phrases between the generated summary and the reference summary.
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]
    
    return preds, labels

In [22]:
def compute_metric(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # metric = evaluate.load("rouge")
    rouge_results = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    rouge_results = {k: round(v * 100, 4) for k, v in rouge_results.items()}
    
    results = {
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
        "rougeLsum": rouge_results["rougeLsum"],
        "gen_len": np.mean([np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds])
    }

    return results

In [23]:
from transformers import DataCollatorForSeq2Seq
# Define a data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=base_model,  # Optional: some models require it for padding/truncation
)

In [24]:
# Use the data collator instead of `tokenizer` in the trainer
trainer = Seq2SeqTrainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,  # Replaces tokenizer
    compute_metrics=compute_metric
)

In [25]:
# Train model
trainer.train()


  0%|          | 0/12 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

{'eval_loss': 0.8015881776809692, 'eval_rouge1': 31.7302, 'eval_rouge2': 9.1392, 'eval_rougeL': 22.8738, 'eval_rougeLsum': 29.1659, 'eval_gen_len': 140.0, 'eval_runtime': 110.2163, 'eval_samples_per_second': 0.181, 'eval_steps_per_second': 0.091, 'epoch': 0.16}


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

{'eval_loss': 0.7838362455368042, 'eval_rouge1': 35.4187, 'eval_rouge2': 12.1684, 'eval_rougeL': 27.0919, 'eval_rougeLsum': 31.5229, 'eval_gen_len': 140.0, 'eval_runtime': 106.2856, 'eval_samples_per_second': 0.188, 'eval_steps_per_second': 0.094, 'epoch': 0.32}


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

{'eval_loss': 0.7310712337493896, 'eval_rouge1': 32.2426, 'eval_rouge2': 11.5009, 'eval_rougeL': 25.2475, 'eval_rougeLsum': 28.5862, 'eval_gen_len': 140.0, 'eval_runtime': 111.1074, 'eval_samples_per_second': 0.18, 'eval_steps_per_second': 0.09, 'epoch': 0.48}


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

{'eval_loss': 0.7013152837753296, 'eval_rouge1': 34.0449, 'eval_rouge2': 13.143, 'eval_rougeL': 26.1982, 'eval_rougeLsum': 31.2627, 'eval_gen_len': 140.0, 'eval_runtime': 106.1452, 'eval_samples_per_second': 0.188, 'eval_steps_per_second': 0.094, 'epoch': 0.64}


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

{'eval_loss': 0.686744213104248, 'eval_rouge1': 33.9441, 'eval_rouge2': 12.2572, 'eval_rougeL': 25.3834, 'eval_rougeLsum': 29.9013, 'eval_gen_len': 140.0, 'eval_runtime': 114.6587, 'eval_samples_per_second': 0.174, 'eval_steps_per_second': 0.087, 'epoch': 0.8}


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

{'eval_loss': 0.6724153161048889, 'eval_rouge1': 35.1248, 'eval_rouge2': 12.5808, 'eval_rougeL': 25.9394, 'eval_rougeLsum': 30.2941, 'eval_gen_len': 140.0, 'eval_runtime': 96.1994, 'eval_samples_per_second': 0.208, 'eval_steps_per_second': 0.104, 'epoch': 0.96}
{'train_runtime': 794.0306, 'train_samples_per_second': 0.063, 'train_steps_per_second': 0.015, 'train_loss': 0.9207168420155843, 'epoch': 0.96}


TrainOutput(global_step=12, training_loss=0.9207168420155843, metrics={'train_runtime': 794.0306, 'train_samples_per_second': 0.063, 'train_steps_per_second': 0.015, 'total_flos': 52010510450688.0, 'train_loss': 0.9207168420155843, 'epoch': 0.96})

In [26]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
print(eval_results)

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


  0%|          | 0/10 [00:00<?, ?it/s]

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

Perplexity: 1.96
{'eval_loss': 0.6724153161048889, 'eval_rouge1': 35.1248, 'eval_rouge2': 12.5808, 'eval_rougeL': 25.9394, 'eval_rougeLsum': 30.2941, 'eval_gen_len': 140.0, 'eval_runtime': 106.4071, 'eval_samples_per_second': 0.188, 'eval_steps_per_second': 0.094, 'epoch': 0.96}


In [27]:
base_model.save_pretrained("./fine_tuned_bart_summarizer")


In [28]:
tokenizer.save_pretrained("./fine_tuned_bart_summarizer")

('./fine_tuned_bart_summarizer/tokenizer_config.json',
 './fine_tuned_bart_summarizer/special_tokens_map.json',
 './fine_tuned_bart_summarizer/vocab.json',
 './fine_tuned_bart_summarizer/merges.txt',
 './fine_tuned_bart_summarizer/added_tokens.json',
 './fine_tuned_bart_summarizer/tokenizer.json')

In [25]:
%pip install huggingface-hub


Note: you may need to restart the kernel to use updated packages.


In [29]:
from huggingface_hub import login

# Prompt for login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [30]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("./fine_tuned_bart_summarizer")
tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_bart_summarizer")

model.push_to_hub("Mia2024/CS5100TextSummarization")
tokenizer.push_to_hub("Mia2024/CS5100TextSummarization")


No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/Mia2024/CS5100TextSummarization/commit/9565526e7526334baaa78df17cdd29895e8a8506', commit_message='Upload tokenizer', commit_description='', oid='9565526e7526334baaa78df17cdd29895e8a8506', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Mia2024/CS5100TextSummarization', endpoint='https://huggingface.co', repo_type='model', repo_id='Mia2024/CS5100TextSummarization'), pr_revision=None, pr_num=None)

In [24]:
import torch
from transformers import AutoTokenizer, GenerationConfig, TextStreamer, AutoModelForSeq2SeqLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "Mia2024/CS5100TextSummarization"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
print("Model and tokenizer loaded successfully.")


Model and tokenizer loaded successfully.


In [25]:
generation_config = GenerationConfig(
        min_new_tokens=10,
        max_new_tokens=256,
        temperature=0.9,
        top_p=1.0,
        top_k=50         
    )

In [26]:
checkpoint = "facebook/bart-large-cnn"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to(device)
print("Model and tokenizer loaded successfully.")

Model and tokenizer loaded successfully.


In [27]:
prefix = "Summarize the following conversation: \n###\n"
suffix = "\n### Summary:"
input_text = "#Person1#: Ms. Dawson, I need you to take a dictation for me. #Person2#: Yes, sir... #Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready? #Person2#: Yes, sir. Go ahead. #Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited. #Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications? #Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications. #Person2#: But sir, many employees use Instant Messaging to communicate with their clients. #Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Where were we? #Person2#: This applies to internal and external communications. #Person1#: Yes. Any employee who persists in using Instant Messaging will first receive a warning and be placed on probation. At second offense, the employee will face termination. Any questions regarding this new policy may be directed to department heads. #Person2#: Is that all? #Person1#: Yes. Please get this memo typed up and distributed to all employees before 4 pm."
input_ids = tokenizer.encode(
        # input_text,
        prefix + input_text + f"The generated summary should be around {int(0.15 * len(input_text.split()))} words." + suffix, 
        return_tensors="pt", 
        truncation=True,
        max_length=1024  # Ensure input fits the model's max length
    )

In [28]:
print(input_ids)

tensor([[    0, 38182,  3916,  2072,     5,   511,  1607,    35,  1437, 50118,
         48134, 50118, 10431, 41761,   134, 10431,    35,  2135,     4, 14820,
             6,    38,   240,    47,     7,   185,    10, 28700,  1258,    13,
           162,     4,   849, 41761,   176, 10431,    35,  3216,     6, 21958,
           734,   849, 41761,   134, 10431,    35,   152,   197,   213,    66,
            25,    41, 18592,    12, 23252, 20834,     7,    70,  1321,    30,
            42,  1390,     4,  3945,    47,  1227,   116,   849, 41761,   176,
         10431,    35,  3216,     6, 21958,     4,  2381,   789,     4,   849,
         41761,   134, 10431,    35, 35798,    70,   813,   734, 33355,  1320,
             6,    70,   558,  4372,    32,  9393,     7,  1047, 25778,     8,
           781, 29966,     4,    20,   304,     9, 26596, 32236,  1767,    30,
          1321,   148,   447,   722,    16, 14657,  9986,     4,   849, 41761,
           176, 10431,    35,  5348,     6,   473,  

In [29]:
output_ids = model.generate(input_ids, do_sample=True, generation_config=generation_config)

In [30]:
print(output_ids)

tensor([[    2,  4030,    16,     5,   766,     9,    10,  3924,   341, 17194,
             4,    20, 17194,    16,  1887,     7,    28, 16556,    30,     5,
           138,     4,    83,  4819,     9,     5, 17194,     4,    20, 11054,
            22,  1121,  8304,   113,    64,    28,   341,     7,  6364,   143,
           761,     9,  1100,     4,    20, 17194,    18,  5131,  4819,   197,
            28,   198,  2357,  1617,     4,     2]])


In [31]:
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [32]:
print(output_text)

New is the name of a widely used algorithm. The algorithm is designed to be administered by the company. A summary of the algorithm. The phrase "Inbox" can be used to indicate any kind of address. The algorithm's recommended summary should be around 33 words.
