<a href="https://colab.research.google.com/github/Aligreu/Text-Based-Projects/blob/main/summarization_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers peft evaluate loralib datasets

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.6.1-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m136.0/136.0 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting loralib
  Downloading loralib-0.1.2-py3-none-any.whl (10 kB)
Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)


In [10]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer, GenerationConfig
import torch
import evaluate
from peft import LoraConfig
import pandas as pd
from datasets import load_dataset

In [34]:
dataset = load_dataset("knkarthick/dialogsum")

In [12]:
pd.DataFrame(dataset["train"])

Unnamed: 0,id,dialogue,summary,topic
0,train_0,"#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. ...","Mr. Smith's getting a check-up, and Doctor Haw...",get a check-up
1,train_1,"#Person1#: Hello Mrs. Parker, how have you bee...",Mrs Parker takes Ricky for his vaccines. Dr. P...,vaccines
2,train_2,"#Person1#: Excuse me, did you see a set of key...",#Person1#'s looking for a set of keys and asks...,find keys
3,train_3,#Person1#: Why didn't you tell me you had a gi...,#Person1#'s angry because #Person2# didn't tel...,have a girlfriend
4,train_4,"#Person1#: Watsup, ladies! Y'll looking'fine t...",Malik invites Nikki to dance. Nikki agrees if ...,dance
...,...,...,...,...
12455,train_12455,#Person1#: Excuse me. You are Mr. Green from M...,Tan Ling picks Mr. Green up who is easily reco...,pick up someone
12456,train_12456,#Person1#: Mister Ewing said we should show up...,#Person1# and #Person2# plan to take the under...,conference center
12457,train_12457,#Person1#: How can I help you today?\n#Person2...,#Person2# rents a small car for 5 days with th...,rent a car
12458,train_12458,#Person1#: You look a bit unhappy today. What'...,#Person2#'s mom lost her job. #Person2# hopes ...,job losing


In [13]:
model_name= 'google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## using The base Model to generate summaries.

In [14]:
prompt = f"""

summarise the following dialogue {dataset["test"][0]["dialogue"]}\n\n

summary: \n\n

"""

ids = tokenizer(prompt, return_tensors="pt").input_ids


output = model.generate(ids, max_new_tokens=100)
generated_summary = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"The original summary:\n {dataset['test'][0]['summary']}")
print(f"The generated summary:\n {generated_summary}")

The original summary:
 Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
The generated summary:
 The memo will go out to all employees by this afternoon.


## Fine Tune The Model for more accurate results.

In [15]:
def tokenize_dataset(example):
    start_prompt = "summarize the following conversation."
    end_prompt = "summary:\n\n"
    full_prompt = [start_prompt + ex + end_prompt for ex in example["dialogue"]]

    example["input_ids"] = tokenizer(full_prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example["labels"] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

tokenized_dataset = dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [16]:
tokenized_dataset = tokenized_dataset.remove_columns(['id', 'topic', 'dialogue', 'summary',])
tokenized_dataset = tokenized_dataset.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [17]:
del dataset, ids, output, generated_summary

In [18]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=8,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

peft_model = get_peft_model(model,
                            lora_config)

In [25]:
args = TrainingArguments(
    output_dir="./data",
    learning_rate=1e-5,
    num_train_epochs=5,
    weight_decay=0.0001,
    push_to_hub_model_id = "fine-tuned-text-summarization"
)

trainer = Trainer(
    model=peft_model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)



In [26]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=80, training_loss=49.171875, metrics={'train_runtime': 175.7573, 'train_samples_per_second': 3.556, 'train_steps_per_second': 0.455, 'total_flos': 434768117760000.0, 'train_loss': 49.171875, 'epoch': 5.0})

In [27]:
trainer.push_to_hub()

adapter_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

events.out.tfevents.1699811088.4d891936a12f.1007.0:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

events.out.tfevents.1699811245.4d891936a12f.1007.1:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

'https://huggingface.co/ig4i/fine-tuned-text-summarization/tree/main/'

In [None]:
from google.colab import userdata
userdata.get('HF')

In [24]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [31]:
tokenizer.save_pretrained("./models")

('./models/tokenizer_config.json',
 './models/special_tokens_map.json',
 './models/tokenizer.json')

In [32]:
model_id = "ig4i/fine-tuned-text-summarization"
mytokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [30]:
myModel = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading (…)/adapter_config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

Downloading (…)er_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

In [44]:
prompt = f"""summarize the following dialogue.\n
{dataset['test'][100]['dialogue']}

summary:\n\n

"""

print(prompt)

ids = mytokenizer(prompt, return_tensors="pt", padding="max_length", truncation=True)["input_ids"]
# print(ids)

output = myModel.generate(ids, max_new_tokens=200)
# print(output)

print(mytokenizer.decode(output[0], skip_special_tokens=True))

summarize the following dialogue.

#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

summary:




The two people are trying to figure out how to react to a cut.
