### Lab 1 - Generative AI Use Case: Summarize Dialogue

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

##### Load dataset 

In [2]:
dialog_dataset = load_dataset("knkarthick/dialogsum")

#Print first 5 examples
print("First 5 examples from the dataset:")
for i in range(5):
    print(f"example {i+1}: {dialog_dataset['test'][i]}")

First 5 examples from the dataset:
example 4: {'id': 'test_1_1', 'dialogue': "#Person1#: You're finally here! What took so long?\n#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.\n#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.\n#Person2#: I don't think it can be avoided, to be honest.\n#Person1#: perhaps it would be better if you started taking public transport system to work.\n#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.\n#Person1#: It would be better for the environment, too.\n#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.\n#Person1#: Taking the subway would be a lot less stressful than driving as well.\n#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.\n#Person1#: Well, when it

#### Lets look at few samples from the dataset.

In [3]:
dash_line = "_".join(' ' for x in range(100))

example_indices = [7, 103]

#Print first 5 examples
print("First 5 examples from the dataset:")
for i in example_indices:
    print(f"EXAMPLE {i}")
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dialog_dataset['test'][i]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY')
    print(dialog_dataset['test'][i]['summary'])
    print(dash_line)

First 5 examples from the dataset:
EXAMPLE 7
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
INPUT DIALOGUE:
#Person1#: Kate, you never believe what's happened.
#Person2#: What do you mean?
#Person1#: Masha and Hero are getting divorced.
#Person2#: You are kidding. What happened?
#Person1#: Well, I don't really know, but I heard that they are having a separation for 2 months, and filed for divorce.
#Person2#: That's really surprising. I always thought they are well matched. What about the kids? Who get custody?
#Person1#: Masha, it seems quiet and makable, no quarrelling about who get the house and stock and then contesting the divorce with other details worked out.
#Person2#: That's the change from all the back stepping we usually hear about. Well, I still can't believe it, Masha and Hero, the perfect couple. When would they divorce be

In [15]:
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")


#### Summarizing without any prompt engineering 
NOTE: The GenerationConfig with num_beams reduces the memory overhead and thus runs properly locally, otherwise it crashes silently probably becuase of memory getting overwhelmed. 

In [16]:
input_text = dialog_dataset['test'][7]['dialogue']
encoded_prompt = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

generation_config = GenerationConfig(
    max_new_tokens=50,  # Limit generation length
    num_beams=1,        # Avoid beam search (less memory)
    do_sample=False     # Greedy decoding
)

outputs = model.generate(
    encoded_prompt, 
    generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Masha and Hero are getting divorced.


#### Summarizing with few shot learning 

In [17]:
prompt = f"""
Use following dialogue and summary examples to summarize the given dialogue with a polite tone.
Dialogue: {dialog_dataset['test'][5]['dialogue']}
Summary: {dialog_dataset['test'][5]['summary']}

Dialogue: {dialog_dataset['test'][11]['dialogue']}
Summary: {dialog_dataset['test'][11]['summary']}

Now, Summarize given dialogue.
Dialogue: {dialog_dataset['test'][37]['dialogue']}
"""
encoded_prompt = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation_config = GenerationConfig(
    max_new_tokens=80,  # Limit generation length
    num_beams=1,        # Avoid beam search (less memory)
    do_sample=False     # Greedy decoding
)
outputs = model.generate(
    encoded_prompt,
    generation_config=generation_config,
)
print("Dialogues: ")
print(dash_line)
print(dialog_dataset['test'][43]['dialogue'])
print(dash_line)
print("Generated Summary: ")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Token indices sequence length is longer than the specified maximum sequence length for this model (856 > 512). Running this sequence through the model will result in indexing errors


Dialogues: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
#Person1#: I don't know how to adjust my life. Would you give me a piece of advice?
#Person2#: You look a bit pale, don't you?
#Person1#: Yes, I can't sleep well every night.
#Person2#: You should get plenty of sleep.
#Person1#: I drink a lot of wine.
#Person2#: If I were you, I wouldn't drink too much.
#Person1#: I often feel so tired.
#Person2#: You better do some exercise every morning.
#Person1#: I sometimes find the shadow of death in front of me.
#Person2#: Why do you worry about your future? You're very young, and you'll make great contribution to the world. I hope you take my advice.
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 