<a href="https://colab.research.google.com/github/Arnav710/Arnav710/blob/main/Fine_tune_FLAN_T5_for_chat_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fine tune FLAN T5 for math based questions

### Installations

In [7]:
!pip install pytesseract transformers datasets rouge-score nltk tensorboard py7zr --upgrade

Collecting py7zr
  Downloading py7zr-0.21.1-py3-none-any.whl (67 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.8/67.8 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxha

### Load Dataset

In [8]:
dataset_id = "knkarthick/dialogsum"

In [9]:
from datasets import load_dataset

# Load dataset from the HuggingFace hub
dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Train dataset size: 12460
Test dataset size: 1500


### Data Exploration

The `dataset` obtained from HuggingFace is in the form of a `DatasetDict` object with three components: train, test, and validation.

Train: 12,460 records

Validation: 500 records

Test: 1,500 records


In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [11]:
train = dataset["train"]
validation = dataset["validation"]
test = dataset["test"]

Each record in the dataset has the following features: ['id', 'dialogue', 'summary', 'topic']

In [12]:
train

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 12460
})

In [14]:
from random import randrange

rand_idx = randrange(len(dataset["train"]))
sample = train[rand_idx]
print(f"dialogue: \n{sample['dialogue']}\n")
print(f"summary: \n{sample['summary']}\n")

dialogue: 
#Person1#: Did you see today's newspaper? That building over there in center view was just struck by lightning for the fourth time.
#Person2#: I'm not surprised. If the conditions for lightning to strike are right one time, they might be as good another time.
#Person1#: Well, I don't take any chances. If I'm caught in a thunderstorm, I will look for a building or a closed car. Also, I was told that if you're stuck outdoors, the best thing you can do is to keep yourself close to the ground and avoid bodies of water.
#Person2#: To tell you the truth, even when I'm at home, I don't take baths or showers during a thunderstorm. And I don't use anything that works electrically. Maybe I'm too anxious.
#Person1#: I wouldn't say that. According to the article, lightning starts thousands of fires every year in the United States alone. Hundreds of people are injured or even killed. I think you're just being sensible.

summary: 
#Person1# tells #Person2# the news about a building gettin

In [16]:
# looking at the various topics associated with the records

import pandas as pd

topics = [train[i]['topic'] for i in range(len(train))]

# Create a frequency distribution of the topics
topic_counts = pd.Series(topics).value_counts()

In [17]:
print(topic_counts)

shopping                 174
job interview            161
daily casual talk        125
phone call                89
order food                79
                        ... 
eat ice creams             1
marriage predicaments      1
ways of commuting          1
food comment               1
baggage pack               1
Name: count, Length: 7434, dtype: int64


We can see that the dataset has 7434 unique conversation topics. Some of the most common conversation topics are shopping, job interviews, casual talk, phone calls, etc

### Tokenization

In [19]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [23]:
from datasets import concatenate_datasets

# Combine the train, test, and validation datasets
full_dataset = concatenate_datasets([dataset["train"], dataset["test"], dataset["validation"]])

# Tokenize the dialogue inputs
tokenized_dialogues = full_dataset.map(
    lambda example: tokenizer(example["dialogue"], truncation=True),
    batched=True,
    remove_columns=["dialogue", "summary"]
)

# Calculate and display the maximum length of tokenized inputs
max_input_length = max(len(tokens) for tokens in tokenized_dialogues["input_ids"])
print(f"Maximum length of tokenized inputs: {max_input_length}")

# Tokenize the summary targets
tokenized_summaries = full_dataset.map(
    lambda example: tokenizer(example["summary"], truncation=True),
    batched=True,
    remove_columns=["dialogue", "summary"]
)

# Calculate and display the maximum length of tokenized targets
max_output_length = max(len(tokens) for tokens in tokenized_summaries["input_ids"])
print(f"Maximum length of tokenized targets: {max_output_length}")


Map:   0%|          | 0/14460 [00:00<?, ? examples/s]

Maximum length of tokenized inputs: 512


Map:   0%|          | 0/14460 [00:00<?, ? examples/s]

Maximum length of tokenized targets: 277


In [30]:
def preprocess_function(sample):

    # add "summarize: " prefix to specify the task type for FLAN T5
    conversations = []
    prefix = "summarize: "
    for conversation in sample["dialogue"]:
        conversations.append(prefix + conversation)

    # tokenize input: convert the conversations to tokens
    model_inputs = tokenizer(conversations, max_length=max_input_length, padding="max_length", truncation=True)

    # Tokenize targets
    labels = tokenizer(text_target=sample["summary"], max_length=max_output_length, padding="max_length", truncation=True)

    # Iterate over the target for all the records
    for target_label in labels["input_ids"]:
        # Iterate over each token that comprises the target for a single record
        for i in range(len(target_label)):
            # Check if any token matches the token used for padding
            if target_label[i] == tokenizer.pad_token_id:
                # Assign high negaive value to ignore padding in the loss
                target_label[i] = -100


    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])


Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

### Fine tuning

In [31]:
from transformers import AutoModel

# huggingface hub model id
model_id="google/flan-t5-base"

# load model from the hub
model = AutoModel.from_pretrained(model_id)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

In [32]:
print(model)

T5Model(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): Linear(in_features