<a href="https://colab.research.google.com/github/MainuddinAlam/Gen_AI_Project/blob/main/finalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Final Assignment</h1>
<h2>Task: Text Summarization</h2>
<h2>Submitted by: Mainuddin Alam Irteja</h2>

In [25]:
# Installing necessary libraries
!pip install transformers datasets torch huggingface_hub



In [26]:
# Loading FLAN-T5 model

# Importing necessary modules
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Assigning the model name and loading the tokenizer and model
modelName = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(modelName)
model = AutoModelForSeq2SeqLM.from_pretrained(modelName)

In [27]:
# Transfer the model so that the gpu is being used
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Print out which device we're using (GPU or CPU)
print(device)

cuda


In [28]:
# Loading CNN/DailyMail dataset
from datasets import load_dataset
cnn_Dataset = load_dataset("cnn_dailymail", "3.0.0", split="train")

# Split the dataset so that it could be used for training and evaluating
split_Dataset = cnn_Dataset.train_test_split(test_size=0.12)
train_Dataset = split_Dataset['train'].train_test_split(test_size=0.98)['train']
eval_Dataset = split_Dataset['test']

In [29]:
# Preprocessing the dataset

"""
Function to preprocess the dataset

@param givenData The dataset given to be preprocessed
@reuturns model_inputs The preprocessed model inputs
"""
def preprocessDataset(givenData):
  # Extract the raw text from the data
  inputs = [doc for doc in givenData['article']]

  # Tokenize the articles with padding and truncation to a max length of 512
  model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True, return_tensors="pt")

  # Tokenize the summaries
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(givenData['highlights'], max_length=128, padding="max_length", truncation=True, return_tensors="pt")

  # Attach the tokenized summaries as labels to the model inputs
  model_inputs["labels"] = labels["input_ids"]

  # Move the tokenized inputs and labels to the appropriate device (GPU/CPU)
  model_inputs = {k: v.to(device) for k, v in model_inputs.items()}

  # Return the preprocessed model inputs
  return model_inputs

In [30]:
# Tokenize the training and testing datasets
tokenized_train_dataset = train_Dataset.map(preprocessDataset, batched=True)
tokenized_eval_dataset = eval_Dataset.map(preprocessDataset, batched=True)

Map:   0%|          | 0/5053 [00:00<?, ? examples/s]

Map:   0%|          | 0/34454 [00:00<?, ? examples/s]

In [31]:
from transformers import Seq2SeqTrainingArguments

# Setting training parameters for text summarization
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',               # Directory to save model checkpoints
    evaluation_strategy="epoch",          # Evaluate the model at the end of each epoch
    learning_rate=2e-5,                   # Learning rate
    per_device_train_batch_size=8,        # Batch size for training
    per_device_eval_batch_size=8,         # Batch size for evaluation
    weight_decay=0.01,                    # Regularization to prevent overfitting
    save_total_limit=3,                   # Only keep the last 3 checkpoints
    num_train_epochs=3,                   # Number of epochs to train the model
    predict_with_generate=True,           # Enable text generation during evaluation
    logging_dir="./logs"                  # Directory for storing training logs
)



In [32]:
from transformers import Seq2SeqTrainer

# Initializing the trainer object for text summarization
trainer = Seq2SeqTrainer(
    model=model,                            # The model to be trained
    args=training_args,                     # The training arguments adapted for text generation
    train_dataset=tokenized_train_dataset,  # Tokenized training dataset
    eval_dataset=tokenized_eval_dataset,    # Tokenized evaluation dataset
    tokenizer=tokenizer                     # The tokenizer to handle input and output
)

In [33]:
# Training the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,5.6505,1.533256
2,1.7354,1.207462
3,1.3927,1.19181


TrainOutput(global_step=1896, training_loss=2.6014186360162017, metrics={'train_runtime': 2592.6128, 'train_samples_per_second': 5.847, 'train_steps_per_second': 0.731, 'total_flos': 2817914160807936.0, 'train_loss': 2.6014186360162017, 'epoch': 3.0})

In [34]:
# Evaluating the model
metrics = trainer.evaluate()

# Display the evaluation metrics
print(metrics)

{'eval_loss': 1.1918097734451294, 'eval_runtime': 589.7175, 'eval_samples_per_second': 58.425, 'eval_steps_per_second': 7.303, 'epoch': 3.0}


In [47]:
# Function to summarize texts

"""
Function to summarize texts

@param text The text provided to the function
@returns The g
"""
def summarizeTexts(text):
  # Tokenize the input text and move it to the correct device
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)

  # Generate the summary using the fine-tuned model
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)

  # Return the decoded summary back into text
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)


In [50]:
print(summarizeTexts(
    """
Soccer consists of teams playing against each other with each team having 11 players. Want me to continue?
The playtime of soccer is 90 minutes. It is divided into two halves with each consisting of 45 minutes.
Each time will try to score goals against the other. The one with the more goals after 90 minutes is the winner"
If both teams score equal goals, it is a draw.
"""
))


The playtime of soccer is 90 minutes. It is divided into two halves with each consisting of 45 minutes. Each time will try to score goals against the other. If both teams score equal goals, it is a draw.


In [46]:
print(summarizeTexts(
    """
Person A: Hey, did you hear about the new project management software our company is planning to implement?

Person B: Yeah, I heard a bit about it. What’s the deal with it?

Person A: It’s called "TaskFlow." The management thinks it’s going to streamline our workflow, especially with remote teams. It’s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform.

Person B: That sounds interesting. But I’m a bit concerned about the learning curve. Is it user-friendly?

Person A: From what I’ve seen, it looks pretty intuitive. They’re also planning to run a couple of training sessions to get everyone up to speed. The first one is next Monday.

Person B: Okay, that helps. I guess I’ll have to attend that session. How does it compare to what we’re using now?

Person A: It’s supposed to be much more efficient. We’ll be able to track project progress more easily and get real-time updates. Plus, it has built-in analytics to help us with performance tracking.

Person B: That sounds promising. I just hope it doesn’t come with too many bugs at launch.

Person A: Yeah, that’s always a concern with new software. But they’ve been testing it for a while now, so fingers crossed it goes smoothly.

Person B: Let’s hope for the best. Thanks for the info!

Person A: No problem. See you at the training!
"""
))

Project management software is going to streamline our workflow, especially with remote teams. It’s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform. It’s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform. It’s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform. It’s supposed to be much more efficient. We’ll be able to track project progress more easily and get real-time updates
