In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/news-summary/news_summary_more.csv
/kaggle/input/news-summary/news_summary.csv



### Introduction

#### In this tutorial we will be fine tuning a transformer model for Summarization Task. In this task a summary of a given article/document is generated when passed through a network. There are 2 types of summary generation mechanisms:

   ##### 1) Extractive Summary: the network calculates the most important sentences from the article and gets them together to provide the most meaningful information from the article.
   ##### 2) Abstractive Summary: The network creates new sentences to encapsulate maximum gist of the article and generates that as output. The sentences in the summary may or may not be contained in the article.

##### In this tutorial we will be generating Extractive Summary Summary.

#### The notebook will be divided into separate sections to provide a organized walk through for the process used.
####  The sections are:

####  1st : We will be installing the necessary libraries followed by importing the libraries and modules needed to run our script. We will be installing: transformers
####  Libraries imported are:

 ##### - Pandas
 ##### - Pytorch
 ##### - Pytorch Utils for Dataset and Dataloader
 ##### - Transformers

In [2]:
!pip install transformers -q
import transformers
print(transformers.__version__)

4.39.3


In [3]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Sat Apr 27 19:26:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:05.0 Off |  

In [4]:
# Setting up GPU Usage:
import torch
from torch import cuda

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

#### 2nd :- Importing Transformer Model and Tokenizer: T5 Model and T5 Tokenizer

In [5]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the pre-trained T5 model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained("t5-base")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Language Model Used:

  #####  This notebook uses one of the most recent and novel transformers model T5. Research Paper
   #####  T5 in many ways is one of its kind transformers architecture that not only gives state of the art results in many NLP tasks, but also has a very radical approach to NLP tasks.
   ##### Text-2-Text - According to the graphic taken from the T5 paper. All NLP tasks are converted to a text-to-text problem. Tasks such as translation, classification, summarization and question answering, all of them are treated as a text-to-text conversion problem, rather than seen as separate unique problem statements.
  #####  Unified approach for NLP Deep Learning - Since the task is reflected purely in the text input and output, you can use the same model, objective, training procedure, and decoding process to ANY task. Above framework can be used for any task - show Q&A, summarization, etc.
  #####  We will be taking inputs from the T5 paper to prepare our dataset prior to fine tuning and training. 

#### 3rd : - Loading Data:

In [6]:
import pandas as pd

# Load the dataset with error handling
try:
    data1 = pd.read_csv('/kaggle/input/news-summary/news_summary.csv', encoding='cp437')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("File not found. Please check the file path.")
except Exception as e:
    print("An error occurred:", e)

# Drop rows with any missing values
data1 = data1.dropna()

# Display a larger sample of the dataset
print("Shape of the dataset:", data1.shape)
print("First 10 rows of the dataset:")
print(data1.head(10))

Dataset loaded successfully.
Shape of the dataset: (4396, 6)
First 10 rows of the dataset:
               author                  date  \
0        Chhavi Tyagi  03 Aug 2017,Thursday   
1         Daisy Mowke  03 Aug 2017,Thursday   
2      Arshiya Chopra  03 Aug 2017,Thursday   
3       Sumedha Sehra  03 Aug 2017,Thursday   
4  Aarushi Maheshwari  03 Aug 2017,Thursday   
5         Sonu Kumari  03 Aug 2017,Thursday   
6        Parmeet Kaur  03 Aug 2017,Thursday   
7        Chhavi Tyagi  03 Aug 2017,Thursday   
8        Parmeet Kaur  03 Aug 2017,Thursday   
9       Sumedha Sehra  03 Aug 2017,Thursday   

                                           headlines  \
0  Daman & Diu revokes mandatory Rakshabandhan in...   
1  Malaika slams user who trolled her for 'divorc...   
2  'Virgin' now corrected to 'Unmarried' in IGIMS...   
3  Aaj aapne pakad liya: LeT man Dujana before be...   
4  Hotel staff to get training to spot signs of s...   
5  Man found dead at Delhi police station, kin al...   

In [7]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4396 entries, 0 to 4513
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     4396 non-null   object
 1   date       4396 non-null   object
 2   headlines  4396 non-null   object
 3   read_more  4396 non-null   object
 4   text       4396 non-null   object
 5   ctext      4396 non-null   object
dtypes: object(6)
memory usage: 240.4+ KB


### Data:We are using the News Summary dataset available at Kaggle
#### This dataset is the collection created from Newspapers published in India, extracting, details that are listed below. We are referring only to the first csv file from the data dump: news_summary.csv
#### There are4514 rows of data. Where each row has the following data-point:
  ##### 1) author : Author of the article
  ##### 2) date : Date the article was published
  ##### 3) headline: Headline for the published article
  ##### 4) read_more : URL for the article to follow online
  ##### 5) text: This is the summary of the article
  ##### 6) ctext: This is the complete article



In [8]:
#     This line selects only the 'ctext' and 'text' columns from the DataFrame data1.
#     It rearranges the columns so that 'ctext' becomes the first column and 'text' becomes the second column.
sum_data=data1[['ctext','text']]
sum_data.columns = ['text', 'summary']
sum_data.head()

Unnamed: 0,text,summary
0,The Daman and Diu administration on Wednesday ...,The Administration of Union Territory Daman an...
1,"From her special numbers to TV?appearances, Bo...",Malaika Arora slammed an Instagram user who tr...
2,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Hotels in Mumbai and other Indian cities are t...,Hotels in Maharashtra will train their staff t...


#### 4th :-This function clean_text seems to perform some basic text preprocessing tasks. 

In [9]:
!pip install unidecode -q

In [10]:
import logging
import time
import string

import numpy as np
import matplotlib.pyplot as plt

from unidecode import unidecode

import tensorflow as tf

2024-04-27 19:27:13.382188: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-27 19:27:13.382267: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-27 19:27:13.383834: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [11]:
from unidecode import unidecode

def clean_text(texts):
    texts=str(texts)
    texts = texts.lower() #convert uppercase to lowercase
    texts = unidecode(texts, errors='ignore') #convert accented letters into unaccented letters. Ignore unknown characters.
    texts = ''.join((char if char in (string.punctuation + string.ascii_lowercase) else ' ' for char in texts)) #keep the selected letters and punctuation.
    
    return texts

In [12]:
sum_data.loc[:, 'text'] = sum_data['text'].apply(lambda x: clean_text(x))

In [13]:
sum_data.loc[:,'summary'] = sum_data['summary'].apply(lambda x: clean_text(x))

In [14]:
sum_data.head()

Unnamed: 0,text,summary
0,the daman and diu administration on wednesday ...,the administration of union territory daman an...
1,"from her special numbers to tv?appearances, bo...",malaika arora slammed an instagram user who tr...
2,the indira gandhi institute of medical science...,the indira gandhi institute of medical science...
3,lashkar-e-taiba's kashmir commander abu dujana...,lashkar-e-taiba's kashmir commander abu dujana...
4,hotels in mumbai and other indian cities are t...,hotels in maharashtra will train their staff t...


### 5th :-

In [15]:
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

In [16]:
# Split the dataset into training and testing sets
train_data, test_data = train_test_split(sum_data, test_size=0.2, random_state=42)

In [17]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3516 entries, 4348 to 869
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     3516 non-null   object
 1   summary  3516 non-null   object
dtypes: object(2)
memory usage: 82.4+ KB


In [18]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 880 entries, 4047 to 3498
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     880 non-null    object
 1   summary  880 non-null    object
dtypes: object(2)
memory usage: 20.6+ KB


#### Custom Dataset Class:

In [19]:
class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=512, max_target_length=150):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_text = self.data.iloc[idx]['summary']
        target_summary = self.data.iloc[idx]['text']

        # Tokenize input text and target summary
        input_ids = self.tokenizer.encode(input_text, max_length=self.max_input_length, truncation=True, padding='max_length')
        target_ids = self.tokenizer.encode(target_summary, max_length=self.max_target_length, truncation=True, padding='max_length')

        return {
            'input_ids': torch.tensor(input_ids, dtype=torch.long),
            'target_ids': torch.tensor(target_ids, dtype=torch.long)
        }

#### Preparing Datasets and Dataloaders:

#### Create instances of the SummarizationDataset class for both training and testing data.
##### DataLoader is used to create iterators over the datasets with specified batch sizes. The shuffle=True argument shuffles the training data before each epoch.

In [20]:
train_dataset = SummarizationDataset(train_data, tokenizer)
test_dataset = SummarizationDataset(test_data, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4)

#### Training Loop:

#### Define the number of training epochs and an AdamW optimizer with a learning rate of 1e-4.
   #### We iterate over each epoch and each batch in the training dataloader.
   ##### Inside the loop, we perform forward pass, compute loss, backward pass, and optimizer step to update model parameters.
   ##### We print the loss at the end of each epoch.

In [21]:
epochs = 3
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for epoch in range(epochs):
    model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(model.device)
        target_ids = batch['target_ids'].to(model.device)
        loss = model(input_ids=input_ids, labels=target_ids).loss
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item()}')

Epoch 1/3, Loss: 2.893789768218994
Epoch 2/3, Loss: 2.501337766647339
Epoch 3/3, Loss: 1.9947530031204224


#### Evaluation Loop:

#### We switch the model to evaluation mode using model.eval().
   #####  We iterate over each batch in the test dataloader and compute the loss without gradient computation.
   ##### We print the test loss at the end.

In [22]:
model.eval()
with torch.no_grad():
    for batch in test_dataloader:
        input_ids = batch['input_ids'].to(model.device)
        target_ids = batch['target_ids'].to(model.device)
        loss = model(input_ids=input_ids, labels=target_ids).loss
    print(f'Test Loss: {loss.item()}')

Test Loss: 2.2409591674804688


#### Saving Trained Model:

In [23]:
model.save_pretrained("trained_model")

#### Inference Example:

#### We define a function generate_summary to generate summaries for input texts.
  ##### Inside the function, we tokenize the input text, 
  ##### Generate the summary using the model's generate method, and decode the summary tokens into human-readable text.
  #####  We provide an example input text and print the generated summary.

In [29]:
input_text =sum_data['summary'].iloc[2]
input_text

"the indira gandhi institute of medical sciences (igims) in patna on thursday made corrections in its marital declaration form by changing 'virgin' option to 'unmarried'. earlier, bihar health minister defined virgin as being an unmarried woman and did not consider the term objectionable. the institute, however, faced strong backlash for asking new recruits to declare their virginity in the form."

In [31]:
def generate_summary(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(input_ids, max_length=150, num_beams=2, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

summary = generate_summary(input_text)
print("Generated Summary:", summary)


Generated Summary: the indira gandhi institute of medical sciences (igims) in patna on thursday made corrections in its marital declaration form by changing 'virgin' option to 'unmarried'. earlier, bihar health minister defined virgin as being an unmarried woman and did not consider the term objectionable. however, the institute faced strong backlash for asking new recruits to declare their virginity in the form.
