<a href="https://colab.research.google.com/github/RoyElkabetz/Text-Summarization-with-Deep-Learning/blob/main/notebooks/T5_Summarizer_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## uncomment only if running from google.colab
# clone the git reposetory
!git clone https://github.com/RoyElkabetz/Text-Summarization-with-Deep-Learning
# add path to .py files for import
import sys
sys.path.insert(1, "/content/Text-Summarization-with-Deep-Learning/src")

Cloning into 'Text-Summarization-with-Deep-Learning'...
remote: Enumerating objects: 312, done.[K
remote: Counting objects: 100% (312/312), done.[K
remote: Compressing objects: 100% (294/294), done.[K
remote: Total 312 (delta 163), reused 47 (delta 14), pack-reused 0[K
Receiving objects: 100% (312/312), 7.61 MiB | 5.13 MiB/s, done.
Resolving deltas: 100% (163/163), done.


In [1]:
## uncomment to mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [3]:
!pip install --quiet transformers==4.5.0
!pip install --quiet pytorch-lightning==1.2.7

[K     |████████████████████████████████| 2.2MB 8.7MB/s 
[K     |████████████████████████████████| 901kB 49.4MB/s 
[K     |████████████████████████████████| 3.3MB 40.5MB/s 
[K     |████████████████████████████████| 839kB 6.8MB/s 
[K     |████████████████████████████████| 829kB 14.5MB/s 
[K     |████████████████████████████████| 235kB 29.1MB/s 
[K     |████████████████████████████████| 122kB 28.0MB/s 
[K     |████████████████████████████████| 276kB 22.3MB/s 
[K     |████████████████████████████████| 1.3MB 30.3MB/s 
[K     |████████████████████████████████| 296kB 52.6MB/s 
[K     |████████████████████████████████| 143kB 44.0MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone
  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone


In [4]:
import time
import pandas as pd
import numpy as np
import torch
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from sklearn.model_selection import train_test_split
from termcolor import colored
from torchtext.datasets import AG_NEWS, IMDB 
from tqdm.auto import tqdm


from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5TokenizerFast as T5Tokenizer
)


# my packages
import models
import utils

# plotting packages 
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc

%matplotlib inline
%config InlineBackend.figure_format='retina'
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
rcParams['figure.figsize'] = 16, 10

# set seed
pl.seed_everything(216)

Global seed set to 216


216

In [3]:
SAVE_DATASET_PATH = '/content/gdrive/MyDrive/Datasets/Text/IMDB/'
CHECKPOINTS_PATH = '/content/gdrive/MyDrive/Checkpoints'
MY_MODEL_NAME = 'Text_Summarizer_T5-v1'
MODEL_NAME = 't5-base'
PATH_TO_LAST_CHECKPOINT = ''.join([CHECKPOINTS_PATH, '/', MY_MODEL_NAME, '.ckpt'])

In [6]:
# load the T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

# load trained (on "news summary" dataset) summarizer
base_model = models.NewsSummaryModel()
trained_model = base_model.load_from_checkpoint(PATH_TO_LAST_CHECKPOINT)
trained_model.freeze()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




In [7]:
def summarizer(text, summary_max_length=150):
  # encoding text
  text_encoding = tokenizer(
      text,
      max_length=512,
      padding='max_length',
      truncation=True,
      return_attention_mask=True,
      add_special_tokens=True,
      return_tensors='pt'
  )

  # get predictions as ids
  generated_ids = trained_model.model.generate(
      input_ids=text_encoding['input_ids'],
      attention_mask=text_encoding['attention_mask'],
      max_length=summary_max_length,
      num_beams=2,
      repetition_penalty=2.5,
      length_penalty=1.0,
      early_stopping=True
  )

  # decode and join prediction
  preds = [
   tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
   for gen_id in generated_ids
  ]
  
  return ''.join(preds)

## Load, process, split and resave the IMDB Train, Validation and Test datasets as CSV files

In [7]:
# down load the Train dataset as an iterator
data_iter = IMDB(split='train')
labels = []
texts = []

for i, (label, text) in enumerate(data_iter):
  labels.append(label)
  texts.append(text)

df = pd.DataFrame.from_dict({'label': labels, 'text': texts})
df = df.dropna()
df.head()
df.to_csv(path_or_buf=SAVE_DATASET_PATH + 'train.csv', columns=['label', 'text'])

In [15]:
# down load the Test dataset as an iterator
data_iter = IMDB(split='test')
labels = []
texts = []

for i, (label, text) in enumerate(data_iter):
  labels.append(label)
  texts.append(text)

## create a Pandas DataFrame of data


In [16]:
df = pd.DataFrame.from_dict({'label': labels, 'text': texts})
df = df.dropna()
df.head()

Unnamed: 0,label,text
0,neg,I love sci-fi and am willing to put up with a ...
1,neg,"Worth the entertainment value of a rental, esp..."
2,neg,its a totally average film with a few semi-alr...
3,neg,STAR RATING: ***** Saturday Night **** Friday ...
4,neg,"First off let me say, If you haven't enjoyed a..."


In [35]:
i = 500
summaries_lengths = [150, 128, 64, 32, 16, 8, 4]
negative_df = df[df['label']=='neg']
positive_df = df[df['label']=='pos']
test_df = negative_df[:i]
test_df = test_df.append(positive_df[:i], ignore_index=True)

for l in summaries_lengths:
    column_name = 'summary-' + str(l)
    test_df[column_name] = ['empty'] * len(test_df)
print(f'Size of dataframe is: {len(test_df)}')
test_df

Size of dataframe is: 1000


Unnamed: 0,label,text,summary-150,summary-128,summary-64,summary-32,summary-16,summary-8,summary-4
0,neg,I love sci-fi and am willing to put up with a ...,empty,empty,empty,empty,empty,empty,empty
1,neg,"Worth the entertainment value of a rental, esp...",empty,empty,empty,empty,empty,empty,empty
2,neg,its a totally average film with a few semi-alr...,empty,empty,empty,empty,empty,empty,empty
3,neg,STAR RATING: ***** Saturday Night **** Friday ...,empty,empty,empty,empty,empty,empty,empty
4,neg,"First off let me say, If you haven't enjoyed a...",empty,empty,empty,empty,empty,empty,empty
...,...,...,...,...,...,...,...,...,...
995,pos,A very delightful bit of filmwork that should ...,empty,empty,empty,empty,empty,empty,empty
996,pos,"Ordinarily, Anthony Mann made westerns with 't...",empty,empty,empty,empty,empty,empty,empty
997,pos,<br /><br />`The Last Frontier' is a superior ...,empty,empty,empty,empty,empty,empty,empty
998,pos,One of the more obscure of Anthony Mann's West...,empty,empty,empty,empty,empty,empty,empty


In [36]:
columns = ['label', 'text']
valid_df = negative_df[i:]
valid_df = valid_df.append(positive_df[i:], ignore_index=True)
valid_df.to_csv(path_or_buf=SAVE_DATASET_PATH + 'valid.csv', columns=columns)

## Summarizer Pipeline - get summary and save as a pd.DataFrame

In [8]:
columns = ['label', 'text', 'summary-150', 'summary-128', 'summary-64', 'summary-32', 'summary-16', 'summary-8', 'summary-4']
test_df = pd.read_csv(SAVE_DATASET_PATH + 'test.csv', usecols=columns)
test_df.head()

Unnamed: 0,label,text,summary-150,summary-128,summary-64,summary-32,summary-16,summary-8,summary-4
0,neg,I love sci-fi and am willing to put up with a ...,"Actors of 'Babylon 5', the original Star Trek ...","Actors of 'Babylon 5', the original Star Trek ...","Actors of 'Babylon 5', the original Star Trek ...","Actors of 'Babylon 5', the original Star Trek ...","Actors of 'Babylon 5', which is",Actors of 'B,Actors
1,neg,"Worth the entertainment value of a rental, esp...",The film is rated 4/5 (Atlanta) and 4/5 (Terro...,The film is rated 4/5 (Atlanta) and 4/5 (Terro...,The film is rated 4/5 (Atlanta) and 4/5 (Terro...,The film is rated 4/5 (Atlanta) and 4/5 (Terro...,The film is rated 4/5 (Atlanta) and 4,The film is rated 4/5,The film is
2,neg,its a totally average film with a few semi-alr...,The end plot is that of a very basic type that...,The end plot is that of a very basic type that...,The end plot is that of a very basic type that...,The end plot is that of a very basic type that...,The end plot is that of a very basic type that...,The end plot is that of,The end plot
3,neg,STAR RATING: ***** Saturday Night **** Friday ...,Former New Orleans homicide cop Jack Robideaux...,Former New Orleans homicide cop Jack Robideaux...,Former New Orleans homicide cop Jack Robideaux...,Former New Orleans homicide cop Jack Robideaux...,Former New Orleans homicide cop Jack Robideaux...,Former New Orleans homicide cop,Former New Orleans
4,neg,"First off let me say, If you haven't enjoyed a...",A Van Damme movie is worth watching. It has th...,A Van Damme movie is worth watching. It has th...,A Van Damme movie is worth watching. It has th...,A Van Damme movie is worth watching. It has th...,A Van Damme movie is worth watching. It has th...,The Van Damme movie is,A Van Dam


In [9]:
summaries_lengths = [150, 128, 64, 32, 16, 8, 4]
columns = list(test_df.columns.values)
with torch.no_grad():
    for i in tqdm(range(len(test_df))):
        row = test_df.iloc[i]
        for j, max_length in enumerate(summaries_lengths):
            if row[columns[j + 2]] == 'empty':
                text = row['text']
                summary = summarizer(text, summary_max_length=max_length)
                test_df.iloc[i, j + 2] = summary
        if i % 2 == 0:
            test_df.to_csv(path_or_buf=SAVE_DATASET_PATH + 'test.csv', columns=columns)

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)



