<a href="https://colab.research.google.com/github/RoyElkabetz/Text-Summarization-with-Deep-Learning/blob/main/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## uncomment only if running from google.colab
# clone the git reposetory
!git clone https://github.com/RoyElkabetz/Text-Summarization-with-Deep-Learning
# add path to .py files for import
import sys
sys.path.insert(1, "/content/Text-Summarization-with-Deep-Learning")

Cloning into 'Text-Summarization-with-Deep-Learning'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 53 (delta 28), reused 6 (delta 0), pack-reused 0[K
Unpacking objects: 100% (53/53), done.


In [2]:
## uncomment if you want to mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
%matplotlib inline
import time
import pandas as pd

import torch
from torchtext.datasets import IMDB
import torchtext.data as data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.dataset import random_split
from torch import nn




print(f'torch {torch.__version__}')
print('Device properties:')
if torch.cuda.is_available():
    device = torch.device("cuda")
    gpu_data = torch.cuda.get_device_properties(0)
    gpu_name = gpu_data.name
    gpu_mem  = f'{gpu_data.total_memory * 1e-9:.02f} Gb'
    print(f'GPU: {gpu_name}\nMemory: {gpu_mem}')
else:
    device = torch.device("cpu")
    print('CPU')

torch 1.9.0+cu102
Device properties:
GPU: Tesla T4
Memory: 15.84 Gb


In [4]:
class DataFrameDataset(Dataset):
  """Create a torch.utils.data.Dataset from a pandas.DataFrame or a CSV file."""

  def __init__(self, csv_file_path=None, pd_dataframe=None, only_columns=None):
    """
      Args:
      csv_file_path (string): Path to the csv file with annotations.
      pd_dataframe (Pandas DataFrame): A Pandas DataFrame with containing the
      data.
      only_columns (list): A List of colums names from the data. 
    """
    if isinstance(pd_dataframe, pd.DataFrame):
      self.df = pd_dataframe 
    else:
      self.df = pd.read_csv(csv_file_path)

    if only_columns is not None:
      if isinstance(only_columns, list):
        for item in only_columns:
          if item not in self.df.columns:
            raise ValueError(f"Got a column name '{item}' in only_columns which is not in DataFrame columns.")
        self.only_columns = only_columns
      else:
        raise TypeError(f"only_columns must be a <class 'list'>, instead got a {type(only_columns)}.")
    else:
      self.only_columns = list(self.df.columns)

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    row = self.df.iloc[idx][self.only_columns]
    row_list = [item for item in row]
    return row_list

## Get the IMDB dataset and create a vocabulary from the train dataset
I use the IMDB test data as train

In [5]:
tokenizer = get_tokenizer('basic_english')
train_iter = IMDB(split='test')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>", "<sos>", "<eos>"])
vocab.set_default_index(vocab["<unk>"])

aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 60.8MB/s]


## Create text and labels pipelines

In [6]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: 0 if x=='neg' else 1

## Print some random samples and the size of the dataset

In [7]:
train_iter = IMDB(split='test')
n_samples = len(train_iter)
random_list = torch.randint(0, n_samples - 1, (4, ))
labels = []
for i, (label, text) in enumerate(train_iter):
    labels.append(label)
    if i in random_list:
        print(f'Label: {label_pipeline(label)}')
        print(f'Text: {text}')
        print(f'Split: {tokenizer(text)}')
        print(f'Tokens: {text_pipeline(text)}\n')
print('Number of classes: {}'.format(len(set(labels))))
print('Number of samples: {}'.format(n_samples))

Label: 0
Text: I must be honest, I like romantic comedies, but this was not what I had hoped for. I thought Ellen Degeneres was having the biggest part, which should have been, because I didn't like the two struggling bed partners. It was awful. Poor Tom Selleck!! He had to act with someone who was that much in the picture while it should have been him and Ellen to be in most of the film. They were the only believable ones. And the only really funny parts starred them, not Kate Capshaw and that Everett guy.. Cool that mummy is coming out of the closet, I thought that was a nice surprise. <br /><br />I'm just glad I saw it on the cable and I didn't pay any money renting it..
Split: ['i', 'must', 'be', 'honest', ',', 'i', 'like', 'romantic', 'comedies', ',', 'but', 'this', 'was', 'not', 'what', 'i', 'had', 'hoped', 'for', '.', 'i', 'thought', 'ellen', 'degeneres', 'was', 'having', 'the', 'biggest', 'part', ',', 'which', 'should', 'have', 'been', ',', 'because', 'i', 'didn', "'", 't', 'li

In [8]:
VALID_DATASET_PATH = '/content/gdrive/MyDrive/Datasets/Text/IMDB_validation_dataset.csv'
TEST_DATASET_PATH = '/content/gdrive/MyDrive/Datasets/Text/IMDB_test_with_summary_dataset.csv'

valid_dataset = DataFrameDataset(csv_file_path=VALID_DATASET_PATH, only_columns=['label', 'text'])
test_dataset = DataFrameDataset(csv_file_path=TEST_DATASET_PATH, only_columns=['label', 'summary'])
full_test_dataset = DataFrameDataset(csv_file_path=TEST_DATASET_PATH, only_columns=['label', 'text', 'summary'])
print(f'Validation dataset size is: {len(valid_dataset)}')
print(f'Test dataset size is: {len(test_dataset)}')

Validation dataset size is: 16434
Test dataset size is: 8467


In [9]:
test_loader = DataLoader(full_test_dataset, batch_size=2, shuffle=False)
for batch in test_loader:
  print('Print samples from a single batch:\n')
  labels, texts, summaries = batch
  for i in range(len(labels)):
    print(f'Sample: {i}')
    print(f'Label: {labels[i]}')
    print(f'Text: {texts[i]}')
    print(f'Label: {summaries[i]}')
    print('\n')
  break

Print samples from a single batch:

Sample: 0
Label: neg
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to maki
Label: I AM CURIOUS-YELLOW is a film about a young Swedish drama student named Lena who wants to learn everything she can about life. The plot is centered around a maki girl named Maki who wants to focus her attentions on the maki boy's maki, which is a game of maki. The film was released in 1967 and has been rated 4/5 (Sweden).


Sample: 1
Label: neg
Text: "I Am Curious: Yellow" is a risible and pretentious steaming pil

In [10]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device) 

In [11]:
class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

In [12]:
train_iter = IMDB(split='test')
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

In [13]:
def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 200
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predited_label = model(text, offsets)
        loss = criterion(predited_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predited_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predited_label = model(text, offsets)
            loss = criterion(predited_label, label)
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

In [14]:
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 32 # batch size for training
  
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter = IMDB(split='test')
train_dataset = to_map_style_dataset(train_iter)
# test_dataset = to_map_style_dataset(test_iter)
# num_train = int(len(train_dataset) * 0.95)
# split_train_, split_valid_ = \
#     random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE,
                              shuffle=False, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=False, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
      total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   200/  782 batches | accuracy    0.585
| epoch   1 |   400/  782 batches | accuracy    0.694
| epoch   1 |   600/  782 batches | accuracy    0.755
-----------------------------------------------------------
| end of epoch   1 | time: 14.05s | valid accuracy    0.643 
-----------------------------------------------------------
| epoch   2 |   200/  782 batches | accuracy    0.800
| epoch   2 |   400/  782 batches | accuracy    0.809
| epoch   2 |   600/  782 batches | accuracy    0.814
-----------------------------------------------------------
| end of epoch   2 | time: 13.73s | valid accuracy    0.762 
-----------------------------------------------------------
| epoch   3 |   200/  782 batches | accuracy    0.836
| epoch   3 |   400/  782 batches | accuracy    0.847
| epoch   3 |   600/  782 batches | accuracy    0.842
-----------------------------------------------------------
| end of epoch   3 | time: 13.98s | valid accuracy    0.780 
-------------------------------

In [15]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.742


In [23]:
imdb_label = {0: 'negative',
              1: 'positive'}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item()

ex_positive_str = "Full of suspense, gripping the entire time, intense, \
  two stories in parallel that come together, somewhat predictable betrayals \
   and twist, felt a bit dark at times, satisfying ending, not a huge amount \
    of action but a solid storyline that keeps you on edge."

ex_negative_str = "This is not your traditional Guy Ritchie movie with slick \
   fast paced action, clever humour and lots of twists. Which I have loved in \
    the past. It is basically a combination of heist movie and revenge \
     thriller. But it's played very straight, without a lot of effort to \
      build characters, and doesn't ever seem to build much momentum. So \
       a few times during the movie I found myself looking at my watch, \
        wondering if it was really going anywhere. The action is fairly \
         tight but mainly gunplay, not much physical action as Statham is \
          famous for. There are no heroes either, Stathams character seems  \
          to be a pretty nasty piece of work himself. All in all, it's an \
           average thriller with nothing in particular to recommend it."

model = model.to("cpu")

print("This is a %s review" %imdb_label[predict(ex_negative_str, text_pipeline)])

This is a negative news
