<a href="https://colab.research.google.com/github/RoyElkabetz/Text-Summarization-with-Deep-Learning/blob/main/Convert_Pandas_DataFrame_to_torchtext_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## uncomment if you want to mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [166]:
LOAD_DATASET_PATH = '/content/gdrive/MyDrive/Datasets/Text/IMDB_train_with_summary_dataset.csv'
SAVE_VALID_DATASET_PATH = '/content/gdrive/MyDrive/Datasets/Text/IMDB_validation_dataset.csv'
SAVE_TEST_DATASET_PATH = '/content/gdrive/MyDrive/Datasets/Text/IMDB_test_with_summary_dataset.csv'

In [155]:
import pandas as pd

import torch
from torchtext.data.utils import get_tokenizer
from torchtext.datasets import IMDB
from torch.utils.data import Dataset
from torchtext.vocab import build_vocab_from_iterator

In [160]:
class DataFrameDataset(Dataset):
  """Create a torch.utils.data.Dataset from a pandas.DataFrame or a CSV file."""

  def __init__(self, csv_file_path=None, pd_dataframe=None, only_columns=None):
    """
      Args:
      csv_file_path (string): Path to the csv file with annotations.
      pd_dataframe (Pandas DataFrame): A Pandas DataFrame with containing the
      data.
      only_columns (list): A List of colums names from the data. 
    """
    if isinstance(pd_dataframe, pd.DataFrame):
      self.df = pd_dataframe 
    else:
      self.df = pd.read_csv(csv_file_path)

    if only_columns is not None:
      if isinstance(only_columns, list):
        for item in only_columns:
          if item not in self.df.columns:
            raise ValueError(f"Got a column name '{item}' in only_columns which is not in DataFrame columns.")
        self.only_columns = only_columns
      else:
        raise TypeError(f"only_columns must be a <class 'list'>, instead got a {type(only_columns)}.")
    else:
      self.only_columns = list(self.df.columns)

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    row = self.df.iloc[idx][self.only_columns]
    row_list = [item for item in row]
    return row_list

## Compare my DataFrameDataset class to IMDB original dataset iterator
Iterating through the data

In [108]:
my_dataset = DataFrameDataset(LOAD_DATASET_PATH)
train_loader = torch.utils.data.DataLoader(my_dataset, batch_size=3, shuffle=True)
for data in train_loader:
  print(data)
  break

[tensor([ 9385, 22798, 18157]), ('neg', 'pos', 'pos'), ('I, too, was fooled by the packaging. I, too, fell for the gory packaging and the DVD casing that claims "grieved fans as every copy was pulled from shelves". Though it was inexpensive ($6.99), it wasn\'t really all that worth it - no scares, and very limited gore. The ending was very cheesy and didn\'t deliver the punch it should have. I really don\'t even know how it became a "Video Nasty" with how very tame it is. The story drags, the characters are obvious amateur actors...it doesn\'t live up to the promise.', "Now either you like Mr Carrey's humour or you don't. Me, Myself and Irene had audiences both walking out in droves and, on the other hand, cheering and collapsing in puddles of mirth. Bruce Almighty is a bit more mainstream, but you have been warned.<br /><br />If you're not sure, watch the trailer. I saw the trailer three times and still laughed at the same gags when I saw the film. If you don't find the sight of a dog

In [102]:
imdb_dataset = IMDB(split='train')
imdb_train_loader = torch.utils.data.DataLoader(imdb_dataset, batch_size=5)
for data in imdb_train_loader:
  print(data)
  break

[('neg', 'neg', 'neg', 'neg', 'neg'), ('I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nu

### Creating a vocabulary from text using DataFrameDataset

In [176]:
data_columns = ['label', 'text', 'summary']
new_dataset = DataFrameDataset(LOAD_DATASET_PATH, only_columns=data_columns)
train_loader = torch.utils.data.DataLoader(new_dataset, batch_size=1, shuffle=True)
for data_sample in train_loader:
  print('A single batch from the data:')
  for i, obj in enumerate(data_sample):
    print(f'{data_columns[i]}s:')
    print(obj)
    print('\n')
  break
tokenizer = get_tokenizer('basic_english')

def yield_text_tokens(data_iter):
    for batch in data_iter:
        texts = batch[1]
        for text in texts:
          yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_text_tokens(train_loader), specials=["<unk>", "<sos>", "<eos>"])
vocab.set_default_index(vocab["<unk>"])
print(f'Size of vocabiulary is: {len(vocab)}')

A single batch from the data:
labels:
('neg',)


texts:
('I run a group to stop comedian exploitation and I just spent the past 2 months hearing horror stories from comedians who attempted to audition for, "Last Comic Standing." If they don\'t have a GOOD agent, then they don\'t even get a chance to audition so more than 80% of the comedians who turn up are rejected before they can show anyone that they have talent! If they do make it to an audition, I was told that it\'s "pre-determined" if they get a second chance. So what the TV audience sees is NOT the best comic',)


summarys:
('Actor-comedians who try to audition for \'Last Comic Standing\' have been rejected by over 80% of the comedians who turn up. "If they don\'t have a GOOD agent, then they don\'t even get a chance to audition so more than 80% of the comedians who turn up are rejected before they can show anyone that they have talent!"',)


Size of vocabiulary is: 62093


## Cleaning the spliting the summarized IMDB data into Validation and Test CSV files
#### Load the data from CSV as a pandas.DataFrame and clean it up.

In [162]:
imdb_data = pd.read_csv(LOAD_DATASET_PATH ,encoding='utf-8')
imdb_data.drop_duplicates(subset=['text'],inplace=True) # dropping duplicates
imdb_data.dropna(axis=0,inplace=True) # dropping na
columns = ['label', 'text', 'summary']
imdb_data = imdb_data[columns]
imdb_data.head()

Unnamed: 0,label,text,summary
0,neg,I rented I AM CURIOUS-YELLOW from my video sto...,I AM CURIOUS-YELLOW is a film about a young Sw...
1,neg,"""I Am Curious: Yellow"" is a risible and preten...","""I Am Curious: Yellow"" is a risible and preten..."
2,neg,If only to avoid making this type of film in t...,The film is interesting as an experiment but t...
3,neg,This film was probably inspired by Godard's Ma...,Actress Lena Nyman has to be the most annoying...
4,neg,"Oh, brother...after hearing about this ridicul...",After hearing about this ridiculous film for u...


#### Split the data into Validation and Test DataFrames

In [163]:
valid_df = imdb_data[imdb_data['summary'] == 'empty']
test_df = imdb_data[imdb_data['summary'] != 'empty']

In [164]:
print('There are {} samples in the Validation dataset with {} positive and {}\
 negative samples.'.format(len(valid_df), (valid_df['label'] == 'pos').sum(),
                           (valid_df['label'] == 'neg').sum()))
valid_df.head()

There are 16434 samples in the Validation dataset with 9943 positive and 6491 negative samples.


Unnamed: 0,label,text,summary
5961,neg,If this is supposed to be the black experience...,empty
5962,neg,"As a fan of Notorious B.I.G., I was looking fo...",empty
5963,neg,"Look,I'm reading and reading this comments and...",empty
5964,neg,I thought the movie was OK but very disappoint...,empty
5965,neg,if you didn't live in the 90's or didn't liste...,empty


In [165]:
print('There are {} samples in the Test dataset with {} positive and {}\
 negative samples.'.format(len(test_df), (test_df['label'] == 'pos').sum(),
                           (test_df['label'] == 'neg').sum()))
test_df.head()

There are 8467 samples in the Test dataset with 2527 positive and 5940 negative samples.


Unnamed: 0,label,text,summary
0,neg,I rented I AM CURIOUS-YELLOW from my video sto...,I AM CURIOUS-YELLOW is a film about a young Sw...
1,neg,"""I Am Curious: Yellow"" is a risible and preten...","""I Am Curious: Yellow"" is a risible and preten..."
2,neg,If only to avoid making this type of film in t...,The film is interesting as an experiment but t...
3,neg,This film was probably inspired by Godard's Ma...,Actress Lena Nyman has to be the most annoying...
4,neg,"Oh, brother...after hearing about this ridicul...",After hearing about this ridiculous film for u...


#### Save the Validation and Test DataFrames into CSV files

In [167]:
valid_df.to_csv(path_or_buf=SAVE_VALID_DATASET_PATH, columns=columns)
test_df.to_csv(path_or_buf=SAVE_TEST_DATASET_PATH, columns=columns)