# Preprocess data

One might need to install further packages for the code to work. Here is an example of this (commented out so the cell is not actually run,  as the packages here are already correctly installed)

In [None]:
# !pip install transformers

### Initialize parameters


earlyStop is a parameter that allows the user to preprocess only a subset of the entire dataset i.e. only the earlyStop first data points

In [1]:
earlyStop = 20 
batch_size = 32
loading_preprocessed_data = False

### Load and open dataset

The following code loads the dataset and displays the first 5 elements. 

In [2]:
from blurb_dataset.blurb_dataset import BlurbDataset
if loading_preprocessed_data == False:
    data = BlurbDataset(
        #earlyStop=earlyStop, #commenting this out will result in no early stopping
        batch_size=batch_size
    )
    display(data.trainDF.head())

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,title,body,author,published,topics
0,The New York Times Daily Crossword Puzzles: Th...,Monday’s Crosswords Do with EaseTuesday’s Cros...,New York Times,"Dec 28, 1996","[Nonfiction, Games]"
1,Creatures of the Night (Second Edition),Two of literary comics modern masters present ...,Neil Gaiman,"Nov 29, 2016","[Fiction, Graphic Novels & Manga]"
2,Cornelia and the Audacious Escapades of the So...,Eleven-year-old Cornelia is the daughter of tw...,Lesley M. M. Blume,"Jan 08, 2008","[Children’s Books, Children’s Middle Grade Books]"
3,The Alchemist's Daughter,"During the English Age of Reason, a woman cloi...",Katharine McMahon,"Oct 24, 2006","[Fiction, Historical Fiction]"
4,Dangerous Boy,A modern-day retelling of The Strange Case of ...,Mandy Hubbard,"Aug 30, 2012","[Teen & Young Adult, Teen & Young Adult Myster..."


This is the unprocessed data.

### Preprocess the dataset

Here we do the actual preporcessing. We take the parts of the data that are relevant for the text classification, preprocess the labels and then save the data. First let's preprocess the labels:

In [3]:
if loading_preprocessed_data == False:
    data.preprocessLabels()
    display(data.trainDF.head())

Unnamed: 0,title,body,author,published,topics,labels
0,The New York Times Daily Crossword Puzzles: Th...,Monday’s Crosswords Do with EaseTuesday’s Cros...,New York Times,"Dec 28, 1996","[Nonfiction, Games]","[0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
1,Creatures of the Night (Second Edition),Two of literary comics modern masters present ...,Neil Gaiman,"Nov 29, 2016","[Fiction, Graphic Novels & Manga]","[1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
2,Cornelia and the Audacious Escapades of the So...,Eleven-year-old Cornelia is the daughter of tw...,Lesley M. M. Blume,"Jan 08, 2008","[Children’s Books, Children’s Middle Grade Books]","[2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
3,The Alchemist's Daughter,"During the English Age of Reason, a woman cloi...",Katharine McMahon,"Oct 24, 2006","[Fiction, Historical Fiction]","[1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
4,Dangerous Boy,A modern-day retelling of The Strange Case of ...,Mandy Hubbard,"Aug 30, 2012","[Teen & Young Adult, Teen & Young Adult Myster...","[3, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1]"


Next we need to tokenize the text input. Note that this will take a while if earlystopping is not implemented. 

In [4]:
if loading_preprocessed_data == False:
    data.tokenization()
    display(data.trainDF.head())

Tokenization:   0%|          | 0/20 [00:00<?, ?it/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 15.4MB/s]

Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 13.7kB/s]

Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 340kB/s]
Tokenization: 100%|██████████| 20/20 [00:01<00:00, 16.73it/s]
Tokenization: 100%|██████████| 10/10 [00:00<00:00, 59.07it/s]
Tokenization: 100%|██████████| 10/10 [00:00<00:00, 57.83it/s]


Unnamed: 0,title,body,author,published,topics,labels,tokenizedTopics,attentionMask
0,The New York Times Daily Crossword Puzzles: Th...,Monday’s Crosswords Do with EaseTuesday’s Cros...,New York Times,"Dec 28, 1996","[Nonfiction, Games]","[0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 6928, 1521, 1055, 2892, 22104, 2079, 200...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Creatures of the Night (Second Edition),Two of literary comics modern masters present ...,Neil Gaiman,"Nov 29, 2016","[Fiction, Graphic Novels & Manga]","[1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 2048, 1997, 4706, 5888, 2715, 5972, 2556...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,Cornelia and the Audacious Escapades of the So...,Eleven-year-old Cornelia is the daughter of tw...,Lesley M. M. Blume,"Jan 08, 2008","[Children’s Books, Children’s Middle Grade Books]","[2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 5408, 1011, 2095, 1011, 2214, 9781, 1390...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,The Alchemist's Daughter,"During the English Age of Reason, a woman cloi...",Katharine McMahon,"Oct 24, 2006","[Fiction, Historical Fiction]","[1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 2076, 1996, 2394, 2287, 1997, 3114, 1010...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,Dangerous Boy,A modern-day retelling of The Strange Case of ...,Mandy Hubbard,"Aug 30, 2012","[Teen & Young Adult, Teen & Young Adult Myster...","[3, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1]","[101, 1037, 2715, 1011, 2154, 2128, 23567, 207...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


Finally, we convert the pd.DataFrame into torch.utils.data.dataloder object. This object is more effecient when called during training.

In [5]:
if loading_preprocessed_data == False:
    data.convertToDataloader()
    display(data.trainDF.head())
    data.isPreprocessed = True

  torch.tensor(df["labels"][:nbrExamples])


Unnamed: 0,title,body,author,published,topics,labels,tokenizedTopics,attentionMask
0,The New York Times Daily Crossword Puzzles: Th...,Monday’s Crosswords Do with EaseTuesday’s Cros...,New York Times,"Dec 28, 1996","[Nonfiction, Games]","[0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 6928, 1521, 1055, 2892, 22104, 2079, 200...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Creatures of the Night (Second Edition),Two of literary comics modern masters present ...,Neil Gaiman,"Nov 29, 2016","[Fiction, Graphic Novels & Manga]","[1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 2048, 1997, 4706, 5888, 2715, 5972, 2556...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,Cornelia and the Audacious Escapades of the So...,Eleven-year-old Cornelia is the daughter of tw...,Lesley M. M. Blume,"Jan 08, 2008","[Children’s Books, Children’s Middle Grade Books]","[2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 5408, 1011, 2095, 1011, 2214, 9781, 1390...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,The Alchemist's Daughter,"During the English Age of Reason, a woman cloi...",Katharine McMahon,"Oct 24, 2006","[Fiction, Historical Fiction]","[1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 2076, 1996, 2394, 2287, 1997, 3114, 1010...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,Dangerous Boy,A modern-day retelling of The Strange Case of ...,Mandy Hubbard,"Aug 30, 2012","[Teen & Young Adult, Teen & Young Adult Myster...","[3, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1]","[101, 1037, 2715, 1011, 2154, 2128, 23567, 207...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


To do all of the steps before in one step, simply call the following function:

In [3]:
if (loading_preprocessed_data == False) and (data.isPreprocessed == False): # Only run this cell if the other cells have not been run
    data.prepareData()
    display(data.trainDF.head())

Tokenization: 100%|██████████| 58715/58715 [03:30<00:00, 279.20it/s]
Tokenization: 100%|██████████| 7392/7392 [00:25<00:00, 285.95it/s]
Tokenization: 100%|██████████| 9197/9197 [00:32<00:00, 283.68it/s]
  torch.tensor(df["labels"][:nbrExamples])


Unnamed: 0,title,body,author,published,topics,labels,tokenizedTopics,attentionMask
0,The New York Times Daily Crossword Puzzles: Th...,Monday’s Crosswords Do with EaseTuesday’s Cros...,New York Times,"Dec 28, 1996","[Nonfiction, Games]","[0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 6928, 1521, 1055, 2892, 22104, 2079, 200...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Creatures of the Night (Second Edition),Two of literary comics modern masters present ...,Neil Gaiman,"Nov 29, 2016","[Fiction, Graphic Novels & Manga]","[1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 2048, 1997, 4706, 5888, 2715, 5972, 2556...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,Cornelia and the Audacious Escapades of the So...,Eleven-year-old Cornelia is the daughter of tw...,Lesley M. M. Blume,"Jan 08, 2008","[Children’s Books, Children’s Middle Grade Books]","[2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 5408, 1011, 2095, 1011, 2214, 9781, 1390...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,The Alchemist's Daughter,"During the English Age of Reason, a woman cloi...",Katharine McMahon,"Oct 24, 2006","[Fiction, Historical Fiction]","[1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1]","[101, 2076, 1996, 2394, 2287, 1997, 3114, 1010...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,Dangerous Boy,A modern-day retelling of The Strange Case of ...,Mandy Hubbard,"Aug 30, 2012","[Teen & Young Adult, Teen & Young Adult Myster...","[3, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1]","[101, 1037, 2715, 1011, 2154, 2128, 23567, 207...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


### Saving and loading preprocessed data

Instead of doing the preprocessing every time the pipeline is executed, we can simply save and load the preprocessed data. To save:

In [5]:
if loading_preprocessed_data == False:
    data.saveData()

Notice that this will only save the earlyStop first elements of the data, as these are the only ones that have actually been preprocessed. The name of the saved file reflects this. There can always be just one saved file for a given value of earlyStop, if the preprocessing is repeated for a value of earlyStop than the saved file is overwritten.  

Loading the data will then simply involve the following step:

In [7]:
if loading_preprocessed_data == True: # Need to load a new data object here which will automatically load the saved data
    data = BlurbDataset(earlyStop=earlyStop, batch_size=batch_size, tokenizedDataPath="blurb_dataset")

### Save the dataloader in to MinIO

In [6]:
import boto3
import os
AWS_ACCESS_KEY_ID = "KCSHSRMJAUGINEW97PRT"
AWS_SECRET_ACCESS_KEY = "JHr+Diuzk7gain4oXqz2Pbl4YuKw6mZmac3EwtlV"

def generateFileNames():
    """ 
    Generates the file names of the files to be uploaded
    """
    PATH = "blurb_dataset" 
    appendix = ""
    if earlyStop != 1e50:
        appendix = f"EarlyStop{earlyStop}"
        
    pathNames = []
    pathNames.append(os.path.join(PATH, f"preprocessedTrainBLURBDataset{appendix}.pt"))
    pathNames.append(os.path.join(PATH, f"preprocessedTestBLURBDataset{appendix}.pt"))
    pathNames.append(os.path.join(PATH, f"preprocessedValBLURBDataset{appendix}.pt"))
    pathNames.append(os.path.join(PATH, "listLabelDict.json"))
    return pathNames

s3 = boto3.resource('s3',
    endpoint_url='http://10.240.5.123:9099',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_session_token=None,
    config=boto3.session.Config(signature_version='s3v4'),
    verify=False
)

bucket_name = "idoml" # define your bucket name
fileNames = generateFileNames() # define your model file name. THIS NEEDS TO BE CODED PROPERLY! COULD ALSO BE ADDED TO THE CLASS DIRECTLY
minio_path = "trainBERT-v1.0.0/preprocessed_data/"

for fileName in fileNames:
    s3.Bucket(bucket_name).upload_file(fileName, minio_path + fileName)