# Dataset_Tokenization for Fine-tuning T5 for Summarization
This model will perform summarization of why messages marked as innapropriate were considered inapropriate. This notebook contains the tokenization of the summary dataset that will be used in training.

In [1]:
#!pip install sentencepiece

## Load Wikipedia Toxic Comments Dataset

In [2]:
import pandas as pd

summary_df = pd.read_csv('summary_dataset.csv')

print(f'Summary dataset shape: {summary_df.shape}')

Summary dataset shape: (13742, 2)


The summary dataset was created by feeding the wikipedia toxic comments dataset into code interpreter. Code interpreter was asked to remove rows that contained appropriate rows. The remaining rows would all be innapropriate comments. Then it was asked to provide a reason why these innapropriate comments were innapropriate in the 'reason' collumn. Code interpreter failed to removed all the appropriate comments but its provided reasons generally appear to be correct. Because of this I used the Rule Adherance Classifier I had trained previously to inspect which comments in the "text" collumn needed to be removed and did so. 

One of the advantages of this approach is that it ensures the summary itself will be clean. Because the labeling process uilized code interpreter, it takes advantage of its sophisticated understanding of language to benefit the training of a simpler seq2seq model.

More scrupulus dataset creation will lead to a better model. The plan for now is to build out the bot as an approximation of the final version to gain insights into any other features that might be useful. As well as determine what else in the code might be unnecessary. 

In [3]:
summary_df.head()

Unnamed: 0,text,reason
0,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,contains derogatory swear words
1,You are gay or antisemmitian? \n\nArchangel WH...,contains potentially offensive group references
2,"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!",contains derogatory swear words
3,GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK T...,contains derogatory swear words
4,Stupid peace of shit stop deleting my stuff as...,"contains derogatory swear words, contains dero..."


I am also going to split the data here before encoding. Here I am using an 80% training split.

In [4]:
# Split the data
train_df = summary_df.sample(frac=0.8, random_state=42)
test_df = summary_df.drop(train_df.index)

print("train size:", len(train_df))
print("test size:", len(test_df))

train size: 10994
test size: 2748


## Encoding w/ T5TokenizerFast 
For this initial summary model I am going to fine tune T5 (Text to Text Transfer Transformer) on a summary dataset

In [2]:
#!pip install transformers

The apply_encoding_to_df() function below takes a df and to each row it applys the encode() function above it. The output is a dictionary of pandas series. Which are converted to a single df in appy_encoding_to_df()

In [None]:
from transformers import T5TokenizerFast

# Instantiate the tokenizer
tokenizer = T5TokenizerFast.from_pretrained('t5-small')

def encode(text, reason):
    # Encode the text with the prefix "summarize: " for the T5 model
    encoded = tokenizer(f"summarize: {text}", truncation=True, padding='max_length', max_length=128, return_tensors="pt")
    
    # Convert 0d tensors to python numbers using .tolist() for the entire tensor
    attention_mask = encoded['attention_mask'][0].tolist()
    input_ids = encoded['input_ids'][0].tolist()
    
    # Convert reason (summary) into ids. No need for attention masks for target.
    target_ids = tokenizer(reason, truncation=True, padding='max_length', max_length=128, return_tensors="pt").input_ids[0].tolist()
    
    # For T5, decoder_input_ids are same as target_ids. 
    # For models like BART, you'd want to shift target_ids by one position.
    # But for simplicity, we'll use target_ids directly:
    decoder_input_ids = target_ids

    # Return data in a dictionary format with appropriate keys for the trainer
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': target_ids,  # using 'labels' instead of 'target_ids' for trainer
        'decoder_input_ids': decoder_input_ids
    }

def apply_encoding_to_df(df):
    # Apply the encode function to each row of the dataframe
    # df.apply() returns a dict of pandas series 
    encoded_data = df.apply(lambda row: encode(row['text'], row['reason']), axis=1)
    
    # Here we move each of the pandas series into a list
    list_of_dicts = [item for item in encoded_data]
    
    #then return the encoded data as a df 
    return pd.DataFrame(list_of_dicts)


encode data

In [7]:
encoded_train_df = apply_encoding_to_df(train_df)
encoded_test_df = apply_encoding_to_df(test_df)

In [8]:
encoded_test_df.head()

Unnamed: 0,input_ids,attention_mask,labels,decoder_input_ids
0,"[21603, 10, 148, 33, 16998, 42, 1181, 7, 15, 6...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2579, 6149, 12130, 563, 9811, 1, 0, 0, 0, 0, ...","[2579, 6149, 12130, 563, 9811, 1, 0, 0, 0, 0, ..."
1,"[21603, 10, 3, 13076, 12417, 3065, 13, 3, 7, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2579, 20, 3822, 6546, 23782, 1234, 6, 2579, 2...","[2579, 20, 3822, 6546, 23782, 1234, 6, 2579, 2..."
2,"[21603, 10, 499, 4483, 5545, 31, 7, 29431, 5, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2579, 20, 3822, 6546, 23782, 1234, 6, 2579, 6...","[2579, 20, 3822, 6546, 23782, 1234, 6, 2579, 6..."
3,"[21603, 10, 71, 3116, 13, 528, 210, 18, 547, 5...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2579, 6149, 12130, 563, 9811, 1, 0, 0, 0, 0, ...","[2579, 6149, 12130, 563, 9811, 1, 0, 0, 0, 0, ..."
4,"[21603, 10, 148, 225, 36, 12744, 6, 25, 31, 60...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2579, 20, 3822, 6546, 1353, 1, 0, 0, 0, 0, 0,...","[2579, 20, 3822, 6546, 1353, 1, 0, 0, 0, 0, 0,..."


I should note that T5 does not explicity require attention masks to perform seq2seq tasks.

### Converting pandas dataframes to datasets.Dataset objects for training

In [1]:
#!pip install datasets

For input into the model I am coverting the pd dataframes to datasets.Dataset objects for memory management

In [10]:
import datasets

summary_train_dataset = datasets.Dataset.from_pandas(encoded_train_df)
summary_test_dataset = datasets.Dataset.from_pandas(encoded_test_df)

save

In [None]:
summary_train_dataset.save_to_disk('summary_train_dataset')
summary_test_dataset.save_to_disk('summary_test_dataset')