# LSAP Data Preprocessing
---
This notebook contains code for combining the pretraining datasets and creating a training, testing, and validation splits for them.

In [47]:
import os
import pandas as pd
import numpy as np

# set the data folders
data_folders = ['polyai-bank', 'wikihow']

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

In [48]:
#create empty dataframe for the combined data
combined_df = pd.DataFrame()

# search each folder for 'data.csv' files
for folder in data_folders:
    # get 'data.csv' in the folder
    folder_path = os.path.join( folder, 'data.csv' )
            
    # read the file
    df = pd.read_csv( folder_path, index_col=0 )
    
    # add the data to the combined dataframe
    combined_df = pd.concat([ combined_df, df ])

combined_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,text,label,label_name
68755,Set the pan on the grill and barbecue the bris...,68755,BBQ Brisket.
2511,There was a fee charged when I paid with my card.,15,Card Payment Fee Charged.
1807,How can i get new card?,23,Contactless Not Working.
110262,Look for the points associated with feelings o...,110262,Read an Ear Reflexology Chart.
50224,Wrap candles to make them fancy .,50224,Make Profitable Crafts.
15004,Extend your tape measure or measuring stick to...,15004,Measure the Size of a TV Screen.
85696,"Come back, and decide what does not make sense...",85696,Write About Fictional Deaths.
95360,Use a small square to mark vertical lines with...,95360,Make a Wooden Cabinet with Dovetail Joints.
70497,Insert the otoscope's speculum 1–2 cm (0.39–0....,70497,Look Into Your Own Ear.
81622,Visit a mental health professional.,81622,Stop Losing Weight.


## Convert to HuggingFace Dataset
---
Here, we convert our Dataframe to a HuggingFace Dataset. This is done so that we can use the HuggingFace API to facilitate data loading and training.

In [55]:
from datasets import Dataset, ClassLabel, DatasetDict

# create a dataset from the dataframe
dataset = Dataset.from_pandas( combined_df , preserve_index=False )

#Convert 'Dataset' object to 'pandas' DataFrame
dataset.set_format( type='pandas' )
dataset_df = dataset.data.to_pandas()

#Drop 'label' and regenerate it from 'label_name'
dataset_df = dataset_df.drop( ["label"], axis = 1 )

#Convert each intent to a ClassLabel
labels = dataset_df[ 'label_name' ].unique().tolist()
ClassLabels = ClassLabel( num_classes=len( labels ), names=labels )

#Append ClassLabels into DataFrame
def map_label2id( row ):
    return ClassLabels.str2int( row )

#Add back the 'label' column
dataset_df["label"] = dataset_df[ "label_name" ].apply( map_label2id )
dataset_df = dataset_df.reset_index( drop = True )

#Generate train/test split
data_dict    = dataset.train_test_split( test_size=0.2, shuffle=True )

#Generate train/val split
train_val_dict = data_dict[ "test" ].train_test_split( test_size=0.5, shuffle=True )

train_test_valid_dataset = DatasetDict({
    'train': data_dict['train'],
    'test':  train_val_dict['test'],
    'valid': train_val_dict['train']
})

print(train_test_valid_dataset)
dataset_df.sample( frac = 1 ).head( 10 )


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_name'],
        num_rows: 98924
    })
    test: Dataset({
        features: ['text', 'label', 'label_name'],
        num_rows: 12366
    })
    valid: Dataset({
        features: ['text', 'label', 'label_name'],
        num_rows: 12366
    })
})


Unnamed: 0,text,label_name,label
733,I've been charged an extra £1 and I don't know...,Extra Charge On Statement.,4
100244,Press and hold ⌘ Command and R. You'll need to...,Transfer Files Between Two Macs.,87238
80132,Obtain an Employer Identification Number (EIN).,Set up a Sole Proprietorship in Tennessee.,67126
62273,Understand that confidence is key.,Be Cool in High School.,49267
31355,Install the ecommerce extension you downloaded.,Create an Ecommerce Website from Scratch Using...,18349
50162,"Add the red chilli powder, salt and beat the ...",Cook Prawns with Scrambled Egg.,37156
106629,Ask the lender what credit report shows the di...,Remove a Dispute from a Credit Report.,93623
69823,Find out if a trial by declaration is allowed ...,Write a Letter Pleading Not Guilty.,56817
118270,Take a Vitamin D supplement to strengthen bone...,Balance Vitamins and Minerals on the Atkins Diet.,105264
9609,Do I need to re-apply to order a new card when...,Card About To Expire.,73


In [59]:
#Convert back to 'DataDict' object

dataset.save_to_disk( 'dataset/huggingface-cache' )

#Save train/test/valid splits as csv files
train_test_valid_dataset[ "train" ][ : ].to_csv( 'dataset/train.csv' )
train_test_valid_dataset[ "test" ][ : ].to_csv( 'dataset/test.csv' )
train_test_valid_dataset[ "valid" ][ : ].to_csv( 'dataset/valid.csv' )

Saving the dataset (0/1 shards):   0%|          | 0/123656 [00:00<?, ? examples/s]