# LSAP Data Preprocessing
---
This notebook contains code for combining the pretraining datasets and creating a training, testing, and validation splits for them.

In [32]:
import os
import pandas as pd
import numpy as np

# set the data folders
data_folders = ['polyai-bank', 'wikihow']

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

In [33]:
#create empty dataframe for the combined data
combined_df = pd.DataFrame()

# search each folder for 'data.csv' files
for folder in data_folders:
    # get 'data.csv' in the folder
    folder_path = os.path.join( folder, 'data.csv' )
            
    # read the file
    df = pd.read_csv( folder_path, index_col=0 )
    
    # add the data to the combined dataframe
    combined_df = pd.concat([ combined_df, df ])

combined_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,text,label,label_name
56029,Refute her attempts to control your decisions.,56029,Deal With Your Mother.
100905,Spend quality time together away from all resp...,100905,Handle Divergent Career Paths in a Relationship.
104625,Pour in the grounded peanut pieces and mix the...,104625,Make Peanut Cookies.
64871,Surround yourself with positive supporters.,64871,Stop Making Excuses for Your Weight.
94632,"Note that Russians say ""raz"" for ""one"" when co...",94632,Count to Ten in Russian.
61264,Remove the flans from the cups by turning them...,61264,Make Coffee Flan.
14096,Plot the pairs of stock return data to obtain ...,14096,Calculate Stock Correlation Coefficient.
22525,Take a moment to think about why you’re being ...,22525,"Answer ""What Do You Like About Me""."
34502,Use target training to get your cat comfortabl...,34502,Target Train a Cat.
100526,Connect with other LGBTQ people of faith.,100526,Be an Openly Gay Christian.


### Clean up Data

In [34]:
from datasets import Dataset, ClassLabel, DatasetDict

#Drop 'label' and regenerate it from 'label_name'
combined_df = combined_df.drop( ["label"], axis = 1 )

#Convert each intent to a ClassLabel
labels = combined_df[ 'label_name' ].unique().tolist()
ClassLabels = ClassLabel( num_classes=len( labels ), names=labels )


#Append ClassLabels into DataFrame
def map_label2id( row ):
    return ClassLabels.str2int( row )

#Add back the 'label' column
combined_df["label"] = combined_df[ "label_name" ].apply( map_label2id )
combined_df = combined_df.reset_index( drop = True )

#Rename 'text' to 'utterance' and 'label' to 'intent'
combined_df = combined_df.rename( columns = { "text": "utterance", "label_name": "intent" } )

#Display
combined_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,utterance,intent,label
3079,Why have I not yet received my virtual card?,Getting Virtual Card.,23
66789,Remove the tamales from the pot and leave them...,Steam Tamales.,53783
43515,Fellowship with Him and as this develops you w...,Increase Christian Faith.,30509
33793,Enjoy the best output of your garden by applyi...,Grow an Australian Native Garden.,20787
16759,"Type ""winver"" without quotation marks in the ""...",Check a PC Operating System.,3753
2092,I messed up a transfer and need to reverse it.,Cancel Transfer.,17
101976,Select the option to send or transfer money.,Use UPI.,88970
107443,Talk to your child’s friend’s parents if your ...,React when Your Toddler Swears.,94437
64889,Use Google Adwords Keyword Planner.,Make Your Mobile App Findable in App Store.,51883
9182,How do I verify my new card?,Activate My Card.,71


## Convert to HuggingFace Dataset
---
Here, we convert our Dataframe to a HuggingFace Dataset. This is done so that we can use the HuggingFace API to facilitate data loading and training.

In [42]:
# create a dataset from the dataframe
dataset = Dataset.from_pandas( combined_df , preserve_index=False )

#Convert 'Dataset' object to 'pandas' DataFrame
dataset.set_format( type='pandas' )

#Generate train/test split
data_dict    = dataset.train_test_split( test_size=0.2, shuffle=True )

#Generate train/val split
train_val_dict = data_dict[ "test" ].train_test_split( test_size=0.5, shuffle=True )

train_test_valid_dataset = DatasetDict({
    'train': data_dict['train'],
    'test':  train_val_dict['test'],
    'valid': train_val_dict['train']
})

print(train_test_valid_dataset)
( data_dict['train'][ : ] ).sample( frac = 1 ).head( 10 )


DatasetDict({
    train: Dataset({
        features: ['utterance', 'intent', 'label'],
        num_rows: 98924
    })
    test: Dataset({
        features: ['utterance', 'intent', 'label'],
        num_rows: 12366
    })
    valid: Dataset({
        features: ['utterance', 'intent', 'label'],
        num_rows: 12366
    })
})


Unnamed: 0,utterance,intent,label
83957,Make a poultice out of fresh leaves and oil.,Use Lemon Balm.,77105
82894,"Avoid letting your pets, particularly dogs, be...",Attract Owls to Your Garden.,39814
74501,Refrain from updating your computer's operatin...,Avoid Game Lag on a Low End System.,69588
62569,Purchase dress shields or garment pads (they a...,Hide Sweat Stains.,15511
32458,"Replace the memory with something positive, bu...",Forget Something Horrible You Saw on the Inter...,46502
52121,I requested a refund directly from the seller ...,Refund Not Showing Up.,44
78727,I'm seeing a direct debit payment that is not me.,Direct Debit Payment Not Recognised.,37
3059,Start with low weights and move to heavier wei...,Build Muscles (for Girls).,15927
66930,"Cut the cake into small cubes, approximately 2...",Make Peach Trifle.,106212
16336,You're finished and have arrived at a very use...,Use the ATAN Function in Excel.,30458


In [36]:

#Convert back to 'DataDict' object
dataset.save_to_disk( 'dataset/huggingface-cache' )

#Save train/test/valid splits as csv files
train_test_valid_dataset[ "train" ][ : ].to_csv( 'dataset/csv/train.csv' )
train_test_valid_dataset[ "test" ][ : ].to_csv( 'dataset/csv/test.csv' )
train_test_valid_dataset[ "valid" ][ : ].to_csv( 'dataset/csv/valid.csv' )

Saving the dataset (0/1 shards):   0%|          | 0/123656 [00:00<?, ? examples/s]

### Write to JSON
---

Below, we write the same datasets to their respective JSON files.

In [37]:
import json, re

def convert_to_json( csv_file ):
    df = pd.read_csv( csv_file )
    with open( re.sub( r'csv', 'json', csv_file ), 'w' ) as out_data:
        for _, row in df.iterrows():
            utterance = row["utterance"]
            intent = row["intent"]

            json_obj = json.dumps({"translation":
                {"src": utterance, "tgt": intent, "prefix": "intent classification: "}
            })
            out_data.write(json_obj + '\n')

#Convert csv files to json files
convert_to_json( "dataset/csv/train.csv" )
convert_to_json( "dataset/csv/test.csv" )
convert_to_json( "dataset/csv/valid.csv" )