# LSAP Data Preprocessing
---
This notebook contains code for combining the pretraining datasets and creating a training, testing, and validation splits for them.

In [18]:
import os
import pandas as pd
import numpy as np

# set the data folders
data_folders = ['polyai-bank', 'wikihow']

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

In [19]:
#create empty dataframe for the combined data
combined_df = pd.DataFrame()

# search each folder for 'data.csv' files
for folder in data_folders:
    # get 'data.csv' in the folder
    folder_path = os.path.join( folder, 'data.csv' )
            
    # read the file
    df = pd.read_csv( folder_path, index_col=0 )
    
    # add the data to the combined dataframe
    combined_df = pd.concat([ combined_df, df ])

combined_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,text,label,label_name
8889,There has a recent suspicious withdrawal on my...,20,Cash Withdrawal Not Recognised.
15972,Make sure your speakers or headphones are plug...,15972,Troubleshoot a Skype Call.
3639,Wear eyeliner on the top lid and put on mascar...,3639,Get the Cute Hobo Look.
90418,Learn how to structure your sentence when the ...,90418,Use the Verb Suggest.
38923,Make people laugh without telling jokes.,38923,Write a Eulogy for a Grandparent.
88176,Keep your mind and eyes open while downloading...,88176,Decrease Your Chances of Being Hacked on a PC.
72456,Hire a medical bill advocate if the hospital w...,72456,Pay Medical Bills.
90579,Attach the 6.35 millimeter adaptor to your 3.5...,90579,Play Your iPod or MP3 Through an Amp.
96284,Pick a hoodie that is 2 sizes larger than your...,96284,Wear an Oversized Hoodie.
10575,Contactless isn't working for me,23,Contactless Not Working.


### Clean up Data

In [20]:
from datasets import Dataset, ClassLabel, DatasetDict

#Drop 'label' and regenerate it from 'label_name'
combined_df = combined_df.drop( ["label"], axis = 1 )

#Convert each intent to a ClassLabel
labels = combined_df[ 'label_name' ].unique().tolist()
ClassLabels = ClassLabel( num_classes=len( labels ), names=labels )


#Append ClassLabels into DataFrame
def map_label2id( row ):
    return ClassLabels.str2int( row )

#Add back the 'label' column
combined_df["label"] = combined_df[ "label_name" ].apply( map_label2id )
combined_df = combined_df.reset_index( drop = True )

#Rename 'text' to 'utterance' and 'label' to 'intent'
combined_df = combined_df.rename( columns = { "text": "utterance", "label_name": "intent" } )

#Display
combined_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,utterance,intent,label
22945,"Add the dried green tea leaves, cardamoms, clo...",Make Cardamom and Clove Green Tea.,9939
20782,Use networking sites to connect to potential e...,Move to New York.,7776
74274,Read all labels and avoid additives.,Use Stevia.,61268
90697,"After you have made your flower, take the flor...",Make Paper Flowers.,77691
91097,Purchase a specialized hand massage tool.,Massage Hands.,78091
66383,Become an online customer service rep.,Earn Extra Cash from Home.,53377
19694,Don't try to say you know exactly how she feels.,Comfort a Girl.,6688
84373,Follow your pediatrician's treatment plan if y...,Prevent UTI in Children.,71367
114499,"Add the dill, horseradish, garlic, mustard see...",Make Polish Dill Pickles.,101493
109147,Add the downloaded book to your Adobe Digital ...,Get Books on the Google Play Store.,96141


## Convert to HuggingFace Dataset
---
Here, we convert our Dataframe to a HuggingFace Dataset. This is done so that we can use the HuggingFace API to facilitate data loading and training.

In [21]:
# create a dataset from the dataframe
dataset = Dataset.from_pandas( combined_df , preserve_index=False )

#Convert 'Dataset' object to 'pandas' DataFrame
dataset.set_format( type='pandas' )
dataset_df = dataset.data.to_pandas()

#Generate train/test split
data_dict    = dataset.train_test_split( test_size=0.2, shuffle=True )

#Generate train/val split
train_val_dict = data_dict[ "test" ].train_test_split( test_size=0.5, shuffle=True )

train_test_valid_dataset = DatasetDict({
    'train': data_dict['train'],
    'test':  train_val_dict['test'],
    'valid': train_val_dict['train']
})

print(train_test_valid_dataset)
dataset_df.sample( frac = 1 ).head( 10 )


DatasetDict({
    train: Dataset({
        features: ['utterance', 'intent', 'label'],
        num_rows: 98924
    })
    test: Dataset({
        features: ['utterance', 'intent', 'label'],
        num_rows: 12366
    })
    valid: Dataset({
        features: ['utterance', 'intent', 'label'],
        num_rows: 12366
    })
})


Unnamed: 0,utterance,intent,label
81961,Subscribe to a genealogy website.,Learn About Genealogy.,68955
83202,Pick saltwater fish that can live in water wit...,Choose Good Tankmates for Seahorses.,70196
76824,Tap the icon of a person.,Make Free Calls on an Android.,63818
21346,Understand that using spirulina for weight los...,Improve Your Health with Spirulina.,8340
115302,Recharge or replace the batteries if you don’t...,Open a Digital Safe Without a Key.,102296
4590,I need to withdraw money. Where can I do that?,Atm Support.,36
53116,Understand that if you are sure that something...,Prepare for a Tsunami.,40110
24016,Execute permission is denoted by the letter 'x...,Run a File in Unix.,11010
13796,Recognize defensive attitudes concerning the s...,Identify Substance Abuse in the Work Place.,790
106403,"Turn your PC or laptop on, and wait for Window...",Set Up a Second Display with Windows.,93397


In [22]:

#Convert back to 'DataDict' object
dataset.save_to_disk( 'dataset/huggingface-cache' )

#Save train/test/valid splits as csv files
train_test_valid_dataset[ "train" ][ : ].to_csv( 'dataset/train.csv' )
train_test_valid_dataset[ "test" ][ : ].to_csv( 'dataset/test.csv' )
train_test_valid_dataset[ "valid" ][ : ].to_csv( 'dataset/valid.csv' )

Saving the dataset (0/1 shards):   0%|          | 0/123656 [00:00<?, ? examples/s]

In [26]:
import json

train = pd.read_csv("dataset/train.csv")

with open("dataset/train.json", 'w') as out_data:
    for i, row in train.iterrows():
        utterance = row["utterance"]
        intent = row["intent"]

        json_obj = json.dumps({"translation":
            {"src": utterance, "tgt": intent, "prefix": "intent classification: "}
        })
        out_data.write(json_obj + '\n')