# LSAP Data Preprocessing
---
This notebook contains code for combining the pretraining datasets and creating a training, testing, and validation splits for them.

In [1]:
import os
import pandas as pd

# set the data folders
data_folders = ['ATIS', 'SNIPS', 'TOPS_Reminder', 'TOPS_Weather']

#Define if to split the data into train, val, and test
SPLIT = False

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

### Write to JSON
---

Below, we write the same datasets to their respective JSON files.

In [2]:
#create dictionary to store data
dataframes = { k : pd.DataFrame() for k in data_folders }

# search each folder for 'data.csv' files
for folder in data_folders:
    # get 'data.csv' in the folder
    folder_path = os.path.join( folder, 'data.csv' )

    # read the file
    df = pd.read_csv( folder_path, index_col=0 )

    # add the dataframe to the dictionary
    dataframes[ folder ] = df

    #name the dataframe
    dataframes[ folder ].name = folder

# combine the dataframes
combined_df = pd.concat( dataframes.values() )
combined_df.name = 'combined'

#add it to the data folder
dataframes['combined'] = combined_df

dataframes['combined'].sample( frac = 1 ).head( 10 )

Unnamed: 0,utterance,intent
3189,pull up all of my reminders about dinner with ...,Get Reminder.
6027,Play easy listening,Play Music.
4238,What's the weather forecast in Melcher-Dallas ?,Get Weather.
19448,I was wondering whether it will be windy tomor...,Get Weather.
3915,What is the weather forecast in Pinson South D...,Get Weather.
666,Do I need to bring an umbrella?,Get Weather.
4989,Set a reminder for ten minutes to check dinner.,Create Reminder.
15809,What's the weather like in Egypt?,Get Weather.
5913,Tell me today's weather forecast,Get Weather.
232,Please delete the reminder about eggs.,Delete Reminder.


### Clean up Data

In [3]:
from datasets import  ClassLabel
import json, re

def convert_to_json( csv_file ):
    df = pd.read_csv( csv_file )
    with open( re.sub( r'csv', 'json', csv_file ), 'w' ) as out_data:
        for _, row in df.iterrows():
            utterance = row["utterance"]
            intent = row["intent"]

            json_obj = json.dumps({"translation":
                {"src": utterance, "tgt": intent, "prefix": "intent classification: "}
            })
            out_data.write(json_obj + '\n')

def convert_intent_labels_to_integers(df):
  """Converts the intent labels in a DataFrame to integers.

  Args:
    df: The DataFrame to convert.

  Returns:
    The converted DataFrame.
  """

  # Convert each intent to a ClassLabel.
  labels = df['intent'].unique().tolist()
  ClassLabels = ClassLabel(num_classes=len(labels), names=labels)

  # Append ClassLabels into DataFrame.
  def map_label2id(row):
    return ClassLabels.str2int(row)

  df['label'] = df['intent'].apply(map_label2id)

  # Reset the index of the DataFrame.
  df = df.reset_index(drop=True)

  # Rename the 'text' and 'label_name' columns to 'utterance' and 'intent', respectively.
  df = df.rename(columns={'text': 'utterance', 'label_name': 'intent'})

  return df

### Write to JSON
---

Below, we write the same datasets to their respective JSON files.

In [4]:
for folder, df in dataframes.items():
  dataframes[ folder ] = convert_intent_labels_to_integers( df )
  #save to csv
  dataframes[ folder ].to_csv( f'dataset/csv/{folder}.csv' )
  #save to json
  convert_to_json( f'dataset/csv/{folder}.csv' )

dataframes["combined"].sample( frac = 1 ).head( 10 )

Unnamed: 0,utterance,intent,label
66976,What will the weather be on the 15th of january,Get Weather.,28
16294,Looking for the trailer for Shaolin Temple,Search Creative Work.,31
16952,Show Family Plot,Search Creative Work.,31
8344,book a table for 2 in Gleed,Book Restaurant.,27
23984,Remove any reminders for Spirit and Song choir...,Delete Reminder.,34
32589,what is the reminder for dinner tonight?,Get Reminder.,38
20155,Wish to find the movie the Heart Beat,Search Creative Work.,31
72243,Will I need a jacket today?,Get Weather.,28
45305,Remind me to take my pills at 7 p.m. everyday,Create Reminder.,33
6603,add Totally Stress Free by bobby lord to my pl...,Add To Playlist.,26
