# LSAP Data Preprocessing
---
This notebook contains code for combining the pretraining datasets and creating a training, testing, and validation splits for them.

In [1]:
import os
import pandas as pd

# set the data folders
data_folders = ['polyai-bank', 'wikihow']

#Define if to split the data into train, val, and test
SPLIT = False

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

### Write to JSON
---

Below, we write the same datasets to their respective JSON files.

In [2]:
#create dictionary to store data
dataframes = { k : pd.DataFrame() for k in data_folders }

# search each folder for 'data.csv' files
for folder in data_folders:
    # get 'data.csv' in the folder
    folder_path = os.path.join( folder, 'data.csv' )

    # read the file
    df = pd.read_csv( folder_path, index_col=0 )

    # add the dataframe to the dictionary
    dataframes[ folder ] = df

    #name the dataframe
    dataframes[ folder ].name = folder

# combine the dataframes
combined_df = pd.concat( dataframes.values() )
combined_df.name = 'combined'

#add it to the data folder
dataframes['combined'] = combined_df

dataframes['combined'].sample( frac = 1 ).head( 10 )

Unnamed: 0,text,label,label_name
108999,Protect your psychological well-being by maint...,108999,Live a Long Life.
82178,"Microwave butter, corn syrup, and brown sugar ...",82178,Make Sweet Party Mix.
33621,Calculate your Intense Target Heart Rate (Targ...,33621,Calculate Your Target Heart Rate.
85374,Remove dried mushroom's stems before re-hydrat...,85374,Clean Shiitake Mushrooms.
76022,Supervise your guinea pig closely while they'r...,76022,Play with a Guinea Pig.
85529,Apply cucumber slices on your skin.,85529,Turn Sunburn Into a Tan.
82775,Notice if you tend to punish people when you d...,82775,Know if You Are Being Selfish.
6346,Check the safety regulations of the location y...,6346,Make a Sparkler Bomb.
51796,Shuffle the discard pile if you use up all of ...,51796,Play Sorry.
21314,"Alt or Opt-Click on ""Merge Visible"" to combine...",21314,Combine Layers in Photoshop.


### Clean up Data

In [4]:
from datasets import  ClassLabel
import json, re

def convert_to_json( csv_file ):
    df = pd.read_csv( csv_file )
    with open( re.sub( r'csv', 'json', csv_file ), 'w' ) as out_data:
        for _, row in df.iterrows():
            utterance = row["utterance"]
            intent = row["intent"]

            json_obj = json.dumps({"translation":
                {"src": utterance, "tgt": intent, "prefix": "intent classification: "}
            })
            out_data.write(json_obj + '\n')

def convert_intent_labels_to_integers(df):
  """Converts the intent labels in a DataFrame to integers.

  Args:
    df: The DataFrame to convert.

  Returns:
    The converted DataFrame.
  """

  # Drop the 'label' column.
  df = df.drop('label', axis=1)

  # Convert each intent to a ClassLabel.
  labels = df['label_name'].unique().tolist()
  ClassLabels = ClassLabel(num_classes=len(labels), names=labels)

  # Append ClassLabels into DataFrame.
  def map_label2id(row):
    return ClassLabels.str2int(row)

  df['label'] = df['label_name'].apply(map_label2id)

  # Reset the index of the DataFrame.
  df = df.reset_index(drop=True)

  # Rename the 'text' and 'label_name' columns to 'utterance' and 'intent', respectively.
  df = df.rename(columns={'text': 'utterance', 'label_name': 'intent'})

  return df

### Write to JSON
---

Below, we write the same datasets to their respective JSON files.

In [None]:
for folder, df in dataframes.items():
  dataframes[ folder ] = convert_intent_labels_to_integers( df )
  #save to csv
  dataframes[ folder ].to_csv( f'dataset/csv/{folder}.csv' )
  #save to json
  convert_to_json( f'dataset/csv/{folder}.csv' )

dataframes["combined"].sample( frac = 1 ).head( 10 )