# Exploring TEACh Data

In [1]:
import os
import sys
import json
import copy

# sys.path.append("../../")

In [2]:
from teach.dataset.definitions import Definitions
from teach.dataset.dataset import Dataset
from teach.dataset.actions import Action_Keyboard, Action_ObjectInteraction

In [3]:
# Edit data directory if changed when using `teach_download`
data_dir = "/media/Blue2TB3/jpei/teach-dataset"

In [4]:
! ls /media/Blue2TB3/jpei/teach-dataset

all_game_files		edh_instances.tar.gz	     images_and_states.tar.gz
all_games.tar.gz	et_pretrained_models.tar.gz  IMAGESLICENSE
baseline_models.tar.gz	experiment_games.tar.gz      tfd_instances.tar.gz
DATALICENSE		games
edh_instances		images


### Definitions

Instantiate a `Definitions` object to access various definitions, mappings of agent IDs and actions to names, as well as task definitions. 
The code uses `Driver` when referring to the `Follower` in the paper. 

In [5]:
definitions = Definitions(version="2.0")
print("Agent IDs to agents: ", definitions.map_agents_id2info)
print("Status IDs to names: ", definitions.map_status_id2name)

Agent IDs to agents:  OrderedDict([(0, OrderedDict([('agent_name', 'Commander'), ('agent_type', 0)])), (1, OrderedDict([('agent_name', 'Driver'), ('agent_type', 1)]))])
Status IDs to names:  OrderedDict([(0, 'Success'), (1, 'Failure')])


Display mappings of action IDs to action names. Note that only a subset of these are used in TEACh data. Note that `definitions.map_tasks_name2info` ends up being more useful when trying to access actions by name. 

In [8]:
print("Action IDs to names:")
for action_id, action in definitions.map_actions_id2info.items():
    print("\t ", action_id, ":", action["action_name"])

Action IDs to names:
"Stop": ""
"Move to": ""
"Forward": ""
"Backward": ""
"Turn Left": ""
"Turn Right": ""
"Look Up": ""
"Look Down": ""
"Pan Left": ""
"Pan Right": ""
"Move Up": ""
"Move Down": ""
"Double Forward": ""
"Double Backward": ""
"Navigation": ""
"Pickup": ""
"Place": ""
"Open": ""
"Close": ""
"ToggleOn": ""
"ToggleOff": ""
"Slice": ""
"Dirty": ""
"Clean": ""
"Fill": ""
"Empty": ""
"Pour": ""
"Break": ""
"BehindAboveOn": ""
"BehindAboveOff": ""
"OpenProgressCheck": ""
"SelectOid": ""
"SearchObject": ""
"Text": ""
"Speech": ""
"Beep": ""


In [7]:
len(definitions.map_actions_id2info)

36

Tasks are also most convenient to access by name via `definitions.map_tasks_name2info` but can be accessed via ID using `definitions.map_tasks_id2info`. The values of both of these dictionaries are of type `Task_THOR`.  

When a `Definitions` object is instantiated, all tasks defined under `src/teach/meta_data_files/task_definitions` get loaded. The Task Definition Language is explained in Appendix F of the [TEACh paper](https://arxiv.org/pdf/2110.00534.pdf). To create a new task, create a new JSON file under `src/teach/meta_data_files/task_definitions`. Each task needs to have a unique `task_id` and `task_name`. Tasks can be referenced in other tasks by their `task_name`. After creating a new task, test that it can be loaded any any inter-task dependencies can be resolved by instantiating a `Definitions` object.

The following code snippet demonstrates how to print a few task details. Note that `#n` (where `n` is a number) indicates a variable.

In [8]:
print("Task details by name:")
print("Task name".ljust(33, " "), "Task ID".ljust(10, " "), "Num task params".ljust(20, " "), "Task component names")
for task_name, task in definitions.map_tasks_name2info.items():
    print(
        task_name.ljust(35, " "),
        str(task.task_id).ljust(15, " "),
        str(task.task_nparams).ljust(10, " "),
        str(list(task.components.keys())),
    )

Task details by name:
Task name                         Task ID    Num task params      Task component names
Candles                             304             0          ['candles', 'bathtub']
Breakfast                           301             14         ['coffee', 'toast', 'potatoes', 'apple', 'sandwich', 'salad', 'serving_spot']
Salad                               303             3          ['lettuce', 'tomato', 'potato', 'plate']
Put All X In One Y                  111             3          ['#0', '#2']
N Cooked Slices Of X In Y           107             4          ['#1', '#3']
Custom Properties Kitchen Tasks     405             0          ['boiled_potato', 'poached_egg']
Boil X                              112             1          ['boiled_#0']
Workspace                           305             3          ['writing', 'laptop', 'book', 'gather_spot', 'lights']
Toggle X All Y                      116             3          ['#1']
Plate Of Toast                      106        

### Gameplay Sessions
Gameplay sessions are stored in `json` files. The `games` subdirectory consists of one subdirectory per split each containing game files of that split. When loaded, these are dictionaries and for many purposes, it is sufficient to analyze the dictionaries. Some examples:   

In [12]:
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
with open(f) as h:
    game_dict = json.load(h)
print(game_dict.keys())

dict_keys(['version', 'task_type', 'comments', 'definitions', 'tasks'])


While the game dictionary contains other keys, the important one is `tasks`. `version`, `task_type` and `comments` are dataset-specific metadata, and `definitions` contains the version of the `Definitions` object used to collect the data. However, all games in the subdirectory `games` have been verified to be replayable and resulting in task success using the current (released) version of the `Definitions` object. `tasks` is always a list of length 1 in this dataset.  

In [13]:
print(game_dict["tasks"][0].keys())

dict_keys(['task_id', 'task_name', 'task_params', 'task_nparams', 'task_anchor_object', 'desc', 'components', 'relations', 'comments', 'episodes'])


This is a dictionary that can be converted to a `Task_THOR` object. All keys except `episodes` are associated with the task definition and can be better understood by reading Appendix F of the [TEACh paper](https://arxiv.org/pdf/2110.00534.pdf). For all game files in this dataset `game_dict['tasks'][0]['episodes']` will be a list of length 1 and `game_dict['tasks'][0]['episodes'][0]` contains the actual sequence of actions taken in the episode. 

In [14]:
print(game_dict["tasks"][0]["episodes"][0].keys())

dict_keys(['episode_id', 'world', 'world_type', 'commander_embodied', 'initial_state', 'interactions', 'final_state'])


Episodes are used to store the initial and final simulator state, as well as the sequence of actions taken in a gameplay session. The components of an episode are:
* `episode_id` - A unique id
* `world_type` - Type of room which is one of `Kitchen`, `Bedroom`, `Bathroom` and `Living room` 
* `world` - ID of the specific AI2-THOR floor plan used for this gameplay session
* `commander_embodied` - False for all TEACh games
* `initial_state`, `final_state` - Dictionaries consisting of the initial and final state of the world including
    * `time_start` - 
    * `agents` - Position and orientation of each agent/ camera at start and end of episode
    * `objects` - A list of the state of all objects at the start and end of the episode. Each object is represented by a dictionary whose keys are property names and values are property values.
    * `custom_object_metadata` - A dictionary to track custom properties in our codebase that are not present in AI2-THOR. This is a dictionary with AI2-THOR objectId as key and a dictionary of (custom_property_name, custom_property_value) pairs as values
* `interactions` - An ordered list of interactions that occurred in the environment, each represented by a dictionary of
    * `agent_id` - The agent that took the action
    * `action_id` - Which action was taken
    * `time_start` - Duration of time between start of episode and when this action started
    * `duration` - Duration of time (in sec) taken to execute this action
    * `success` - 1 if the action was successfully executed during data collection and 0 otherwise. An example of a case where `success` might be 0 is if the human annotator tried to pick up an object from too far away 
    * Action specific keys. Some examples include
        * `utterance` for a `Text` action - Stores the text value of the utterance made
        * `pose_delta` and `pose` for a navigation action
        
Code snippet to print out the sequence of actions taken in an episode:

In [15]:
def print_actions_from_game_dict(game_dict, definitions):
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    print(
        "Time Start",
        "Action Success".ljust(15, " "),
        "Agent".ljust(15, " "),
        "Action".ljust(20, " "),
        "Utterance text / Object ID / Object X, Y",
    )
    for interaction in interactions:
        output_str = "".rjust(2, " ")
        output_str += ("%.2f" % interaction["time_start"]).ljust(15, " ")
        output_str += str(interaction["success"]).ljust(10, " ")
        output_str += definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"].ljust(15, " ")
        output_str += definitions.map_actions_id2info[interaction["action_id"]]["action_name"].ljust(20, " ")
        if "utterance" in interaction:
            output_str += interaction["utterance"]
        elif "oid" in interaction and interaction["oid"] is not None:
            output_str += interaction["oid"]
        elif "x" in interaction and "y" in interaction:
            output_str += "(" + str(interaction["x"]) + ", " + str(interaction["y"]) + ")"
        print(output_str)

In [16]:
print_actions_from_game_dict(game_dict, definitions)

Time Start Action Success  Agent           Action               Utterance text / Object ID / Object X, Y
  15.29          0         Commander      OpenProgressCheck   
  27.85          1         Commander      Text                I need the newspaper to be placed on a single table.
  29.49          1         Commander      SelectOid           
  39.11          1         Driver         Text                what should i do
  61.21          1         Driver         Pan Left            
  61.59          1         Driver         Pan Left            
  61.84          1         Driver         Pan Left            
  62.12          1         Commander      Text                I need the newspaper placed on a single table.
  70.16          1         Driver         Pickup              Newspaper|-04.15|+00.36|-02.48
  87.74          1         Driver         Place               CoffeeTable|-02.47|+00.00|-02.49
  92.55          1         Commander      OpenProgressCheck   


Note that for all object interactions, the relative coordinates of the object on the agent's egocentric image are available in `interaction['x'], interaction['y']`. In the cases where the wrapper was able to resolve these to an object ID using the segmentation frame, we also have the ID of the object interacted with in `interaction['oid']` but if the wrapper was forced to backoff to raycasting, then this is not available.   

It is also possible to import a game file into a `Dataset` object as follows.

In [17]:
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
game = Dataset.import_json(f)

The following is how the code snippet to print out the same action info would look using the object oriented representation:

In [18]:
def extract_actions_context_from_game_dict(game_dict, definitions):
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]

    data_dict_list = []

    for interaction in interactions:
        data_dict = {}
        query_response = []
        query_str = ''
        action_str = ''
        object_str = ''
        
        agent_name = definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"]
        action_str = definitions.map_actions_id2info[interaction["action_id"]]["action_name"] + ' ' + str(interaction["success"])
        if "utterance" in interaction:
            query_str = f'{agent_name}: {interaction["utterance"]}'
        elif "oid" in interaction and interaction["oid"] is not None:
            object_str += interaction["oid"]
        elif "x" in interaction and "y" in interaction:
            object_str += "(" + str(interaction["x"]) + ", " + str(interaction["y"]) + ")"
        
        # data_dict_list.append()
        if action_str.startswith('Text'):
            print(query_str)
        else:
            if action_str!='':
                print('action', action_str)
                
            if object_str!='':
                print('object', object_str)
            # query_str = ''
        
extract_actions_context_from_game_dict(game_dict, definitions)

action OpenProgressCheck 0
Commander: I need the newspaper to be placed on a single table.
action SelectOid 1
Driver: what should i do
action Pan Left 1
action Pan Left 1
action Pan Left 1
Commander: I need the newspaper placed on a single table.
action Pickup 1
object Newspaper|-04.15|+00.36|-02.48
action Place 1
object CoffeeTable|-02.47|+00.00|-02.49
action OpenProgressCheck 1


In [19]:
import pandas as pd

def create_dataframe_from_game_dict(game_dict, definitions):
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    
    data = []
    for interaction in interactions:
        row = {
            "Time Start": "%.2f" % interaction["time_start"],
            "Action Success": str(interaction["success"]),
            "Agent": definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"],
            "Action": definitions.map_actions_id2info[interaction["action_id"]]["action_name"]
        }
        
        if "utterance" in interaction:
            row["Utterance"] = interaction["utterance"]
        elif "oid" in interaction and interaction["oid"] is not None:
            row["Utterance"] = interaction["oid"]
        elif "x" in interaction and "y" in interaction:
            row["Utterance"] = f"({interaction['x']}, {interaction['y']})"
        
        data.append(row)
    
    df = pd.DataFrame(data)
    return df

# Example usage
# game_dict = {"tasks": [{"episodes": [{"interactions": [{"time_start": 1.23, "success": True, "agent_id": 1, "action_id": 2}]}]}]}
# definitions = {
#     "map_agents_id2info": {1: {"agent_name": "Agent1"}},
#     "map_actions_id2info": {2: {"action_name": "Action2"}}
# }

df = create_dataframe_from_game_dict(game_dict, definitions)
df.fillna('')


INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


Unnamed: 0,Time Start,Action Success,Agent,Action,Utterance
0,15.29,0,Commander,OpenProgressCheck,
1,27.85,1,Commander,Text,I need the newspaper to be placed on a single ...
2,29.49,1,Commander,SelectOid,
3,39.11,1,Driver,Text,what should i do
4,61.21,1,Driver,Pan Left,
5,61.59,1,Driver,Pan Left,
6,61.84,1,Driver,Pan Left,
7,62.12,1,Commander,Text,I need the newspaper placed on a single table.
8,70.16,1,Driver,Pickup,Newspaper|-04.15|+00.36|-02.48
9,87.74,1,Driver,Place,CoffeeTable|-02.47|+00.00|-02.49


In [28]:
def convert_to_action_finetuned_df(df):
    # Prepare the fine-tune dataset
    finetune_dataset = []
    
    for i in range(1, len(df)):
        prev_actions = df.loc[:i - 1, 'Action'].astype(str).tolist()
        prev_attributes = df.loc[:i - 1, 'Utterance'].astype(str).tolist()
        current_action = df.loc[i, 'Action']
        
        input_sequence = ", ".join(filter(None, prev_actions + prev_attributes))
        
        data_point = {'input': input_sequence, 'output': current_action}
        finetune_dataset.append(data_point)
    
    finetune_df = pd.DataFrame(finetune_dataset)
    finetune_df.replace('nan', pd.NA, inplace=True)
    # Display the fine-tune dataset
    return finetune_df
convert_to_action_finetuned_df(df)

10
10


Unnamed: 0,input,output
0,"OpenProgressCheck, nan",Text
1,"OpenProgressCheck, Text, nan, I need the newsp...",SelectOid
2,"OpenProgressCheck, Text, SelectOid, nan, I nee...",Text
3,"OpenProgressCheck, Text, SelectOid, Text, nan,...",Pan Left
4,"OpenProgressCheck, Text, SelectOid, Text, Pan ...",Pan Left
5,"OpenProgressCheck, Text, SelectOid, Text, Pan ...",Pan Left
6,"OpenProgressCheck, Text, SelectOid, Text, Pan ...",Text
7,"OpenProgressCheck, Text, SelectOid, Text, Pan ...",Pickup
8,"OpenProgressCheck, Text, SelectOid, Text, Pan ...",Place
9,"OpenProgressCheck, Text, SelectOid, Text, Pan ...",OpenProgressCheck


In [94]:
from tqdm.notebook import tqdm
def create_action_dataset(mode='train'):
    action_json_files = [os.path.join(data_dir, f"games/{mode}/{f}") for f in os.listdir(os.path.join(data_dir, f"games/{mode}"))] # + [os.path.join(data_dir, f"games/valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "games/valid_seen"))]
    # f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
    # Initialize an empty list to store DataFrames
    dfs = []
    for f in tqdm(action_json_files):
        with open(f) as h:
            game_dict = json.load(h)
            df = create_dataframe_from_game_dict(game_dict, definitions)
            df.fillna('')
            df = convert_to_action_finetuned_df(df)
            dfs.append(df)
    # Merge all DataFrames into a single DataFrame
    merged_df = pd.concat(dfs, ignore_index=True)
    return merged_df
teach_action_train_dataset = create_action_dataset(mode='train')
teach_action_valid_dataset = create_action_dataset(mode='valid_seen')
teach_action_test_dataset = create_action_dataset(mode='valid_unseen')

  0%|          | 0/1482 [00:00<?, ?it/s]

  0%|          | 0/181 [00:00<?, ?it/s]

  0%|          | 0/612 [00:00<?, ?it/s]

In [96]:
from datasets import load_dataset, DatasetDict, Dataset
train_dataset = Dataset.from_pandas(teach_action_train_dataset)
eval_dataset = Dataset.from_pandas(teach_action_valid_dataset)
test_dataset = Dataset.from_pandas(teach_action_test_dataset)
teach_action_dataset = DatasetDict({"train":train_dataset, "validation": eval_dataset,"test":test_dataset})

In [97]:
teach_action_dataset.push_to_hub("Jiahuan/teach_action")

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/repos/create HTTP/1.1" 200 92


Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/116 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_action/preupload/main HTTP/1.1" 200 96
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_action.git/info/lfs/objects/batch HTTP/1.1" 200 2022
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/4c/46/4c4653a7ae39e1153b82a86c5c260ec6c4521f9ab605b1c1f07c752727a9f7fe/46cb853e2278a5ec04d8ee1e0f212c1049d0ef64da632ad1b7bfafa968c5a53d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231215T212604Z&X-Amz-Expires=86400&X-Amz-Signature=f2197437f8b6c9b9006797e8c937241cacab5d62aaf028b3c84d72a09436857c&X-Amz-SignedHeaders=host&partNumber=1&uploadId=NSFLjXWFUgybO_hNxSupY3CgwB5fN2b.F1L8CSoCSujRWPZbK

Creating parquet from Arrow format:   0%|          | 0/116 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_action/preupload/main HTTP/1.1" 200 96
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_action.git/info/lfs/objects/batch HTTP/1.1" 200 2022
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/4c/46/4c4653a7ae39e1153b82a86c5c260ec6c4521f9ab605b1c1f07c752727a9f7fe/0bfcaa4cda681c7ffc808ba8e0df24095f1e1903522808ce8382e6e55fe10978?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231215T212607Z&X-Amz-Expires=86400&X-Amz-Signature=a9c371434e6ca41dcdf47b433897338386f71bbb16e459d2c962ecb57a1eb039&X-Amz-SignedHeaders=host&partNumber=1&uploadId=hMpZdEfwnoSksp0thdCwf2b7spBLgFq3DWrSa2AFGf5I77wxzXq2aZLY9S0ts3z0DhbeLAyQWlv31N61EiCkgSYLg_gU6LjNFf2tlXaA.xUK8xIXhci9IEFrF18tBmpT&x-id=UploadPart HTTP/1.1" 200 0
DEBUG

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/29 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_action/preupload/main HTTP/1.1" 200 101
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_action.git/info/lfs/objects/batch HTTP/1.1" 200 938
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/4c/46/4c4653a7ae39e1153b82a86c5c260ec6c4521f9ab605b1c1f07c752727a9f7fe/aa782847bc43661770b82d95074b182af2884cb1fd4ff7a994f5df7503a552db?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231215T212611Z&X-Amz-Expires=900&X-Amz-Signature=89327d6e2efdd2a39b8c60ee163ecef55376258dd49d4f51c65281cdff54350a&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_action.git/info/lfs/objects/verif

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/92 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_action/preupload/main HTTP/1.1" 200 95
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_action.git/info/lfs/objects/batch HTTP/1.1" 200 939
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/4c/46/4c4653a7ae39e1153b82a86c5c260ec6c4521f9ab605b1c1f07c752727a9f7fe/488ddc7f153e8f18ff3b3fab364335714910ebf57493ef63700d87e4eec3a9b5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231215T212612Z&X-Amz-Expires=900&X-Amz-Signature=c24a85fe16f425ec336b958711ceeee6d51bbcb8720e2df0ae6bb3fe68d517ae&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_action.git/info/lfs/objects/verify

In [50]:
def convert_to_object_finetuned_df(df, doc_id):
    # Extract query and output pairs
    llm_finetune_dataset = []
    current_query = None
    prev_query = None
    
    for index, row in df.iterrows():
        if row['Action'] == 'Text':
            current_query = row['Utterance']
        else:
            if isinstance(row['Utterance'], str) and '|' in row['Utterance']:
                output_sequence = row["Agent"]+ ': ' + row['Utterance']#.split(',')
                llm_finetune_dataset.append({'doc_id': doc_id, 'start_time': row['Time Start'], 'query': current_query, 'action': row['Action'], 'action_success': row['Action Success'], 'object': output_sequence})
    
    # Create a new DataFrame for LLM fine-tuning
    llm_finetune_df = pd.DataFrame(llm_finetune_dataset)
    return llm_finetune_df
convert_to_object_finetuned_df(df, doc_id='7d2a79f43e605c36_1657')

Unnamed: 0,doc_id,start_time,query,action,action_success,object
0,7d2a79f43e605c36_1657,70.16,I need the newspaper placed on a single table.,Pickup,1,Driver: Newspaper|-04.15|+00.36|-02.48
1,7d2a79f43e605c36_1657,87.74,I need the newspaper placed on a single table.,Place,1,Driver: CoffeeTable|-02.47|+00.00|-02.49


In [51]:
from tqdm.notebook import tqdm
def create_object_dataset(mode='train'):
    action_json_files = [os.path.join(data_dir, f"games/{mode}/{f}") for f in os.listdir(os.path.join(data_dir, f"games/{mode}"))] # + [os.path.join(data_dir, f"games/valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "games/valid_seen"))]
    # f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
    # Initialize an empty list to store DataFrames
    dfs = []
    for f in tqdm(action_json_files):
        with open(f) as h:
            game_dict = json.load(h)
            doc_id = os.path.basename(f)#.split('.')[0]
            df = create_dataframe_from_game_dict(game_dict, definitions)
            df.fillna('')
            df = convert_to_object_finetuned_df(df, doc_id)
            dfs.append(df)
    # Merge all DataFrames into a single DataFrame
    merged_df = pd.concat(dfs, ignore_index=True)
    return merged_df
teach_object_train_dataset = create_object_dataset(mode='train')
teach_object_valid_dataset = create_object_dataset(mode='valid_seen')
teach_object_test_dataset = create_object_dataset(mode='valid_unseen')

  0%|          | 0/1482 [00:00<?, ?it/s]

  0%|          | 0/181 [00:00<?, ?it/s]

  0%|          | 0/612 [00:00<?, ?it/s]

In [53]:
from datasets import load_dataset, DatasetDict, Dataset
train_dataset = Dataset.from_pandas(teach_object_train_dataset)
eval_dataset = Dataset.from_pandas(teach_object_valid_dataset)
test_dataset = Dataset.from_pandas(teach_object_test_dataset)
teach_object_dataset = DatasetDict({"train":train_dataset, "validation": eval_dataset,"test":test_dataset})

In [54]:
train_dataset.to_pandas()

Unnamed: 0,doc_id,start_time,query,action,action_success,object
0,0008f3c95e006303_2053.game.json,124.61,Good day! We are preparing breakfast. We fir...,Pickup,1,Driver: Mug|-02.43|+00.59|+00.17
1,0008f3c95e006303_2053.game.json,145.61,The mug is located under the sink,Place,1,Driver: Sink|+00.02|+00.77|-01.71
2,0008f3c95e006303_2053.game.json,151.78,The mug is located under the sink,ToggleOn,1,Driver: Faucet|-00.19|+00.92|-01.75
3,0008f3c95e006303_2053.game.json,156.92,The mug is located under the sink,ToggleOff,1,Driver: Faucet|-00.19|+00.92|-01.75
4,0008f3c95e006303_2053.game.json,160.70,Oh you found one! Okay.,Pickup,1,Driver: Mug|-02.43|+00.59|+00.17
...,...,...,...,...,...,...
32482,ffeaead76b9103a8_d411.game.json,157.73,plate is inside the fridge,Place,1,Driver: Fridge|+02.10|+00.00|-00.28
32483,ffeaead76b9103a8_d411.game.json,162.81,plate is inside the fridge,Pickup,1,Driver: Plate|+02.16|+00.60|-00.37
32484,ffeaead76b9103a8_d411.game.json,180.45,plate is inside the fridge,Place,1,Driver: CounterTop|+01.07|+00.97|+02.67
32485,ffeaead76b9103a8_d411.game.json,185.14,plate is inside the fridge,Pickup,1,Driver: Bread|-00.25|+00.80|+00.81|BreadSliced_4


In [55]:
teach_object_dataset

DatasetDict({
    train: Dataset({
        features: ['doc_id', 'start_time', 'query', 'action', 'action_success', 'object'],
        num_rows: 32487
    })
    validation: Dataset({
        features: ['doc_id', 'start_time', 'query', 'action', 'action_success', 'object'],
        num_rows: 4139
    })
    test: Dataset({
        features: ['doc_id', 'start_time', 'query', 'action', 'action_success', 'object'],
        num_rows: 13738
    })
})

In [46]:
import huggingface_hub
huggingface_hub.interpreter_login()


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .


DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/whoami-v2 HTTP/1.1" 200 332


Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /home/jpei/.cache/huggingface/token
Login successful


In [56]:
teach_object_dataset.push_to_hub("Jiahuan/teach_object")

DEBUG:urllib3.connectionpool:Resetting dropped connection: huggingface.co
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/repos/create HTTP/1.1" 409 110


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/33 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_object/preupload/main HTTP/1.1" 200 96
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_object.git/info/lfs/objects/batch HTTP/1.1" 200 937
DEBUG:urllib3.connectionpool:Resetting dropped connection: hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/76/b4/76b4a91ebe66b939c8dae1b4e6f5f710c4cbda29c455d7d6d165208c7ac8709e/3b91f3fac828172d4b47770e4eff42d28b7d5b0444a20fdd0a384e9741aa2781?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231216%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231216T001247Z&X-Amz-Expires=900&X-Amz-Signature=ea5fc0c61706860c4ffbb416722096ef24041c4a06ca8d91bb0487d5f9a07f6f&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/1.1" 200 0
DEBUG:urllib

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_object/preupload/main HTTP/1.1" 200 101
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_object.git/info/lfs/objects/batch HTTP/1.1" 200 936
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/76/b4/76b4a91ebe66b939c8dae1b4e6f5f710c4cbda29c455d7d6d165208c7ac8709e/438444eec80aa2f816f16fbfd2fe0201541da31288f1eb65b5231be2574ad898?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231216%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231216T001249Z&X-Amz-Expires=900&X-Amz-Signature=6cca559620a5baf8c8a22f494890616c7bb539451d7733d4155949f40e757f1e&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_object.git/info/lfs/objects/verif

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/14 [00:00<?, ?ba/s]

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_object/preupload/main HTTP/1.1" 200 95
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_object.git/info/lfs/objects/batch HTTP/1.1" 200 937
DEBUG:urllib3.connectionpool:https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com:443 "PUT /repos/76/b4/76b4a91ebe66b939c8dae1b4e6f5f710c4cbda29c455d7d6d165208c7ac8709e/fd530f4b70e33d53a34288b8175f743b580bd30923bf6879dbe0ecfdc6bcfc8d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQFN2FTF47%2F20231216%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231216T001250Z&X-Amz-Expires=900&X-Amz-Signature=13fc1a8ab4dfef3773f60e7e43c41c52a4e2dd3e81d60e7c93edab47e557e87d&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /datasets/Jiahuan/teach_object.git/info/lfs/objects/verify

README.md:   0%|          | 0.00/596 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 139726692600784 on /home/jpei/.cache/huggingface/hub/.locks/datasets--Jiahuan--teach_object/c45f7c3661552107f80c2d6e605986b64a0ebf70.lock
DEBUG:filelock:Lock 139726692600784 released on /home/jpei/.cache/huggingface/hub/.locks/datasets--Jiahuan--teach_object/c45f7c3661552107f80c2d6e605986b64a0ebf70.lock
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_object/preupload/main HTTP/1.1" 200 76
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/datasets/Jiahuan/teach_object/commit/main HTTP/1.1" 200 202


In [109]:
game_dict["tasks"][0].keys()

dict_keys(['task_id', 'task_name', 'task_params', 'task_nparams', 'task_anchor_object', 'desc', 'components', 'relations', 'comments', 'episodes'])

In [11]:
episode_data = game_dict["tasks"][0]["episodes"][0]

In [12]:
episode_data.keys()

dict_keys(['episode_id', 'world', 'world_type', 'commander_embodied', 'initial_state', 'interactions', 'final_state'])

In [13]:
episode_data['world'], episode_data['world_type']

('FloorPlan209_physics', None)

In [14]:
episode_data['commander_embodied']

'False'

In [15]:
episode_data['initial_state'].keys()

dict_keys(['time_start', 'agents', 'objects', 'custom_object_metadata'])

In [16]:
episode_data['final_state'].keys()

dict_keys(['time_start', 'agents', 'objects', 'custom_object_metadata'])

In [17]:
# A list of dict #2
episode_data['initial_state']['agents']

[{'thirdPartyCameraId': 0,
  'position': {'x': -5.5, 'y': 0.9027014970779419, 'z': -2.0},
  'rotation': {'x': -0.0, 'y': 270.0, 'z': 0.0},
  'fieldOfView': 90.0},
 {'name': 'agent',
  'position': {'x': -2.75, 'y': 0.9027014970779419, 'z': -3.5},
  'rotation': {'x': -0.0, 'y': 0.0, 'z': 0.0},
  'cameraHorizon': 30.000003814697266,
  'isStanding': True,
  'inHighFrictionArea': True}]

In [105]:
episode_data['final_state']['agents']

[{'thirdPartyCameraId': 0,
  'position': {'x': 1.25, 'y': 0.9010001420974731, 'z': -0.25},
  'rotation': {'x': -0.0, 'y': 90.0, 'z': 0.0},
  'fieldOfView': 90.0},
 {'name': 'agent',
  'position': {'x': 1.0, 'y': 0.9009994268417358, 'z': 2.0},
  'rotation': {'x': -0.0, 'y': 90.00000762939453, 'z': 0.0},
  'cameraHorizon': 6.597455922019435e-06,
  'isStanding': True,
  'inHighFrictionArea': False}]

In [106]:
# A list of object dict # 47
episode_data['initial_state']['objects']

[{'name': 'Bowl_dfba074d(Clone)_copy_45',
  'position': {'x': -0.033100008964538574,
   'y': 0.7424726486206055,
   'z': 2.0614430904388428},
  'rotation': {'x': -0.0, 'y': 0.0, 'z': 0.0},
  'visible': False,
  'obstructed': True,
  'receptacle': True,
  'toggleable': False,
  'isToggled': False,
  'breakable': True,
  'isBroken': False,
  'canFillWithLiquid': True,
  'isFilledWithLiquid': True,
  'dirtyable': True,
  'isDirty': True,
  'canBeUsedUp': False,
  'isUsedUp': False,
  'cookable': False,
  'isCooked': False,
  'ObjectTemperature': 'RoomTemp',
  'canChangeTempToHot': False,
  'canChangeTempToCold': False,
  'sliceable': False,
  'isSliced': False,
  'openable': False,
  'isOpen': False,
  'openness': 0.0,
  'pickupable': True,
  'isPickedUp': False,
  'moveable': False,
  'mass': 0.4699999988079071,
  'salientMaterials': ['Ceramic'],
  'receptacleObjectIds': [],
  'distance': 2.989400863647461,
  'objectType': 'Bowl',
  'objectId': 'Bowl|-00.03|+00.74|+02.06',
  'parentRecep

In [107]:
# A list of object dict # 47
episode_data['final_state']['objects']

[{'name': 'Bread_27_Slice_7',
  'position': {'x': -0.33846086263656616,
   'y': 0.7977312803268433,
   'z': 0.6675292253494263},
  'rotation': {'x': 359.7679443359375,
   'y': 31.615781784057617,
   'z': 359.650634765625},
  'visible': False,
  'obstructed': True,
  'receptacle': False,
  'toggleable': False,
  'isToggled': False,
  'breakable': False,
  'isBroken': False,
  'canFillWithLiquid': False,
  'isFilledWithLiquid': False,
  'dirtyable': False,
  'isDirty': False,
  'canBeUsedUp': False,
  'isUsedUp': False,
  'cookable': True,
  'isCooked': False,
  'ObjectTemperature': 'RoomTemp',
  'canChangeTempToHot': False,
  'canChangeTempToCold': False,
  'sliceable': False,
  'isSliced': False,
  'openable': False,
  'isOpen': False,
  'openness': 0.0,
  'pickupable': True,
  'isPickedUp': False,
  'moveable': False,
  'mass': 0.05829999968409538,
  'salientMaterials': ['Food'],
  'receptacleObjectIds': None,
  'distance': 1.8914598226547241,
  'objectType': 'BreadSliced',
  'objectI

In [None]:
# 57
episode_data['initial_state']['custom_object_metadata']

In [None]:
# 57
len(episode_data['final_state']['custom_object_metadata'])

Note that while the object oriented representation of the game can be manipulated more easily in the code, the task of the game does not get perfectly loaded. Specifically, when loading a game file, no attempt is made to resolve components of tasks that are themselves tasks. Additionally, the final state does not get loaded. The following code snippet shows how to check whether the task associated with a gameplay session is complete at the final state, by directly loading the game json file as a dictionary. 

In [None]:
definitions = Definitions(version="2.0")
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
with open(f) as h:
    game_dict = json.load(h)
game_task = game_dict["tasks"][0]
task_to_check = copy.deepcopy(
    definitions.map_tasks_name2info[game_task["task_name"]]
)  # Copying is important if you're sharing a definitions object across calls
task_to_check.task_params = game_task["task_params"]
final_state_objects = game_dict["tasks"][0]["episodes"][0]["final_state"]["objects"]
task_check_output = task_to_check.check_episode_progress(final_state_objects)
print(task_check_output["success"])

The utterances in successful human-human sessions in the TEACh dataset are now annotated with dialog acts. This was done in two steps - first utterances were corrected to correct spelling mistakes, expand contractions and resolve some other issues. The corrected utterances were then annotated with dialog acts. An utterance can contain more than one dialog act. If it contains more than one dialog act, the utterance is divided into segments corresponding to each dialog act. The following code snippet prints the original utterance, the corrected utterance and each segment with the associated dialog act.  
**Note: Currently the object oriented representation does not load dialog act anotations. If you wish to use the dialog act annotations please load the game json file directly into a dictionary.**

In [None]:
def print_utterances_and_dialog_acts(game_dict, definitions):
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    for interaction in interactions:
        if "utterance" in interaction:
            output_str = ""
            output_str += definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"].ljust(15, " ")
            output_str += "Utterance: " + interaction["utterance"] + "\n"
            output_str += "".ljust(15, " ") + "Corrected: " + interaction["corrected_utterance"] + "\n"
            output_str += "".ljust(15, " ") + "DAs with segments: \n"
            for idx in range(len(interaction["da_metadata"]["das"])):
                # interaction["da_metadata"]["text_segments"] and interaction["da_metadata"]["das"] are lists of length 3
                # If an utterance has fewer than 3 DAs then the extra segments and DAs are empty
                # No utterance has more than 3 DAs
                utt_segment = interaction["da_metadata"]["text_segments"][idx]
                da = interaction["da_metadata"]["das"][idx]
                if len(da) > 0:
                    output_str += "".ljust(30, " ") + da + ": " + utt_segment + "\n"
            print(output_str + "\n")

In [None]:
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
with open(f) as h:
    game_dict = json.load(h)
print_utterances_and_dialog_acts(game_dict, definitions)

### EDH Instances
EDH instances are stored in `json` files. The `edh_instances` subdirectory consists of one subdirectory per split each containing EDH instances of that split. EDH instances do not have a corresponding object oriented representation and need to be manipulated as dictionaries.

In [None]:
f = os.path.join(data_dir, "edh_instances/train/7d2a79f43e605c36_1657.edh0.json")
with open(f) as h:
    edh_instance = json.load(h)
print(edh_instance.keys())

The components of an EDH instance are:
* `game_id` - ID of the gameplay session this was created from (the filename of a gameplay session file is of the form game_id.game.json)
* `instance_id` - ID of this EDH instance
* `interactions` - Subset of game interactions used to create this EDH instance (note that on the test set `interactions` will be modified so that actions to be predicted will not be included); Utterance interactions now have dialog act information in the same format as in the game
* `pred_start_idx` - Start index of actions to be predicted in `interactions` 
* `dialog_history` - Utterances in dialog history of the EDH instance paired with the speaker for each turn
* `dialog_history_cleaned` - Cleaned version of `dialog_history` with spell correction and removal of utterances commenting on the annotation interface (see Appendix B for details of data cleaning)
* `driver_action_history` - Environment actions provided as history. Each action is represented as a dictionary containing
    * `action_id`, `action_name` of the action according to the action definition
    * `action_idx` - Modified `action_id` to be in the range 0-35 for easier use in prediction (note that this still contains unused actions)
    * `time_start` - Timestamp from `interaction` corresponding to this action
    * `obj_interaction_action` - 1 if the action is an object interaction action and 0 otherwise
    * `oid` - Object ID of the object interacted with; None if the object is unknown or if the action was not an object interaction action
    * `x`, `y` - Relative coordinates on egocentric image indicating the coordinate used for an object interaction action; None is the action was not an object interaction action
* `driver_image_history` - Filename of image file of egocentric driver observation preceding each action in `driver_action_history`, that is, `driver_image_history[idx]` is the filename where the image for the driver's egocentric observation just before taking action `driver_action_history[idx]` is saved. 
* `driver_actions_future` - Environment actions to be predicted; Format is identical to `driver_action_history`; Not available at test time
* `driver_images_future` - Image observations corresponding to environment actions to be predicted; Format is identical to `driver_image_history`; Not available at test time
* `history_subgoals` - Programmatically created sequence of "subgoals" corresponding to environment actions provided as history - this is created by replacing every sequence of navigation actions with an abstract "Navigate" action with the destination as the next object manipulated. 
* `future_subgoals` - Programmatically created sequence of "subgoals" corresponding to environment actions to be predicted; Format identical to `history_subgoals`; Not available at test time
* `expected_init_success` - Should be 1 for all EDH instances; This flag was used to filter EDH instances whose action history could not be reliably replayed
* `expected_init_goal_conditions_total`, `expected_init_goal_conditions_satisfied` - When task completion status is checked, two of the statistics returned are `goal_conditions_total`, which is the number of object properties in the environment that were checked, and `goal_conditions_satisfied`, which is the number of checked object properties that were satisfied. These entries cache the values for these two statistics after replaying all history actions in the EDH instance. For calculating the goal condition success rate metric (GC), the task completion status is checked again after the model-predicted trajectory ends. At this time, along with the final task success rate, we also obtain final values, `final_goal_conditions_total` and `final_goal_conditions_satisfied`. GC is then calculated as `(1.0 - ((final_goal_conditions_total - final_goal_conditions_satisfied) / (expected_init_goal_conditions_total - expected_init_goal_conditions_satisfied)))`
* `init_state_diff` - Differences in object properties between the initial state of the gameplay session and the state at the end of actions taken in the dialog history
* `final_state_diff` - Differences in object properties between the initial state of the gameplay session and the state after playing all ground truth actions int he EDH instance
* `state_changes` - State changes between `init_state_diff` and `final_state_diff` used to construct the task that will be used to evaluate this EDH instance

For inference and evaluation it is recommended to use the provided inference script at [src/teach/cli/inference.py](https://github.com/alexa/teach/blob/main/src/teach/cli/inference.py)

# Prepare dataset for vox fine-tuning
## Trial play

In [None]:
def save_utterances_and_dialog_acts_to_csv(game_dict, definitions):
    data_list = []
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    game_id = '' # ID of the gameplay session this was created from (the filename of a gameplay session file is of the form game_id.game.json)
    instance_id = ''
    episode_id = ''
    for i, interaction in enumerate(interactions):
        if "utterance" in interaction:
            role = definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"]
            utterance = interaction["utterance"].strip()
            da_utterance = ""
            for idx in range(len(interaction["da_metadata"]["das"])):
                utt_segment = interaction["da_metadata"]["text_segments"][idx]
                da = interaction["da_metadata"]["das"][idx]
                if len(da) > 0:
                    da_utterance += da + ": " + utt_segment + "|"
            # print(role, utterance, da_utterance)
            # data_list.append((game_id, instance_id, i, role, utterance, da_utterance))
            data_list.append((game_id, instance_id, i, role, utterance, da_utterance))
    return data_list

In [None]:
definitions = Definitions(version="2.0")

In [None]:
json_files = [os.path.join(data_dir, f"games/train/{f}") for f in os.listdir(os.path.join(data_dir, "games/train"))] # + [os.path.join(data_dir, f"games/valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "games/valid_seen"))]

In [None]:
len(json_files)

In [None]:
json_files

In [None]:
import pandas as pd
all_data_list = []

for filename in json_files[:5]:
    with open(filename) as h:
        game_dict = json.load(h)
        one_data_list = save_utterances_and_dialog_acts_to_csv(game_dict, definitions)
        all_data_list.extend(one_data_list)

df = pd.DataFrame(all_data_list) #, columns=['Role', 'Utterance', 'DA']

In [None]:
df

In [None]:
f = os.path.join(data_dir, "edh_instances/train/c29e584989d391ac_b7d5.edh2.json")
with open(f) as h:
    edh_instance = json.load(h)

In [None]:
edh_instance.keys()

In [None]:
[': '.join(utter) for utter in edh_instance['dialog_history_cleaned']]

In [None]:
edh_instance['game_id'], edh_instance['instance_id']

In [None]:
def generate_role_utterance_pairs(utterances):
    role_utterance_pairs = []
    current_role = None
    current_utterance = ''

    for utterance in utterances:
        role, text = utterance.split(': ', 1)

        if role == current_role:
            # If the current role is the same as the previous one, combine the utterances
            current_utterance += '. ' + text
        else:
            # If the role changes, add the previous combined utterance to the list
            if current_role is not None:
                role_utterance_pairs.append((current_role, current_utterance.strip()))

            # Start a new combined utterance for the new role
            current_role = role
            current_utterance = text

    # Add the last combined utterance to the list
    role_utterance_pairs.append((current_role, current_utterance.strip()))

    return role_utterance_pairs

# Input utterances
utterances = ['Driver: hi', # Human traninee
              'Driver: what should I do?',
              'Commander: today we need to make a salad', # AI trainer
              'Commander: please cut the lettuce using a knife',
              "Driver: what's next?",
              'Commander: please cut the potato using the knife',
              'Driver: did that',
              'Commander: you need to cook the potato slice']

# Generate question-answer pairs
role_utterance_pairs = generate_role_utterance_pairs(utterances)

# Print the result
for role, utterance in role_utterance_pairs:
    print(f'{role}: {utterance}')


In [None]:
def generate_one_turn_dialogue_history_pairs(role_conversation_pairs):
    dialogue_pairs = []

    current_role = None
    dialogue_history = []

    for role, text in role_conversation_pairs:
        # role, text = utterance.split(': ', 1)

        if role == current_role:
            # If the current role is the same as the previous one, add to the dialogue history
            dialogue_history.append(f"{role}: {text}\n")
        else:
            # If the role changes, create a dialogue history-current utterance pair
            if current_role is not None:
                dialogue_pairs.append((' '.join(dialogue_history), f"{role}: {text}"))

            # Update current role and reset dialogue history
            current_role = role
            dialogue_history = [f"{role}: {text}"]

    # Add the last dialogue history-current utterance pair to the list
    dialogue_pairs.append((' '.join(dialogue_history), ''))

    return dialogue_pairs

# Generate dialogue history-current utterance pairs
dialogue_pairs = generate_one_turn_dialogue_history_pairs(role_utterance_pairs)

# Print the result
for dialogue_history, current_utterance in dialogue_pairs:
    print(f'### Context:\n{dialogue_history}')
    print(f'### Response:\n{current_utterance}\n')
    # print({'input': dialogue_history, 'output': current_utterance})


In [None]:
def generate_all_turn_dialogue_history_pairs(role_conversation_pairs):
    dialogue_pairs = []

    dialogue_history = []

    for role, text in role_conversation_pairs:
        # Add the current utterance to the dialogue history
        dialogue_history.append(f'{role}: {text}')

        # Create a dialogue history-current utterance pair
        dialogue_pairs.append((' '.join(dialogue_history[:-1]), dialogue_history[-1]))

    return dialogue_pairs

In [None]:
# Generate dialogue history-current utterance pairs
full_dialogue_pairs = generate_all_turn_dialogue_history_pairs(role_utterance_pairs)


In [None]:
for dialogue_history, current_utterance in full_dialogue_pairs:
    print(f'### Context:\n{dialogue_history}')
    print(f'### Response:\n{current_utterance}\n')
    # print({'input': dialogue_history, 'output': current_utterance})

## Start prepare!

In [None]:
data_dir = "/media/PampusData/jpei/teach-dataset/edh_instances"
json_files_train = [os.path.join(data_dir, f"train/{f}") for f in os.listdir(os.path.join(data_dir, "train"))] 
json_files_valid = [os.path.join(data_dir, f"valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "valid_seen"))] 
json_files_test = [os.path.join(data_dir, f"valid_unseen/{f}") for f in os.listdir(os.path.join(data_dir, "valid_unseen"))] 

In [None]:
len(json_files_train), len(json_files_valid), len(json_files_test)

In [None]:
json_files_train[0]

In [None]:
with open(json_files_train[0], 'r') as h:
    edh_instance = json.load(h)
    print(edh_instance['dialog_history_cleaned'])

In [None]:
from tqdm.notebook import tqdm
import json

def generate_finetune_QA_pairs(json_files, data_dir=data_dir, data_name='teach_edh_train.jsonl'):
    data_list = []
    count_pairs = 0
    for f in tqdm(json_files):
        with open(f) as h:
            edh = json.load(h)
            # print(edh['dialog_history_cleaned'])
            utterances = [': '.join(utter) for utter in edh['dialog_history_cleaned']]
            role_utterance_pairs = generate_role_utterance_pairs(utterances)
            full_dialogue_pairs = generate_all_turn_dialogue_history_pairs(role_utterance_pairs)
            for dialogue_history, current_utterance in full_dialogue_pairs:
                # print(f'Dialogue History: {dialogue_history}')
                # print(f'Current Utterance: {current_utterance}\n')
                if dialogue_history!='' and current_utterance!='':
                    count_pairs +=1
                    line = (dialogue_history, current_utterance)
                    # line = {"input": f"{dialogue_history}", "output": f"{current_utterance}"}
                    # line = '{"input": "%s", "output": "%s"}\n' % (dialogue_history, current_utterance)
                    data_list.append(line)
                    # print({"input": dialogue_history, "output": current_utterance})
    
    data_list = [{"input": f"{dialogue_history}", "output": f"{current_utterance}"} for dialogue_history, current_utterance in list(set(data_list))] # Only remain the unique pairs
    print(data_list[0])
    with open(f'{data_dir}/{data_name}', 'w') as fw:
        json.dump(data_list, fw)
        
    print(f'Dataset {data_name} contains QA pairs: {len(data_list)} unique /{count_pairs} total')

In [None]:
generate_finetune_QA_pairs(json_files_valid, data_dir=data_dir, data_name='teach_edh_valid.jsonl')

In [None]:
generate_finetune_QA_pairs(json_files_train, data_dir=data_dir, data_name='teach_edh_train.jsonl')

In [None]:
generate_finetune_QA_pairs(json_files_test, data_dir=data_dir, data_name='teach_edh_test.jsonl')

In [None]:
! ls "/media/PampusData/jpei/teach-dataset/edh_instances"

In [None]:
from datasets import load_dataset, DatasetDict
teach_data_dir = "/media/PampusData/jpei/teach-dataset/edh_instances"
# teach_data_dir = "/media/PampusData/jpei/teach-dataset/edh_instances"
train_dataset = load_dataset('json', data_files=f'{teach_data_dir}/teach_edh_train.jsonl', split='train')  
eval_dataset = load_dataset('json', data_files=f'{teach_data_dir}/teach_edh_valid.jsonl', split='train')
test_dataset = load_dataset('json', data_files=f'{teach_data_dir}/teach_edh_test.jsonl', split='train')
teach_dataset = DatasetDict({"train":train_dataset, "validation": eval_dataset,"test":test_dataset})

teach_dataset.push_to_hub("Jiahuan/teach_edh", private=True)
# teach_dataset = load_dataset("Jiahuan/teach_edh")

In [None]:
from datasets import load_dataset
cache_dir = '/media/Blue2TB3/datasets'
teach_dataset = load_dataset("Jiahuan/teach_edh", cache_dir=cache_dir)

In [None]:
teach_dataset['train'][0]

In [None]:
train_dataset[0]

# Second version in format of alphaca dataset


In [1]:
import os
data_dir = "/media/Blue2TB3/jpei/teach-dataset/edh_instances"
json_files_train = [os.path.join(data_dir, f"train/{f}") for f in os.listdir(os.path.join(data_dir, "train"))] 
json_files_valid = [os.path.join(data_dir, f"valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "valid_seen"))] 
json_files_test = [os.path.join(data_dir, f"valid_unseen/{f}") for f in os.listdir(os.path.join(data_dir, "valid_unseen"))] 

In [59]:
from tqdm.notebook import tqdm
import json
import itertools

def generate_finetune_QA_pairs(json_files, data_dir=data_dir, data_name='teach_edh_train_v2.jsonl'):
    all_data_list = []
    count_pairs = 0
    for f in tqdm(json_files):
        data_list = []
        with open(f) as h:
            edh = json.load(h)
            # print(edh['dialog_history_cleaned']) 
            # [['Commander', 'Good day! We are preparing breakfast. We first need to wash a dirty mug.']]
            for i in range(1, len(edh['dialog_history_cleaned'])):
                if edh['dialog_history_cleaned'][i][0] == 'Commander':
                    if edh['dialog_history_cleaned'][i-1][0] == 'Driver': # Commander is the AI trainer
                        sample = {
                            # "instruction": ': '.join(edh['dialog_history_cleaned'][i-1]), # with the role
                            "instruction": f"{edh['dialog_history_cleaned'][i-1][1]}", 
                            "input": "", 
                            # "output": ': '.join(edh['dialog_history_cleaned'][i]), # with the role 
                            "output": f"{edh['dialog_history_cleaned'][i][1]}",
                            "history": []
                        }
                        if sample not in data_list: # Only add the unique elements
                            count_pairs += 1
                            data_list.append(sample)                        
                    else:
                        pass
                        # sample = {
                        #     "instruction": ': '.join(edh['dialog_history_cleaned'][i-2]), 
                        #     # "instruction": f"{edh['dialog_history_cleaned'][i-2][1]}", 
                        #     "input": "", 
                        #     "output": ': '.join(edh['dialog_history_cleaned'][i-1]) + '. '+edh['dialog_history_cleaned'][i][1], 
                        #     # "output": f"{edh['dialog_history_cleaned'][i-1][1]+'.'+edh['dialog_history_cleaned'][i][1]}", 
                        #     "history": []
                        # }
                elif edh['dialog_history_cleaned'][i][0] == 'Driver':
                    continue
            # Get all history from one conversation
            for i in range(1, len(data_list)):
                # for VLM we consider only QA not history now.
                data_list[i]['history'] = [[q['instruction'], a['output']] for q, a in zip(data_list[:i], data_list[:i])]    
            all_data_list.extend(data_list)
    
    # Remove redundant dict elements
    all_data_list = [a[0] for a in itertools.groupby(all_data_list)]
    
    # Save to the file        
    with open(f'{data_dir}/{data_name}', 'w') as fw:
        json.dump(all_data_list, fw)
        
    print(f'Dataset {data_name} contains QA pairs: {len(all_data_list)} unique /{count_pairs} total')
    return all_data_list

In [60]:
train_data = generate_finetune_QA_pairs(json_files_train+json_files_valid, data_name='teach_edh_train_v2.jsonl')

  0%|          | 0/6083 [00:00<?, ?it/s]

Dataset teach_edh_train_v2.jsonl contains QA pairs: 17422 unique /18603 total


In [61]:
test_data = generate_finetune_QA_pairs(json_files_test, data_name='teach_edh_test_v2.jsonl')

  0%|          | 0/2149 [00:00<?, ?it/s]

Dataset teach_edh_test_v2.jsonl contains QA pairs: 5552 unique /5987 total


In [62]:
test_data[3]

{'instruction': 'where is the newspaper',
 'input': '',
 'output': 'over the black stand',
 'history': [['hello', 'put the newspaper on the sofa']]}

In [63]:
import os

# Replace 'username/dataset-name' with your actual username and dataset name
dataset_identifier='Jiahuan/teach_edh_v2'
from datasets import load_dataset, DatasetDict, Dataset
from huggingface_hub import login

os.environ['HF_TOKEN'] = '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'
# Replace 'your_token' with your actual Hugging Face API token
api_token = os.environ['HF_TOKEN']
# Log in to the Hugging Face Hub
login(token=api_token)

# train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=42)
train_dataset, test_dataset = Dataset.from_list(train_data), Dataset.from_list(test_data)
dataset = DatasetDict({"train":train_dataset, "test":test_dataset})
dataset.push_to_hub(dataset_identifier)

# Print some information about the dataset
print(dataset)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/jpei/.cache/huggingface/token
Login successful


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/504 [00:00<?, ?B/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'history'],
        num_rows: 17422
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'history'],
        num_rows: 5552
    })
})
