# Exploring TEACh Data

In [8]:
import os
import sys
import json
import copy

# sys.path.append("../../")

In [9]:
from teach.dataset.definitions import Definitions
from teach.dataset.dataset import Dataset
from teach.dataset.actions import Action_Keyboard, Action_ObjectInteraction

In [10]:
# Edit data directory if changed when using `teach_download`
data_dir = "/media/PampusData/jpei/teach-dataset"

In [11]:
! ls /media/PampusData/jpei/teach-dataset

all_game_files		edh_instances.tar.gz	     images_and_states.tar.gz
all_games.tar.gz	et_pretrained_models.tar.gz  IMAGESLICENSE
baseline_models.tar.gz	experiment_games.tar.gz      tfd_instances.tar.gz
DATALICENSE		games
edh_instances		images


### Definitions

Instantiate a `Definitions` object to access various definitions, mappings of agent IDs and actions to names, as well as task definitions. 
The code uses `Driver` when referring to the `Follower` in the paper. 

In [12]:
definitions = Definitions(version="2.0")
print("Agent IDs to agents: ", definitions.map_agents_id2info)
print("Status IDs to names: ", definitions.map_status_id2name)

Agent IDs to agents:  OrderedDict([(0, OrderedDict([('agent_name', 'Commander'), ('agent_type', 0)])), (1, OrderedDict([('agent_name', 'Driver'), ('agent_type', 1)]))])
Status IDs to names:  OrderedDict([(0, 'Success'), (1, 'Failure')])


Display mappings of action IDs to action names. Note that only a subset of these are used in TEACh data. Note that `definitions.map_tasks_name2info` ends up being more useful when trying to access actions by name. 

In [13]:
print("Action IDs to names:")
for action_id, action in definitions.map_actions_id2info.items():
    print("\t ", action_id, ":", action["action_name"])

Action IDs to names:
	  0 : Stop
	  1 : Move to
	  2 : Forward
	  3 : Backward
	  4 : Turn Left
	  5 : Turn Right
	  6 : Look Up
	  7 : Look Down
	  8 : Pan Left
	  9 : Pan Right
	  10 : Move Up
	  11 : Move Down
	  12 : Double Forward
	  13 : Double Backward
	  300 : Navigation
	  200 : Pickup
	  201 : Place
	  202 : Open
	  203 : Close
	  204 : ToggleOn
	  205 : ToggleOff
	  206 : Slice
	  207 : Dirty
	  208 : Clean
	  209 : Fill
	  210 : Empty
	  211 : Pour
	  212 : Break
	  400 : BehindAboveOn
	  401 : BehindAboveOff
	  500 : OpenProgressCheck
	  501 : SelectOid
	  502 : SearchObject
	  100 : Text
	  101 : Speech
	  102 : Beep


Tasks are also most convenient to access by name via `definitions.map_tasks_name2info` but can be accessed via ID using `definitions.map_tasks_id2info`. The values of both of these dictionaries are of type `Task_THOR`.  

When a `Definitions` object is instantiated, all tasks defined under `src/teach/meta_data_files/task_definitions` get loaded. The Task Definition Language is explained in Appendix F of the [TEACh paper](https://arxiv.org/pdf/2110.00534.pdf). To create a new task, create a new JSON file under `src/teach/meta_data_files/task_definitions`. Each task needs to have a unique `task_id` and `task_name`. Tasks can be referenced in other tasks by their `task_name`. After creating a new task, test that it can be loaded any any inter-task dependencies can be resolved by instantiating a `Definitions` object.

The following code snippet demonstrates how to print a few task details. Note that `#n` (where `n` is a number) indicates a variable.

In [14]:
print("Task details by name:")
print("Task name".ljust(33, " "), "Task ID".ljust(10, " "), "Num task params".ljust(20, " "), "Task component names")
for task_name, task in definitions.map_tasks_name2info.items():
    print(
        task_name.ljust(35, " "),
        str(task.task_id).ljust(15, " "),
        str(task.task_nparams).ljust(10, " "),
        str(list(task.components.keys())),
    )

Task details by name:
Task name                         Task ID    Num task params      Task component names
Candles                             304             0          ['candles', 'bathtub']
Breakfast                           301             14         ['coffee', 'toast', 'potatoes', 'apple', 'sandwich', 'salad', 'serving_spot']
Salad                               303             3          ['lettuce', 'tomato', 'potato', 'plate']
Put All X In One Y                  111             3          ['#0', '#2']
N Cooked Slices Of X In Y           107             4          ['#1', '#3']
Custom Properties Kitchen Tasks     405             0          ['boiled_potato', 'poached_egg']
Boil X                              112             1          ['boiled_#0']
Workspace                           305             3          ['writing', 'laptop', 'book', 'gather_spot', 'lights']
Toggle X All Y                      116             3          ['#1']
Plate Of Toast                      106        

### Gameplay Sessions
Gameplay sessions are stored in `json` files. The `games` subdirectory consists of one subdirectory per split each containing game files of that split. When loaded, these are dictionaries and for many purposes, it is sufficient to analyze the dictionaries. Some examples:   

In [15]:
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
with open(f) as h:
    game_dict = json.load(h)
print(game_dict.keys())

dict_keys(['version', 'task_type', 'comments', 'definitions', 'tasks'])


While the game dictionary contains other keys, the important one is `tasks`. `version`, `task_type` and `comments` are dataset-specific metadata, and `definitions` contains the version of the `Definitions` object used to collect the data. However, all games in the subdirectory `games` have been verified to be replayable and resulting in task success using the current (released) version of the `Definitions` object. `tasks` is always a list of length 1 in this dataset.  

In [16]:
print(game_dict["tasks"][0].keys())

dict_keys(['task_id', 'task_name', 'task_params', 'task_nparams', 'task_anchor_object', 'desc', 'components', 'relations', 'comments', 'episodes'])


This is a dictionary that can be converted to a `Task_THOR` object. All keys except `episodes` are associated with the task definition and can be better understood by reading Appendix F of the [TEACh paper](https://arxiv.org/pdf/2110.00534.pdf). For all game files in this dataset `game_dict['tasks'][0]['episodes']` will be a list of length 1 and `game_dict['tasks'][0]['episodes'][0]` contains the actual sequence of actions taken in the episode. 

In [17]:
print(game_dict["tasks"][0]["episodes"][0].keys())

dict_keys(['episode_id', 'world', 'world_type', 'commander_embodied', 'initial_state', 'interactions', 'final_state'])


Episodes are used to store the initial and final simulator state, as well as the sequence of actions taken in a gameplay session. The components of an episode are:
* `episode_id` - A unique id
* `world_type` - Type of room which is one of `Kitchen`, `Bedroom`, `Bathroom` and `Living room` 
* `world` - ID of the specific AI2-THOR floor plan used for this gameplay session
* `commander_embodied` - False for all TEACh games
* `initial_state`, `final_state` - Dictionaries consisting of the initial and final state of the world including
    * `time_start` - 
    * `agents` - Position and orientation of each agent/ camera at start and end of episode
    * `objects` - A list of the state of all objects at the start and end of the episode. Each object is represented by a dictionary whose keys are property names and values are property values.
    * `custom_object_metadata` - A dictionary to track custom properties in our codebase that are not present in AI2-THOR. This is a dictionary with AI2-THOR objectId as key and a dictionary of (custom_property_name, custom_property_value) pairs as values
* `interactions` - An ordered list of interactions that occurred in the environment, each represented by a dictionary of
    * `agent_id` - The agent that took the action
    * `action_id` - Which action was taken
    * `time_start` - Duration of time between start of episode and when this action started
    * `duration` - Duration of time (in sec) taken to execute this action
    * `success` - 1 if the action was successfully executed during data collection and 0 otherwise. An example of a case where `success` might be 0 is if the human annotator tried to pick up an object from too far away 
    * Action specific keys. Some examples include
        * `utterance` for a `Text` action - Stores the text value of the utterance made
        * `pose_delta` and `pose` for a navigation action
        
Code snippet to print out the sequence of actions taken in an episode:

In [18]:
def print_actions_from_game_dict(game_dict, definitions):
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    print(
        "Time Start",
        "Action Success".ljust(15, " "),
        "Agent".ljust(15, " "),
        "Action".ljust(20, " "),
        "Utterance text / Object ID / Object X, Y",
    )
    for interaction in interactions:
        output_str = "".rjust(2, " ")
        output_str += ("%.2f" % interaction["time_start"]).ljust(15, " ")
        output_str += str(interaction["success"]).ljust(10, " ")
        output_str += definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"].ljust(15, " ")
        output_str += definitions.map_actions_id2info[interaction["action_id"]]["action_name"].ljust(20, " ")
        if "utterance" in interaction:
            output_str += interaction["utterance"]
        elif "oid" in interaction and interaction["oid"] is not None:
            output_str += interaction["oid"]
        elif "x" in interaction and "y" in interaction:
            output_str += "(" + str(interaction["x"]) + ", " + str(interaction["y"]) + ")"
        print(output_str)

In [19]:
print_actions_from_game_dict(game_dict, definitions)

Time Start Action Success  Agent           Action               Utterance text / Object ID / Object X, Y
  15.29          0         Commander      OpenProgressCheck   
  27.85          1         Commander      Text                I need the newspaper to be placed on a single table.
  29.49          1         Commander      SelectOid           
  39.11          1         Driver         Text                what should i do
  61.21          1         Driver         Pan Left            
  61.59          1         Driver         Pan Left            
  61.84          1         Driver         Pan Left            
  62.12          1         Commander      Text                I need the newspaper placed on a single table.
  70.16          1         Driver         Pickup              Newspaper|-04.15|+00.36|-02.48
  87.74          1         Driver         Place               CoffeeTable|-02.47|+00.00|-02.49
  92.55          1         Commander      OpenProgressCheck   


Note that for all object interactions, the relative coordinates of the object on the agent's egocentric image are available in `interaction['x'], interaction['y']`. In the cases where the wrapper was able to resolve these to an object ID using the segmentation frame, we also have the ID of the object interacted with in `interaction['oid']` but if the wrapper was forced to backoff to raycasting, then this is not available.   

It is also possible to import a game file into a `Dataset` object as follows.

In [20]:
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
game = Dataset.import_json(f)

The following is how the code snippet to print out the same action info would look using the object oriented representation:

In [21]:
def print_actions_from_game_as_dataset(game, definitions):
    interactions = game.tasks[0].episodes[0].interactions
    print(
        "Time Start",
        "Action Success".ljust(15, " "),
        "Agent".ljust(15, " "),
        "Action".ljust(20, " "),
        "Utterance text / Object ID / Object X, Y",
    )
    for interaction in interactions:
        output_str = "".rjust(2, " ")
        output_str += ("%.2f" % interaction.time_start).ljust(15, " ")
        output_str += str(interaction.status).ljust(10, " ")
        output_str += definitions.map_agents_id2info[interaction.agent_id]["agent_name"].ljust(15, " ")
        output_str += definitions.map_actions_id2info[interaction.action.action_id]["action_name"].ljust(20, " ")
        if isinstance(interaction.action, Action_Keyboard):
            output_str += interaction.action.utterance
        if isinstance(interaction.action, Action_ObjectInteraction):
            if interaction.action.oid is None:
                output_str += "(" + str(interaction.action.x) + ", " + str(interaction.action.y) + ")"
            else:
                output_str += interaction.action.oid
        print(output_str)

In [22]:
print_actions_from_game_as_dataset(game, definitions)

Time Start Action Success  Agent           Action               Utterance text / Object ID / Object X, Y
  15.29          None      Commander      OpenProgressCheck   
  27.85          None      Commander      Text                I need the newspaper to be placed on a single table.
  29.49          None      Commander      SelectOid           
  39.11          None      Driver         Text                what should i do
  61.21          None      Driver         Pan Left            
  61.59          None      Driver         Pan Left            
  61.84          None      Driver         Pan Left            
  62.12          None      Commander      Text                I need the newspaper placed on a single table.
  70.16          None      Driver         Pickup              Newspaper|-04.15|+00.36|-02.48
  87.74          None      Driver         Place               CoffeeTable|-02.47|+00.00|-02.49
  92.55          None      Commander      OpenProgressCheck   


Note that while the object oriented representation of the game can be manipulated more easily in the code, the task of the game does not get perfectly loaded. Specifically, when loading a game file, no attempt is made to resolve components of tasks that are themselves tasks. Additionally, the final state does not get loaded. The following code snippet shows how to check whether the task associated with a gameplay session is complete at the final state, by directly loading the game json file as a dictionary. 

In [23]:
definitions = Definitions(version="2.0")
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
with open(f) as h:
    game_dict = json.load(h)
game_task = game_dict["tasks"][0]
task_to_check = copy.deepcopy(
    definitions.map_tasks_name2info[game_task["task_name"]]
)  # Copying is important if you're sharing a definitions object across calls
task_to_check.task_params = game_task["task_params"]
final_state_objects = game_dict["tasks"][0]["episodes"][0]["final_state"]["objects"]
task_check_output = task_to_check.check_episode_progress(final_state_objects)
print(task_check_output["success"])

True


The utterances in successful human-human sessions in the TEACh dataset are now annotated with dialog acts. This was done in two steps - first utterances were corrected to correct spelling mistakes, expand contractions and resolve some other issues. The corrected utterances were then annotated with dialog acts. An utterance can contain more than one dialog act. If it contains more than one dialog act, the utterance is divided into segments corresponding to each dialog act. The following code snippet prints the original utterance, the corrected utterance and each segment with the associated dialog act.  
**Note: Currently the object oriented representation does not load dialog act anotations. If you wish to use the dialog act annotations please load the game json file directly into a dictionary.**

In [24]:
def print_utterances_and_dialog_acts(game_dict, definitions):
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    for interaction in interactions:
        if "utterance" in interaction:
            output_str = ""
            output_str += definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"].ljust(15, " ")
            output_str += "Utterance: " + interaction["utterance"] + "\n"
            output_str += "".ljust(15, " ") + "Corrected: " + interaction["corrected_utterance"] + "\n"
            output_str += "".ljust(15, " ") + "DAs with segments: \n"
            for idx in range(len(interaction["da_metadata"]["das"])):
                # interaction["da_metadata"]["text_segments"] and interaction["da_metadata"]["das"] are lists of length 3
                # If an utterance has fewer than 3 DAs then the extra segments and DAs are empty
                # No utterance has more than 3 DAs
                utt_segment = interaction["da_metadata"]["text_segments"][idx]
                da = interaction["da_metadata"]["das"][idx]
                if len(da) > 0:
                    output_str += "".ljust(30, " ") + da + ": " + utt_segment + "\n"
            print(output_str + "\n")

In [25]:
f = os.path.join(data_dir, "games/train/7d2a79f43e605c36_1657.game.json")
with open(f) as h:
    game_dict = json.load(h)
print_utterances_and_dialog_acts(game_dict, definitions)

Commander      Utterance: I need the newspaper to be placed on a single table.
               Corrected: I need the newspaper to be placed on a single table.
               DAs with segments: 
                              Instruction: I need the newspaper to be placed on a single table.


Driver         Utterance: what should i do
               Corrected: what should i do
               DAs with segments: 
                              RequestForInstruction: what should i do


Commander      Utterance: I need the newspaper placed on a single table.
               Corrected: I need the newspaper placed on a single table.
               DAs with segments: 
                              Instruction: I need the newspaper placed on a single table.


### EDH Instances
EDH instances are stored in `json` files. The `edh_instances` subdirectory consists of one subdirectory per split each containing EDH instances of that split. EDH instances do not have a corresponding object oriented representation and need to be manipulated as dictionaries.

In [26]:
f = os.path.join(data_dir, "edh_instances/train/7d2a79f43e605c36_1657.edh0.json")
with open(f) as h:
    edh_instance = json.load(h)
print(edh_instance.keys())

dict_keys(['dialog_history', 'driver_action_history', 'driver_image_history', 'driver_actions_future', 'driver_images_future', 'interactions', 'game_id', 'instance_id', 'pred_start_idx', 'init_state_diff', 'final_state_diff', 'state_changes', 'history_subgoals', 'future_subgoals', 'expected_init_goal_conditions_total', 'expected_init_goal_conditions_satisfied', 'dialog_history_cleaned', 'dialog_history_with_das'])


The components of an EDH instance are:
* `game_id` - ID of the gameplay session this was created from (the filename of a gameplay session file is of the form game_id.game.json)
* `instance_id` - ID of this EDH instance
* `interactions` - Subset of game interactions used to create this EDH instance (note that on the test set `interactions` will be modified so that actions to be predicted will not be included); Utterance interactions now have dialog act information in the same format as in the game
* `pred_start_idx` - Start index of actions to be predicted in `interactions` 
* `dialog_history` - Utterances in dialog history of the EDH instance paired with the speaker for each turn
* `dialog_history_cleaned` - Cleaned version of `dialog_history` with spell correction and removal of utterances commenting on the annotation interface (see Appendix B for details of data cleaning)
* `driver_action_history` - Environment actions provided as history. Each action is represented as a dictionary containing
    * `action_id`, `action_name` of the action according to the action definition
    * `action_idx` - Modified `action_id` to be in the range 0-35 for easier use in prediction (note that this still contains unused actions)
    * `time_start` - Timestamp from `interaction` corresponding to this action
    * `obj_interaction_action` - 1 if the action is an object interaction action and 0 otherwise
    * `oid` - Object ID of the object interacted with; None if the object is unknown or if the action was not an object interaction action
    * `x`, `y` - Relative coordinates on egocentric image indicating the coordinate used for an object interaction action; None is the action was not an object interaction action
* `driver_image_history` - Filename of image file of egocentric driver observation preceding each action in `driver_action_history`, that is, `driver_image_history[idx]` is the filename where the image for the driver's egocentric observation just before taking action `driver_action_history[idx]` is saved. 
* `driver_actions_future` - Environment actions to be predicted; Format is identical to `driver_action_history`; Not available at test time
* `driver_images_future` - Image observations corresponding to environment actions to be predicted; Format is identical to `driver_image_history`; Not available at test time
* `history_subgoals` - Programmatically created sequence of "subgoals" corresponding to environment actions provided as history - this is created by replacing every sequence of navigation actions with an abstract "Navigate" action with the destination as the next object manipulated. 
* `future_subgoals` - Programmatically created sequence of "subgoals" corresponding to environment actions to be predicted; Format identical to `history_subgoals`; Not available at test time
* `expected_init_success` - Should be 1 for all EDH instances; This flag was used to filter EDH instances whose action history could not be reliably replayed
* `expected_init_goal_conditions_total`, `expected_init_goal_conditions_satisfied` - When task completion status is checked, two of the statistics returned are `goal_conditions_total`, which is the number of object properties in the environment that were checked, and `goal_conditions_satisfied`, which is the number of checked object properties that were satisfied. These entries cache the values for these two statistics after replaying all history actions in the EDH instance. For calculating the goal condition success rate metric (GC), the task completion status is checked again after the model-predicted trajectory ends. At this time, along with the final task success rate, we also obtain final values, `final_goal_conditions_total` and `final_goal_conditions_satisfied`. GC is then calculated as `(1.0 - ((final_goal_conditions_total - final_goal_conditions_satisfied) / (expected_init_goal_conditions_total - expected_init_goal_conditions_satisfied)))`
* `init_state_diff` - Differences in object properties between the initial state of the gameplay session and the state at the end of actions taken in the dialog history
* `final_state_diff` - Differences in object properties between the initial state of the gameplay session and the state after playing all ground truth actions int he EDH instance
* `state_changes` - State changes between `init_state_diff` and `final_state_diff` used to construct the task that will be used to evaluate this EDH instance

For inference and evaluation it is recommended to use the provided inference script at [src/teach/cli/inference.py](https://github.com/alexa/teach/blob/main/src/teach/cli/inference.py)

# Prepare dataset for vox fine-tuning
## Trial play

In [86]:
def save_utterances_and_dialog_acts_to_csv(game_dict, definitions):
    data_list = []
    interactions = game_dict["tasks"][0]["episodes"][0]["interactions"]
    game_id = '' # ID of the gameplay session this was created from (the filename of a gameplay session file is of the form game_id.game.json)
    instance_id = ''
    episode_id = ''
    for i, interaction in enumerate(interactions):
        if "utterance" in interaction:
            role = definitions.map_agents_id2info[interaction["agent_id"]]["agent_name"]
            utterance = interaction["utterance"].strip()
            da_utterance = ""
            for idx in range(len(interaction["da_metadata"]["das"])):
                utt_segment = interaction["da_metadata"]["text_segments"][idx]
                da = interaction["da_metadata"]["das"][idx]
                if len(da) > 0:
                    da_utterance += da + ": " + utt_segment + "|"
            # print(role, utterance, da_utterance)
            data_list.append((game_id, instance_id, i, role, utterance, da_utterance))
    return data_list

In [81]:
definitions = Definitions(version="2.0")

In [82]:
json_files = [os.path.join(data_dir, f"games/train/{f}") for f in os.listdir(os.path.join(data_dir, "games/train"))] # + [os.path.join(data_dir, f"games/valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "games/valid_seen"))]

In [83]:
len(json_files)

1482

In [89]:
import pandas as pd
all_data_list = []

for filename in json_files:
    with open(filename) as h:
        game_dict = json.load(h)
        one_data_list = save_utterances_and_dialog_acts_to_csv(game_dict, definitions)
        all_data_list.extend(one_data_list)

df = pd.DataFrame(all_data_list, columns=['Role', 'Utterance', 'DA'])

In [90]:
df

Unnamed: 0,Role,Utterance,DA
0,Commander,Good day! We are preparing breakfast. We fir...,Greetings/Salutations: Good day!;Instruction: ...
1,Commander,The mug is located under the sink\n,InformationOnObjectDetails: The mug is located...
2,Commander,Oh you found one! Okay.\n,Acknowledge: Oh you found one! Okay.;
3,Driver,done\n,Acknowledge: done;
4,Commander,Great. We are making a sandwich. we need a k...,FeedbackPositive: Great.;Instruction: We are m...
...,...,...,...
19322,Driver,What shall I do today?\n,RequestForInstruction: What shall I do today?;
19323,Commander,take the knife and slice the bread\n,Instruction: take the knife and slice the bread;
19324,Commander,toast the slice\n,Instruction: toast the slice;
19325,Commander,place the slices on the plate\n,Instruction: place the slices on the plate;


In [99]:
f = os.path.join(data_dir, "edh_instances/train/c29e584989d391ac_b7d5.edh2.json")
with open(f) as h:
    edh_instance = json.load(h)

In [102]:
edh_instance.keys()

dict_keys(['dialog_history', 'driver_action_history', 'driver_image_history', 'driver_actions_future', 'driver_images_future', 'interactions', 'game_id', 'instance_id', 'pred_start_idx', 'init_state_diff', 'final_state_diff', 'state_changes', 'history_subgoals', 'future_subgoals', 'expected_init_goal_conditions_total', 'expected_init_goal_conditions_satisfied', 'dialog_history_cleaned', 'dialog_history_with_das'])

In [118]:
[': '.join(utter) for utter in edh_instance['dialog_history_cleaned']]

['Driver: hi',
 'Driver: what should I do?',
 'Commander: today we need to make a salad',
 'Commander: please cut the lettuce using a knife',
 "Driver: what's next?",
 'Commander: please cut the potato using the knife',
 'Driver: did that',
 'Commander: you need to cook the potato slice']

In [113]:
edh_instance['game_id'], edh_instance['instance_id']

('c29e584989d391ac_b7d5', 'c29e584989d391ac_b7d5.edh2')

In [128]:
def generate_role_utterance_pairs(utterances):
    role_utterance_pairs = []
    current_role = None
    current_utterance = ''

    for utterance in utterances:
        role, text = utterance.split(': ', 1)

        if role == current_role:
            # If the current role is the same as the previous one, combine the utterances
            current_utterance += '. ' + text
        else:
            # If the role changes, add the previous combined utterance to the list
            if current_role is not None:
                role_utterance_pairs.append((current_role, current_utterance.strip()))

            # Start a new combined utterance for the new role
            current_role = role
            current_utterance = text

    # Add the last combined utterance to the list
    role_utterance_pairs.append((current_role, current_utterance.strip()))

    return role_utterance_pairs

# Input utterances
utterances = ['Driver: hi',
              'Driver: what should I do?',
              'Commander: today we need to make a salad',
              'Commander: please cut the lettuce using a knife',
              "Driver: what's next?",
              'Commander: please cut the potato using the knife',
              'Driver: did that',
              'Commander: you need to cook the potato slice']

# Generate question-answer pairs
role_utterance_pairs = generate_role_utterance_pairs(utterances)

# Print the result
for role, utterance in role_utterance_pairs:
    print(f'{role}: {utterance}')


Driver: hi. what should I do?
Commander: today we need to make a salad. please cut the lettuce using a knife
Driver: what's next?
Commander: please cut the potato using the knife
Driver: did that
Commander: you need to cook the potato slice


In [131]:
def generate_one_turn_dialogue_history_pairs(role_conversation_pairs):
    dialogue_pairs = []

    current_role = None
    dialogue_history = []

    for role, text in role_conversation_pairs:
        # role, text = utterance.split(': ', 1)

        if role == current_role:
            # If the current role is the same as the previous one, add to the dialogue history
            dialogue_history.append(text)
        else:
            # If the role changes, create a dialogue history-current utterance pair
            if current_role is not None:
                dialogue_pairs.append((' '.join(dialogue_history), text))

            # Update current role and reset dialogue history
            current_role = role
            dialogue_history = [text]

    # Add the last dialogue history-current utterance pair to the list
    dialogue_pairs.append((' '.join(dialogue_history), ''))

    return dialogue_pairs

# Generate dialogue history-current utterance pairs
dialogue_pairs = generate_one_turn_dialogue_history_pairs(role_utterance_pairs)

# Print the result
for dialogue_history, current_utterance in dialogue_pairs:
    print(f'Dialogue History: {dialogue_history}')
    print(f'Current Utterance: {current_utterance}\n')
    # print({'input': dialogue_history, 'output': current_utterance})


Dialogue History: hi. what should I do?
Current Utterance: today we need to make a salad. please cut the lettuce using a knife

Dialogue History: today we need to make a salad. please cut the lettuce using a knife
Current Utterance: what's next?

Dialogue History: what's next?
Current Utterance: please cut the potato using the knife

Dialogue History: please cut the potato using the knife
Current Utterance: did that

Dialogue History: did that
Current Utterance: you need to cook the potato slice

Dialogue History: you need to cook the potato slice
Current Utterance: 


In [135]:
def generate_all_turn_dialogue_history_pairs(role_conversation_pairs):
    dialogue_pairs = []

    dialogue_history = []

    for role, text in role_conversation_pairs:
        # Add the current utterance to the dialogue history
        dialogue_history.append(text)

        # Create a dialogue history-current utterance pair
        dialogue_pairs.append((' '.join(dialogue_history[:-1]), dialogue_history[-1]))

    return dialogue_pairs

In [136]:
# Generate dialogue history-current utterance pairs
full_dialogue_pairs = generate_all_turn_dialogue_history_pairs(role_utterance_pairs)


In [137]:
for dialogue_history, current_utterance in full_dialogue_pairs:
    print(f'Dialogue History: {dialogue_history}')
    print(f'Current Utterance: {current_utterance}\n')
    # print({'input': dialogue_history, 'output': current_utterance})

Dialogue History: 
Current Utterance: hi. what should I do?

Dialogue History: hi. what should I do?
Current Utterance: today we need to make a salad. please cut the lettuce using a knife

Dialogue History: hi. what should I do? today we need to make a salad. please cut the lettuce using a knife
Current Utterance: what's next?

Dialogue History: hi. what should I do? today we need to make a salad. please cut the lettuce using a knife what's next?
Current Utterance: please cut the potato using the knife

Dialogue History: hi. what should I do? today we need to make a salad. please cut the lettuce using a knife what's next? please cut the potato using the knife
Current Utterance: did that

Dialogue History: hi. what should I do? today we need to make a salad. please cut the lettuce using a knife what's next? please cut the potato using the knife did that
Current Utterance: you need to cook the potato slice


## Start prepare!

In [218]:
data_dir = "/media/PampusData/jpei/teach-dataset/edh_instances"
json_files_train = [os.path.join(data_dir, f"train/{f}") for f in os.listdir(os.path.join(data_dir, "train"))] 
json_files_valid = [os.path.join(data_dir, f"valid_seen/{f}") for f in os.listdir(os.path.join(data_dir, "valid_seen"))] 
json_files_test = [os.path.join(data_dir, f"valid_unseen/{f}") for f in os.listdir(os.path.join(data_dir, "valid_unseen"))] 

In [146]:
len(json_files_train), len(json_files_valid), len(json_files_test)

(5475, 608, 2149)

In [148]:
json_files_train[0]

'/media/PampusData/jpei/teach-dataset/edh_instances/train/0008f3c95e006303_2053.edh0.json'

In [152]:
with open(json_files_train[0], 'r') as h:
    edh_instance = json.load(h)
    print(edh_instance['dialog_history_cleaned'])

[['Commander', 'Good day! We are preparing breakfast. We first need to wash a dirty mug.']]


In [223]:
from tqdm.notebook import tqdm
import json

def generate_finetune_QA_pairs(json_files, data_dir=data_dir, data_name='teach_edh_train.jsonl'):
    data_list = []
    count_pairs = 0
    for f in tqdm(json_files):
        with open(f) as h:
            edh = json.load(h)
            # print(edh['dialog_history_cleaned'])
            utterances = [': '.join(utter) for utter in edh['dialog_history_cleaned']]
            role_utterance_pairs = generate_role_utterance_pairs(utterances)
            full_dialogue_pairs = generate_all_turn_dialogue_history_pairs(role_utterance_pairs)
            for dialogue_history, current_utterance in full_dialogue_pairs:
                # print(f'Dialogue History: {dialogue_history}')
                # print(f'Current Utterance: {current_utterance}\n')
                if dialogue_history!='' and current_utterance!='':
                    count_pairs +=1
                    line = (dialogue_history, current_utterance)
                    # line = {"input": f"{dialogue_history}", "output": f"{current_utterance}"}
                    # line = '{"input": "%s", "output": "%s"}\n' % (dialogue_history, current_utterance)
                    data_list.append(line)
                    # print({"input": dialogue_history, "output": current_utterance})
    
    data_list = [{"input": f"{dialogue_history}", "output": f"{current_utterance}"} for dialogue_history, current_utterance in list(set(data_list))] # Only remain the unique pairs
    with open(f'{data_dir}/{data_name}', 'w') as fw:
        json.dump(data_list, fw)
        
    print(f'Dataset {data_name} contains QA pairs: {len(data_list)} unique /{count_pairs} total')

In [224]:
generate_finetune_QA_pairs(json_files_valid, data_dir=data_dir, data_name='teach_edh_valid.jsonl')

  0%|          | 0/608 [00:00<?, ?it/s]

Dataset teach_edh_valid.jsonl contains QA pairs: 1124 unique /3491 total


In [225]:
generate_finetune_QA_pairs(json_files_train, data_dir=data_dir, data_name='teach_edh_train.jsonl')

  0%|          | 0/5475 [00:00<?, ?it/s]

Dataset teach_edh_train.jsonl contains QA pairs: 9634 unique /29965 total


In [226]:
generate_finetune_QA_pairs(json_files_test, data_dir=data_dir, data_name='teach_edh_test.jsonl')

  0%|          | 0/2149 [00:00<?, ?it/s]

Dataset teach_edh_test.jsonl contains QA pairs: 3831 unique /10774 total


In [186]:
! ls "/media/PampusData/jpei/teach-dataset/edh_instances"

teach_edh_test.jsonl   teach_edh_valid.jsonl  valid_seen
teach_edh_train.jsonl  train		      valid_unseen
