# Video Scene Graph Task Generation:
In this notebook, we will show how to generate Scene Graph videoQA test cases in TaskVerse, including: `WhatObjectVideoSceneGraphTaskGenerator`, `WhatRelationVideoSceneGraphTaskGenerator`, and `WhatActionVideoSceneGraphTaskGenerator`.

## Requirements

If you want to run this notebook, you need to add `..` to python package search path. You can do this by running the following command in the root directory of the repository:

In [1]:
import sys
sys.path.append("../..")

## Download Scene Graph
We use GQA version of Scene Graph, you can download it from the link: https://downloads.cs.stanford.edu/nlp/data/gqa/sceneGraphs.zip

## Using TaskGenerator to enumerate task plans and store them into local parquet file

* `TaskGenerator` is a class used to enumerate all the possible task plans from the source metadata, and generate VQA test cases based on the task plans. (e.g. `WhatGridTaskGenerator`)
* `TaskStore` is a class designed to handle the task plan generated by the task generator. The task store can output data in a pandas dataframe format.
  
In this notebook, instead of using a single type of generator, we will use multiple types of generators to generate test cases. So we introduce a new class `JointTaskGenerator`:
* `JointTaskGenerator` is a class to assemble multiple `TaskGenerator` into a joint task generator. 

After downloading the source data, we first create a `VideoSceneGraphMetaData` object to load the metadata from Objaverse. Then you can first create a dict of TaskGenerator, and then pass it to `JointTaskGenerator` to create a joint task generator. This generator has the same interface as a single task generator, but it will generate tasks from all the task generators in the dict.

In [2]:
from tma.videoqa.scene_graph import *
from tma.base import JointTaskGenerator
from tma.videoqa.metadata import VideoSceneGraphMetaData


path = '/your_path/TaskMeAnything-v1-source/agqa_video'  # the path to scene_graph folder
metadata = VideoSceneGraphMetaData('../../annotations', video_scene_graph_folder=path)

generators = {
    'what object video'              : WhatObjectVideoSceneGraphTaskGenerator,
    'what relation video'             : WhatRelationVideoSceneGraphTaskGenerator,
    'what action video'          : WhatActionVideoSceneGraphTaskGenerator,
}
generator = JointTaskGenerator(metadata, generators)

Once you feed the metadata to the task generator, we can start to enumerate all the possible task plans by using `generator.enumerate_task_plans()` method. We also need to initialize `TaskStore` to store the task plans. When we initialize `TaskStore` with a `output_file`, it will save the task plans to a local parquet file without holding them in memory. After task plans enumeration, use `task_store.close()` method to make sure the parquet writer is appropriately closed. 

Note that `buffer_size` is the maximum number of task plans to hold in memory before writing to the parquet file which save in disk. If your memory is limited, you can set a smaller buffer_size to avoid memory overflow. Typically, `1e6` is a good choice for server which has less than 32GB memory.


In [3]:
from tma.task_store import TaskStore

save_path = '../cache/videosg.parquet'  # path to save task plans to a parquet
task_store = TaskStore(output_file=save_path, schema=generator.schema, buffer_size=1e6)
generator.enumerate_task_plans(task_store)
task_store.close()

Writing to ../cache/videosg.parquet


enumerating [what object video] task: 100%|██████████| 9601/9601 [00:02<00:00, 3270.74it/s]


Generated [428342] what object video tasks


enumerating [what relation video] task: 100%|██████████| 9601/9601 [00:02<00:00, 3230.58it/s]


Generated [428342] what relation video tasks


enumerating [what action video] task: 100%|██████████| 9601/9601 [00:02<00:00, 3905.58it/s]


Generated [335386] what action video tasks


The `JointTaskGenerator` also maintains a `stats` attribute, which is a dictionary that records the number of tasks generated by each task generator.

In [4]:
generator.stats

{'what object video': 428342,
 'what relation video': 428342,
 'what action video': 335386}

## Sample task plans from local parquet file and generate VQA test cases

We can Load the task plans from the local parquet file to a Pandas dataframe by using `pq.read_table(save_path, filters=f).to_pandas().astype(get_pd_schema(generator.schema)` method. After that, we can use `generator.generate()` method to generate VQA test cases based on the task plan.

In [5]:
import pandas as pd
import pyarrow.parquet as pq
from tma.task_store import get_pd_schema
from tqdm import tqdm
from pprint import pprint # use pprint to print the task
from IPython.display import Video
import os
import random

types = ['what object video', 'what relation video', 'what action video'] # select the types of task plans you want
filters = [[('task type', '==', t)] for t in types]

pool = []
sample_num = 100

save_path = '../cache/videosg.parquet'
for filter in tqdm(filters):
    df = pq.read_table(save_path, filters=filter).to_pandas().astype(get_pd_schema(generator.schema))
    df.iloc[random.randint(0, sample_num - 1)].dropna()
    task = generator.generate(df.iloc[random.randint(0, sample_num - 1)].dropna().to_dict(), return_data=True)

    binary_video_data = task['video']
    task['video'] = "see the video below"
    pprint(task)
    with open("temp_video.mp4", "wb") as f:
        f.write(binary_video_data)

    display(Video("temp_video.mp4", embed=True, width=600))

    if len(df) > sample_num:
        df = df.sample(sample_num)
    pool.append(df)
    
if os.path.exists("temp_video.mp4"):
    os.remove("temp_video.mp4")

  0%|          | 0/3 [00:00<?, ?it/s]

{'answer': 'clothes',
 'options': ['clothes', 'shoe', 'floor', 'medicine'],
 'question': 'What is the object that the person is on the side of after the '
             'person taking some clothes from somewhere?',
 'task_plan': 'task type: what object video\n'
              'object: clothes\n'
              'relation: on the side of\n'
              'reference action: taking some clothes from somewhere\n'
              'relation type: spatial\n'
              'temporal reference type: after',
 'video': 'see the video below',
 'video_scene_graph_id': 'KRF68'}


 33%|███▎      | 1/3 [00:00<00:00,  4.15it/s]

{'answer': 'holding',
 'options': ['holding', 'lying on', 'covered by', 'having it on the back'],
 'question': 'What is the person doing to the blanket while the person '
             'sneezing somewhere?',
 'task_plan': 'task type: what relation video\n'
              'object: blanket\n'
              'relation: holding\n'
              'reference action: sneezing somewhere\n'
              'relation type: contact\n'
              'temporal reference type: while',
 'video': 'see the video below',
 'video_scene_graph_id': 'N11GT'}


 67%|██████▋   | 2/3 [00:00<00:00,  4.05it/s]

{'answer': 'running somewhere',
 'options': ['running somewhere',
             'putting a dish somewhere',
             'taking a box from somewhere',
             'working on a book'],
 'question': 'What action is the person doing before the person taking a phone '
             'from somewhere?',
 'task_plan': 'task type: what action video\n'
              'reference action: taking a phone from somewhere\n'
              'temporal reference type: before\n'
              'action: running somewhere',
 'video': 'see the video below',
 'video_scene_graph_id': 'KRF68'}


100%|██████████| 3/3 [00:00<00:00,  4.21it/s]


Above is 3 VQA test cases generated from different types of generators. 

The dataframe below is the example task plans we sampled from the parquet file:

The values start with "Q" in the `target category`, `reference category` columns of the returned dataframe is the QID of the category, which corresponds to a Wikidata entry (eg,  `Q11422` corresponds to https://www.wikidata.org/wiki/Q11422).

Note that different question types have different schema, For example, for "what" type question, we don't need count, so the `count` column is `<NA>`.

In [6]:
df = pd.concat(pool)
df

Unnamed: 0,task type,object,relation,reference action,relation type,temporal reference type,video scene graph id,action
213762,what object video,closet,holding,watching a picture,contact,while,STDCJ,
335556,what object video,door,holding,closing a door,contact,while,XJE7I,
19951,what object video,bed,on the side of,taking food from somewhere,spatial,after,Q4932,
78940,what object video,doorknob,touching,fixing their hair,contact,after,ET224,
203788,what object video,box,on the side of,taking something from a box,spatial,before,J06RS,
...,...,...,...,...,...,...,...,...
104504,what action video,,,holding a cup of something,,before,2XJG9,taking a cup from somewhere
164679,what action video,,,opening a closet,,while,13YII,holding some food
305016,what action video,,,putting a dish somewhere,,while,7NXWU,working at a table
223466,what action video,,,putting a box somewhere,,after,ZS7X6,reaching for and grabbing a picture
