# ETOT VLM Planning

In [None]:
from vlm import BaseVLM, MultiVLM
api_key = ""
base_url = ""
model = "gpt-4o"
erp = ""
system_message = "You are a helpful assistant."

## 2.0 Failure Detection

Analyze the changes in the state of each object through the first and last frames of the action. 

And predict the ideal state of the objects in the future based on the subtask.

If it is found that the change in the state of the object does not meet expectations, the action is considered a failure.

In [26]:

task = "[OPEN, microwave]"
env_objects = "microwave, tennis, desk"
object_state = "1. [closed, microwave] 2. [ontop, microwave, desk] 3. [ontop, tennis, desk]"

content = '''
Task background: The robot is in front of the desk and needs to complete a multi-step task.
You need to generate some criteria for determining whether the task has been successfully executed, which will be handed over to the simulator for evaluation. 
You can make requirements for the position relationship of objects (up and down), the orientation of objects (vertical up), the state of the articulated object (open/closed)and so on.

Example1:
Input:
Task: [PICK UP, apple 1]
Environment_object: apple 1, apple 2, cabinet, plate, desk
Object_state:
1. [ontop, apple 1, desk]
2. [ontop, apple 2, desk]
3. [ontop, cabinet, desk]
4. [closed, cabinet]
5. [ontop, plate, desk]
Output:
GOAL State:
1. [inhand, apple 1]
2. [ontop, apple 2, plate]
3. [ontop, cabinet, plate]
4. [closed, cabinet]
5. [ontop, plate, desk]

Example2:
Input:
Task: [PUT INTO, holder]
Environment_object: tennis, holder, desk
Object_state:
1. [inhand, tennis]
2. [ontop, holder, desk]
Output:
GOAL State:
1. [inside, tennis, holder]
2. [ontop, holder, desk]

Example2:
Input:
Task: [PICK UP, tennis]
Environment_object: tennis, holder, desk
Object_state:
1. [ontop, tennis, desk]
2. [ontop, holder, desk]
Output:
GOAL State:
1. [inhand, tennis]
2. [ontop, holder, desk]

Input:
Task: {task}
Environment_object: {env_objects}
Object_state:
{object_state}
Output:

'''
content1 = content.format(task=task, env_objects=env_objects, object_state=object_state)
# 使用MultiVLM
base_vlm = BaseVLM(api_key, model, system_message, base_url)
response = base_vlm.call_model(content1)
print(response)

GOAL State:
1. [open, microwave]
2. [ontop, microwave, desk]
3. [ontop, tennis, desk]


In [27]:
task = "[PUT INTO, drawer]"
env_objects = "apple, drawer, desk"
object_state = "1. [open, drawer] 2. [ontop, drawer, desk] 3. [inhand, apple]"
content2 = content.format(task=task, env_objects=env_objects, object_state=object_state)
base_vlm = BaseVLM(api_key, model, system_message, base_url)
response = base_vlm.call_model(content2)
print(response)

GOAL State:
1. [inside, apple, drawer]
2. [ontop, drawer, desk]
3. [open, drawer]


In [68]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/1_s_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()

In [74]:
task = "open the microwave"
env_objects = "microwave, tennis, desk"
action = "OPEN, microwave"
image_url = ['./examples/images/vlm_frames/frame_0.jpg', './examples/images/vlm_frames/frame_9.jpg']
content = '''
Environmental objects include: {env_objects}.
Action: {action}
Check step by step the states of environmental objects before and after the action.
For example: 
Before action: box is on the desk; apple is in the holder, tennis is in robot gripper.
After action: box is on the desk; apple is in the holder, tennis is on the desk.
If the object is not in the scene, then it is lost.
If the action being performed is to close or open a articulated object, the loss of the object may be due to visual obstruction, and you need to reason carefully.If you are unsure, output 'unsure'
'''
content = content.format(env_objects=env_objects, action=action)
multi_vlm = MultiVLM(api_key, model, system_message, base_url, erp=erp)
multi_response0 = multi_vlm.call_model_with_multi(content, image_url)
print("VLM Response:", multi_response0)



VLM Response: Before action:  
- Microwave is closed  
- Tennis ball is on the desk  

After action:  
- Microwave is open  
- Tennis ball is unsure (possibility of being inside the microwave or lost due to visual obstruction)

The exact state of the tennis ball after the action is unclear because the microwave door might be obstructing the view.


For objects classified as unsure, we further verify their states in simulation using their precise pose information. 

During real execution, if an object is judged as unsure, we compare it with its corresponding case in the simulation plan: if the object likewise appeared unsure after executing the same action in simulation, we update the object's state to the verified state determined in simulation.


## 2. Reflective Branching

In [28]:
content = '''
Task background: The robot is in front of the desk and needs to complete a multi-step task.
Now, it has encountered a failure during subtask execution.

The subtask is {subtask}. The complete task plan is {task}. Environmental objects include: {env_objects}.
The two images show the initial state and the end state of the environment.
Some objects' state changes do not meet expectations: before task execution, {object_origin_state}, but after execution, it should be {object_target_state} but did not meet the requirements, combining images to analyze what causes it.

For example: 
Opening the drawer knocked over the vase;
Picking up an apple knocked over the pen holder next to the apple.

Action Skills Set:
- PICK UP
- PUT ON
- PUT INTO
- OPEN
- CLOSE
Consider whether it is possible to correct the task using only the action skills set, and be careful not to change the object without authorization. 
For example, correcting a failed capture of tennis 1 should not be about grabbing tennis 2, but about how to improve the planning for tennis 1.
Note that only actions from the action skills can be used to form a plan. If there is no possible correction solution, output 'no'!!!
Do not change the purpose of the object's operation, for example, if the original plan intended to put the apple in the holder, the new plan should maintain it.

###You only have three abilities for correction: 
### 1. If an object is knocked over outside the desktop due to your actions, move the object to a safe place first.
Example 1:
The task is open the drawer.
The origin task plan is 1. [OPEN, drawer]
I observed that opening the drawer would knock over the book, so please move the book away before opening the drawer
Correct task: 
1. [PICK UP, book] 
2. [PUT ON, desk](safe place) 
3. [OPEN, drawer]

### 2. If the action of picking up a container fails due to the container containing other objects, you should first pick up and put the other objects in a safe place, and then pick up the container 
Example 2:
The task is put the tennis into the holder and put them on the drawer.
The origin task plan is 1. [PICK UP, tennis] 2. [PUT INTO, holder] 3. [PICK UP, holder] 4. [PUT ON, drawer]
I noticed that loading tennis into the holder first would make it impossible to grab the holder, so I need to first grab the holder and place it in the designated location before loading tennis into the holder.
Correct task: 
1. [PICK UP, holder] 
2. [PUT ON, desk] 
3. [PICK UP, tennis] 
4. [PUT INTO, holder]

### 3. If the opening/closing of an articulated object causes other objects to be obstructed and unable to operate, you should move the target object to a safe position before opening/closing the articulated object.
Example 3:
The task is pick up the eraser and put it into the drawer.
The origin task plan is 1. [OPEN, drawer] 2. [PICK UP, eraser] 3. [PUT INTO, drawer] 4. [CLOSE, drawer]
I observed that the part of the opening drawer obstruct the view and path of the eraser. The first step [OPEN, drawer] cause the eraser to be obstructed, so please move the eraser away before opening the drawer
Correct task: 
3. [PICK UP, eraser](In order not to be obstructed after opening the drawer) 
4. [PUT ON, drawer(safe place)](Make sure there are no objects in hand)
5. [OPEN, drawer](Then, open and put it in)
6. [PICK UP, eraser] 
7. [PUT INTO, drawer] 
8. [CLOSE, drawer]

### If it is due to a mismatch in the physical properties of the object itself, such as an object that is too large to fit into a drawer, an object that is too heavy causing the container to fall, etc., you do not need to correct the plan!!!
### If the object causing the failure is one of the two objects involved in the PUT IN or PUT ON action, it is considered an unchangeable problem of the task itself, and the output is None!!!
Example 4:
The task is put the corn into the bowl.
The origin task plan is 1. [PUT INTO, bowl]
Because the size of the corn itself does not match the bowl, it causes the bowl to tilt, so there is no improvement plan.
Correct task: None

The task is {task}
The origin task plan is {task_plan}.
I observed that:
'''

screen = '''
Task background: The robot is in front of the desk and needs to complete a multi-step task.
Given a corrective plan for the observed failure, determine whether this plan relies on reinforcing or stabilizing an object (e.g., holding, supporting, wedging, or otherwise increasing its stability).
On the contrary, methods such as removing obstacles and adjusting the order of movements are not simply reinforcing ideas, but are acceptable.
If the corrective plan involves reinforcing/stabilizing an object, answer: Yes, else answer: No.
Corrective plan:{plan}

'''

In [29]:
format_check = '''
Task background: The robot is in front of the desk and needs to complete a multi-step task.
You need to check if there are any irregularities in the generated tasks. The task specifications are as follows:
1. Firstly, the plan must consist of the following actions, and any undefined actions will be considered as errors:
Action Skills Set:
- PICK UP
- PUT ON
- PUT INTO
- OPEN
- CLOSE

2. We require that when multiple objects are placed into a articulated container, each insertion must follow an open–pick-putin–close cycle. That is, the container must be opened, the object placed inside, and then closed before inserting the next object.

3. After action [PICK UP, object], the next action can only be PUT ON, PUT INTO, or None(PICK UP is the last action), not OPEN or CLOSE. While after action [PUT ON, obj] or [PUT INTO, obj] or [CLOSE, obj] or [OPEN, obj], the next action can be any action.

4. Microwave ovens and drawers are articulated objects, holders are not articulated objects, but they can all be put into objects as containers. Drawer can also be used as a platform to place objects.

5. Plans for different branches should be judged separately, There is no relationship between different branches!!!

6. Output format: First output Yes!!!/No!!!, then output analysis.

7. The PICK UP, OPEN, and CLOSE actions can appear alone in the plan.

8. After PICK UP, you can only PICK UP again after PUT ON, PUT INTO.

Example1:
Input:
1. [OPEN, drawer]
Output:
Yes, The final generated plan meets the specifications, OPEN action can be excuted alone

Example2:
Input:
1. [CLOSE, drawer]
Output:
Yes, The final generated plan meets the specifications, CLOSE action can be excuted alone

Example3:
Input:
1. [PICK UP, apple]
2. [PUT ON, drawer] (safe place)
Output:
Yes!!! The final generated plan meets the specifications. 
The first action is picking up the apple. After [PICK UP apple], you can do second action [PUT ON, drawer].

Example4:
Input:
Branch 1:
1. [PICK UP, apple 1]
Branch 2:
1. [PICK UP, apple 2]
Output:
Yes!!! Plans for different branches should be judged separately. Each branch plan meets the specifications.
PICK UP action can be excuted alone for each branch, actions. 
After picking up an object, the next action can be None(PICK UP is the last action)

Example5:
Input:
Branch 1:
1. [PICK UP, tennis]
2. [PUT INTO, holder]
3. [OPEN, drawer]
4. [PICK UP, apple]
5. [PUT INTO, drawer]
6. [CLOSE, drawer]
Output:
Yes!!! The final generated plan meets the specifications.
The first action is picking up the tennis, the second action can be [PUT INTO, holder]. 
And after [PUT INTO, holder], the third action can be [PICK UP, apple].
And after [OPEN, drawer], the fourth action can be [PICK UP, apple].
And after [OPEN, drawer] and [PICK UP, apple], the sixth action can be [PUT INTO, drawer].
And after [PUT INTO, drawer], the fifth action can be [CLOSE, drawer].



Example6:
Input:
MultiVLM Response: In analyzing the images and the scenario, we can infer the issue might have occurred while attempting to pick up the tennis ball after opening the drawer. This action was likely impeded by the drawer being open, hindering the robot's access to the tennis ball on the desk. 

To correct the task plan while adhering to only the action skills set provided, we can modify the steps to move the tennis ball to a safe place before dealing with the drawer.

Here is the corrected task plan:

1. [PICK UP, apple]
2. [PUT INTO, holder]
3. [PICK UP, tennis]
4. [OPEN, drawer]
5. [PUT INTO, drawer]
6. [CLOSE, drawer]

Output:
No!!! The final generated plan encountered an error: the forth action OPEN drawer cannot be executed directly after the third action picking up tennis.

Example7:
Input:
MultiVLM Response: 
Correct task:
1. [OPEN, holder]
2. [PICK UP, apple]
3. [PUT INTO, holder]
4. [CLOSE, holder]
Output:
No!!! The holder is not a articulated object and cannot be opened

Example8:
Input:
Branch 1:
1. [PICK UP, apple 1]
1. [PICK UP, apple 2]
Output:
No!!! The final generated plan encountered an error: the second action pick up apple 2 cannot be executed directly after the first action picking up apple 1

Example9:
Input:
MultiVLM Response: 
Correction:
1. [PICK UP, tennis 1] (move to accessible position)
2. [PUT ON, desk](safe and accessible place)
3. [PICK UP, tennis 2] 
4. [PUT INTO, holder]
5. [PICK UP, holder]
6. [PUT ON, drawer](safe and accessible place)
Output:
Yes!!! The final generated plan meets the specifications. 
After the first action picking up the tennis 1, the second action can be [PUT ON, desk]. 
And after the second action [PUT ON, desk], the third action can be [PICK UP, tennis 2]. 
After the third action picking up the tennis 2, the fourth action can be [PUT INTO, holder]. 
And after the fourth action [PUT INTO, holder], the fifth action can be [PICK UP, holder]. 
And after the fifth action [PICK UP, holder], the sixth action can be [PUT ON, drawer].


You only need to focus on the corrected plan and don't need to think about the specific purpose of the plan. You just need to strictly check whether each action meets the specifications

### Input:
{task_plan}

Output:

'''



In [54]:
def reflective_plan(task, task_plan, subtask, env_objects, object_origin_state, object_target_state, image_url):

    content1 = content.format(task=task, task_plan=task_plan, subtask=subtask, env_objects=env_objects, object_origin_state=object_origin_state, object_target_state=object_target_state)
    multi_vlm = MultiVLM(api_key, model, system_message, base_url, erp)
    format = "NO"
    while "yes" not in format.lower():
        plan = multi_vlm.call_model_with_multi(content1, image_url)
        print("PLAN!!!!!!!!!!!!!!!!!!!!!!")
        print("MultiVLM Response:", plan)
        print("FORMAT!!!!!!!!!!!!!!!!!!!!!!")
        if 'none' in plan.lower():
            plan = "NONE"
            print("NONE!!!!!!!!!!!!!!!!!!!!!!")
            return plan
        format_prompt = format_check.format(task_plan=plan)
        format = base_vlm.call_model(format_prompt)
        print(format)

### Task 1

In [None]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/1_s_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()


In [35]:

task = "Open the microwave door"
task_plan = "1. [OPEN, microwave]"
subtask = "[OPEN, microwave]"
env_objects = "microwave, tennis, desk"
object_origin_state = "[ontop, tennis, desk]"
object_target_state = "[ontop, tennis, desk]"
image_url = ['./examples/images/vlm_frames/frame_0.jpg', './examples/images/vlm_frames/frame_2.jpg', './examples/images/vlm_frames/frame_4.jpg']

reflective_plan(task=task, task_plan=task_plan, subtask=subtask, env_objects=env_objects, object_origin_state=object_origin_state, object_target_state=object_target_state, image_url=image_url)


PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiVLM Response: When opening the microwave door, it appears that the tennis ball, which was initially on top of the desk, got knocked off its position. Here is the correction plan using the provided action skills:

Corrected Task Plan:
1. [PICK UP, tennis]
2. [PUT ON, desk] (move the tennis ball to a safe place on the desk, away from where the microwave door will swing)
3. [OPEN, microwave]

This plan ensures that opening the microwave door does not knock over or displace the tennis ball.
FORMAT!!!!!!!!!!!!!!!!!!!!!!
Yes!!! The final generated plan meets the specifications.

After the first action picking up the tennis, the second action can be [PUT ON, desk]. And after the second action [PUT ON, desk], the third action can be [OPEN, microwave].


### Task 2

In [47]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/2_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()


In [49]:

task = "Reorient the pen and place it into a pen holder"
task_plan = "1. [PICK UP, pen] 2. [PUT INTO, holder 1]"
subtask = "[PUT INTO, holder 1]"
env_objects = "pen, apple, drawer, holder 1, holder 2, desk"
object_origin_state = "[inhand, pen]"
object_target_state = "[inside, pen, holder 1]"
image_url = ['./examples/images/vlm_frames/frame_7.jpg', './examples/images/vlm_frames/frame_8.jpg', './examples/images/vlm_frames/frame_9.jpg']

reflective_plan(task=task, task_plan=task_plan, subtask=subtask, env_objects=env_objects, object_origin_state=object_origin_state, object_target_state=object_target_state, image_url=image_url)


PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiVLM Response: I observed that the 'put into' action was not successful. Based on the images, the pen appears to be misaligned with the holder and is resting outside the holder.

To correct this issue, I would suggest a modified plan that involves reorienting the pen before attempting to place it in the holder. However, given that the failure might also be due to a mismatch in the object's properties or the robot's misalignment during the execution, it seems the objects involved in the "PUT INTO" action are the source of the problem. Since this issue pertains to the placement of the pen into the holder, which directly involves the two objects in the task, it is considered an unchangeable problem of the task itself.

Therefore, the correct task adjustment plan for this particular scenario is:

None.
FORMAT!!!!!!!!!!!!!!!!!!!!!!
NONE!!!!!!!!!!!!!!!!!!!!!!


'NONE'

### Task 4

In [50]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/4_s_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()



In [None]:
task = "Close the drawer"
task_plan = "1. [CLOSE, drawer]"
subtask = "[CLOSE, drawer]"
env_objects = "drawer, toy, desk"
object_origin_state = "[open, drawer]"
object_target_state = "[closed, drawer]"
image_url = ['./examples/images/vlm_frames/frame_0.jpg', './examples/images/vlm_frames/frame_5.jpg', './examples/images/vlm_frames/frame_9.jpg']

reflective_plan(task, task_plan, subtask, env_objects, object_origin_state, object_target_state, image_url)


PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiVLM Response: I observed that the drawer cannot be closed because the toy inside the drawer is obstructing it from fully closing. To correct this, the toy must be removed from the drawer before attempting to close it. The new task plan is:

1. [PICK UP, toy]
2. [PUT ON, desk] (safe place)
3. [CLOSE, drawer]
FORMAT!!!!!!!!!!!!!!!!!!!!!!
No!!! The final generated plan encountered an error: the third action CLOSE drawer cannot be executed directly after the second action PUT ON desk.

Explanation:
According to the specifications, after a PUT ON action, the next action can be any action. However, in this specific context, to close the drawer, it should first be opened (or already be open). The plan does not include an action to open the drawer before attempting to close it.

Here’s how you could correct the plan to meet the specifications:

1. [OPEN, drawer]
2. [PICK UP, toy]
3. [PUT ON, desk] (safe place)
4. [CLOSE, drawer]
PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiV

### Task 5

In [None]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/5_s_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()


In [None]:
task = "Reorient the pen and place it into a pen holder"
task_plan = "1. [PICK UP, pen] 2. [PUT INTO, holder 2]"
subtask = "[PUT INTO, holder 2]"
env_objects = "pen, apple, drawer, holder 1, holder 2, desk"
object_origin_state = "[inhand, pen]"
object_target_state = "[inside, pen, holder 2]"
image_url = ['./examples/images/vlm_frames/frame_7.jpg', './examples/images/vlm_frames/frame_8.jpg', './examples/images/vlm_frames/frame_9.jpg']

reflective_plan(task=task, task_plan=task_plan, subtask=subtask, env_objects=env_objects, object_origin_state=object_origin_state, object_target_state=object_target_state, image_url=image_url)


PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiVLM Response: the pen was not placed into holder 2 as required. Upon examining the images, the issue seems to arise from the presence of the apple on top of holder 2, which obstructs the pen from being placed properly into the holder.

To correct this, we need to move the apple to a safe place before attempting to place the pen into holder 2. Here is the corrected task plan:

1. [PICK UP, apple] 
2. [PUT ON, desk] 
3. [PICK UP, pen] 
4. [PUT INTO, holder 2]

By following this corrected task plan, we can ensure that the pen is successfully placed into holder 2 without any obstruction.
FORMAT!!!!!!!!!!!!!!!!!!!!!!
No!!! The final generated plan encountered an error: 

Analysis:
- The first action is picking up the apple, which is valid.
- The second action is putting the apple on the desk, which is valid.
- The third action is picking up the pen, which is valid.

The fourth action, [PUT INTO, holder 2], is also valid after the third action. 

However, the 

In [None]:
task = "[OPEN, microwave]"
env_objects = "microwave, tennis, desk"
object_state = "1. [closed, microwave] 2. [ontop, microwave, desk] 3. [ontop, tennis, desk]"
plan = '''1. [PICK UP, tennis] (Remove the tennis ball from the area that will be obstructed.)
2. [PUT ON, a different part of the desk (safe place)] (Place the tennis ball in a safe location on the desk where it will not fall due to the door opening.)
3. [OPEN, microwave] (Proceed with opening the microwave door.)'''
screen1 = screen.format(plan=plan)
# 使用MultiVLM
base_vlm = BaseVLM(api_key, model, system_message, base_url)
response = base_vlm.call_model(screen1)
print(response)

No


### Task 6

In [51]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/6_s_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()


In [56]:
task = "Place the apple and the holder on the drawer, with the apple inside the holder"
task_plan = "1. [PICK UP, apple] 2. [PUT INTO, apple, holder] 3. [PICK UP, holder] 4. [PUT ON, holder, drawer]"
subtask = "[PICK UP, holder]"
env_objects = "apple, drawer, holder, desk"
object_origin_state = "[ontop, holder, desk]"
object_target_state = "[inhand, holder]"
image_url = ['./examples/images/vlm_frames/frame_7.jpg', './examples/images/vlm_frames/frame_8.jpg', './examples/images/vlm_frames/frame_9.jpg']

reflective_plan(task=task, task_plan=task_plan, subtask=subtask, env_objects=env_objects, object_origin_state=object_origin_state, object_target_state=object_target_state, image_url=image_url)


PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiVLM Response: I observed that the action of picking up the holder failed because the apple was placed inside the holder, which made it difficult to grasp the holder. 

To correct this task, I need to pick up and place the holder in the designated location first before placing the apple inside the holder.

Correct task plan:
1. [PICK UP, holder] 
2. [PUT ON, drawer]
3. [PICK UP, apple] 
4. [PUT INTO, apple, holder]
FORMAT!!!!!!!!!!!!!!!!!!!!!!
No!!! The final generated plan encountered an error: the fourth action [PUT INTO, apple, holder] is incorrectly formatted and should only specify the object being placed and the container, i.e., [PUT INTO, holder]. Additionally, there seems to be an error in action sequence as [PUT INTO] should directly follow the [PICK UP] action of the object being placed into the container. Below is the corrected task plan:

1. [PICK UP, holder] 
2. [PUT ON, drawer]
3. [PICK UP, apple] 
4. [PUT INTO, holder]

This plan now follow

### Task 7

In [41]:
import cv2
from vlm import BaseVLM, MultiVLM, HistoryVLM
cap = cv2.VideoCapture('./examples/videos/7_s_f.mp4')

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
key_frames = 10
interval = total_frames // key_frames

for i in range(key_frames):
    cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - (i) * interval - 1)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(f'./examples/images/vlm_frames/frame_{key_frames-1-i}.jpg', frame)
    else:
        break

cap.release()

In [None]:
task = "Put the apple and the tennis ball in either the drawer or the pen holder, together or separately. Ensure the drawer is closed."
task_plan = "1. [PICK UP, apple]2. [PUT INTO, holder]3. [OPEN, drawer]4. [PICK UP, tennis]5. [PUT INTO, drawer]6. [CLOSE, drawer]"
subtask = "[PICK UP, tennis]"
env_objects = "apple, drawer, tennis, holder, desk"
object_origin_state = "[ontop, tennis, desk]"
object_target_state = "[inhand, tennis]"
image_url = ['./examples/images/vlm_frames/frame_7.jpg', './examples/images/vlm_frames/frame_8.jpg', './examples/images/vlm_frames/frame_9.jpg']

reflective_plan(task=task, task_plan=task_plan, subtask=subtask, env_objects=env_objects, object_origin_state=object_origin_state, object_target_state=object_target_state, image_url=image_url)


PLAN!!!!!!!!!!!!!!!!!!!!!!
MultiVLM Response: the execution of step 4 [PICK UP, tennis] failed due to the tennis ball being blocked by the open drawer. To correct this, we need to adjust the task plan to avoid this obstruction.

Here's the corrected task plan:

1. [PICK UP, apple] 
2. [PUT INTO, holder]
3. [PICK UP, tennis]  (Ensure this action is not obstructed)
4. [OPEN, drawer]
5. [PUT INTO, drawer]
6. [CLOSE, drawer] 

By picking up the tennis ball before opening the drawer, we avoid the failure caused by the open drawer blocking access to the tennis ball.
FORMAT!!!!!!!!!!!!!!!!!!!!!!
No!!! The final generated plan encountered an error: the fourth action OPEN drawer cannot be executed directly after the third action picking up tennis.

Analysis:
1. [PICK UP, apple] - Valid
2. [PUT INTO, holder] - Valid
3. [PICK UP, tennis] - Valid
4. [OPEN, drawer] - Invalid (as the next action after PICK UP can only be PUT ON, PUT INTO, or None, not OPEN)
5. [PUT INTO, drawer] - Conditional Validi