<!--
This content is kept in sync with `theory.ipynb` via `jupytext`. Run

  ```bash
  jupytext --set-formats ipynb,md --sync theory.ipynb
  ```

to link the markdown and ipynb versions of this file. Then run

  ```bash
  jupytext --sync theory.ipynb
  ```

to update either file with the other's changes.
-->

# Computatrum

## Introduction

Introduction

Motivation

Overview

## Computer interaction

Main idea

Mathematical model

Explain several categories of tasks (including sections of user manuals and YouTube videos)

Common features shared by many or all tasks include:

- Being able to operate at varying levels of abstraction simultaneously.
- Another common feature of all tasks is...

## Environment

Code: the code's on my github (footnote with link). Explain how to use it.

In [None]:
example using random action space samples

Animated GIF output

And if I wanted to do action Y,

In [None]:
example of performing a preprogrammed action sequence

Animated GIF output

Conclude this section and lead into the next two: data and evaluation.

## Evaluation

Briefly highlight shortcomings of naive approaches.

- These are 'quick-trick', one-off, brittle hand-engineered approaches that work -- but fail to generalize.
- The problem is that most tasks are too complex to parametrize and there isn't enough data.
- These are naive by themselves, but not necessarily a bad idea.
- like Imitation learning and Task classification

### Language analysis

**The main idea**: make a text summary of the state/trajectory. Compare that summary against an expected summary in semantic embedding space. Minimize the distance between the two.

Explain how this is a more general, principled, and preferred way to evaluate the task.

Static vision-language analysis

Example with CLIP (or whatever vision-language model I plan on using)

In [None]:
# Re-use a frame from the GIF input from previously

(show frame and text summary)

Dynamic task estimation

Example with another language model

In [None]:
# Re-use the GIF input from previously

(show animation and text summary)

Explain how these approaches can be formed into a reward function.

Algorithm.

Code snippets of reward function from computatrum repo.

Demonstrate how to use the reward function.

In [None]:
# Re-use the GIF input from previously

(show inputs and outputs)

Convince the reader that this reward function is hard to game. Try to perform the task manually and see if it's possible to get a high reward without performing the objective.

In [None]:
# show inputs (myself) and outputs (my score)

## Training

I am going to collect demonstrations. Explain how. Show code/script:

In [None]:
# collect demonstrations

Using this methodology, I have collected a dataset of diverse GUI interactions spanning ENUMERATE. Details on this dataset. I uploaded it to WHERE. It consist of the following demonstrations:

In [None]:
# show how to download dataset
# display dataset

(dataset dataframe)

Data augmentations. Explain the concept and justify. I use these augmentations:

- mouse/keyboard jerky/smooth/fast/slow in-between press/release
- different display sizes, observation-action update rates
- different window themes, accessibility features enabled/disabled, and other visual variations
- Different language prompt variations
- In some cases, resize the window

These augmentations require the demonstrations to be performed in a deterministic environment, OTHER CONSTRAINTS, however using them, I am able to expand the dataset from X demonstrations to Y total demonstrations -- a ZZZ% increase.

Show code/script:

In [None]:
# augment all data in original dataset
# print demonstration count before and after
# select a few demonstrations to display

(outputs)

Synthetic data. Explain the concept and justify. Details: precise locations of where to click can be identified by specific pixel colors. I synthesize the following data:

- Automated form filling: the form is auto-generated and should be filled with specified data. Forms should include
  - pages of just different buttons that can be identified by text, color, shape, or image.
  - traditional forms (input fields and a submit)

  Show example yaml description and image for each

- Automated form filling with on-screen instructions: the form is auto-generated and should be filled with specified data. The instructions are displayed on-screen instead of through a separate modality.

  Show example yaml description and image for each

Curriculum learning. Methodology: go from easier to harder based on how the learner is progressing. Advantages and disadvantages. In this case, the advantages seem to outweigh the disadvantages. This is the curriculum:

-

## Policy Architecture

Design philosophy and criterion for policy architecture(s).

I'd be willing to bet that if neuroscientists analyzed me and my millennial-peer's brains, they find a structure in the homunculus or SMA just dedicated to the mouse. But the intelligence should extend far beyond pixel-level representations. (Picture of butterfly, dog, baby, man, and manager with different perspectives/abstractions on the man's actions.)

Choices made.

Picture of policy architecture(s).

## Putting it all together

Figure showing entire architecture.

Explain overall architecture.

Explain details.

Provide some justification for design. Link to my separate posts on various design criteria.

## Experiments

Each experiment should include:

- user demonstration
- invocation code
- quantitative analysis (metrics)
- at least one raw animation
- other qualitative analysis (visualizations)
- critical summary

## Discussion

General discussion

Experimental analyses

### Broader impact

Discuss how revolutionary this project has the potential of becoming.

Highlight some negative uses of this technology.

- Captchas. Demonstrate solving one programmatically.
- Increased attack surface for social engineering
- Weapon for hacking, social engineering, and disinformation.
- Malicious content generation. Demonstrate

### Safety

General discussion on how larger systems will be supervised while given access to the Internet.

The answers are not clear and we will need to proceed with caution. Justify why I am open-sourcing everything. Maybe cite my paper(s) from CSE-4314.

### Future directions

General expectations: More experiments, more complexity, more data.

#### MAN

Various ideas I want to experiment with

Transfer learning

#### The Artificial Experience

#### Other artificial-ecosystem projects

### Conclusion

Clearly restate what was accomplished. How it was accomplished. What its impact is. Invite readers to start using the computatrum and contribute to its development.

## Appendix

Footnotes.

Make sure citations are listed

Make sure the discussion is enabled