Sigma Transparency Note

Transparency Note

What is a Transparency note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the system (or technology) works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who will use or be affected by your system.

Microsoft's Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the Microsoft AI principles.

What is the "Situated Interactive Guidance Monitoring and Assistance" (SIGMA) system?

Situated Interactive Guidance Monitoring and Assistance, or in short SIGMA, is an open-source prototype system intended to enable and accelerate research on mixed-reality task assistive agents. It comprises a baseline end-to-end system that attempts to interactively guide a user through procedural tasks. The system works on a HoloLens 2 device and requires an additional desktop server for processing. Researchers can adopt and build upon this prototype to investigate the many challenges with developing real-time interactive mixed-reality agents and using LLMs and other foundation models in this context.

What can SIGMA do?

SIGMA can guide users step by step through procedural tasks. During the interaction, the system renders a virtual floating panel that displays the instructions to the user and reads the instructions out loud step-by-step. Additionally, the system can also display spatially placed relevant holograms to help guide the user (e.g., an arrow that points to a specific object part, or indicates a direction in which something should be turned, etc.) The user can navigate through the steps with simple commands like "next/previous step".

The system can guide the user through tasks in two different modes. In the first (default) mode, preexisting tasks are manually defined and specified in a task library (.json format). Alternatively, the system can be configured in a second mode that allows for automatically generating recipes for novel tasks via an LLM. SIGMA can also be configured to answer open-ended questions from the user on each step (the system uses an LLM to answer them).

Finally, on the computer vision side, SIGMA can be configured to use off-the-shelf large vision models (e.g., Detic, SEEM) to recognize and highlight objects and tools relevant to the task at hand.

SIGMA was designed to enable and accelerate research in mixed reality task assistance. The system can be configured to collect and log a variety of data streams from the HoloLens 2 device. These include: audio (from the user and from the system), color and depth camera images, mixed reality preview stream (color images + overlaid holograms), eye gaze direction (3D ray), head pose, and hand pose (3D poses for 26 joints per hand).

What are SIGMA's intended uses?

SIGMA is an open-source experimental research prototype under active development and is intended for use for research purposes only (see also the license agreement)

The primary use case is for academic and industry researchers that work in the space of mixed-reality interaction and want to explore the physically situated interaction challenges connected to mixed-reality task assistance. SIGMA facilitates research in this space by providing a baseline end-to-end interactive system (that resolves the basic engineering problems, e.g., streaming multimodal sensor data from and to the HoloLens device, basic UI, speech recognition, etc.). SIGMA is intended as a research tool to push the state of the art in the space of mixed-reality procedural task assistance.

SIGMA is not intended for any commercial, product, business, or mission critical purposes. It should also not be used for any medical or health-related tasks or for tasks that are dangerous or have an increased risk of physical harm.

How was SIGMA evaluated? What metrics are used to measure performance?

SIGMA is an integrative AI research prototype that brings together a varied set of technologies, including speech recognition and synthesis, large language models, and computer vision models. The performance of the end-to-end system has not yet been extensively evaluated with human users. Information on the performance and limitations of individual components may be available on a per component basis from the individual component providers, e.g., Azure OpenAI, Detic, SEEM, etc.

Researchers that wish to make use of SIGMA should first familiarize themselves with the system and its limitations and risks involved with using the system in a user-study context. Researchers that wish to conduct studies with SIGMA and human participants should undergo a full IRB or ethical board review as appropriate for their institution. As part of any user study protocol, researchers should communicate to the participants all known limitations and the risks of using the system, as well as the data types collected (listed above).

What are the limitations of SIGMA? How can users minimize the impact of SIGMA’s limitations when using the system?

SIGMA's reliance on the HoloLens 2 mixed reality headset induces certain limitations and risks, such as:

Wearing the HoloLens 2 headset for long periods of time may lead to discomfort. Users should never wear the headset for longer than they feel comfortable doing so.
The headset itself, including virtual content it is rendering, may sometimes occlude physical objects in the environment, so the user should make sure to carry out tasks with extra care and caution.

SIGMA can be configured to rely on LLMs to generate task guidance and answer questions. Experiments with SIGMA will therefore carry over the known limitations of the LLM models used, including:

Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.
Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses.
Lack of Transparency: Due to the complexity and size, large language models can act as "black boxes," making it difficult to comprehend the rationale behind specific outputs or decisions.
Inaccurate or ungrounded content: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content without high authority input sources.
Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content.

SIGMA can be configured with off-the-shelf computer vision models (such as Detic and SEEM) to recognize objects in the vicinity to use as context for the LLM (generating recipes and asking open-ended questions). These computer vision models are limited. As such, objects may be misidentified, which may lead to erroneous instructions and answers to questions. The use of LLMs and computer vision models in a physical world context may also create physical risks:

Inappropriate instructions generated by the LLM, due to fabrication, misrecognized objects, etc., may be physically unsafe for the user to perform. Users should not blindly trust SIGMA's instructions and must always use their own best judgment before carrying out any task steps.

What operational factors and settings allow for effective and responsible use of SIGMA?

Hardware setup: The system relies on a client-server architecture where sensor capture and UI rendering is performed on a HoloLens 2 device, but perception and computation is offloaded live to a separate compute server. For optimal operation, a powerful desktop server, as well as a good WiFi or USB connectivity to the HoloLens device is required.

Computer Vision: The system should be used in indoor spaces with good lighting conditions, to improve the performance of the underlying vision models.

LLMs: Users can configure the LLMs that are used by SIGMA for generating task recipes and answering open-ended questions. We encourage developers to review OpenAI's Usage policies and Azure OpenAI's Code of Conduct.

Tasks: Do not attempt to use SIGMA for dangerous, medical or health related, or mission critical tasks. The system is only intended to explore mixed reality task assistance for everyday tasks like fixing or maintaining appliances, assembling furniture, cooking, etc.

Experimental protocols: Researchers that wish to make use of SIGMA should first familiarize themselves with the system and its limitations and risks. Researchers should conduct initial pilot experimentation and make sure the spaces in which the experiments are conducted are clear of additional hazards. When configuring SIGMA to use LLMs, we recommend testing during early pilot experimentation that the responses provided by the models are reasonable for the tasks/domains investigated. Researchers that wish to conduct studies with SIGMA and human participants should undergo a full IRB or ethical board review as appropriate for their institution. They should also communicate to the participants as part of the user study protocol known limitations and the risks of using the system, as well as the data types collected.