# Integrating YOLO and LLMs for Object Detection in Unity VR

## Introduction:

In this tutorial, we'll explore the integration of YOLObfor real-time object detection in Unity VR environments. Additionally, we'll delve into LLMS (Large Language Models) to provide transparency and insight into the detection process. By combining these technologies, we'll create immersive VR experiences with powerful object detection capabilities.

## Breakdown:

1. **Natural Language Processing (NLP):**
  NLP is a field of AI focused on enabling computers to understand, interpret, and generate human language. It encompasses tasks like text understanding, language translation, and sentiment analysis. NLP algorithms leverage linguistic principles and machine learning techniques to process and analyze large volumes of text data, enabling applications like virtual assistants and sentiment analysis platforms.

2. **LLM (Large Language Models):**
  LLMs are advanced AI models, like GPT, proficient in understanding and generating large volumes of natural language text. They revolutionize tasks such as language generation and question answering through extensive pre-training on vast text datasets. Their applications in language understanding and content generation are significant advancements in AI.

    **GPT-3.5** is the latest iteration of OpenAI's Generative Pre-trained Transformer (GPT) model, boasting an impressive **175** billion parameters, making it one of the largest language models in existence. This massive scale enables GPT-3.5 to excel across a wide range of natural language tasks, including text completion, translation, question answering, and sentiment analysis. Its ability to generate human-like text and effectively comprehend nuanced prompts has positioned it as a valuable tool across diverse domains, from education and healthcare to content generation.


3. **Explainability:**
  Explainability in AI refers to the ability of models to provide understandable explanations for their decisions and predictions. It enhances transparency, trust, and accountability in AI systems, particularly in high-stakes domains. Explainability techniques make complex AI models more interpretable by providing insights into their findgs, facilitating human understanding and collaboration.

4. **Explainability in AI and Object Detection:**
   Explainability techniques like LLMS help users understand why certain objects were detected in a scene and their meaning. This transparency fosters trust and collaboration between users and the AI system, enhancing the overall VR experience.

## Conclusion:

By leveraging YOLO for object detection and LLMS for explainability with GPT for enhanced interaction, we can create immersive Unity VR experiences with robust object detection capabilities.


Materials:

*   Tutorial #2 Unity Scene
  *   Continue working off your Tutorial #2 Unity Scene, this tutorial will add to that scene.

**IMPORTANT**


---
Do **NOT** share this key outside of this course. Access will be turned off at the end of the course, so attempting to share or use it will be **useless**.

*   **GPT 3.5 API Key** = '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'

---


**IMPORTANT**

# PyCharm and Flask

Go to **File** > **New** > and select **Python File**, give it a name of your choice

Follow these steps for setting up the Python File

In [None]:
from flask import Flask, request, jsonify
import tensorflow as tf
from openai import OpenAI
import numpy as np
import cv2
import os
import openai

app = Flask(__name__)

# Assuming these paths are correctly set to your model and labels
MODEL_PATH = ''
LABELS_PATH = ''

# Load the TensorFlow SavedModel
model = tf.saved_model.load(MODEL_PATH)
print('Model loaded successfully.')

# Set your OpenAI API key here
os.environ["OPENAI_API_KEY"] = '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'

def load_labels(label_path):
    with open(label_path, 'r') as f:
        return [line.strip() for line in f.readlines()]

labels = load_labels(LABELS_PATH)

def preprocess_image_from_memory(filestr):
    npimg = np.frombuffer(filestr, np.uint8)
    image = cv2.imdecode(npimg, cv2.IMREAD_COLOR)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = cv2.resize(image, (640, 640))
    image = image.astype(np.float32) / 255.0
    image = np.transpose(image, (2, 0, 1))
    image = np.expand_dims(image, axis=0)
    return image

def detect_with_tensorflow(model, image, labels, confidence_threshold=0.3):
    print("Detecting with TensorFlow model...")
    inputs = {'input': tf.convert_to_tensor(image)}
    outputs = model.signatures['serving_default'](**inputs)
    output_tensor = outputs['output'].numpy()

    predictions = output_tensor.reshape(-1, 5)

    detections = []
    for detection in predictions:
        x, y, w, h, confidence = detection
        if confidence > confidence_threshold:
            class_id = 0
            detections.append({'confidence': confidence, 'class_id': class_id})

    return detections

def generate_description(shoe_type, max_tokens=50):
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant who answers in 1 complete sentence."},
            {"role": "system", "content": "You are a knowledgeable expert on shoes."},
            {"role": "user", "content": f"Describe the functionality and typical use cases of {shoe_type}."}
        ],
        max_tokens=max_tokens
    )
    # Convert the message to a string
    message = str(response.choices[0].message)
    return message


@app.route('/detect', methods=['POST'])
def detect():
    if 'image' not in request.files:
        return jsonify({"error": "No image part in the request"}), 400

    filestr = request.files['image'].read()
    image = preprocess_image_from_memory(filestr)
    detections = detect_with_tensorflow(model, image, labels, confidence_threshold=0.6)

    results = []
    for detection in detections:
        label = labels[detection['class_id']]
        description = generate_description(label)
        results.append({
            "label": label,
            "confidence": float(detection['confidence']),
            "description": description
        })

    return jsonify({"detections": results})

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)


## Flask Application with TensorFlow Object Detection and GPT-3 Description Generation


## Application Setup and Routes

The Flask application is initialized and configured to listen for incoming HTTP POST requests on the `/detect` endpoint. This endpoint expects an image file as part of the request, which it processes to detect objects and generate descriptions.

```python
app = Flask(__name__)
```

### Environment Configuration

The OpenAI API key is set using an environment variable. This key is essential for authenticating requests sent to OpenAI's GPT-3 model.

```python
os.environ["OPENAI_API_KEY"] = 'your_openai_api_key_here'
```

## Image Processing and Object Detection

Upon receiving an image, the application performs preprocessing to format the image for the TensorFlow model. This includes resizing and normalizing the image data.

```python
image = preprocess_image_from_memory(filestr)
```

The TensorFlow model then processes the preprocessed image to detect objects. Each detection is accompanied by a confidence score.

```python
detections = detect_with_tensorflow(model, image, labels, confidence_threshold=0.6)
```

## GPT-3 Description Generation

For each detected object, the application uses OpenAI's GPT-3 to generate a descriptive text. This is where the focus on the GPT side of things comes in.

### Generating Descriptions with GPT-3

The `generate_description` function crafts a prompt for GPT-3 based on the detected object. It then sends this prompt to the GPT-3 model and formats the response into a coherent description.

```python
description = generate_description(label)
```

This function demonstrates interacting with the OpenAI API, specifying the model to use (`gpt-3.5-turbo` in this case), and how to structure the prompt for effective results.

### OpenAI API Integration

The integration with OpenAI's API is highlighted in this segment, showcasing how to construct a request with a series of messages designed to guide GPT-3 in generating a relevant and concise description of the detected object.

```python
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant who answers in 1 complete sentence."},
        {"role": "system", "content": "You are a knowledgeable expert on shoes."},
        {"role": "user", "content": f"Describe the functionality and typical use cases of {shoe_type}."}
    ],
    max_tokens=max_tokens
)
```

## Response Formatting and Server Response

The application aggregates the detection and description data, formatting it into a JSON response that is returned to the client.

```python
return jsonify({"detections": results})
```

This final step sends a structured response back to the requester, containing both the label and the generated description for each detected object, showcasing a practical application of combining machine learning models for vision and language.

## Running the Flask Application

```python
if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)
```

The application is configured to run in debug mode on all network interfaces, making it accessible for testing purposes.


Hit the green **play** button in the top right corner to run the app. In your PyCharm console, you should see that your Flask app is running. Two links will be present. This means your Flask app is successfully launched and ready to accept requests.

copy/store the second link, it should look sometwhat like this:
 **http://192.168.68.116:5000** we will use it in our Unity Code

# Unity

In [None]:
using System;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Networking;
using TMPro;

[Serializable]
public class Detection
{
    public float confidence;
    public string label;
    public string description; // Initially holds the complete ChatCompletionMessage string.
}

[Serializable]
public class RootObject
{
    public List<Detection> detections;
}

public class ContinuousObjectDetection : MonoBehaviour
{
    public string serverURL = "address/detect";
    public int captureWidth = 640;
    public int captureHeight = 640;
    public float captureIntervalSeconds = 0.1f;
    private Camera _camera;
    public TextMeshProUGUI detectionResultsText;
    public TextMeshProUGUI descriptionResultsText;

    private Dictionary<string, string> labelDescriptions = new Dictionary<string, string>();
    private HashSet<string> detectedLabelsThisFrame = new HashSet<string>();

    void Start()
    {
        _camera = GetComponent<Camera>();
        if (_camera == null)
        {
            Debug.LogError("Camera component not found.");
            return;
        }
        if (detectionResultsText == null || descriptionResultsText == null)
        {
            Debug.LogError("TextMeshProUGUI component(s) not set.");
            return;
        }
        Debug.Log("Camera found, capture starting.");
        InvokeRepeating(nameof(CaptureAndSend), 2.0f, captureIntervalSeconds);
    }

    void CaptureAndSend()
    {
        Debug.Log("Preparing to capture and send image.");
        StartCoroutine(CaptureAndSendCoroutine());
    }

    IEnumerator CaptureAndSendCoroutine()
    {
        detectedLabelsThisFrame.Clear(); // Clear the set at the start of each detection cycle.
        Debug.Log("Inside coroutine, capturing frame.");
        yield return new WaitForEndOfFrame();

        RenderTexture renderTexture = new RenderTexture(captureWidth, captureHeight, 24);
        _camera.targetTexture = renderTexture;
        _camera.Render();

        Texture2D screenShot = new Texture2D(captureWidth, captureHeight, TextureFormat.RGB24, false);
        RenderTexture.active = renderTexture;
        screenShot.ReadPixels(new Rect(0, 0, captureWidth, captureHeight), 0, 0);
        screenShot.Apply();

        byte[] imageData = screenShot.EncodeToJPG();

        _camera.targetTexture = null;
        RenderTexture.active = null;
        Destroy(renderTexture);
        Destroy(screenShot);

        WWWForm form = new WWWForm();
        form.AddBinaryData("image", imageData, "image.jpg", "image/jpeg");

        using (UnityWebRequest www = UnityWebRequest.Post(serverURL, form))
        {
            yield return www.SendWebRequest();

            if (www.result != UnityWebRequest.Result.Success)
            {
                Debug.LogError($"Error sending image: {www.error}");
                ClearText(); // Clear text if there's an error.
            }
            else
            {
                string jsonResponse = www.downloadHandler.text;
                Debug.Log($"Image uploaded successfully! Response: {jsonResponse}");
                ProcessDetections(jsonResponse);
            }
        }

        // Clear labels and descriptions if no detections were made this frame.
        if (detectedLabelsThisFrame.Count == 0)
        {
            ClearText();
        }
    }

    void ProcessDetections(string jsonResponse)
    {
        RootObject rootObject = JsonUtility.FromJson<RootObject>(jsonResponse);
        if (rootObject != null && rootObject.detections.Count > 0)
        {
            foreach (var detection in rootObject.detections)
            {
                Debug.Log($"Detected label: {detection.label}, Confidence: {detection.confidence}");
                detectedLabelsThisFrame.Add(detection.label); // Track detected labels this frame.

                if (!labelDescriptions.ContainsKey(detection.label))
                {
                    // Save description only if not already saved.
                    labelDescriptions[detection.label] = ExtractDescriptionContent(detection.description);
                }

                // Update UI with the most recent detection and its description.
                detectionResultsText.text = $"{detection.label} ({detection.confidence * 100:F1}% confidence)";
                descriptionResultsText.text = labelDescriptions[detection.label];
            }
        }
        else
        {
            Debug.Log("No detections in view.");
        }
    }

    private string ExtractDescriptionContent(string description)
    {
        const string contentPrefix = "ChatCompletionMessage(content='";
        int startIdx = description.IndexOf(contentPrefix) + contentPrefix.Length;
        if (startIdx >= contentPrefix.Length)
        {
            int endIdx = description.IndexOf("', role='assistant'", startIdx);
            if (endIdx > startIdx)
            {
                string content = description.Substring(startIdx, endIdx - startIdx);
                return content; // Removed line breaks conversion.
            }
        }
        return "Description not available."; // Fallback text if parsing fails or format is unexpected.
    }

    private void ClearText()
    {
        detectionResultsText.text = "";
        descriptionResultsText.text = "";
    }
}


## Continuous Object Detection in Unity


## Using Directives

- `System`: Provides base class and interface functionalities.
- `System.Collections`: Contains interfaces and classes that define various collections of objects.
- `System.Collections.Generic`: Provides classes for strongly typed collections.
- `UnityEngine`: Main namespace for Unity engine functionalities.
- `UnityEngine.Networking`: Contains classes for network operations.
- `TMPro`: Namespace for TextMesh Pro functionalities, used for advanced text rendering.

## Serializable Classes

### `Detection` Class

- **Purpose**: Represents a single detection result, including confidence, label, and a description.
- **Fields**:
  - `confidence`: A `float` indicating the detection confidence level.
  - `label`: A `string` representing the label of the detected object.
  - `description`: A `string` initially holding a complete message string, intended for further parsing.

### `RootObject` Class

- **Purpose**: Acts as a container for multiple `Detection` objects.
- **Fields**:
  - `detections`: A `List<Detection>` holding the detection results.

## Main MonoBehaviour Class: `ContinuousObjectDetection`

### Fields

- Public fields for configuration:
  - `serverURL`: Server address for sending images.
  - `captureWidth`, `captureHeight`: Dimensions for the captured images.
  - `captureIntervalSeconds`: Time interval between captures.
  - `detectionResultsText`, `descriptionResultsText`: `TextMeshProUGUI` components for displaying detection results and descriptions.
- Private fields for internal logic:
  - `_camera`: Reference to the Camera component.
  - `labelDescriptions`: Dictionary to hold label descriptions.
  - `detectedLabelsThisFrame`: HashSet to track labels detected in the current frame.

### MonoBehaviour Methods

#### `Start()`

- **Purpose**: Initializes the script, checks for necessary components, and starts the continuous capture process.
- **Key Operations**:
  - Checks for the camera component and UI text components, logging errors if not found.
  - Uses `InvokeRepeating` to start the `CaptureAndSend` method at a set interval.

#### `CaptureAndSend()`

- **Purpose**: Prepares for capturing an image and starts the coroutine for the capture and send process.

#### `CaptureAndSendCoroutine()`

- **Coroutine**: Handles the actual capture, encoding, and sending of the image to the server.
- **Process**:
  - Clears detected labels from the previous frame.
  - Captures the image using a `RenderTexture` and `Texture2D`.
  - Encodes the image to JPG format and prepares a `WWWForm`.
  - Sends the image to the server using `UnityWebRequest`.
  - On success, processes the server response; on failure, clears the UI text.

### Utility Methods

#### `ProcessDetections(string jsonResponse)`

- **Purpose**: Parses the JSON response, updates the UI, and stores the descriptions.
- **Key Operations**:
  - Parses the JSON response to `RootObject`.
  - Updates detected labels and descriptions, handling UI updates.

#### `ExtractDescriptionContent(string description)`

- **Purpose**: Extracts the actual content from the description string.
- **Key Operations**:
  - Searches for specific substrings to locate and extract the relevant content.

#### `ClearText()`

- **Purpose**: Clears the UI text components.


- Ensure you have added another text element for your description
  - **Right-click** on the **Canvas** object, then choose **UI > Text - TextMeshPro** to add a text element.
  
  - click on **Main camera** and drag the text object into the field for **descriptionResultsText** in the Inspector.

## Positioning for description and label text blocks

It's important to achieve the perfect positioning for your text to appear on your VR HUD. Feel free to copy the positioning of this setup or adjust it as you see fit, as long as it remains in front of and visible from your **Main Camera** object.

![First Image](https://drive.google.com/uc?export=view&id=1EI0uflu8U4g_LbieaUcSOPVs1i_2bau4)
![Second Image](https://drive.google.com/uc?export=view&id=1JfHjR7WyQoa-cNDm7aZmtzrza8NlSPLt)


Ensure your Oculus Quest 2 is connected via USB and hit run!

**Note**: you might have to get close up with objects for object detection to work.