# Introduction

**Retrieval-Augmented Generation (RAG)** combines a large language model with an external knowledge source—like a database, document library, or web search—to produce more accurate, context-aware answers.

A business-related example would be providing an LLM with proprietary information such as technical specificiations, user manuals, or product specifications in order to better assist customers.

# Models

I chose to use **llama3** for the backend large language model and **mxbai-embed-large** as the embedding model. The models are running through **Ollama** using the following Nvidia GPU and associated drivers:

<details>
<summary>Click to reveal gpu details</summary>

``` bash
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        On  |   00000000:0C:00.0  On |                  N/A |
|  0%   27C    P5             32W /  340W |     689MiB /  10240MiB |     31%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
```
</details>

Their respective models and their sizes are also shown below:

``` bash
NAME                        ID              SIZE      PROCESSOR    UNTIL              
mxbai-embed-large:latest    468836162de7    1.2 GB    100% GPU     4 minutes from now    
llama3:latest               365c0bd3c000    6.7 GB    100% GPU     4 minutes from now  
```

Embedding models are models that are trained specifically to generate vector embeddings. Vector embeddings are just arrays of numbers that represent semantic meaning for a given sequence of text.

<img src="what-are-embeddings.svg">

# Streamlit

Getting started with Ollama is a one-liner thanks to their install script. 

Once installed it is just a matter of running `ollama pull <model>` and then `ollama run <model>`. It starts a local server by default at `http://localhost:11434` which is what you use to interact with the models.

The code below demonstrates a quick mock-up of a ChatGPT setup using streamlit.

<details>
<summary>Click to reveal ChatGPT-like clone using Ollama and Streamlit</summary>

```python
import streamlit as st
import requests
import json

OLLAMA_URL = "http://localhost:11434/api/chat"


st.title("💬 ChatGPT-like Demo (Ollama + Streamlit)")

if "model" not in st.session_state:
    st.session_state.model = "llama3"

if "messages" not in st.session_state:
    st.session_state.messages = []

# Display existing chat
for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

# Handle user input
if prompt := st.chat_input("What is up?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        response_text = ""
        payload = {
            "model": st.session_state.model,
            "messages": st.session_state.messages,
            "stream": True
        }

        response = requests.post(OLLAMA_URL, json=payload, stream=True)

        placeholder = st.empty()

        for line in response.iter_lines():
            if line:
                data = json.loads(line.decode("utf-8"))
                content = data.get("message", {}).get("content", "")
                response_text += content
                placeholder.markdown(response_text)

    st.session_state.messages.append({"role": "assistant", "content": response_text})
```
</details>

<img src="demo.gif">

For storing data I opted to use ChromaDB to create a vectorstore.