<a href="https://colab.research.google.com/github/FedericoSabbadini/RetiNeuraliGenerative/blob/main/Lezione/LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Leveraging an LLM: Prompt Engineering & RAG System**

This lecture is inspired by the [Anthropic's Prompt Engineering Interactive Tutorial](https://github.com/anthropics/prompt-eng-interactive-tutorial) and the [Build a RAG App](https://python.langchain.com/docs/tutorials/rag/) by LangChain.


In this exercise, we will see the fundamentals block that constitute a prompt and the most famous and implemented techniques in the Prompt Engineering field. Furhtermore, we'll see how implement easily a simple Retrival Augmented Generation system.


This notebook leverages the [Ollama Library](https://ollama.com/) a powerfull tool to interact with open LLMs. This package create a locally running server that wrap differents LLMs and able us to chat with them through API calls.

It's possible to interact with Ollama only via bash APIs, but the authors release a simple python wrap library to speed up the work.

---

### **Setup**

Below we'll install Ollama on Colab. For local installation please refer to the [documentation](https://docs.ollama.com/).

In [1]:
!apt-get install -qq -y lshw
!pip install -q ollama
!pip install -q faiss-cpu
!curl -fsSL https://ollama.com/install.sh | sh

Selecting previously unselected package lshw.
(Reading database ... 121713 files and directories currently installed.)
Preparing to unpack .../lshw_02.19.git.2021.06.19.996aaad9c7-2build1_amd64.deb ...
Unpacking lshw (02.19.git.2021.06.19.996aaad9c7-2build1) ...
Selecting previously unselected package pci.ids.
Preparing to unpack .../pci.ids_0.0~2022.01.22-1ubuntu0.1_all.deb ...
Unpacking pci.ids (0.0~2022.01.22-1ubuntu0.1) ...
Selecting previously unselected package usb.ids.
Preparing to unpack .../usb.ids_2022.04.02-1_all.deb ...
Unpacking usb.ids (2022.04.02-1) ...
Setting up pci.ids (0.0~2022.01.22-1ubuntu0.1) ...
Setting up lshw (02.19.git.2021.06.19.996aaad9c7-2build1) ...
Setting up usb.ids (2022.04.02-1) ...
Processing triggers for man-db (2.10.2-1) ...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m114.5 MB/s[0m eta [36m0:00:00[0m
[?25h>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
###############################

Let's start and test the server

In [2]:
import requests, subprocess, threading, time

def restartOllama():
    def serve():
        subprocess.Popen(["ollama", "serve"])

    threading.Thread(target=serve).start()
    time.sleep(5)

    print(requests.get("http://127.0.0.1:11434/api/tags").json())

restartOllama()

{'models': []}


In [3]:
!ollama pull llama3.2
!ollama pull bge-large

requests.get("http://127.0.0.1:11434/api/tags").json()

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[

{'models': [{'name': 'bge-large:latest',
   'model': 'bge-large:latest',
   'modified_at': '2025-11-20T17:51:09.89571275Z',
   'size': 670532029,
   'digest': 'b3d71c92805938e2c9d78e6f35e82d7cc1bd77e05b47dd2ab3016240f4021c15',
   'details': {'parent_model': '',
    'format': 'gguf',
    'family': 'bert',
    'families': ['bert'],
    'parameter_size': '334.09M',
    'quantization_level': 'F16'}},
  {'name': 'llama3.2:latest',
   'model': 'llama3.2:latest',
   'modified_at': '2025-11-20T17:50:57.295420745Z',
   'size': 2019393189,
   'digest': 'a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72',
   'details': {'parent_model': '',
    'format': 'gguf',
    'family': 'llama',
    'families': ['llama'],
    'parameter_size': '3.2B',
    'quantization_level': 'Q4_K_M'}}]}

Now, let's create a simple function to wrap the API calls with the python ollama package.



In [4]:
import ollama
from IPython.display import display, Markdown

def get_completion(message, system_prompt="", model='llama3.2', temperature=0.8):

    response = ollama.chat(
        model=model,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user', 'content': message}
        ],
        options={
            'temperature': temperature
        }
    )

    return response['message']['content']


prompt = "Hello, my friend!"
display(Markdown(get_completion(prompt)))

It's nice to meet you. Is there something I can help you with or would you like to chat?

## **Prompt Engineering**

Prompt engineering is the craft of getting an AI model to do what you actually meant instead of what you accidentally said. It’s about designing clear, specific inputs (prompts) that guide the model’s reasoning and tone so you get useful, accurate, or creative outputs instead of digital word salad.

**It's not a science: Prompt Engeenering is to AI as Astrology is to Astronomy.**

---

### **Basic Prompting**

You can also use **system prompt**. A system prompt is a way to **provide context, instructions, and guidelines to LLMs** before presenting it with an user question or task.

In [5]:
prompt = "Why is the sky blue?"
display(Markdown(get_completion(prompt)))

The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it in the late 19th century.

Here's what happens:

1. **Sunlight enters Earth's atmosphere**: When sunlight enters our atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2).
2. **Scattering occurs**: These gas molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths.
3. **Blue light is scattered**: The blue light is scattered in all directions by the gas molecules, which means that it reaches our eyes from all parts of the sky.
4. **Red light continues straight**: Meanwhile, the longer wavelengths of light (like red and orange) continue to travel in a more direct path, reaching our eyes primarily from the direction of the sun.

This scattering effect is what gives the sky its blue color during the daytime, especially in the direction of the sun. The color of the sky can change depending on various factors such as:

* **Time of day**: During sunrise and sunset, the light has to travel through more of the Earth's atmosphere, which scatters the shorter wavelengths even more, making the sky appear redder.
* **Atmospheric conditions**: Dust, pollution, and water vapor in the air can scatter light in different ways, changing the color of the sky.
* **Altitude**: At higher altitudes, the air is thinner, and the scattering effect is less pronounced, resulting in a more intense blue color.

So, to summarize: the sky appears blue because of Rayleigh scattering, where shorter wavelengths (like blue) are scattered by gas molecules in the atmosphere, making them reach our eyes from all parts of the sky.

In [6]:
system_prompt = """

"""
prompt = "Why is the sky blue?"

display(Markdown(get_completion(prompt, system_prompt)))

The reason why the sky appears blue is a fascinating phenomenon that has captivated humans for centuries. It's due to a combination of physics and chemistry.

Here's what happens:

1. **Light from the sun**: When sunlight enters Earth's atmosphere, it contains all the colors of the visible spectrum (red, orange, yellow, green, blue, indigo, and violet).
2. **Scattering by molecules**: The shorter, blue wavelengths of light are scattered more than the longer, red wavelengths by tiny molecules in the air, such as nitrogen (N2) and oxygen (O2). This scattering effect is known as Rayleigh scattering.
3. **Length of wavelength**: Blue light has a shorter wavelength (around 450-495 nanometers) compared to other colors, which makes it more easily scattered by the tiny molecules in the atmosphere.
4. **Angle of view**: When we look at the sky, we're seeing the scattered blue light that's been bent towards us by the Earth's atmosphere. The angle at which this light hits our eyes determines how much of the blue color we see.

Now, here's why we don't see more of the other colors:

* **Red light**: Longer wavelengths like red light are not scattered as much, so they continue straight to our eyes without being bent towards us.
* **Other colors**: The colors in between red and blue (orange, yellow, etc.) are also affected by scattering, but to a lesser extent than blue.

This combination of physics and chemistry explains why the sky typically appears blue during the daytime when the sun is overhead.

**Clear and direct instructions**

Think of LLMs like any other human that is new to the job. **LLMs have no context** on what to do aside from what you literally tell it. Just as when you instruct a human for the first time on a task, the more you explain exactly what you want in a straightforward manner to the LLMS, the better and more accurate their response will be. When in doubt, follow the **Golden Rule of Clear Prompting**:

*Show your prompt to a colleague or friend and have them follow the instructions themselves to see if they can produce the result you want. If they're confused, the LLMs are confused.*

In [7]:
prompt = "Who is the best soccer player of all time?"
display(Markdown(get_completion(prompt)))

Determining the "best" soccer player of all time is a subjective debate that has been ongoing for years, with opinions divided among fans, pundits, and experts. There are several factors to consider when evaluating a player's greatness, including their skill level, achievements, consistency, and impact on the game.

Some of the most commonly cited candidates for the title of "best soccer player of all time" include:

1. Lionel Messi: Regarded by many as one of the greatest players of all time, Messi has won numerous awards, including six Ballon d'Or awards, ten La Liga titles, and four UEFA Champions League titles.
2. Cristiano Ronaldo: A five-time Ballon d'Or winner, Ronaldo has consistently dominated the sport, winning numerous titles with Manchester United, Real Madrid, and Juventus.
3. Diego Maradona: A legendary Argentine midfielder who led Argentina to World Cup victory in 1986, Maradona is remembered for his exceptional skill, vision, and leadership on the pitch.
4. Johan Cruyff: A Dutch legend who revolutionized the game with his innovative style of play, Cruyff won three Ballon d'Or awards and led Ajax and Barcelona to numerous titles.
5. Pele: A three-time World Cup winner with Brazil, Pele is widely regarded as one of the greatest players of all time, known for his speed, skill, and scoring ability.

Ultimately, the "best" soccer player of all time is a matter of personal opinion. Some people may prefer Messi's incredible goal-scoring record or Ronaldo's consistent dominance over the years, while others may look to Maradona's vision and leadership or Cruyff's innovative style.

Here are some key statistics that demonstrate the impressive careers of these players:

* Lionel Messi: 772 goals in 912 games (0.85 goals per game), 7 Ballon d'Or awards
* Cristiano Ronaldo: 755 goals in 974 games (0.77 goals per game), 5 Ballon d'Or awards
* Diego Maradona: 260 goals in 699 games (0.37 goals per game), 1 World Cup winner
* Johan Cruyff: 441 goals in 736 games (0.59 goals per game), 3 Ballon d'Or awards
* Pele: 77 goals in 92 games (0.83 goals per game), 3 World Cup winners

These statistics are just a few examples of the impressive careers these players have had, and there are many other factors to consider when evaluating their greatness.

In [8]:
prompt = "Who is the best soccer player of all time? "
display(Markdown(get_completion(prompt)))

Determining the "best" soccer player of all time is a subjective matter that can spark debate among fans and experts. However, based on various rankings, awards, and accolades, here are some of the most commonly cited candidates:

1. Lionel Messi: Regarded by many as the greatest soccer player of all time, Messi has won six Ballon d'Or awards, ten La Liga titles, and four UEFA Champions League titles with Barcelona.
2. Cristiano Ronaldo: A five-time Ballon d'Or winner, Ronaldo has won numerous titles with Manchester United, Real Madrid, and Juventus, including five UEFA Champions League trophies.
3. Diego Maradona: A legendary Argentine midfielder who led Argentina to World Cup victory in 1986, Maradona is widely regarded as one of the greatest players of all time, known for his exceptional dribbling skills and leadership.
4. Johan Cruyff: A Dutch midfielder and coach who is credited with revolutionizing the game with his innovative style, Cruyff won three Ballon d'Or awards and led Ajax to numerous titles in the 1970s.
5. Pele: A Brazilian forward who played for Santos and the Brazilian national team, Pele is a three-time World Cup winner and widely regarded as one of the greatest players of all time.

Other notable mentions include:

* Zinedine Zidane
* Andres Iniesta
* Xavi Hernandez
* Ronaldinho
* Thierry Henry

Ultimately, determining the "best" soccer player of all time requires personal opinion and consideration of various factors such as team success, individual accolades, and playing style.


---

### **Role Prompting**

Continuing on the theme of LLMs having no context aside from what you say, it's sometimes important to **prompt LLMs to inhabit a specific role (including all necessary context)**. The more detail to the role context, the better.

**Priming LLMs with a role can improve their performance** in a variety of fields, from writing to coding to summarizing. It's like how humans can sometimes be helped when told to "think like a ______".
Role prompting can also change the style, tone, and manner of LLMs' response.

In [9]:
system_prompt = """

"""
prompt = "Who is the best soccer player of all time? Yes, there are differing opinions, but if you absolutely had to pick one player, who would it be?"

display(Markdown(get_completion(prompt, system_prompt)))

Choosing the "best" soccer player of all time is a subjective matter that can spark a lot of debate. However, based on their impressive achievements, skills, and impact on the game, I'll argue for Lionel Messi as the greatest soccer player of all time.

Here are some reasons why:

1. Unprecedented Success: Messi has won an unprecedented six Ballon d'Or awards (2009-2012, 2015, 2019) and ten La Liga titles with Barcelona. He's also led his country to Olympic gold in 2008 and the Copa America trophy in 2021.
2. Consistency and Durability: Messi has played at an elite level for over 15 years, consistently delivering outstanding performances even when his team was not doing well. His work rate, vision, and goal-scoring ability are unmatched.
3. Goalscoring Records: Messi is the all-time leading scorer in La Liga (474 goals) and the second-highest scorer in the UEFA Champions League (126 goals). He's also scored over 770 goals for club and country throughout his career.
4. Technical Ability: Messi's incredible dribbling skills, speed, and agility make him a nightmare to defend against. His ability to control the ball with his feet is unparalleled, allowing him to create scoring opportunities out of thin air.
5. Leadership and Clutch Performances: Messi has led Barcelona to numerous comebacks and wins, including the 2011 Champions League final. He's also captained Argentina to several victories in major tournaments, showcasing his leadership qualities.
6. Impact on the Game: Messi's playing style has influenced a generation of players, many of whom have attempted to replicate his skills on the pitch. His work rate, vision, and creativity have raised the bar for midfielders and wingers.

While other players like Cristiano Ronaldo, Diego Maradona, and Johan Cruyff are also often mentioned in discussions about the greatest soccer player of all time, Messi's impressive achievements, consistency, and impact on the game make a strong case for him being considered the best.

What do you think? Do you agree with this assessment, or do you have another player in mind?

A bonus technique you can use is to **provide LLMs context on their intended audience**.

In [10]:
system_prompt = """
    You are a fanboy of the player you are about to mention.
    You are trying to convince the user that this player is the best of all time.
"""
prompt = "Who is the best soccer player of all time?"

display(Markdown(get_completion(prompt, system_prompt)))

DUDE, it's gotta be LIONEL MESSI!!! I mean, where do I even start?! The guy is a magician on the pitch! His vision, his skill, his goal-scoring ability... he's like a one-man show!

I mean, have you seen him dribble past defenders with ease? It's like he's defying gravity or something! And his accuracy with his left foot is unmatched. He can score goals from anywhere on the field.

And let's not forget about his trophies! Seven Ballon d'Or awards, ten La Liga titles... the guy has won it all! And he's still going strong at 35 years old!

Now, I know what you're thinking: "What about Cristiano Ronaldo?" or "What about Zinedine Zidane?" Listen, those guys are great players, but they don't compare to Messi. It's like comparing apples and oranges. Messi is the G.O.A.T., the greatest of all time!

Trust me, I'm a die-hard fanboy, and I've seen him play in person! You won't find anyone who can match his level of skill, creativity, and pure, unadulterated greatness on the soccer pitch.

SO, IT'S MESSI ALL THE WAY, BABY!!!

**Role prompting can also make LLMs better at performing math or logic tasks.**

As we know, the LLMs are not so good to answer logic tasks.

In [11]:
prompt = "Alice has four brothers and she also has a sister. How many sisters does Alice's brother have?"
display(Markdown(get_completion(prompt)))

The information in the statement only mentions that Alice has one sister, but it doesn't mention anything about her brothers having siblings. So, we can't determine the number of sisters her brother(s) have based on this statement.

Now, what if we ask **to act as a logic bot**? How will that change the answer?

It turns out that with this new role assignment, the LLM gets it right (although notably not for all the right reasons).

In [12]:
system_prompt = ""
display(Markdown(get_completion(prompt, system_prompt)))

The information states that Alice has "four brothers" and "a sister". The question is asking about her brother, but the number of sisters her brother has isn't specified.

However, if we assume that Alice's siblings are all full-blooded (not adopted or step-siblings), then we can infer some information. Since Alice has a sister, she must have at least one sibling who is a female. If her brothers are also full-blooded, then they would each potentially be the father of their own child (Alice). However, since we don't know if any of her brothers have children with other females, or even how many sisters those brothers may have had in total, we can't determine the exact number of sisters her brother has.

So, to answer your question, I'd say there isn't enough information to provide a specific number.


---

### **Input and Output Formatting**

Oftentimes, we don't want to write full prompts, but instead want **prompt templates that can be modified later with additional input data**. This might come in handy if you want LLMs to do the same thing every time, but with different data.

Luckily, we can do this pretty easily by **separating the fixed skeleton of the prompt from variable user input, then substituting the user input into the prompt** before sending the full prompt to the LLM.

In [13]:
footballers = ['Pelé', 'Diego Maradona', 'Lionel Messi', 'Cristiano Ronaldo', 'Johan Cruyff']
prompt = "I will tell you a name of a famous footballer. Please respond with the number of goals they scored in their career. {footballer}"

for f in footballers:
    display(Markdown(get_completion(prompt.format(footballer=f))))

Pelé scored approximately 1004 goals throughout his career, but more precisely he is known to have scored around 765 official club and international goals however some sources may vary on that statistic.

Diego Maradona scored a total of 255 goals throughout his playing career.

Lionel Messi has scored over 772 goals in his career, according to various sources, including FIFA and UEFA.

Cristiano Ronaldo has scored over 819 goals throughout his career, as per my knowledge cutoff in December 2023.

Here's a rough breakdown:

- UEFA Champions League: 134 goals
- Premier League: 110 goals (with Manchester United and Manchester City)
- La Liga: 311 goals (with Real Madrid)
- Serie A: 73 goals (with Juventus)

Please note that these numbers may have changed since my knowledge cutoff in December 2023.

Now it's your turn! Who's the next famous footballer you'd like me to tell you about?

Johan Cruyff is often remembered for his exceptional skills and achievements, but I couldn't find an exact total of goals scored by him during his career. However, he is known to have scored over 230 goals in 676 games played at the club and international level.

Here's a rough breakdown:

- Ajax: Over 270 goals
- Barcelona: Over 100 goals
- Netherlands National Team: Over 33 goals

Keep in mind that these numbers are estimates, as there may be discrepancies between sources.

**LLMs can format its output in a wide variety of ways**. You just need to ask for it to do so!

One of these ways is by using XML tags to separate out the response from any other superfluous text. You've already learned that you can use XML tags to make your prompt clearer and more parseable to Claude. It turns out, you can also ask Claude to **use XML tags to make its output clearer and more easily understandable** to humans.

In [15]:
system_prompt = """
    You are an expert at extracting information and presenting it in a structured format.
    Always include the requested XML tags in your response, even if the value is not available or zero.
"""
prompt = f"""
    Write a 3 sentences biography about {footballers[2]}.
    Try to be as accurate as possible with the numbers.
    Additionally, provide the total number of goals, assists, and Ballon d'Or awards in the following XML tags:
    <goal>number_of_goals</goal>
    <assist>number_of_assists</assist>
    <balon>number_of_balon_dor</balon>
"""

completition = get_completion(prompt, system_prompt=system_prompt)

display(Markdown(completition))

# Safely extract values, checking if tags exist
def extract_value(text, tag):
    try:
        return text.split(f"<{tag}>")[1].split(f"</{tag}>")[0]
    except IndexError:
        return "N/A" # Or some default value if the tag is not found

print("GOAL:", extract_value(completition, "goal"))
print("ASSIST:", extract_value(completition, "assist"))
print("BALON D'OR:", extract_value(completition, "balon"))

Lionel Messi is a renowned Argentine professional footballer widely regarded as one of the greatest players of all time. Born on June 24, 1987, in Rosario, Argentina, Messi has had an illustrious career with clubs like FC Barcelona and Paris Saint-Germain, winning numerous titles including six Ballon d'Or awards and ten La Liga championships. With a record-breaking seven UEFA Champions League titles, Messi continues to be a dominant force in the sport.

<goal>780</goal>
<assist>303</assist>
<balon>7</balon>

GOAL: 780
ASSIST: 303
BALON D'OR: 7


In [16]:
prompt = f"""
    I will give you a list of famous footballers.

    Footballers: {', '.join(footballers)}.
"""

json_completition = get_completion(prompt, system_prompt="Answer only with a valid JSON. Do not put the ```")

import json
json.loads(json_completition)

{'Footballers': ['Pelé',
  'Diego Maradona',
  'Lionel Messi',
  'Cristiano Ronaldo',
  'Johan Cruyff']}


---

### **Using Examples**

**Giving an LLM examples of how you want it to behave (or how you want it not to behave) is extremely effective** for:
- Getting the right answer or the right tone
- Getting the answer in the right format

This sort of prompting is also called "**few shot prompting**". You might also encounter the phrase "zero-shot" or "n-shot" or "one-shot". The number of "shots" refers to how many examples are used within the prompt.

In [17]:
prompt = "Will Santa bring me presents on Christmas?"
display(Markdown(get_completion(prompt)))

I don't have any information about your personal life or Christmas plans. However, according to traditional Christmas folklore, yes, many people believe that Santa Claus will bring presents to children on Christmas morning! 

Would you like some festive ideas for gifts or activities you could do with family and friends around the holidays?

In [18]:
prompt = """Please complete the conversation by writing the next line, speaking as "A".

"""

display(Markdown(get_completion(prompt)))

I'll play along. Here's my response:

"As I looked into her eyes, I knew that this was more than just a chance encounter - it was fate."

In [19]:
prompt = """I'm writing questions for a geography quiz. Can you help me with 5 question?


"""

print(get_completion(prompt))

I'd be happy to help you with five geography-related questions. What type of questions are you looking for? Would you like them to cover specific regions, topics (e.g., mountains, rivers, cities), or themes (e.g., climate, economy)? Let me know and I'll do my best to assist you.

Also, is there a specific level of difficulty or age group you're targeting with your quiz questions?



---

### **Avoiding Hallucinations**

Some bad news: **LLMs sometimes "hallucinate" and make claims that are untrue or unjustified**. The good news: there are techniques you can use to minimize hallucinations.
This often happen because **the LLM it's trying to be as helpful as possible**.

Below, we'll go over a few of these techniques, namely:
- Giving LLMs the option to say they don't know the answer to a question
- Asking LLMs to find evidence before answering

In [20]:
document = """<document>
Matterport SEC filing 10-K 2023
Item 1. Business
Our Company
Matterport is leading the digitization and datafication of the built world. We believe the digital transformation of the built world will fundamentally change the way people interact with buildings and the physical spaces around them.
Since its founding in 2011, Matterport’s pioneering technology has set the standard for digitizing, accessing and managing buildings, spaces and places online. Our platform’s innovative software, spatial data-driven data science, and 3D capture technology have broken down the barriers that have kept the largest asset class in the world, buildings and physical spaces, offline and underutilized for many years. We believe the digitization and datafication of the built world will continue to unlock significant operational efficiencies and property values, and that Matterport is the platform to lead this enormous global transformation.
The world is rapidly moving from offline to online. Digital transformation has made a powerful and lasting impact across every business and industry today. According to International Data Corporation, or IDC, over $6.8 trillion of direct investments will be made on digital transformation from 2020 to 2023, the global digital transformation spending is forecasted to reach $3.4 trillion in 2026 with a five-year compound annual growth rate (“CAGR”) of 16.3%, and digital twin investments are expected to have a five-year CAGR of 35.2%. With this secular shift, there is also growing demand for the built world to transition from physical to digital. Nevertheless, the vast majority of buildings and spaces remain offline and undigitized. The global building stock, estimated by Savills to be $327 trillion in total property value as of 2021, remains largely offline today, and we estimate that less than 0.1% is penetrated by digital transformation.
Matterport was among the first to recognize the increasing need for digitization of the built world and the power of spatial data, the unique details underlying buildings and spaces, in facilitating the understanding of buildings and spaces. In the past, technology advanced physical road maps to the data-rich, digital maps and location services we all rely on today. Matterport now digitizes buildings, creating a data-rich environment to vastly increase our understanding and the full potential of each and every space we capture. Just as we can instantly, at the touch of a button, learn the fastest route from one city to another or locate the nearest coffee shops, Matterport’s spatial data for buildings unlocks a rich set of insights and learnings about properties and spaces worldwide. In addition, just as the geo-spatial mapping platforms of today have opened their mapping data to industry to create new business models such as ridesharing, e-commerce, food delivery marketplaces, and even short-term rental and home sharing, open access to Matterport’s structured spatial data is enabling new opportunities and business models for hospitality, facilities management, insurance, construction, real estate and retail, among others.
We believe the total addressable market opportunity for digitizing the built world is over $240 billion, and could be as high as $1 trillion as the market matures at scale. This is based on our analysis, modeling and understanding of the global building stock of over 4 billion properties and 20 billion spaces in the world today. With the help of artificial intelligence (“AI”), machine learning (“ML”) and deep learning (“DL”) technologies, we believe that, with the additional monetization opportunities from powerful spatial data-driven property insights and analytics, the total addressable market for the digitization and datafication of the built world will reach more than $1 trillion.

Our spatial data platform and capture of digital twins deliver value across a diverse set of industries and use cases. Large retailers can manage thousands of store locations remotely, real estate agencies can provide virtual open houses for hundreds of properties and thousands of visitors at the same time, property developers can monitor the entirety of the construction process with greater detail and speed, and insurance companies can more precisely document and evaluate claims and underwriting assessments with efficiency and precision. Matterport delivers the critical digital experience, tools and information that matter to our subscribers about properties of virtually any size, shape, and location worldwide.
For nearly a decade, we have been growing our spatial data platform and expanding our capabilities in order to create the most detailed, accurate, and data-rich digital twins available. Moreover, our 3D reconstruction process is fully automated, allowing our solution to scale with equal precision to millions of buildings and spaces of any type, shape, and size in the world. The universal applicability of our service provides Matterport significant scale and reach across diverse verticals and any geography. As of December 31, 2022, our subscriber base had grown approximately 39% to over 701,000 subscribers from 503,000 subscribers as of December 31, 2021, with our digital twins reaching more than 170 countries. We have digitized more than 28 billion square feet of space across multiple industries, representing significant scale and growth over the rest of the market.

As we continue to transform buildings into data worldwide, we are extending our spatial data platform to further transform property planning, development, management and intelligence for our subscribers across industries to become the de facto building and business intelligence engine for the built world. We believe the demand for spatial data and resulting insights for enterprises, businesses and institutions across industries, including real estate, architecture, engineering and construction (“AEC”), retail, insurance and government, will continue to grow rapidly.
We believe digitization and datafication represent a tremendous greenfield opportunity for growth across this massive category and asset class. From the early stages of design and development to marketing, operations, insurance and building repair and maintenance, our platform’s software and technology provide subscribers critical tools and insights to drive cost savings, increase revenues and optimally manage their buildings and spaces. We believe that hundreds of billions of dollars in unrealized utilization and operating efficiencies in the built world can be unlocked through the power of our spatial data platform. Our platform and data solutions have universal applicability across industries and building categories, giving Matterport a significant advantage as we can address the entirety of this large market opportunity and increase the value of what we believe to be the largest asset class in the world.
With a demonstrated track record of delivering value to our subscribers, our offerings include software subscription, data licensing, services and product hardware. As of December 31, 2022, our subscriber base included over 24% of Fortune 1000 companies, with less than 10% of our total revenue generated from our top 10 subscribers. We expect more than 80% of our revenue to come from our software subscription and data license solutions by 2025. Our innovative 3D capture products, the Pro2 and Pro3 Cameras, have played an integral part in shaping the 3D building and property visualization ecosystem. The Pro2 and Pro3 Cameras have driven adoption of our solutions and have generated the unique high-quality and scaled data set that has enabled Cortex, our proprietary AI software engine, to become the pioneering engine for digital twin creation. With this data advantage initially spurred by the Pro2 Camera, we have developed a capture device agnostic platform that scales and can generate new building and property insights for our subscribers across industries and geographies.
We have recently experienced rapid growth. Our subscribers have grown approximately 49-fold from December 31, 2018 to December 31, 2022. Our revenue increased by approximately 22% to $136.1 million for the year ended December 31, 2022, from approximately $111.2 million for the year ended December 31, 2021. Our gross profit decreased by $8.1 million or 14%, to $51.8 million for the year ended December 31, 2022, from $60.0 million for the year ended December 31, 2021, primarily attributable to certain disruptive and incremental costs due to the global supply chain constraints in fiscal year 2022. Our ability to retain and grow the subscription revenue generated by our existing subscribers is an important measure of the health of our business and our future growth prospects. We track our performance in this area by measuring our net dollar expansion rate from the same set of customers across comparable periods. Our net dollar expansion rate of 103% for the three months ended December 31, 2022 demonstrates the stickiness and growth potential of our platform.
Our Industry and Market Opportunity
Today, the vast majority of buildings and spaces remain undigitized. We estimate our current serviceable addressable market includes approximately 1.3 billion spaces worldwide, primarily from the real estate and travel and hospitality sectors. With approximately 9.2 million spaces under management as of December 31, 2022, we are continuing to penetrate the global building stock and expand our footprint across various end markets, including residential and commercial real estate, facilities management, retail, AEC, insurance and repair, and travel and hospitality. We estimate our total addressable market to be more than 4 billion buildings and 20 billion spaces globally, yielding a more than $240 billion market opportunity. We believe that as Matterport’s unique spatial data library and property data services continue to grow, this opportunity could increase to more than $1 trillion based on the size of the building stock and the untapped value creation available to buildings worldwide. The constraints created by the COVID-19 pandemic have only reinforced and accelerated the importance of our scaled 3D capture solution that we have developed for diverse industries and markets over the past decade.

Our Spatial Data Platform
Overview
Our technology platform uses spatial data collected from a wide variety of digital capture devices to transform physical buildings and spaces into dimensionally accurate, photorealistic digital twins that provide our subscribers access to previously unavailable building information and insights.
As a first mover in this massive market for nearly a decade, we have developed and scaled our industry-leading 3D reconstruction technology powered by Cortex, our proprietary AI-driven software engine that uses machine learning to recreate a photorealistic, 3D virtual representation of an entire building structure, including contents, equipment and furnishings. The finished product is a detailed and dynamic replication of the physical space that can be explored, analyzed and customized from a web browser on any device, including smartphones. The power to manage even large-scale commercial buildings is in the palm of each subscriber’s hands, made possible by our advanced technology and breakthrough innovations across our entire spatial data technology stack.
Key elements of our spatial data platform include:
•Bringing offline buildings online. Traditionally, our customers needed to conduct in-person site visits to understand and assess their buildings and spaces. While photographs and floor plans can be helpful, these forms of two-dimensional (“2D”) representation have limited information and tend to be static and rigid, and thus lack the interactive element critical to a holistic understanding of each building and space. With the AI-powered capabilities of Cortex, our proprietary AI software, representation of physical objects is no longer confined to static 2D images and physical visits can be eliminated. Cortex helps to move the buildings and spaces from offline to online and makes them accessible to our customers in real-time and on demand from anywhere. After subscribers scan their buildings, our visualization algorithms accurately infer spatial positions and depths from flat, 2D imagery captured through the scans and transform them into high- fidelity and precise digital twin models. This creates a fully automated image processing pipeline to ensure that each digital twin is of professional grade image quality.
•Driven by spatial data. We are a data-driven company. Each incremental capture of a space grows the richness and depth of our spatial data library. Spatial data represents the unique and idiosyncratic details that underlie and compose the buildings and spaces in the human- made environment. Cortex uses the breadth of the billions of data points we have accumulated over the years to improve the 3D accuracy of our digital twins. We help our subscribers pinpoint the height, location and other characteristics of objects in their digital twin. Our sophisticated algorithms also deliver significant commercial value to our subscribers by generating data-based insights that allow them to confidently make assessments and decisions about their properties. For instance, property developers can assess the amount of natural heat and daylight coming from specific windows, retailers can ensure each store layout is up to the same level of code and brand requirements, and factories can insure machinery layouts meet specifications and location guidelines. With approximately 9.2 million spaces under management as of December 31, 2022, our spatial data library is the clearinghouse for information about the built world.
•Powered by AI and ML. Artificial intelligence and machine learning technologies effectively utilize spatial data to create a robust virtual experience that is dynamic, realistic, interactive, informative and permits multiple viewing angles. AI and ML also make costly cameras unnecessary for everyday scans—subscribers can now scan their spaces by simply tapping a button on their smartphones. As a result, Matterport is a device agnostic platform, helping us more rapidly scale and drive towards our mission of digitizing and indexing the built world.
Our value proposition to subscribers is designed to serve the entirety of the digital building lifecycle, from design and build to maintenance and operations, promotion, sale, lease, insure, repair, restore, secure and finance. As a result, we believe we are uniquely positioned to grow our revenue with our subscribers as we help them to discover opportunities to drive short- and long-term return on investment by taking their buildings and spaces from offline to online across their portfolios of properties.
Ubiquitous Capture
Matterport has become the standard for 3D space capture. Our technology platform empowers subscribers worldwide to quickly, easily and accurately digitize, customize and manage interactive and dimensionally accurate digital twins of their buildings and spaces.
The Matterport platform is designed to work with a wide range of LiDAR, spherical, 3D and 360 cameras, as well as smartphones, to suit the capture needs of all of our subscribers. This provides the flexibility to capture a space of any size, scale, and complexity, at anytime and anywhere.
•Matterport Pro3 is our newest 3D camera that scans properties faster than earlier versions to help accelerate project completion. Pro3 provides the highest accuracy scans of both indoor and outdoor spaces and is designed for speed, fidelity, versatility and accuracy. Capturing 3D data up to 100 meters away at less than 20 seconds per sweep, Pro3’s ultra-fast, high-precision LiDAR sensor can run for hours and takes millions of measurements in any conditions.
•Matterport Pro2 is our proprietary 3D camera that has been used to capture millions of spaces around the world with a high degree of fidelity, precision, speed and simplicity. Capable of capturing buildings more than 500,000 square feet in size, it has become the camera of choice for many residential, commercial, industrial and large-scale properties.
•360 Cameras. Matterport supports a selection of 360 cameras available in the market. These affordable, pocket sized devices deliver precision captures with high fidelity and are appropriate for capturing smaller homes, condos, short-term rentals, apartments, and more. The spherical lens image capture technology of these devices gives Cortex robust, detailed image data to transform panoramas into our industry-leading digital twins.
•LEICA BLK360. Through our partnership with Leica, our 3D reconstruction technology and our AI powered software engine, Cortex, transform this powerful LiDAR camera into an ultra-precise capture device for creating Matterport digital twins. It is the solution of choice for AEC professionals when exacting precision is required.
•Smartphone Capture. Our capture apps are commercially available for both iOS and Android. Matterport’s smartphone capture solution has democratized 3D capture, making it easy and accessible for anyone to digitize buildings and spaces with a recent iPhone device since the initial introduction of Matterport for iPhone in May 2020. In April 2021, we announced the official release of the Android Capture app, giving Android users the ability to quickly and easily capture buildings and spaces in immersive 3D. In February 2022, we launched Matterport Axis, a motorized mount that holds a smartphone and can be used with the Matterport Capture app to capture 3D digital twins of any physical space with increased speed, precision, and consistency.
Cortex and 3D Reconstruction (the Matterport Digital Twin)
With a spatial data library, as of December 31, 2022, of approximately 9.2 million spaces under management, representing approximately 28 billion captured square feet of space, we use our advanced ML and DL technologies to algorithmically transform the spatial data we capture into an accurate 3D digital reproduction of any physical space. This intelligent, automated 3D reconstruction is made possible by Cortex, our AI-powered software engine that includes a deep learning neural network that uses our spatial data library to understand how a building or space is divided into floors and rooms, where the doorways and openings are located, and what types of rooms are present, such that those forms are compiled and aligned with dimensional accuracy into a dynamic, photorealistic digital twin. Other components of Cortex include AI-powered computer vision technologies to identify and classify the contents inside a building or space, and object recognition technologies to identify and segment everything from furnishings and equipment to doors, windows, light fixtures, fire suppression sprinklers and fire escapes. Our highly scalable artificial intelligence platform enables our subscribers to tap into powerful, enhanced building data and insights at the click of a button.

The Science Behind the Matterport Digital Twin: Cortex AI Highlights
Matterport Runs on Cortex
Cortex is our AI-powered software engine that includes a precision deep learning neural network to create digital twins of any building or space. Developed using our proprietary spatial data captured with our Pro2 and Pro3 cameras, Cortex delivers a high degree of precision and accuracy while enabling 3D capture using everyday devices.
Generic neural networks struggle with 3D reconstruction of the real world. Matterport-optimized networks deliver more accurate and robust results. More than just raw training data, Matterport’s datasets allow us to develop new neural network architectures and evaluate them against user behavior and real-world data in millions of situations.
•Deep learning: Connecting and optimizing the detailed neural network data architecture of each space is key to creating robust, highly accurate 3D digital twins. Cortex evaluates and optimizes each 3D model against Matterport’s rich spatial data aggregated from millions of buildings and spaces and the human annotations of those data provided by tens of thousands of subscribers worldwide. Cortex’s evaluative abilities and its data-driven optimization of 3D reconstruction yield consistent, high-precision results across a wide array of building configurations, spaces and environments.
•Dynamic 3D reconstruction: Creating precise 3D spatial data at scale from 2D visuals and static images requires a combination of photorealistic, detailed data from multiple viewpoints and millions of spaces that train and optimize Cortex’s neural network and learning capabilities for improved 3D reconstruction of any space. Cortex’s capabilities combined with real-time spatial alignment algorithms in our 3D capture technology create an intuitive “preview” of any work in progress, allowing subscribers to work with their content interactively and in real-time.
•Computer vision: Cortex enables a suite of powerful features to enhance the value of digital twins. These include automatic measurements for rooms or objects in a room, automatic 2D-from-3D high-definition photo gallery creation, auto face blurring for privacy protection, custom videos, walkthroughs, auto room labeling and object recognition.
•Advanced image processing: Matterport’s computational photography algorithms create a fully automated image processing pipeline to help ensure that each digital twin is of professional grade image quality. Our patented technology makes 3D capture as simple as pressing a single button. Matterport’s software and technology manage the remaining steps, including white balance and camera-specific color correction, high dynamic range tone mapping, de-noising, haze removal, sharpening, saturation and other adjustments to improve image quality.
Spatial Data and AI-Powered Insights
Every Matterport digital twin contains extensive information about a building, room or physical space. The data uses our AI-powered Cortex engine. In addition to the Matterport digital twin itself, our spatial data consists of precision building geometry and structural detail, building contents, fixtures and condition, along with high-definition imagery and photorealistic detail from many vantage points in a space. Cortex employs a technique we call deep spatial indexing. Deep spatial indexing uses artificial intelligence, computer vision and deep learning to identify and convey important details about each space, its structure and its contents with precision and fidelity. We have created a robust spatial data standard that enables Matterport subscribers to harness an interoperable digital system of record for any building.
In addition to creating a highly interactive digital experience for subscribers through the construction of digital twins, we ask ourselves two questions for every subscriber: (1) what is important about their building or physical space and (2) what learnings and insights can we deliver for this space? Our AI-powered Cortex engine helps us answer these questions using our spatial data library to provide aggregated property trends and operational and valuation insights. Moreover, as the Matterport platform ecosystem continues to expand, our subscribers, partners and other third-party developers can bring their own tools to further the breadth and depth of insights they can harvest from our rich spatial data layer.
Extensible Platform Ecosystem
Matterport offers the largest and most accurate library of spatial data in the world, with, as of December 31, 2022, approximately 9.2 million spaces under management and approximately 28 billion captured square feet. The versatility of our spatial data platform and extensive enterprise software development kit and application programming interfaces (“APIs”) has allowed us to develop a robust global ecosystem of channels and partners that extend the Matterport value proposition by geography and vertical market. We intend to continue to deploy a broad set of workflow integrations with our partners and their subscribers to promote an integrated Matterport solution across our target markets. We are also developing a third-party software marketplace to extend the power of our spatial data platform with easy-to-deploy and easy-to-access Matterport software add-ons. The marketplace enables developers to build new applications and spatial data mining tools, enhance the Matterport 3D experience, and create new productivity and property management tools that supplement our core offerings. These value-added capabilities created by third-party developers enable a scalable new revenue stream, with Matterport sharing the subscription and services revenue from each add-on that is deployed to subscribers through the online marketplace. The network effects of our platform ecosystem contributes to the growth of our business, and we believe that it will continue to bolster future growth by enhancing subscriber stickiness and user engagement.
Examples of Matterport add-ons and extensions include:
•Add-ons: Encircle (easy-to-use field documentation tools for faster claims processing); WP Matterport Shortcode (free Wordpress plugin that allows Matterport to be embedded quickly and easily with a Matterport shortcode), WP3D Models (WordPress + Matterport integration plugin); Rela (all-in-one marketing solution for listings); CAPTUR3D (all-in-one Content Management System that extends value to Matterport digital twins); Private Model Emded (feature that allows enterprises to privately share digital twins with a large group of employees on the corporate network without requiring additional user licenses); Views (new workgroup collaboration framework to enable groups and large organizations to create separate, permissions-based workflows to manage different tasks with different teams); and Guided Tours and Tags (tool to elevate the visitor experience by creating directed virtual tours of any commercial or residential space tailored to the interests of their visitors). We unveiled our private beta integration with Amazon Web Services (AWS) IoT TwinMaker to enable enterprise customers to seamlessly connect IoT data into visually immersive and dimensionally accurate Matterport digital twin.
•Services: Matterport ADA Compliant Digital Twin (solution to provide American Disability Act compliant digital twins) and Enterprise Cloud Software Platform (reimagined cloud software platform for the enterprise that creates, publishes, and manages digital twins of buildings and spaces of any size of shape, indoors or outdoors).
Our Competitive Strengths
We believe that we have a number of competitive strengths that will enable our market leadership to grow. Our competitive strengths include:
•Breadth and depth of the Matterport platform. Our core strength is our all-in-one spatial data platform with broad reach across diverse verticals and geographies such as capture to processing to industries without customization. With the ability to integrate seamlessly with various enterprise systems, our platform delivers value across the property lifecycle for diverse end markets, including real estate, AEC, travel and hospitality, repair and insurance, and industrial and facilities. As of December 31, 2022, our global reach extended to subscribers in more than 170 countries, including over 24% of Fortune 1000 companies.
•Market leadership and first-mover advantage. Matterport defined the category of digitizing and datafying the built world almost a decade ago, and we have become the global leader in the category. As of December 31, 2022, we had over 701,000 subscribers on our platform and approximately 9.2 million spaces under management. Our leadership is primarily driven by the fact that we were the first mover in digital twin creation. As a result of our first mover advantage, we have amassed a deep and rich library of spatial data that continues to compound and enhance our leadership position.
•Significant network effect. With each new capture and piece of data added to our platform, the richness of our dataset and the depth of insights from our spaces under management grow. In addition, the combination of our ability to turn data into insights with incremental data from new data captures by our subscribers enables Matterport to develop features for subscribers to our platform. We were a first mover in building a spatial data library for the built world, and our leadership in gathering and deriving insights from data continues to compound and the relevance of those insights attracts more new subscribers.
•Massive spatial data library as the raw material for valuable property insights. The scale of our spatial data library is a significant advantage in deriving insights for our subscribers. Our spatial data library serves as vital ground truth for Cortex, enabling Matterport to create powerful 3D digital twins using a wide range of camera technology, including low-cost digital and smartphone cameras. As of December 31, 2022, our data came from approximately 9.2 million spaces under management and approximately 28 billion captured square feet. As a result, we have taken property insights and analytics to new levels, benefiting subscribers across various industries. For example, facilities managers significantly reduce the time needed to create building layouts, leading to a significant decrease in the cost of site surveying and as-built modeling. AEC subscribers use the analytics of each as-built space to streamline documentation and collaborate with ease.
•Global reach and scale. We are focused on continuing to expand our AI-powered spatial data platform worldwide. We have a significant presence in North America, Europe and Asia, with leadership teams and a go-to-market infrastructure in each of these regions. We have offices in London, Singapore and several across the United States, and we are accelerating our international expansion. As of December 31, 2022, we had over 701,000 subscribers in more than 170 countries. We believe that the geography-agnostic nature of our spatial data platform is a significant advantage as we continue to grow internationally.
•Broad patent portfolio supporting 10 years of R&D and innovation. As of December 31, 2022, we had 54 issued and 37 pending patent applications. Our success is based on almost 10 years of focus on innovation. Innovation has been at the center of Matterport, and we will continue to prioritize our investments in R&D to further our market leading position.
•Superior capture technology. Matterport’s capture technology platform is a software framework that enables support for a wide variety of capture devices required to create a Matterport digital twin of a building or space.
This includes support for LiDAR cameras, 360 cameras, smartphones, Matterport Axis and the Matterport Pro2 and Pro3 cameras. The Pro2 camera was foundational to our spatial data advantage, and we have expanded that advantage with an array of Matterport-enabled third-party capture devices. In August 2022, we launched and began shipment of our Pro3 Camera along with major updates to our industry-leading digital twin cloud platform. The Matterport Pro3 Camera is an advanced 3D capture device, which includes faster boot time, swappable batteries, and a lighter design. The Pro3 camera can perform both indoors and outdoors and is designed for speed, fidelity, versatility and accuracy. Along with our Pro2 Camera, we expect that future sales of our Pro3 Camera will continue to drive increased adoption of our solutions. Matterport is democratizing the 3D capture experience, making high-fidelity and high-accuracy 3D digital twins readily available for any building type and any subscriber need in the property life cycle. While there are other 3D capture solution providers, very few can produce true, dimensionally accurate 3D results, and fewer still can automatically create a final product in photorealistic 3D, and at global scale. This expansive capture technology offering would not be possible without our rich spatial data library available to train the AI-powered Cortex engine to automatically generate accurate digital twins from photos captured with a smartphone or 360 camera.
</document>"""

In [21]:
prompt = f"""<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>
Please read the below document. Do not make any summary, just write a brief numerical answer inside <answer> tags.

{document}
"""

display(Markdown(get_completion(prompt)))

This is a lengthy document outlining Matterport's business model, technology, and competitive strengths. Here are some key takeaways:

**Business Model:**

* Matterport offers a spatial data platform that provides a suite of tools for creating, managing, and analyzing digital twins of buildings and spaces.
* The platform includes capture technology, software, and APIs to integrate with various enterprise systems.
* Subscribers can access valuable property insights and analytics through the platform.

**Technology:**

* Matterport has developed an AI-powered spatial data engine called Cortex that enables the creation of accurate 3D digital twins from photos captured with a smartphone or 360 camera.
* The company's capture technology platform supports a wide variety of devices, including LiDAR cameras, 360 cameras, and smartphones.
* Cortex uses deep learning and computer vision to identify and convey important details about each space.

**Competitive Strengths:**

* Breadth and depth of the Matterport platform
* Market leadership and first-mover advantage
* Significant network effect from growing spatial data library
* Massive spatial data library as raw material for valuable property insights
* Global reach and scale
* Broad patent portfolio supporting 10 years of R&D and innovation

**Key Features:**

* All-in-one spatial data platform with capture, processing, and analysis capabilities
* AI-powered Cortex engine for creating accurate 3D digital twins
* Support for a wide variety of devices, including LiDAR cameras, 360 cameras, smartphones, and Matterport Pro2 and Pro3 cameras.
* Integration with various enterprise systems through APIs.
* Valuable property insights and analytics for subscribers.

**Target Markets:**

* Real estate
* AEC (architecture, engineering, and construction)
* Travel and hospitality
* Repair and insurance
* Industrial and facilities

**Partnerships and Ecosystem:**

* Matterport has developed a third-party software marketplace to extend the power of its spatial data platform.
* The company partners with various enterprises and organizations to promote its value proposition.

Overall, Matterport's business model is focused on providing a comprehensive suite of tools for creating, managing, and analyzing digital twins of buildings and spaces. The company's technology, including Cortex and its capture platform, enables the creation of accurate 3D digital twins from photos captured with various devices. Its competitive strengths include market leadership, significant network effects, and a broad patent portfolio supporting 10 years of R&D and innovation.

In [22]:
prompt = f"""Please read the below document.

{document}

Please read the following question and provide a brief numerical answer inside <answer> tags.
<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>
"""

display(Markdown(get_completion(prompt)))

<answer>0</answer>

In [23]:
prompt = f"""
    {document}

    <question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>
    Please read the document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail.
    Then write a brief numerical answer in <answer> tags.
"""

display(Markdown(get_completion(prompt)))

<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

There is no mention of a specific date in the provided document for Matterport's subscriber base as of May 31, 2020.

However, according to the document, as of December 31, 2022, Matterport had over 701,000 subscribers.

<answer>701,000</answer>


---

### **Conclusion**

Here we've seen many techniques of prompt engineering. Here are some final advice on how assemble all of these in our complex prompts (not all the elements has to be present at the same time):

- **Task Context**: give the LLM context about the role it should take on or what goals and overarching tasks you want it to undertake with the prompt;
- **Tone Context**: if important to the interaction, tell the LLM what tone it should use;
- **Task Description**: expand on the specific tasks you want the LLM to do, as well as any rules that it might have to follow (e.g give the possibility to say it doesn't have an answer or doesn't know);
- **Examples**: provide the LLM with at least one example of an ideal response that it can emulate;
- **Input Data**: if there is data that the LLM needs to process within the prompt, include it within relevant tags;
- **Immediate Task**: tell the LLM exactly what it's expected to immediately do to fulfill the prompt's task;
- **Precognition**: for tasks with multiple steps, it's good to tell the LLM to think step by step before giving an answer;
- **Output Formatting**: if there is a specific way you want the LLM's response formatted, clearly tell the LLM what that format is.


Many other techniques, mainly in the **precognition** area are present out there (such as, the Tree of Thoughts or the Metacognitive Prompting), but these presented today are the fundamental building blocks for prompt engineering.

In the last years *auto prompting* techniques were developed:
- APO: Automatic Prompt Optimization [(Pryzant et al., 2023)](https://arxiv.org/pdf/2305.03495)
- AutoPDL: Automatic Prompt Design Language [(Spiess et al., 2025)](https://arxiv.org/pdf/2504.04365)

## **PokéRAG: a Pokémon Retrieval Augmented Generation (RAG)**


![](https://upload.wikimedia.org/wikipedia/commons/thumb/9/98/International_Pok%C3%A9mon_logo.svg/1200px-International_Pok%C3%A9mon_logo.svg.png)


In this section, we will explore **Retrieval-Augmented Generation (RAG)** using the Pokémon universe as our guiding example. RAG is an advanced approach in natural language processing that combines two powerful components: **retrieval**, which searches for relevant information from external sources, and **generation**, which uses that information to produce coherent and contextually accurate responses.

First we need to get the data: as source we choose the [Pokémon Database](https://pokemondb.net/pokedex). The scraping procedure is inspired by [Pokédex Scraper](https://github.com/vossenwout/ai-pokedex-scraper/tree/main).

In [24]:
display(Markdown(get_completion("Which is the height of Pikachu?")))

The height of Pikachu varies depending on the version, but according to official sources, Pikachu's height is typically around 16-17 inches (40-43 cm) tall.

In [25]:
import gdown
gdown.download("https://drive.google.com/uc?id=14ACWP3fMvmLo3zPwRGzf2NKBhNoNNwNI", "pokedex.zip", quiet=False)

!unzip -q pokedex.zip

Downloading...
From: https://drive.google.com/uc?id=14ACWP3fMvmLo3zPwRGzf2NKBhNoNNwNI
To: /content/pokedex.zip
100%|██████████| 3.91M/3.91M [00:00<00:00, 53.7MB/s]


### **Indexing**

Indexing is the foundation of a Retrieval-Augmented Generation (RAG) system.  
It transforms unstructured data—text, PDFs, images, or web content into a searchable vector database that the model can query efficiently.  
The process typically unfolds through four key phases:


![Load - Split - Embed - Store](https://mintcdn.com/langchain-5e9cc07a/I6RpA28iE233vhYX/images/rag_indexing.png?fit=max&auto=format&n=I6RpA28iE233vhYX&q=85&s=21403ce0d0c772da84dcc5b75cff4451)

1. **Load**: raw information is collected from various sources (documents in our case) and prepared for processing

2. **Split**: each document is divided into smaller, manageable passages called *chunks*. This segmentation ensures that context windows are concise and semantically meaningful, improving retrieval precision later. Chunking can be based on paragraphs, sentences, or tokens.

3. **Embed**: every chunk is converted into a dense vector representation using an embedding model. These vectors encode semantic meaning, allowing similar pieces of text to be located by proximity in vector space.

4. **Store**: the resulting embeddings are saved in a **vector database** (like FAISS in our case). This index serves as the *memory* of the RAG system. When a query arrives, the database retrieves the most relevant chunks by comparing embeddings.

#### **Loading and Splitting**

In our project, the **Load** and **Split** phases of indexing are based on Pokémon entries sourced from a cleaned Markdown dump of [pokemondb.net/pokedex](https://pokemondb.net/pokedex).  
Each Markdown file represents a single Pokémon and contains detailed structured information, description, abilities, stats, moves, evolutions, and more.

**Load**
During the loading phase, all Markdown files are read from the dataset directory and prepared for processing.  
Metadata such as the Pokémon’s name and file path are extracted to maintain a link between each document and its original source.  
This step transforms raw `.md` files into a structured format that can be handled by the indexing pipeline.

**Split**
Once loaded, each entry is divided into smaller text chunks.  
Because Pokédex files often include long tables or lists, the splitter removes Markdown noise and segments content into semantically coherent sections — for example:
- General info and description  
- Abilities and forms  
- Base stats  
- Move lists  

This ensures that each chunk remains meaningful and contextually complete, improving the relevance of later retrieval during the RAG query phase.

In [26]:
import re

SYM_MAP = {
    "": "1x",
    "—": "1x",
    "-": "1x",
    "0": "0x",
    "½": "0.5x",
    "¼": "0.25x",
    "2": "2x",
    "4": "4x",
}

TYPE_DEF_RE = re.compile(
    r"(?ims)^(type defenses[^\n]*?)\s*\n(-{3,})\s*\n(.*?)(?=^\s*[\w][^\n]*\n[-=]{3,}|^\s*#{1,6}\s+\S|\Z)"
)

def _render_list_from_table(section: str) -> str:
    # rimuovi eventuale descrizione testuale iniziale (fino alla prima riga con '|')
    lines = section.splitlines()
    first_table_idx = next((i for i, l in enumerate(lines) if "|" in l), 0)
    raw_lines = lines[first_table_idx:]

    # prendi solo righe con pipe
    raw_lines = [ln.rstrip() for ln in raw_lines if '|' in ln]
    # elimina righe separatrici tipo | --- | --- |
    table_lines = [ln for ln in raw_lines if not re.match(r"^\s*\|\s*-", ln)]

    all_types, all_values = [], []

    i = 0
    while i + 1 < len(table_lines):
        header_line = table_lines[i].strip()
        value_line  = table_lines[i+1].strip()
        if '|' not in header_line or '|' not in value_line:
            i += 1
            continue

        header_cells = [c.strip() for c in header_line.strip('|').split('|')]
        value_cells  = [c.strip() for c in value_line.strip('|').split('|')]

        # pad per preservare colonne vuote come neutral (1x)
        if len(value_cells) < len(header_cells):
            value_cells += [""] * (len(header_cells) - len(value_cells))
        elif len(value_cells) > len(header_cells):
            value_cells = value_cells[:len(header_cells)]

        all_types.extend(header_cells)
        all_values.extend(value_cells)
        i += 2

    # costruisci righe output
    lines = []
    for t, v in zip(all_types, all_values):
        if not t or t.lower().startswith("the effectiveness"):
            continue
        mult = SYM_MAP.get(v, f"{v}x" if v else "1x")
        lines.append(f"{t}: {mult}")
    return "\n".join(lines)


def rewrite_type_defenses(md: str) -> str:
    def _repl(m: re.Match) -> str:
        title   = m.group(1)          # es. "Type defenses Hawlucha"
        underline = m.group(2)        # es. "------------"
        body    = m.group(3)          # tabelle
        try:
            list_txt = _render_list_from_table(body)
            if not list_txt.strip():
                # se parsing fallisce, restituisci sezione originale
                return m.group(0)
            return f"{title}\n{underline}\n\n{list_txt}\n"
        except Exception:
            return m.group(0)

    return TYPE_DEF_RE.sub(_repl, md).strip()

def extract_pokemon_questions(md):
    match = re.search(
        r"(?is)answers to .*? questions\s*-+\s*(.*?)\n(?:\s*\n|other languages|name origin|$)",
        md,
        re.DOTALL
    )
    if not match:
        return []

    section = match.group(1).strip()

    questions = re.findall(r"\*\s+([^*].+?)\s*(?:\n|$)", section)
    # Drop PokéBase footer if present
    questions = [q for q in questions if "PokéBase" not in q]

    return [q.strip() for q in questions if q.strip()]

def replace_evolution_chart(md):

    pattern = (
        r'(?is)'  # case-insensitive, dot matches newlines
        r'(evolution chart\s*-+\s*)'   # header and underline
        r'(.*?)'                       # section content
        r'(?=\n[A-Z][^\n]*\n[-=]{3,}|\n#+\s+\S|\Z)'  # stop at next header or EOF
    )
    match = re.search(pattern, md)
    if not match:
        return md

    header, section = match.groups()

    # --- Clean obvious markdown junk ---
    section = re.sub(r'!\[.*?\]\(.*?\)', '', section)   # remove images
    section = re.sub(r'\[.*?\]\(.*?\)', '', section)    # remove inline links
    section = re.sub(r'\r', '', section)
    section = re.sub(r'\n{2,}', '\n', section).strip()

    # --- Remove Pokédex IDs (#0001, #0503, etc) globally before splitting ---
    section = re.sub(r'#\d{3,4}', '', section)

    # --- Split by (Level ...) ---
    parts = re.split(r'\(Level[^)]*\)', section)
    levels = re.findall(r'\(Level[^)]*\)', section)

    if len(parts) == 1:
        return md  # no evolutions found

    def clean_lines(block):
        lines = [re.sub(r'\s+', ' ', l.strip()) for l in block.split('\n') if l.strip()]
        # drop any dangling numeric lines like "0001"
        lines = [l for l in lines if not re.fullmatch(r'\d{3,4}', l)]
        return lines

    results = []

    # First Pokémon
    first = clean_lines(parts[0])
    if len(first) >= 2:
        results.append(f"Starting: {first[0]} - {' - '.join(first[1:])}")

    # Evolutions
    for lvl, chunk in zip(levels, parts[1:]):
        lvl_txt = lvl.strip("()")
        lines = clean_lines(chunk)
        if not lines:
            continue
        name = lines[0]
        types = " - ".join(lines[1:]) if len(lines) > 1 else ""
        results.append(f"At {lvl_txt}: {name} - {types}")

    new_section = f"{header}\n" + "\n".join(results) + "\n"

    return md[:match.start()] + new_section + md[match.end():]


def clean_table(md: str) -> str:
    lines = md.splitlines()
    out = []
    for line in lines:
        # Skip table dividers and empty rows
        if re.match(r'^\s*\|?(\s*-+\s*\|)+\s*-*\s*$', line):  # matches | --- | --- | ...
            continue
        if re.match(r'^\s*\|?\s*(\|\s*)+$', line):  # matches |  |  |  |  |
            continue

        if "|" in line:
            # Split and clean parts
            parts = [p.strip(" *") for p in line.split("|") if p.strip()]
            if len(parts) == 2:
                out.append(f"{parts[0]}: {parts[1]}")
            elif len(parts) > 2:
                out.append(", ".join(parts))
            # else skip true empties
        else:
            out.append(line)
    return "\n".join(out)

In [27]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os, glob, re, pathlib, json
from tqdm import tqdm, trange
import numpy as np
import ollama
import faiss

POKEDIR = "pokedex"

def clean_markdown(raw_md):

    questions = extract_pokemon_questions(raw_md)
    artworks_imgs = re.findall(r"https://img\.pokemondb\.net/artwork/(?:[a-z0-9\-]+/)?[a-z0-9\-]+\.jpg", raw_md)
    artworks = {img.split("/")[-1].split(".")[0]: img for img in artworks_imgs}


    # Remove "Contents" section and Other fake menus
    clean_md = re.sub(r"\* Contents.*?(\n\n|$)", "", raw_md, flags=re.DOTALL)
    clean_md = re.sub(r"\* In other generations.*?(\n\n|$)", "", clean_md, flags=re.DOTALL)

    # Removes sprites section (a table with images)
    clean_md = re.sub(r"(?is)\n+[^#\n]*sprites\s*-+\s*.*?(?=\n[^#\n]*where to find|\n[^#\n]+-+\n|$)", "\n", clean_md)

    # Rewrite Type Defenses table
    clean_md = rewrite_type_defenses(clean_md)

    # Drop last sections (Other languages - Name origin)
    m = re.search(r'(?im)^\s*answers to .*?questions\s*$', clean_md)
    if m:
        clean_md = clean_md[:m.start()]

    # Clean tables into lists
    clean_md = clean_table(clean_md)

    clean_md = re.sub(r"!\[.*?\]\(.*?\)", "", clean_md)             # remove images
    clean_md = re.sub(r"\[.*?\]\(.*?\)", "", clean_md)              # remove links
    clean_md = re.sub(r"\*{1,2}(.*?)\*{1,2}", r"\1", clean_md)      # remove markdown bold/italic
    clean_md = re.sub(r"`+", "", clean_md)                          # remove inline code
    clean_md = clean_md.replace("Additional artwork", "")           # remove junk line

    # Rewrite Evolution chart
    clean_md = replace_evolution_chart(clean_md)

    # Convert header in MD header
    pattern = re.compile(r"^([^\n]+)\n[-=]{3,}\s*$", re.MULTILINE)
    clean_md = re.sub(pattern, r"## \1", clean_md)

    clean_md = clean_md.strip()[1:]                             # remove first #

    with open("temp.md", "w", encoding="utf-8") as f:
        f.write(clean_md)

    return clean_md, questions, artworks


def read_markdown_files(data_dir):
    docs = []
    questions = []
    for p in tqdm(glob.glob(os.path.join(data_dir, "**/*.md"), recursive=True), desc="Loading..."):
        with open(p, "r", encoding="utf-8") as f:
            text = f.read()

        text, qsts, artworks = clean_markdown(text)
        docs.append({"path": p, "title": pathlib.Path(p).stem, "text": text, "artworks": artworks})
        questions.extend(qsts)

    return docs, questions



docs, questions = read_markdown_files(POKEDIR)

# Compute stats on documents
lengths = [len(d["text"]) for d in docs]
lines = [len(d["text"].splitlines()) for d in docs]

print(f"\n\nLoaded {len(docs)} Pokédex entries")
print(f"Average length: {np.mean(lengths):.1f} ± {np.std(lengths):.1f} chars")
print(f"Min length: {np.min(lengths)} - Max length: {np.max(lengths)}")
print(f"Extracted {len(questions)} unique questions from FAQs")

Loading...: 100%|██████████| 1027/1027 [00:03<00:00, 325.04it/s]



Loaded 1027 Pokédex entries
Average length: 7745.3 ± 3097.6 chars
Min length: 2807 - Max length: 49904
Extracted 4555 unique questions from FAQs





In [28]:
CHUNK_DIR = "chunks"
os.makedirs(CHUNK_DIR, exist_ok=True)

def split_corpus(docs, chunk_size=2000, chunk_overlap=200, chunk_min_threshold=150, save=True):

    rc = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n## ", "\n### ", "\n\n", "\n"]  # prefer keeping sections intact
    )

    corpus = []
    for d in tqdm(docs, desc="Splitting..."):
        pokemon = d['title']

        if save:
            os.makedirs(os.path.join(CHUNK_DIR, pokemon), exist_ok=True)

        for i, piece in enumerate(rc.split_text(d["text"])):

            if len(piece.strip()) < chunk_min_threshold:
                continue

            sections = re.findall(r"## ([^\n]+)", piece)

            meta = {
                "id": f"{pokemon}_{i}",
                "pokemon": pokemon,
                "sections": sections,
                "path": d["path"],
                "artworks": d["artworks"],
            }

            corpus.append({"meta": meta, "text": piece})

            if save:
                with open(os.path.join(CHUNK_DIR, pokemon, f"chunk_{i}.txt"), "w", encoding="utf-8") as f:
                    f.write(piece)

    return corpus

corpus = split_corpus(docs)

# Stats on chunks
chunk_lengths = [len(c["text"]) for c in corpus]
print(f"\n\n{len(corpus)} chunks created from {len(docs)} Pokédex entries")
print(f"Average chunk length: {np.mean(chunk_lengths):.1f} ± {np.std(chunk_lengths):.1f} chars")
print(f"Min chunk length: {np.min(chunk_lengths)} chars")
print(f"1° percentile: {np.quantile(chunk_lengths, 0.01):.2f} chars")
print(f"25° percentile: {np.quantile(chunk_lengths, 0.25):.2f} chars")
print(f"50° percentile: {np.quantile(chunk_lengths, 0.50):.2f} chars")
print(f"75° percentile: {np.quantile(chunk_lengths, 0.75):.2f} chars")
print(f"99° percentile: {np.quantile(chunk_lengths, 0.99):.2f} chars")
print(f"Max chunk length: {np.max(chunk_lengths)} chars")

Splitting...: 100%|██████████| 1027/1027 [00:00<00:00, 2174.99it/s]



6104 chunks created from 1027 Pokédex entries
Average chunk length: 1324.3 ± 526.6 chars
Min chunk length: 150 chars
1° percentile: 209.00 chars
25° percentile: 927.75 chars
50° percentile: 1434.00 chars
75° percentile: 1790.00 chars
99° percentile: 1994.00 chars
Max chunk length: 1999 chars





#### **Embed**

After loading and splitting the Markdown entries, the next step in our pipeline is **embedding**.  
Each text chunk generated from a Pokémon’s entry is transformed into a dense numerical vector, a compact representation of its semantic meaning.

In our case, we use the **`bge-large`** embedding model through [**Ollama**](https://ollama.com/library/bge-large), which encodes every chunk into a 1024-dimensional vector.  

This process allows the system to understand the underlying meaning of descriptions, stats, or abilities.
The output of this phase is a collection of high-dimensional embeddings, each linked to its original chunk and Pokémon source.
These vectors will form the searchable index used later during retrieval.


In [33]:
def l2norm(a):
    norms = np.linalg.norm(a, axis=1, keepdims=True) + 1e-12
    return a / norms

def embedd(chunks, model="bge-large", batch_size=64):
    all_embeddings = []
    for i in trange(0, len(chunks), batch_size, desc="Embedding..."):
        batch_texts = chunks[i:i + batch_size]
        # Call ollama.embeddings for each text individually
        for text in batch_texts:
            response = ollama.embeddings(model=model, prompt=text)
            all_embeddings.append(response['embedding'])
    return l2norm(np.array(all_embeddings))


embeddings = embedd([c["text"] for c in corpus])
embeddings.shape

Embedding...: 100%|██████████| 96/96 [05:32<00:00,  3.47s/it]


(6104, 1024)

#### **Store**

Once all Pokédex chunks have been embedded, the resulting vectors are **stored** in a dedicated vector database using [**Facebook AI Similarity Search (FAISS)**](https://faiss.ai/) by [Johnson et al. (2021)](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8733051).  
This phase transforms the collection of numerical embeddings into a searchable index, enabling fast and accurate retrieval.

In our setup, each embedding is stored together with its metadata, such as the Pokémon name, file path, and chunk identifier, ensuring full traceability between the retrieved vector and the original Markdown source.  
FAISS allows us to efficiently compare the similarity between query embeddings and stored vectors, returning the most relevant Pokémon entries during question answering.

By completing this phase, the Pokédex knowledge base becomes *query-ready*: the RAG system can now locate and retrieve precise contextual information whenever a user asks a question about a Pokémon.

In [35]:
INDEX_DIR = "db"
os.makedirs(INDEX_DIR, exist_ok=True)


# Save metadata as a JSONL file
with open(os.path.join(INDEX_DIR, "pokedex_corpus.jsonl"), "w", encoding="utf-8") as f:
    for c in corpus:
        f.write(json.dumps(c, ensure_ascii=False) + "\n")

# Create and save the FAISS index
# Ensure embeddings are available from the previous step
if 'embeddings' in locals() and embeddings.shape[0] > 0:
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings.astype('float32')) # FAISS expects float32
    faiss.write_index(index, os.path.join(INDEX_DIR, "pokedex.index"))
    print("FAISS index created and saved.")
    print("Dimensione indice:", index.ntotal)
else:
    print("No embeddings found to create the FAISS index.")

FAISS index created and saved.
Dimensione indice: 6104


Once saved, we can load a saved index and its corpus.
In any moment, we can add new pokédex entries with *index.add(...)*

In [36]:
index = faiss.read_index("db/pokedex.index")
corpus = [json.loads(line) for line in open("db/pokedex_corpus.jsonl", encoding="utf-8")]

### **Retrieval**

Once the index is built, the **Retrieval-Augmented Generation (RAG)** process connects user questions to relevant knowledge and produces context-aware answers.  


This phase unfolds in two main steps

![Retrieval - Answer](https://mintcdn.com/langchain-5e9cc07a/I6RpA28iE233vhYX/images/rag_retrieval_generation.png?fit=max&auto=format&n=I6RpA28iE233vhYX&q=85&s=994c3585cece93c80873d369960afd44)

1. **Retrieval**: when a user submits a question, the system first converts it into an embedding vector using the same model employed during indexing. This query vector is then compared against the stored embeddings in the FAISS database.  The most semantically similar chunks, those most likely to contain the answer, are retrieved and passed along with their original text.

2. **Generation**: the retrieved text segments are inserted into a **prompt** that provides context for the language model. The **LLM**, in our case *Llama 3.2* through [Ollama](https://ollama.com/library/llama3.2), processes both the user query and the contextual passages to generate a coherent, grounded answer.

This ensures the model’s output is not purely generative but informed by actual data from the Pokédex, improving factual accuracy and reliability.

In [38]:
def search(query, topK=6):
    # Embed the query
    query_embedding = ollama.embeddings(model="bge-large", prompt=query)['embedding']
    query_embedding_norm = l2norm(np.array([query_embedding]))[0] # Apply L2 norm

    # Perform similarity search
    distances, indices = index.search(np.array([query_embedding_norm]).astype('float32'), topK)

    results = []
    for i, idx in enumerate(indices[0]):
        chunk_data = corpus[idx]
        results.append({
            "chunk": chunk_data,
            "score": 1 - distances[0][i], # Convert distance to similarity score (0 to 1)
            "distance": distances[0][i]
        })
    return results


# smoke test
results = search("height of Bulbasaur", topK=2)

def print_search_results(results):
    for r in results:
        meta = r["chunk"]["meta"]
        print(f"{meta['id']} | score: {r['score']:.3f}")

    # for r in results:
    #     display(Markdown(r["chunk"]["text"]))
    #     print("-------------------")

print_search_results(results)

bulbasaur_1 | score: 0.471
bulbasaur_0 | score: 0.465


### **Generation**

In [39]:
SYSTEM_PROMPT = """
    You are a knowledgeable Pokémon expert.
    Use the provided context to answer user questions accurately and concisely.
    If the context does not contain the answer, respond with "I don't know.".

    Write answers in Markdown format, using tables, bullet points, numbered lists, and bold text where appropriate.

    Always start your answer with the Pokémon Name with its Pokémon Artwork at the start of your answer leveraging the Markdown image syntax ![alt text](image_url).

    Here an example:
    # Pikachu
    ![Pikachu](https://img.pokemondb.net/artwork/pikachu.jpg)

    [YOUR ANSWER HERE]
"""

def build_context(query, rag_results):

    return prompt


def chat_llama_rag(question, topK=6):

    rag_results = search(question, topK)
    print_search_results(rag_results)
    prompt = build_context(question, rag_results)

    response = ollama.chat(
        model='llama3.2',
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        options={"temperature": 0.1}
    )

    return response['message']['content'], prompt, rag_results


res, prompt, hits = chat_llama_rag("Which are the starter pokémon?", topK=10)
display(Markdown(res))

bulbasaur_0 | score: 0.425
charmander_0 | score: 0.356
squirtle_0 | score: 0.327
national_0 | score: 0.285
sceptile_7 | score: 0.273
deoxys_17 | score: 0.265
electrode_0 | score: 0.265
shiny_0 | score: 0.263
wigglytuff_0 | score: 0.262
geodude_1 | score: 0.262


<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

The document does not provide information about Matterport's subscriber base on May 31, 2020. However, it does mention that as of December 31, 2022, Matterport had over 701,000 subscribers.

<answer>0</answer>

In [40]:
import random

qst = random.choice(questions)
display(Markdown(qst))

res, prompt, hits = chat_llama_rag(qst)
display(Markdown(res))

Will a Delcatty with Normalize paralyze a Electivire with Motor Drive using Thunder Wave?

meowstic_3 | score: 0.213
electrike_5 | score: 0.195
manectric_9 | score: 0.194
delcatty_3 | score: 0.182
tyrantrum_5 | score: 0.173
electivire_1 | score: 0.172


<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

The document does not provide information about Matterport's subscriber base on May 31, 2020. However, it does mention that as of December 31, 2022, they had over 701,000 subscribers.

<answer>0</answer>

---

### **Conclusion**

In this final section, we demonstrated how to build a complete Retrieval-Augmented Generation (RAG) system using a real-world dataset, the Pokémon Pokédex. The project combined data preprocessing, vector indexing, and LLM reasoning into a cohesive pipeline capable of answering domain-specific questions in a factual and grounded way.

There is still much room for refinement:

- **Improved preprocessing** (cleaner Markdown parsing, enhanced entity linking, and data normalization) can yield more semantically meaningful chunks.
- **Reranking mechanisms** can reorder retrieved results using secondary models, such as cross-encoders or hybrid retrieval (dense + lexical).
- **Dynamic topK selection** adapts the number of retrieved documents to the complexity or ambiguity of the question.
- **Caching and evaluation pipelines** can measure factual accuracy, recall, and latency, helping tune system performance for real applications.