# RAG

In recent years LLMs have shown impressive capabilities in generating content from user inputs. However, they suffer from several issues. First, their pretrained nature make them not up to date with latest information. In addition, the fact they are trained on very large amount of data make them good at generalising how to build answers, but not how to necessarily answer accurately. Making them prone to hallucinations. `Retrieval Augmented Generation` (RAG) comes as an effective way to paliate to these issues by providing an up to date context to LLMs so they can generate human-like answers while benefiting from an up to date source of information. In this experiment, I am building a RAG pipeline that takes a user input, retrieves relevant information about the query from `Wikipedia` to build a context for an LLM. The end goal is to observe and analyse how RAG improves the results from LLM with and without context.

**Keywords:** `NLP`, `RAG`, `LLM`, `Wikipedia`, `OpenAI`, `ChatGPT`, `Ngrams`, `NLTK`, `ChromaDB`, `Vector Database`

## Experiment plan

First, an input will be defined to mimick a user asking a question. This input will be tokenised using the `NLTK` library for `Natural Language Processing` (NLP), stop words will be filtered using NLTK's defaults English words and Ngrams will be extracted from the tokens. These Ngrams are sequences of N words (`3` will be the starting values used in the experiment) extracted from the user input that will help retrieving relevant pages from Wikipedia.


Then, using the Ngrams, the relevant Wikipedia pages will be retrieves using the `wikipedia` Python module. The pages will be chunked using `\n` to separate paragraph and the section titles will be removed as they do not bring a lot of meaning compared to the page's text itself.


The page's content will then be loaded in ChromaDB, a Vector Database, along with their embeddings which will be generated using ChromaDB's Default Embeddings function using [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) as model for embeddings. Embeddings are a crucial step in the retrieval step as they give words meaning compared to other words. Hence, allowing to perform semantic search on documents later on.


Once the documents stored in the Vector Database, the user input will be converted as embeddings and used to query the `top 5` relevant content from ChromaDB. Which will then be used to build a prompt passed in `ChatGPT-4o-mini` containing the user's original request, along with the context coming from the documents.


Finally, the results will be analysed and discussed to understand how much RAG was able to bring between a simple user prompt, and a prompt generated using the Wikipedia context.

In [None]:
!pip install wikipedia
!pip install sentence-transformers
!pip install chromadb # Using chromadb as I want something simple to use
!pip install openai

import os
import pickle
import wikipedia
import nltk
import chromadb
import openai

from chromadb.utils import embedding_functions
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from nltk.lm import NgramCounter

nltk.download('punkt_tab')
nltk.download('stopwords')

path = '<PATH_TO_OPEN_AI_KEY_FILE>'
openai_key_path = path + 'key'
store_docs_in_fs = False

default_ef = embedding_functions.DefaultEmbeddingFunction()
client = chromadb.Client()

with open(openai_key_path, 'r') as key_file:
  key = key_file.read()

openai_client = openai.OpenAI(api_key=key)

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=2e3bcbdc1ec15d9fa85c13fec12a0b893585a3ab6923432a9711376719304a8d
  Stored in directory: /root/.cache/pip/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def get_tokens(input: str):
  tokens = word_tokenize(input)
  return [word for word in tokens if not word in stopwords.words()]

def get_results_for_ngrams(tokens: list[str], n_grams: int) -> list[str]:
  trigrams = list(ngrams(tokens, n_grams))

  ngram_counts = NgramCounter([trigrams])

  search_terms = [" ".join(gram) for gram in trigrams]

  results = []
  for term in search_terms:
      try:
          page = wikipedia.search(term)
          results.extend(page)
      except wikipedia.exceptions.DisambiguationError as e:
          pass
      except wikipedia.exceptions.PageError:
          pass

  #remove duplicate results. -> Stop word removal did a lot of good here, there is still results that do not seem relevant, but better !
  return list(set(results))

In [None]:
def get_page_chunks(trigrams):
  pages_chunks = {}

  for trigram in trigrams:
    try:
      page = wikipedia.page(trigram)
      chunks = filter(lambda x: x != '' and '===' not in x, page.content.split('\n'))
      embeddings = []
      chunks_cleaned = []
      for chunk in chunks:
        chunk_sw_cleaned = [word for word in chunk.split(' ') if not word in stopwords.words()]
        chunks_cleaned.append(' '.join(chunk_sw_cleaned))
      pages_chunks[page.title] = chunks_cleaned
    except wikipedia.DisambiguationError as e:
      pass
    except Exception as e:
      pass
  return pages_chunks

In [None]:
def save_to_pickle(documents: dict, save_path: str):
  for k, v in documents.items():
    pickle.dump(v, open(save_path + '/Documents/' + k, 'wb'))

In [None]:
def load_documents(path: str):
  documents = {}
  documents_path = path + '/Documents'
  for f in os.listdir(documents_path):
    if os.path.isfile(os.path.join(documents_path, f)):
      with(open(documents_path + '/' + f, 'rb')) as file:
        documents[f] = pickle.load(file)
  return documents

In [None]:
def create_or_get_collection(client: chromadb.Client, name: str):
  try:
    return client.get_collection(name)
  except Exception as e:
    print(f'Collection [{name}] does not exist, creating it...', e)
  return client.create_collection(name, embedding_function=default_ef)

def create_collection(client: chromadb.Client, name: str):
  try:
    client.delete_collection(name=name)
  except Exception as e:
    pass
  return client.create_collection(name, embedding_function=default_ef)


def add_documents(documents: dict, collection: chromadb.Collection):
  docs = []
  metadatas = []
  ids = []

  for source, documents_list in documents.items():
    for i, document in enumerate(documents_list):
      chunk_id = f"{source}_{i}"
      ids.append(chunk_id)
      metadatas.append({'source': source})
      docs.append("".join(document))

  collection.add(
      documents=docs,
      metadatas=metadatas,
      ids=ids,
  )

In [None]:
def run_experiment(input: str, openai_client: openai.OpenAI, collection: chromadb.Collection):
  embedded_text = default_ef([input])

  query_results = collection.query(
      query_embeddings=embedded_text,
      n_results=5
  )

  prompt = f"## User query\n{input}\n\nRelevant context:\n"

  for i in range(len(query_results['documents'][0])):
      prompt += f"- {query_results['documents'][0][i]}\n"

  response = openai_client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": prompt},
      ]
  )

  print('\n=============== RESULTS ===============\n')
  generated_text = response.choices[0].message.content
  print("RAG Prompt result: ", generated_text)
  print('-------------------------------------------')
  response = openai_client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": input},
      ]
  )

  generated_text = response.choices[0].message.content
  print("Vanilla result: " + generated_text)

GPT-4o-mini does not find our context useful, this needs a rework as it does not allow to evaluate the RAG.

After fixing the issues with chunked documents, GPT-4o-mini is now able to give a number. Although, not 100% accurate since I asked this year (2025). When no context is given, gpt-4o-mini is not able to give information and suggests to check for relevant sources.

## Experiments

### Experiment 1
In this experiment, I am asking ChatGPT how many tourists visted France this year (2025). Since the first quarter of the year is not yet wrapped up at the time, it is not realistic to expect results for this year. However, Wikipedia's page about [Tourism in France](https://en.wikipedia.org/wiki/Tourism_in_France) points 2023's numbers and states 100 millions foreign visitors.

Hence, the expectations are that the RAG results should state the number of 100 millions for 2023. While GPT's answer should give a more generic answer that points towards researching the information from viable sources since it does not have access to relevant sources.

In [None]:
input = "How many tourists visited France this year"

trigram_res = get_results_for_ngrams(get_tokens(input), 3)
page_chunks = get_page_chunks(trigram_res)
collection = create_or_get_collection(client, "rag-database")
add_documents(page_chunks, collection)

run_experiment(input, openai_client, collection)



  lis = BeautifulSoup(html).find_all('li')


Collection [rag-database] does not exist, creating it... Collection rag-database does not exist.


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 29.4MiB/s]




RAG Prompt result:  In 2023, France welcomed approximately 100 million foreign tourists, making it the most visited country in the world.
-------------------------------------------
Vanilla result: I don't have access to real-time data, but I can tell you that tourism statistics for a given year are typically compiled and released by official sources such as the French government or tourism boards. As of the end of 2022, France was recovering from the impacts of the COVID-19 pandemic, which had significantly affected tourism in previous years. 

For the most accurate and up-to-date information on the number of tourists who visited France in 2023, I recommend checking the official website of the French Ministry of Culture and Communication, the French National Institute of Statistics and Economic Studies (INSEE), or publications from the French tourism board.


The results confirm the expected outcome: When the context is provided, RAG outputs a sentence from the context itself. Where ChatGPT output a methodology to find the solution.

### Experiment 2
In this experiment, the answer that is looked after is absolute. While in the previous experiment "this year" could not be determined, we can see from the wikipedia page that the information about the current president of the United States is up to date.

In [None]:
input = "Who is the president of the united states of america"

trigram_res = get_results_for_ngrams(get_tokens(input), 3)
page_chunks = get_page_chunks(trigram_res)
collection = create_or_get_collection(client, "rag-database")
add_documents(page_chunks, collection)

run_experiment(input, openai_client, collection)



RAG Prompt result:  As of now, the president of the United States is Donald Trump, who assumed office on January 20, 2025.
-------------------------------------------
Vanilla result: As of my last knowledge update in October 2023, the President of the United States is Joe Biden. He took office on January 20, 2021. Please verify with up-to-date sources to ensure this information is still current.


As hinted out from the previous experiment, ChatGPT-4o-mini was last trained in 2023. At this time, the US president was different hence it was not expected the answer could be found without the current context. This experiment confirms this assumption as the RAG was able to help ChatGPT find the correct answer and ChatGPT named the former US president. Note the form employed by ChatGPT vanilla which is not assertive, compared to the RAG result.

### Experiment 3
This experiment calls for a more elaborate answer, what is being asked here is to analyse and compare the weather from the current year with previous years. The expectations are that the RAG answer should be able to find some relevant context from the wikipedia searches, helping the model giving a relatively good picture of the situation. The expectations from ChatGPT's vanilla's answer is that it should be able to point out a way to get an assessment, without necessarilly coming to a conclusion.

In [None]:
input = "How does the weather compares in 2025 with past years"

trigram_res = get_results_for_ngrams(get_tokens(input), 3)
page_chunks = get_page_chunks(trigram_res)
collection = create_or_get_collection(client, "rag-database")
add_documents(page_chunks, collection)

run_experiment(input, openai_client, collection)



RAG Prompt result:  To provide a comparison of the weather in 2025 with past years like 2024 and earlier periods, we can look at several factors, including trends in temperature, precipitation, extreme weather events, and the impact of climate change. However, keep in mind that detailed data for 2025 would require access to real-time weather reports, which may not be fully available yet. 

### General Trends

1. **Global Warming Impact**: Over the past few decades, the average global temperature has been rising due to climate change, which could continue into 2025. This might result in higher average temperatures in comparison to years in the early 2000s like 2005.

2. **Extreme Weather Events**: Increased frequency and intensity of extreme weather events, such as hurricanes, droughts, and tornadoes, have been noted in recent years. If this trend continues, 2025 may see more tornadoes and severe weather compared to 2024 and earlier years.

3. **Regional Variations**: Weather patterns

It is observed that ChatGPT was able to use the context from the RAG to form a conclusion, even structure an answer using general trends, comparison with previous years (2005-2008) and form a conclusion. While the tone used in the conclusion is not as assertive as Experiment 2, it seems relevant. The vanilla GPT's answer points towards relevant steps to build the conclusion ourselves. What is interesting is to note is that looking at GPT's vanilla answer compared to the RAG answer, it cannot be concluded that the RAG answer is built upon what is suggested by vanilla GPT. Which means that RAG used the context to build an answer without reasoning on the fact it may be missing components it would itself suggests to come to a conclusion.

## Discussion
Throughout this experiment, the capabilities of RAG were established compared to the LLM answers. The diverse experiments proved that while LLM was able to provide an answer that was making sense in a human context, it was not able to provide concrete answers to questions or, provided them with a rather low level of assertivity. When RAG was used to provide a context, it was seen that the LLM was able to form some conclusions using the context. While this demonstrates the benefit of using RAG, the last experiment highlights that the LLM was not able to use the context in an `intelligent` manner. This is highlighted by the fact that in the last experiment, the initial suggestion from the LLM involved much more thorough analysis and data gathering. Meanwhile, when a context was provided, the LLM came to a conclusion of its own, without questioning the information nor using it to perform the task it earlier suggested.

While this experiment comes to relevant conclusions, it opens the door for further experiments. First, the information coming from Wikipedia could instead be sourced from Google search, Reddit or other platforms. Which could be interesting to acquire different opinions and see how the LLM would interpret and use conflicting information to build its answers.

On a pure technical topic, Ngrams value could be adjusted from `3` to `4/5` to observe the impact on the retrieval. More models could be included from different providers (Google Gemini, GROK, Anthropic Claude etc...) to observe how different models build different answers and if they show any signs of bias or other noticeable behaviours.

Finally, the RAG could also be included in an Agent Workflow as a tool for information retrieval.