# RAG Workshop Notebook - Naive RAG


In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

## 1. Data Ingestion:
1. Load pre-downloaded Wikipedia articles from the data directory.
2. Chunk the text into smaller pieces to create embeddings.
3. Create embeddings using OpenAI's text-embedding-3-small model.
4. Index the embeddings using Qdrant Vector Store.

![../imgs/ingestion.png](../imgs/ingestion.png)

**Note**: Articles are pre-downloaded to avoid repetitive API calls during workshops. Use `scripts/fetch_additional_articles.py` to fetch new articles when needed.

### 1.1. Load pre-downloaded Wikipedia articles

**Configuration Options**: Easily change which articles to load by modifying the variables in the next cell:
- **Specific articles**: Set `SELECTED_ARTICLES` list
- **Predefined combinations**: Uncomment and use `SELECTED_COMBINATION` 
- **All articles**: Use `load_existing_wiki_articles()` without parameters

In [2]:
# Import the wiki article loader utility
import sys
sys.path.append('../scripts')
from wiki_article_loader import load_existing_wiki_articles

# === CONFIGURABLE: Choose which articles to load ===
# Uncomment one of the options below:

# Option 1: Load specific articles (recommended for focused learning)
articles = load_existing_wiki_articles("../data/wiki_articles", ["Deep learning", "Artificial neural network"])

# Option 2: Load all available articles
# articles = load_existing_wiki_articles("../data/wiki_articles")

print(f"Successfully loaded {len(articles)} articles")
for i, article in enumerate(articles, 1):
    print(f"{i}. {article['title']}")

Successfully loaded 2 articles
1. Artificial neural network
2. Deep learning


### 1.2. Verify loaded articles (text is already cleaned)

In [3]:
# Articles are already cleaned and ready to use
# The pre-downloaded articles have markup and citations removed

print(f"Loaded articles summary:")
for article in articles:
    content_length = len(article['content'])
    print(f"- {article['title']}: {content_length:,} characters")

print(f"\nTotal articles: {len(articles)}")
print(f"Total content: {sum(len(a['content']) for a in articles):,} characters")

Loaded articles summary:
- Artificial neural network: 57,787 characters
- Deep learning: 56,760 characters

Total articles: 2
Total content: 114,547 characters


In [4]:
articles[1]['content'][:1000]

'In machine learning, deep learning focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the network. Methods used can be supervised, semi-supervised or unsupervised.\nSome common deep learning network architectures include fully connected networks, deep belief networks, recurrent neural networks, convolutional neural networks, generative adversarial networks, transformers, and neural radiance fields. These architectures have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board 

### 1.3. Chunk the text into smaller pieces to create embeddings.

In [5]:
# Chunking function
def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    return [' '.join(words[i:i + chunk_size])
            for i in range(0, len(words), chunk_size - overlap)]


# Prepare chunks and metadata
corpus = []
metadata = []
for article in articles:
    chunks = chunk_text(article["content"])
    corpus.extend(chunks)
    metadata.extend([{"title": article["title"], "url": article["url"]}] * len(chunks))

In [6]:
print('Total Corpus:', len(corpus))
print('Total Metadata:', len(metadata))

deep_learning_chunks = [chunk for chunk, meta in zip(corpus, metadata) if meta['title'] == 'Deep learning']

Total Corpus: 67
Total Metadata: 67


In [7]:
len(deep_learning_chunks)

34

### 1.4. Create embeddings using OpenAI's text-embedding-3-small model.

In [8]:
from openai import OpenAI
from tqdm import tqdm

openai_client = OpenAI()


# Define the embedding function using OpenAI's API (using text-embedding-ada-002)
def openai_embedding(text):
    text = text.replace("\n", " ")
    response = openai_client.embeddings.create(
        input=[text],  # Passing the text as a list
        model="text-embedding-3-small"
    )
    # Use dot notation to access the embedding from the response object
    embeddings = [data.embedding for data in response.data]
    return embeddings

In [11]:
import time

embeddings = []
chunked_texts = []
metadata_chunks = []
# test_corpus = corpus[:10]

for chunk in tqdm(corpus):
    embedding = openai_embedding(chunk)
    embeddings.extend(embedding)
    chunked_texts.extend([chunk] * len(embedding))
    time.sleep(0.4)




100%|██████████| 67/67 [01:18<00:00,  1.16s/it]


### 1.5. Index the embeddings using Qdrant Vector Store.

In [12]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Create an in-memory Qdrant instance
client = QdrantClient(":memory:")
collection_name = "wikipedia_articles"

# Create the collection with the specified vector configuration
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upsert points into the collection using PointStruct for each point
client.upsert(
    collection_name=collection_name,
    points=[
        PointStruct(
            id=idx,
            vector=embedding,
            payload={"text": chunked_texts[idx]}
        )
        for idx, embedding in enumerate(embeddings)
    ]
)


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

## 2. Build the Q/A Chatbot

![../imgs/naive-rag.png](../imgs/naive-rag.png)


### 2.1. Retrieval - Search the database for the most relevant embeddings.

In [13]:
# Function to search the database
def vector_search(query, top_k=3):
    # create embedding of the query
    response = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )
    query_embeddings = response.data[0].embedding
    # similarity search using the embedding, give top n results which are close to the query embeddings
    search_result = client.query_points(
        collection_name=collection_name,
        query=query_embeddings,
        with_payload=True,
        limit=top_k,
    ).points
    return [result.payload for result in search_result]


search_result = vector_search("What does the word 'deep' in 'deep learning' refer")

from pprint import pprint

pprint(search_result)

[{'text': 'In machine learning, deep learning focuses on utilizing '
          'multilayered neural networks to perform tasks such as '
          'classification, regression, and representation learning. The field '
          'takes inspiration from biological neuroscience and is centered '
          'around stacking artificial neurons into layers and "training" them '
          'to process data. The adjective "deep" refers to the use of multiple '
          'layers (ranging from three to several hundred or thousands) in the '
          'network. Methods used can be supervised, semi-supervised or '
          'unsupervised. Some common deep learning network architectures '
          'include fully connected networks, deep belief networks, recurrent '
          'neural networks, convolutional neural networks, generative '
          'adversarial networks, transformers, and neural radiance fields. '
          'These architectures have been applied to fields including computer '
          '

### 2.2. Generation - Use the retrieved embeddings to generate the answer.

In [14]:
def model_generate(prompt, model="gpt-4o"):
    messages = [{"role": "user", "content": prompt}]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

In [15]:
import json


def prompt_template(question, context):
    return """You are a AI Assistant that provides answer to the question at the end, over the following
  pieces of context. Make sure to only use the context to answer the question. Keep the wording very close to the context
  context:
  ```
  """ + json.dumps(context) + """
  ```
  User question: """ + question + """
  Answer in markdown:"""


In [16]:
def generate_answer(question):
    #Retrieval: search a knowledge base.
    search_result = vector_search(question)

    prompt = prompt_template(question, search_result)
    # Generation: LLMs' ability to generate the answer
    return model_generate(prompt)


question = f"What does the word 'deep' in 'deep learning' refer? "
answer = generate_answer(question)
print("Answer:", answer)

Answer: The word "deep" in "deep learning" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output, describing potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited. Most researchers agree that deep learning involves CAP depth higher than two.


In [17]:
question = f"Who introduced the time delay neural network (TDNN)? and when ?"
answer = generate_answer(question)
print("Answer:", answer)

Answer: ```markdown
The time delay neural network (TDNN) was introduced by Alex Waibel in 1987.
```
