# RAG Workshop Notebook
## From Naive RAG to Advanced Techniques

## 1. Setup Environment

In [1]:
%pip install wikipedia mwparserfromhell beautifulsoup4 openai qdrant-client tqdm python-dotenv ragas


Note: you may need to restart the kernel to use updated packages.


In [9]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

## 2. Data Collection & Preprocessing

In [3]:
import wikipedia
import json
import re
from mwparserfromhell import parse
from bs4 import BeautifulSoup

ARTICLE_TITLES = [
    "Machine learning", "Deep learning",
    "Transformer (machine learning model)", "Natural language processing",
    "Computer vision", "Reinforcement learning",
    "Artificial neural network", "Generative pre-trained transformer",
    "BERT (language model)", "Overfitting"
]

def fetch_wikipedia_article(title):
    try:
        page = wikipedia.page(title)
        return {
            "title": title,
            "url": page.url,
            "raw_content": page.content
        }
    except wikipedia.exceptions.DisambiguationError as e:
        return fetch_wikipedia_article(e.options[0])
    except wikipedia.exceptions.PageError:
        print(f"Skipping {title}")
        return None

def clean_text(text):
    # Remove wiki markup and citation numbers
    text = ''.join(parse(text).strip_code())
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text()
    return re.sub(r'\[\d+\]', '', text).strip()

articles = []
for title in ARTICLE_TITLES:
    article = fetch_wikipedia_article(title)
    if article:
        article["content"] = clean_text(article["raw_content"])
        articles.append(article)



Skipping Machine learning
Skipping Computer vision


In [4]:
len(articles)

8

In [5]:
# Chunking function
def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) 
            for i in range(0, len(words), chunk_size - overlap)]

# Prepare chunks and metadata
corpus = []
metadata = []
for article in articles:
    chunks = chunk_text(article["content"])
    corpus.extend(chunks)
    metadata.extend([{"title": article["title"], "url": article["url"]}] * len(chunks))

In [ ]:
len(corpus)
len(metadata)



## 3 Create Embeddings with OpenAI

In [10]:
from openai import OpenAI
from tqdm import tqdm

openai_client = OpenAI()

# Define the embedding function using OpenAI's API (using text-embedding-ada-002)
def openai_embedding(text):
    text = text.replace("\n", " ")
    response = openai_client.embeddings.create(
        input=[text],  # Passing the text as a list
        model="text-embedding-3-small"
    )
    # Use dot notation to access the embedding from the response object
    embeddings = [data.embedding for data in response.data]
    return embeddings


embeddings = []
chunked_texts = []
metadata_chunks = []
test_corpus = corpus[:10]




for chunk in tqdm(test_corpus):
    embedding = openai_embedding(chunk)
    embeddings.extend(embedding)
    chunked_texts.extend([chunk] * len(embedding))
    
    


100%|██████████| 10/10 [00:08<00:00,  1.24it/s]


# 4. Indexing with Qdrant Vector Store

In [11]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Create an in-memory Qdrant instance
client = QdrantClient(":memory:")
collection_name = "wikipedia_articles"

# Create the collection with the specified vector configuration
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upsert points into the collection using PointStruct for each point
client.upsert(
    collection_name=collection_name,
    points=[
        PointStruct(
            id=idx,
            vector=embedding,
            payload={"text": chunked_texts[idx]}
        )
        for idx, embedding in enumerate(embeddings)
    ]
)


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [12]:
# Function to search the database
def vector_search(query, top_k=3):

  # create embedding of the query
  response = openai_client.embeddings.create(
      input=query,
      model="text-embedding-3-small"
  )
  query_embeddings = response.data[0].embedding
  # similarity search using the embedding, give top n results which are close to the query embeddings
  search_result = client.query_points(
      collection_name=collection_name,
      query=query_embeddings,
      with_payload=True,
      limit=top_k,
  ).points
  return [result.payload for result in search_result]

search_result = vector_search("What is Reinforcement learning?")
print(search_result[0])

{'text': 'Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the network. Methods used can be either supervised, semi-supervised or unsupervised. Some common deep learning network architectures include fully connected networks, deep belief networks, recurrent neural networks, convolutional neural networks, generative adversarial networks, transformers, and neural radiance fields. These architectures have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material ins

In [13]:
def model_generate(prompt, model="gpt-4o-mini"):
    messages = [{"role": "user", "content": prompt}]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

## Build the Q/A Chatbot


In [14]:
import json

def prompt_template(question, context):
  return """You are a AI Assistant that provides answer to the question at the end, over the following
  pieces of context.
  context:
  ```
  """+ json.dumps(context) + """
  ```
  User question: """+ question +"""
  Answer in markdown:"""


In [15]:
def generate_answer(question):
  #Retrieval: search a knowledge base.
  search_result = vector_search(question)

  prompt = prompt_template(question, search_result)
  # Generation: LLMs' ability to generate the answer
  return model_generate(prompt)

question = f"What is deep learning ?"
answer = generate_answer(question)
print("Answer:", answer)

Answer: Deep learning is a subset of machine learning that utilizes neural networks to perform various tasks such as classification, regression, and representation learning. It is inspired by biological neuroscience and involves stacking artificial neurons into multiple layers, which allows the model to process and learn from data in a hierarchical manner.

### Key Features of Deep Learning:

- **Multiple Layers**: The term "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the neural network. Each layer transforms the input data into progressively more abstract representations.
  
- **Automatic Feature Learning**: Unlike traditional machine learning techniques that often require hand-crafted feature engineering, deep learning models automatically discover useful feature representations from the data.

- **Types of Networks**: Common architectures include:
  - Fully Connected Networks
  - Convolutional Neural Networks (CNNs)
  - Recurren