In [1]:
import requests

def fetch_wikivoyage_data(query):
    url = f"https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "format": "json",
        "titles": query,
        "prop": "extracts",
        "explaintext": True,
    }
    response = requests.get(url, params = params)
    data = response.json()
    pages = data["query"]["pages"]
    for page_id, page_data in pages.items():
        return page_data.get("extract", "No data found")


In [2]:
travel_info = fetch_wikivoyage_data("Paris")
print(travel_info)

Paris (French pronunciation: [paʁi] ) is the capital and largest city of France. With an official estimated population of 2,102,650 residents in January 2023 in an area of more than 105 km2 (41 sq mi), Paris is the fourth-largest city in the European Union and the 30th most densely populated city in the world in 2022. Since the 17th century, Paris has been one of the world's major centres of finance, diplomacy, commerce, culture, fashion, and gastronomy. For its leading role in the arts and sciences, as well as its early and extensive system of street lighting, in the 19th century, it became known as the City of Light.
The City of Paris is the centre of the Île-de-France region, or Paris Region, with an official estimated population of 12,271,794 inhabitants in January 2023, or about 19% of the population of France. The Paris Region had a nominal GDP of €765 billion (US$1.064 trillion when adjusted for PPP) in 2021, the highest in the European Union. According to the Economist Intellig

In [3]:
def split_into_sections(text):
    sections = text.split("\n==")
    cleaned_sections = [sec.replace("==", "").strip() for sec in sections if sec.strip()]
    return cleaned_sections

documents = split_into_sections(travel_info)
for doc in documents:
    print(doc[:100])

Paris (French pronunciation: [paʁi] ) is the capital and largest city of France. With an official es
Etymology 

The ancient oppidum that corresponds to the modern city of Paris was first mentioned in 
History
= Origins =

The Parisii, a sub-tribe of the Celtic Senones, inhabited the Paris area from around th
= High and Late Middle Ages to Louis XIV =

By the end of the 12th century, Paris had become the pol
= 18th and 19th centuries =

Paris grew in population from about 400,000 in 1640 to 650,000 in 1780.
= 20th and 21st centuries =

By 1901, the population of Paris had grown to about 2,715,000. At the b
Geography
= Location =

Paris is located in northern central France, in a north-bending arc of the river Seine
= Climate =

Paris has an oceanic climate within the Köppen climate classification, typical of weste
Administration
= City government =

For almost all of its long history, except for a few brief periods, Paris was g
= Métropole du Grand Paris =

In January 2016, the Métropo

In [4]:
import re

In [5]:
def clean_text(text):
    # Remove pronunciation guides and references in square brackets
    text = re.sub(r"\[.*?\]", "", text)
    # Remove special characters and excessive whitespace
    text = re.sub(r"==+", "", text)  # Remove section markers
    text = re.sub(r"\s{2,}", " ", text)  # Replace multiple spaces with one
    text = re.sub(r"[^a-zA-Z0-9.,!?'\s]", "", text)  # Remove non-alphanumeric
    return text.strip()

cleaned_documents = [clean_text(doc) for doc in documents]

In [6]:
for doc in cleaned_documents:
    print(doc[:100])

Paris French pronunciation  is the capital and largest city of France. With an official estimated po
Etymology The ancient oppidum that corresponds to the modern city of Paris was first mentioned in th
History
Origins  The Parisii, a subtribe of the Celtic Senones, inhabited the Paris area from around the mid
High and Late Middle Ages to Louis XIV  By the end of the 12th century, Paris had become the politic
18th and 19th centuries  Paris grew in population from about 400,000 in 1640 to 650,000 in 1780. A n
20th and 21st centuries  By 1901, the population of Paris had grown to about 2,715,000. At the begin
Geography
Location  Paris is located in northern central France, in a northbending arc of the river Seine, who
Climate  Paris has an oceanic climate within the Kppen climate classification, typical of western Eu
Administration
City government  For almost all of its long history, except for a few brief periods, Paris was gover
Mtropole du Grand Paris  In January 2016, the Mtropole du 

In [7]:
def split_into_paragraphs(documents):
    all_paragraphs = []
    for text in documents:
        paragraphs = text.split("\n")  # Use single newline as delimiter
        paragraphs = [para.strip() for para in paragraphs if len(para.strip()) > 10]  # Relax length condition
        all_paragraphs.extend(paragraphs)
    return all_paragraphs

paragraphs = split_into_paragraphs(cleaned_documents)
print(f"Total paragraphs: {len(paragraphs)}")  # Check total count
for paragraph in paragraphs[:5]:
    print(paragraph)  # Preview first 5 paragraphs


Total paragraphs: 173
Paris French pronunciation  is the capital and largest city of France. With an official estimated population of 2,102,650 residents in January 2023 in an area of more than 105 km2 41 sq mi, Paris is the fourthlargest city in the European Union and the 30th most densely populated city in the world in 2022. Since the 17th century, Paris has been one of the world's major centres of finance, diplomacy, commerce, culture, fashion, and gastronomy. For its leading role in the arts and sciences, as well as its early and extensive system of street lighting, in the 19th century, it became known as the City of Light.
The City of Paris is the centre of the ledeFrance region, or Paris Region, with an official estimated population of 12,271,794 inhabitants in January 2023, or about 19 of the population of France. The Paris Region had a nominal GDP of 765 billion US1.064 trillion when adjusted for PPP in 2021, the highest in the European Union. According to the Economist Intelli

In [8]:
def tag_sections(paragraphs):
    tagged_data = []
    for para in paragraphs:
        if "museum" in para.lower() or "art" in para.lower():
            tag = "Culture & Art"
        elif "transport" in para.lower() or "metro" in para.lower():
            tag = "Transportation"
        elif "history" in para.lower() or "roman" in para.lower():
            tag = "History"
        else:
            tag = "General"
        tagged_data.append({"tag": tag, "content": para})
    return tagged_data

tagged_paragraphs = tag_sections(paragraphs)
for data in tagged_paragraphs[:5]:
    print(data)  # Preview tagged data

{'tag': 'Culture & Art', 'content': "Paris French pronunciation  is the capital and largest city of France. With an official estimated population of 2,102,650 residents in January 2023 in an area of more than 105 km2 41 sq mi, Paris is the fourthlargest city in the European Union and the 30th most densely populated city in the world in 2022. Since the 17th century, Paris has been one of the world's major centres of finance, diplomacy, commerce, culture, fashion, and gastronomy. For its leading role in the arts and sciences, as well as its early and extensive system of street lighting, in the 19th century, it became known as the City of Light."}
{'tag': 'General', 'content': 'The City of Paris is the centre of the ledeFrance region, or Paris Region, with an official estimated population of 12,271,794 inhabitants in January 2023, or about 19 of the population of France. The Paris Region had a nominal GDP of 765 billion US1.064 trillion when adjusted for PPP in 2021, the highest in the Eu

In [9]:
def structure_data(tagged_data):
    structured_data = []
    for data in tagged_data:
        tag = data["tag"]
        content = data["content"]
        structured_data.append({"tag": tag, "content": content})  # Flatten each document with its tag
    return structured_data

structured_data = structure_data(tagged_paragraphs)
for entry in structured_data:
    print(entry["tag"], len(entry["content"]))

Culture & Art 613
General 486
Transportation 626
Culture & Art 575
General 1105
History 475
History 719
General 131
History 682
Culture & Art 623
General 795
General 407
Culture & Art 526
General 449
General 728
Culture & Art 882
Culture & Art 902
History 899
General 385
Culture & Art 558
Culture & Art 1378
General 439
Culture & Art 497
Culture & Art 366
Culture & Art 597
Transportation 961
General 598
Culture & Art 1103
Culture & Art 505
Culture & Art 1179
General 316
General 184
Culture & Art 537
Transportation 658
General 293
General 302
General 404
General 456
General 308
General 14
History 1014
Culture & Art 445
General 733
Culture & Art 620
Transportation 570
General 19
Culture & Art 600
Culture & Art 424
General 1043
Culture & Art 515
Culture & Art 409
General 598
General 352
Culture & Art 664
General 560
Culture & Art 369
General 371
Culture & Art 1188
Culture & Art 1024
Culture & Art 416
General 555
Culture & Art 532
General 370
Culture & Art 693
General 270
Culture & Art 657


In [10]:
structured_data

[{'tag': 'Culture & Art',
  'content': "Paris French pronunciation  is the capital and largest city of France. With an official estimated population of 2,102,650 residents in January 2023 in an area of more than 105 km2 41 sq mi, Paris is the fourthlargest city in the European Union and the 30th most densely populated city in the world in 2022. Since the 17th century, Paris has been one of the world's major centres of finance, diplomacy, commerce, culture, fashion, and gastronomy. For its leading role in the arts and sciences, as well as its early and extensive system of street lighting, in the 19th century, it became known as the City of Light."},
 {'tag': 'General',
  'content': 'The City of Paris is the centre of the ledeFrance region, or Paris Region, with an official estimated population of 12,271,794 inhabitants in January 2023, or about 19 of the population of France. The Paris Region had a nominal GDP of 765 billion US1.064 trillion when adjusted for PPP in 2021, the highest in

In [11]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.3.0-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.3.0-py3-none-any.whl (268 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.7/268.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 3.2.1
    Uninstalling sentence-transformers-3.2.1:
      Successfully uninstalled sentence-transformers-3.2.1
Successfully installed sentence-transformers-3.3.0


In [12]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0


In [19]:
#Generating Embeddings

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract documents and tags
documents = [entry['content'] for entry in structured_data]
tags = [entry['tag'] for entry in structured_data]

embeddings = embedding_model.encode(documents)

np.save("embeddings.npy", embeddings)
with open("documents.txt", "w") as f:
    f.writelines([f"{doc}\n" for doc in documents])


In [20]:
# Initialize FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Save the FAISS index
faiss.write_index(index, "travel_index.faiss")
print("FAISS index created and saved.")

FAISS index created and saved.


In [26]:
#Query the Retriever
def query_faiss(query, embedding_model, index, documents, top_k=3):
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(np.array(query_embedding).astype('float32'), k=top_k)
    results = [documents[i].strip() for i in indices[0]]
    return results

# Load FAISS index and embeddings
index = faiss.read_index("travel_index.faiss")
embeddings = np.load("embeddings.npy", allow_pickle=True)
with open("documents.txt", "r") as f:
    documents = [line.strip() for line in f.readlines()]

# Query example
query = "What are some cultural landmarks in Paris?"
results = query_faiss(query, embedding_model, index, documents)
for result in results:
    print(result)

Outline of Paris
Art Deco in Paris
Paris' top cultural attractions in 2022 were the Louvre Museum 7.7 million visitors, the Eiffel Tower 5.8 million visitors, the Muse d'Orsay 3.27 million visitors and the Centre Pompidou 3 million visitors.


In [22]:
#Set up the Generator(LLM)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
generator_model = AutoModelForCausalLM.from_pretrained("gpt2")

def generate_response(prompt, model, tokenizer, max_new_tokens=50):
  if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
  inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
  inputs["attention_mask"] = inputs.get("attention_mask", None)  # Use attention_mask if not already set
  outputs = model.generate(
      inputs["input_ids"],
      attention_mask=inputs["attention_mask"],
      pad_token_id=tokenizer.eos_token_id,  # Set pad_token_id to eos_token_id
      max_new_tokens=max_new_tokens,  # Generate only new tokens
  )
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
prompt = "Based on these documents, suggest cultural activities in Paris:\n" + "\n".join(results)
response = generate_response(prompt, generator_model, tokenizer, max_new_tokens=150)  # Generate up to 150 new tokens
print(response)

Based on these documents, suggest cultural activities in Paris:
Outline of Paris
Art Deco in Paris
Paris' top cultural attractions in 2022 were the Louvre Museum 7.7 million visitors, the Eiffel Tower 5.8 million visitors, the Muse d'Orsay 3.27 million visitors and the Centre Pompidou 3 million visitors.
Some of the finest manouche musicians in the world are found here playing the cafs of the city at night. Some of the more notable jazz venues include the New Morning, Le Sunset, La Chope des Puces and Bouquet du Nord. Several yearly festivals take place in Paris, including the Paris Jazz Festival and the rock festival Rock en Seine. The Orchestre de Paris was established in 1967. December 2015 was the 100th anniversary of the birth of Edith Piafwidely regarded as France's national chanteuse, as well as being one of France's greatest international stars.
Paris is known for its museums and architectural landmarks the Louvre received 8.9 million visitors in 2023, on track for keeping its 

In [33]:
from transformers import pipeline

# Initialize the summarizer
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def summarize_documents(documents):
    summaries = []
    for doc in documents:
        input_length = len(doc.split())  # Number of words in the input document
        if input_length < 10:  # If input is very short, skip summarization
            summaries.append(doc)
            continue

        max_length = min(30, input_length + 10)  # Dynamically set max_length
        min_length = max(5, input_length // 2)  # Ensure meaningful summaries

        summary = summarizer(doc, max_length=max_length, min_length=min_length, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    return summaries


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [34]:
def clean_response(response):
    # Split into lines and remove duplicates or overly repetitive phrases
    lines = response.split("\n")
    unique_lines = list(dict.fromkeys(lines))  # Remove duplicate lines
    filtered_lines = [line for line in unique_lines if len(line.strip()) > 5]  # Remove very short lines
    return "\n".join(filtered_lines)


In [35]:
def rag_pipeline(query, retriever, embedding_model, generator_model, tokenizer, documents, max_new_tokens=150):
    # Step 1: Retrieve and summarize documents
    retrieved_docs = query_faiss(query, embedding_model, retriever, documents, top_k=3)
    if not retrieved_docs:
        return "No relevant documents found for the query."

    summarized_docs = summarize_documents(retrieved_docs)

    # Step 2: Construct the prompt
    prompt = (
        "You are a helpful assistant. Based on these documents, provide a concise and well-structured list of "
        "cultural activities and landmarks in Paris:\n" + "\n".join(summarized_docs)
    )

    # Step 3: Generate the response
    response = generate_response(prompt, generator_model, tokenizer, max_new_tokens=max_new_tokens)

    # Step 4: Post-process the response
    response = clean_response(response)
    return response

In [36]:
# Example query
query = "What are some cultural landmarks in Paris?"

# Execute the RAG pipeline
response = rag_pipeline(
    query=query,
    retriever=index,  # FAISS index
    embedding_model=embedding_model,  # SentenceTransformer model
    generator_model=generator_model,  # GPT model
    tokenizer=tokenizer,  # Tokenizer for the GPT model
    documents=documents,  # Original document texts
    max_new_tokens=150  # Limit the generated response length
)

# Print the response
print("Generated Response:")
print(response)

Generated Response:
You are a helpful assistant. Based on these documents, provide a concise and well-structured list of cultural activities and landmarks in Paris:
Outline of Paris
Art Deco in Paris
The Louvre Museum will have 7.7 million visitors in 2022. The Eiffel Tower will have 5.8 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.5 million visitors. The Louvre will have 1.
