## **08 - External Index Retrievers**

In [1]:
!uv pip install langchain-community
!uv pip install arxiv          
!uv pip install wikipedia       
!uv pip install tavily-python

[2mAudited [1m1 package[0m [2min 230ms[0m[0m
[2mAudited [1m1 package[0m [2min 80ms[0m[0m
[2mAudited [1m1 package[0m [2min 69ms[0m[0m
[2mAudited [1m1 package[0m [2min 81ms[0m[0m


In [2]:
from dotenv import load_dotenv
import os

load_dotenv
from langchain_community.retrievers import ArxivRetriever, WikipediaRetriever, TavilySearchAPIRetriever
from langchain_core.documents import Document
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

import langchain
print(f"‚úÖ LangChain version: {langchain.__version__}")
print("‚úÖ Setup complete!")

‚úÖ LangChain version: 1.2.0
‚úÖ Setup complete!


In [3]:
!uv pip install arxiv

[2mAudited [1m1 package[0m [2min 86ms[0m[0m


In [4]:
arxiv_retriever = ArxivRetriever(load_max_docs=3)

query = "Large language model"
docs = arxiv_retriever.invoke(query)

print(f"üìö Found {len(docs)} papers on '{query}'\n")

# Display first paper
print("=" * 80)
print(f"Title: {docs[0].metadata.get('Title', 'N/A')}")
print(f"Authors: {docs[0].metadata.get('Authors', 'N/A')}")
print(f"Published: {docs[0].metadata.get('Published', 'N/A')}")
print(f"\nAbstract (first 500 chars):\n{docs[0].page_content[:500]}...")
print("=" * 80)
print(f"Title: {docs[1].metadata.get('Title', 'N/A')}")
print("=" * 80)
print(f"Title: {docs[2].metadata.get('Title', 'N/A')}")

üìö Found 3 papers on 'Large language model'

Title: Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
Authors: Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin
Published: 2024-04-16

Abstract (first 500 chars):
Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine...
Title: Demystifying Instruction Mixing for Fine-tuning Large Language Models
Title: WizardLM: Empowering large pre-trained language models to follow complex instructions


### **INTERMEDIATE: Advanced ArxivRetriever Features**

In [5]:
arxiv_advanced = ArxivRetriever(
    load_max_docs=5,
    load_all_available_meta=True
)
query = "transformers attention mechanism"
arxiv_advanced_result = arxiv_advanced.invoke(query)

print(f"Retrieved docs length: {len(arxiv_advanced_result)}")
for i, doc in enumerate(arxiv_advanced_result, 1):
    print(f"{i}. {doc.metadata.get('Title', 'N/A')}")
    print(f"   Authors: {doc.metadata.get('Authors', 'N/A')}")
    print(f"   Published: {doc.metadata.get('Published', 'N/A')}")
    print(f"   Entry ID: {doc.metadata.get('entry_id', 'N/A')}")
    print()

Retrieved docs length: 3
1. Vision Transformer with Quadrangle Attention
   Authors: Qiming Zhang, Jing Zhang, Yufei Xu, Dacheng Tao
   Published: 2023-03-27
   Entry ID: N/A

2. D√©j√† vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation
   Authors: Jibang Wu, Renqin Cai, Hongning Wang
   Published: 2020-01-29
   Entry ID: N/A

3. Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
   Authors: Nihal Mehta
   Published: 2025-11-16
   Entry ID: N/A



###  **INTERMEDIATE: Using .batch() for Multiple Queries**

In [6]:
queries = [
    "RAG retrieval augmented generation",
    "vector embeddings",
    "prompt engineering"
]

arxiv_retriever_batch = ArxivRetriever(
    load_max_docs=3
)
batch_results = arxiv_retriever_batch.batch(queries)

print("üìö Batch Search Results:\n")
for query, docs in zip(queries, batch_results):
    print(f"Query: '{query}'")
    print(f"  ‚Üí Found {len(docs)} papers")
    if docs:
        print(f"  ‚Üí Top result: {docs[0].metadata.get('Title', 'N/A')}")
    print()

üìö Batch Search Results:

Query: 'RAG retrieval augmented generation'
  ‚Üí Found 3 papers
  ‚Üí Top result: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Query: 'vector embeddings'
  ‚Üí Found 3 papers
  ‚Üí Top result: Part-of-Speech Relevance Weights for Learning Word Embeddings

Query: 'prompt engineering'
  ‚Üí Found 3 papers
  ‚Üí Top result: Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey



## **WikipediaRetriever - General Knowledge**

In [7]:
wiki_retriever = WikipediaRetriever(top_k_results=3)

query = "Python programming language"
docs = wiki_retriever.invoke(query)

print(f"Found {len(docs)} Wikipedia articles on: {query}\n")
print("+"*80)
for i, doc in enumerate(docs):
    print(f"Title: {doc.metadata.get('title','N/A')}")
    print(f"Source: {doc.metadata.get('source','N/A')}")
    print(f"\nContent: {doc.page_content[:400]}")
    print("="*60)

Found 3 Wikipedia articles on: Python programming language

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Title: Python (programming language)
Source: https://en.wikipedia.org/wiki/Python_(programming_language)

Content: Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
Guido van Rossum began working on Python in the late 
Title: Outline of the Python programming language
Source: https://en.wikipedia.org/wiki/Outline_of_the_Python_programming_language

Content: The following outline is provided as an overview of and topical guide to Python:
Python is a general-purpose, interpreted, object-oriented, multi-paradigm, and dynamically typed programming language kno

### **Advanced WikipediaRetriever Features**

In [8]:
wiki_retriever_advanced = WikipediaRetriever(
  top_k_results=3,
  doc_content_chars_max=1000
)
query = "machine learning"
docs = wiki_retriever_advanced.invoke(query)

print(f"üìñ Retrieved {len(docs)} Wikipedia articles\n")
print(docs)
print()

for i, doc in enumerate(docs,1):
    print(f"{i}. Title{doc.metadata.get('title','N/A')}")
    print(f"   Summary: {doc.metadata.get('summary','N/A')}")
    print(f"   Content length: {len(doc.page_content)} charactors")
    print()


üìñ Retrieved 3 Wikipedia articles

[Document(metadata={'title': 'Machine learning', 'summary': 'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.\nStatistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (E

### **INTERMEDIATE: Multilingual Support**

In [9]:
wiki_retriever_es = WikipediaRetriever(
    top_k_results=3,
    lang="es"
)
query = "Inteligencia Artificial"
docs = wiki_retriever_es.invoke(query)

print(docs)
print()

print(f"üåê Search in Spanish Wikipedia: '{query}'\n")
print(f"Title: {docs[0].metadata.get('title', 'N/A')}")
print(f"Content preview:\n{docs[0].page_content[:400]}...")

[Document(metadata={'title': 'Inteligencia artificial', 'summary': 'La inteligencia artificial, abreviado como IA, en el contexto de las ciencias de la computaci√≥n, es una disciplina y un conjunto de capacidades cognoscitivas e intelectuales expresadas por sistemas inform√°ticos o combinaciones de algoritmos cuyo prop√≥sito es la creaci√≥n de m√°quinas que imiten la inteligencia humana.\nEstas tecnolog√≠as permiten que las m√°quinas aprendan de la experiencia, se adapten a nuevas entradas y realicen tareas humanas como el reconocimiento de voz, la toma de decisiones, la traducci√≥n de idiomas o la visi√≥n por computadora.\u200b\u200b\nEn la actualidad, la inteligencia artificial abarca una gran variedad de subcampos. Estos van desde √°reas de prop√≥sito general, aprendizaje y percepci√≥n, a otras m√°s espec√≠ficas como el reconocimiento de voz, el juego de ajedrez, la demostraci√≥n de teoremas matem√°ticos, la escritura de poes√≠a y el diagn√≥stico de enfermedades. La inteligencia art

In [10]:
queries = [
    "Albert Einstein",
    "Quantum Computing",
    "Neural Networks"
]
wiki_retriever_batch = WikipediaRetriever(
    top_k_results=1,
    doc_content_chars_max=500
)
batch_results = wiki_retriever_batch.batch(queries)
print(batch_results)

print("üìñ Batch Wikipedia Search Results:\n")
for query, doc in zip(queries,batch_results):
    print(query)
    if doc:
        print(f"   ‚Üí Title: {doc[0].metadata.get('title','N/A')}")
        print(f"   ‚Üí Summary: {doc[0].page_content[:200]}...")
    print()

[[Document(metadata={'title': 'Albert Einstein', 'summary': 'Albert Einstein (14 March 1879 ‚Äì 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity. Einstein also made important contributions to quantum theory. His mass‚Äìenergy equivalence formula E = mc2, which arises from special relativity, has been called "the world\'s most famous equation". He received the 1921 Nobel Prize in Physics for "his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".\nBorn in the German Empire, Einstein moved to Switzerland in 1895, forsaking his German citizenship (as a subject of the Kingdom of W√ºrttemberg) the following year. In 1897, at the age of seventeen, he enrolled in the mathematics and physics teaching diploma program at the Swiss federal polytechnic school in Zurich, graduating in 1900. He acquired Swiss citizenship a year later, which he kept for the rest of his life, and afterwards

## **TavilySearchAPIRetriever - Web Search üîç**

In [11]:
!uv pip install tavily-python

[2mAudited [1m1 package[0m [2min 54ms[0m[0m


In [12]:
tavily_retriever = TavilySearchAPIRetriever(k=3)
query = "latest developments in artificial intelligence 2025"

docs =tavily_retriever.invoke(query)
print(docs)
print(f"Found {len(docs)} web result for '{query}'\n")

print("=" * 80)
print(f"Source: {docs[0].metadata.get('source','N/A')}")
print(f"\nContent: \n{docs[0].page_content[:500]}")

[Document(metadata={'title': 'How artificial intelligence grew in 2025 and what could come next', 'source': 'https://www.youtube.com/watch?v=y_7WFvBPEeM', 'score': 0.998103, 'images': []}, page_content="In 2025, the integration of artificial intelligence into the U.S. economy and people's everyday lives grew to historic levels. CBS News"), Document(metadata={'title': 'The Latest AI News and AI Breakthroughs that Matter Most: 2025', 'source': 'https://www.crescendo.ai/news/latest-ai-news-and-updates', 'score': 0.99580115, 'images': []}, page_content='Summary: DeepCogito v2, an open-source AI model, has been released with improved logical reasoning and task planning. Developers say it outperforms many closed'), Document(metadata={'title': 'The State of AI 2025: 12 Eye-Opening Graphs - IEEE Spectrum', 'source': 'https://spectrum.ieee.org/ai-index-2025', 'score': 0.9952077, 'images': []}, page_content="1. U.S. Companies Are Out Ahead ¬∑ 2. Speaking of Training Costs... ¬∑ 3. Yet the Cost o

### **INTERMEDIATE: Advanced TavilySearchAPIRetriever Features**

In [13]:
from langchain_community.retrievers import TavilySearchAPIRetriever

tavily_retriever_advanced = TavilySearchAPIRetriever(
    k=5,
    search_depth="advanced",
    include_domains=["github.com", "stackoverflow.com"]
)
query = "langchain tutorials"
docs = tavily_retriever_advanced.invoke(query)

print(f"üîç Retrieved {len(docs)} web results\n")
for i, doc in enumerate(docs,1):
    print(f"{i}. Source: {doc.metadata.get('source','N/A')}")
    print(f"   Content preview: {doc.page_content[:300]}...")
    print()

üîç Retrieved 5 web results

1. Source: https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial
   Content preview: ## Repository files navigation

# ü¶úÔ∏èüîó The LangChain Open Tutorial for Everyone

This tutorial delves into LangChain, starting from an overview then providing practical examples.

The LangChain community in Seoul is excited to announce the LangChain OpenTutorial, a brand-new resource designed for eve...

2. Source: https://github.com/NickScherbakov/LangChain-Tutorial
   Content preview: Fork this repository and contribute your own tutorials.
 Raise issues or share suggestions through the GitHub platform.
 Join the LangChain community for further discussions and support.

By diving into these LangChain tutorials, you'll gain practical skills and expand your understanding of this pow...

3. Source: https://github.com/TirendazAcademy/LangChain-Tutorials
   Content preview: ## Repository files navigation

# Welcome to LangChain Tutorials

This repo contains q

In [14]:
from datetime import datetime
current_date = datetime.now().strftime("%B %d %Y")

queries = [
    f"latest AI news {current_date}",
    "current weather in San Francisco",
    "NVIDIA stock price today"
]

tavily_realtime = TavilySearchAPIRetriever(k=3)
print(f"üïê Real-Time Information (as of {current_date}):\n")

for query in queries:
    docs = tavily_realtime.invoke(query)
    if docs:
        print(f"  ‚Üí {docs[0].page_content[:350]}")
        print(f"  ‚Üí Source: {docs[0].metadata.get('source','N/A')}")
    print()

üïê Real-Time Information (as of December 23 2025):

  ‚Üí Tech Pulse: December 23, 2025 - AI, Cybersecurity & Development News Roundup ; Lovable (Swedish AI startup): $330M Series B at $6.6B valuation, ...Read more
  ‚Üí Source: https://dev.to/krlz/tech-pulse-december-23-2025-ai-cybersecurity-development-news-roundup-1jeh

  ‚Üí {'location': {'name': 'San Francisco', 'region': 'California', 'country': 'United States of America', 'lat': 37.775, 'lon': -122.4183, 'tz_id': 'America/Los_Angeles', 'localtime_epoch': 1766468038, 'localtime': '2025-12-22 21:33'}, 'current': {'last_updated_epoch': 1766467800, 'last_updated': '2025-12-22 21:30', 'temp_c': 14.4, 'temp_f': 57.9, 'is_
  ‚Üí Source: https://www.weatherapi.com/

  ‚Üí The NVIDIA Corporation share price today is ‚Äé$‚Äé183.08, reflecting a ‚Äé-0.33‚Äé% change over the last 24 hours and ‚Äé3.35‚Äé% over the past week.
  ‚Üí Source: https://www.etoro.com/markets/nvda



## **Integration with RAG Chains üîó**

In [15]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

wiki_retriever = WikipediaRetriever(
    top_k_results=2,
    doc_content_chars_max=2000
)
llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)
template = """
Answer the question based on the following context from Wikipedia:

Context:
{context}

Question: {question}

Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
def format_doc(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context":wiki_retriever | format_doc, "question":RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

question = "What is quantum computer and how does it work?"
answer = rag_chain.invoke(question)

print(f"Question: {question}")
print(f"Answer: {answer}")

Question: What is quantum computer and how does it work?
Answer: A quantum computer is a type of computer that uses principles of quantum mechanics to perform calculations. Unlike classical computers, which encode data in binary bits (0s and 1s), quantum computers use quantum bits, or qubits. Qubits can exist in a superposition state, meaning they can represent both 0 and 1 simultaneously, which allows quantum computers to process vast amounts of data and carry out complex computations simultaneously.

Quantum computers leverage two main principles of quantum mechanics:

1. **Superposition**: Qubits can exist in multiple states (both 0 and 1) at the same time, enabling parallel computation. This property makes quantum computers ideal for solving problems that require evaluating many possibilities simultaneously.

2. **Entanglement**: Quantum entanglement is a phenomenon where qubits become interconnected, such that the state of one qubit is directly related to the state of the other, e

### **INTERMEDIATE: Multi-Source RAG Chain**

In [17]:
arxiv_retriever = ArxivRetriever(load_max_docs=2)
wiki_retriever = WikipediaRetriever(top_k_results=2, doc_content_chars_max=1500)

def multiRetriever(query):
    arxiv_docs = arxiv_retriever.invoke(query)
    wiki_docs = wiki_retriever.invoke(query)

    all_docs = []

    if arxiv_docs:
        all_docs.append("=== Academic Papers (ArXiv) ===")
        all_docs.extend([doc.page_content[:500] for doc in arxiv_docs])

    if wiki_docs:
        all_docs.append("\n=== General Knowledge (Wikipedia) ===")
        all_docs.extend([doc.page_content[:500] for doc in wiki_docs])
    return "\n\n".join(all_docs)

multiple_source_template = """Answer the question using information from multiple sources below:

{context}

Question: {question}

Provide a comprehensive answer that synthesizes information from both academic and general sources:"""

multi_prompt = ChatPromptTemplate.from_template(multiple_source_template)
multi_rag_chain = (
    {"context":multiRetriever, "question":RunnablePassthrough()}
    | multi_prompt
    | llm
    |StrOutputParser()
)

question = "What are transformers in machine learning?"
answer = multi_rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer (from multiple sources):\n{answer}")

Question: What are transformers in machine learning?

Answer (from multiple sources):
Transformers in machine learning are advanced network architectures that have revolutionized the broader field of artificial intelligence by utilizing a mechanism called multi-head attention. First introduced in the landmark 2017 research paper *"Attention Is All You Need"* by Google researchers, the transformer architecture is foundational to modern AI applications, enabling state-of-the-art performance in a variety of tasks, such as natural language processing, computer vision, and more.

At their core, transformers take text or other sequential input data and process it by converting each element (e.g., words in a sentence) into numerical representations known as *tokens*. These tokens are further transformed into vectors through a word embedding lookup table. The key innovation lies in the way the transformer architecture processes these tokens: instead of analyzing the input sequentially, it uses

In [18]:
tavily_retriever = TavilySearchAPIRetriever(k=3)
realtime_template = """Based on the latest information from the web:

{context}

Question: {question}

Provide an up-to-date answer having atleast 500 words with source attribution:"""

realtime_prompt = ChatPromptTemplate.from_template(realtime_template)

realtime_rag_chain = (
    {"context":tavily_retriever|format_doc, "question":RunnablePassthrough()}
    |realtime_prompt
    |llm
    |StrOutputParser()
)

question = "What are the latest developments in AI regulation?"
answer = realtime_rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer (from real-time web search):\n{answer}")

Question: What are the latest developments in AI regulation?

Answer (from real-time web search):
I‚Äôm afraid I cannot access real-time updates from the web since my training only extends to October 2023, nor can I browse the web for real-time developments. However, I can provide a detailed overview of developments in AI regulation based on information up to that time and general trends in policymaking concerning artificial intelligence.

**AI Regulation Overview (as of October 2023)**

Artificial intelligence regulation has become a significant priority globally due to the rapid advancements in AI and its profound implications for ethics, safety, privacy, and fairness. Below are key trends and developments in AI regulation.

---

### **1. United States: Interest and Action on AI Governance**

The U.S. has historically embraced a light-touch approach to AI regulation, but recent developments indicate growing interest in crafting guidelines to govern AI development and deployment. Poli