## **08 - External Index Retrievers**

In [2]:
!uv pip install langchain-community
!uv pip install arxiv          
!uv pip install wikipedia       
!uv pip install tavily-python

[2mAudited [1m1 package[0m [2min 153ms[0m[0m
[2mResolved [1m8 packages[0m [2min 5.46s[0m[0m
   [36m[1mBuilding[0m[39m sgmllib3k[2m==1.0.0[0m
      [32m[1mBuilt[0m[39m sgmllib3k[2m==1.0.0[0m
[2mPrepared [1m3 packages[0m [2min 1.70s[0m[0m
[2mInstalled [1m3 packages[0m [2min 93ms[0m[0m
 [32m+[39m [1marxiv[0m[2m==2.3.1[0m
 [32m+[39m [1mfeedparser[0m[2m==6.0.12[0m
 [32m+[39m [1msgmllib3k[0m[2m==1.0.0[0m
[2mResolved [1m9 packages[0m [2min 4.29s[0m[0m
   [36m[1mBuilding[0m[39m wikipedia[2m==1.4.0[0m
      [32m[1mBuilt[0m[39m wikipedia[2m==1.4.0[0m
[2mPrepared [1m1 package[0m [2min 1.80s[0m[0m
[2mInstalled [1m1 package[0m [2min 47ms[0m[0m
 [32m+[39m [1mwikipedia[0m[2m==1.4.0[0m
[2mAudited [1m1 package[0m [2min 83ms[0m[0m


In [1]:
from dotenv import load_dotenv
import os

load_dotenv
from langchain_community.retrievers import ArxivRetriever, WikipediaRetriever, TavilySearchAPIRetriever
from langchain_core.documents import Document
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

import langchain
print(f"‚úÖ LangChain version: {langchain.__version__}")
print("‚úÖ Setup complete!")

‚úÖ LangChain version: 1.2.0
‚úÖ Setup complete!


In [2]:
!uv pip install arxiv

[2mAudited [1m1 package[0m [2min 159ms[0m[0m


In [3]:
arxiv_retriever = ArxivRetriever(load_max_docs=3)

query = "Large language model"
docs = arxiv_retriever.invoke(query)

print(f"üìö Found {len(docs)} papers on '{query}'\n")

# Display first paper
print("=" * 80)
print(f"Title: {docs[0].metadata.get('Title', 'N/A')}")
print(f"Authors: {docs[0].metadata.get('Authors', 'N/A')}")
print(f"Published: {docs[0].metadata.get('Published', 'N/A')}")
print(f"\nAbstract (first 500 chars):\n{docs[0].page_content[:500]}...")
print("=" * 80)
print(f"Title: {docs[1].metadata.get('Title', 'N/A')}")
print("=" * 80)
print(f"Title: {docs[2].metadata.get('Title', 'N/A')}")

üìö Found 3 papers on 'Large language model'

Title: Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
Authors: Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin
Published: 2024-04-16

Abstract (first 500 chars):
Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine...
Title: Demystifying Instruction Mixing for Fine-tuning Large Language Models
Title: WizardLM: Empowering large pre-trained language models to follow complex instructions


### **INTERMEDIATE: Advanced ArxivRetriever Features**

In [5]:
arxiv_advanced = ArxivRetriever(
    load_max_docs=5,
    load_all_available_meta=True
)
query = "transformers attention mechanism"
arxiv_advanced_result = arxiv_advanced.invoke(query)

print(f"Retrieved docs length: {len(arxiv_advanced_result)}")
for i, doc in enumerate(arxiv_advanced_result, 1):
    print(f"{i}. {doc.metadata.get('Title', 'N/A')}")
    print(f"   Authors: {doc.metadata.get('Authors', 'N/A')}")
    print(f"   Published: {doc.metadata.get('Published', 'N/A')}")
    print(f"   Entry ID: {doc.metadata.get('entry_id', 'N/A')}")
    print()

Retrieved docs length: 3
1. Vision Transformer with Quadrangle Attention
   Authors: Qiming Zhang, Jing Zhang, Yufei Xu, Dacheng Tao
   Published: 2023-03-27
   Entry ID: N/A

2. D√©j√† vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation
   Authors: Jibang Wu, Renqin Cai, Hongning Wang
   Published: 2020-01-29
   Entry ID: N/A

3. Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
   Authors: Nihal Mehta
   Published: 2025-11-16
   Entry ID: N/A



###  **INTERMEDIATE: Using .batch() for Multiple Queries**

In [4]:
queries = [
    "RAG retrieval augmented generation",
    "vector embeddings",
    "prompt engineering"
]

arxiv_retriever_batch = ArxivRetriever(
    load_max_docs=3
)
batch_results = arxiv_retriever_batch.batch(queries)

print("üìö Batch Search Results:\n")
for query, docs in zip(queries, batch_results):
    print(f"Query: '{query}'")
    print(f"  ‚Üí Found {len(docs)} papers")
    if docs:
        print(f"  ‚Üí Top result: {docs[0].metadata.get('Title', 'N/A')}")
    print()

üìö Batch Search Results:

Query: 'RAG retrieval augmented generation'
  ‚Üí Found 3 papers
  ‚Üí Top result: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Query: 'vector embeddings'
  ‚Üí Found 3 papers
  ‚Üí Top result: Part-of-Speech Relevance Weights for Learning Word Embeddings

Query: 'prompt engineering'
  ‚Üí Found 3 papers
  ‚Üí Top result: Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey



## **WikipediaRetriever - General Knowledge**

In [8]:
wiki_retriever = WikipediaRetriever(top_k_results=3)

query = "Python programming language"
docs = wiki_retriever.invoke(query)

print(f"Found {len(docs)} Wikipedia articles on: {query}\n")
print("+"*80)
for i, doc in enumerate(docs):
    print(f"Title: {doc.metadata.get('title','N/A')}")
    print(f"Source: {doc.metadata.get('source','N/A')}")
    print(f"\nContent: {doc.page_content[:400]}")
    print("="*60)

Found 3 Wikipedia articles on: Python programming language

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Title: Python (programming language)
Source: https://en.wikipedia.org/wiki/Python_(programming_language)

Content: Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
Guido van Rossum began working on Python in the late 
Title: Outline of the Python programming language
Source: https://en.wikipedia.org/wiki/Outline_of_the_Python_programming_language

Content: The following outline is provided as an overview of and topical guide to Python:
Python is a general-purpose, interpreted, object-oriented, multi-paradigm, and dynamically typed programming language kno

### **Advanced WikipediaRetriever Features**

In [10]:
wiki_retriever_advanced = WikipediaRetriever(
  top_k_results=3,
  doc_content_chars_max=1000
)
query = "machine learning"
docs = wiki_retriever_advanced.invoke(query)

print(f"üìñ Retrieved {len(docs)} Wikipedia articles\n")
print(docs)
print()

for i, doc in enumerate(docs,1):
    print(f"{i}. Title{doc.metadata.get('title','N/A')}")
    print(f"   Summary: {doc.metadata.get('summary','N/A')}")
    print(f"   Content length: {len(doc.page_content)} charactors")
    print()


üìñ Retrieved 3 Wikipedia articles

[Document(metadata={'title': 'Machine learning', 'summary': 'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.\nStatistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (E

### **INTERMEDIATE: Multilingual Support**

In [11]:
wiki_retriever_es = WikipediaRetriever(
    top_k_results=3,
    lang="es"
)
query = "Inteligencia Artificial"
docs = wiki_retriever_es.invoke(query)

print(docs)
print()

print(f"üåê Search in Spanish Wikipedia: '{query}'\n")
print(f"Title: {docs[0].metadata.get('title', 'N/A')}")
print(f"Content preview:\n{docs[0].page_content[:400]}...")

[Document(metadata={'title': 'Inteligencia artificial', 'summary': 'La inteligencia artificial, abreviado como IA, en el contexto de las ciencias de la computaci√≥n, es una disciplina y un conjunto de capacidades cognoscitivas e intelectuales expresadas por sistemas inform√°ticos o combinaciones de algoritmos cuyo prop√≥sito es la creaci√≥n de m√°quinas que imiten la inteligencia humana.\nEstas tecnolog√≠as permiten que las m√°quinas aprendan de la experiencia, se adapten a nuevas entradas y realicen tareas humanas como el reconocimiento de voz, la toma de decisiones, la traducci√≥n de idiomas o la visi√≥n por computadora.\u200b\u200b\nEn la actualidad, la inteligencia artificial abarca una gran variedad de subcampos. Estos van desde √°reas de prop√≥sito general, aprendizaje y percepci√≥n, a otras m√°s espec√≠ficas como el reconocimiento de voz, el juego de ajedrez, la demostraci√≥n de teoremas matem√°ticos, la escritura de poes√≠a y el diagn√≥stico de enfermedades. La inteligencia art

In [12]:
queries = [
    "Albert Einstein",
    "Quantum Computing",
    "Neural Networks"
]
wiki_retriever_batch = WikipediaRetriever(
    top_k_results=1,
    doc_content_chars_max=500
)
batch_results = wiki_retriever_batch.batch(queries)
print(batch_results)

print("üìñ Batch Wikipedia Search Results:\n")
for query, doc in zip(queries,batch_results):
    print(query)
    if doc:
        print(f"   ‚Üí Title: {doc[0].metadata.get('title','N/A')}")
        print(f"   ‚Üí Summary: {doc[0].page_content[:200]}...")
    print()

[[Document(metadata={'title': 'Albert Einstein', 'summary': 'Albert Einstein (14 March 1879 ‚Äì 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity. Einstein also made important contributions to quantum theory. His mass‚Äìenergy equivalence formula E = mc2, which arises from special relativity, has been called "the world\'s most famous equation". He received the 1921 Nobel Prize in Physics for "his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".\nBorn in the German Empire, Einstein moved to Switzerland in 1895, forsaking his German citizenship (as a subject of the Kingdom of W√ºrttemberg) the following year. In 1897, at the age of seventeen, he enrolled in the mathematics and physics teaching diploma program at the Swiss federal polytechnic school in Zurich, graduating in 1900. He acquired Swiss citizenship a year later, which he kept for the rest of his life, and afterwards

## **TavilySearchAPIRetriever - Web Search üîç**