# Langchain `ArxivRetriever`

For retrieval, we will use langchain's `ArxivRetriever`.



In [None]:
!pip install langchain_community arxiv

In [1]:
from langchain_community.retrievers import ArxivRetriever

Under the hood, it uses ArxivAPIWrapper class from the arxiv library.
Arxiv search is based on the search engine which uses **BM25 metric** to rank documents' relevance.


[arXiv API](https://lukasschwab.me/arxiv.py/arxiv.html)

[arXiv search repo](https://github.com/arXiv/arxiv-search)

In [5]:
retriever = ArxivRetriever(
    top_k_results=5,
    get_full_documents=False,
    doc_content_chars_max=10
)
query = 'AI'
docs = retriever.invoke(query)
len(docs)

5

Because retriever searches arXiv database in its current state, we will stay up-to-date. 

In [6]:
docs = retriever.invoke("2412.04451") # paper submitted few days ago
docs[0].metadata  # meta-information of the Document

{'Entry ID': 'http://arxiv.org/abs/2412.04451v1',
 'Published': datetime.date(2024, 12, 5),
 'Title': 'Bordism and resolution of singularities',
 'Authors': 'Mohammed Abouzaid, Shaoyun Bai'}

In [7]:
docs = retriever.invoke("Bordism and resolution of singularities")
docs[0]

Document(metadata={'Entry ID': 'http://arxiv.org/abs/2412.04451v1', 'Published': datetime.date(2024, 12, 5), 'Title': 'Bordism and resolution of singularities', 'Authors': 'Mohammed Abouzaid, Shaoyun Bai'}, page_content='We adapt algorithms for resolving the singularities of complex algebraic\nvarieties to prove that the natural map of homology theories from complex\nbordism to the bordism theory of complex derived orbifolds splits. In\nequivariant stable homotopy theory, our techniques yield a splitting of\nhomology theories for the map from bordism to the equivariant bordism theory of\na finite group $\\Gamma$, given by assigning to a manifold its product with\n$\\Gamma$. In symplectic topology, and using recent work of\nAbouzaid-McLean-Smith and Hirshi-Swaminathan, we conclude that one can define\ncomplex cobordism-valued Gromov-Witten invariant for arbirary (closed)\nsymplectic manifolds. We apply our results to constrain the topology of the\nspace of Hamiltonian fibrations over $S

Let's test it on some cases

In [None]:
retriever = ArxivRetriever(
    load_max_docs=3,
    get_summaries_as_docs=True,
)

queries = [
    'bounded cohomology and its applications', 
    'differential privacy in real-world data', 
    'chaotic behavior in financial markets'
]

for num, query in enumerate(queries):
    docs = retriever.invoke(query)
    print(f'TOPIC {num}: {query}')
    print('DOCS')
    print(*[('TITLE: ' + doc.metadata['Title'] + '\n' + 'SUMMARY: ' + doc.page_content) for doc in docs], sep='\n')
    print()


TOPIC 0: bounded cohomology and its applications
DOCS
TITLE: Index bounded relative symplectic cohomology
SUMMARY: We study the relative symplectic cohomology with the help of an index bounded
contact form. For a Liouville domain with an index bounded boundary, we
construct a spectral sequence which starts from its classical symplectic
cohomology and converges to the relative symplectic cohomology of it inside a
Calabi-Yau manifold. In the appendix we compare the relative symplectic
cohomology of a Liouville domain inside its completion with its classical
symplectic cohomology. As an application, we obtain a version of the Viterbo
isomorphism.
TITLE: Cohomology of twisted tensor products
SUMMARY: It is well known that the cohomology of a tensor product is essentially the
tensor product of the cohomologies. We look at twisted tensor products, and
investigate to which extend this is still true. We give an explicit description
of the $\Ext$-algebra of the tensor product of two modules, an

The retriever works well even with complicated queries. But we can still simplify them with the help of LLM when developing RAG.

In [51]:
queries = [
    'What are the latest advancements in using reinforcement learning techniques for optimizing supply chain logistics, particularly under uncertainty and constraints like limited storage or fluctuating demand?', 
    'Reinforcement learning for optimizing supply chain logistics under uncertainty and constraints?'
]

for num, query in enumerate(queries):
    docs = retriever.invoke(query)
    print(f'TOPIC {num}: {query}')
    print('DOCS')
    print(*[('TITLE: ' + doc.metadata['Title'] + '\n' + 'SUMMARY: ' + doc.page_content) for doc in docs], sep='\n')
    print()


TOPIC 0: What are the latest advancements in using reinforcement learning techniques for optimizing supply chain logistics, particularly under uncertainty and constraints like limited storage or fluctuating demand?
DOCS
TITLE: Controllability Analyses on Firm Networks Based on Comprehensive Data
SUMMARY: Since governments give stimulus to firms and expect the spillover effect by
fiscal policies, it is important to know the effectiveness that they can
control the economy. To clarify the controllability of the economy, we
investigate a firm production network observed exhaustively in Japan and what
firms should be directly or indirectly controlled by using control theory. By
control theory, we can classify firms into three different types: (a) firms
that should be directly controlled; (b) firms that should be indirectly
controlled; (c) neither of them (ordinary). Since there is a direction
(supplier and client) in the production network, we can consider controls of
two different directio

You can choose different parameters, including:
- `doc_content_chars_max`

In [91]:
retriever = ArxivRetriever(
    load_max_docs=2,
    get_full_documents=True,
    doc_content_chars_max=10000000000 # you can retrieve full articles
)

In [94]:
docs = retriever.invoke('asdf')

In [95]:
full_docs = retriever.invoke('asdf')

In [96]:
len(full_docs[0].page_content)

72751

In [None]:
len(full_docs[1].page_content)

16873