In [1]:
%run setup.ipynb

### Research Paper Data Loader
ArXiv is an open-source archive for over 2 million scholarly articles in the fields of physics, mathematics, computer science, finance, statistics, electrical engineering, system science, economics, biology, and so on. So ArXiv is a great source of data in various domain. We can easily properly use data for our applications.

To access ArXiv data, we need to install the language community, Arxiv, and PyMuPDF integration packages. PyMuPDF helps to transform PDF files downloaded from the ArXiv website into text format.

In [4]:
%pip install -qU langchain-community arxiv pymupdf
# Fix SSL certificate issues on macOS
import ssl
import certifi
ssl._create_default_https_context = ssl._create_unverified_context

Note: you may need to restart the kernel to use updated packages.


In [5]:

from langchain_community.document_loaders import ArxivLoader
import fitz

# Fix PyMuPDF module structure issue
if not hasattr(fitz, 'fitz'):
    fitz.fitz = fitz

try:
    # Supports all arguments of `ArxivAPIWrapper`
    loader = ArxivLoader(
        query="Yolov8",
        load_max_docs=2,
        # doc_content_chars_max=1000,
        # load_all_available_meta=False,
        # ...
    )

    docs = loader.load()
    print(f"Successfully loaded {len(docs)} documents")
    print(f"First document preview:")
    print(f"Title: {docs[0].metadata.get('Title', 'Unknown')}")
    print(f"Authors: {docs[0].metadata.get('Authors', 'Unknown')}")
    print(f"Content length: {len(docs[0].page_content)} characters")
    print(f"Content preview: {docs[0].page_content[:500]}...")
    
    # get the summary of the paper
    print("\n" + "="*50)
    print("Getting summaries...")
    summary_docs = loader.get_summaries_as_docs()
    print(f"Successfully retrieved {len(summary_docs)} summaries")
    if summary_docs:
        print(f"First summary: {summary_docs[0].page_content}")
    
except Exception as e:
    print(f"Error occurred: {e}")
    print("\nTrying alternative approach with direct search...")
    
    # Alternative approach using ArxivAPIWrapper directly
    from langchain_community.utilities import ArxivAPIWrapper
    
    try:
        arxiv = ArxivAPIWrapper(top_k_results=2, doc_content_chars_max=10000)
        docs = arxiv.run("Yolov8")
        print(f"Alternative approach successful!")
        print(f"Content preview: {docs[:1000]}...")
    except Exception as e2:
        print(f"Alternative approach also failed: {e2}")
        print("This might be due to network connectivity issues or ArXiv being temporarily unavailable.")

Successfully loaded 3 documents
First document preview:
Title: Optimizing YOLOv8 for Parking Space Detection: Comparative Analysis of Custom YOLOv8 Architecture
Authors: Apar Pokhrel, Gia Dao
Content length: 29956 characters
Content preview: arXiv:2505.17364v1  [cs.CV]  23 May 2025
Optimizing YOLOv8 for Parking Space Detection: Comparative Analysis of
Custom Backbone Architectures
Apar Pokhrel
The University of Texas at Arlington
apar.pokhrel@mavs.uta.edu
Gia Dao
The University of Texas at Arlington
gia.daoduyduc@mavs.uta.edu
Abstract
Parking space occupancy detection is a critical compo-
nent in the development of intelligent parking management
systems. Traditional object detection approaches, such as
YOLOv8, provide fast and accur...

Getting summaries...
Successfully retrieved 3 summaries
First summary: Parking space occupancy detection is a critical component in the development
of intelligent parking management systems. Traditional object detection
approaches, such as YOLOv8, provi

## Alternative SSL Certificate Fix

If you're still experiencing SSL certificate issues, you can permanently fix them by installing certificates properly:

### For macOS:
```bash
# Option 1: Install certificates via Python
/Applications/Python\ 3.11/Install\ Certificates.command

# Option 2: Install certificates manually
pip install --upgrade certifi
```

### For all systems:
```python
import ssl
import certifi

# Use proper certificate bundle
ssl.create_default_context = lambda: ssl.create_default_context(cafile=certifi.where())
```

## Troubleshooting Tips:

1. **Network Issues**: If ArXiv is temporarily unavailable, try running the cell again later
2. **Version Compatibility**: Ensure you have compatible versions of langchain-community and pymupdf
3. **Query Optimization**: Try different search terms or reduce load_max_docs if facing timeouts
