# PDFtoConceptAgentSystem

- Author: [Jiwon Kim](https://github.com/brian604)
- Design: []
- Peer Review: []
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)


## Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)

## Overview

This tutorial covers a proof of concept where PDF or html can be mined for concept. This has been inspired by a release of Large Concept Model by Meta Research Team, [arxiv](https://arxiv.org/abs/2412.08821) and [github](https://github.com/facebookresearch/large_concept_model).



### References

- [txtai: All-in-one embeddings database](https://neuml.github.io/txtai/)
- [annotateai github: Automatically annotate papers using LLMs](https://neuml.github.io/txtai/) 
- [markitdown github: Python tool for converting files and office documents to Markdown.](https://github.com/microsoft/markitdown)
- 
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "markitdown",
        "annotateai",
        "pydantic",
        "httpx",
        "scholarly",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Automatically select the appropriate device
import torch
import platform


def get_device():
    if platform.system() == "Darwin":  # macOS specific
        if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
            print("✅ Using MPS (Metal Performance Shaders) on macOS")
            return "mps"
    if torch.cuda.is_available():
        print("✅ Using CUDA (NVIDIA GPU)")
        return "cuda"
    else:
        print("✅ Using CPU")
        return "cpu"


# Set the device
device = get_device()
print("🖥️ Current device in use:", device)

✅ Using MPS (Metal Performance Shaders) on macOS
🖥️ Current device in use: mps


In [4]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "12-PDFtoConceptAgentSystem",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [5]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

In [7]:
from pydantic import BaseModel, HttpUrl, ValidationError
import httpx
import markitdown
from scholarly import scholarly

from pydantic import BaseModel, HttpUrl, ValidationError
import httpx


class LinkValidator(BaseModel):
    url: HttpUrl  # This ensures the URL is valid during validation

    def is_reachable(self) -> bool:
        try:
            # Convert the `HttpUrl` object to a string before making the request
            response = httpx.head(str(self.url), timeout=5)
            return response.status_code < 400
        except httpx.RequestError:
            return False

# Function to validate and categorize links
def validate_links(links: list[str]) -> dict:
    results = {"valid": [], "invalid": [], "unreachable": []}
    for link in links:
        try:
            validated_link = LinkValidator(url=link)
            if validated_link.is_reachable():
                results["valid"].append(link)
            else:
                results["unreachable"].append(link)
        except ValidationError:
            results["invalid"].append(link)
    return results

In [8]:
# from scholarly import ProxyGenerator
# # Set up a ProxyGenerator object to use free proxies
# # This needs to be done only once per session
# pg = ProxyGenerator()
# pg.FreeProxies()
# scholarly.use_proxy(pg)


def google_scholar_search_links(query: str, limit_search=20, link_limit=5) -> dict:
    links = 0
    results = scholarly.search_pubs(query)
    output_dict = {"titles": [], "links": []}  # Initialize with empty lists

    for i, result in enumerate(results):
        if i >= limit_search:  # Limit the number of results
            break
        title = result["bib"]["title"]
        link = result.get("pub_url", "No link available")
        if validate_links(link):
            # Append title and link to the lists in the dictionary
            output_dict["titles"].append(title)
            output_dict["links"].append(link)
            links += 1
        if links >= link_limit:
            break

    return output_dict


query = "Multiple Agentic Framework AND LLM"
search_dict = google_scholar_search_links(query=query)
print(search_dict)
links = search_dict["links"]

{'titles': ['A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops', 'Harnessing Multi-Agent LLMs for Complex Engineering Problem-Solving: A Framework for Senior Design Projects', 'Frontiers of Large Language Model-Based Agentic Systems-Construction, Efficacy and Safety', 'LLM-based agentic systems in medicine and healthcare', 'Practical Considerations for Agentic LLM Systems'], 'links': ['https://arxiv.org/abs/2412.17149', 'https://arxiv.org/abs/2501.01205', 'https://dl.acm.org/doi/abs/10.1145/3627673.3679105', 'https://www.nature.com/articles/s42256-024-00944-1', 'https://arxiv.org/abs/2412.04093']}


In [9]:
print(links)

['https://arxiv.org/abs/2412.17149', 'https://arxiv.org/abs/2501.01205', 'https://dl.acm.org/doi/abs/10.1145/3627673.3679105', 'https://www.nature.com/articles/s42256-024-00944-1', 'https://arxiv.org/abs/2412.04093']


- Then, I manually save PDFs for each links without any code (will not store PDF locally)
- 'https://www.nature.com/articles/s42256-024-00944-1' is not accessible (Title: LLM-based agentic systems in medicine and healthcare)

In [10]:
#!uv pip install annotateai llama-cpp-python # already ran

In [11]:
from annotateai import Annotate

# macOS users should run this instead
annotate = Annotate(
    "bartowski/Llama-3.1_OpenScholar-8B-GGUF/Llama-3.1_OpenScholar-8B-Q4_K_M.gguf"
)

ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml

In [12]:
#!pip install txtai # already ran

In [13]:
annotate("./data/2412.04093v1.pdf")  # link-way

Extracting page text: 0it [00:00, ?it/s]

Extracting title:   0%|          | 0/1 [00:00<?, ?it/s]

ValueError: Requested tokens (1162) exceed context window of 512

- Failed due to CPU/GPU

As of Wednesday, January 8, I will plan to use `markitdown` and will chunk paragraph-by-paragraph.
Then, I am planning to label topic sentences and supporting evidence in a paragraph. Next, I would build central ideas based on paragraph-level organizational information. 
Leveraging this, I am planning to build an agent that would build central and accessory ideas and concepts based on `Concept` Agent. 
After going through a `Concept` Agent, I will find in the Wikis. I will query using similarity index because they might not be "exact" when exact-querying the Wikis. Beyond wikis, I will build `Search` Agent that would expand the knowledge beyond just the Wiki and ideas / concepts in the paper