Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pdf loader #171

Merged
merged 5 commits into from
Apr 7, 2023
Merged

add pdf loader #171

merged 5 commits into from
Apr 7, 2023

Conversation

zzstoatzz
Copy link
Collaborator

@zzstoatzz zzstoatzz commented Apr 5, 2023

may also want a bulk pdf loader at some point to avoid extra chroma clients

closes #161

import asyncio

async def main():
    remote_pdf_document = await PDFLoader(
        file_path="https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf"
    ).load()

    local_pdf_document = await PDFLoader(
        file_path="/Users/nate/Downloads/MMR.pdf"
    ).load()
    assert len(remote_pdf_document) == len(local_pdf_document)

    assert remote_pdf_document[0].text == local_pdf_document[0].text
    assert remote_pdf_document[0].order == local_pdf_document[0].order

    assert remote_pdf_document[-1].text == local_pdf_document[-1].text
    assert remote_pdf_document[-1].order == local_pdf_document[-1].order

asyncio.run(main())

from urllib.parse import urlparse

import httpx
import pypdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good place to warn about needing to install marvin[pdf]

@jlowin jlowin merged commit a821663 into main Apr 7, 2023
@jlowin jlowin deleted the pdf-loader branch April 7, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add PDF loaders
2 participants