What is ragna's approach to data connectors? #202

nenb · 2023-11-13T13:18:24Z

nenb
Nov 13, 2023
Maintainer

Question: Like the title - how does ragna intend to develop support for other data connectors eg different file extensions like .docx, or documents that are not on the local filesystem such as in the cloud or behind an API?

Context: Similar in spirit to #177. At the moment, ragna only has support for reading local files with .txt and .pdf extensions. This support is provided via classes in the core/_document.py module. These classes are built on top of a custom import system that eagerly loads all available file handlers at runtime. The custom import system would seem to suggest that all new data connectors will need to be written from scratch.

Personal Thoughts: On first impressions, this seems to be creating a very large body of work and distracting from ragna's goal as a very accessible and effective RAG orchestration framework eg see the number of data connectors supported by langchain here. Duplicating this effort in ragna also seems to me like a questionable use of community resources (but this is a very personal opinion). Is there any way that ragna could hook in to the community effort from other projects (but only for the data connectors), and hence allow the ragna team to focus on optimising the RAG orchestration experience?

pmeier · 2023-11-15T11:33:09Z

pmeier
Nov 15, 2023
Maintainer

Question: Like the title - how does ragna intend to develop support for other data connectors eg different file extensions like .docx, or documents that are not on the local filesystem such as in the cloud or behind an API?

These are two different concepts. Let's tackle them individually.

how does ragna intend to develop support for other data connectors eg different file extensions like .docx

This is a decision that I struggled for a bit when implementing it. Ultimately, I decided to keep it internal for now as part of the bigger refactor in #72. Meaning, for now, if you want to add a new document handler, you need to send a PR to ragna to add it. My reasoning for this is that this is common component in almost all use cases.

If you can show me a good use case where you want a custom DocumentHandler that likely won't make its way into Ragna core, I'm happy to reconsider.

Duplicating this effort in ragna also seems to me like a questionable use of community resources (but this is a very personal opinion). Is there any way that ragna could hook in to the community effort from other projects (but only for the data connectors), and hence allow the ragna team to focus on optimising the RAG orchestration experience?

I soft-disagree here. The code for our DocumentHandlers is super simple right now. For example, the whole PDF handling is < 30 LoC:

ragna/ragna/core/_document.py

Lines 237 to 263 in 4962665

    
           @DOCUMENT_HANDLERS.load_if_available 
        
           class PdfDocumentHandler(DocumentHandler): 
        
               """Document handler for `.pdf` documents. 
        
               !!! info "Package requirements" 
        
                   - [`pymupdf`](https://pymupdf.readthedocs.io/en/latest/) 
        
               """ 
        
               @classmethod 
        
               def requirements(cls) -> list[Requirement]: 
        
                   return [PackageRequirement("pymupdf>=1.23.6")] 
        
               @classmethod 
        
               def supported_suffixes(cls) -> list[str]: 
        
                   # TODO: pymudpdf supports a lot more formats, while .pdf is by far the most 
        
                   #  prominent. Should we expose the others here as well? 
        
                   return [".pdf"] 
        
               def extract_pages(self, document: Document) -> Iterator[Page]: 
        
                   import fitz 
        
                   with fitz.Document( 
        
                       stream=document.read(), filetype=Path(document.name).suffix 
        
                   ) as document: 
        
                       for number, page in enumerate(document, 1): 
        
                           yield Page(text=page.get_text(sort=True), number=number)

with the actual core part being only 5

ragna/ragna/core/_document.py

Lines 259 to 263 in 4962665

    
           with fitz.Document( 
        
               stream=document.read(), filetype=Path(document.name).suffix 
        
           ) as document: 
        
               for number, page in enumerate(document, 1): 
        
                   yield Page(text=page.get_text(sort=True), number=number)

So we aren't looking at crazy amounts of duplication here. IMO it is not worth it to pull a dependency in for this.

That being said, maybe ragna-langchain is a good idea for a third-party library that provides adapters between the two projects?

or documents that are not on the local filesystem such as in the cloud or behind an API?

This is what the Document class in Ragna is all about. We are defaulting to LocalDocument, but there is a (not well documented) example for storing documents in S3: https://github.com/Quansight/ragna/tree/main/examples/s3_documents. Meaning, this is entirely possible already.

0 replies

nenb · 2023-11-15T16:06:51Z

nenb
Nov 15, 2023
Maintainer Author

I recognise that extensions for fairly-arbitrary data connectors are possible at the moment, and I really appreciate the thought that was put in originally to allow this.

My question though was i) is it the intention for ragna to implement support for all data connectors from scratch in the project (based on your answer, this seems to be the case) and ii) whether this is the best use of available resources.

I realise that adding support for a new connector is relatively straightforward (ten's of LoC as you point out). But it is the sheer scale of the connectors that are required (see the langchain link in my original message) as well as the maintenance burden that this brings that I was more concerned about. I was hopeful that there might be a way to hook into community support for these connectors in some way (as you point out, they aren't complex or particularly novel), and that would allow focusing more on the orchestration + UX aspects of ragna.

7 replies

nenb Nov 20, 2023
Maintainer Author

Local and Remote File Systems

ragna is currently designed for file formats on a local file system with the LocalDocument class. Extending this to remote 'file systems' (I'm being lazy and referring to object storage as a file system here) seems like it could be a bit of work, and feels like something that has been re-implemented in a bunch of OSS projects already. What about the use of fsspec as an abstraction over these different types of file systems?

To me, fsspec feels like it aligns with ragna in terms of lightweight nature and attention to details like serialisability. Perhaps it could be possible to extend the LocalDocument class to also include fsspec support? This way, it wouldn't matter where a user's pdf file (for example) was located, as long as ragna had a pdf handler it would just load it. ragna wouldn't have to spend (too much) time implementing support for all the different file systems, and could share the maintenace burden with the rest of the OSS community. What do you think?

External APIs

I personally think this is important. I have seen so many use cases for RAG that involve hitting an org's collection of Slack messages, or their archive of decision documents in Google Workspace. I think it would be great to allow some sort of extendability for loading documents from external APIs eg an alternative class to the LocalDocument that can be extended by the user. There are so many APIs available, that I don't think that ragna could implement them all (see langchain for an example of the effort required). Hence why I am suggesting an extension class.

It could be argued that a user should be responsible for exporting data from these external APIs to some file system before using ragna. That way, ragna wouldn't need to worry about including this support. I'm not sure about this though - some of these tools feel so embedded in daily workflows now, that it would be a shame not to support them. I'd be interested to hear ragnas views on this (which can of course change over time too!)

File formats

I haven't got any bright ideas for building on community work here. I would personally find it helpful to be explicit about the number of file formats that aim to be supported before starting the work, to get an idea of the amount of work involved. (And I would of course be happy to help implement a bunch of these file format parsers once/if they have been selected!)

Other

I'm definitely getting ahead of myself here, but it would be great to try and keep the data connector part of the project as uncoupled as possible to the rest of the project. It would be great to be able to share something like this with the rest of the community, as my impression is that a lightweight (ie minimal depedency and limited LoC) solution doesn't currently exist (and if it does, then perhaps we should consider using it!)

pmeier Nov 20, 2023
Maintainer

What about the use of fsspec as an abstraction over these different types of file systems?

Sounds good. Please open a feature request for it. That being said, I would very much prefer universal-pathlib, which relates to fsspec as os.path relates to pathlib. I never want to deal with str paths again.

I think it would be great to allow some sort of extendability for loading documents from external APIs eg an alternative class to the LocalDocument that can be extended by the user.

Apart from the fact that I think this should be an external library and not in Ragna core (not saying that the Ragna team cannot maintain something like that), I think the biggest issue is consistency with the UI. Right now we only allow to upload documents from the local machine to some backend. By default this is the machine the API is running on, but can be changed to S3 for example.

However as soon as you change the intake to no longer being documents from local machine, the UI needs to change as well. And with that comes the big question on how to do that properly? Should we put the relevant UI code on the Document class? What happens if the user doesn't even want the UI or wants to use a custom one? Should we still enforce UI code for Ragna to be present on the Document class?

Not saying this is impossible, but we need to answer all of these question (and potentially more that I cannot think of right now), before we even start to consider other data sources. This is not impossible, but needs a clean design. This would also bring us closer to a solution for #176 (comment).

I haven't got any bright ideas for building on community work here. I would personally find it helpful to be explicit about the number of file formats that aim to be supported before starting the work, to get an idea of the amount of work involved. (And I would of course be happy to help implement a bunch of these file format parsers once/if they have been selected!)

IMO, we should to this on demand. So far we only got a request for adding support for Markdown files in #209 (see #210). And I suspect that we will also need support for Word documents and Powerpoint slides in the near future. But other than that, I can't think of any right now. So we shouldn't just add features for the sake of having them. This increases maintenance burden for no reason.

I'm definitely getting ahead of myself here, but it would be great to try and keep the data connector part of the project as uncoupled as possible to the rest of the project. It would be great to be able to share something like this with the rest of the community, as my impression is that a lightweight (ie minimal depedency and limited LoC) solution doesn't currently exist (and if it does, then perhaps we should consider using it!)

I feel like I'm repeating myself here. Ragna is already really decoupled. You are free to implement any Document class that you want to use. If you think this is useful enough for other people as well, throw it in small package and publish it. With your own #192, using these external packages really simple.

IMO, one of the largest mistakes langchain made is to have everything inside a single package. Ragna goes a different way. To re-iterate: I'm not saying everything that is not part of Ragna core has to be maintained by the community. Think of it more like pytest. They have a super customizable base with clean scope and maintain plenty of plugins themselves: https://github.com/pytest-dev

nenb Dec 7, 2023
Maintainer Author

Sorry for taking so long to follow up on this!

Sounds good. Please open a feature request for it.

I have shared some work in #233. I used fsspec rather than universal-pathlib largely because it supports more filesystems at present and I am far more familiar with. But I'm open to change here. If the work is of interest, I will then open a feature request.

Apart from the fact that I think this should be an external library and not in Ragna core (not saying that the Ragna team cannot maintain something like that), I think the biggest issue is consistency with the UI.

I'll play around with this over the next couple of weeks, and share anything I find of interest.

IMO, we should to this on demand. So far we only got a request for adding support for Markdown files in #209 (see #210). And I suspect that we will also need support for Word documents and Powerpoint slides in the near future. But other than that, I can't think of any right now. So we shouldn't just add features for the sake of having them. This increases maintenance burden for no reason.

My two cents: ragna is 'only' (I mean this in a very positive way, the minimalism is v attractive and is core design philosophy as you outlined) a RAG orchestrator. Hence, it needs to do RAG well! If I was looking for a RAG tool, and I saw that a package didn't implement the file formats that I needed, then I think I might just move on. Demand might be a bit tricky to measure.

My opinion is that a baseline of .md, .mdx, .pdf, HTML, .txt, a range of office-suite tools (.doc, .docx, .ppt, .pptx at least), and also .csv + .json is necessary. Would you agree? If so, I will open issues for the formats that aren't currently supported, and will start adding support over the next few weeks.

pmeier Dec 8, 2023
Maintainer

My opinion is that a baseline of .md, .mdx, .pdf, HTML, .txt, a range of office-suite tools (.doc, .docx, .ppt, .pptx at least), and also .csv + .json is necessary. Would you agree? If so, I will open issues for the formats that aren't currently supported, and will start adding support over the next few weeks.

.md, .mdx 👍 see Add support for markdown documents #210
.pdf 👍 already supported
HTML 🤷 not against it, but we should discuss the proper use case first. Let's do so on the issue.
.txt 👍 already supported
.doc, .docx 👍 see [ENH] - Add support for .doc / .docx #225
.ppt, .pptx 👍
.csv 👎 CSV is usually for structured numerical data, which is kinda on the opposite of the spectrum of a text document. Unless there is a convincing use case that I'm missing, I'm against this. If you feel strongly, please open an issue and detail why Ragna should support this
.json 🤷 My gut says that JSON is too variable for us to safely extract relevant text. I would need to see an actual use case to make a call here. On the other hand, you are not the first person to request that. See [ENH] - Limit displayed document pills in header #224 (comment)

peachkeel Dec 8, 2023

I recant my comment in #224 about JSON. The way we're using JSON does not really fit the document handler paradigm. I might start a separate discussion about my use-case, which has more to do with corpora than documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is ragna's approach to data connectors? #202

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What is ragna's approach to data connectors? #202

nenb Nov 13, 2023 Maintainer

Replies: 2 comments · 7 replies

pmeier Nov 15, 2023 Maintainer

nenb Nov 15, 2023 Maintainer Author

nenb Nov 20, 2023 Maintainer Author

pmeier Nov 20, 2023 Maintainer

nenb Dec 7, 2023 Maintainer Author

pmeier Dec 8, 2023 Maintainer

peachkeel Dec 8, 2023

nenb
Nov 13, 2023
Maintainer

Replies: 2 comments 7 replies

pmeier
Nov 15, 2023
Maintainer

nenb
Nov 15, 2023
Maintainer Author

nenb Nov 20, 2023
Maintainer Author

pmeier Nov 20, 2023
Maintainer

nenb Dec 7, 2023
Maintainer Author

pmeier Dec 8, 2023
Maintainer