You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, right from the beginning this was already an issue for plain text documents like .txt and after #270 also .md since they don't have a concept of pages. In addition, we were disillusioned by the fact not even .docx documents can provide page information (#281 (comment)). There is some hope for .pptx (#296 (comment)).
In any case, it is clear now that besides .pdf most other popular formats do not support page information. Let's have a look on how the information is used next.
As seen above, the Page object is just a thin dataclass. It is processed by the builtin source storages by transforming the objects into chunks:
After that, the text is chunked different than before and the only thing left from the original structure is the list of page numbers. They are stored alongside the chunk in the source storage as metadata and will re-appear on the retrieved Sources as source.location
Note that we here already deviated the Page abstraction and used the more generic location. However, we still only put page numbers in.
Ultimately, this location can help users track down the sources in the original document that were used in an answer. Even after #264, i.e. the sources in the API / UI contain the full text alongside the location information, I still believe this is quite useful.
Ok, where do we go from here? My initial thought was to generalize the Page object to Text and instead of having the page number on there we have a more generic location field similar to the Source. Instead of just a string, this could be a of type class Location(abc.ABC). Potential subclasses could be class Page(Location), class Paragraph(Location), or class Line(Location). This should give us more flexibility for the supported data formats.
After some more thoughts, one thing I dislike about the approach above is that we still have an "input object" Text and a processed object Chunk. And besides where in the pipeline the object appears, they are practically the same. One can think of the original extraction from the document as the first chunking and our later processing into chunks of a constant number of tokens as re-chunking. Thus, what I think makes the most sense here is to make Chunk the object we extract from the document. We can still use the location approach I described in the paragraph above here. And in the source storages we could have something like
This Chunker could actually become a standalone component separated from the SourceStorage (@nenb IIRC we have discussed this at some point in the past, but I can't find it. If you do, please link it to this thread). However, this would not be required in the first step, as the change can also stand on its own as is.
The text was updated successfully, but these errors were encountered:
When we started out with the document handlers, we assumed that a
Page
is a good unit to base the extraction on:ragna/ragna/core/_document.py
Lines 179 to 188 in 51fcda4
ragna/ragna/core/_document.py
Lines 203 to 213 in 51fcda4
However, right from the beginning this was already an issue for plain text documents like
.txt
and after #270 also.md
since they don't have a concept of pages. In addition, we were disillusioned by the fact not even.docx
documents can provide page information (#281 (comment)). There is some hope for.pptx
(#296 (comment)).In any case, it is clear now that besides
.pdf
most other popular formats do not support page information. Let's have a look on how the information is used next.As seen above, the
Page
object is just a thin dataclass. It is processed by the builtin source storages by transforming the objects into chunks:ragna/ragna/source_storages/_vector_database.py
Lines 44 to 48 in 51fcda4
ragna/ragna/source_storages/_vector_database.py
Lines 79 to 97 in 51fcda4
After that, the text is chunked different than before and the only thing left from the original structure is the list of page numbers. They are stored alongside the chunk in the source storage as metadata and will re-appear on the retrieved
Source
s assource.location
ragna/ragna/core/_components.py
Lines 79 to 96 in 51fcda4
ragna/ragna/source_storages/_chroma.py
Lines 129 to 135 in 51fcda4
Note that we here already deviated the
Page
abstraction and used the more genericlocation
. However, we still only put page numbers in.Ultimately, this location can help users track down the sources in the original document that were used in an answer. Even after #264, i.e. the sources in the API / UI contain the full text alongside the location information, I still believe this is quite useful.
Ok, where do we go from here? My initial thought was to generalize the
Page
object toText
and instead of having the page number on there we have a more genericlocation
field similar to theSource
. Instead of just a string, this could be a of typeclass Location(abc.ABC)
. Potential subclasses could beclass Page(Location)
,class Paragraph(Location)
, or class Line(Location). This should give us more flexibility for the supported data formats.After some more thoughts, one thing I dislike about the approach above is that we still have an "input object"
Text
and a processed objectChunk
. And besides where in the pipeline the object appears, they are practically the same. One can think of the original extraction from the document as the first chunking and our later processing into chunks of a constant number of tokens as re-chunking. Thus, what I think makes the most sense here is to makeChunk
the object we extract from the document. We can still use the location approach I described in the paragraph above here. And in the source storages we could have something likeThis
Chunker
could actually become a standalone component separated from theSourceStorage
(@nenb IIRC we have discussed this at some point in the past, but I can't find it. If you do, please link it to this thread). However, this would not be required in the first step, as the change can also stand on its own as is.The text was updated successfully, but these errors were encountered: