[ Feature: document module ] document endpoints #59

jerome-white · 2025-03-17T08:34:10Z

Summary

Target issue is #32

Endpoints for manipulating documents meant for RAG interaction.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran poetry run uvicorn src.app.main:app --reload in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

This is designed to be a "working" PR: it will remain in draft state as it is fleshed out to provide transparency and get ongoing discussion around its practices.

Update on moving from draft

This has become a large PR. It was started before a lot of our development processes were stablized. Notably:

Before we confirmed that small progressively merged features were the preference. Rather than single document module endpoints, this PR contains about half of the overall document module specification.
Before there was clear delineation between branch purposes and how those branches would work together. The branch against which this code is developed is most certainly out-of-date.

To address the second point, I ask that the reviewers look at this code in isolation, not necessarily how the changes fit with what is currently on main. "Approving" and merging this branch will start the following process:

Code from this PR will go back into my out-of-date sub-branch
I will work to bring the sub-branch inline with main. This not only means a manual merge, but updates to the code to use current best-practices (see Document Management: Integrate organization/project into endpoints #85 for example). Each non-merge update will proceed as its own PR
From there I will proceed with the open sub-issues in Module for managing documents in RAG #26

Route names will conform to common Unix operations.

jerome-white · 2025-03-17T09:33:11Z

There are several questions I have so far (relevant to 4cbb94f):

AWS

How are we managing AWS credentials across development and production:

Are we assuming there will be a single AWS user for all organizations/projects?
Is there a set of AWS credentials we are using for development?
Where are we storing AWS credentials? The .env seems like the best place, but there is nothing there right now (even though we seem to be testing deployment in AWS)

S3 management

The current version of this PR assumes a bucket per project/organization. Alternatives include:

a bucket per organization or project, with "directories" for project or organization (respectively)
s3://org/project_{1,2,3}
a single bucket for all users, differentiated by "directories" for organization and project
s3://ai-platform/org/project/

Using buckets per project/org (s3://org-proj/) means we will have several buckets, but in my mind will be much easier to manage; you can clear a bucket, for example, without worry about effecting other users. This becomes particularly easier if the ai-platform has its own AWS sub-account within T4D.

Project and organization information

How are we passing the project and organization to routes? Right now I assume this information is carried in the "current user"; although in my branch the current user has no awareness of a org/project breakdown. I believe #63 will bring this functionality in, but it's not clear whether that will hide the details behind the user facade (like I'm hoping it will)

sourabhlodha · 2025-03-17T10:54:52Z

@jerome-white I can share you AWS credentials you can add in .env. Let's use same bucket and different directive for all organisation

#59 (comment)

* Since there is a single bucket per client: storage class no longer responsible for name generation * Since the bucket is created at platform startup: storage class no longer needs to worry about it, and client creation needs to be globally accessible.

See: https://stackoverflow.com/a/49665620

Storage class builds bucket path based on user information and the basename. This allows the basename to follow the clients convention.

* Fewer fixtures that are more general * Remove class fixtures; assume someone else manages database state * Consolidate manual database interaction into a single class * Fix test class name clashes

jerome-white · 2025-03-27T06:37:08Z

There are several questions...

Based on various discord discussions and meetings:

For now there will be a single AWS user that all clients will share
We will use a single bucket for all clients
Project and organization are now present/available in the main branch; how to replace "user" with them is to me still unknown

Each method takes into account whether the client is allowed to access the documents they are trying to touch. To do this, owner is dropped from the method signatures; instead being specified at construction.

The preference for client creation is to take parameters values from the environment; platform "settings" are a backup. This ensures that when mocking S3, things work as expected.

jerome-white · 2025-03-31T03:44:44Z

I can no longer wait for a second review. Merging this

* [ Feature: document module ] Schema setup and basic route (#29) * Basic document model * Document route skeleton Implements document list based on current user. * Imports for document route exposure * Add relationship between users and documents * [ Feature: document module ] document endpoints (#59) * Cleaner syntax * Interface for cloud storage functionality * Create module dedicated to cloud functionality * Take the user in the constructor * Rename the document list route Route names will conform to common Unix operations. * Implementation of file upload * Default placeholders for AWS variables * Take AWS credentials from .env * Perform bucket creation at startup * Use a single bucket for all clients #59 (comment) * Changes to support new bucket semantics * Since there is a single bucket per client: storage class no longer responsible for name generation * Since the bucket is created at platform startup: storage class no longer needs to worry about it, and client creation needs to be globally accessible. * Missing imports * Unused imports * Lift timestamp generation to common location * Timestamp generation is global resource * Updating a document is not yet supported * Flesh out document remove and stat * Must specify region when creating a bucket See: https://stackoverflow.com/a/49665620 * Add boto3 requirement * Repeating AWS environment variables in the settings * Client expected to pass the basename of the destination Storage class builds bucket path based on user information and the basename. This allows the basename to follow the clients convention. * Corrected upload file specification * Build basename that matches the UUID expectation of the model * SQLAlchemy cannot process Path types natively * Ensure document ID is passed to route body * Corrected database interaction when deleting * More graceful handling of non singular results * Move document database interactions to CRUD * Ignore Emacs backup files * Allow document list to be iterable * Corrected parameter naming * Initial document CRUD tests * Test for read_* methods * Infrastructure for making it work (utils) * Document update returns inserted document * Lift document creation function * Test document update * Whitespace * Linted * Corrected update_at test * Appropriate class variable naming * Move document creation into a class * Linted * Test document CRUD delete * All test methods cleanup after themselves * Gracefully handle negative skip's and limit's * Fixture usage is more explicit and straightforward * Better usage of fixtures * Import reordering * Lift document crud testing utils to global testing utilities * Take read error into account * Test document list route * Simplify test parameters * Lift common document endpoint testing resources to utils * Return number of rows deleted * Test document endpoint deletion * Consisten Session variable naming * Special list document type * Ability to add component to URL path * Corrections to injected types * Simplification of Route semantics * Route type now handles all URL operations required for the crawler * Crawler assumes it will be called with a single type (Route) * Linted * Delete uses update * Lift document comparison to test utilities * Tests assert * Must refresh the session before interacting with the database * Tests for document stat route * Linted * Better temporary bucket naming * Move from deprecated way of Pydantic to dict * Lift bucket creation to global module * Unused import * Return all information about an uploaded document * Updates to Python packages * Bump boto3 version * Add moto dependency * Test upload endpoint * Use object to upload documents * General document test cleanup * Fewer fixtures that are more general * Remove class fixtures; assume someone else manages database state * Consolidate manual database interaction into a single class * Fix test class name clashes * Remove unused imports * Remove extraneous code * Document crud takes respects user throughout Each method takes into account whether the client is allowed to access the documents they are trying to touch. To do this, owner is dropped from the method signatures; instead being specified at construction. * Routes respect new document CRUD interface * Integer to UUID now a standalone generator * Better variable naming * Tests take into account new document CRUD interface * Unused imports * More descriptive AWS error type * Catch generic exceptions to ensure service does not go down * Ensure boto3 respects environment The preference for client creation is to take parameters values from the environment; platform "settings" are a backup. This ensures that when mocking S3, things work as expected. * Log cloud storage errors during startup * Corrected variable naming * Provide default values for AWS credentials * Unused import * Proper handling of missing AWS keys * Updated UV lock

jerome-white added 6 commits March 12, 2025 16:06

Cleaner syntax

1d53062

Interface for cloud storage functionality

e8f0d9c

Create module dedicated to cloud functionality

3a8e62a

Take the user in the constructor

41e4652

Rename the document list route

6af4395

Route names will conform to common Unix operations.

Implementation of file upload

4cbb94f

jerome-white added the enhancement New feature or request label Mar 17, 2025

jerome-white self-assigned this Mar 17, 2025

jerome-white added this to Kaapi-dev Mar 17, 2025

jerome-white moved this to In progress in Kaapi-dev Mar 17, 2025

jerome-white linked an issue Mar 17, 2025 that may be closed by this pull request

[ Document endpoint ] Remaining endpoints from specification #32

Closed

jerome-white added 17 commits March 17, 2025 21:58

Default placeholders for AWS variables

56f2887

Take AWS credentials from .env

235ef85

Perform bucket creation at startup

303c981

Use a single bucket for all clients

11e3d54

#59 (comment)

Missing imports

0350ce2

Unused imports

d1e97e1

Lift timestamp generation to common location

c5319cb

Timestamp generation is global resource

9607567

Updating a document is not yet supported

4d7c7c7

Flesh out document remove and stat

6e118c6

Must specify region when creating a bucket

a4510ff

See: https://stackoverflow.com/a/49665620

Add boto3 requirement

31e5fae

Repeating AWS environment variables in the settings

18654ba

Client expected to pass the basename of the destination

67274e6

Storage class builds bucket path based on user information and the basename. This allows the basename to follow the clients convention.

Corrected upload file specification

fa14685

Build basename that matches the UUID expectation of the model

483eb3d

jerome-white requested a review from AkhileshNegi March 26, 2025 09:35

jerome-white marked this pull request as ready for review March 26, 2025 09:35

jerome-white requested review from Ishankoradia, avirajsingh7, fatchat, nishika26 and sourabhlodha March 26, 2025 09:35

Use object to upload documents

3293d07

jerome-white moved this from In progress to In review in Kaapi-dev Mar 26, 2025

General document test cleanup

397e316

* Fewer fixtures that are more general * Remove class fixtures; assume someone else manages database state * Consolidate manual database interaction into a single class * Fix test class name clashes

jerome-white added 8 commits March 27, 2025 12:16

Remove unused imports

318d803

Remove extraneous code

d39b03e

Document crud takes respects user throughout

847eed9

Each method takes into account whether the client is allowed to access the documents they are trying to touch. To do this, owner is dropped from the method signatures; instead being specified at construction.

Routes respect new document CRUD interface

cb5e994

Integer to UUID now a standalone generator

a2dbcdf

Better variable naming

3ad53f0

Tests take into account new document CRUD interface

8b7637f

Unused imports

96472e1

sourabhlodha approved these changes Mar 28, 2025

View reviewed changes

jerome-white added 5 commits March 28, 2025 11:34

More descriptive AWS error type

20ddd20

Catch generic exceptions to ensure service does not go down

33cac51

Ensure boto3 respects environment

0bf097a

The preference for client creation is to take parameters values from the environment; platform "settings" are a backup. This ensures that when mocking S3, things work as expected.

Log cloud storage errors during startup

8eb7ae2

Corrected variable naming

df9e0d1

jerome-white merged commit 612462c into feature/document-module/dev Mar 31, 2025
2 of 14 checks passed

jerome-white deleted the feature/document-module/-docs branch March 31, 2025 03:44

github-project-automation bot moved this from In review to Closed in Kaapi-dev Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ Feature: document module ] document endpoints #59

[ Feature: document module ] document endpoints #59

Uh oh!

jerome-white commented Mar 17, 2025 •

edited

Loading

Uh oh!

jerome-white commented Mar 17, 2025 •

edited

Loading

Uh oh!

sourabhlodha commented Mar 17, 2025

Uh oh!

jerome-white commented Mar 27, 2025

Uh oh!

jerome-white commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ Feature: document module ] document endpoints #59

[ Feature: document module ] document endpoints #59

Uh oh!

Conversation

jerome-white commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Update on moving from draft

Uh oh!

jerome-white commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AWS

S3 management

Project and organization information

Uh oh!

sourabhlodha commented Mar 17, 2025

Uh oh!

jerome-white commented Mar 27, 2025

Uh oh!

jerome-white commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jerome-white commented Mar 17, 2025 •

edited

Loading

jerome-white commented Mar 17, 2025 •

edited

Loading