Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: what's the current status of the REST API? #496

Open
zblesk opened this issue Oct 1, 2020 · 35 comments
Open

Question: what's the current status of the REST API? #496

zblesk opened this issue Oct 1, 2020 · 35 comments

Comments

@zblesk
Copy link

zblesk commented Oct 1, 2020

I'd like to add new pages by sending an HTTP request to an endpoint. I saw it mentioned in issues such as #339 and items linked in that thread.

There seemed to be commit names that mentioned adding a REST API, but I haven't been able to find whether those are already implemented and released.

Are they? If so, how do I call them?

I've tried just capturing a request to the "Add" method when I click it in the browser, but it looks like there is some csrf protection, so I can't just copy-paste some bearer token and re-issue requests. I'm asking here before I spend time reverse-engineering something just because I missed an already existing API. :)

@pirate
Copy link
Member

pirate commented Oct 2, 2020

The current status of the API is "unstable" I'd say. Reverse engineering the UI is the way to go for now, but we have plans to stabilize it more in future versions and split out a proper API with django-rest-framework or something so that external tools don't have to shoehorn their needs into requests used by the UI.

@mAAdhaTTah
Copy link
Contributor

@pirate I would be interested in working on this. I shot you an email a week or so ago cuz I think the underlying data model needs to be solidified and would love to help move this along. Let me know how I can help.

@cdvv7788
Copy link
Contributor

@mAAdhaTTah that is great! Currently, on master, we have the sqlite database working. We can now start working with django-rest-framework to enable a proper API (Like @pirate mentioned).
What are the issues that you are finding with the data model? Something that needs to be improved? We can start the discussion here, so we can all have the proper context, and find a way to get started soon.

@mAAdhaTTah
Copy link
Contributor

@cdvv7788 Generally, I think the split/transformation between Link <-> Snapshot is a bit weird. Snapshot seems to be db-only (it's transformed into Link's as it's fetched out of the db for most of the operations I was looking at). I also think the double duty of timestamp being "the time it was bookmarked" as well as "the path in the archive" is a bit of an issue. From my email:

I believe you're currently looking to move from timestamp -> sha for the Snapshots and their relationship to the on-disk archive. If we want to eventually allow multiple snapshots per link (to avoid the hash hack), reifiying the Link model into the database and making the Snapshot a single download of a Link seems like a good way to do it. Part of the benefit for me for moving away from timestamps is I want to track when an article was read so I can group them by read day, and manipulating the timestamp for this seems a bit fragile if it can break the relationship to the archive. Having added, updated, etc. properties for that purpose seems a lot clearer.

(For context, I'd like to use ArchiveBox as a reading list, which I would then pull into my website, hence needing a REST API to pull that from. That's the reference to the "benefit for me" line.)

@cdvv7788
Copy link
Contributor

@mAAdhaTTah We have discussed those topics before. I think that @pirate has some progress on the timestamp issue, and it will be changed once we come up with a good solution.
The Link <-> Snapshot stuff is a leftover of the recent migration. In the latest release (v4.x), Link was generated from the index.json, and Snapshot was updated on a best effort basis. After the refactor, this has changed, and we definitely want to get rid of this relationship, leaving everything directly in Snapshot if possible. Supporting multiple snapshots for the same url is not supported at this moment, but after we remove the dependency on the Link schema, it should not be hard to add if we decide to go that way.
The main blocker at this moment is that Snapshot requires django, so it cannot be used on it's own. We need to find a way to circumvent that (@pirate do you know if this is possible?) or we need to get more creative initializing django. Some research on this specific topic would be of great help (this is something in our short term objectives).

@mAAdhaTTah
Copy link
Contributor

Supporting multiple snapshots for the same url is not supported at this moment, but after we remove the dependency on the Link schema, it should not be hard to add if we decide to go that way.

So my thinking/proposal is to actually remove the Link schema, migrate what is currently considered a Snapshot to be a Link instead (mostly as a naming convention change), then add Snapshot that represents a single download of a website. Based on your explanation, I think we'd need to include a migration in v0.5 that migrates the index.json into the db, then once we're solely dependent on the db, performing the above migrations, splitting the existing Snapshot into 2 models: Snapshot & Link, with a one-to-many relationship (plus whatever UI updates are needed to account for this).

Does that make sense? Happy to elaborate and/or provide some code to explain.

The main blocker at this moment is that Snapshot requires django, so it cannot be used on it's own.

Not sure I understand this. Could you provide some background here?

@cdvv7788
Copy link
Contributor

So my thinking/proposal is to actually remove the Link schema, migrate what is currently considered a Snapshot to be a Link instead (mostly as a naming convention change), then add Snapshot that represents a single download of a website. Based on your explanation, I think we'd need to include a migration in v0.5 that migrates the index.json into the db, then once we're solely dependent on the db, performing the above migrations, splitting the existing Snapshot into 2 models: Snapshot & Link, with a one-to-many relationship (plus whatever UI updates are needed to account for this).

At this moment we only have the means to represent a single download per website. I understand what you propose, and that does make sense. At this point we already migrated the index.json into the sqlite database. In fact, if you check #502, we are already removing the automatic generation of those indexes completely. This, however, cannot be done without first solving the other issue, which takes me to:

The main blocker at this moment is that Snapshot requires django, so it cannot be used on it's own.

Snapshot is a django model. We cannot use that model in a place where django has not been initialized yet. If you try to do that, it will complain because the module will try to use some django internal stuff. This is the only reason we have not gotten rid of Link as we know it. I am going to spend some time figuring alternatives to make Snapshot usable in the whole application. You are welcome to help us pursue this. As I mentioned earlier, this is a blocker, and the other stuff cannot be worked until it is not resolved (The REST API could actually be implemented, but once we fix this, we would need to refactor it in a big way...I think it is better to solve this layer first).

@mAAdhaTTah
Copy link
Contributor

We cannot use that model in a place where django has not been initialized yet.

All of this makes sense so far. I can do some investigating and see what I can come up with. Just to clarify, when you say "use that model", is that "interacting with it" or is importing it enough to make it fail?

@cdvv7788
Copy link
Contributor

Importing it is enough to make it fail. There is a method that you will find around named django_setup which initializes what is required.

@pirate
Copy link
Member

pirate commented Oct 21, 2020

I don't believe we need Link or Snapshot anywhere that Django is not initialized, so that is a non-issue. If you're worried about oneshot I have an idea to fix that (we can discuss more in Zulip).

@cdvv7788 cdvv7788 mentioned this issue Oct 22, 2020
6 tasks
@mAAdhaTTah
Copy link
Contributor

@pirate Does that change if the idea is to turn Link & Snapshot into db models?

@cdvv7788 cdvv7788 mentioned this issue Oct 26, 2020
6 tasks
mAAdhaTTah added a commit to mAAdhaTTah/ArchiveBox that referenced this issue Nov 8, 2020
This pulls in DRF to configure our API. Pretty straightforward binding of a view
to a serializer & a model and making the data available. For this first pass,
we're using the model even though it's currently unstable. From a feature
standpoint, we get a lot for free from DRF with very little code, including
pagination. The `list_links` method loads all of the snapshots, which would
require pagination to be implemented manually on the entire list of snapshots,
which won't work well on large databases. Because archivebox is a CLI first
and a web application second, the way Exceptions are thrown and errors
logged doesn't always make those methods conducive to integrating w/ an API.

On the testing side, this shows up in how we're configuring things. The
`setup_django` function doesn't fully work when passing `out_path`;
Some variables in the Django settings aren't updated or configured correctly.
Instead, we use `subprocess` the same way the other tests do to start up the
server and hit it with `requests`.

# Summary

This is obviously a work in progress but wanted to get some feedback on the
direction. It would be helpful if the API functions exposed by archivebox were
more decoupled from the CLI context specifically, but I think we're going to
want to bind the Models directly (at least for querying).

# Related issues

ArchiveBox#496

# Changes these areas

- [ ] Bugfixes
- [X] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
@zblesk
Copy link
Author

zblesk commented Sep 14, 2021

Hello!
I see there's been some progress here.
What's the current status? Is the api available yet?

One of the linked tasks seems to mention it's available in 'dev' - is that an available docker tag?
Is it safe to use?
To be more specific: I understand the API is still in alpha, and I can accept that. However, I don't understand what else can be unstable in dev - I don't want to risk my instance and my data.

Thank you!

@mAAdhaTTah
Copy link
Contributor

I have not made any additional progress since opening my PR here: #529 I don't think we will be continuing down that path, as we were considering using Django Ninja instead of DRF as well. Eventually, I'd like to pick this back up again but haven't had the time.

@pirate
Copy link
Member

pirate commented Apr 12, 2022

Copying over my earlier message here from the API discussion related to the ArchiveBox browser extension #577:

I think a minimal API can be worked on before the Huey refactor, as the user-facing API is going to be relatively stable even with the change to the internals. These endpoints are already partially available through the Django Admin:

  • /add GET,POST (CSRF excempt, usable as an API from external origins and is used by the browser extension)
  • /api/core/snapshot/ GET, POST, PUT
  • /api/core/snapshot/<id> GET, PATCH, DELETE
  • /api/core/archiveresult/ GET, POST
  • /api/core/archiveresult/<id> GET, PATCH, DELETE
  • /api/core/tag/ GET, POST, PUT
  • /api/core/tag/<id> GET, PATCH, DELETE

and this bonus escape hatch endpoint is planned to be added to do everything else not possible with the above ^:

  • /api/cli/<command> POST (simulate running any archivebox CLI command with a given dict of args and kwargs to populate the CLI flags and args)
    e.g. /api/cli/add POST {urls: 'https://example.com', depth: 1, extractors: ['wget', 'media', 'screenshot'], ...}
    or /api/cli/schedule POST {urls: 'https://example.com', depth: 1, every: 'day', ...}

I'm leaning towards using FastAPI for the API instead of DRF. I like the pydantic type-based API definitions better than DRF's serializers but I could be convinced either way.

@zblesk
Copy link
Author

zblesk commented Apr 23, 2022

Thanks for the update. Looking forward to this.

Though I'm not sure I read those correctly. For instance, what is the difference between a GET and a POST to /add?
Will it support adding many links at once, as well?

And which endpoint should be used for 'return the archive URL for this input URL, if it exists'?

@djkemmet
Copy link

@pirate hey there are you still working on this / need help? I'm thinking this is possibly something I could put together with FastAPI and the CLI hopefully next weekend. let me know! cheers

Copying over my earlier message here from the API discussion related to the ArchiveBox browser extension #577:

I think a minimal API can be worked on before the Huey refactor, as the user-facing API is going to be relatively stable even with the change to the internals. These endpoints are already partially available through the Django Admin:

  • /add GET,POST (CSRF excempt, usable as an API from external origins and is used by the browser extension)
  • /api/core/snapshot/ GET, POST, PUT
  • /api/core/snapshot/<id> GET, PATCH, DELETE
  • /api/core/archiveresult/ GET, POST
  • /api/core/archiveresult/<id> GET, PATCH, DELETE
  • /api/core/tag/ GET, POST, PUT
  • /api/core/tag/<id> GET, PATCH, DELETE

and this bonus escape hatch endpoint is planned to be added to do everything else not possible with the above ^:

  • /api/cli/<command> POST (simulate running any archivebox CLI command with a given dict of args and kwargs to populate the CLI flags and args)
    e.g. /api/cli/add POST {urls: 'https://example.com', depth: 1, extractors: ['wget', 'media', 'screenshot'], ...}
    or /api/cli/schedule POST {urls: 'https://example.com', depth: 1, every: 'day', ...}

I'm leaning towards using FastAPI for the API instead of DRF. I like the pydantic type-based API definitions better than DRF's serializers but I could be convinced either way.

@pirate
Copy link
Member

pirate commented Sep 15, 2022

Definitely open to contribution on the API front! I'm more focused on internals refactoring at the moment but as mentioned in that quoted comment I believe my changes can be kept insulated from anything external facing.

If you want to share gists or a fork with your work I can leave progress on your mock-up as you go to save time on PR review later.

@joedavison
Copy link

I would use an API like this.

@djkemmet
Copy link

hi, if anyone is following this issue and could give me some guidance please see this issue: #1030

@FunctionDJ
Copy link

i think @zblesk brought up and important point. a route like /add/ feels like violating REST principles by implying an action. ideally, if the API should be REST, it the routes should be resources and the action is determined by the HTTP method (GET, PUT etc).
so i feel like it would make more sense to make a GET to /archive to get archived items and to make a POST to /archive to store a new link etc.

@joedavison
Copy link

Sure let's start with POST to /archive in addition to the current command line input method.

@pirate
Copy link
Member

pirate commented Nov 18, 2022

Lets keep the REST API URLs in line with the model names and use /api/snapshot GET/POST and /api/archiveresult GET/POST.

@FunctionDJ
Copy link

@pirate good point. my comment was less about the specific endpoint names and more about the REST conformity of using proper HTTP methods and resource endpoints. depending on the application design it might not make sense to map the models to endpoints 1-to-1 because some data is simply always a composition of different data models. i'm not familiar with the archivebox software project so i can't tell.

@pirate
Copy link
Member

pirate commented Nov 23, 2022

I think keeping endpoints the same as model names is better than the alternative because more layers of indirection/leaky abstraction make it harder to grep through the source code and understand.

@PeterPilley
Copy link

Hi everyone, can I ask what the status is of the rest API, definitely +1 for fastapi instead of DRF.

Is this something you need help or is there a list of active tasks for the current implementation?

@pirate
Copy link
Member

pirate commented Feb 19, 2023

It's still on the list but slow going, I haven't had a lot of big blocks of coding time to work on ArchiveBox over the last year, so I've mostly been devoting my time to support and docs.

On the plus side I have interest from a big multinational org to use ArchiveBox, and maybe able to turn that into a consulting contract to fund some work towards the API. They are a slow-moving org so it may take 6~12 months, but it's exciting news nonetheless.

@cogscides
Copy link

Hope this will be implemented. In my case, I want to scrap and store websites in my local network and then be able to process this with AI and then put it in my personal knowledge management system. AI and PKM staff is on my side, just need to have API 🙏

@aitorllj93
Copy link

hello! what's the current state of this? It's kinda confusing since it says it's on Alpha but reading the comments I don't know if it's possible to use it on Docker. I'm interested on building an alternative front end for this application and the REST API would help me a lot

@pirate
Copy link
Member

pirate commented Nov 18, 2023

Alpha = There are a few POST/GET etc. endpoints exposed by the admin UI and the /add page that allow quick things can be hacked together, but it's not a proper REST API by any means. I'm working on a django-huey-monitor refactor to add and event driven queue system in the backend, and the new REST API I'm planning will insert messages into this queue to manage extractor jobs and snapshots.

Can I ask why you're going in the direction of an alternative frontend vs contributing changes to AB directly? I'd definitely be open to PRs improving our existing frontend!

See the discussion here too: #1126

@aitorllj93
Copy link

Alpha = There are a few POST/GET etc. endpoints exposed by the admin UI and the /add page that allow quick things can be hacked together, but it's not a proper REST API by any means. I'm working on a django-huey-monitor refactor to add and event driven queue system in the backend, and the new REST API I'm planning will insert messages into this queue to manage extractor jobs and snapshots.

Can I ask why you're going in the direction of an alternative frontend vs contributing changes to AB directly? I'd definitely be open to PRs improving our existing frontend!

See the discussion here too: #1126

@pirate my main issue about contributing to the existing frontend is that the current version is far from what I think would be useful for me, so probably my changes might be too much disturbing to include them just with a PR without previous discussion. If you still think this project could benefit from a total rework on the FrontEnd (which I do) I can think about making some proposals and reach to an agreement

@pirate
Copy link
Member

pirate commented Nov 20, 2023

I'm down to add a new frontend to the existing app as long as we keep the Django admin one available as well in parallel. I was considering using htmx to do this myself (it plays well with Django templates) but haven't gotten around to it.

One of the core principles is that we should rely on JS as little as possible because I want ArchiveBox views to be extremely durable long term and viewable across many different types of devices.

I'm ok with some of the UI requiring JS but ideally the most critical parts should fall back to working with old school plain html.

If that design direction sounds compatible with your ideas then I'm down to work together to add your UI changes to AB directly, otherwise maybe an independent app/mod may be better.

@aitorllj93
Copy link

@pirate sure, that sounds nice. I don't want to include a JavaScript framework neither. Regarding htmx, we can give it a try if we need it, I already did some works on a side project and it's great. About the CSS I saw the current implementation uses Bootstrap, I wonder if we can move to Tailwind, which I think fits better for an open source project these days, in that way we don't need to implement custom classes and it's easier for external contributions

@pirate
Copy link
Member

pirate commented Nov 21, 2023

Nice! I also prefer tailwind to bootstrap, happy to move to that.

If you want to open a new issue for your UI ideas as they come up I think we should move frontend discussion away from the REST API thread so we don't spam everyone.

@zblesk
Copy link
Author

zblesk commented Nov 21, 2023

If you do create a new thread for that, can you please @ me? Thanks.

@pirate
Copy link
Member

pirate commented Apr 26, 2024

Hey everyone, check out the new REST API on dev! Big thanks to @Brandl!

https://github.com/ArchiveBox/ArchiveBox/releases/v0.8.0-rc

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants