Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solid-search compatibility #275

Closed
joepio opened this issue Oct 19, 2020 · 23 comments
Closed

Solid-search compatibility #275

joepio opened this issue Oct 19, 2020 · 23 comments
Assignees
Labels
❓question Further information is requested

Comments

@joepio
Copy link
Contributor

joepio commented Oct 19, 2020

We (Ontola) have been awarded a grant to implement a full-text search module for Solid pods, called Solid-Search. We're planning on linking this newly created module to this community-server. The goal is to have a re-usable search module that can be used by various RDF servers (most notably solid pods).

Here's some thoughts on how we'll be doing this:

  • Whenever a resource is updated (created / edited), the search index should know it needs to be updated. We need some kind of callback / middelware for this.
  • We need to communicate these changes to the search module. We've used linked-deltas for this in the past, which allows for persisting these updates as well as using them for events that need handling. Ideally, these deltas are persisted in the pod. related.
  • The Solid-Search module needs to run somewhere. Ideally, this is in some runtime inside of a pod, but we could also require the pod admin to spin up a (multi-tenant) solid-search instance.
  • The Solid-Search API needs to be exposed - ideally on the domain of the pod, as an endpoint. This could mean that the pod has proxy functionality. Otherwise, we need to have a search service somewhere else, which feels weird.

We'll start development of Solid-Search December this year, and plan to link it to community-server in August 2021. Any thoughts are very welcome!

@joepio joepio self-assigned this Oct 19, 2020
@RubenVerborgh RubenVerborgh added the ❓question Further information is requested label Oct 19, 2020
@RubenVerborgh RubenVerborgh self-assigned this Oct 19, 2020
@RubenVerborgh
Copy link
Member

RubenVerborgh commented Oct 22, 2020

We (Ontola) have been awarded a grant to implement a full-text search module for Solid pods, called Solid-Search.

Excellent news, congrats!

If you're interested about exchanging ideas, my team has quite some experience with this. Happy to discuss.

We're planning on linking this newly created module to this community-server. The goal is to have a re-usable search module that can be used by various RDF servers (most notably solid pods).

Excellent!

  • Whenever a resource is updated (created / edited), the search index should know it needs to be updated. We need some kind of callback / middelware for this.

Given that you want this to be reusable across multiple servers, I'll write how I envision such types of agents to function for the Solid ecosystem in general.

Conceptually, an agent providing a service in a network (such as, in your case, index and search) should subscribe via Linked Data Notifications to the source it wants to index. Whenever a change occurs in that source, the source will generate a Linked Data Notification and send it to the service, upon which the service performs it task. (Possibly, subscriptions need to be periodically renewed.)

This is the general mechanism. The case where the service and the pod run on the same server (or even in the same process) is an optimization. That is: the LDN mechanism would still work, it's just that there are more efficient channels available for local communication. Architecturally, you could use the decorator pattern to implement a ResourceStore (I recommend subclassing PassthroughStore) that creates notifications in case of addResource, deleteResource, setRepresentation, modifyRepresentation.

I'd strongly recommend to consider implementing it the notification way. I can offer help from @Dexagod, who is working on notifications. Also CC'ing @csarven.

  • We need to communicate these changes to the search module. We've used linked-deltas for this in the past, which allows for persisting these updates as well as using them for events that need handling. Ideally, these deltas are persisted in the pod. related.

Mmm, I think several things are unnecessarily conflated here.

  1. As a first approximation, I don't think you need deltas at all. It's an LDP interface, so when an update happens, you know the affected document. Just delete its old version and reindex. That's what a Lucene also would do.

  2. If for some reason this is too slow, the indexer could take a delta. No guarantee that this would be faster though than just deleting and reindexing, so I strongly recommend against premature optimization here. But if the indexer takes the delta, then as a first approximation, I think it's the indexer that should store the old version.

  3. However, if we use the decorator pattern as recommended above, we can hook into modifyRepresentation such that the generated notification includes the changes. Concretely, if a PATCH request arrives for a resource, then modifyRepresentation will see the requested changes and can communicate them. No need to store any deltas, as we just see them on the fly. (We could also make this work for PUT, but if the client overwrites an entire representation, having the indexer do the same seems ok.)

In general, I see no need for delta processing (as explained above), and even if that were the case, no need for the pod to be responsible for delta storage. But if there is, we got you covered.

  • The Solid-Search module needs to run somewhere. Ideally, this is in some runtime inside of a pod, but we could also require the pod admin to spin up a (multi-tenant) solid-search instance.

See my above suggestion of designing it as an independent agent, for instance, as a Docker container.
If written in a JavaScript-compatible host language, we could also provide in-process integration, but that is really just an optimization.

  • The Solid-Search API needs to be exposed - ideally on the domain of the pod, as an endpoint. This could mean that the pod has proxy functionality. Otherwise, we need to have a search service somewhere else, which feels weird.

There is no problem with an external search or indexing service. Clients just follow links, and they don't care whether they go to the same domain or a different one. We might want an explicit trust statement "this service can be an indexer for my data", but that's about it.

That said, the server could perfectly proxy.

If you are implementing this in a JavaScript-ish language, I'd suggest implementing the HttpHandler interface, such that it can easily be injected as a module into the Community Server.

We'll start development of Solid-Search December this year, and plan to link it to community-server in August 2021. Any thoughts are very welcome!

Additional thought: integrate early. I haven't seen the details of your planning, but I wonder why the integration would come this late. In a fail-fast mode, this is what I would test first. All the rest, I can pretty much imagine how it will work, given that good indexers etc. are available.

Really excited about this project, and always reachable for a chat!

@RubenVerborgh RubenVerborgh removed their assignment Oct 22, 2020
@joepio
Copy link
Contributor Author

joepio commented Oct 31, 2020

Thanks for the comments, and thoughts @RubenVerborgh! Very helpful.

You suggest replacing the deltas with linked data notifications. These notifications do not contain descriptions of how the resource has changed, which means that the service listening to notifications (in this case the search index) needs to fetch each individual document again and rebuild its index. This would lead to horrible performance, and this will happen with every system that has a dependency to changing resources. I think this is why we really need to have some bus / event log / delta store where the changes are persisted and made accessible to modules, such as a search module.

If you are implementing this in a JavaScript-ish language, I'd suggest implementing the HttpHandler interface, such that it can easily be injected as a module into the Community Server.

For the search itself, it's very unlikely to be JS - performance is crucial, and most performant search engines are powered by system level languages or maybe java (lucene). I personally think the Rust based Sonic project is really interesting for solid-search, as it is incredibly fast, lightweight and returns just URLs (which can be resolved in a pod anyways). Lightweight is important, as I want to let people run their own solid pods on lightweight (arm-based) devices.

But anyway - it will probably also be made available as an independent docker-image.

Additional thought: integrate early.

Gotta agree with that, I'll change the planning!

Really excited about this project, and always reachable for a chat!

Thanks, we'll be in touch soon enough!

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Oct 31, 2020

You suggest replacing the deltas with linked data notifications. These notifications do not contain descriptions of how the resource has changed

LDNs could perfectly be designed to include such a description, and there are indeed good reasons to do so.

which means that the service listening to notifications (in this case the search index) needs to fetch each individual document again and rebuild its index.

So no 🙂

This would lead to horrible performance

In principle, you could be right, but let's not make such assumptions before we adequately calculate and/or measure.
The good news is that I have a team of people whose job it is to exactly do that, so happy to help.

I think this is why we really need to have some bus / event log / delta store

It is crucial that we agree that the interface and storage are different, orthogonal concerns. Whether or not a notification (= interface) contains a delta is completely independent of whether or not the back-end (= storage) contains deltas.

It's really important to keep these separate in order to have a clear design discussion.

For the search itself, it's very unlikely to be JS

That's okay for the back-end.

Rust

WebAssembly could be interesting.

@joepio
Copy link
Contributor Author

joepio commented Nov 12, 2021

Hi @RubenVerborgh!

I have an update, the search server is working. I've got some documentation right here, although the code itself still has to be merged. Please read the docs upon reading further.

So I'm looking for how to integrate this into CSS. IIRC, there's a bunch of middlewares in CSS which could execute something, like POSTing a turtle representation of the resource to some endpoint. If we can get that working, we're mostly there.

Next step is some strategy for running the search app. It's a rust binary or a dockerized image, whichever you prefer. I can imagine it makes sense to simply cargo install the binary in the dockerfile for CSS, and add it to its run script. Make sure to set a custom port, of course, because you probably want to handle HTTP routing in CSS.

Now, we need to add the route to the search instance. We can link to the built-in front-end, but that would expose all data to the public - probably not the best idea!

So I think the other approach it to make the endpoint available behind some authorization check.

What are your thoughts?

@RubenVerborgh
Copy link
Member

Hi @joepio!

IIRC, there's a bunch of middlewares in CSS which could execute something, like POSTing a turtle representation of the resource to some endpoint. If we can get that working, we're mostly there.

Yes indeed.

You could either:

  • have an intercepting ResourceStore that updates the index on change
  • tap into the existingMonitoringStore (configured as urn:solid-server:default:ResourceStore), which will notify your component of changes (example in config/http/server-factory/websockets.json / UnsecureWebSocketsProtocol)

I suggest the second; you then basically create a component that listens to a source, which will tell you when something changes.

It's a rust binary or a dockerized image, whichever you prefer.

Docker seems good. We already have Dockers (in the integration tests) for a SPARQL endpoint and Redis.
Seems like a handy way of packaging and testing it.

Now, we need to add the route to the search instance. We can link to the built-in front-end, but that would expose all data to the public - probably not the best idea!

Is there any authentication going on with the search server?

An approach that I have used previously, relied on there being a low number of unique users per pod. As such, I created indexes for every user: https://github.com/solid/solid-tpf

Alternatively, or as a start, we can only expose that interface to the owner of the pod.

So I think the other approach it to make the endpoint available behind some authorization check.

In any case, you should be able to reuse existing authentication components such as AuthorizingHttpHandler to shield access. Looks like you probably want to create a custom instance (see config/ldp/handler/default.json for inspiration), where:

  • credentialsExtractor is the default urn:solid-server:default:CredentialsExtractor
  • modesExtractor is a constant extractor that returns Read (i.e., always require read access)
  • permissionReader is either the default (so you can use the ACL system to set access to the index), or just the OwnerPermissionReader
  • authorizer is the default
  • operationHandler is your custom handler that provides access to your search endpoint

@joepio
Copy link
Contributor Author

joepio commented Mar 23, 2022

Hi there! I'd love to give this another go, but I feel like I need a bit of help getting started when I want to run everything locally. Is there a contributer who might have a moment to help me out in a video call for an hour sometime this week? @joachimvh maybe? I'd be grateful!

@joachimvh
Copy link
Member

@joepio sure, send me a mail or a message though slack/gitter to arrange. There is not that much "this week" left though 😄

@joepio
Copy link
Contributor Author

joepio commented Mar 25, 2022

Thanks to @joachimvh, we've been able to set up a repo that almost works. It contains very simple logic: post Turtle documents to some endpoint (atomic-server) whenever a resource is updated.

There's still at least one major thing that's holding us back: the exported class (SearchListener)'s constructor is not called. It's not being instantiated by component.js.

Some thoughts from Joachim:

het feit dat de constructor van SearchListener niet gecalled wordt wil zeggen dat die niet geinstantieerd wordt door components.js
denk dat ik weet waarom maar het is ambetant
je moet aan components.js zeggen welke class de entrypoint is, en dan maakt die recursief alle classes aan die gelinkt zijn daar aan
maar die SearchListener is in feite een aparte class die niet gelinkt is aan de main entrypoint, da's een eigen ding
zou eens moeten nadenken over hoe dat de components.js moet aangepast worden om dat werkende te krijgen
de oplossing is dat er ergens een component in CSS naar deze class zou moeten referen, zodat components.js die vindt als die de config paden doorloopt
makkelijkste manier om dat te doen is waarschijnlijk om van deze class een Initializer te maken en die toe te voegen aan de lijst van Initializers die worden aangeroepen voordat de server start (hoewel die niet echt iets moet initializen)
dus feel free om al in die richting te zoeken wat er allemaal is als je er nog verder mee wil spelen :D

Seems like I should make an Initializer from SearchListener, so I tried extending from Initializer and making an empty required handle function. I don't think this is what he meant, though.

@joachimvh
Copy link
Member

Seems like I should make an Initializer from SearchListener, so I tried extending from Initializer and making an empty required handle function. I don't think this is what he meant, though.

That is actually exactly what I meant. It's just that it also requires a few lines of config to work. The handle function will be called when the server is being started so you can put whatever you want in there, such as a log message that the server is being started with the search functionality. But it can be empty.

I'll first translate and explain the snippet above. When you run Components.js you have to tell it the URI of the component it needs to generate. It will then instantiate that component and its constructor parameters recursively. This means that your component will only be instantiated if there is a path from our entry component to your component. That URI is hardcoded here:

const DEFAULT_APP = 'urn:solid-server:default:App';

Currently there is no path from our AppRunner to your SearchListener, which means there are 2 solutions: create a custom components.js call that also references the URI of your component, or make it so there is such a path, even though it is not really needed. The second one is much easier when you want people to combine this with CSS so we'll go for that.

Somewhere in our config we have a list of Initializers that get called before the server is started:

"comment": "These handlers are called whenever the server is started, and can be used to ensure that all necessary resources for booting are available.",
"@id": "urn:solid-server:default:ParallelInitializer",
"@type": "ParallelHandler",
"handlers": [

Once your class extends Initializer it still needs to be added to that list so the application has a reference to your class. You do that by changing your config/search.json to the following:

    {
      "@type": "SearchListener",
      "@id": "urn:solid-server:search:SearchListener"
      "source": { "@id": "urn:solid-server:default:ResourceStore" },
      "store": { "@id": "urn:solid-server:default:ResourceStore" },
      "searchEndpoint": "http://example.org/my-search-endpoint"
    },
    {
      "@id": "urn:solid-server:default:ParallelInitializer",
      "@type": "ParallelHandler",
      "handlers": [
        { "@id": "urn:solid-server:search:SearchListener" }
      ]
    }

Note that an @id needs to be added to the search listener so it can be referenced.

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

@joachimvh Thanks for the help again! I've modified the config file, and extends Initializer, and I get a new error now. Seems like my newly defined module cannot be found:

2022-03-28T11:15:27.267Z [Components.js] info: Loaded configs
2022-03-28T11:15:28.034Z [Components.js] error: Detected fatal error. Generated 'componentsjs-error-state.json' with more information.
Could not create the server
Cause: Cannot find module 'solid-search-community-server'
Require stack:
- /Users/joep/dev/github/joepio/solid-search-community-server/node_modules/componentsjs/lib/loading/ComponentsManagerBuilder.js

I'm also not entirely sure if my config files are correct now, they seem to have some duplicate info 1, 2.

@RubenVerborgh
Copy link
Member

https://github.com/RubenVerborgh/solid-hue might provide help; do you use the -m . option when starting the server?

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

Thanks, adding -m . seems to help!

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

Feels like I'm getting real close!

I'm now getting errors when trying to get the Resource:

await this.store.getRepresentation(changed, { type: { 'text/turtle': 1 } })
SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at JsonResourceStorage.get (/Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/storage/keyvalue/JsonResourceStorage.js:29:25)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async GreedyReadWriteLocker.incrementCount (/Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/util/locking/GreedyReadWriteLocker.js:119:28)
    at async /Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/util/locking/GreedyReadWriteLocker.js:77:27
    at async GreedyReadWriteLocker.withInternalReadLock (/Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/util/locking/GreedyReadWriteLocker.js:105:20)
    at async GreedyReadWriteLocker.preReadSetup (/Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/util/locking/GreedyReadWriteLocker.js:76:9)
    at async GreedyReadWriteLocker.withReadLock (/Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/util/locking/GreedyReadWriteLocker.js:33:9)

Also tried without the option (await this.store.getRepresentation(changed, {})), but getting the same error.

@joachimvh
Copy link
Member

I'm also not entirely sure if my config files are correct now, they seem to have some duplicate info

You should indeed put that information in only one of the two. I would suggest putting everything in the search.json config, except perhaps the path as that is probably the field users will want to edit most often. The block with the path will still need the id and the type, so those will be duplicated if you split over 2 files.

The error message you're getting is quite weird. For some reason the JSON files that are generated to keep track of resource locks are invalid. Does the server process have write rights on the disk? If you still have a .internal folder in your data directory delete that in case something got corrupt.

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

Removing my local files solved the issue! Thanks again @joachimvh 👏

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

Two problems:

Can't get turtle representations

I'm trying to get the turtle representation of the resource with await this.store.getRepresentation(changed, { type: { 'text/turtle': 1 } }).

This function returns a Representation, which has a Readable in its data field.

I need the turtle string, but for some reason I'm getting an error when I try to turn the Readable into a string:

/Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/storage/conversion/RdfToQuadConverter.js:27
        const data = (0, StreamUtil_1.pipeSafely)(rawQuads, pass, (error) => new BadRequestHttpError_1.BadRequestHttpError(error.message));
                                                                             ^

BadRequestHttpError: Missing context link header for media type application/json on http://localhost:3000/.internal/setup/Y3VycmVudC1iYXNlLXVybA==
    at /Users/joep/dev/github/joepio/solid-search-community-server/node_modules/@solid/community-server/dist/storage/conversion/RdfToQuadConverter.js:27:78

Not sure if the Missing context link header is problematic, and not sure where it's coming from (well, rom RdfToQuadConverter).

If I remove the type: {'text/turtle': 1 }} option, It seems to work, but now I'm getting various (non-turtle) representations that my server can't parse, such as:

http://localhost:3000/

and

3.0.0

I could get around this for now by simply ignoring all resources that aren't valid turtle, but that's not a real solution I think.

changed event not emitted on PUT

But, another problem is essential: the source.on('changed' handler doesn't seem to be called when I PUT a resource:

curl -X PUT -H "Content-Type: text/turtle"  -d '<http://example.com/test> <ex:p> "testme".'  http://localhost:3000/myfile.ttl

Which is a bit weird, as the request seems to be properly handled:

2022-03-28T13:49:00.202Z [BaseHttpServerFactory] info: Received PUT request for /myfile.ttl

The MonitoringStore's functions (modifyResource, setRepresentation, emitChanged) are not called when this PUT is processed.

@joachimvh
Copy link
Member

I need the turtle string, but for some reason I'm getting an error when I try to turn the Readable into a string:

This is going to happen if the target resource is not an RDF resource.

Not sure if the Missing context link header is problematic, and not sure where it's coming from (well, rom RdfToQuadConverter).

It's because the target resource is JSON and the converter is trying to interpret it as JSON-LD.

I could get around this for now by simply ignoring all resources that aren't valid turtle, but that's not a real solution I think.

That actually is the solution I think. Or you could first check the content-type in the metadata to check if the data is RDF.

All this data you're seeing is internal data. 3.0.0 for example is the server storing the version number when the server gets started. The reason is that for internal data we use the same backend as for data. Although we probably should make it so the MonitoringStore does not fire events about these resources as they are not relevant for whomever is listening.

The MonitoringStore's functions (modifyResource, setRepresentation, emitChanged) are not called when this PUT is processed.

Can you double check that setRepresentation is not called when doing a PUT? Because this is the PUT function, and if this function is not being called it would also mean that no data is stored in the backend when doing a PUT.

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

@joachimvh

Thanks, I'll use the metadata.contentType to filter for turtle. Still not ideal, as I'd like to parse every valid RDF file.

Can you double check that setRepresentation is not called when doing a PUT? Because this is the PUT function, and if this function is not being called it would also mean that no data is stored in the backend when doing a PUT.

Yeah, I can verify this. emitChanged and setRepresentation are properly called during initiation (e.g. for the 3.0.0), but not when PUT is processed.

When I do

curl -X PUT -H "Content-Type: text/turtle"  -d '<http://localhost:3000/myfile.ttl> <ex:p> "testmsse".'  http://localhost:3000/myfile.ttl

I don't see any response, but I also can't read the file after posting.

curl -H "Accept: text/turtle"  "http://localhost:3000/myfile.ttl"

Gives the same response (empty) as

curl -H "Accept: text/turtle"  "http://localhost:3000/nonexisting"

I'm trying to work with a clean setup (removing my ./local-files often), but maybe that messed up some acl stuff? The /setup isn't performed, anyway.

EDIT: It was /setup, after completing that it worked! I didn't realise this step was mandatory. Maybe add an error if it's in /setup mode instead of empty responses?

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

It's working!

Thanks so much @RubenVerborgh and @joachimvh :)

Closing for now. Any questions / issues can be posted here:

https://github.com/ontola/solid-search-community-server/issues

@joepio joepio closed this as completed Mar 28, 2022
@RubenVerborgh
Copy link
Member

RubenVerborgh commented Mar 28, 2022

I'll admit I've only skimmed this, but does await this.store.getRepresentation(changed, { type: { 'text/turtle': 1 } }) not give what we expect? I.e., fail on non-RDF resources? So you can keep it in; whenever it fails, it was not RDF. When it succeeds, it was RDF—in any format.

But I think the JSON case should be a 406 for us, no, @joachimvh?

And also the

EDIT: It was /setup, after completing that it worked! I didn't realise this step was mandatory. Maybe add an error if it's in /setup mode instead of empty responses?

@joepio
Copy link
Contributor Author

joepio commented Mar 28, 2022

I'll admit I've only skimmed this, but does await this.store.getRepresentation(changed, { type: { 'text/turtle': 1 } }) not give what we expect? I.e., fail on non-RDF resources? So you can keep it in; whenever it fails, it was not RDF. When it succeeds, it was RDF—in any format.

You're right, it works, but it throws errors for non-turtle / internal JSON resources:

Missing context link header for media type application/json on http://localhost:3000/.internal/setup/Y3VycmVudC1iYXNlLXVybA==

I'll revert back to using { type: { 'text/turtle': 1 } again, I assume this will make more serialisations compatible (I assume the server does some conversions for RDF serialization formats).

@joachimvh
Copy link
Member

EDIT: It was /setup, after completing that it worked! I didn't realise this step was mandatory. Maybe add an error if it's in /setup mode instead of empty responses?

The returned status code is a 302 though.

I assume this will make more serialisations compatible

Indeed, that's the reason for the text/turtle preference.

@joepio
Copy link
Contributor Author

joepio commented Mar 29, 2022

The returned status code is a 302 though.

Ah, I guess I don't use curl enough to interpret its empty output well enough. Note to self: use -i. Scratch my suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❓question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants