Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a federated Open Terms Archive collections APIs #1016

Closed
Ndpnt opened this issue Jul 17, 2023 · 27 comments
Closed

Define a federated Open Terms Archive collections APIs #1016

Ndpnt opened this issue Jul 17, 2023 · 27 comments
Labels
RFC Request for comments

Comments

@Ndpnt
Copy link
Member

Ndpnt commented Jul 17, 2023

Context and Problem Statement

Open Terms Archive is a decentralised system that tracks collections of services and documents across multiple servers. Each collection operates its own API which exposes services and terms tracked, but the decentralisation of these APIs implies to search across all these APIs to identify which services and documents are currently tracked.

We propose the creation of a federated API to enable easy querying of the distributed database and thus facilitate collaboration with external applications.

Proposed solution

Base URL

http://api.opentermsarchive.org/:version

Endpoints

Note: The failures object is detailed below in the Error Handling section

GET /collections

Enumerate all collections

Returns

A JSON array of all collections

Example

GET /collections
[
  {
    "id": "collection-1",
    "name": "Collections 1",
    "languages": ["en"],
    "jurisdictions": ["EU"],
    "industries": {
      "en": "Online intermediation services for businesses subject to the European <a href=\"https://eur-lex.europa.eu/eli/reg/2019/1150/oj\">platforms-to-businesses (“P2B” / 2019/1150) regulation</a>",
      "fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen <a href=\"https://eur-lex.europa.eu/legal-content/FR/TXT/HTML/?uri=CELEX:32019R1150&from=EN\">P2B / 2019/1150</a>"
    },
    "url": "162.162.162.162",
    "maintainers": [
      {
        "name": "Open Evidence",
        "url": "https://open-evidence.com/"
      },
      {
        "name": "European Commission",
        "url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en"
      }
    ],
  },
  {
    "id": "collection-2",
    "name": "Collections 2",
    "languages": ["en"],
    "jurisdictions": ["EU"],
    "industries": {
      "en": "Services needed to operate the Open Terms Archive engine",
      "fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive"
    },
    "url": "162.162.162.162",
    "maintainers": [
      {
        "name": "Open Terms Archive",
        "url": "https://opentermsarchive.org"
      }
    ],
  }
]

GET /services?searchName=:searchName

Parameters

Parameter Type Description
searchName URL-encoded string The string to search for in service names

Returns

A JSON array of all matching services accross all collections with the URL where they can be found.
Returns all services if no searchName param is passed.
Returns an empty array if no matching service is found.

Example

GET /services?searchName=tube
{
  "results": [
    {
      "collection": "demo",
      "service": {
        "id": "peartube",
        "name": "PEARTUBE",
        "url": "http://173.173.173.173/api/v1/service/peartube",
        "termsTypes": [ "Terms of Service"]
      }
    },
    {
      "collection": "contrib",
      "service": {
        "id": "yourtube",
        "name": "YourTube",
        "url": "http://162.162.162.162/api/v1/service/yourtube",
        "termsTypes": [ "Terms of Service", "Privacy Policy"]
      }
    }
  ],
  "failures": []
}

GET /service/:serviceId

A JSON array of all specific service identified by their ID in all collections

Parameters

Parameter Type Description
serviceId URL-encoded string The ID of the service.

Returns

A JSON array of services with the given ID accross all collections with the URL where they can be found.
Returns a HTTP 404 if no matching service is found.

Example

GET /service/service1
{
  "results": [
    {
      "collection": "demo",
      "service": {
        "id": "service1",
        "name": "Service 1",
        "url": "http://173.173.173.173/api/v1/service/service1",
        "termsTypes": [ "Terms of Service"]
      }
    },
    {
      "collection": "contrib",
      "service": {
        "id": "service1",
        "name": "Service 1",
        "url": "http://162.162.162.162/api/v1/service/service1",
        "termsTypes": [ "Terms of Service", "Privacy Policy"]
      }
    }
  ],
  "failures": []
}

Notes

Duplicates

We have considered multiple duplicate resolution solutions (specifying priority order as query params, defining an arbitrary priority based on data quality, returning an arbitrary result with a key alternatives to other results, using HTTP code 300 Multiple Choices, …) but we have come to the conclusion that they do not align with our fundamental philosophy of decentralization and resilience. The idea is therefore to embrace the fact that it is possible to have the same service declared in multiple collections and thus to always return an array of results.

Error Handling

To handle errors in the underlying APIs, the idea is to return a failures array containing objects describing the collection that failed and why. For example:

{
  "results": [
    
  ],
  "failures": [
    {
      "collection": "demo",
      "message": "The API service encountered an internal error while processing the request.",
    },
    {
      "collection": "contrib",
      "message": "The API is currently unreachable.",
    }
  ]
}

Compatibility with different underlying API versions

By definition, a federated API may interact with multiple versions of underlying APIs. To effectively manage this, the proposed approach is to only gather the necessary fields and directly provide the resource URL in the underlying API. Moreover, to allow the client to determine the shape of the result, it is proposed to include the API version in the response headers of each underlying API.

Naming convention for collection ID

As the collection ID will then become a differentiating element that should be easy to handle with scripts and other tools, we suggest the following naming convention:

  • Non-ASCII characters are not supported, they should be normalized into ASCII.
    • Example: france-électionsfrance-elections.
  • Capitals and spaces are not supported. It should be in lowercase and kebabcase (spaces are replaced with a dash -):
    • Example: France Electionsfrance-elections.
@Ndpnt Ndpnt added the RFC Request for comments label Jul 17, 2023
@madoleary
Copy link

I have a note about duplicates: I think I agree that returning all results is the best way to go, but that still leaves the question of how we'd handle duplicates on the ToS;DR side. The RFC mentions "defining an arbitrary priority based on data quality" -- what is the criteria for "data quality" in this case? Does this mean that the result with the "highest" data quality would be returned?

Is there a real-life example of duplicates that I could inspect, just to see what the returned data might look like?

Thank you!

@Ndpnt
Copy link
Member Author

Ndpnt commented Jul 19, 2023

Hi @madoleary,

I have a note about duplicates: I think I agree that returning all results is the best way to go, but that still leaves the question of how we'd handle duplicates on the ToS;DR side.

The idea is to let each client of the federated API the responsibility to handle duplicates by returning all the results and letting it choose the collection from which it wants to obtain the document.
I think Open Terms Archive does not aim to be an intermediary that makes crucial choices for federated API clients, such as which collection should be more reliable than another.

The RFC mentions "defining an arbitrary priority based on data quality" -- what is the criteria for "data quality" in this case? Does this mean that the result with the "highest" data quality would be returned?

As it is mentioned, the idea of "defining an arbitrary priority based on data quality" was not retained, so a priori the question of data quality criterion will not be addressed on the OTA side.

Is there a real-life example of duplicates that I could inspect, just to see what the returned data might look like?

For example, a result for a query like GET /service/facebook could look like this:

{
  "results": [
    {
      "collection": "pga",
      "service": {
        "id": "facebook",
        "name": "Facebook",
        "url": "http://173.173.173.173/api/v1/service/facebook",
        "termsTypes": [ "Terms of Service", "Privacy Policy", "Developer Terms", "Trackers Policy", "Data Processor Agreement"]
      }
    },
    {
      "collection": "contrib",
      "service": {
        "id": "facebook",
        "name": "Facebook",
        "url": "http://162.162.162.162/api/v1/service/facebook",
        "termsTypes": [ "Terms of Service", "Privacy Policy"]
      }
    }
  ],
  "failures": []
}

And on your side, you could define that you prefer to use data from the pga collection because this collection is dedicated to tracking only gatekeepers with a high quality of maintenance whereas the contrib collection has no clearly defined maintainers. Another element of choice for you could be that the pga collection has more types of terms tracked for the Facebook service. It's up to you 🙂.

@madoleary
Copy link

Very helpful, thank you, @Ndpnt !

@MattiSG
Copy link
Member

MattiSG commented Jul 19, 2023

Thanks @Ndpnt for this clear RFC!

Proposition 1.B

This is a suggested improvement of proposition 1 (initially posted) on GET /collections.

GET /collections

The provided url examples are just a hostname (162.162.162.162). I believe they should be full-fledged URLs to the base endpoint of the API (http://162.162.162.162/api) so that API calls can be programmatically written. We should also specify in the spec that it has no trailing slash.

[
  {
    "id": "collection-1",
    "name": "Collections 1",
    "languages": ["en"],
    "jurisdictions": ["EU"],
    "industries": {
      "en": "Online intermediation services for businesses subject to the European <a href=\"https://eur-lex.europa.eu/eli/reg/2019/1150/oj\">platforms-to-businesses (“P2B” / 2019/1150) regulation</a>",
      "fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen <a href=\"https://eur-lex.europa.eu/legal-content/FR/TXT/HTML/?uri=CELEX:32019R1150&from=EN\">P2B / 2019/1150</a>"
    },
-    "url": "162.162.162.162",
+    "url": "http://162.162.162.162/api",
    "maintainers": [
      {
        "name": "Open Evidence",
        "url": "https://open-evidence.com/"
      },
      {
        "name": "European Commission",
        "url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en"
      }
    ],
  },
  {
    "id": "collection-2",
    "name": "Collections 2",
    "languages": ["en"],
    "jurisdictions": ["EU"],
    "industries": {
      "en": "Services needed to operate the Open Terms Archive engine",
      "fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive"
    },
-    "url": "162.162.162.162",
+    "url": "https://api.ota.openmirrors.example/arbitrary/long/path",
    "maintainers": [
      {
        "name": "Open Terms Archive",
        "url": "https://opentermsarchive.org"
      }
    ],
  }
]

Proposition 2

This is an alternative to proposition 1 (initially posted) on GET /services?searchName=:searchName

GET /services/search?name=:searchName

My rationale is to prefer a /services/search route with a ?name query string, as this feels more future-proof with regards to future other routes: we don't reserve query parameters at /services level, and avoid repeating search as a query parameter name if we, for example, add support for searching by ID in the future, or support fuzzy search.

Parameters

| Parameter | Type   | Description            |
| --------- | ------ | ---------------------- |
- | searchName | URL-encoded string | The string to search for in service names |
+ | name | URL-encoded string | The string to search for in service names |

Returns

A JSON array of all matching services accross all collections with the URL where they can be found.
Returns all services if no name param is passed.
Returns an empty array if no matching service is found.

Example

- GET /services?searchName=tube
+ GET /services/search?name=tube
{
  "results": [
    {
      "collection": "demo",
      "service": {
        "id": "peartube",
        "name": "PEARTUBE",
-        "url": "http://173.173.173.173/api/v1/service/peartube",
+        "url": "http://162.162.162.162/api/v1/service/peartube",
        "termsTypes": [ "Terms of Service"]
      }
    },
    {
      "collection": "contrib",
      "service": {
        "id": "yourtube",
        "name": "YourTube",
-        "url": "http://162.162.162.162/api/v1/service/yourtube",
+        "url": "https://api.ota.openmirrors.example/arbitrary/long/path/v1/service/yourtube",
        "termsTypes": [ "Terms of Service", "Privacy Policy"]
      }
    }
  ],
  "failures": []
}

@madoleary
Copy link

I think that the ?name query string is good suggestion

@madoleary
Copy link

I have another question: what would the response object look like for an index of services? For example, if I were to retrieve all the services for each collection. I ask this because eventually Phoenix is supposed to retrieve an index of services from OTA, per the MOU. Let me know if this question is outside the scope of this RFC.

@madoleary
Copy link

Also: is there a specific message returned when a service is not found?

@madoleary
Copy link

Also: is there a specific message returned when a service is not found?

Sorry, I see the HTTP 404 note!

@Ndpnt
Copy link
Member Author

Ndpnt commented Jul 24, 2023

Thanks @MattiSG for your propositions.

I fully agree with the Proposition 1.B.

For proposition 2:

  • I'm in favor of renaming the query string name.
  • For the route /services/search, I think having this route is less in line with the REST philosophy than /services?name=:searchName.
    REST encourages the use of URLs that represent resources which are represented by nouns whereas actions are represented by HTTP methods. Or with a route like /services/search?name=:searchName, it really looks like search is an action on the collection of services resources. We could think of it as a resource but I think it's not what comes in mind firstly. I think it is more RESTful to think: "There is a services collection where I apply some filters", so a route like /services?name=:searchName.

@Ndpnt
Copy link
Member Author

Ndpnt commented Jul 24, 2023

I have another question: what would the response object look like for an index of services? For example, if I were to retrieve all the services for each collection. I ask this because eventually Phoenix is supposed to retrieve an index of services from OTA, per the MOU. Let me know if this question is outside the scope of this RFC.

As I suggest to have the search action being only a filtering on the servicescollection, for me the response object will look exactly the same. And if you need to retrieve all the services for each collection we could add a collection query string to allow filtering on the collection ID as well.

@madoleary
Copy link

madoleary commented Jul 25, 2023

Proposition 3

This is a suggested improvement on proposition one GET /services?name=:searchName , initially posted as GET /services?searchName=:searchName.

GET /services?name=:searchName&termsType=:termsType

The idea is to add the ability to query by termsType, so that the results can be filtered by both service name and terms type. This is to avoid having to iterate through all service results and verify their termsTypes fields at each iteration, just to locate a specific terms type within a specific service.

Details

Parameters

Parameter Type Description
name URL-encoded string The string to search for in service names
termsType URL-encoded string The string to search for in service terms

Returns

A JSON array of all matching services across all collections that also include the terms type, as indicated by the termsType query param, in their termsTypes fields.
Returns all matching services if no termsType param is passed.
Returns an empty array if no matching service with the terms type is found.

Example

GET /services?name=facebook&termsType=cookies%20policy

{
  "results": [
     {
        "collection": "contrib",
        "service": {
          "id": "facebook",
          "name": "Facebook",
          "url": "http://162.162.162.162/api/v1/service/facebook",
          "termsTypes": ["Terms of Service", "Cookies Policy"]
      }
    }
  ],
  "failures": []
}

@MattiSG MattiSG changed the title Define a federate Open Terms Archive collections APIs Define a federated Open Terms Archive collections APIs Jul 31, 2023
@Ndpnt
Copy link
Member Author

Ndpnt commented Jul 31, 2023

Hi @madoleary,
Thanks for your proposition 3. I would make a minor changes by allowing to give multiple terms types like this:

Proposition 3.B

GET /services?name=:searchName&termsTypes=:termsType1,termsType2

Details

Parameters

Parameter Type Description
name URL-encoded string The string to search for in service names
termsTypes URL-encoded string The comma-separated string that represent the array of termsType to search for

Returns

A JSON array of all matching services across all collections that also include the terms types, as indicated by the termsTypes query param, in their termsTypes fields.
Returns all matching services if no termsTypes param is passed.
Returns an empty array if no matching service with the terms types is found.

Example

GET /services?name=facebook&termsTypes=Cookies%20Policy,Terms%20of%Service

{
  "results": [
     {
        "collection": "contrib",
        "service": {
          "id": "facebook",
          "name": "Facebook",
          "url": "http://162.162.162.162/api/v1/service/facebook",
          "termsTypes": ["Terms of Service", "Cookies Policy"]
      }
    }
  ],
  "failures": []
}

@madoleary
Copy link

That looks great, @Ndpnt ! I'm in favor of proposition 3.B

@MattiSG
Copy link
Member

MattiSG commented Aug 1, 2023

Love it!

I think it is more RESTful to think: "There is a services collection where I apply some filters"

💯

Thank you both for your contributions, I fully support 3.B!

@Ndpnt
Copy link
Member Author

Ndpnt commented Sep 6, 2023

Hi everyone,

This RFC received no further feedback since one month, so I think we can conclude that proposal 3.B seems acceptable to everyone and will therefore be implemented.

Thanks again for your contributions 🙏 .

Please note that we will probably not be able to work on its implementation before a few weeks as we have a lot of things to handle this month.

@Ndpnt Ndpnt closed this as completed Sep 6, 2023
@MattiSG
Copy link
Member

MattiSG commented Sep 6, 2023

Thanks @Ndpnt!

It's not entirely clear to me what will be implemented: 3.B is concerned with GET /services?name=:searchName&termsTypes=:termsType1,termsType2. What about GET /service/:serviceId (proposition 2? With your further amendments?) and GET /collections (1 or 1.B?)? 🤔 What is the final proposed API layout?

@Ndpnt
Copy link
Member Author

Ndpnt commented Sep 6, 2023

Proposed final API layout:

GET /collections

Returns

A JSON array of all collections

Example

GET /collections
[
  {
    "id": "collection-1",
    "name": "Collections 1",
    "languages": ["en"],
    "jurisdictions": ["EU"],
    "industries": {
      "en": "Online intermediation services for businesses subject to the European <a href=\"https://eur-lex.europa.eu/eli/reg/2019/1150/oj\">platforms-to-businesses (“P2B” / 2019/1150) regulation</a>",
      "fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen <a href=\"https://eur-lex.europa.eu/legal-content/FR/TXT/HTML/?uri=CELEX:32019R1150&from=EN\">P2B / 2019/1150</a>"
    },
    "url": "http://162.162.162.162/api",
    "maintainers": [
      {
        "name": "Open Evidence",
        "url": "https://open-evidence.com/"
      },
      {
        "name": "European Commission",
        "url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en"
      }
    ],
  },
  {
    "id": "collection-2",
    "name": "Collections 2",
    "languages": ["en"],
    "jurisdictions": ["EU"],
    "industries": {
      "en": "Services needed to operate the Open Terms Archive engine",
      "fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive"
    },
    "url": "https://api.ota.openmirrors.example/arbitrary/long/path",
    "maintainers": [
      {
        "name": "Open Terms Archive",
        "url": "https://opentermsarchive.org"
      }
    ],
  }
]

GET /services?name=:searchName&termsTypes=:termsType1,termsType2

Details

Parameters

Parameter Type Description
name URL-encoded string The string to search for in service names
termsTypes URL-encoded string The comma-separated string that represent the array of termsType to search for

Returns

A JSON array of all matching services across all collections that also include the terms types, as indicated by the termsTypes query param, in their termsTypes fields.
Returns all matching services if no termsTypes param is passed.
Returns an empty array if no matching service with the terms types is found.

Example

GET /services?name=facebook&termsTypes=Cookies%20Policy,Terms%20of%Service

{
  "results": [
     {
        "collection": "contrib",
        "service": {
          "id": "facebook",
          "name": "Facebook",
          "url": "http://162.162.162.162/api/v1/service/facebook",
          "termsTypes": ["Terms of Service", "Cookies Policy"]
      }
    }
  ],
  "failures": []
}

GET /service/:serviceId

Parameters

Parameter Type Description
serviceId URL-encoded string The ID of the service.

Returns

A JSON array of services with the given ID accross all collections with the URL where they can be found.
Returns a HTTP 404 if no matching service is found.

Example

GET /service/service1
{
  "results": [
    {
      "collection": "demo",
      "service": {
        "id": "service1",
        "name": "Service 1",
        "url": "http://173.173.173.173/api/v1/service/service1",
        "termsTypes": [ "Terms of Service"]
      }
    },
    {
      "collection": "contrib",
      "service": {
        "id": "service1",
        "name": "Service 1",
        "url": "http://162.162.162.162/api/v1/service/service1",
        "termsTypes": [ "Terms of Service", "Privacy Policy"]
      }
    }
  ],
  "failures": []
}

@MattiSG
Copy link
Member

MattiSG commented Sep 6, 2023

Much clearer, thank you very much! 😃

@MattiSG
Copy link
Member

MattiSG commented Nov 28, 2023

In 3.B (#1016 (comment)), we did not specify if specifying multiple terms types means we want to get only the service declarations that track all those terms types, or if we want to get all service declarations that track at least one of those terms types 🙃

@Ndpnt you were the one expanding on @madoleary’s initial request, to include multiple terms types. Do you remember what was your intention with this addition?

@MattiSG
Copy link
Member

MattiSG commented Nov 28, 2023

We also did not specify what happens if /services is called with no parameter at all. I suggest it sends a 400 Bad Request error, as we don't want the federated API to proceed with aggregating every existing declaration.

@Ndpnt
Copy link
Member Author

Ndpnt commented Nov 29, 2023

In 3.B (#1016 (comment)), we did not specify if specifying multiple terms types means we want to get only the service declarations that track all those terms types, or if we want to get all service declarations that track at least one of those terms types 🙃

@Ndpnt you were the one expanding on @madoleary’s initial request, to include multiple terms types. Do you remember what was your intention with this addition?

My intention was to make it possible to search for a service containing at least the specified terms types, in order to help me find the most appropriate collection for the terms types I was interested in. So for me, it was an AND logical operator for terms types.

@Ndpnt
Copy link
Member Author

Ndpnt commented Nov 29, 2023

We also did not specify what happens if /services is called with no parameter at all. I suggest it sends a 400 Bad Request error, as we don't want the federated API to proceed with aggregating every existing declaration.

I don't agree with that, I'm in favor of returning all the services. At the moment, we don't have too many services, and when we do, we'll be able to set up pagination. It's important to bear in mind that this means just one request to each collection API and not a request per service.

@Ndpnt
Copy link
Member Author

Ndpnt commented Nov 29, 2023

After some discussion, it seems that we don't currently have a use case for searching with multiple term types on /services, so we'll revert to a single termsType parameter.

Ndpnt added a commit to OpenTermsArchive/federation-api that referenced this issue Nov 29, 2023
@MattiSG
Copy link
Member

MattiSG commented Dec 2, 2023

If we have no results but all collections have failures, is that still a 404 or is that a 502 at some point? 🤔

@MattiSG
Copy link
Member

MattiSG commented Dec 2, 2023

it seems that we don't currently have a use case for searching with multiple term types

Complement note: we also found that all hypothetical use cases (AND, OR) could be implemented with the basic function provided here and a tiny bit of client-side logic. It will always be time to add more power to the API later on when we gather more understanding of most usual use cases 🙂

@MattiSG
Copy link
Member

MattiSG commented Dec 2, 2023

I don't agree with that, I'm in favor of returning all the services. At the moment, we don't have too many services

After discussion I agree, this was premature optimisation on my side. This “no parameter” route is very easy to cache. If it becomes very popular and the contents grow big, we can just decrease the poll rate and warn that this route only updates every hour / every day…

@madoleary
Copy link

Hi all, I appreciate the discussion about multiple terms types. In my specs, I only have us searching for one terms type at a time, e.g., cookies policy. I, too, don't think searching for multiple terms types is necessary. I also think all services should be returned on /services. I think that's more like the RESTful behavior I've seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments
Projects
None yet
Development

No branches or pull requests

3 participants