# General Introduction

## OData
The new Reaxys API is based on the OData protocol which is a well documented industry standard.  
Conceptually OData is a serialized version of SQL, thus some of the concept will be familiar.  
The following resources are helpful to learn more about OData:  

### Documentation and Specifications  
https://www.odata.org/documentation/  
https://docs.oasis-open.org/odata/odata/v4.0/odata-v4.0-part1-protocol.html  

### Resources  
https://www.odata.org/getting-started/learning-odata-on-postman/  
https://pragmatiqa.com/xodata/odatadir.html  
https://www.odata.org/libraries/  

## Python support
Odata support in python for the latest OData version v4 used by the Reaxys API is still evolving.  
Therefore not all of the more adnaved feature are currently available in python.

However the library **python-odata** used for the examples is a good starting point to utlilze the new Reaxys API from python.  
The documentation, installation instructions, and source code for the python-odata library can be found here:  

### Python Resources
https://python-odata.readthedocs.io/en/latest/index.html  
https://pypi.org/project/python-odata/  
https://github.com/eblis/python-odata  

# Initiate API Connection

The library **python-odata** allows to dynamically generate the python representation for the Reaxys data entity types using the OData metadata specifications. To dynamically generate the entities the OData service has to be initiated with the option `reflect_entities=True`.   
The metadata specifications can retrieved using the endpoint https://demodal-data-api.rx-nonprod.cm-elsevier.com/data/$metadata


In [None]:
# Import OData connector from python-odata
from odata import ODataService
import json

To initiate the connection the API base URL has to be specified which is `https://demodal-data-api.rx-nonprod.cm-elsevier.com/data/`.  
The service is then retrieveing the metadata to generate the entity classess.

**Note:** *Currently no authentication* is required. However, with the next release a access token will have to be provided. Details will be provided ahead of this change.

In [None]:
# Connect to Reaxys API and build Reaxys entity classes
Service = ODataService('https://demodal-data-api.rx-nonprod.cm-elsevier.com/data/', reflect_entities=True)

# Browse main resources

The **main OData resources** are **representing** what is defined as **Reaxys contexts** in the existing XML based API.  
These resources are *Substances*, *Citations*, *Reactions*, *Datapoints* (aka Bioactivities), and *Targets*.  
In OData these resoruces are modelled as collections (aka lists) of entity types.  

The main resources can be directly retrieved by calling the OData Service method `query` using the resource class as parameter.

# Define a query object

Conceptually the OData library allows to build up a query using the OData functionlities which then is used to submit the request and retrieve the results.

These queries can be constructed step wise by successively adding OData features to the query.  
The first step to build a query is to define which resource should be retrieved using the OData service method `query`:  
`service.query(RESOURCE)`


In [None]:
# Construct the base query to retrieve all Substances
Substances = Service.entities['Substances']
query_substance = Service.query(Substances)

# The string representaion of a query object will provide the OData API call
print(query_substance)

In [None]:
# Construct the base query to retrieve all Citations
Citations = Service.entities['Citations']
query_citations = Service.query(Citations)

# The string representaion of a query object will provide the OData API call
print(query_citations)

## Filtering main resources by resource properties

Each main resource is represented by a collection of entity types e.g. Substances are a collection of Substance entity types, etc.   
Each **entity type has** a defined set of **properties**. These properties are the database fields which are defined by the so called core facts in the existing XML based API, i.e.  
A ```Substance``` has the properties defined in the ```IDE``` fact meaning the field names that start with ```IDE.``` such as ```IDE.XRN```, ```IDE.NA```, ```IDE.CN```  
A ```Citation``` has the properties defined in the ```CNR``` and ```CIT``` facts meaning the field names that start with ```CNR.``` or ```CIT.```  such as ```CIT.PNX```  
A ```Reaction``` has the properties defined in the ```RX``` fact meaning the field names that start with ```RX.```  
A ```Datapoint``` has the properites defined in the ```DAT``` fact meaning the field names that start with ```DAT.``` such as ```DAT.VALUE```

These properties can have different data types such as  
simple **integer**: ```Substance.ReaxysRegistryNumber``` (IDE.XRN)  
simple **decimal**: ```Substance.NumberOfAtoms``` (IDE.NA)  
simple **string**: ```Citations.CommonPatentNumber``` (CIT.PNX)  
**list of strings**: ```Substance.ChemicalNames``` (IDE.CN)  ```Substance.CasRegistryNumbers``` (IDE.RN)  
**complex range type**: ```Datapoint.MeasurementValue``` (DAT.VALUE)  

These **data types** are **supported** by the python-odata library **to a different degree** for building filters  
- for simple types (string, integer, decimal) the query method `filter` can be used  
- for list of values the filter has to be added manually as OData expression using query method `raw`
- support for complex types currently unknown

## Construct simple string filters

For string properties a filter can be constructed using the query method `filter` by providing as parameters the reorurce property and the comparison condition  
`query.filter(RESOURCE.PROPERTY COMPARISON_OPERATOR VALUE)` 

In [None]:
# Filter substances by INCHI Key
query_sub_by_inchi = query_substance.filter(Substances.InChiKey == 'LZFSKNPPWIFMFL-UHFFFAOYSA-N')
print(query_sub_by_inchi)

In [None]:
# Filter citations by Patent Number
# Patent number format conversion supported

query_cit_by_patent_number = query_citations.filter(Citations.CommonPatentNumber == 'US202417248')
print(query_cit_by_patent_number)

## Execute Query

A query is only executed if results are requested.  
Requesting results can be accomplished using different methods:

- request the first matching record `query.first()`  

- request to retrieve all results `query.all()`  

- iterate over query object
```python
for entity in query:
  service.values(entity)
```


### Retrieve first matching record

The service method `values` allows to pretty print the record which helps to explore the record type

In [None]:
patent = query_cit_by_patent_number.first()
Service.values(patent)

### Retrieve results using all

The `all` method will execute the query and retrieve the matching records in one go.  
This method is supposed to exhaust the query until all results are retrieved.
However, Reaxys API does currently not provide the required information in the response.each request has a max number of records that are returned.
Therefore, only the the number of records defined by the page size are retrieved.
How to retrieve result sets exceeding the page size is described later under **broader queries and pagaination**

In [None]:
print("## Retrieve all results at once")
patents = query_cit_by_patent_number.all()
print(f"Number of matching patents: {len(patents)}")
for patent in patents:
    print(f"Citation ID: {patent.CitationNumberId}")
    print(f"Patent Assignee: {patent.PatentAssignee}")

### Retrieve result by iterating query

A query object is iterable and therfore results can be retrieved one at a time, permitting it to be iterated over in a for-loop.  
However, the same limitiation applies here as for the `all` method, i.e. only the number of records will be retirved that are defined by the page size.
How to retrieve larger results is described later under **broader queries and pagaination**

In [None]:
print()
print("## Iterate over query to retrieve records one-by-one")
for patent in query_cit_by_patent_number:
    print(f"Citation ID: {patent.CitationNumberId}")
    print(f"Patent Assignee: {patent.PatentAssignee}")

## Filter by multiple criteria

To construct more complex filter criteria different methods are available.  

The `filter` method can be called multiple times which will result in concatanating the conditions with `and`:  
`query.filter(CONDITION).filter(CONDITION)`  

The `filter` method also accepts logically combined conditions as a single parameter:  
`query.filter( (CONDITION A) BOOLEAN OPERATOR (CONDITION B))`  


### Calling `filter` multiple times

In [None]:
# Concatenate conditions by calling filter method multiple times to concatenated conditions with 'and'

# The query query_cit_by_patent_number already contains a filter
print("Original filter query:")
print(query_cit_by_patent_number)
patents = query_cit_by_patent_number.all()
print(f"Number of hits: {len(patents)}")
print()

# calling `filter` a second time will add the new condition using 'and' 
query_cit_by_patent_number_and_assignee = query_cit_by_patent_number.filter(Citations.PatentAssignee == 'BAKER HUGHES OILFIELD OPERATIONS LLC')
print("Query with additional filter criteria:")
print(query_cit_by_patent_number_and_assignee)

patents = query_cit_by_patent_number_and_assignee.all()
print(f"Number of hits: {len(patents)}")

### Use logically combined conditions as `filter` parameter

In [None]:
query_cit_by_patent_numbers = query_citations.filter((Citations.CommonPatentNumber == 'US202417248') | (Citations.CommonPatentNumber == 'CN114206964'))
print(query_cit_by_patent_numbers)

patents = query_cit_by_patent_numbers.all()
print(f"Number of hits: {len(patents)}")

## Construct queries for list of string properties

Filtering by properties that represents a list of string values is currentlynot supported by the `filter` method.  
However the OData library provides the possibility to provide OData requests directly as filter using the `raw` method.  

For list of values properties OData is utilizing a lamba function approach to compare each value with a given filter condition  
e.g. to filter substances by a ChemicalName the filter statement would be `ChemicalNames/any(cn:cn eq 'pyrrole')`  

These filter statements can be defined using the `raw` method.
However, the `raw` method will directly execute the query and return the raw Json repsonse instead of the entity objects

In [None]:
substance_results = query_substance.raw({'$filter': "ChemicalNames/any(cn:cn eq 'pyrrole')"})
print(json.dumps(substance_results, indent=2))

## Include additional facts in response using expand

Beside the core properties, a resource such as citations have also additional associated facts which are defined as individual OData entities.
For example **Patent Citations** will **include Patent Biobliographic Data** such as **Patent Claims**.  
Retrieving these facts requires to explicitly request them using the `expand` method. 

In [None]:
print("Original filter query:")
print(query_cit_by_patent_number)
print()

# Request to include PatentBibliographies
query_cit_by_patent_number_expand_pbib = query_cit_by_patent_number.expand(Citations.PatentBibliographies)
print("Query with expand request:")
print(query_cit_by_patent_number_expand_pbib)

In [None]:
patent = query_cit_by_patent_number_expand_pbib.first()
for pbib in patent.PatentBibliographies:
    Service.values(pbib)

# Broader queries and Pagination

## Pagination
By default the Reaxys OData API returns up to 10 records per request.  
This default can be increased using the OData `$top`. The python library provides this feature using the query method `limit`.  
However only a max page size of 100 is supported.  
Therefore, in case of larger result sets, a paginiation mechnism must be applied.
To retrieve resultsets larger than ten records paging through the results using the OData `$skip` is available. The python library provides this feature using the query method `offset`.

## Substring Queries
One common way to broaden a text search is using a substring search. While in Reaxys the `*` is used as wildcard to define a substring, OData is provding dedicated functions:  
`startswith` is identifying string property values that start with the filter string. This corresponds to `FILTERVALUE*`  
`endswith` is identifying string property values that end with the filter string. This corresponds to `*FILTERVALUE`  
`contains` is identifying string property values that contain the filter string. This corresponds to `*FILTERVALUE*`  

### Exact string filter
To filter a string value property using an exact match the OData `eq` operator can be used which is represented by the python equal operator `==`.

In [None]:
query_cit_by_orig_assignee = query_citations.filter(Citations.PatentAssignee == 'keishu')
print(query_cit_by_orig_assignee)

### Sub-string filter
To filter a string value property using a substring match the OData `contains` function can be used. The python library provides this feature using the query method `contains`.

In [None]:
query_cit_contains_orig_assignee = query_citations.filter(Citations.PatentAssignee.contains('keishu'))
print(query_cit_contains_orig_assignee)

### Increase Page size
By default the Reaxys API returns only up to 10 records per request. This page size can be changed using the OData $top, represented as the query method `limit` by the python library.
However a page size of 100 can not be exceeded.

As mentioned above, the method `all` should exhaust the query, which is currently not supported.  
However, even `all` will exhaust the query only, if the default page size is used.  If `$top` is used to change the page size, the user will be required to perform the pagination.

In [None]:
# Default page size is 10
patents = query_cit_contains_orig_assignee.all()
print(f"By default 10 records per page are returned: {len(patents)}")

In [None]:
# Page size can be changed using the `limit` method
query_cit_contains_orig_assignee_limit = query_cit_contains_orig_assignee.limit(20)
print(query_cit_contains_orig_assignee_limit)
patents = query_cit_contains_orig_assignee_limit.all()
print(f"Setting page size to 20 returns 20 records: {len(patents)}")

### Use `skip` to page through results

Use the query method `offset` to move the result fetching forward.

In [None]:
page_size = 10
offset = 0
result_count_total = 0
while True:
    print(f"Fetching results {offset+1} to {offset+page_size}")
    query_page = query_cit_contains_orig_assignee.offset(offset).limit(page_size)
    for entity_idx, entity in enumerate(query_page, start=1):
        print(f"Citation ID: {entity.CitationNumberId}")
    result_count_current = entity_idx
    result_count_total += result_count_current
    offset += page_size
    if result_count_current < page_size:
        break
print(f"Results fetched: {result_count_total}")
    
    

# Retrieve associated main resource records

Using OData it is possible to retrieve the records for a main resource e.g. Citations and simulataneously retrieve the records for an associated main resource such as substances that are linked to a given citation.  

To acomplish this the same `expand` mechnism is utilized as described above for the inclusion of additionl resource facts.

In [None]:
query_cit_contains_orig_assignee_expand_subs = query_cit_contains_orig_assignee.expand(Citations.Substances)
print(query_cit_contains_orig_assignee_expand_subs)

In [None]:
for patent in query_cit_contains_orig_assignee_expand_subs.limit(10):
    substance_count = len(patent.Substances)
    print(f"Patent {patent.CitationNumberId} has {substance_count} excerpted substances")
    if substance_count == 0:
        continue
    print("First substance record for patent:")
    Service.values(patent.Substances[0])

# Current Known Major Limitations

## python-odata library

- **Structure and Similarity Search**: Reaxys API is utilizing OData bound actions for structure and similarity search. The Reaxys action defintions are currently not properly accessible via the odata library.
- **Complex range types**: Range values like melting points, etc. are stored in Reaxys as range values. The odata python library is currently not handling these complex data types.

Workaround examples how to execute a these queries will be shared.

## Reaxys API

- **aggregations** are not yet available. This includes the aggreagtions `count` and `groupby` which are supported by the existing XML API
- **cross-context filtering**, i.e. filter substances by citation properties such as filter substances by patent number: `data/substance?$filter=CommonPatentNumber eq 'WO2023/235490'` - under development
- **filtering datapoints** (aka bioactivities) **by target names with taxonomy** using the property `TargetNameUniprotIdPdbId` which corresponds to the `DAT.TNAME` field.  This field requires integration with the GPT taxonomy, which is currently under investigation. As a workaround for the evaluation the datapoints property `TargetShortKeyBioactivity` which corresponds to the `DAT.TSKEY` can be used. This property however does not allow searching for UniProt IDs or other external identifiers.
Example usage of property `TargetNameUniprotIdPdbId`:  
`data/Datapoints?$filter=DatapointTarget/any(dt:contains(dt/TargetShortKeyBioactivity, 'Zein Protein'))`  
`data/Datapoints?$filter=DatapointTarget/any(dt:dt/TargetShortKeyBioactivity eq 'Zein Protein')`