# Using [linkML](https://linkml.io) to define schema in the context of PGO:

## Background:

* [linkML](https://linkml.io)  is a "flexible modeling language that allows you to author schemas in YAML that describe the structure of your data. Additionally, it is a framework for working with and validating data in a variety of formats (JSON, RDF, TSV), with generators for compiling linkML schemas to other frameworks."
* PGO, or Pistoia Alliance Pharma General Ontology is a project aiming at documenting 20 key concepts in the context of Pharmaceutical Industry R&D. The process of documenting is meant to first identify these key entities and second, identify open or well characteristerised authoritative resources which can be used as reference, community approved resources, which can be communicated or recommended when defining a data exchange process.
* **Why is PGO looking at linkML?** * There are 3 answers for that question:
    * Define a data structure for describing each type/class defined by PGO
    * Define a data structure for describing one or more established and authoritative resources to reference instances of the PGO types.
    * Define a data structure for defining a data transfer between a CRO and a Pharma customer

## Notebook scope:

The main goals of this notebook are:
* to highlight the key functions provided by the linkml framework to validate schema and instance level data.
* to showcase one possible representation mode of data dictionaries using the linkML framework.
* to discuss strengths and weaknesses of the framework.

### 1. Viewing a linkML-based PGO Resource Description Minimal Information profile schema:

In [55]:
import os
# os.chdir("/change_me_path_to/git/linkml/data_dictionary/outputs/")
os.getcwd()

'./git/Pistoia-Alliance-Pharma-General-Ontology/doc'

In [56]:
# os.chdir("/change_me_path_to/git/Pistoia-Alliance-Pharma-General-Ontology/doc")
# os.getcwd()

In [57]:
!cat  ./linkml/pgo-entities-by-import.yaml

id: https://w3id.org/linkml/PGO/core-entities
name: core-entities
prefixes:
  linkml: https://w3id.org/linkml/
  namespace: http://example.com

version: 0.0.0
license: https://creativecommons.org/publicdomain/zero/1.0/

imports:
  - linkml:types
  - ./pgo-enums
  - ./pgo-entity-named-thing
  - ./pgo-entity-substance
  - ./pgo-entity-cell
  - ./pgo-entity-drug

default_range: string


### 2. Linting and Validating a linkML schema:

#### 2.1 Let's check if this LinkML schema validates. 

To do so, we can invoke the `linkml-lint` command. First, let's check the options available:

In [58]:
!linkml-lint --help

Usage: linkml-lint [OPTIONS] SCHEMA

  Run linter on SCHEMA.

  SCHEMA can be a single LinkML YAML file or a directory. If it is a directory
  every YAML file found in the directory (recursively) will be linted.

Options:
  -a, --all                       Process files that start with '.'.
  -c, --config FILE               Custom linter configuration file.
  -f, --format [terminal|markdown|json|tsv]
                                  Report format.  [default: terminal]
  --validate                      Validate the schema against the LinkML
                                  Metamodel before linting.
  --validate-only                 Validate the schema against the LinkML
                                  Metamodel and then exit without checking
                                  linter rules.
  -v, --verbose
  -o, --output FILENAME           Report file name.
                                  found.  [default: 0]
  --fix / --no-fix
  -V, --version                   Show the version and e

#### 2.2 We'll now invoke the command over the `PGO Resource Description Minimal Information profile` linkml schema without any option:

In [60]:
!linkml-lint --validate ./linkml/pgo-entity-named-thing.yaml

[32m✓[0m No problems found


In [61]:
!linkml-lint --validate ./linkml/pgo-entity-drug.yaml

[32m✓[0m No problems found


In [74]:
!linkml-lint --validate ./linkml/pgo-entity-cell.yaml

[4m/git/Pistoia-Alliance-Pharma-General-Ontology/doc/linkml/pgo-entity-cell.yaml[0m
[31m  error    [0mIn classes > Cell: Additional properties are not allowed ('definitions' was unexpected)  [2m(valid-schema)[0m

[31m✖[0m Found 1 problem in 1 schema


In [63]:
!linkml-lint --validate ./linkml/pgo-entity-substance.yaml

[32m✓[0m No problems found


#### 2.3 Let's invoke to same command but let's skip the warning from the report:

In [75]:
!linkml-lint --ignore-warnings ./linkml/pgo-resource.yaml

[4m//git/Pistoia-Alliance-Pharma-General-Ontology/doc/linkml/pgo-resource.yaml[0m
[31m  error    [0mFile is not a valid LinkML schema. Use --validate for more details.

[31m✖[0m Found 1 problem in 1 schema


```note
The root cause of the error is the attempt to provide an array of values for the class_iri attribute, which is not supported by the linkml parser.
```

#### 2.4 Invoking the linter with the `--validate` option:



In [77]:
# !linkml-lint --validate ./linkml/pgo-entities-by-import.yaml

#### 2.5 Understanding the validation report:

The difference between `linting` and `validating` is that the latter will check the validity of syntax while the former checks stylistic compliance. 

#### Warnings:
All the warnings reporting by the linter are mundane and relate to variable names which do not follow the convention adopted by linkML group. These can be ignored but it should be noted that our use departs from the recommendations and may result in a reduction is reusability.

#### Errors:
The linkml linter/validator reports an `error` for the slot `abbreviation`. The error simply says that the `slot_uri` can not be left empty and, if present, the attribute must be filled with a valid `uniform resource identifier`. A compact uri (curie) may be used if a namespace has been declared in the schema.

### Fixes
To clear the error in the linkML schema, one would have to set the value for the slot_uri for the `abbreviation`. For instance, something like the following curie 'dcterms:description' .

```
 abbreviation:
    slot_uri: schema:alternateName
    required: true
    description: the abbreviation of the resource
```


In [42]:
!linkml-lint --validate ./linkml/pgo-entity.yaml

[4m//git/Pistoia-Alliance-Pharma-General-Ontology/doc/linkml/pgo-entity.yaml[0m

[31m✖[0m Found 41 problems in 1 schema


### 3. Generating an Excel Spreadsheet template from the LinkML schema for use by the Business:

The linkml framework provides a number of generators which take a linkML schema as an input and generator specific products.
One such product is tabular template suitable for data entry by non-technical personel.
The command is detailed below:

In [66]:
!gen-excel ./linkml/pgo-resource.yaml -o  ./linkml/pgo-resource_in_linkml_as_excel_template.xlsx

ValueError: ['schema:name', 'dcterms:title'] is not a valid URI or CURIE


*observations*
The resulting Microsoft Excel template creates a worksheet for each of the linkML Class declared in the schema. 
The header of each worksheet correspond to the Class attributes, `slot` in linkML speak.

`+` for slots where the range is an enumeration (a set of controlled values), the corresponding excel field will be provisioned with a data validation drop-down showing the value set defined in the linkML schema

`-` in spite of textual description for each field declared in the linkML schema, the resulting excel rendering is void of contextual information or tooltips.
`-` in spite of specify some field ranges to be of a certain LinkML class, the resulting excel rendering does not show the cross-referencing.

### 4. Let's open the excel file resulting from the operation

In [14]:
!open ./linkml/pgo-resource_in_linkml_as_excel_template.xlsx

The file ./git/Pistoia-Alliance-Pharma-General-Ontology/doc/linkml/pgo-resource_in_linkml_as_excel_template.xlsx does not exist.


### 5. Validate instance data against a linkML schema:

In this section of the tutorial, we will show how to validate instance data against a linkML schema.
While linkML is meant to be capable of validating instances expressed in various forms (e.g., yaml, tsv), we will highlight some inconsistencies and maturity issues associated with the technology stack.
In this section, we will be using a simpler example relying on a much simpler schema, the `Personinfo.yaml` linkML schema and the associated instances file, `person_instance_data.yaml`:

In [15]:
!cat ./linkml/pgo_resource_instance_data.yaml

cat: ./linkml/pgo_resource_instance_data.yaml: No such file or directory


### Let's validate this instance data yaml file against the schema:

Note: This is different from checking the linkML schema is valid against its schema definition. Here, we want to check that all instance data is compliant with the requirements described by the `Personinfo.yaml` document.

In [16]:
! linkml-validate --schema ./linkml/pgo-resource.yaml --target-class Resource ./linkml/pgo-resource-instances.yaml

Traceback (most recent call last):
  File "/git/venv/bin/linkml-validate", line 8, in <module>
    sys.exit(cli())
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/git/venv/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/git/venv/lib/python3.9/site-packages/linkml/validator/cli.py", line 191, in cli
    validator = Validator(config.schema_path, validation_plugins=plugins, strict=exit_on_first_failure)
  File "/git/venv/lib/python3.9/site-packages/linkml/validator/va

### Let's now the same using instance data formatted in tsv file:

In [17]:
!linkml-validate --schema ./linkml/pgo-resource.yaml ./linkml/pgo_resource_instance_data_as_excel_template-populated.csv --target-class Resource    
  

Usage: linkml-validate [OPTIONS] [DATA_SOURCES]...
Try 'linkml-validate --help' for help.

Error: Invalid value for '[DATA_SOURCES]...': Path './linkml/pgo_resource_instance_data_as_excel_template-populated.csv' does not exist.


In [18]:
!gen-jsonld-context --base https://schema.org ./linkml/pgo-resource.yaml > ./linkml/pgo_resource_new_context.jsonld

yaml.scanner.ScannerError: while scanning a plain scalar
  in "pgo-resource.yaml", line 66, column 31
found unexpected ':'
  in "pgo-resource.yaml", line 66, column 38


In [19]:
!gen-jsonld-context --help

Usage: gen-jsonld-context [OPTIONS] YAMLFILE

  Generate jsonld @context definition from LinkML model

Options:
  --base TEXT                     Base URI for model
  --prefixes / --no-prefixes      Emit context for prefixes
                                  (default=--prefixes)  [default: prefixes]
  --model / --no-model            Emit context for model elements
                                  (default=--model)  [default: model]
  --flatprefixes / --no-flatprefixes
                                  Emit non-JSON-LD compliant prefixes as an
                                  object (deprecated: use gen-prefix-map
                                  instead).  [default: no-flatprefixes]
  -V, --version                   Show the version and exit.
  -f, --format [context|json]     Output format  [default: context]
  --metadata / --no-metadata      Include metadata in output  [default:
                                  metadata]
  --useuris / --metauris          Use class and slot URIs ov

In [20]:
!linkml-convert --help

Usage: linkml-convert [OPTIONS] INPUT

  Converts instance data to and from different LinkML Runtime serialization
  formats.

  The instance data must conform to a LinkML model, and either a path to a
  python module must be passed, or a path to a schema.

  The converter works by first using a linkml-runtime *loader* to instantiate
  in-memory model objects, then a *dumper* is used to serialize. A validation
  step is optionally performed in between

  When converting to or from RDF, a path to a schema must be provided.

  For more information, see https://linkml.io/linkml/data/index.html

Options:
  -m, --module TEXT               Path to python datamodel module
  -o, --output TEXT               Path to output file
  -f, --input-format [yml|yaml|json|rdf|ttl|json-ld|csv|tsv]
                                  Input format. Inferred from input suffix if
                                  not specified
  -t, --output-format [yml|yaml|json|rdf|ttl|json-ld|csv|tsv]
                       

In [21]:
!linkml-convert -s ./linkml/pgo-resource.yaml   -c ./linkml/pgo_resource_new_context.jsonld -f yaml  ./linkml/pgo-resource-instances.yaml  -t ttl -o ./linkml/pgo_resources_instance_data_in_turtle.ttl





Traceback (most recent call last):
  File "/git/venv/bin/linkml-convert", line 8, in <module>
    sys.exit(cli())
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "git/venv/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/git/venv/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/git/venv/lib/python3.9/site-packages/linkml/utils/converter.py", line 108, in cli
    python_module = PythonGenerator(schema).compile_module()
  File "<string>", line 29, in __init__
  File "/git/venv/lib/python3.9/site-packages/linkml/generators/pythongen.py", line 75, in __post_init__
    self.schemaview = SchemaView(self.schema, base_dir=self.base_dir)
  File "/git/venv/lib/python3.9/sit

### 6. Generating documentation fror the schema:

In [22]:
!gen-doc  ./linkml/pgo-resource.yaml  -d ./linkml/pgo_resource_schema_documentation

yaml.scanner.ScannerError: while scanning a plain scalar
  in "pgo-resource.yaml", line 66, column 31
found unexpected ':'
  in "pgo-resource.yaml", line 66, column 38


### 7. Generating SQL tables from the linkML schema

In [23]:
!gen-sqlddl ./linkml/pgo-resource.yaml > ./linkml/resources_linkml.ddl.sql

yaml.scanner.ScannerError: while scanning a plain scalar
  in "pgo-resource.yaml", line 66, column 31
found unexpected ':'
  in "pgo-resource.yaml", line 66, column 38


In [24]:
!grep -A35 'CREATE TABLE "Resource"' ./linkml/resources_linkml.ddl.sql

## Conclusion:

With this last step, we reach the end of this tutorial. 
In the next guidance document, we will demonstrate how to generate a basic prototype for a user interface using streamlit and the streamlit-sqlalchemy packages.