# Prerequisites

1) Using python 3.8 install requirements.txt in a virtual environment (run [installation.sh](installation.sh) on unix or [installation.bat](installation.bat) on windows)
2) Have a Neo4j database with APOC and n10s libraries available. See [here](https://github.com/GSK-Biostatistics/neointerface) for introduction to Neo4j and [here](https://github.com/GSK-Biostatistics/neointerface/blob/main/Appendixes.md) for best ways to deploy it
3) Set environment variables to enable connection to your neo4j database. The following are required:
NEO4J_HOST, NEO4J_USER, NEO4J_PASSWORD, NEO4J_RDF_HOST  
You may opt to populate the template in setenv.sh (on unix) or setenv.bat(on windwows) and run it.

When the above is done you should be ready to run this notebook with an example.

In [1]:
import os
import time

# Loading CDISC metadata into Neo4j graph database
Files SDTM_v1.4.csv, SDTMIG_v3.2.csv and CT2022Q1.csv stored in [cdisc_data](cdisc_data) folder have been downloaded from [CDISC Library](https://library.cdisc.org/browser).  
CdiscStandardLoader class loads these metadata and creates links between entities.  
On top of that it enriches this metadata with SDTM ontology information originating from [phuse-org repository](https://github.com/phuse-org/rdf.cdisc.org/blob/master/std/sdtm-1-3.ttl).

In [2]:
from cdisc_model_managers.cdisc_standard_loader import CdiscStandardLoader

standards_folder = "cdisc_data"
standards_model = "SDTM_v1.4.csv"
standards_file = "SDTMIG_v3.2.csv"
sdtm_terminology = "CT2022Q1.csv"

csl = CdiscStandardLoader(    
    standards_folder=standards_folder,
    sdtm_file=standards_model, 
    sdtmig_file=standards_file, 
    terminology_file=sdtm_terminology    
)

csl.clean_slate()  # cleaning the database
csl.load_standard()
print ("CDISC metadata loading and linked. "
       f"You may browse it using Neo4j Browser: {os.getenv('NEO4J_HOST').replace('neo4j', 'http').replace('7687','7474')}")

[37m2022-11-28 14:41:03,366    cld    INFO    -------------------------------   Loaded CLD Logger    -------------------------------[0m


---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
Connection to http://10.40.225.48:27474/rdf/ established
 --- Deleting nodes with label: `Resource` ---
Loading content
Standards folder: cdisc_data
Standards model: SDTM_v1.4.csv
Standards Implementation Guide: SDTMIG_v3.2.csv
Terminology file: CT2022Q1.csv
---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
Linking CDISC content
CDISC Data Link Complete
[{'terminationStatus': 'OK', 'triplesLoaded': 1778, 'triplesParsed': 1778, 'namespaces': None, 'extraInfo': '', 'callParams': {}}]
SDTM TTL Loaded and Linked
CDISC metadata loading and linked. You may browse it using Neo4j Browser: http://10.40.225.48:27474/


# Generating Clinical Linked Data(CLD) model from CDISC metadata
In order to ingest data into a linked model using neo4cdisc/tab2neo packages, the graph must contain metadata in a certain form, in particular nodes with labels "Class" and "Relationship" must exist. The following step creates that metadata following certain logic using CDISC metadata uploaded in the setup before.

In [3]:
from model_managers.model_manager import ModelManager
mm = ModelManager()
mm.generate_excel_based_model()

---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
---------------- <class 'model_managers.model_manager.ModelManager'> initialized -------------------
Creating indexes on Class and Term
Creating indexes on Source Data Table and Source Data Column
Mapping Term GSK Codes and NCI Codes together
Creating Classes from Dataset and ObservationClass
Creating Class from Variable
Creating Class from dataElement and Relationship from Variable
Creating Class from SUPP domain Terms
Creating Relationships based on SDTM ontology
Creating additional Relationships based on business need
Linking Result Qualifiers to Finding Topics
Linking grouping classes to topics
Linking Category to Sub-Category


# Downloading CDISC pilot study data
In order to demonstrate clinical data ingestion into a graph we will download some data from [phuse-org repository](https://github.com/phuse-org/phuse-scripts/tree/master/data/sdtm/cdiscpilot01)  
For demonstration purposes we will only load 2 domains, feel free to experiment loading other domains as well.

In [4]:
from utils.utils import get_cdisc_pilot_data
domains = ['DM', 'DS']
#domains = ['DM', 'EX', 'AE', 'LB', 'VS', 'DS']
downloads = get_cdisc_pilot_data(domains)
print("Done: ", downloads['ok'])
if (downloads['error']):
    print("Failed to download some files:")
    print(downloads['error'])


--------
Getting domain: DM
Download from url: https://github.com/phuse-org/phuse-scripts/raw/master/data/sdtm/cdiscpilot01/dm.xpt

--------
Getting domain: DS
Download from url: https://github.com/phuse-org/phuse-scripts/raw/master/data/sdtm/cdiscpilot01/ds.xpt
Done:  [{'folder': 'temp/data/sdtm/cdiscpilot01', 'file': 'dm.xpt'}, {'folder': 'temp/data/sdtm/cdiscpilot01', 'file': 'ds.xpt'}]


# The next step is to load tabular data "as is" into the graph database
For details on FileDataLoader [see](https://github.com/GSK-Biostatistics/tab2neo/tree/main/data_loaders)

In [5]:
from data_loaders.file_data_loader import FileDataLoader
fdl = FileDataLoader()
fdl.delete_source_data() #deleting the data in case it was already uploaded to the graph
for download in downloads['ok']:
    df = fdl.load_file(download['folder'], download['file'])
    print(f"{download['file']} loaded")

---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
dm.xpt loaded
ds.xpt loaded


# Auto-mapping the data
Now that the graph knows the data loaded, CLD will do its best to map it to graph automatically. In order to avoid confusion it will delete the Classes to which no data is mapped to.

In [6]:
standard_label = os.path.basename(csl.sdtmig_file)
mm.automap_excel_based_model(domain=domains, standard=standard_label)
mm.remove_unmapped_classes()

# Applying graph model on the data(reshaping)


In [7]:
from model_appliers.model_applier import ModelApplier
ma = ModelApplier(mode="schema_CLASS")
ma.delete_reshaped(batch_size=10000) #deleting reshaped data in case it was reshaped previously

start_time = time.time()
ma.reshape_all()
print(
    f"Reshaping done in: {(time.time() - start_time):.2f} seconds\n"
    f"You may browse clinical data in a form of a graph using Neo4j Browser"
    f": {os.getenv('NEO4J_HOST').replace('neo4j', 'http').replace('7687','7474')}"
)

---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
 ------------------ Deleting reshaped instances ------------------
[[{'total': 0, 'batches': 1, 'failedBatches': 0, 'failedOperations': 0}], [{'total': 0, 'batches': 1, 'failedBatches': 0, 'failedOperations': 0}]]
 ------ Refactoring loaded data per graph class_ definition.  EXECUTING PART 1 --------- 
 ------ Refactoring loaded data per graph class_ definition.  EXECUTING PART 2 --------- 
    LOOPING OVER  38  entries in helper list:
--------------- Creating IS_A relationship ----------------------
[[{'total': 6567, 'batches': 7, 'failedBatches': 0, 'failedOperations': 0}]]
--------------- Linking classes ----------------------
[[{'total': 20019, 'batches': 21, 'failedBatches': 2, 'failedOperations': 1019}], [{'total': 486, 'batches': 1, 'failedBatches': 0, 'failedOperations': 0}]]
[{'total': 387, 'batches': 1, 'failedBatches': 0, 'failedOperations': 0}]
Reshaping done

# Clean-up

In [8]:
# #In order to enable Methods to work with already existing Relationships (btw parent and a child's neighbour)
# #created during refactoring, we explicitly create Relationship nodes
mm.propagate_rels_to_parent_class()
# labels from the domains that were not loaded may be confusing
print("Removing auxilary term labels")
mm.remove_auxilary_term_labels()
print("Done")

Copying Relationships to 'parent' Classes where (child)-[:SUBCLASS_OF]->(parent)
Removing auxilary term labels
Done


# Checking reshaping worked as expected
Not that the data is in the form of a graph we can further use CLD modules to wrangle and query it.  
DataProvider enables a python interface to query the data from the model without a need to write Cypher qeuries:

In [9]:
from data_providers.data_provider import DataProvider
dp = DataProvider()
df = dp.get_data_generic(
    labels=['Subject', 'Disposition', 'Dictionary-Derived Term', 'Start Date/Time of Observation'], #entities
    where_map={'Dictionary-Derived Term': {'rdfs:label':'COMPLETED'}}, #where condition
    infer_rels=True,
    return_nodeid=False,
    use_shortlabel=True,
    use_rel_labels=True,
    return_propname=False
)
assert not(df.empty)
df

---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
---------------- <class 'model_managers.model_manager.ModelManager'> initialized -------------------
---------------- Initializing NeoInterface -------------------
Connection to neo4j://10.40.225.48:27687/ established
---------------- <class 'data_providers.data_provider.DataProvider'> initialized -------------------


[37m2022-11-28 14:41:55,475    cld    INFO    QUERY: MATCH (`DS`:`Disposition`),
(`USUBJID`:`Subject`),
(`DSDECOD`:`Dictionary-Derived Term`),
(`DSSTDTC`:`Start Date/Time of Observation`),
(`DS`)-[`DS_Dictionary-Derived Term_DSDECOD`:`Dictionary-Derived Term`]->(`DSDECOD`),
(`DS`)-[`DS_Start Date/Time of Observation_DSSTDTC`:`Start Date/Time of Observation`]->(`DSSTDTC`),
(`DS`)-[`DS_Subject_USUBJID`:`Subject`]->(`USUBJID`)
WHERE `DSDECOD`.`rdfs:label` = $par_1


RETURN apoc.map.mergeList([apoc.map.fromPairs([key in keys(CASE WHEN `USUBJID`{.*} IS NULL THEN {} ELSE `USUBJID`{.*} END) | ["USUBJID", CASE WHEN `USUBJID`{.*} IS NULL THEN {} ELSE `USUBJID`{.*} END[key]]])
, apoc.map.fromPairs([key in keys(CASE WHEN `DS`{.*} IS NULL THEN {} ELSE `DS`{.*} END) | ["DS", CASE WHEN `DS`{.*} IS NULL THEN {} ELSE `DS`{.*} END[key]]])
, apoc.map.fromPairs([key in keys(CASE WHEN `DSDECOD`{.*} IS NULL THEN {} ELSE `DSDECOD`{.*} END) | ["DSDECOD", CASE WHEN `DSDECOD`{.*} IS NULL THEN {} ELSE `DSDECOD

Unnamed: 0,DSDECOD,DSSTDTC,USUBJID
0,COMPLETED,2014-09-09,01-701-1118
1,COMPLETED,2014-07-28,01-713-1269
2,COMPLETED,2013-07-12,01-710-1408
3,COMPLETED,2014-07-28,01-717-1109
4,COMPLETED,2013-11-11,01-708-1084
...,...,...,...
105,COMPLETED,2013-05-27,01-704-1218
106,COMPLETED,2014-04-18,01-704-1351
107,COMPLETED,2013-11-05,01-708-1253
108,COMPLETED,2013-05-01,01-710-1354


# What else can we do with a graph?
There are plenty of useful applications of clinical data along with its metadata in a graph.  
For example we can query where the loaded data is not compliant with the controlled terminology loaded. For that we will run a Cypher query to the database:

In [10]:
dp.query("""
MATCH (c:Class)<-[:IS_A]-(instance)
WHERE (c)-[:HAS_CONTROLLED_TERM]->(:Term)
AND NOT (c)-[:HAS_CONTROLLED_TERM]->(:Term)<-[:Term]-(instance)
RETURN *
""")

[{'c': {'short_label': '--DECOD',
   'count': 12,
   'create': False,
   'label': 'Dictionary-Derived Term'},
  'instance': {'rdfs:label': 'FINAL RETRIEVAL VISIT'}},
 {'c': {'short_label': '--DECOD',
   'count': 12,
   'create': False,
   'label': 'Dictionary-Derived Term'},
  'instance': {'rdfs:label': 'FINAL LAB VISIT'}}]

We can see from the result of the query above that terms 'FINAL LAB VISIT' and 'FINAL RETRIEVAL VISIT' where not part of the metadata (CT2022Q1.csv) but are present in the data