

# Queries for Knowledge Graph 

Queries that can be used in Neo4J for our Knowledge Graph.

Change these to fit your needs:
* 'NAME' should be exchanged with a specific drug, substance, reason etc name you want to run the query for. Make sure it is really a name that is inside the Knowledge Graph.
* 'KEYWORD' should be exchanged with a drug, substance, producer keyword you want to run the query for. This does not have to be the whole or exact name, it will search for this keyword and sometimes return multiple results. 

## Queries for specific aspects inside the graph

These Queries are mostly simple, to explore single aspects of the graph.

### Show only the relevant to supply substances
will return a graph will substance nodes which property relevant_to_supply is set to 'Yes'

In [None]:
MATCH (substance:substance)
WHERE substance.relevant_to_supply = 'Yes'
RETURN substance;

### Show drugs of specific dosage form
will return a graph with drug nodes which have the same dosage form specified

for example 'Retardtablette'

In [None]:
MATCH (drug:drug)
WHERE drug.dosage_form = 'NAME'
RETURN drug;

### Show one producer and all his drugs
will return a graph with a specific producer in the middle and all their drugs. This is especially great to see if drugs of the same producer are connected alternatives.

for example use 'Riemser Pharma GmbH'

In [None]:
MATCH (p:producer {producer: 'NAME'})
OPTIONAL MATCH (p)-[rel]->(otherNode)
RETURN p, rel, otherNode;

### Search for your medication
use any keyword, it might not find anything, or it will find multiple matches. It is also possible to search for dosages or producer since these keywords are sometimes part of the drug name.

for example use 'ibuprofen'


In [None]:
//shows only the names

MATCH (drug:drug)
WHERE drug.name =~ '(?i).*KEYWORD .*'
RETURN drug;

In [None]:
//shows the matching drug nodes in a graph

MATCH (drug:drug)
WHERE [drug.name](http://drug.name/) =~ '(?i).*KEYWORD .*'
RETURN drug
LIMIT 25;  // Limiting to 25 nodes for example, adjust as needed

In [None]:
//shows the matching drug nodes and all their relationships in a graph

MATCH (drug:drug)-[r]-(connectedNode)
WHERE drug.name =~ '(?i).*KEYWORD .*'
RETURN drug, r, connectedNode;

## Queries to calculate percentages and ratios

these queries show the total number of and percentages of properties 

### How many drugs are generic drugs
will return the total number of drugs, generic drugs and percentage of generic drugs 

In [None]:
MATCH (drug:drug)
WITH COUNT(drug) AS total_drugs, 
		SUM(CASE WHEN drug.is_generic = 'TRUE' THEN 1 ELSE 0 END) AS is_generic_drugs
RETURN total_drugs, is_generic_drugs, 
		toFloat(is_generic_drugs) / toFloat(total_drugs) AS percentage_of_total_drugs;

### How many substances are relevant to supply
will return the total number of substances, relevant to supply substances and percentage of relevant to supply substances

In [None]:
MATCH (substance:substance)
WITH COUNT(substance) AS total_Substances, 
		SUM(CASE WHEN substance.relevant_to_supply = 'Yes' THEN 1 ELSE 0 END) AS relevant_to_supply_Substances
RETURN total_Substances, relevant_to_supply_Substances, 
			toFloat(relevant_to_supply_Substances) / toFloat(total_Substances) AS percentage_of_relevant;

### How many drugs contain substances relevant to supply
will return the total number of drugs, relevant drugs and percentage of relevant drugs


In [None]:
MATCH (drug:drug)-[:has_substance]->(substance:substance)
WITH COUNT(DISTINCT drug) AS total_Drugs,
COUNT(DISTINCT CASE WHEN substance.relevant_to_supply = 'Yes' THEN drug END) AS relevant_to_supply_Drugs
RETURN relevant_to_supply_Drugs, total_Drugs,
toFloat(relevant_to_supply_Drugs) / toFloat(total_Drugs) AS percentage_of_relevant;

### How many reports are issued because of specific reason
will return the total number of reports that have been issued because of a specific reason.

for example use 'Unzureichende Produktionskapazitäten'

In [None]:
MATCH (m:reason {reason:'NAME'}) 
WITH m MATCH (m) <- [:because] - (r:report) 
return m.reason,count(r)

### How many drugs have alternatives
returns total count of drugs and count of drugs with have an alternative listed

In [None]:
MATCH (d:drug)-[:HAS_ALTERNATIVE]->(a:drug)
RETURN COUNT(DISTINCT d) AS totalDrugs, COUNT(DISTINCT a) AS drugsWithAlternatives;

### How many alternatives point onto themselves
returns a graph with every drug that points to itself as an alternative

In [None]:
MATCH p=(n)-[r:HAS_ALTERNATIVE]->(n)
RETURN p
LIMIT 250;

### How many alternatives point to each other
returns a graph with every pair or triple of drugs that point to each other as an alternative

In [None]:
MATCH path = (n)-[*]->(n)
RETURN path;

## Query for Durations and Time periods

### How long was a drug reported in days
returns the duration of a specific drug together with the begin date and PZN

In [None]:
MATCH (report:report)-[:report_has]->(:drug {name: 'NAME'})
RETURN duration.inDays(date(report.begin), date(report.end)).days, 
	report.begin, report.PZN

### Which reports go into 2024 
returns a graph of all report nodes that end in 2024, the table view shows the report id together with the end date

In [None]:
MATCH (report:report)
WHERE date(report.end).year = 2024
RETURN report.report_id AS reportId, report.end AS endDate, report;

### Which reports are active today
returns a table with every report that time span overlaps with the current date

In [None]:
MATCH (report:report)-[:report_has]->(:drug)
WHERE date(report.begin) <= date() AND date() <= date(report.end)
RETURN report.report_id AS reportId, 
				report.begin AS beginDate, report.end AS endDate;

### Which reports where active in January 2023
returns a table with every report id, begin and end date

In [None]:
MATCH (report:report)-[:report_has]->(:drug)
WHERE date(report.begin) <= date('2023-01-31') 
			AND date('2023-01-01') <= date(report.end)
RETURN report.report_id AS reportId, 
			report.begin AS beginDate, report.end AS endDate;

### How many reports where active in January 2023
returns the number of reports that where active in january 2023

In [None]:
MATCH (report:report)-[:report_has]->(:drug)
WHERE date(report.begin) <= date('2023-01-31') 
			AND date('2023-01-01') <= date(report.end)
RETURN COUNT(report) AS numberOfReports;

change ENDDATE and BEGINDATE to match a time period in this format :
yyyy-mm-dd

In [None]:
MATCH (report:report)-[:report_has]->(:drug)
WHERE date(report.begin) <= date('ENDDATE') 
			AND date('BEGINDATE') <= date(report.end)
RETURN COUNT(report) AS numberOfReports;

### Average duration of all shortages 

In [None]:
MATCH (r:report)
WHERE r.time_span IS NOT NULL
RETURN AVG(toFloat(r.time_span)) AS averageTimeSpan;

average duration of reports that ended before 2023-10-28 , the date the original dataset was obtained:

In [None]:
MATCH (r:report)
WHERE date(r.end) < date('2023-10-28') AND r.time_span IS NOT NULL 
RETURN AVG(toFLoat(r.time_span)) AS avgTimeSpan

### Average Duration of Shortages for each producer
calculates the average duration of each report for each producer

we limited the reports to those that ended before 2023-10-28 , the date we obtained the original dataset, in order to eliminate estimated durations.

sorted descending and ascending

In [None]:
//descending

MATCH (p:producer)-[:producer_of]->(d:drug)<-[:report_has]-(report:report)
WHERE date(report.end) < date('2023-10-28')
WITH p, AVG(report.time_span) AS avgDuration_in_days
RETURN p.producer AS producer, avgDuration_in_days
ORDER BY avgDuration_in_days DESC;

In [None]:
//ascending

MATCH (p:producer)-[:producer_of]->(d:drug)<-[:report_has]-(report:report)
WHERE date(report.end) < date('2023-10-28')
WITH p, AVG(report.time_span) AS avgDuration_in_days
RETURN p.producer AS producer, avgDuration_in_days
ORDER BY avgDuration_in_days ASC;

## Queries using basic graph theory

### produce an inventory of the nodes 
This above Cypher does perform an entire graph traversal and then will 'sample' out 90% of the nodes by way of inclusion of 'rand()⇐ 0.1'. As a result the numbers returned are effectively a 10% sample of the graph.

source:https://neo4j.com/developer/kb/how-do-i-produce-an-inventory-of-statistics-on-nodes-relationships-properties/

returns these properties for each type of node:
Avgerage, Minimum and Maximum Number of Properties,
Avgerage, Minimum and Maximum Number of Relationships

In [None]:
MATCH (n) WHERE rand() <= 0.1
WITH labels(n) as labels, size(keys(n)) as props, COUNT{(n)--()} as degree
RETURN
DISTINCT labels,
count(*) AS NumofNodes,
avg(props) AS AvgNumOfPropPerNode,
min(props) AS MinNumPropPerNode,
max(props) AS MaxNumPropPerNode,
avg(degree) AS AvgNumOfRelationships,
min(degree) AS MinNumOfRelationships,
max(degree) AS MaxNumOfRelationships

changed query to use the whole graph and not just a 10% sample:

In [None]:
MATCH (n) WHERE rand() <= 1
WITH labels(n) as labels, size(keys(n)) as props, COUNT{(n)--()} as degree
RETURN
DISTINCT labels,
count(*) AS NumofNodes,
avg(props) AS AvgNumOfPropPerNode,
min(props) AS MinNumPropPerNode,
max(props) AS MaxNumPropPerNode,
avg(degree) AS AvgNumOfRelationships,
min(degree) AS MinNumOfRelationships,
max(degree) AS MaxNumOfRelationships

### Most Central Nodes
Returns the top 5 most central nodes. With central meaning the highest number of relationships (incoming and outcoming).


The top 6 most central nodes of the whole graph are all reasons, that's why instead of using n.name which for reasons would return null, n.reason is used. 

you can adjust the query to show more or less nodes by changing the limit


In [None]:
MATCH (n)
RETURN n.reason AS node, size([ (n)--() | 1 ]) AS degree
ORDER BY degree DESC
LIMIT 5;

### incoming, outcoming and total degree of a node type (Knotengrad)
here for the substance node, you can change for any other type of node

Total:

In [None]:
// Calculate Total node degrees for substance nodes (both incoming and outgoing relationships)
MATCH (n:substance)
OPTIONAL MATCH (n)-[]->(outgoing)
OPTIONAL MATCH (incoming)-[]->(n)
WITH n, COUNT(DISTINCT outgoing) + COUNT(DISTINCT incoming) AS degree
RETURN
COALESCE(MIN(degree), 0) AS minDegree,
COALESCE(AVG(degree), 0) AS avgDegree,
COALESCE(MAX(degree), 0) AS maxDegree;

Incoming:

In [None]:
// Calculate incoming node degrees for substance nodes
MATCH (n:substance)
OPTIONAL MATCH (incoming)-[]->(n)
WITH n, COUNT(DISTINCT incoming) AS incomingDegree
RETURN
COALESCE(MIN(incomingDegree), 0) AS minIncomingDegree,
COALESCE(AVG(incomingDegree), 0) AS avgIncomingDegree,
COALESCE(MAX(incomingDegree), 0) AS maxIncomingDegree;

Outcoming:

In [None]:
// Calculate outgoing node degrees for substance nodes
MATCH (n:substance)
OPTIONAL MATCH (n)-[]->(outgoing)
WITH n, COUNT(DISTINCT outgoing) AS outgoingDegree
RETURN
COALESCE(MIN(outgoingDegree), 0) AS minOutgoingDegree,
COALESCE(AVG(outgoingDegree), 0) AS avgOutgoingDegree,
COALESCE(MAX(outgoingDegree), 0) AS maxOutgoingDegree;

### Top 5 Treatments by number of drugs
sorts treatments depending on which one includes the most drugs in the graph

In [None]:
MATCH (d:drug)-[:has_substance]->(:substance)-[:used_for]->(treatment:treatment)
RETURN treatment.treatment AS treatmentGroup, COUNT(DISTINCT d) AS numberOfDrugs
ORDER BY numberOfDrugs DESC
LIMIT 5;

### Top Substances by amount of drugs

returns a graph of the top 3 substance nodes and their connected drugs

In [None]:
MATCH (s:substance)<-[:has_substance]-(d:drug)
WITH s, COLLECT(d) AS drugs, COUNT(*) AS numDrugs
RETURN s, numDrugs, drugs
ORDER BY numDrugs DESC
LIMIT 3

returns a table with the top 10 substances by amount of drugs connected to them

In [None]:
MATCH (s:substance)<-[:has_substance]-(d:drug)
WITH s, COLLECT(d) AS drugs, COUNT(*) AS numDrugs
RETURN s.name AS Substance, numDrugs 
ORDER BY numDrugs DESC
LIMIT 10

## Top 5 drugs by amount of reports

returns graph with top 5 drug nodes and their connected reports:

In [None]:
MATCH (d:drug)<-[:report_has]-(r:report)
WITH d, COLLECT(r) AS reports, COUNT(*) AS degree
RETURN d, degree, reports
ORDER BY degree DESC
LIMIT 5;

returns table with pzn, drug name and degree (amount of reports):

In [None]:
MATCH (d:drug)<-[:report_has]-(r:report)
WITH d, COLLECT(r) AS reports, COUNT(*) AS degree
RETURN d.PZN, d.name, degree
ORDER BY degree DESC
LIMIT 5;