[DevDoc] Notes on the API implementation

Some rough, outdated, to-be-reviewed notes of mine (MB) regarding the way the KnetMiner API is implemented. You can get a decent dev-level intro of our code from here, especially if you open the mentioned components.

KnetminerServer

/{ds}/{mode}, handle(), GET case

Request hub, gets DS, "mode" (ie, name of API call) and general params
And then dispatches to handleRaw()

/{ds}/{mode}, handle(), POST case

TODO

handleRaw ()

Invokes DS.$method, having got the method from mode

/synonyms

Searches synonyms, using UIService.renderSynonymTable()
- Uses searchService.searchTopConceptsByName() to get relevant concepts
  - Uses luceneMgr.searchTopConceptsByIdxField()
  - Prepares a table where, for each keyword, there is an entry conceptName, conceptType, conceptId

/countHits

@param keyword
DS.countHits()
new SemanticMotifSearchMgr( keyword ), assuming keyword && ! geneList
- luceneConcepts:Map<Concept -> Score>: SearchService.searchGeneRelatedConcepts () * Split keyword into list, get 'not query' * notList = this.searchTopConceptsByName() if necessary * Populates hit2score (Concept->Score) with a series of Lucene searches, involving keyword (search string) and notList
  - countLinkGenes()
    - Uses luceneConcepts and SM.concepts2Genes to count SM-linked concepts (luceneDocumentsLinked) and matched unique genes (numConnectedGenes)
Puts SMSearchMgr counts into the response

/countLoci()

DataService.getLociGeneCount() to count the loci in the request's QTL
Used in the genome regions input

/genome and _keyword()

@param keyword, list, listMode, qtl
DS.genome(), prepares GenomeResponse, calls DS._keyword()
- Extracts the userGenes, using KGUtils.filterGenesByAccessionKeywords()
  - This tunrs the list into genes, using 1) searches over accessions and names and 2) filter on taxId
    - Probably not to be filtered with user taxId (check it's valid and configured)
  - Adds qtl to userGenes, using genome regions, via KGUtils.fetchQTLs ( ONDEXGraph graph, List<String> taxIds, List<String> qtlsStr )
    - QTL.fromStringList ( qtlsStr ) to build QTL region strucutures
      - Then double loop over all regions and all genes in the graph
  - smSearchMgr = new SemanticMotifSearchMgr ( searchString, genes )
    - Like said above, searches concepts based on keywords and scores them
  - candidateGenesMap = smSearchMgr.getSortedGeneCandidates() # Map<Concept->Score> This is based on SemanticMotifsSearchResult.getScoredGenes ( Lucene-scored concepts ), which works like:
    - From lucene-hit concepts, compute gene2HitConcepts, ie, a subfilter over gene->concepts map (coming from sem motifs)
    - use gene2HitConcepts to compute knet scores for each gene => scoredGeneCandidates: Map<Gene -> KnetScore>
    - return gene -> score result, ranked by score and with a filter over (unlikely) duplicated genes
  - Then, this is (possibly) filtered using user genes + QTL genes
  - Finally, we have genesMap and genes
  - Next is the chromosome view
    - what to do with multi-specie case?
  - Next is exportService.exportGeneTable()
  - Next is exportService.exportEvidenceTable()

/network

Does the same gene filtering as _keyword()
ondexServiceProvider.getSemanticMotifService ().findSemanticMotifs( keyword, seed (genes) )
- Map<ONDEXConcept, Float> luceneResults = searchService.searchGeneRelatedConcepts ( keyword, seed, false )
- Then, semanticMotifDataService.getGraphTraverser () with the seed genes => Map<ONDEXConcept, List<EvidencePathNode>> results
- Splits the search string into actual keyowrds (SearchUtils.getSearchWords())
  - get a colour map for them (UIUtils.createHilightColorMap())
  - Uses the found paths to create the network view graph
  - highlights paths and node labels based on the search keywords

/dataset-info

General info on the current dataset
Served by DatasetInfo DatasetInfoService.datasetInfo()
Mostly based on the dataset section in the config YAML

/dataset-info/network-stats

Gets per-type topological information. Used by the 'Release notes' button
Served by DatasetInfoService.networkStats()
Based on the JSON file produced by KnetMinerInitializer.exportGraphStats()
which mostly get data from the Semantic Motif summary data

/dataset-info/knetspace-url

Served by DatasetInfoService.knetSpaceURL()
Using a dedicated config variable

[REMOVED] /evidencePath

@param keyword, used to extract an evidenceOndexId
list: usual gene list (except QTL)
Similar to /network, see #631
No longer used, removed

[REMOVED] /latestNetworkStats

Replaced by /dataset-info/network-stats, see #657
Fetches stats on the whole dataset,
- which were computed by ExportService.exportGraphStats()
- which was invoked by OSP.initData()

[REMOVED] /geneCount

Searches genes bases on user input (uses KGUtils.filterGenesByAccessionKeywords() as above)
Adds genes in QTL regions, as above+
Finds sem motifs and builds the subgraph
exports the subgraph to JSON
puts counts into the response
WTH?!?!?!?
No longer used, removed

[REMOVED] /{ds}/genepage

Prepares data to perform a network view request
Then forwards to genepage.jsp (via MVC)
which will know how to invoke /network
We moved it to the client, where it belongs

[REMOVED] /{ds}/evidencepage

Works similarly to genepage above

[REMOVED] /ksHost

Replaced by /dataset-info/knetspace-url.
returns the KnetSpace host, set in the config.

[REMOVED] /dataSource

Replaced by /dataset-info.

Some general info. Very rubbish format, it puts JSON into a string, instead of the usual fields in the response class. The taxIds overwrite each other:

 summaryJSON.put("dbVersion", dataService.getDatasetVersion () );
 summaryJSON.put("sourceOrganization", dataService.getDatasetOrganization ());
 dataService.getTaxIds ().forEach( taxID -> {
 		summaryJSON.put("speciesTaxid", taxID);
 });
 summaryJSON.put("speciesName", dataService.getSpecies());

 // TODO: in future, this might come from OXL metadata (the graph descriptor)
 SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm");  
 var timestampStr = formatter.format ( oxlFile.lastModified () );
 summaryJSON.put("dbDateCreated", timestampStr);

 summaryJSON.put("provider", dataService.getDatasetProvider () );
 String jsonString = summaryJSON.toString();
 // Removing the pesky double quotes
 jsonString = jsonString.substring(1, jsonString.length() - 1);
 log.info("response.dataSource= " + jsonString);
 response.dataSource = jsonString;

It's used by save-knet.js, for exportAsJson(). This is very messy
It's also used in showNetworkStats.js::fetchStats(), but dbVersion only is fetched from the API out

Code details

`SemanticMotifSearchMgr`

Map<ONDEXConcept, Float> scoredConcepts: the keyword-related concepts, got from Lucene
- Based on SearchService.searchGeneRelatedConcepts() (see below)
SemanticMotifsSearchResult searchResult
- Uses SearchService.getScoredGenes ( scoredConcepts, this.taxId ) (see below)

`countLinkedGenes()`

Counts concepts in scoredConcepts, just using its size
Counts the genes linked to scoredConcepts
- For each concept:
  - Get genes in concept2Genes.get ( concept )
  - Filter by taxId
  - Eventually, count

SearchService

`searchGeneRelatedConcepts()`

Case there is only a gene list:

(gene list is normalised)

for each gene in gene list: add genes2Concepts ( gene ) to the result, with score = 1

Case with keyword

get the notQuery expression from keywords
Search concepts via Lucene, using keywords

`SemanticMotifsSearchResult getScoredGenes ( Map<ONDEXConcept, Float> scoredConcepts, taxId )`

Map<Integer, Set<Integer>> gene2HitConcepts

For each concept in scoredConcepts:
- add concept2Genes.get ( concept ) to result
  - possibly, filter by taxId
Then, group by gene

Map<ONDEXConcept, Double> scoredGeneCandidates

for each gene in gene2HitConcepts:
- for concept in gene2HitConcepts.get ( gene )
  - luceneScore = scoredEvidenceConcepts.get ( concept )
    - igf = log ( genesCount / concepts2Gene.get ( concept ).size () )
  - invGraphDist = 1 / genes2PathLens.get ( gene, concept )
    - knetScore = the three above combined
  - Sum of knetScore for each concept is knetScore ( gene )
scoredGeneCandidates are sorted
The final SemanticMotifsSearchResult result contains:
- geneId2RelatedConceptIds = gene2HitConcepts
  - gene2Score = sorted scoredGeneCandidates
genesCount is the total no of genes in the traverser seed, which belong to one of the configured specie In Neo4j: needs to be stored?
concepts2Gene.get ( concept ).size (), needs to be stored in Neo4j?
genes2PathLens.get ( gene, concept ) in Neo4j, is in the gene/concept link

exportGeneTable()

Params:

* List<ONDEXConcept> candidateGenes
* Set<ONDEXConcept> userGenes
* List<String> userQtlsStr
* String listMode
* SemanticMotifsSearchResult searchResult

Best name function in ondex
The gene's evidences are got from searchResult.getGeneId2RelatedConceptIds()
The gene score is got from searchResult.getGene2Score ()
The graph distances are got from genes2PathLengths (SemMotif summaries)
- In Neo4j, gene/concept links

exportEvidenceTable()