Skip to content

Latest commit

 

History

History
21 lines (15 loc) · 2.07 KB

File metadata and controls

21 lines (15 loc) · 2.07 KB

Technical documentation of mapping suggestion pipeline

Python scripts

  • id-generator-templates: Takes in a template file and adds IDs to cells where there are none.
  • merge-mapping-suggestions: Simple scripts that collects mappings from various mapping pipelines (currently only zooma and nlp) and updates the templates with Suggested Categories.
  • mapping-suggest-qc: Review existing mappings and compare them with suggestions generated by the pipeline. Identifies similar terms mapped to different GECKO classes as well as differences between actual mappings performed and the suggestions the pipeline generates (which points to the need of reconciling conflicting mappings).
  • mapping-suggest-nlp: Uses a standard NLP pipeline that takes in all known term-GECKO mappings and attempts to predict suitable mappings for a data dictionary.
  • mapping-suggest-zooma: Uses Zooma and OLS to match a data dictionary against known term mappings.
  • ihcc-zooma-dataset: Takes in all templates, processes some of the associated strings and produces a Zooma datasets with all known term mappings.
  • lib: Contains all functions used and re-used by the scripts above.

Jupyter notebooks:

The jupyter notebooks are not part of the official pipeline but have been used to develop them and explore a variety of configurations. They can be safely ignored, but will come in handy when the pipeline is extended/improved.

  • ihcc-matching.ipynb: Matches a single template an generates Suggested Category items.
  • zooma_ihcc_zooma_dataset.ipynb: Builds a Zooma dataset from all the lexical information contained in all data dictionaries and GECKO.
  • ihcc-nlp-mapping.ipynb: Notebook to explore some very basic Machine Learning classifiers (Naive Bayes, SVM) with a bit of NLP processing. The notebook is for exploration only, and can be used to improve the classification model.