Skip to content

Commit

Permalink
Classification (#93)
Browse files Browse the repository at this point in the history
* A temporary owl file for terms available for curation.

This is a temporary owl file for terms (currently not available in ontologies) available for further considearion and curation in the ontologies which is used to enable classification in the interim.

* Revamping of classification component based on newer IFSAC+ classification schema.

Includes updated predefined resources.
Addition of some new predefined resources.
Functionality to deal with multi class label values in bucket labels.
Functionality to customize ordering of class labels in the final class assignment.
Updated and newer rules for post-classification refinement of  IFSAC labels.
Excluded "hasBroadSynonym" from the mapping considerations
Functionality to convert the matched component to standard Upper case Ontology Ids
Four new modules for assignment of confidence level in mapping (for future use)

* Update lexmapr/predefined_resources/ifsac-refinement.csv

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_resources.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* A few coding style changes in function naming, documentation, indentation etc. to adhere with PEP style.

Few code changes to resolve " To Do" regarding refine_ifsac_final_labels, remove_duplicate_tokens, retain_phrase functions.

* Update tests and classification table

* Update lexmapr/pipeline.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_classification.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

* Update lexmapr/pipeline_helpers.py

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>

Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>
  • Loading branch information
lexmapr and ivansg44 committed Aug 11, 2020
1 parent 9093063 commit f0d0253
Show file tree
Hide file tree
Showing 51 changed files with 22,526 additions and 2,232 deletions.
52 changes: 40 additions & 12 deletions lexmapr/pipeline.py
Expand Up @@ -33,6 +33,11 @@ def run(args):
# terms from online ontologies to lookup tables.
lookup_table = pipeline_resources.get_predefined_resources()

# Scientific names dictionary fetched from lookup tables.
# Todo: Move to ontology_lookup_table later
scientific_names_dict = pipeline_resources.get_resource_dict(
"foodon_ncbi_synonyms.csv")

# To contain resources fetched from online ontologies, if any.
# Will eventually be added to ``lookup_table``.
ontology_lookup_table = None
Expand All @@ -57,14 +62,20 @@ def run(args):
output_fields = [
"Sample_Id",
"Sample_Desc",
"Cleaned_Sample",
"Processed_Sample",
"Processed_Sample (With Scientific Name)",
"Matched_Components"
]

if args.full:
output_fields += [
"Match_Status(Macro Level)",
"Match_Status(Micro Level)"
"Match_Status(Micro Level)",
"Sample_Transformations"
]
else:
output_fields += [
"Match_Status(Macro Level)"
]

if args.bucket:
Expand Down Expand Up @@ -100,13 +111,15 @@ def run(args):
sample_id = row[0].strip()
original_sample = " ".join(row[1:]).strip()
cleaned_sample = ""
cleaned_sample_scientific_name = ""
matched_components = []
macro_status = "No Match"
micro_status = []
lexmapr_classification = []
lexmapr_bucket = []
third_party_bucket = []
third_party_classification = []
sample_conversion_status = {}

# Standardize sample to lowercase and with punctuation
# treatment.
Expand All @@ -116,26 +129,32 @@ def run(args):
sample_tokens = word_tokenize(sample)

# Get ``cleaned_sample``
for tkn in sample_tokens:
for token in sample_tokens:
# Ignore dates
if helpers.is_date(tkn) or helpers.is_number(tkn):
if helpers.is_date(token) or helpers.is_number(token):
continue
# Some preprocessing
tkn = helpers.preprocess(tkn)
token = helpers.preprocess(token)

lemma = helpers.singularize_token(tkn, lookup_table, micro_status)
lemma = helpers.singularize_token(token, lookup_table, micro_status)
lemma = helpers.spelling_correction(lemma, lookup_table, micro_status)
lemma = helpers.abbreviation_normalization_token(lemma, lookup_table, micro_status)
lemma = helpers.non_English_normalization_token(lemma, lookup_table, micro_status)

if not token == lemma:
sample_conversion_status[token] = lemma
cleaned_sample = helpers.get_cleaned_sample(cleaned_sample, lemma, lookup_table)
cleaned_sample = re.sub(' +', ' ', cleaned_sample)
cleaned_sample = helpers.abbreviation_normalization_phrase(cleaned_sample,
lookup_table, micro_status)
cleaned_sample = helpers.non_English_normalization_phrase(cleaned_sample, lookup_table,
micro_status)
cleaned_sample_scientific_name = helpers.get_annotated_sample(
cleaned_sample_scientific_name, lemma, scientific_names_dict)
cleaned_sample_scientific_name = re.sub(' +', ' ', cleaned_sample_scientific_name)

cleaned_sample = helpers.remove_duplicate_tokens(cleaned_sample)
cleaned_sample_scientific_name = helpers.remove_duplicate_tokens(
cleaned_sample_scientific_name)

# Attempt full term match
full_term_match = helpers.map_term(sample, lookup_table)
Expand Down Expand Up @@ -222,11 +241,11 @@ def run(args):
# We do need it, but perhaps the function could be
# simplified?
if len(matched_components):
matched_components = helpers.retainedPhrase(matched_components)
matched_components = helpers.retain_phrase(matched_components)

# Finalize micro_status
# TODO: This is ugly, so revisit after revisiting
# ``retainedPhrase``.
# ``retain_phrase``.
micro_status_covered_matches = set()
for component_match in component_matches:
possible_matched_component = component_match["term"] + ":" + component_match["id"]
Expand All @@ -249,17 +268,26 @@ def run(args):
third_party_classification = classification_result["ifsac_final_labels"]

# Write to row
matched_components = helpers.get_matched_component_standardized(matched_components)

# Get post-processed cleaned sample with embedded scientific
# name.
cleaned_sample_scientific_name = helpers.refine_sample_sc_name(
sample, cleaned_sample, cleaned_sample_scientific_name,
third_party_classification)

fw.write("\n" + sample_id + "\t" + original_sample + "\t" + cleaned_sample + "\t"
+ str(matched_components))
+ cleaned_sample_scientific_name + "\t" + str(matched_components) + "\t"
+ macro_status)

if args.full:
fw.write("\t" + macro_status + "\t" + str(micro_status))
fw.write("\t" + str(micro_status)+"\t" + str(sample_conversion_status))

if args.bucket:
if args.full:
fw.write("\t" + str(lexmapr_classification) + "\t" + str(lexmapr_bucket)
+ "\t" + str(third_party_bucket))
fw.write("\t" + str(sorted(third_party_classification)))
fw.write("\t" + str(third_party_classification))

fw.write('\n')
# Output files closed
Expand Down

0 comments on commit f0d0253

Please sign in to comment.