Classification (#93)

* A temporary owl file for terms available for curation. This is a temporary owl file for terms (currently not available in ontologies) available for further considearion and curation in the ontologies which is used to enable classification in the interim. * Revamping of classification component based on newer IFSAC+ classification schema. Includes updated predefined resources. Addition of some new predefined resources. Functionality to deal with multi class label values in bucket labels. Functionality to customize ordering of class labels in the final class assignment. Updated and newer rules for post-classification refinement of IFSAC labels. Excluded "hasBroadSynonym" from the mapping considerations Functionality to convert the matched component to standard Upper case Ontology Ids Four new modules for assignment of confidence level in mapping (for future use) * Update lexmapr/predefined_resources/ifsac-refinement.csv Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_resources.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * A few coding style changes in function naming, documentation, indentation etc. to adhere with PEP style. Few code changes to resolve " To Do" regarding refine_ifsac_final_labels, remove_duplicate_tokens, retain_phrase functions. * Update tests and classification table * Update lexmapr/pipeline.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_classification.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> * Update lexmapr/pipeline_helpers.py Co-authored-by: ivansg44 <ivan.gill@bccdc.ca> Co-authored-by: ivansg44 <ivan.gill@bccdc.ca>
cidgoh · Aug 11, 2020 · f0d0253 · f0d0253
1 parent 9093063
commit f0d0253
Show file tree

Hide file tree

Showing 51 changed files with 22,526 additions and 2,232 deletions.
diff --git a/lexmapr/pipeline.py b/lexmapr/pipeline.py
@@ -33,6 +33,11 @@ def run(args):
     #  terms from online ontologies to lookup tables.
     lookup_table = pipeline_resources.get_predefined_resources()
 
+    # Scientific names dictionary fetched from lookup tables.
+    # Todo: Move to ontology_lookup_table later
+    scientific_names_dict = pipeline_resources.get_resource_dict(
+        "foodon_ncbi_synonyms.csv")
+
     # To contain resources fetched from online ontologies, if any.
     # Will eventually be added to ``lookup_table``.
     ontology_lookup_table = None
@@ -57,14 +62,20 @@ def run(args):
     output_fields = [
         "Sample_Id",
         "Sample_Desc",
-        "Cleaned_Sample",
+        "Processed_Sample",
+        "Processed_Sample (With Scientific Name)",
         "Matched_Components"
     ]
 
     if args.full:
         output_fields += [
             "Match_Status(Macro Level)",
-            "Match_Status(Micro Level)"
+            "Match_Status(Micro Level)",
+            "Sample_Transformations"
+        ]
+    else:
+        output_fields += [
+                "Match_Status(Macro Level)"
         ]
 
     if args.bucket:
@@ -100,13 +111,15 @@ def run(args):
         sample_id = row[0].strip()
         original_sample = " ".join(row[1:]).strip()
         cleaned_sample = ""
+        cleaned_sample_scientific_name = ""
         matched_components = []
         macro_status = "No Match"
         micro_status = []
         lexmapr_classification = []
         lexmapr_bucket = []
         third_party_bucket = []
         third_party_classification = []
+        sample_conversion_status = {}
 
         # Standardize sample to lowercase and with punctuation
         # treatment.
@@ -116,26 +129,32 @@ def run(args):
         sample_tokens = word_tokenize(sample)
 
         # Get ``cleaned_sample``
-        for tkn in sample_tokens:
+        for token in sample_tokens:
             # Ignore dates
-            if helpers.is_date(tkn) or helpers.is_number(tkn):
+            if helpers.is_date(token) or helpers.is_number(token):
                 continue
             # Some preprocessing
-            tkn = helpers.preprocess(tkn)
+            token = helpers.preprocess(token)
 
-            lemma = helpers.singularize_token(tkn, lookup_table, micro_status)
+            lemma = helpers.singularize_token(token, lookup_table, micro_status)
             lemma = helpers.spelling_correction(lemma, lookup_table, micro_status)
             lemma = helpers.abbreviation_normalization_token(lemma, lookup_table, micro_status)
             lemma = helpers.non_English_normalization_token(lemma, lookup_table, micro_status)
-
+            if not token == lemma:
+                sample_conversion_status[token] = lemma
             cleaned_sample = helpers.get_cleaned_sample(cleaned_sample, lemma, lookup_table)
             cleaned_sample = re.sub(' +', ' ', cleaned_sample)
             cleaned_sample = helpers.abbreviation_normalization_phrase(cleaned_sample,
                                                                        lookup_table, micro_status)
             cleaned_sample = helpers.non_English_normalization_phrase(cleaned_sample, lookup_table,
                                                                       micro_status)
+            cleaned_sample_scientific_name = helpers.get_annotated_sample(
+                cleaned_sample_scientific_name, lemma, scientific_names_dict)
+            cleaned_sample_scientific_name = re.sub(' +', ' ', cleaned_sample_scientific_name)
 
         cleaned_sample = helpers.remove_duplicate_tokens(cleaned_sample)
+        cleaned_sample_scientific_name = helpers.remove_duplicate_tokens(
+            cleaned_sample_scientific_name)
 
         # Attempt full term match
         full_term_match = helpers.map_term(sample, lookup_table)
@@ -222,11 +241,11 @@ def run(args):
             # We do need it, but perhaps the function could be
             #  simplified?
             if len(matched_components):
-                matched_components = helpers.retainedPhrase(matched_components)
+                matched_components = helpers.retain_phrase(matched_components)
 
             # Finalize micro_status
             # TODO: This is ugly, so revisit after revisiting
-            #  ``retainedPhrase``.
+            #  ``retain_phrase``.
             micro_status_covered_matches = set()
             for component_match in component_matches:
                 possible_matched_component = component_match["term"] + ":" + component_match["id"]
@@ -249,17 +268,26 @@ def run(args):
                 third_party_classification = classification_result["ifsac_final_labels"]
 
         # Write to row
+        matched_components = helpers.get_matched_component_standardized(matched_components)
+
+        # Get post-processed cleaned sample with embedded scientific 
+        # name.
+        cleaned_sample_scientific_name = helpers.refine_sample_sc_name(
+            sample, cleaned_sample, cleaned_sample_scientific_name,
+            third_party_classification)
+
         fw.write("\n" + sample_id + "\t" + original_sample + "\t" + cleaned_sample + "\t"
-                 + str(matched_components))
+                 + cleaned_sample_scientific_name + "\t" + str(matched_components) + "\t" 
+                 + macro_status)
 
         if args.full:
-            fw.write("\t" + macro_status + "\t" + str(micro_status))
+            fw.write("\t" + str(micro_status)+"\t" + str(sample_conversion_status))
 
         if args.bucket:
             if args.full:
                 fw.write("\t" + str(lexmapr_classification) + "\t" + str(lexmapr_bucket)
                          + "\t" + str(third_party_bucket))
-            fw.write("\t" + str(sorted(third_party_classification)))
+            fw.write("\t" + str(third_party_classification))
 
     fw.write('\n')
     # Output files closed