Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOTE NLP table #85

Closed
clairblacketer opened this issue Jul 13, 2017 · 20 comments
Closed

NOTE NLP table #85

clairblacketer opened this issue Jul 13, 2017 · 20 comments

Comments

@clairblacketer
Copy link
Contributor

clairblacketer commented Jul 13, 2017

Addition of NOTE NLP table and new fields in NOTE table


Proposal

Relevant table: NOTE

NOTE table additions

Field Required Type Description
note_id Yes integer A unique identifier for each note.
person_id Yes integer A foreign key identifier to the Person about whom the Note was recorded. The demographic details of that Person are stored in the PERSON table.
note_date Yes date The date the note was recorded.
note_datetime No datetime The date and time the note was recorded.
note_type_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the type, origin or provenance of the Note.
note_class_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the HL7 LOINC Document Type Vocabulary classification of the note.
note_title No varchar(250) The title of the Note as it appears in the source.
note_text No RBDMS dependent text The content of the Note.
encoding_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the note character encoding type
language_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the language of the note
provider_id No integer A foreign key to the Provider in the PROVIDER table who took the Note.
note_source_value No varchar(50) The source value associated with the origin of the note
visit_occurrence_id No integer Foreign key to the Visit in the VISIT_OCCURRENCE table when the Note was taken.

New Fields

  • note_class_concept_id: a foreign key to the CONCEPT table to describe a standardized combination of five LOINC axes (role, domain, setting, type of service, and document kind). See Section 3 for description of mapping of clinical documents to Clinical Document Ontology (CDO) and standard terminology.
  • note_title: This field represents the title of a note.
  • encoding_concept_id: a foreign key to the predefined Concept in the Standardized Vocabularies reflecting the note character encoding type. Create the concepts in the CONCEPT table for note encoding type.
  • language_concept_id: a foreign key that refers to an identifier in the CONCEPT table for the note language. Use SNOMED qualifier concepts for all major languages.

Field Changes

note_text type depends on RDBMS, not all the engines support CLOB, e.g. in MS SQL server this will be VARCHAR(MAX).

Outstanding issues

note_id - convert to BIGINT due to a large table size.
Changing identifier fields from INT to BIGINT should have to be a larger group discussion/decision as it would significantly affect all the existing implementations. We should consider whether to change all the identifier fields or a subset. CONDITION_OCCURRENCE, PROCEDURE_OCCURRENCE should be even larger tables.

NOTE_NLP table

This table will encode all output of NLP on clinical notes. Each row represents a single extracted term from a note.

Field Required Type Description
note_nlp_id Yes Big Integer A unique identifier for each term extracted from a note.
note_id Yes integer A foreign key to the Note table note the term was extracted from.
section_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term.
snippet No varchar(250) A small window of text surrounding the term.
offset No varchar(50) Character offset of the extracted term in the input note.
lexical_variant Yes varchar(250) Raw text extracted from the NLP tool.
note_nlp_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table.
note_nlp_source_concept_id no integer A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system
nlp_system No varchar(250) Name and version of the NLP system that extracted the term.Useful for data provenance.
nlp_date Yes date The date of the note processing.Useful for data provenance.
nlp_datetime No datetime The date and time of the note processing. Useful for data provenance.
term_exists No varchar(1) A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. *
term_temporal No varchar(50) An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later.
term_modifiers No varchar(2000) A compact description of all the modifiers of the specific term extracted by the NLP system. (e.g. “son has rash” → “negated=no,subject=family,certainty=undef,conditional=false,general=false”).

Term_exists
Term_exists is defined as a flag that indicates if the patient actually has or had the condition. Any of the following modifiers would make Term_exists false:

  • Negation = true
  • Subject = [anything other than the patient]
  • Conditional = true
  • Rule_out = true
  • Uncertain = very low certainty or any lower certainties

A complete lack of modifiers would make Term_exists true.

For the modifiers that are there, they would have to have these values:

  • Negation = false
  • Subject = patient
  • Conditional = false
  • Rule_out = false
  • Uncertain = true or high or moderate or even low (could argue about low)

Term_temporal
Term_temporal is to indicate if a condition is “present” or just in the “past”.

The following would be past:

  • History = true
  • Concept_date = anything before the time of the report

Term_modifiers
Term_modifiers will concatenate all modifiers for different types of entities (conditions, drugs, labs etc) into one string. Lab values will be saved as one of the modifiers. A list of allowable modifiers (e.g., signature for medications) and their possible values will be standardized later.

Mapping of clinical documents to Clinical Document Ontology (CDO) and standard terminology

HL7/LOINC CDO is a standard for consistent naming of documents to support a range of use cases: retrieval, organization, display, and exchange. It guides the creation of LOINC codes for clinical notes. CDO annotates each document with 5 dimensions:

  • Kind of Document: Characterizes the generalc structure of the document at a macro level (e.g. Anesthesia Consent)
  • Type of Service: Characterizes the kind of service or activity (e.g. evaluations, consultations, and summaries). The notion of time sequence, e.g., at the beginning (admission) at the end (discharge) is subsumed in this axis. Example: Discharge Teaching.
  • Setting: Setting is an extension of CMS’s definitions (e.g. Inpatient, Outpatient)
  • Subject Matter Domain (SMD): Characterizes the subject matter domain of a note (e.g. Anesthesiology)
  • Role: Characterizes the training or professional level of the author of the document, but does not break down to specialty or subspecialty (e.g. Physician)

Each combination of these 5 dimensions should roll up to a unique LOINC code. For example, Dentistry Hygienist Outpatient Progress note (LOINC code 34127-1) has the following dimensions:

  • According to CDO requirements, only 2 of the 5 dimensions are required to properly annotate a document: Kind of Document and any one of the other 4 dimensions.
  • However, not all the permutations of the CDO dimensions will necessarily yield an existing LOINC code.2 HL7/LOINC workforce is committed to establish new LOINC codes for each new encountered combination of CDO dimensions. 3

Automation of mapping of clinical notes to a standard terminology based on the note title is possible when it is driven by ontology (aka CDO). Mapping to individual LOINC codes which may or may not exist for a particular note type cannot be fully automated. To support mapping of clinical notes to CDO in OMOP CDM, we propose the following approach:

1. Add all LOINC concepts representing 5 CDO dimensions to the Concept table. For example:

Field Record 1 Record 2
concept_id 55443322132 55443322175
concept_name Administrative note Against medical advice note
concept_code LP173418-7 LP173388-2
vocabulary_id LOINC LOINC

2. Represent CDO hierarchy in the Concept_Relationship table using the “Subsumes” – “Is a” relationship pair. For example:

Field Record 1 Record 2
concept_id_1 55443322132 55443322175
concept_id_2 55443322175 55443322132
relationship_id Subsumes Is a

3. Add LOINC document codes to the Concept table (e.g. Dentistry Hygienist Outpatient Progress note, LOINC code 34127-1). For example:

Field Record 1 Record 2
concept_id 193240 193241
concept_name Dentistry Hygienist Outpatient Progress note Consult note
concept_code 34127-1 11488-4
vocabulary_id LOINC LOINC

4. Represent dimensions of each document concept in Concept_Relationship table by its relationships to the respective concepts from CDO. Use the “Member Of” – “Has Member” (new) relationship pair. Using example from the Dentistry Hygienist Outpatient Progress note (LOINC code 34127-1):

concept_id_1 concept_id_1 relationship_id
193240 55443322132 Member Of
55443322132 193240 Has Member
193240 55443322175 Member Of
55443322175 193240 Has Member
193240 55443322166 Member Of
55443322166 193240 Has Member
193240 55443322107 Member Of
55443322107 193240 Has Member
193240 55443322146 Member Of
55443322146 193240 Has Member

Where concept codes represent the following concepts:

Content Description
193240 Corresponds to LOINC 34127-1, Dentistry Hygienist Outpatient Progress note
55443322132 Corresponds to LOINC LP173418-7, Kind of Document = Note
55443322175 Corresponds to LOINC LP173213-2, Type of Service = Progress
55443322166 Corresponds to LOINC LP173051-6, Setting = Outpatient
55443322107 Corresponds to LOINC LP172934-4, Subject Matter Domain  = Dentistry
55443322146 Corresponds to LOINC LP173071-4, Role = Hygienist

Most of the codes will not have all 5 dimensions. Therefore, they may be represented by 2-5 relationship pairs.

5. If LOINC does not have a code corresponding to a permutation of the 5 CDO encountered in the source, this code will be generated as OMOP vocabulary code. Its relationships to the CDO dimensions will be represented exactly as those of existing LOINC concepts (as described above). If/when a proper LOINC code for this permutation is released, the old code should be deprecated. Transition between the old and new codes should be represented by “Concept replaces” – “Concept replaced by” pairs.

6. Mapping from the source data will be performed to the 2-5 CDO dimensions.

Query below finds LOINC code for Dentistry Hygienist Outpatient Progress note (see example above) that has all 5 dimensions:

SELECT FROM Concept_Relationship WHERE relationship_id = ‘Has Member’ AND (concept_id_1 = 55443322132 OR concept_id_1 = 55443322175 OR concept_id_1 = 55443322166 OR concept_id_1 = 55443322107 OR concept_id_1 = 55443322146) GROUP BY concept_ID_2
If less than 5 dimensions are available, HAVING COUNT(n) clause should be added to get a unique record at the intersection of these dimensions. n is the number of dimensions available:

SELECT FROM Concept_Relationship WHERE relationship_id = ‘Has Member’ AND (concept_id_1 = 55443322132 OR concept_id_1 = 55443322175 OR concept_id_1 = 55443322146) GROUP BY concept_ID_2 HAVING COUNT(*) = 3

To identify appropriate dimension while mapping source documents, use the following concept classes:

  • Note Provider Role
  • Note Domain
  • Note Setting
  • Note Service Type
  • Note Kind

The proposed approach will ensure that any combination of the 5 CDO dimensions encountered in the source data has a corresponding concept in the vocabulary. It will also support consistent approach to the OMOP CDM/Vocabulary conventions:

  • One required _type_concept_id field will be populated in a corresponding domain table, NOTE.
  • Vocabulary-related attributes are stored in a vocabulary data model in a uniform way
  • Usage of a standard vocabulary, LOINC, is ensured where possible
  • Introduction of new OMOP concepts when a standard does not provide adequate coverage of the source data

A similar mapping approach can be applied to labs.

Use Cases

Example 1 - Left ventricular ejection fraction

Left ventricular ejection fraction is an important indicator of heart health. It is measured during echocardiogram procedures but also during a range of various procedures. The value is frequently reported in clinical reports and has to be extracted using natural language processing.

Name Value
Note_NLP_id 123456
note_id 123446425
section_concept_id <foreign key to "Echocardiogram Report">
snippet ejection fraction was estimated at 60%
lexical_variant ejection fraction
Note_NLP_concept_id <foreign key to "Left Ventricular Ejection Fraction" concept>
NLP_system EchoExtractor_EF(v.2016)
NLP_date 3/30/16
Term_exists TRUE
Value_as_concept_id null
Value_as_number 60.0
Unit_concept_id <foreign key to "percent">
Term_temporal present
Term_modifiers null

Example 2 - eMERGE Phenoytpes

Existence of specific report or specific note section

  1. Presence of a Pathology Report [Appendicitis].
  2. Must contain at least two Past Medical History sections and Medication lists (could substitute two non-acute clinic visits or requirement for annual physical) [Hypothyroidism].
  3. At least 1 abdominal CT or colonoscopy [Diverticulosis].
  4. Patients have to have had a colonoscopy [colonPolyp].
  5. Must have at least a problem list and/or note containing non-empty (can say “none”) medication list and past medical history before or immediately after the time of the ECG [QRS].

Term/Concept mentioning in notes or specific sections

  1. Positive result of inflammation and non-inflammation concept (CUI) in post-surgical biopsy report [Appendicitis].
  2. Reported History of Appendicitis [Appendicitis].
  3. Individual’s patient chart includes one or mentions of an ADHD or hyperkinesia [ADHD].
  4. SSTI cases must have the following or similar keywords in the text results of a bacterial culture lab test, such as skin, wound, boil, abscess, but also recognizing that anatomic sites (e.g. foot/hand/leg/buttock, etc.) [caMRSA].
  5. At least on diagnosis code for C. diff and at least one affirmative mention of C. diff infection (unqualified by negation, uncertainty, or historical reference) in progress notes [CDiff].
  6. Retrieve DSM-IV Symptom criteria (Social Interaction/Communication/Behavior, Interests and Activities) terms from notes to confirm Autism [Autism].
  7. Patient has colonoscopy without positive mention of diverticulosis as control [Diverticulosis].
  8. Positive mention of HF in the problem list through either NLP or structured problem list [HeartFailure].
  9. Cases are those that have polyps in any of their colonoscopy or associated pathology reports [colonPolyp].
  10. Notes contain no evidence of heart disease concepts (NLP for notes, Problem Lists at or near ECG time, ignoring Family Medical History and Allergy sections (using section tagger), ICD9 and CPT codes at or near ECG time describing heart disease) before ECG time or within one month following [QRS].

Related terms mentioning in the same line or adjacent lines

  1. Potential cases were identified if they contained at least one term from List 1 (terms identifying an ace-inhibitor, see below) AND List 2 (terms identifying cough, see below) one the same line (e.g., sentence) within the “Allergy section”, “Medication section” or within the entire “Patient summary section” of the EMR [ACEIcough].
  2. At least one non-negated “Disorder related terms” mention and “Anatomical site related terms” mention either in the SAME or adjacent sentences in a ‘section of interest’ [VTE].

Numeric values with/without temporal constraints

  1. Exclude all patients with an Ejection Fraction (EF or LVEF) <35% within 1 year before or after meeting the CASE 1 definition [ResHTN].
  2. Have evidence from a carotid imaging study of >50% carotid artery stenosis (at least unilaterally) [CAAD].
  3. Classify the type of HF using the numeric EF results (use the lowest EF recorded in the time window) [HeartFailure].
  4. In defining “Normal” ECG, QRSd between 65-120ms, ECG designed as “NORMAL”, Heart Rate between 50-100, ECG Impression must not contain evidence of heart disease concepts [QRS].
@clairblacketer clairblacketer added this to the CDM v5.2.0 milestone Jul 13, 2017
@schuemie
Copy link
Member

Too late now I guess, but me and some other folks have requested the following (but not in the right places apparently, just in some e-mail communication):

Please split up note_nlp.term_modifiers into several predefined flags (as already mentioned above):

  • term_negated
  • term_subject
  • term_certainty

Others can go into a other_term_modifiers bucket, but these are common to almost all NLP system and it makes no sense to have to parse a field you'll be using quite often.

@clairblacketer
Copy link
Contributor Author

@schuemie Just to be sure I'm clear, would the new table look something like this:

NOTE_NLP

Field Required Type Description
note_nlp_id Yes Big Integer A unique identifier for each term extracted from a note.
note_id Yes integer A foreign key to the Note table note the term was extracted from.
section_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term.
snippet No varchar(250) A small window of text surrounding the term.
offset No varchar(50) Character offset of the extracted term in the input note.
lexical_variant Yes varchar(250) Raw text extracted from the NLP tool.
note_nlp_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table.
note_nlp_source_concept_id no integer A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system
nlp_system No varchar(250) Name and version of the NLP system that extracted the term.Useful for data provenance.
nlp_date Yes date The date of the note processing.Useful for data provenance.
nlp_date_time No datetime The date and time of the note processing. Useful for data provenance.
term_exists No varchar(1) A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. *
term_temporal No varchar(50) An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later.
term_negated No varchar(50)
term_subject No varchar(50)
term_certainty No varchar(50)
other_term_modifiers No varchar(2000)

@cgreich
Copy link
Contributor

cgreich commented Jul 14, 2017

@schuemie: Have you talked to Hua? Because he was running the entire subgroup that came up with the definition.

@pbr6cornell
Copy link
Contributor

pbr6cornell commented Jul 14, 2017 via email

@hripcsa
Copy link

hripcsa commented Jul 14, 2017 via email

@clairblacketer
Copy link
Contributor Author

@schuemie, @cgreich, @pbr6cornell, @hripcsa I did contact Hua, Noemie and Rimma about this issue with these additional columns and I will post their response once I hear back from them.

@huaxu7000
Copy link

As George mentioned, we have gone through extensive discussion with the modifier fields for the NLP table. We have voted and decided on the current version. So the plan is to go forward with the current table and we start implementing it. After we learn more, we can discuss and make changes in the future versions. thanks.

@clairblacketer
Copy link
Contributor Author

clairblacketer commented Jul 14, 2017

So we will keep the table as proposed with the one modifier field.

clairblacketer added a commit that referenced this issue Jul 14, 2017
@schuemie
Copy link
Member

Sorry, didn't mean to upset the decision making process.

I don't agree with the decision (I can make a good case at least for a bit-field term_negated), but I'm not a member of the NLP working group so I will respect its decision.

@hripcsa
Copy link

hripcsa commented Jul 15, 2017 via email

@schuemie
Copy link
Member

The context in which I was thinking of using the note_nlp table and negation flag is specifically the construction of features which would subsequently be used in (for example) prediction models. It seems to me that although the most important features would be things that are present, there would be considerable information in things that a doctor (or whoever wrote the note) took the trouble to negate, so I would additionally create negated features. The semantics of negation being a negative statement about something, so "... observed no rash..." or "... rule out pneumonia ..." would be examples of things (rash and pneumonia, resp.) that are negated. Whether those features are informative I don't know, I would have to see if the prediction algorithm selects them into the model. But my hypothesis is that they might be. But with the current structure of note_nlp, where negation isn't standardized, I cannot create these features.

@hripcsa
Copy link

hripcsa commented Jul 16, 2017 via email

@schuemie
Copy link
Member

I think the issues you mention are generic to NLP: we have no NLP that can figure out the full semantics of natural language, and we are likely to get the 'present' label wrong many times for the same reasons. Anyone using NLP output has to consider its noisy nature.

Despite my poor attempt at defining negation (I guess I meant 'ruled out'), it is a common concept in NLP, for example as implemented in NegEx. And although the boundaries of what negation means are perhaps vague, it suggests quite different semantics than non-negated things, and that distinction may be informative for example for a machine learning algorithm.

@cgreich
Copy link
Contributor

cgreich commented Jul 16, 2017

Friends:

Usually it helps in these debates when you do concrete use cases. Then it is much easier to vote on adding the feature or not.

@huaxu7000
Copy link

huaxu7000 commented Jul 16, 2017 via email

@schuemie
Copy link
Member

@cgreich, the specific use case I have is this: We want to use NLP features in predictive models. More specifically, right now we want to fit propensity models in a Dutch GP EHR system. We have an algorithm for identifying negations, and I want to implement a covariate builder in the FeatureExtraction package that creates separate features for negated and non-negated terms, because I hypothesize there may be value in that (better predictions). I can then plug the covariate builder into CohortMethod.

Right now we would have to come up with a string we would put in the term_modifiers field, and FeatureExtraction would have to look for that string when creating features. But since that string is not standardized, another site will probably use a different string, so we can't create a covariate builder that automatically runs everywhere.

@hripcsa
Copy link

hripcsa commented Jul 17, 2017 via email

@huaxu7000
Copy link

Yes, we suggested a format like “negation: negated; uncertainty: certain”. You can query negation from this concatenated field. I agree the more problematic issue is about standardizing the modifiers and their values. As different sites may use different NLP systems, the outputs of modifiers could be different, which makes it challenging to run studies across sites. It will take us some time to make everyone to agree on a standard of modifiers and their values.

@clairblacketer clairblacketer mentioned this issue Jul 20, 2017
clairblacketer added a commit that referenced this issue Jul 27, 2017
@clairblacketer
Copy link
Contributor Author

Closing this issue as the NOTE_NLP table was added to CDM v5.2 as it appears at the top, though the discussion is still open

@clairblacketer
Copy link
Contributor Author

for my reference - the document ontology referred to
DocumentOntology.xlsx

This was referenced Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants