# Covid-19 NLP pipeline

### The pipline repository [link](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/)

### Inroduction:
The primary objective of the NLP pipeline is to identify individuals who have been positively diagnosed with COVID-19 by extracting pertinent information from unstructured free-text narratives found within the Electronic Health Record (EHR) of the Department of Veterans Affairs (VA). By automating this process, the pipeline streamlines the screening of a substantial volume of clinical text, significantly reducing the time and effort required for identification.
The pipeline is built on medSpacy framework, and defines a new UI to use.
Our goal is to write the pipline in rgxlog language so we show a real world example about the benefits of the rgxlog framework from the NLP world.

### pipline stages:
- [Concept tagger](#concept-tag-rules)
- [Target matcher](#target-rules)
- [Sectionizer](#section-rules)
- [Context matcher](#context-rules)
- [Postprocessor](#postprocess-rules)
- [Document Classifier](#document_classifier)

We will implement each stage separately later on.

### First, we need to install some requirements to work with [medspacy](https://github.com/medspacy/medspacy) framework 

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

In [2]:
import spacy

### Import what we need from the rgxlog framework and define some ie functions that will be used in every stage of the pipline:

In [3]:
import re
import pandas as pd
from pandas import DataFrame
import rgxlog
from rgxlog import magic_session
from rgxlog import Session
from rgxlog.engine.datatypes.primitive_types import DataTypes
from rgxlog.engine.datatypes.primitive_types import Span
session = rgxlog.magic_session

In [4]:
def read_from_file(text_path):
    """
    Reads from file and return it's content.

    Parameters:
        text_path (str): The path to the text file to read from.

    Returns:
        str: The content of the file.
    """
    with open(f"{text_path}", 'r') as file:
        content = file.read()
    yield content
magic_session.register(ie_function=read_from_file,
                       ie_function_name = "read_from_file",
                       in_rel=[DataTypes.string],
                       out_rel=[DataTypes.string])

In [5]:
def resolve_interval_conflicts(replacements):
    """
    This function takes a list of replacements, where each replacement is represented
    as a list containing a label and a span (interval). It checks for conflicts among
    the intervals and returns a list of resolved replacements, ensuring that no two
    intervals overlap.

    Parameters:
    replacements (list of lists): A list of replacements, where each replacement
        is represented as a list [label, span].

    Returns:
    list of lists: A list of resolved replacements, where each replacement is a list
        [label, span], ensuring that there are no conflicts among intervals.
    """
    # Sort the replacements by the size of the spans in descending order
    replacements.sort(key=lambda x: x[1].span_end - x[1].span_start, reverse=True)

    # Initialize a list to keep track of intervals that have been replaced
    resolved_replacements = []
    
    for label, span in replacements:
        conflict = False

        for _, existing_span in resolved_replacements:
            existing_start = existing_span.span_start
            existing_end = existing_span.span_end

            if not (span.span_end <= existing_start or span.span_start >= existing_end):
                conflict = True
                break

        if not conflict:
            resolved_replacements.append([label, span])

    return resolved_replacements

In [6]:
def replace_spans(spans_table, paths_table):
    """
    This function takes tables a spans tables and path table for the files paths,
    it generate queries for the tables, executes the queries using a session, processes the results, 
    and replaces specific spans in a text with the corresponding labels, it first
    resolve spans overlapping conflicts for each giving path.

    Parameters:
    spans_table (str): A string representing the spans table to process, table columns are formated as (Label, Span, Path).
    paths_table (str): A string representing the paths table to process, table columns are formated as (Path)

    Returns:
    str: The adjusted text string with the new labels.
    """
    # Get a list of all the paths
    paths = session.run_commands(f"?{paths_table}(Path)", print_results=False, format_results=True)
    paths = paths[0].values.tolist()
    for path_list in paths:
        path = path_list[0]

        # Generate a spans query for each path, the query will be formates as (Label, Span, Path)
        results = session.run_commands(f'?{spans_table}(Label, Span, "{path}")', print_results=True, format_results=True)
        if len(results[0]) == 0:
            continue
        # replacments is list of lists where each list is a [Label, Span]
        replacements = results[0].values.tolist()
        
        with open(f"{path}", 'r') as file:
            adjusted_string = file.read()
    
        # Resolve spans conflicts
        resolved_replacements = resolve_interval_conflicts(replacements)
    
        # Sort the resolved replacements by the starting index of each span in descending order
        resolved_replacements.sort(key=lambda x: x[1].span_start, reverse=True)
    
        # iterate over the resolved query results and replace the space with the corresponding label
        for i in range(len(resolved_replacements)):
            replace_string, span = resolved_replacements[i]
            replace_length = len(replace_string)
            adjusted_string = adjusted_string[:span.span_start] + replace_string + adjusted_string[span.span_end:]
    
        with open(f"{path}", 'w') as file:
            file.writelines(adjusted_string)

### The paths of the text files to be classified should be written in "files_paths.csv" file

In [7]:
%%bash
cat files_paths.csv

sample1.txt
sample2.txt
sample3.txt
sample4.txt
sample5.txt

In [8]:
session.import_relation_from_csv("files_paths.csv", relation_name="FilesPaths", delimiter=",")

In [9]:
%%rgxlog
FilesContent(Path, Content) <- FilesPaths(Path), read_from_file(Path) -> (Content)
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                                   Content
-------------+---------------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | Patient presents to be tested for COVID-19. His wife recently tested positive for novel coronavirus. SARS-COV-2 results came back positive.
 sample2.txt |                                         The patient was tested for COVID-19. Results are positive.
 sample3.txt |                                            Problem List: 1. Pneumonia 2. Novel Coronavirus 2019
 sample4.txt |                                                            neg covid education.
 sample5.txt |                                                         positive covid precaution.



<a id='concept-tag-rules'></a>
### [Concept Tag Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/concept_tag_rules.py):
Concept tag rules, also known as pattern-based rules or custom rules, are a way to specify and define patterns that an NLP (Natural Language Processing) system should recognize within text data. These rules are used to identify specific concepts or entities within text documents. In the context of MedSpaCy and medical NLP, concept tag rules are often used to identify medical entities and concepts accurately.

In the orginal project they used the TargetRule class which defines a rule for identifying a specific concept or entity in text.
each concept Target Rule looks like this:

TargetRule(
            literal="coronavirus",
            category="COVID-19",
            pattern=[{"LOWER": {"REGEX": "coronavirus|hcov|ncov$"}}],
          )

**Literal** : This specifies the literal text or word that this rule is targeting.

**Category** : This specifies the category or label associated with the identified entity.

**Pattern** : This defines the pattern or conditions under which the entity should be recognized. It's a list of dictionaries specifying conditions for token matching. These rules some times used lemma attribute or POS of each token. A documentation can be found at : https://spacy.io/usage/rule-based-matching.

Instead what we did is to define regex patterns, we have added these pattern in concept_target_rules.csv file, there are two types of these patterns lemma and pos, that we will implement each later on.
Each rule in the csv file is like this : regexPattern, label, type

In [10]:
%%bash
cat concept_tags_rules.csv

(?i)(?:hcov|covid(?:(?:-)?(?:\s)?19|10)?|2019-cov|cov2|ncov-19|covd 19|no-cov|sars cov),COVID-19,lemma
(?i)(?:coivid|(?:novel )?corona(?:virus)?(?: (?:20)?19)?|sars(?:\s)?(?:-)?(?:\s)?cov(?:id)?(?:-)?(?:2|19)),COVID-19,lemma
(?i)(?:\+(?: ve)?|\(\+\)|positive|\bpos\b|active|confirmed),positive,lemma
(?i)(?:pneum(?:onia)?|pna|hypoxia|septic shoc|ards\(?(?:(?:[12])/2)\)?|(?:hypoxemic|acute|severe)? resp(?:iratory)? failure(?:\(?(?:[12]/2)\)?)?)",associated_diagnosis,lemma
(?i)(?:(?:diagnos(?:is|ed)|dx(?:\.)?)(?:of|with)?),diagnosis,lemma
(?i)(?:^screen),screening,lemma
(?i)(?:in contact with|any one|co-worker|at work|(?:the|a)(?:wo)?man|(?:another|a) (?:pt|patient|pt\.)),other_experiencer,lemma
(?i)(?:patient|pt(?:\.)?|vt|veteran),patient,lemma
(?i)(?:like_num (?:days|day|weeks|week|months|month) (?:ago|prior)),timesx,lemma
(?i)(?:(?:antibody|antibodies|ab) test),antibody test,lemma
(?i)(?:(?:coronavirus|hcovs?|ncovs?|covs?)(?:\s)?(?:-)?(?:\s)?(?: infection)?(?: strain)?(?:\s)?(?:229(?:e)

In [11]:
session.import_relation_from_csv("concept_tags_rules.csv", relation_name="ConceptTagRules", delimiter=",")

#### Lemma Rules:
Lemma rules are rules that used the attribute _lemma of each token in the NLP, so what we defined this function to lemmatize the text, most of the rules used only the raw text, thats why we decided that only to lemmatize the tokens we needed.

Example for a lemma rule from the original NLP:

        TargetRule(
            "results positive",
            "positive",
            pattern=[
                {"LOWER": "results"},
                {"LEMMA": "be", "OP": "?"},
                {"LOWER": {"IN": ["pos", "positive"]}},
            ],
        ),
We used the py_rgx_span to capture the patterns, and will use the spans later on in replace_spans that will replace each span with the correct label

In [12]:
def lemmatize_text(text_path, lemma_words_path):
    """
    This function reads a text file, lemmatizes its content using spaCy's English language model,
    and replaces certain words with their lemmas the rest will remain the same. The updated text is then written back to the same file.

    Parameters:
        text_path (str): The path to the text file to be lemmatized.
        lemma_words_path(str): The path that contains the list of words to be lemmatized

    Returns:
        str: The lemmatized text.
    """
    # Define a list of words to be lemmatized
    lemma_words = [line.strip() for line in open(f"{lemma_words_path}") if line.strip()]

    with open(text_path, 'r') as file:
        contents = file.read()

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(contents)

    lemmatized_text = ""
    for token in doc:
        if token.lemma_ in lemma_words:
            lemmatized_text += token.lemma_
        elif token.like_num:
            lemmatized_text += "like_num"
        else:
            lemmatized_text += token.text
        lemmatized_text += " "

    # Write the lemmatized text back to the same file
    with open(text_path, 'w') as file:
        file.writelines(lemmatized_text)

    yield lemmatized_text
magic_session.register(ie_function=lemmatize_text, ie_function_name = "lemmatize_text", in_rel=[DataTypes.string, DataTypes.string], out_rel=[DataTypes.string])

In [13]:
%%rgxlog
lemma_texts(Path, LemmaText) <- FilesPaths(Path), lemmatize_text(Path, "lemma_words.txt") -> (LemmaText)
?lemma_texts(Path, LemmaText)

LemmaMatches(Label, Span, Path) <- lemma_texts(Path, Text), ConceptTagRules(Pattern, Label, "lemma"), py_rgx_span(Text, Pattern) -> (Span)

printing results for query 'lemma_texts(Path, LemmaText)':
    Path     |                                                                    LemmaText
-------------+--------------------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for novel coronavirus . SARS - COV-2 results came back positive .
 sample2.txt |                                            The patient be tested for COVID-19 . Results be positive .
 sample3.txt |                                    Problem List : like_num . Pneumonia like_num . Novel Coronavirus like_num
 sample4.txt |                                                              neg covid education .
 sample5.txt |                                                           positive covid precaution .



In [14]:
# replace the matches with the correct label
replace_spans("LemmaMatches", "FilesPaths")

printing results for query 'LemmaMatches(Label, Span, "sample1.txt")':
  Label   |    Span
----------+------------
 COVID-19 |  [34, 42)
 COVID-19 | [103, 115)
 COVID-19 | [83, 100)
 positive | [134, 142)
 positive |  [70, 78)
 patient  |   [0, 7)

printing results for query 'LemmaMatches(Label, Span, "sample2.txt")':
  Label   |   Span
----------+----------
 COVID-19 | [26, 34)
 positive | [48, 56)
 patient  | [4, 11)

printing results for query 'LemmaMatches(Label, Span, "sample3.txt")':
        Label         |   Span
----------------------+----------
       COVID-19       | [47, 64)
 associated_diagnosis | [26, 35)

printing results for query 'LemmaMatches(Label, Span, "sample4.txt")':
  Label   |  Span
----------+--------
 COVID-19 | [4, 9)

printing results for query 'LemmaMatches(Label, Span, "sample5.txt")':
  Label   |  Span
----------+---------
 COVID-19 | [9, 14)
 positive | [0, 8)



In [15]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                               Content
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                     The patient be tested for COVID-19 . Results be positive .
 sample3.txt |                             Problem List : like_num . associated_diagnosis like_num . COVID-19 like_num
 sample4.txt |                                                      neg COVID-19 education .
 sample5.txt |                                                   positive COVID-19 precaution .



#### POS Rules:
As we mentioned above these rules used the POS attribute of each token, there were a small number of rules so we only used this to the tokens we needed.
Example of the a rule from the original NLP:

        TargetRule(
            "other experiencer",
            category="other_experiencer",
            pattern=[
                {
                    "POS": {"IN": ["NOUN", "PROPN", "PRON", "ADJ"]},
                    "LOWER": {
                        "IN": [
                            "someone",
                            "somebody",
                            "person",
                            "anyone",
                            "anybody",
                        ]
                    },
                }
            ],
        ),

The patterns we've defined will match words listed under "IN", We specifically capture words if their Part-of-Speech (POS) falls into one of the categories: ["NOUN", "PROPN", "PRON", "ADJ"]. To accomplish this, two functions are employed: the first function determines the POS of each token, and the second one, py_rgx_span, captures the predefined patterns. After matching words, We confirm the accurate POS tags of the matched words using spans.

In [16]:
def annotate_text_with_pos(text_path):
    """
    This function reads a text file, processes its content using spaCy's English language model,
    and returns a tuple of (POS, Span) for each token if it's one of NOUN|PROPN|PRON|ADJ
    otherwise an empty tuple will be returned
    
    Parameters:
        text_path (str): The path to the text file to be annotated.

    Returns:
        tuple(str, Span): The POS of the token and it's span
    """
    with open(text_path, 'r') as file:
        contents = file.read()

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(contents)

    for token in doc:
        if token.pos_ in ["NOUN", "PROPN", "PRON", "ADJ"]:
            yield token.pos_, Span(token.idx, token.idx + len(token.text))
        else:
            yield tuple()
magic_session.register(ie_function=annotate_text_with_pos, ie_function_name = "annotate_text_with_pos", in_rel=[DataTypes.string], out_rel=[DataTypes.string, DataTypes.span])

In [17]:
%%rgxlog
POSTable(POS, Span, Path) <- lemma_texts(Path, Text), annotate_text_with_pos(Path) -> (POS, Span)
?POSTable(POS, Span, Path)

POSMatches(Label, Span, Path) <- lemma_texts(Path, Text), ConceptTagRules(Pattern, Label, "pos"), py_rgx_span(Text, Pattern) -> (Span)
?POSMatches(Label, Span, Path)

POSRuleMatches(Label, Span, Path) <- POSTable(POS, Span, Path), POSMatches(Label, Span, Path)
?POSRuleMatches(Label, Span, Path)

printing results for query 'POSTable(POS, Span, Path)':
  POS  |    Span    |    Path
-------+------------+-------------
  ADJ  |   [0, 7)   | sample1.txt
  ADJ  | [121, 129) | sample1.txt
  ADJ  |  [70, 78)  | sample1.txt
 NOUN  | [103, 110) | sample1.txt
 NOUN  |  [49, 53)  | sample1.txt
 NOUN  |  [8, 16)   | sample1.txt
 PROPN |  [83, 91)  | sample1.txt
  ADJ  |  [48, 56)  | sample2.txt
 NOUN  |  [26, 34)  | sample2.txt
 NOUN  |  [37, 44)  | sample2.txt
 NOUN  |  [4, 11)   | sample2.txt
 NOUN  |  [15, 23)  | sample3.txt
 PROPN |   [0, 7)   | sample3.txt
 PROPN |  [26, 46)  | sample3.txt
 PROPN |  [47, 55)  | sample3.txt
 PROPN |  [67, 75)  | sample3.txt
 PROPN |  [8, 12)   | sample3.txt
 NOUN  |  [13, 22)  | sample4.txt
 PROPN |   [0, 3)   | sample4.txt
 PROPN |  [4, 12)   | sample4.txt
  ADJ  |   [0, 8)   | sample5.txt
 NOUN  |  [18, 28)  | sample5.txt

printing results for query 'POSMatches(Label, Span, Path)':
  Label  |   Span   |    Path
---------+----------+-------------
 fami

In [18]:
# replace the matches with the correct label
replace_spans("POSRuleMatches", "FilesPaths")

printing results for query 'POSRuleMatches(Label, Span, "sample1.txt")':
  Label  |   Span
---------+----------
 family  | [49, 53)

printing results for query 'POSRuleMatches(Label, Span, "sample2.txt")':
[]

printing results for query 'POSRuleMatches(Label, Span, "sample3.txt")':
[]

printing results for query 'POSRuleMatches(Label, Span, "sample4.txt")':
[]

printing results for query 'POSRuleMatches(Label, Span, "sample5.txt")':
[]



In [19]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                                Content
-------------+---------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His family recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                      The patient be tested for COVID-19 . Results be positive .
 sample3.txt |                              Problem List : like_num . associated_diagnosis like_num . COVID-19 like_num
 sample4.txt |                                                       neg COVID-19 education .
 sample5.txt |                                                    positive COVID-19 precaution .



<a id='target-rules'></a>
### [Target Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/target_rules.py):
These rules used the label that was assigned through the concept tagger, to capture some more complex patterns and assign a label for them inorder to decremnt the cases of false positive.
Each rule look like this:

        TargetRule(
            literal="coronavirus screening",
            category="IGNORE",
            pattern=[
                {"_": {"concept_tag": "COVID-19"}},
                {"LOWER": {"IN": ["screen", "screening", "screenings"]}},
            ],
        ),
Since we replaced the spans we found with the corresponding label we didn't need the concept_tag attribute of the token/span.
To ease the patterns we have devided them into two groups PreTargetRules and TargetRules

### PreTargetRules: 
To ease the process, we have implemented preTarget rules aimed at squash consecutive identical labels assigned through the concept tagger into a single label

In [20]:
%%bash
cat pre_target_rules.csv

(?i)(?:COVID-19(?: COVID-19)+),COVID-19
(?i)(?:positive(?: positive)+),positive
(?i)(?:patient(?: patient)+),patient
(?i)(?:other_experiencer(?: other_experiencer)+),other_experiencer
(?i)(?:screening(?: screening)+),screening

In [21]:
session.import_relation_from_csv("pre_target_rules.csv", relation_name="PreTargetTagRules", delimiter=",")

In [22]:
%%rgxlog
PreTargetMatches(Label, Span, Path) <- lemma_texts(Path, Text), PreTargetTagRules(Pattern, Label), py_rgx_span(Text,Pattern) -> (Span)
?PreTargetMatches(Label, Span, Path)

printing results for query 'PreTargetMatches(Label, Span, Path)':
[]



In [23]:
replace_spans("PreTargetMatches", "FilesPaths")

printing results for query 'PreTargetMatches(Label, Span, "sample1.txt")':
[]

printing results for query 'PreTargetMatches(Label, Span, "sample2.txt")':
[]

printing results for query 'PreTargetMatches(Label, Span, "sample3.txt")':
[]

printing results for query 'PreTargetMatches(Label, Span, "sample4.txt")':
[]

printing results for query 'PreTargetMatches(Label, Span, "sample5.txt")':
[]



In [24]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                                Content
-------------+---------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His family recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                      The patient be tested for COVID-19 . Results be positive .
 sample3.txt |                              Problem List : like_num . associated_diagnosis like_num . COVID-19 like_num
 sample4.txt |                                                       neg COVID-19 education .
 sample5.txt |                                                    positive COVID-19 precaution .



### Target Rules:

In [25]:
%%bash
cat target_rules.csv

(?i)(?:COVID-19 positive (?:unit|floor)|positive COVID-19 (?:unit|floor|exposure)),COVID-19
(?i)(?:known(?: positive)? COVID-19(?: positive)? (?:exposure|contact)),COVID-19
(?i)(?:COVID-19 positive screening|positive COVID-19 screening|screening COVID-19 positive|screening positive COVID-19),positive coronavirus screening
(?i)(?:diagnosis : COVID-19 (?:test|screening)),COVID-19
(?i)(?:COVID-19 screening),coronavirus screening
(?i)(?:active COVID-19 precaution|droplet isolation precaution|positive for (?:flu|influenza)|(?:the|a) positive case|results are confirm),1 2 3
(?i)(?:exposed to positive|[ ] COVID-19|age like_num(?: )?\+|(?:return|back) to work|COVID-19 infection rate),1 2 3
(?i)(?:COVID-19 (?:restriction|emergency|epidemic|outbreak|crisis|breakout|pandemic|spread|screening)|droplet precaution),1 2
(?i)(?:contact precautions|positive (?:flu|influenza)|positive (?:patient|person)|confirm (?:with|w/(?:/)?|w)|(?:the|positive) case),1 2
(?i)(?:results confirm|(?:neg|pos)\S+ pressure

In [26]:
session.import_relation_from_csv("target_rules.csv", relation_name="TargetTagRules", delimiter=",")

In [27]:
%%rgxlog
TargetTagMatches(Label, Span, Path) <- lemma_texts(Path, Text), TargetTagRules(Pattern, Label), py_rgx_span(Text,Pattern) -> (Span)

In [28]:
replace_spans("TargetTagMatches", "FilesPaths")

printing results for query 'TargetTagMatches(Label, Span, "sample1.txt")':
[]

printing results for query 'TargetTagMatches(Label, Span, "sample2.txt")':
[]

printing results for query 'TargetTagMatches(Label, Span, "sample3.txt")':
[]

printing results for query 'TargetTagMatches(Label, Span, "sample4.txt")':
[]

printing results for query 'TargetTagMatches(Label, Span, "sample5.txt")':
[]



In [29]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                                Content
-------------+---------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His family recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                      The patient be tested for COVID-19 . Results be positive .
 sample3.txt |                              Problem List : like_num . associated_diagnosis like_num . COVID-19 like_num
 sample4.txt |                                                       neg COVID-19 education .
 sample5.txt |                                                    positive COVID-19 precaution .



<a id='section-rules'></a>
### [Section Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/section_rules.py):
Here, we'll add a section detection component that defines rules for detecting sections titles, which usually appear before a semicolon.
Section rules are utilized to identify specific section names, enabling the separation of text into different parts. Entities occurring in certain sections are considered positive.

In the original project, the SectionRule class was used to define rules for identifying specific section text. Each SectionRule has the following structure

      SectionRule(category="problem_list", literal="Active Problem List:"),
      SectionRule(category="problem_list", literal="Current Problems:"),
    
    
**Literal** : This specifies the literal section text or word that this rule is targeting.

**Category** : This specifies the section category associated with the identified section.




Similar to the approach used in the concept tagger stage, regex patterns were derived from these literals, and these patterns are stored in the 'section_target_rules.csv' file and are used to match section texts and replace them with their appropriate category.

Each rule in the CSV file follows this format: regexPattern, sectionLabel

In [30]:
%%bash
cat section_rules.csv

(?i)(?:Lab results :),labs :
(?i)(?:Addendum :),addendum :
(?i)(?:(?:ALLERGIC REACTIONS|ALLERGIES) :),allergies :
(?i)(?:(?:CC|Chief Complaint) :),chief_complaint :
(?i)(?:COMMENTS :),comments :
(?i)(?:(?:(?:ADMISSION )?DIAGNOSES|Diagnosis|Primary Diagnosis|Primary|Secondary(?: (?:Diagnoses|Diagnosis))) :),diagnoses :
(?i)(?:(?:Brief Hospital Course|CONCISE SUMMARY OF HOSPITAL COURSE BY ISSUE/SYSTEM|HOSPITAL COURSE|SUMMARY OF HOSPITAL COURSE) :),hospital_course :
(?i)(?:(?:Imaging|MRI|INTERPRETATION|Radiology) :),imaging :
(?i)(?:(?:ADMISSION LABS|Discharge Labs|ECHO|Findings|INDICATION|Labs|Micro|Microbiology|Studies|Pertinent Results) :),labs_and_studies :
(?i)(?:(?:ACTIVE MEDICATIONS(?: LIST)|ADMISSION MEDICATIONS|CURRENT MEDICATIONS|DISCHARGE MEDICATIONS|HOME MEDICATIONS|MEDICATIONS) :),medications :
(?i)(?:(?:MEDICATIONS AT HOME|MEDICATIONS LIST|MEDICATIONS ON ADMISSION|MEDICATIONS ON DISCHARGE|MEDICATIONS ON TRANSFER|MEDICATIONS PRIOR TO ADMISSION) :),medications :
(?i)(?:Neuro :

In [31]:
session.import_relation_from_csv("section_rules.csv", relation_name="SectionRules", delimiter=",")

In [32]:
def is_span_contained(span1, span2):
    """
    Checks if one span is contained within the other span and returns the smaller span if yes.

    Parameters:
        span1 (span)
        span2 (span)

    Returns:
        span: span1 if contained within span2 or vice versa, or None if not contained.
    """
    start1, end1 = span1.span_start, span1.span_end
    start2, end2 = span2.span_start, span2.span_end
    
    if start2 <= start1 and end1 <= end2:
        yield span1
        
    elif start1 <= start2 and end2 <= end1:
        yield span2

magic_session.register(is_span_contained, "is_span_contained", in_rel=[DataTypes.span, DataTypes.span], out_rel=[DataTypes.span])

In [33]:
%%rgxlog
SectionRulesMatches(Label, Span, Path) <- lemma_texts(Path, Text), SectionRules(Pattern, Label), py_rgx_span(Text,Pattern) -> (Span)
?SectionRulesMatches(Label, Span, Path)

printing results for query 'SectionRulesMatches(Label, Span, Path)':
     Label      |  Span   |    Path
----------------+---------+-------------
 problem_list : | [0, 14) | sample3.txt



In [34]:
replace_spans("SectionRulesMatches", "FilesPaths")

printing results for query 'SectionRulesMatches(Label, Span, "sample1.txt")':
[]

printing results for query 'SectionRulesMatches(Label, Span, "sample2.txt")':
[]

printing results for query 'SectionRulesMatches(Label, Span, "sample3.txt")':
     Label      |  Span
----------------+---------
 problem_list : | [0, 14)

printing results for query 'SectionRulesMatches(Label, Span, "sample4.txt")':
[]

printing results for query 'SectionRulesMatches(Label, Span, "sample5.txt")':
[]



In [35]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                                Content
-------------+---------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His family recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                      The patient be tested for COVID-19 . Results be positive .
 sample3.txt |                              problem_list : like_num . associated_diagnosis like_num . COVID-19 like_num
 sample4.txt |                                                       neg COVID-19 education .
 sample5.txt |                                                    positive COVID-19 precaution .



### Attribute Assertion:

 Next, we will explore how to assert attributes indicating whether a mention of COVID-19 is positive or not. In our project, we have created a table     named 'CovidAttributes' that contains all attributes for each COVID-19 mention. This table will be used for classifying documents.

In [36]:
%%rgxlog
#Here, we employ a pattern to identify entities present in specific sections and mark them as positive,
#and adding them to the 'CovidAttributes' table.

pattern = "(?i)(?:diagnoses :|observation_and_plan :|past_medical_history :|problem_list :)(?:(?!labs :|addendum :|allergies :|chief_complaint :|comments :|family_history :|hospital_course :|imaging :|labs_and_studies :|medications :|neurological :|other :|patient_education :|physical_exam :|reason_for_examination :|signature :|social_history :).)*"
new SectionRulesAttribute(str, str)
SectionRulesAttribute(pattern, "positive")

SectionMatches(Path, Span, CovidAttribute) <- lemma_texts(Path, Text), SectionRulesAttribute(Pattern, CovidAttribute), py_rgx_span(Text, Pattern) -> (Span)
?SectionMatches(Path, Span, CovidAttribute)

CovidMatches(Path, Span) <- lemma_texts(Path, Text), py_rgx_span(Text, "COVID-19") -> (Span)
?CovidMatches(Path, Span)

printing results for query 'SectionMatches(Path, Span, CovidAttribute)':
    Path     |  Span   |  CovidAttribute
-------------+---------+------------------
 sample3.txt | [0, 76) |     positive

printing results for query 'CovidMatches(Path, Span)':
    Path     |   Span
-------------+-----------
 sample1.txt | [34, 42)
 sample1.txt | [85, 93)
 sample1.txt | [96, 104)
 sample2.txt | [26, 34)
 sample3.txt | [58, 66)
 sample4.txt |  [4, 12)
 sample5.txt |  [9, 17)



In [37]:
%%rgxlog
SectionCovidAttributes(Path, CovidSpan, CovidAttribute) <- SectionMatches(Path, Span1, CovidAttribute), CovidMatches(Path, Span2), is_span_contained(Span1, Span2) -> (CovidSpan)
?SectionCovidAttributes(Path, CovidSpan, CovidAttribute)

printing results for query 'SectionCovidAttributes(Path, CovidSpan, CovidAttribute)':
    Path     |  CovidSpan  |  CovidAttribute
-------------+-------------+------------------
 sample3.txt |  [58, 66)   |     positive



### Tokenizing the Text into Sentences:

In the subsequent stages, where attributes are assigned to COVID-19 mentions, a departure from the previous stages occurs. Here, patterns are no longer applied to the entire text, instead, they are applied at the sentence level, since the attributes of COVID-19 mentions are typically determined by the context of the sentence in which they appear. This means the text is processed and tokenized into sentences using spaCy's English language model. This process is accomplished through the use of  ie functions and relations.

In [38]:
def sent_tokenization(text_path):
    """
    This function reads a text file, processes its content using spaCy's English language model,
    tokenizing it into sentences and returns each individual sentence in the processed text using a generator.
    
    Parameters:
        text_path (str): The path to the text file to be annotated.

    Returns:
        str: Individual sentences extracted from the input text.
    """
    with open(text_path, 'r') as file:
        contents = file.read()

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(contents)

    for sentence in doc.sents:
        yield sentence.text

magic_session.register(ie_function=sent_tokenization, ie_function_name = "sent_tokenization", in_rel=[DataTypes.string], out_rel=[DataTypes.string])

In [39]:
%%rgxlog
#Sentences of the text
Sents(Path, Sent) <- FilesPaths(Path), sent_tokenization(Path) -> (Sent)
?Sents(Path, Sent)

#SentSpan is the span of the sentence in the text
SentSpans(Path, Sent, SentSpan) <- lemma_texts(Path, Text), Sents(Path, Sent), py_rgx_span(Text, Sent) -> (SentSpan)
?SentSpans(Path, Sent, SentSpan)

printing results for query 'Sents(Path, Sent)':
    Path     |                        Sent
-------------+----------------------------------------------------
 sample1.txt |       COVID-19 results came back positive .
 sample1.txt | His family recently tested positive for COVID-19 .
 sample1.txt |    patient presents to be tested for COVID-19 .
 sample2.txt |               Results be positive .
 sample2.txt |        The patient be tested for COVID-19 .
 sample3.txt |                 COVID-19 like_num
 sample3.txt |          associated_diagnosis like_num .
 sample3.txt |             problem_list : like_num .
 sample4.txt |              neg COVID-19 education .
 sample5.txt |           positive COVID-19 precaution .

printing results for query 'SentSpans(Path, Sent, SentSpan)':
    Path     |                        Sent                        |  SentSpan
-------------+----------------------------------------------------+------------
 sample1.txt |       COVID-19 results came back positive

In [40]:
def get_relative_span(span1, span2):
    """
    Computes the relative position of the conatined span within the other span.
    

    Parameters:
        span1 (Span): The first span object.
        span2 (Span): The second span object.

    Yields:
        Span: The new relative span of the contained one.
        None: If there's no span contained within the other.
    """
    start1, end1 = span1.span_start, span1.span_end
    start2, end2 = span2.span_start, span2.span_end
    
    if start2 <= start1 and end1 <= end2:
        yield Span(span1.span_start - span2.span_start, span1.span_end - span2.span_start)
        
    elif start1 <= start2 and end2 <= end1:
        yield Span(span2.span_start - span1.span_start, span2.span_end - span1.span_start)

magic_session.register(get_relative_span, "get_relative_span", in_rel=[DataTypes.span, DataTypes.span], out_rel=[DataTypes.span])

In [41]:
%%rgxlog
CovidAttributes(Path, CovidSpan, CovidAttribute, Sent) <- SectionCovidAttributes(Path, AbsCovidSpan, CovidAttribute),\
SentSpans(Path, Sent, SentSpan) ,get_relative_span(AbsCovidSpan, SentSpan) -> (CovidSpan)
?CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)

printing results for query 'CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)':
    Path     |  CovidSpan  |  CovidAttribute  |       Sent
-------------+-------------+------------------+-------------------
 sample3.txt |   [0, 8)    |     positive     | COVID-19 like_num



<a id='context-rules'></a>
### [Context Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/context_rules.py):
These rules assign an attribute for each COVID-19 label based on the context, these attributes will be used later to classify each text.

Example for this rule is: 

    ConTextRule(
        literal="Not Detected",
        category="NEGATED_EXISTENCE",
        direction="BACKWARD",
        pattern=[
            {"LOWER": {"IN": ["not", "non"]}},
            {"IS_SPACE": True, "OP": "*"},
            {"TEXT": "-", "OP": "?"},
            {"LOWER": {"REGEX": "detecte?d"}},
        ],
        allowed_types={"COVID-19"},
    ),
   **direction** specify if the allowed_types should be before or after the pattern,
   **allowed_types** specify on what labels should this rule be applied on 

In [42]:
%%bash
cat context_rules.csv

(?i)(?:positive COVID-19|COVID-19 (?:\([^)]*\)) (?:positive|detected)|COVID-19(?: positive)? associated_diagnosis)#positive
(?i)(?:COVID-19 status : positive)#positive
(?i)(?:associated_diagnosis COVID-19|associated_diagnosis (?:with|w|w//|from) (?:associated_diagnosis )?COVID-19)#positive
(?i)(?:COVID-19 positive(?: patient| precaution)?|associated_diagnosis (?:due|secondary) to COVID-19)#positive
(?i)(?:(?:current|recent) COVID-19 diagnosis)#positive
(?i)(?:COVID-19 (?:- )?related (?:admission|associated_diagnosis)|admitted (?:due to|(?:with|w|w/)) COVID-19)#positive
(?i)(?:COVID-19 infection|b34(?:\.)?2|b97.29|u07.1)#positive
(?i)(?:COVID-19 eval(?:uation)?|(?:positive )? COVID-19 symptoms|rule out COVID-19)#uncertain
(?i)(?:patient (?:do )?have COVID-19)#positive
(?i)(?:diagnosis : COVID-19(?: (?:test|screen)(?:ing|ed|s)? positive)?(?: positive)?)#positive
(?i)(?:COVID-19(?: (?!<IGNORE>)\S+)*? (?:not|non) (?:- )?detecte?d)#negated
(?i)(?:COVID-19(?: (?!<IGNORE>)\S+){0,1} negative s

In [43]:
session.import_relation_from_csv("context_rules.csv", relation_name="ContextRules", delimiter="#")

In [44]:
%%rgxlog
#covid_attributes: negated, other_experiencer, is_future, not_relevant, uncertain, positive
ContextMatches(CovidAttribute, Span, Path, Sent) <- Sents(Path, Sent), ContextRules(Pattern, CovidAttribute),\
py_rgx_span(Sent, Pattern) -> (Span)
?ContextMatches(CovidAttribute, Span, Path, Sent)

CovidSpans(Path, Span, Sent) <- Sents(Path, Sent), py_rgx_span(Sent, "COVID-19") -> (Span)
?CovidSpans(Path, Span, Sent)

printing results for query 'ContextMatches(CovidAttribute, Span, Path, Sent)':
  CovidAttribute  |   Span   |    Path     |                        Sent
------------------+----------+-------------+----------------------------------------------------
     positive     | [0, 35)  | sample1.txt |       COVID-19 results came back positive .
     positive     | [27, 48) | sample1.txt | His family recently tested positive for COVID-19 .
     negated      | [4, 48)  | sample1.txt | His family recently tested positive for COVID-19 .
     negated      | [0, 12)  | sample4.txt |              neg COVID-19 education .
      future      | [4, 22)  | sample4.txt |              neg COVID-19 education .
     positive     | [0, 17)  | sample5.txt |           positive COVID-19 precaution .
      future      | [9, 28)  | sample5.txt |           positive COVID-19 precaution .

printing results for query 'CovidSpans(Path, Span, Sent)':
    Path     |   Span   |                        Sent
-------------+----

In [45]:
%%rgxlog
CovidAttributes(Path, CovidSpan, CovidAttribute, Sent) <- ContextMatches(CovidAttribute, Span1, Path, Sent), CovidSpans(Path, Span2, Sent), is_span_contained(Span1, Span2) -> (CovidSpan)
?CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)

printing results for query 'CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)':
    Path     |  CovidSpan  |  CovidAttribute  |                        Sent
-------------+-------------+------------------+----------------------------------------------------
 sample1.txt |   [0, 8)    |     positive     |       COVID-19 results came back positive .
 sample1.txt |  [40, 48)   |     negated      | His family recently tested positive for COVID-19 .
 sample1.txt |  [40, 48)   |     positive     | His family recently tested positive for COVID-19 .
 sample3.txt |   [0, 8)    |     positive     |                 COVID-19 like_num
 sample4.txt |   [4, 12)   |      future      |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |     negated      |              neg COVID-19 education .
 sample5.txt |   [9, 17)   |      future      |           positive COVID-19 precaution .
 sample5.txt |   [9, 17)   |     positive     |           positive COVID-19 precaution .



<a id='postprocess-rules'></a>
### [Postprocessor](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/postprocess_rules.py):
These rules assign an additional attribute for each COVID-19 mention based on its attributes or the context of the sentences they are part of. This way, we can be flexible and fix problems with the data and make specific improvements. For example, they can be handy for spotting and correcting wrongly tagged positive cases, making our classification more accurate.


Example rule in the original project:

PostprocessingRule(
        patterns=[
        
            PostprocessingPattern(lambda ent: ent.label_ == "COVID-19"),
            PostprocessingPattern(
                postprocessing_functions.sentence_contains,
                condition_args=({"deny", "denies", "denied"},),
            ),
            PostprocessingPattern(
                postprocessing_functions.sentence_contains,
                condition_args=({"contact", "contacts", "confirmed"},),
            ),
        ],
        action=postprocessing_functions.remove_ent,
        description="Remove a coronavirus entity if 'denies' and 'contact' are in. This will help get rid of false positives from screening.",
    ),    

This rule iterates through each entity and checks a series of conditions which are the "PostprocessingPattern". If all conditions evaluate as True, then some action is taken on the entity, which is 'remove' action in this example. Some other actions could include changing attributes.


In our case, we assign "IGNORE" attribute to the COVID-19 mention causing it to be excluded from consideration during the document classification process.

Each rule in the CSV file follows this format: regexPattern, Attribute


In [46]:
%%bash
cat postprocess_pattern_rules.csv

.*education.*#IGNORE
.* \?#IGNORE
(?=.*\b(?:deny|denies|denied)\b)(?=.*\b(?:contact|confirm)\b).*#IGNORE
(?=.*\b(?:setting of|s/o)\b)(?!.*\b(?:COVID-19 infection|COVID-19 ards)\b).*#no_positive
(?i)(.*benign.*)#uncertain
admitted to COVID-19 unit#positive

In [47]:
session.import_relation_from_csv("postprocess_pattern_rules.csv", relation_name="PostprocessRules", delimiter="#")

In [48]:
%%rgxlog
PostprocessMatches(CovidAttribute, Span, Path, Sent) <- Sents(Path, Sent), PostprocessRules(Pattern, CovidAttribute),\
py_rgx_span(Sent, Pattern) -> (Span)
?PostprocessMatches(CovidAttribute, Span, Path, Sent)

printing results for query 'PostprocessMatches(CovidAttribute, Span, Path, Sent)':
  CovidAttribute  |  Span   |    Path     |           Sent
------------------+---------+-------------+--------------------------
      IGNORE      | [0, 24) | sample4.txt | neg COVID-19 education .



In [49]:
%%rgxlog
CovidAttributes(Path, CovidSpan, CovidAttribute, Sent) <- PostprocessMatches(CovidAttribute, Span1, Path, Sent), CovidSpans(Path, Span2, Sent), is_span_contained(Span1, Span2) -> (CovidSpan)
?CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)

printing results for query 'CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)':
    Path     |  CovidSpan  |  CovidAttribute  |                        Sent
-------------+-------------+------------------+----------------------------------------------------
 sample1.txt |   [0, 8)    |     positive     |       COVID-19 results came back positive .
 sample1.txt |  [40, 48)   |     negated      | His family recently tested positive for COVID-19 .
 sample1.txt |  [40, 48)   |     positive     | His family recently tested positive for COVID-19 .
 sample3.txt |   [0, 8)    |     positive     |                 COVID-19 like_num
 sample4.txt |   [4, 12)   |      IGNORE      |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |      future      |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |     negated      |              neg COVID-19 education .
 sample5.txt |   [9, 17)   |      future      |           positive COVID-19 precaution .
 sample5.txt |   [9,

### Postprocess rules with attributes example:

PostprocessingRule(
        patterns=[
        
            PostprocessingPattern(lambda ent: ent.label_ == "COVID-19"),
            PostprocessingPattern(
                postprocessing_functions.is_modified_by_category,
                condition_args=("DEFINITE_POSITIVE_EXISTENCE",),
            ),
            # PostprocessingPattern(postprocessing_functions.is_modified_by_category, condition_args=("TEST",)),
            PostprocessingPattern(
                postprocessing_functions.sentence_contains,
                condition_args=(
                    {
                        "should",
                        "unless",
                        "either",
                        "if comes back",
                        "if returns",
                        "if s?he tests positive",
                    },
                    True,
                ),
            ),
        ],
        action=set_is_uncertain,
        action_args=(True,),
        description="Subjunctive of test returning positive. 'Will contact patient should his covid-19 test return positive.'",
    ),

This rule examines whether a COVID-19 mention possesses a positive attribute and if the sentence containing it includes any of the words specified in 'condition_args' If these conditions are met, the uncertain attribute is set to true.


In our case, we check for each COVID-19 mention in the 'CovidAttributes' table if it's labeled as 'positive', also, we check if any of the specified words in 'condition_args' are present in the same sentence using a regex search. If the conditions are met, then we simply assign it an 'uncertain' attribute.

Each rule in the CSV file follows this format: regexPattern, ExistingAttribute, NewAttribute


In [50]:
%%bash
cat postprocess_attributes_rules.csv

.*pending.*#negated#no_negated
.*(?:should|unless|either|if comes back|if returns|if s?he tests positive).*#positive#uncertain
.*precaution.*#positive#no_future
.*(?:re[ -]?test|second test|repeat).*#negated#no_negated
.*(?:sign|symptom|s/s).*#positive#uncertain

In [51]:
session.import_relation_from_csv("postprocess_attributes_rules.csv", relation_name="PostprocessRulesWithAttributes", delimiter="#")

In [52]:
%%rgxlog
PostprocessWithAttributesMatches(CovidAttribute, NewAttribute, Span, Path, Sent) <- Sents(Path, Sent), PostprocessRulesWithAttributes(Pattern, CovidAttribute, NewAttribute),\
py_rgx_span(Sent, Pattern) -> (Span)
?PostprocessWithAttributesMatches(CovidAttribute, NewAttribute, Span, Path, Sent)

printing results for query 'PostprocessWithAttributesMatches(CovidAttribute, NewAttribute, Span, Path, Sent)':
  CovidAttribute  |  NewAttribute  |  Span   |    Path     |              Sent
------------------+----------------+---------+-------------+--------------------------------
     positive     |   no_future    | [0, 30) | sample5.txt | positive COVID-19 precaution .



In [53]:
%%rgxlog
CovidAttributes(Path, CovidSpan, NewAttribute, Sent) <- CovidAttributes(Path, CovidSpan, CovidAttribute, Sent), PostprocessWithAttributesMatches(CovidAttribute, NewAttribute, Span, Path, Sent)
?CovidAttributes(Path, CovidSpan, NewAttribute, Sent)

printing results for query 'CovidAttributes(Path, CovidSpan, NewAttribute, Sent)':
    Path     |  CovidSpan  |  NewAttribute  |                        Sent
-------------+-------------+----------------+----------------------------------------------------
 sample1.txt |   [0, 8)    |    positive    |       COVID-19 results came back positive .
 sample1.txt |  [40, 48)   |    negated     | His family recently tested positive for COVID-19 .
 sample1.txt |  [40, 48)   |    positive    | His family recently tested positive for COVID-19 .
 sample3.txt |   [0, 8)    |    positive    |                 COVID-19 like_num
 sample4.txt |   [4, 12)   |     IGNORE     |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |     future     |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |    negated     |              neg COVID-19 education .
 sample5.txt |   [9, 17)   |     future     |           positive COVID-19 precaution .
 sample5.txt |   [9, 17)   |   no_future  

### Postprocess rules with next_sentence:

There's a rule that checks if the following sentence contains positive mentions. If it does, the COVID-19 mentions in the current sentence are also
marked as positive. To Implement this rule in our project, we defined a new relation that pairs each sentence with its subsequent sentence.


In [54]:
def next_sent(text_path):
    with open(text_path, 'r') as file:
        contents = file.read()

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(contents)

    # Tokenize sentences
    sentences = list(doc.sents)
    for i in range(len(sentences) - 1):  # Iterate until the second-to-last sentence
        yield(sentences[i].text, sentences[i + 1].text)

magic_session.register(ie_function=next_sent, ie_function_name = "next_sent", in_rel=[DataTypes.string], out_rel=[DataTypes.string,DataTypes.string])

In [55]:
%%rgxlog
NextSent(Path, Sent1, Sent2) <- FilesPaths(Path), next_sent(Path) -> (Sent1, Sent2)
?NextSent(Path, Sent1, Sent2)

printing results for query 'NextSent(Path, Sent1, Sent2)':
    Path     |                       Sent1                        |                       Sent2
-------------+----------------------------------------------------+----------------------------------------------------
 sample1.txt | His family recently tested positive for COVID-19 . |       COVID-19 results came back positive .
 sample1.txt |    patient presents to be tested for COVID-19 .    | His family recently tested positive for COVID-19 .
 sample2.txt |        The patient be tested for COVID-19 .        |               Results be positive .
 sample3.txt |          associated_diagnosis like_num .           |                 COVID-19 like_num
 sample3.txt |             problem_list : like_num .              |          associated_diagnosis like_num .



In [56]:
%%rgxlog
new PostProcessWithNextSentenceRules(str, str)
PostProcessWithNextSentenceRules("(?i)(?:^(?:positive|detected)|results?(?: be)? positive)", "positive")
PostProcessWithNextSentenceMatches(CovidAttribute, Span, Path, Sent) <- Sents(Path, Sent), PostProcessWithNextSentenceRules(Pattern, CovidAttribute),\
py_rgx_span(Sent, Pattern) -> (Span)

CovidAttributes(Path, CovidSpan, CovidAttribute, Sent1) <- CovidSpans(Path, CovidSpan, Sent1), NextSent(Path, Sent1, Sent2), PostProcessWithNextSentenceMatches(CovidAttribute, Span, Path, Sent2)
?CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)

printing results for query 'CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)':
    Path     |  CovidSpan  |  CovidAttribute  |                        Sent
-------------+-------------+------------------+----------------------------------------------------
 sample1.txt |   [0, 8)    |     positive     |       COVID-19 results came back positive .
 sample1.txt |  [40, 48)   |     negated      | His family recently tested positive for COVID-19 .
 sample1.txt |  [40, 48)   |     positive     | His family recently tested positive for COVID-19 .
 sample2.txt |  [26, 34)   |     positive     |        The patient be tested for COVID-19 .
 sample3.txt |   [0, 8)    |     positive     |                 COVID-19 like_num
 sample4.txt |   [4, 12)   |      IGNORE      |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |      future      |              neg COVID-19 education .
 sample4.txt |   [4, 12)   |     negated      |              neg COVID-19 education .
 sample5.txt |   

<a id='document_classifier'></a>
### [Document Classifier](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/document_classifier.py):
Now we have the basic pieces in place to make our document classification. Each document is classified as either 'POS', 'UNK', or 'NEG' determined by the attributes of its COVID-19 mentions. The Results are stored in a DataFrame.

Document Classifier stage has 2 parts:
 1) **Attribute filtering**: Our pipeline assigns various attributes to each COVID-19 mention. However, during this stage, each COVID-19 case is refined to possess only one attribute. This filtering process operates based on specific conditions outlined in the 'attribute_filter' function.
 2) **Document classification**: Documents are classified based on distinct conditions, as detailed in the 'classify_doc_helper' function. This step ensures the accurate categorization of each document according to the specified criteria.


In [57]:
def attribute_filter(group):
    """
    Filters attributes within each "CovidSpan" of a DataFrame table based on specific conditions.

    Parameters:
        group (pandas.Series): A pandas Series representing attributes for each "CovidSpan" within a DataFrame.

    Returns:
        str: Filtered "CovidSpan" attribute determined by the following rules:
            - If 'IGNORE' is present, returns 'IGNORE'.
            - If 'negated' is present (and 'no_negated' is not present), returns 'negated'.
            - If 'future' is present (and 'no_future' is not present), returns 'negated'.
            - If 'other experiencer' or 'not relevant' is present, returns 'negated'.
            - If 'positive' is present (and 'uncertain' and 'no_positive' are not present), returns 'positive'.
            - Otherwise, returns 'uncertain'.
    """
    if 'IGNORE' in group.values:
        return 'IGNORE'
    elif 'negated' in group.values and not 'no_negated' in group.values:
        return 'negated'
    elif 'future' in group.values and not 'no_future' in group.values:
        return 'negated'
    elif 'other experiencer' in group.values or 'not relevant' in group.values:
        return 'negated'
    elif 'positive' in group.values and not 'uncertain' in group.values and not 'no_positive' in group.values:
        return 'positive'
    else:
        return 'uncertain'

In [58]:
df = (session.run_commands("?CovidAttributes(Path, CovidSpan, CovidAttribute, Sent)", print_results=False, format_results=True))[0]
if len(df) == 0:
    df = DataFrame(columns=["Path","CovidSpan","CovidAttribute"])
df['CovidAttribute'] = df.groupby(['CovidSpan', 'Sent'])['CovidAttribute'].transform(attribute_filter)
df = df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,Path,CovidSpan,CovidAttribute,Sent
0,sample1.txt,"[0, 8)",positive,COVID-19 results came back positive .
1,sample1.txt,"[40, 48)",negated,His family recently tested positive for COVID-...
2,sample2.txt,"[26, 34)",positive,The patient be tested for COVID-19 .
3,sample3.txt,"[0, 8)",positive,COVID-19 like_num
4,sample4.txt,"[4, 12)",IGNORE,neg COVID-19 education .
5,sample5.txt,"[9, 17)",positive,positive COVID-19 precaution .


In [59]:
def classify_doc_helper(group):
    """
Classifies a document as 'POS', 'UNK', or 'NEG' based on COVID-19 attributes.

Parameters:
    group (pandas.Series): A pandas Series representing COVID-19 attributes for each document within a DataFrame.
    
Returns:
    str: Document classification determined as follows:
         - 'POS': If at least one COVID-19 attribute with "positive" is present in the group.
         - 'UNK': If at least one COVID-19 attribute with "uncertain" is present in the group and no "positive" attributes,
                  or there's at least one COVID-19 attribute with 'IGNORE' and no other COVID-19 attributes exist.
         - 'NEG': Otherwise.
"""
    if 'positive' in group.values:
        return 'POS'
    elif 'uncertain' in group.values:
        return 'UNK'
    elif 'negated' in group.values:
        return 'NEG'
    else:
        return 'UNK'

In [60]:
df['DocResult'] = df.groupby('Path')['CovidAttribute'].transform(classify_doc_helper)
df = df[['Path', 'DocResult']]
df = df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,Path,DocResult
0,sample1.txt,POS
1,sample2.txt,POS
2,sample3.txt,POS
3,sample4.txt,UNK
4,sample5.txt,POS


In [61]:
df_path = (session.run_commands("?FilesPaths(Path)", print_results=False, format_results=True))[0]
df = (pd.merge(df, df_path, on='Path', how='outer'))
df['DocResult'] = df['DocResult'].fillna("UNK")
df

Unnamed: 0,Path,DocResult
0,sample1.txt,POS
1,sample2.txt,POS
2,sample3.txt,POS
3,sample4.txt,UNK
4,sample5.txt,POS
