# Covid-19 NLP pipeline

### The pipline repository [link](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/)

### Inroduction:
The primary objective of the NLP pipeline is to identify individuals who have been positively diagnosed with COVID-19 by extracting pertinent information from unstructured free-text narratives found within the Electronic Health Record (EHR) of the Department of Veterans Affairs (VA). By automating this process, the pipeline streamlines the screening of a substantial volume of clinical text, significantly reducing the time and effort required for identification.
The pipeline is built on medSpacy framework, and defines a new UI to use.
Our goal is to write the pipline in rgxlog language so we show a real world example about the benefits of the rgxlog framework from the NLP world.

### pipline stages:
- [Concept tagger](#concept-tag-rules)
- [Target matcher](#target-rules)
- [Context matcher](#context-rules)
- Sectionizer
- Postprocessor
- Document classification

We will implement each stage separately later on.

### First, we need to install some requirements to work with [medspacy](https://github.com/medspacy/medspacy) framework 

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

In [1]:
import spacy

### Import what we need from the rgxlog framework and define some ie functions that will be used in every stage of the pipline:

In [2]:
import re
import pandas as pd
from pandas import DataFrame
import rgxlog
from rgxlog import magic_session
from rgxlog import Session
from rgxlog.engine.datatypes.primitive_types import DataTypes
from rgxlog.engine.datatypes.primitive_types import Span
session = rgxlog.magic_session

In [3]:
def read_from_file(text_path):
    """
    Reads from file and return it's content.

    Parameters:
        text_path (str): The path to the text file to read from.

    Returns:
        str: The content of the file.
    """
    with open(f"{text_path}", 'r') as file:
        content = file.read()
    yield content
magic_session.register(ie_function=read_from_file,
                       ie_function_name = "read_from_file",
                       in_rel=[DataTypes.string],
                       out_rel=[DataTypes.string])

In [4]:
def resolve_interval_conflicts(replacements):
    """
    This function takes a list of replacements, where each replacement is represented
    as a list containing a label and a span (interval). It checks for conflicts among
    the intervals and returns a list of resolved replacements, ensuring that no two
    intervals overlap.

    Parameters:
    replacements (list of lists): A list of replacements, where each replacement
        is represented as a list [label, span].

    Returns:
    list of lists: A list of resolved replacements, where each replacement is a list
        [label, span], ensuring that there are no conflicts among intervals.
    """
    # Sort the replacements by the size of the spans in descending order
    replacements.sort(key=lambda x: x[1].span_end - x[1].span_start, reverse=True)

    # Initialize a list to keep track of intervals that have been replaced
    resolved_replacements = []
    
    for label, span in replacements:
        conflict = False

        for _, existing_span in resolved_replacements:
            existing_start = existing_span.span_start
            existing_end = existing_span.span_end

            if not (span.span_end <= existing_start or span.span_start >= existing_end):
                conflict = True
                break

        if not conflict:
            resolved_replacements.append([label, span])

    return resolved_replacements

In [5]:
def replace_spans(spans_table, paths_table):
    """
    This function takes tables a spans tables and path table for the files paths,
    it generate queries for the tables, executes the queries using a session, processes the results, 
    and replaces specific spans in a text with the corresponding labels, it first
    resolve spans overlapping conflicts for each giving path.

    Parameters:
    spans_table (str): A string representing the spans table to process, table columns are formated as (Label, Span, Path).
    paths_table (str): A string representing the paths table to process, table columns are formated as (Path)

    Returns:
    str: The adjusted text string with the new labels.
    """
    # Get a list of all the paths
    paths = session.run_commands(f"?{paths_table}(Path)", print_results=False, format_results=True)
    paths = paths[0].values.tolist()
    for path_list in paths:
        path = path_list[0]

        # Generate a spans query for each path, the query will be formates as (Label, Span, Path)
        results = session.run_commands(f'?{spans_table}(Label, Span, "{path}")', print_results=True, format_results=True)
        if len(results[0]) == 0:
            return " "
        # replacments is list of lists where each list is a [Label, Span]
        replacements = results[0].values.tolist()
        
        with open(f"{path}", 'r') as file:
            adjusted_string = file.read()
    
        # Resolve spans conflicts
        resolved_replacements = resolve_interval_conflicts(replacements)
    
        # Sort the resolved replacements by the starting index of each span in descending order
        resolved_replacements.sort(key=lambda x: x[1].span_start, reverse=True)
    
        # iterate over the resolved query results and replace the space with the corresponding label
        for i in range(len(resolved_replacements)):
            replace_string, span = resolved_replacements[i]
            replace_length = len(replace_string)
            adjusted_string = adjusted_string[:span.span_start] + replace_string + adjusted_string[span.span_end:]
    
        with open(f"{path}", 'w') as file:
            file.writelines(adjusted_string)

### The paths of the text files to be classified should be written in "files_paths.csv" file

In [6]:
%%bash
cat files_paths.csv

sample1.txt
sample2.txt

In [7]:
session.import_relation_from_csv("files_paths.csv", relation_name="FilesPaths", delimiter=",")

In [8]:
%%rgxlog
FilesContent(Path, Content) <- FilesPaths(Path), read_from_file(Path) -> (Content)
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                               Content
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                            The patient be tested negative for COVID-19.



<a id='concept-tag-rules'></a>
### [Concept Tag Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/concept_tag_rules.py):
Concept tag rules, also known as pattern-based rules or custom rules, are a way to specify and define patterns that an NLP (Natural Language Processing) system should recognize within text data. These rules are used to identify specific concepts or entities within text documents. In the context of MedSpaCy and medical NLP, concept tag rules are often used to identify medical entities and concepts accurately.

In the orginal project they used the TargetRule class which defines a rule for identifying a specific concept or entity in text.
each concept Target Rule looks like this:

TargetRule(
            literal="coronavirus",
            category="COVID-19",
            pattern=[{"LOWER": {"REGEX": "coronavirus|hcov|ncov$"}}],
          )

**Literal** : This specifies the literal text or word that this rule is targeting.

**Category** : This specifies the category or label associated with the identified entity

**Pattern** : This defines the pattern or conditions under which the entity should be recognized. It's a list of dictionaries specifying conditions for token matching. These rules some times used lemma attribute or POS of each token. A documentation can be found at : https://spacy.io/usage/rule-based-matching

Instead what we did is to define regex patterns, we have added these pattern in concept_target_rules.csv file, there are two types of these patterns lemma and pos, that we will implement each later on.
Each rule in the csv file is like this : regexPattern, label, type

In [9]:
%%bash
cat concept_tags_rules.csv

(?i)(?:hcov|covid(?:(?:-)?(?:\s)?19|10)?|2019-cov|cov2|ncov-19|covd 19|no-cov|sars cov),COVID-19,lemma
(?i)(?:coivid|(?:novel )?corona(?:virus)?(?: (?:20)?19)?|sars(?:\s)?(?:-)?(?:\s)?cov(?:id)?(?:-)?(?:2|19)),COVID-19,lemma
(?i)(?:\+(?: ve)?|\(\+\)|positive|\bpos\b|active|confirmed),positive,lemma
(?i)(?:pneum(?:onia)?|pna|hypoxia|septic shoc|ards\(?(?:(?:[12])/2)\)?|(?:hypoxemic|acute|severe)? resp(?:iratory)? failure(?:\(?(?:[12]/2)\)?)?)",associated_diagnosis,lemma
(?i)(?:(?:diagnos(?:is|ed)|dx(\.)?)(?:of|with)?),diagnosis,lemma
(?i)(?:^screen),screening,lemma
(?i)(?:in contact with|any one|co-worker|at work|(?:the|a)(?:wo)?man|(?:another|a) (?:pt|patient|pt\.)),other_experiencer,lemma
(?i)(?:patient|pt(?:\.)?|vt|veteran),patient,lemma
(?i)(?:like_num (?:days|day|weeks|week|months|month) (?:ago|prior)),timesx,lemma
(?i)(?:(?:antibody|antibodies|ab) test),antibody test,lemma
(?i)(?:(?:coronavirus|hcovs?|ncovs?|covs?)(?:\s)?(?:-)?(?:\s)?(?: infection)?(?: strain)?(?:\s)?(?:229(?:e)?|

In [10]:
session.import_relation_from_csv("concept_tags_rules.csv", relation_name="ConceptTagRules", delimiter=",")

#### Lemma Rules:
Lemma rules are rules that used the attribute _lemma of each token in the NLP, so what we defined this function to lemmatize the text, most of the rules used only the raw text, thats why we decided that only to lemmatize the tokens we needed.

Example for a lemma rule from the original NLP:

        TargetRule(
            "results positive",
            "positive",
            pattern=[
                {"LOWER": "results"},
                {"LEMMA": "be", "OP": "?"},
                {"LOWER": {"IN": ["pos", "positive"]}},
            ],
        ),
We used the py_rgx_span to capture the patterns, and will use the spans later on in replace_spans that will replace each span with the correct label

In [11]:
def lemmatize_text(text_path, lemma_words_path):
    """
    This function reads a text file, lemmatizes its content using spaCy's English language model,
    and replaces certain words with their lemmas the rest will remain the same. The updated text is then written back to the same file.

    Parameters:
        text_path (str): The path to the text file to be lemmatized.
        lemma_words_path(str): The path that contains the list of words to be lemmatized

    Returns:
        str: The lemmatized text.
    """
    # Define a list of words to be lemmatized
    lemma_words = [line.strip() for line in open(f"{lemma_words_path}") if line.strip()]

    with open(text_path, 'r') as file:
        contents = file.read()

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(contents)

    lemmatized_text = ""
    for token in doc:
        if token.lemma_ in lemma_words:
            lemmatized_text += token.lemma_
        elif token.like_num:
            lemmatized_text += "like_num"
        else:
            lemmatized_text += token.text
        lemmatized_text += " "

    # Write the lemmatized text back to the same file
    with open(text_path, 'w') as file:
        file.writelines(lemmatized_text)

    yield lemmatized_text
magic_session.register(ie_function=lemmatize_text, ie_function_name = "lemmatize_text", in_rel=[DataTypes.string, DataTypes.string], out_rel=[DataTypes.string])

In [12]:
%%rgxlog
lemma_texts(Path, LemmaText) <- FilesPaths(Path), lemmatize_text(Path, "lemma_words.txt") -> (LemmaText)
?lemma_texts(Path, LemmaText)

LemmaMatches(Label, Span, Path) <- lemma_texts(Path, Text), ConceptTagRules(Pattern, Label, "lemma"), py_rgx_span(Text, Pattern) -> (Span)

printing results for query 'lemma_texts(Path, LemmaText)':
    Path     |                                                              LemmaText
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                            The patient be tested negative for COVID-19 .



In [13]:
# replace the matches with the correct label
replace_spans("LemmaMatches", "FilesPaths")

printing results for query 'LemmaMatches(Label, Span, "sample1.txt")':
  Label   |    Span
----------+------------
 positive | [121, 129)
 positive |  [70, 78)
 COVID-19 |  [34, 42)
 COVID-19 |  [83, 91)
 COVID-19 | [94, 102)
 patient  |   [0, 7)

printing results for query 'LemmaMatches(Label, Span, "sample2.txt")':
  Label   |   Span
----------+----------
 COVID-19 | [35, 43)
 patient  | [4, 11)



In [14]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                               Content
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                            The patient be tested negative for COVID-19 .



#### POS Rules:
As we mentioned above these rules used the POS attribute of each token, there were a small number of rules so we only used this to the tokens we needed.
Example of the a rule from the original NLP:

        TargetRule(
            "other experiencer",
            category="other_experiencer",
            pattern=[
                {
                    "POS": {"IN": ["NOUN", "PROPN", "PRON", "ADJ"]},
                    "LOWER": {
                        "IN": [
                            "someone",
                            "somebody",
                            "person",
                            "anyone",
                            "anybody",
                        ]
                    },
                }
            ],
        ),
The patterns we defined will capture what should be after the POS token, so we have used two ie functions, the first one will be used to know the POS of each token, then we used py_rgx_span to capture the patterns we defined, and we checked if the POS token is right before the captured patterns, if so then thats a match for the rule.

In [15]:
def annotate_text_with_pos(text_path):
    """
    This function reads a text file, processes its content using spaCy's English language model,
    and returns a tuple of (POS, Span) for each token if it's one of NOUN|PROPN|PRON|ADJ
    otherwise an empty tuple will be returned
    
    Parameters:
        text_path (str): The path to the text file to be annotated.

    Returns:
        tuple(str, Span): The POS of the token and it's span
    """
    with open(text_path, 'r') as file:
        contents = file.read()

    nlp = spacy.load("en_core_web_sm")
    doc = nlp(contents)

    for token in doc:
        if token.pos_ in ["NOUN", "PROPN", "PRON", "ADJ"]:
            yield token.pos_, Span(token.idx, token.idx + len(token.text))
        else:
            yield tuple()
magic_session.register(ie_function=annotate_text_with_pos, ie_function_name = "annotate_text_with_pos", in_rel=[DataTypes.string], out_rel=[DataTypes.string, DataTypes.span])

In [16]:
def is_next_span(first_span, second_span):
    """
    Determine if two spans are adjacent in a text sequence and return a new span
    representing their combination if they are.

    Parameters:
        first_span (Span): The first span to be checked.
        second_span (Span): The second span to be checked.

    Returns:
        a new Span representing their combination if they are adjacent or empty Span otherwise

    """
    if (first_span.span_end + 1) == second_span.span_start:
        yield Span(first_span.span_start, second_span.span_end)
    else:
        yield tuple() 
magic_session.register(ie_function=is_next_span,
                       ie_function_name = "is_next_span",
                       in_rel=[DataTypes.span, DataTypes.span],
                       out_rel=[DataTypes.span])

In [17]:
%%rgxlog
POSTable(POS, Span, Path) <- lemma_texts(Path, Text), annotate_text_with_pos(Path) -> (POS, Span)
?POSTable(POS, Span, Path)

POSMatches(Label, Span, Path) <- lemma_texts(Path, Text), ConceptTagRules(Pattern, Label, "pos"), py_rgx_span(Text, Pattern) -> (Span)
?POSMatches(Label, Span, Path)

POSRuleMatches(Label, Span, Path) <- POSTable(POS, FirstSpan, Path), POSMatches(Label, SecondSpan, Path), is_next_span(FirstSpan, SecondSpan) -> (Span)

printing results for query 'POSTable(POS, Span, Path)':
  POS  |    Span    |    Path
-------+------------+-------------
  ADJ  |   [0, 7)   | sample1.txt
  ADJ  | [121, 129) | sample1.txt
  ADJ  |  [70, 78)  | sample1.txt
 NOUN  | [103, 110) | sample1.txt
 NOUN  |  [49, 53)  | sample1.txt
 NOUN  |  [8, 16)   | sample1.txt
 PROPN |  [83, 91)  | sample1.txt
  ADJ  |  [22, 30)  | sample2.txt
 NOUN  |  [4, 11)   | sample2.txt
 PROPN |  [35, 43)  | sample2.txt

printing results for query 'POSMatches(Label, Span, Path)':
  Label  |   Span   |    Path
---------+----------+-------------
 family  | [49, 53) | sample1.txt



In [18]:
# replace the matches with the correct label
replace_spans("POSRuleMatches", "FilesPaths")

printing results for query 'POSRuleMatches(Label, Span, "sample1.txt")':
[]



' '

In [19]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                               Content
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                            The patient be tested negative for COVID-19 .



<a id='target-rules'></a>
### [Target Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/target_rules.py):
These rules used the label that was assigned through the concept tagger, to capture some more complex patterns and assign a label for them inorder to decremnt the cases of false positive.
Each rule look like this:

        TargetRule(
            literal="coronavirus screening",
            category="IGNORE",
            pattern=[
                {"_": {"concept_tag": "COVID-19"}},
                {"LOWER": {"IN": ["screen", "screening", "screenings"]}},
            ],
        ),
Since we replaced the spans we found with the corresponding label we didn't need the concept_tag attribute of the token/span.
To ease the patterns we have devided them into two groups PreTargetRules and TargetRules

### PreTargetRules: 
To ease the process, we have implemented preTarget rules aimed at squash ؤonsecutive identical labels assigned through the concept tagger into a single label

In [20]:
%%bash
cat pre_target_rules.csv

(?i)(?:COVID-19(?: COVID-19)*),COVID-19
(?i)(?:positive(?: positive)*),positive
(?i)(?:patient(?: patient)*),patient
(?i)(?:other_experiencer(?: other_experiencer)*),other_experiencer
(?i)(?:screening(?: screening)*),screening

In [21]:
session.import_relation_from_csv("pre_target_rules.csv", relation_name="PreTargetTagRules", delimiter=",")

In [22]:
%%rgxlog
PreTargetMatches(Label, Span, Path) <- lemma_texts(Path, Text), PreTargetTagRules(Pattern, Label), py_rgx_span(Text,Pattern) -> (Span)
?PreTargetMatches(Label, Span, Path)

printing results for query 'PreTargetMatches(Label, Span, Path)':
  Label   |    Span    |    Path
----------+------------+-------------
 COVID-19 |  [35, 43)  | sample2.txt
 COVID-19 |  [34, 42)  | sample1.txt
 COVID-19 |  [83, 91)  | sample1.txt
 COVID-19 | [94, 102)  | sample1.txt
 positive | [121, 129) | sample1.txt
 positive |  [70, 78)  | sample1.txt
 patient  |  [4, 11)   | sample2.txt
 patient  |   [0, 7)   | sample1.txt



In [23]:
replace_spans("PreTargetMatches", "FilesPaths")

printing results for query 'PreTargetMatches(Label, Span, "sample1.txt")':
  Label   |    Span
----------+------------
 COVID-19 |  [34, 42)
 COVID-19 |  [83, 91)
 COVID-19 | [94, 102)
 positive | [121, 129)
 positive |  [70, 78)
 patient  |   [0, 7)

printing results for query 'PreTargetMatches(Label, Span, "sample2.txt")':
  Label   |   Span
----------+----------
 COVID-19 | [35, 43)
 patient  | [4, 11)



In [24]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                               Content
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                            The patient be tested negative for COVID-19 .



### Target Rules:

In [25]:
%%bash
cat target_rules.csv

(?i)(COVID-19 positive (?:unit|floor)|positive COVID-19 (?:unit|floor|exposure)),COVID-19
(?i)(known(?: positive)? COVID-19(?: positive)? (?:exposure|contact)),COVID-19
(?i)(COVID-19 positive screening|positive COVID-19 screening|screening COVID-19 positive|screening positive COVID-19),positive coronavirus screening
(?i)(diagnosis : COVID-19 (?:test|screening)),COVID-19
(?i)(COVID-19 screening),coronavirus screening
(?i)(active COVID-19 precaution),IGNORE
(?i)(COVID-19 (?:restriction|emergency|epidemic|outbreak|crisis|breakout|pandemic|spread|screening)|droplet (?:isolation )?precaution),IGNORE
(?i)(contact precautions|positive (?:for )?(?:flu|influenza)|positive (?:patient|person)|confirm (?:with|w/(?:/)?|w)|the (?:positive )?case),IGNORE
(?i)(positive cases|results (?:are )?confirmed|exposed to positive|(?:neg|pos) pressure|a positive case|positive (?:attitude|feedback|serology)|[ ] COVID-19),IGNORE
(?i)(has (?:the )?patient been diagnosed (?:with|w/(?:/)?|w)|(?:person|patient) with 

In [26]:
session.import_relation_from_csv("target_rules.csv", relation_name="TargetTagRules", delimiter=",")

In [27]:
%%rgxlog
TargetTagMatches(Label, Span, Path) <- lemma_texts(Path, Text), TargetTagRules(Pattern, Label), py_rgx_span(Text,Pattern) -> (Span)

In [28]:
replace_spans("TargetTagMatches", "FilesPaths")

printing results for query 'TargetTagMatches(Label, Span, "sample1.txt")':
[]



' '

In [29]:
%%rgxlog
?FilesContent(Path, Content)

printing results for query 'FilesContent(Path, Content)':
    Path     |                                                               Content
-------------+-------------------------------------------------------------------------------------------------------------------------------------
 sample1.txt | patient presents to be tested for COVID-19 . His wife recently tested positive for COVID-19 . COVID-19 results came back positive .
 sample2.txt |                                            The patient be tested negative for COVID-19 .



<a id='context-rules'></a>
### [Context Rules](https://github.com/abchapman93/VA_COVID-19_NLP_BSV/blob/master/cov_bsv/knowledge_base/context_rules.py):
These rules assign an attribute for each COVID-19 label based on the context, these attributes will be used later to classify each text.

Example for this rule is: 

    ConTextRule(
        literal="Not Detected",
        category="NEGATED_EXISTENCE",
        direction="BACKWARD",
        pattern=[
            {"LOWER": {"IN": ["not", "non"]}},
            {"IS_SPACE": True, "OP": "*"},
            {"TEXT": "-", "OP": "?"},
            {"LOWER": {"REGEX": "detecte?d"}},
        ],
        allowed_types={"COVID-19"},
    ),
   **direction** specify if the allowed_types should be before or after the pattern,
   **allowed_types** specify on what labels should this rule be applied on 

In [30]:
%%bash
cat context_rules.csv

(?i)(?:positive COVID-19|COVID-19 (?:\([^)]*\)) (?:positive|detected)|COVID-19(?: positive)? associated_diagnosis)#positive
(?i)(?:COVID-19 status : positive)#positive
(?i)(?:associated_diagnosis COVID-19|associated_diagnosis (?:with|w|w//|from) (?:associated_diagnosis )?COVID-19)#positive
(?i)(?:COVID-19 positive(?: patient| precaution)?|associated_diagnosis (?:due|secondary) to COVID-19)#positive
(?i)(?:(?:current|recent) COVID-19 diagnosis)#positive
(?i)(?:COVID-19 (?:- )?related (?:admission|associated_diagnosis)|admitted (?:due to|(?:with|w|w/)) COVID-19)#positive
(?i)(?:COVID-19 infection|b34(?:\.)?2|b97.29|u07.1)#positive
(?i)(?:COVID-19 eval(?:uation)?|(?:positive )? COVID-19 symptoms|rule out COVID-19)#uncertain
(?i)(?:patient (?:do )?have COVID-19)#positive
(?i)(?:diagnosis : COVID-19(?: (?:test|screen)(?:ing|ed|s)? positive)?(?: positive)?)#positive
(?i)(?:COVID-19(?: (?!IGNORE)\w+)*? (?:not|non) (?:- )?detecte?d)#negated
(?i)(?:COVID-19(?: (?!IGNORE)\w+){0,1} negative scree

(?i)(?:COVID-19(?: (?!IGNORE)\w+){0,1} precaution)#negated
(?i)(?:(?:(?:precaution|protection|protect) (?:for|against)|concern about|reports of|vaccine|protect yourself|prevent(?:ed|ion|s|ing)|avoid)(?: (?!IGNORE)\w+)*? COVID-19)#negated
(?i)(?:COVID-19(?: (?!IGNORE)\w+)*? (?:prevent(?:ed|ion|s|ing)|vaccine|education(?:ion|ed|ing|ed)?|instruction))#negated
(?i)(?:(?:questions (?:about|regarding|re|concerning|on|for)|(?:anxiety|ask(?:ing|ed|es|ed)?) about|educat(?:ion|ed|ing|ed)?|instruction)(?: (?!IGNORE)\w+)*? COVID-19)#negated
(?i)(?:(?:information(?: )?(?:on|about|regarding|re)?|protocols?)(?: (?!IGNORE)\w+){0,2} COVID-19)#negated
(?i)(?:COVID-19(?: (?!IGNORE)\w+){0,2} protocols?)#negated
(?i)(?:(?:materials|fact(?: )?sheet|literature|(?:informat(?:ion|ed|ing) )?handouts?|(?:anxious|worr(?:ied|ies|y|ying)) (?:about|re|regarding))(?: (?!IGNORE)\w+)*? COVID-19)#negated
(?i)(?:COVID-19(?: (?!IGNORE)\w+)*? (?:materials|fact(?: )?sheet|literature|(?:informat(?:ion|ed|ing) )?handouts?))#n

In [31]:
session.import_relation_from_csv("context_rules.csv", relation_name="ContextRules", delimiter="#")

In [32]:
%%rgxlog
#covid_attributes: negated, other_experiencer, is_future, not_relevant, uncertain, positive
ContextMatches(CovidAttribute, Span, Path, Pattern) <- lemma_texts(Path, Text), ContextRules(Pattern, CovidAttribute), py_rgx_span(Text, Pattern) -> (Span)
?ContextMatches(CovidAttribute, Span, Path, Pattern)

CovidSpans(Path, Span) <- lemma_texts(Path, Text), py_rgx_span(Text, "COVID-19") -> (Span)
?CovidSpans(Path, Span)

printing results for query 'ContextMatches(CovidAttribute, Span, Path, Pattern)':
  CovidAttribute  |   Span    |    Path     |                                                                                    Pattern
------------------+-----------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     negated      | [22, 43)  | sample2.txt |                                   (?i)(?:(?:answer(?:ed|s|ing)? (?:no|negative|neg)|(?:neg|negative)(?: for)?)(?: (?!IGNORE)\w+)*? COVID-19)
     positive     | [70, 91)  | sample1.txt |                  (?i)(?:(?:(?:test(?:ed|s|ing)?)?positive(?: for)?|notif(?:y|ied) of positive (?:results?|test(?:ing)?|status))(?: (?!IGNORE)\w+)*? COVID-19)
     positive     | [94, 129) | sample1.txt | (?i)(?:COVID-19(?: (?!IGNORE)\w+)*? (?:positiv(?:e|ity)|test(?:ed|s|ing)? positive|(?:test|pcr) remains positive|notif(?:y

In [33]:
def is_span_contained(span1, span2):
    """
    Checks if one span is contained within the other span and returns the smaller span if yes.

    Parameters:
        span1 (span)
        span2 (span)

    Returns:
        span: span1 if contained within span2 or vice versa, or None if not contained.
    """
    start1, end1 = span1.span_start, span1.span_end
    start2, end2 = span2.span_start, span2.span_end
    
    if start2 <= start1 and end1 <= end2:
        yield span1
        
    elif start1 <= start2 and end2 <= end1:
        yield span2

magic_session.register(is_span_contained, "is_span_contained", in_rel=[DataTypes.span, DataTypes.span], out_rel=[DataTypes.span])

In [34]:
%%rgxlog
CovidAttributes(Path, CovidSpan, CovidAttribute) <- ContextMatches(CovidAttribute, Span1, Path, Pattern), CovidSpans(Path, Span2), is_span_contained(Span1, Span2) -> (CovidSpan)
?CovidAttributes(Path, CovidSpan, CovidAttribute)

printing results for query 'CovidAttributes(Path, CovidSpan, CovidAttribute)':
    Path     |  CovidSpan  |  CovidAttribute
-------------+-------------+------------------
 sample1.txt |  [83, 91)   |     positive
 sample1.txt |  [94, 102)  |     positive
 sample2.txt |  [35, 43)   |     negated



In [35]:
def attribute_filter(group):
    """
    Helper function to filter attributes within each "CovidSpan" of a DataFrame table based on specific conditions.
    
    Parameters:
        group (pandas.Series):A pandas Series representing a group of attributes for each "CovidSpan" within a DataFrame.
        
    Returns:
        str: the filtered "CovidSpan" attribute, determined as follows:
        - 'negated' if 'negated' is present in the group.
        - 'uncertain' if 'uncertain' is present in the group and 'negated' is not present.
        - 'positive' if neither 'negated' nor 'uncertain' are present in the group.
        
    """
    if 'negated' in group.values:
        return 'negated'
    elif 'uncertain' in group.values:
        return 'uncertain'
    else:
        return 'positive'

In [36]:
df = (session.run_commands("?CovidAttributes(Path, CovidSpan, CovidAttribute)", print_results=False, format_results=True))[0]
if len(df) == 0:
    df = DataFrame(columns=["Path","CovidSpan","CovidAttribute"])
df['CovidAttribute'] = df.groupby('CovidSpan')['CovidAttribute'].transform(attribute_filter)
df = df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,Path,CovidSpan,CovidAttribute
0,sample1.txt,"[83, 91)",positive
1,sample1.txt,"[94, 102)",positive
2,sample2.txt,"[35, 43)",negated


In [37]:
def classify_doc_helper(group):
    """
     Helper function to classify a document as either 'POS', 'UNK', or 'NEG' based on COVID-19 attributes.
    
    Parameters:
        group (pandas.Series):A pandas Series representing a group of COVID-19 attributes for each document within a DataFrame.
        
    Returns:
       str: the document classification determined as follows:
       - 'POS': At least one COVID-19 attribute with "positive" in the group.
       - 'UNK': At least one COVID-19 attribute with "uncertain" in the group and no "positive" attributes.
       - 'NEG': Otherwise.
    """
    if 'positive' in group.values:
        return 'POS'
    elif 'uncertain' in group.values:
        return 'UNK'
    else:
        return 'NEG'

In [38]:
df['DocResult'] = df.groupby('Path')['CovidAttribute'].transform(classify_doc_helper)
df = df[['Path', 'DocResult']]
df = df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,Path,DocResult
0,sample1.txt,POS
1,sample2.txt,NEG


In [39]:
df_path = (session.run_commands("?FilesPaths(Path)", print_results=False, format_results=True))[0]
df = (pd.merge(df, df_path, on='Path', how='outer'))
df['DocResult'] = df['DocResult'].fillna("UNK")
df

Unnamed: 0,Path,DocResult
0,sample1.txt,POS
1,sample2.txt,NEG
