## Step 1: Import packages

In [1]:
import pandas as pd
import re

## Step 2: Import dataset and perform initial preprocessing steps
##### The next piece of code accomplishes several important preprocessing steps. First, we read the dataset into a pandas dataframe from Excel. Next, we'll convert the ReportText column to string and convert any contiguous spaces or carriage returns to a single space. This transforms the ReportText column into a much more readable format for identifying PFT report templates. 

In [2]:
# Import data from Excel to pandas dataframe. Important to include PatientICN, PFT date, and ReportText columns
df = pd.read_excel('[Insert Directory Here]/[Insert File Name Here].xlsx')

# Convert ReportText to string and remove carriage returns and extra spacing between words
df['ReportText'] = df['ReportText'].astype('str')
df['ReportText'] = df['ReportText'].str.replace(r'\s+', ' ', regex=True).str.replace(r'\n+', ' ', regex=True)

# Convert PFT date column to date data type
df['pft_date'] = df['pft_date'].dt.date

## Step 3: Create function to generate snippets from notes
##### In order to do identify templates, use the following code to investigate the a random sample of 100 notes with a record counter and '-'-' separator for readability. You must identify templates before moving onto the next step, where you create the function `extract_fev1_context()`. Feel free to add additional columns to the below loop for more thorough validation (e.g. PFT date, Patient ID, etc.)
```
n = 1
for index,row in df.sample(frac=1)[:100].iterrows():
    print(f"Row Number: {n}")
    print(f"Note: {row['ReportText']}")
    print('-'*100)
    n+=1
```
##### Once a template is identified, modify the `pattern` variable in `extract_fev1_context` to match the beginning phrases or characters identified in the template (for example: `'Spirometry Interpretation:'`, and the approximate number of characters after the template start phrase that would include the variables of interest (this is in the format `{0,n}` where `n` is the desired length of the snippet in characters after the template start phrase. The template start phrase will be included in the snippet. You may optionally include `.{0,n}` directly prior to the template start phrase to include as many characters as you would like before the template start phrase. If you identify multiple possible template start phrases, you may employ the pipe operator (`|`) directly after the first template start phrase and add another template start phrase with the same format after the pipe operator.

In [2]:
# Function to create snippet based on template start phrases
def extract_fev1_context(text):
    pattern = re.compile(r'.Pulmonary Function Tests:.{0,350}|.Spirometry Interpretation.{0,350}', re.IGNORECASE)
    matches = pattern.findall(text)
    return ' '.join(matches)

## Step 4: Generate `Snippet` column by applying the `extract_fev1_context()` function to the `ReportText` column
##### Here, we run the `ReportText` column through the snippet generation function and create a new column called `Snippet` which holds that value. First, we create a copy of the dataframe to prevent a `FutuerWarning` from appearing. Then, create the new column and initialize a new dataframe called `notes_with_fev` which keep only rows with an identified snippet.

In [2]:
# Create 'Snippet' column based on 'ReportText' column in original dataframe by running the ReportText values through the function
df = df.copy(deep = True)
df['Snippet'] = df['ReportText'].apply(extract_fev1_context)

# Create new dataframe where all rows with no snippet are dropped
notes_with_fev = df[df['Snippet'] != ''].reset_index(drop=True)

##### As in the previous step, you can check the snippet results against the ReportText to make sure the function is capturing the accurate text fragments of adequate lengths.
```
n = 1
for index,row in notes_with_fev.sample(frac=1)[:100].iterrows():
    print(f"Row Number: {n}")
    print(f"Note: {row['ReportText']}")
    print(f"Note: {row['Snippet']}")
    print('-'*100)
    n+=1
```

## Step 5: Initialize Function to Classify PFT Results 
##### The below function employs regular expressions to extract specific PFT values from snippets and appends the results to the relevant lists. You may alter the names of variables, adapt regex matching patterns to identify and extract new or different values, or change the number of values per variable extracted. The function reads snippets from `notes_with_fev` on a row-by-row basis, and returns each extracted value as a pandas Series from a dictionary of the desired new column name as the key and new variable generated previously in the function as the value. 

In [9]:
def classify_fev1(row):
    
    # Initialize variables of interest as lists to hold extracted values
    fev1_abs_pre = []
    fev1_abs_post = []
    fev1_perc_predicted_pre = []
    fev1_perc_predicted_post = []
    fev1_fvc_pre = []
    fev1_fvc_post = []
    
    text = row['Snippet']
    
    # FEV1 absolute value pre BD
    fev1_abs_pre_pattern = re.compile(r'FEV1.*?(\d*\.\d+)L?/\d{2}', re.IGNORECASE)
    fev1_abs_pre_pattern_results = fev1_abs_pre_pattern.findall(text)
    
    if fev1_abs_pre_pattern_results:
        fev1_abs_pre.append(fev1_abs_pre_pattern_results[0])
        
    # FEV1 absolute value post BD
    fev1_abs_post_pattern = re.compile(r'post BD\s(\d*\.\d+)L?/\d{2}', re.IGNORECASE)
    fev1_abs_post_pattern_results = fev1_abs_post_pattern.findall(text)
    
    if fev1_abs_post_pattern_results:
        fev1_abs_post.append(fev1_abs_post_pattern_results[0])
        
    
    # FEV1 Percent Predicted pre BD
    fev1_perc_pred_pre = re.compile(r'FEV1.*?\d*\.\d+L?/(\d{2})', re.IGNORECASE)
    fev1_perc_pred_pre_results = fev1_perc_pred_pre.findall(text)

    if fev1_perc_pred_pre_results:
        fev1_perc_predicted_pre.append(fev1_perc_pred_pre_results[0])
        
    # FEV1 Percent Predicted post BD
    fev1_perc_pred_post = re.compile(r'post BD\s\d*\.\d+L?/(\d{2})', re.IGNORECASE)
    fev1_perc_pred_post_results = fev1_perc_pred_post.findall(text)

    if fev1_perc_pred_post_results:
        fev1_perc_predicted_post.append(fev1_perc_pred_post_results[0])
        
    # FEV1/FVC pre BD
    fev1_fvc_pre_pattern = re.compile(r'''FEV1/FVC.*?\d*\.\d+L?/
                                      \d{2}\s\d*\.\d+L?/\d{2}\s(\d{2})'''
                                      , re.IGNORECASE)
    fev1_fvc_pre_results = fev1_fvc_pre_pattern.findall(text)

    if fev1_fvc_pre_results:
        fev1_fvc_pre.append(fev1_fvc_pre_results[0])
        
    # FEV1/FVC post BD 
    fev1_fvc_post_pattern = re.compile(r'''post BD\s\d*\.\d+L?/
                                       \d{2}\s\d*\.\d+L?/\d{2}\s(\d{2})'''
                                       , re.IGNORECASE)
    fev1_fvc_post_results = fev1_fvc_post_pattern.findall(text)

    if fev1_fvc_post_results:
        fev1_fvc_post.append(fev1_fvc_post_results[0])
     
    # Initialize positive qualitative descriptor variable
    fev1_qual_hi = []
    
    # Pattern matches for positive descriptors of FEV
    fev1_qual_hi_pattern = re.compile(r'''(no obstructive ventilatory defect|
                                          normal spirometry|no obstruction|
                                          non-specific ventilatory)'''
                                      , re.IGNORECASE)                                  
    fev1_qual_hi_matches = fev1_qual_hi_pattern.findall(text)

    # Append positive matches to list variable
    for match in fev1_qual_hi_matches:
        if len(match) > 0:
            fev1_qual_hi.append(match)
    
    # Initialize negative qualitative descriptor variable
    fev1_qual_lo = []
    
    # Pattern matches for negative descriptors of FEV
    fev1_qual_lo_pattern = re.compile(r'''
                                      (mild obstructive defect|
                                       mild obstructive ventilatory defect|
                                       moderately severe obstructive ventilatory defect|
                                       very severe obstructive ventilatory defect|
                                       very severe obstruction ventilatory defect|
                                       severe obstructive ventilatory defect|
                                       moderate severe obstructive ventilatory defect|
                                       mild obstruction|moderate obstruction|
                                       severe obstruction|
                                       moderately severe obstructive ventilatory defect)
                                       '''
                                      , re.IGNORECASE)
    fev1_qual_lo_matches = fev1_qual_lo_pattern.findall(text)
    
    # Append negative matches to list variable
    for match in fev1_qual_lo_matches:
        if len(match) > 0:
            fev1_qual_lo.append(match)
    
    # If previously identified negative match, negative match supercedes positive
    if len(fev1_qual_lo) != 0:
        fev1_qual_hi = []
    
    '''
    Return the results of the above capturing patterns as Series, 
    which are joined to the original dataframe as new columns row-wise. 
    Names are modifiable.
    '''
    return pd.Series({'FEV1_Abs_Pre': fev1_abs_pre if fev1_abs_pre else None,
                      'FEV1_Perc_Pred_Pre': fev1_perc_predicted_pre if fev1_perc_predicted_pre else None,
                      'FEV1_FVC_Pre': fev1_fvc_pre if fev1_fvc_pre else None,
                      'FEV1_Abs_Post': fev1_abs_post if fev1_abs_post else None,
                      'FEV1_Perc_Pred_Post': fev1_perc_predicted_post if fev1_perc_predicted_post else None,
                      'FEV1_FVC_Post': fev1_fvc_post if fev1_fvc_post else None,
                      'FEV1_Qual_High': fev1_qual_hi if fev1_qual_hi else None,
                      'FEV1_Qual_Low': fev1_qual_lo if fev1_qual_lo else None})

## Step 6: Run dataframe through the PFT extraction function

In [23]:
'''
Initialize a new dataframe called 'results' which
adds the new variables as columns to the original dataframe
'''
results = notes_with_fev.join(notes_with_fev.apply(classify_fev1, axis = 1))

## Step 7: Extract values from FEV1 % predicted, FEV1:FVC pre-BD, and qualitative variables.
##### These values are stored as nested lists, so we need to apply a function that extracts the value via indexing and converts it to either an integer (quantitative) or string (qualitative) for later processing. These values are stored in new variables added to the `results` dataframe.

In [26]:
def extract_value(nested_list):
    if nested_list is not None:
        return int(nested_list[0])
    
# Create new variables 'FEV1_Perc_Pred' and 'fev1_fvc' to hold extracted quantitative values
results['FEV1_Percent_Pred'] = results['FEV1_Perc_Pred_Pre'].apply(extract_value)
results['fev1_fvc'] = results['FEV1_FVC_Pre'].apply(extract_value)

def extract_fev1_qualitative(nested_list):
    if nested_list is not None:
        return str(nested_list[0])

# Create new variables to hold qualitative data
results['fev1_qual_neg'] = results['FEV1_Qual_Low'].apply(extract_fev1_qualitative)
results['fev1_qual_pos'] = results['FEV1_Qual_High'].apply(extract_fev1_qualitative)

## Step 8: Create mapping functions to map quantitative values to the standard clinical definitions of obstruction and severity of obstruction

In [29]:
# Create mapping function
def fev1_severity(value):
    if value >= 80:
        return "Normal"
    if 70 <= value <= 79:
        return "Mild"
    if 60 <= value <= 69:
        return "Moderate"
    if 50 <= value <= 59:
        return "Moderately Severe"
    if 35 <= value < 50:
        return "Severe"
    if value < 35:
        return "Very Severe"
    
def obstruction(value):
    if value >= 70:
        return "Normal"
    if value < 70:
        return "Reduced"

# Create new variables 'FEV1_Severity' and 'Obstruction' by running the FEV1 % predicted and FEV1:FVC variables through the mapping functions
results['FEV1_Severity'] = results['FEV1_Percent_Pred'].map(fev1_severity)
results['Obstruction'] = results['fev1_fvc'].map(obstruction)

## Step 9a: Impute FEV1 severity values from qualitative data
##### This function assigns values to the `FEV1_Severity` variable created in the previous step, based on qualitative data in note snippets. The value is only imputed if the quantitative mapping function produced no FEV1 severity results, indicating that, for that row, no quantitative value for FEV1 severity was extracted from the templated note snippet.

In [35]:
def fev1_severity_from_qual(row):
    if row['FEV1_Severity'] is None and row['fev1_qual_pos'] in ['Normal spirometry', 'normal spirometry']:
        return "Normal"
    if row['FEV1_Severity'] is None and row['fev1_qual_neg'] in ['mild obstructive ventilatory defect', 
                                                                 'Mild obstructive ventilatory defect', 
                                                                 'mild obstructive defect']:
        return "Mild"
    elif row['FEV1_Severity'] is None and row['fev1_qual_neg'] in ['moderate obstructive ventilatory defect', 
                                                                   'Moderate obstructive ventilatory defect']:
        return "Moderate"
    elif row['FEV1_Severity'] is None and row['fev1_qual_neg'] in ['moderately severe obstructive ventilatory defect', 
                                                                   'Moderately severe obstructive ventilatory defect', 
                                                                   'Moderate severe obstructive ventilatory defect', 
                                                                   'moderate severe obstructive ventilatory defect']:
        return "Moderately Severe"
    elif row['FEV1_Severity'] is None and row['fev1_qual_neg'] not in ['moderately severe obstructive ventilatory defect', 
                                                                       'Moderately severe obstructive ventilatory defect', 
                                                                       'Moderate severe obstructive ventilatory defect', 
                                                                       'moderate severe obstructive ventilatory defect'] 
                                                                        and row['fev1_qual_neg'] in 
                                                                        ['severe obstructive ventilatory defect', 
                                                                         'Severe obstructive ventilatory defect', 
                                                                         'severe obstruction']:
        return "Severe"
    elif row['FEV1_Severity'] is None and row['fev1_qual_neg'] not in ['moderately severe obstructive ventilatory defect', 
                                                                       'Moderately severe obstructive ventilatory defect', 
                                                                       'Moderate severe obstructive ventilatory defect', 
                                                                       'moderate severe obstructive ventilatory defect'] 
                                                                        and row['fev1_qual_neg'] not in 
                                                                        ['severe obstructive ventilatory defect', 
                                                                         'Severe obstructive ventilatory defect', 
                                                                         'severe obstruction'] 
                                                                        and row['fev1_qual_neg'] in 
                                                                        ['Very severe obstructive ventilatory defect', 
                                                                         'very severe obstructive ventilatory defect', 
                                                                         'Very severe obstruction ventilatory defect']:
        return "Very Severe"
    else:
        return row['FEV1_Severity']

# If quantitative data is missing for FEV1 % predicted, use qualitative data to map value
results['FEV1_Severity'] = results.apply(fev1_severity_from_qual, axis = 1)

## Step 9b: Impute obstruction values from qualitative data
##### This function assigns values to the `Obstruction` variable created in the previous step, based on qualitative data in note snippets. The value is only imputed if the quantitative mapping function produced no results.

In [17]:
def obstruction_from_qual(row):
    if row['Obstruction'] is None and row['fev1_qual_pos'] in ['No obstructive ventilatory defect', 
                                                               'no obstructive ventilatory defect', 
                                                               'normal spirometry', 
                                                               'Normal spirometry', 
                                                               'No obstruction', 
                                                               'no obstruction', 
                                                               'non-specific ventilatory']:
        return "Normal"
    elif row['Obstruction'] is None and row['fev1_qual_neg'] is not None:
        return "Reduced"
    else:
        return row['Obstruction']
    
# If quantitative data is missing for Obstruction, use qualitative data to map value
results['Obstruction'] = results.apply(obstruction_from_qual, axis = 1)

## Step 10: Drop duplicate rows or rows missing extracted PFT data

In [1]:
# List columns to define on which variables you would like to drop duplicates
list_cols = ['Obstruction', 'FEV1_Severity', 'FEV1_Abs_Pre', 'FEV1_Perc_Pred_Pre', 
             'FEV1_FVC_Pre', 'FEV1_Abs_Post', 'FEV1_Perc_Pred_Post', 'FEV1_FVC_Post', 
             'FEV1_Qual_High', 'FEV1_Qual_Low']

# Drop rows that have no extracted PFT values
results = results.dropna(subset = list_cols, how = 'all')

# Convert columns in list_cols to string
for col in list_cols:
    results[col] = results[col].apply(lambda x: str(x))
    
# Drop duplicates of PFT results based on columns of interest + PatientID and PFT date
results = results.drop_duplicates(subset = ['PatientICN', 'pft_date', 'Obstruction', 
                                            'FEV1_Severity', 'FEV1_Abs_Pre', 'FEV1_Perc_Pred_Pre', 
                                            'FEV1_FVC_Pre', 'FEV1_Abs_Post', 'FEV1_Perc_Pred_Post', 
                                            'FEV1_FVC_Post', 'FEV1_Qual_High', 'FEV1_Qual_Low'])

# Replace cells with 'None' values to empty string for ease of readability in the output Excel file
results.replace('None','',inplace = True)

## Step 11: Merge rows from same PFT with multiple notes containing values for different variables

In [19]:
# Define columns that will keep the max value if the two rows being merged have different values for.
columns_to_max = ['PatientSID', 'Obstruction', 'FEV1_Severity', 'FEV1_Abs_Pre', 'FEV1_Perc_Pred_Pre', 
                  'FEV1_FVC_Pre', 'FEV1_Abs_Post', 'FEV1_Perc_Pred_Post', 'FEV1_FVC_Post', 'FEV1_Qual_High', 'FEV1_Qual_Low']

# This function ensures that we don't lose one of the snippets upon merge, but rather append them together
def concatenate_strings(series):
    return ''.join(series.unique())

# Define aggregation function to keep the max value for columns that both have data across the rows
agg_funcs = {col: 'max' for col in columns_to_max}

# Create concatenated snippets for merged rows (instead of taking the "max" snippet value)
agg_funcs['Snippet'] = concatenate_strings

# Regenerate dataframe with collapsed rows for identical PFTs with multiple notes
results = results.groupby(['PatientICN','pft_date'], sort = False).agg(agg_funcs).reset_index()

## Step 12: Export data to Excel for validation/analysis

In [22]:
# Select columns to export
columns_to_export = ['Snippet', 'PatientICN', 'PatientSID', 'pft_date', 
                     'Obstruction', 'FEV1_Severity', 'FEV1_Abs_Pre', 
                     'FEV1_Perc_Pred_Pre', 'FEV1_FVC_Pre', 'FEV1_Abs_Post', 
                     'FEV1_Perc_Pred_Post', 'FEV1_FVC_Post', 'FEV1_Qual_High', 'FEV1_Qual_Low']

# Define desired output directory, file name, and file path
output_dir = '[Insert Output Directory Here]/'
file_name = '[Insert Output File Name Here].xlsx'
full_path = output_dir + file_name
to_export = results

# Export data as .xlsx file
to_export.to_excel(full_path, columns = columns_to_export, index = False)