## Part 1: Data Extraction

In this section, I will extract raw data from the physical therapist evaluation and perform data cleaning for later use in Data Analytics and the building of several machine learning models.

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.chunk import RegexpParser
import spacy


## Feature information
age_sex = []
fusion = []
ordering_provider = []
brace = []
pain = []
plof = []
prior_loc = []
ste = []
hr = []
ad = []
num_falls = []
mobility = []
sup_sit = []
sit_stand = []
amb_distance = []
amb_assist = []
stairs_assist = []
num_stairs = []

# Initial label Information
los = []
dc_loc = []

# 10 .txt files of Evaluations are located in the folder 'Initial_Evaluations'
data_folder = os.path.join(os.getcwd(),'Initial_Evaluations')

for root, folder, files in os.walk(data_folder):
    for file in files:
        path = os.path.join(root, file)

        ## Avoid analysis of checkpoint folders
        if '.ipynb_checkpoints' in path:
            continue

        with open(path) as curr_file:
            lines = curr_file.readlines()
            
        collect_plof = False
        collect_mobility = False
        curr_mobility = []
        curr_plof = []
        ## Initial data collection from evaluations
        for line in lines:
            line = line.lower()
        
            if line.startswith('reason for admission:'):
                fusion.append(line.split('reason for admission:')[1].strip())
            if line.startswith('ordered by:'):
                ordering_provider.append(line.split('ordered by:')[1].strip())
            if line.startswith('precautions:'):
                brace.append(line.split('precautions:')[1].strip())
            if line.startswith('subjective:'):
                pain.append(line.split('subjective:')[1].strip())
            if line.startswith('living situation/prior level of function:'):
                collect_plof = True
                curr_plof.append(line.split('living situation/prior level of function:')[1].strip())
                continue
            if line.startswith("objective:"):
                collect_plof = False
            if collect_plof:
                curr_plof.append(line)
            if line.startswith('appearance:'):
                age_sex.append(line.split('appearance:')[1].strip())
            if line.startswith('mobility assessment:'):
                collect_mobility = True
                continue
            if line.startswith("assessment:"):
                collect_mobility = False
            if collect_mobility:
                curr_mobility.append(line)
        mobility.append(''.join(curr_mobility))
        plof.append(''.join(curr_plof))

Let's confirm we extracted information from 10 documents.

In [2]:
print(f'Length of brace list: {len(brace)}')

Length of brace list: 10


Note: I am not able to collect labels as these are not real evaluations. In a live setting, the labels, length of stay and discharge location, would be collected from the electronic medical record, specifically from the discharge summary. I will create a function to 'guess' this data in the **Evaluation Generator** section of this project.

## Part 2: Data Cleansing and Transformation

I will now clean age_sex, fusion, ordering_provider, brace, and pain using *re*, regular expressions.

In [3]:
age_pattern = '\d+'
sex_pattern = '\S*male'
fusion_pattern = '(cervical|lumbar)'
brace_pattern = '\D*lso|aspen'
pain_pattern = '\d+'

def extract_pattern(pattern,raw_list):
    clean_list = [re.search(pattern,patient).group() for patient in raw_list]
    return clean_list

age = extract_pattern(age_pattern, age_sex)
sex = extract_pattern(sex_pattern, age_sex)
fusion = extract_pattern(fusion_pattern,fusion)
brace = extract_pattern(brace_pattern,brace)
pain = extract_pattern(pain_pattern,pain)

print(f' Ages: {age}')
print(f' Sex: {sex}')
print(f' Fusion Types: {fusion}')
print(f' Brace required: {brace}')
print(f' Pain level: {pain}')

 Ages: ['68', '71', '70', '75', '70', '82', '65', '72', '78', '75']
 Sex: ['female', 'male', 'female', 'male', 'female', 'female', 'male', 'male', 'male', 'male']
 Fusion Types: ['lumbar', 'lumbar', 'lumbar', 'cervical', 'cervical', 'cervical', 'cervical', 'cervical', 'lumbar', 'lumbar']
 Brace required: ['lso', 'tlso', 'lso', 'aspen', 'aspen', 'aspen', 'aspen', 'aspen', 'lso', 'tlso']
 Pain level: ['7', '5', '3', '8', '2', '5', '4', '10', '9', '9']


### Named Entity Recognition
I will use Named Entity Recognition to retrieve the last names of the ordering providers. I believe this will be easier to maintain over time than searching for specific providers in a list that would require updating. 

\* *I sorted out titles as some would be flagged incorrectly as proper nouns, such as Np.*

In [4]:
def tokenize_tag(names):
    tagged_tokens = [pos_tag(word_tokenize(name)) for name in names]
    return tagged_tokens

def extract_last_name(names):
    tagged_tokens_list = tokenize_tag(names)

    # Search and parse for proper nouns
    grammar = "NAME: {<NNP>+}"
    chunk_parser = RegexpParser(grammar)
    last_name = []
    for tagged_tokens in tagged_tokens_list:
        chunked = chunk_parser.parse(tagged_tokens)

        # Titles to avoid categorizing as proper nouns 
        titles = {'Md', 'Np', 'Do', 'Pa'}

        
        for subtree in chunked:
            if isinstance(subtree, nltk.Tree) and subtree.label() == 'NAME':
                # Filter out titles and get the last proper noun
                name_tokens = [token for token, pos in subtree.leaves() if token not in titles]
                if name_tokens:
                    last_name.append(name_tokens[-1]) # Retrieve only the last name
        
    return last_name

name_string = ['Bob Kuzak, Md, Np','Billy Smith, Md']
ordering_provider = [provider.title() for provider in ordering_provider]
provider = extract_last_name(ordering_provider)

print(f' Provider last names: {provider}')

 Provider last names: ['Nolan', 'Myers', 'Myers', 'Myers', 'Kuzak', 'Kuzak', 'Myers', 'Myers', 'Nolan', 'Nolan']


## Conversion of unstructured text into structured data: an Exploration of Gensim

There is significant variation in how prior living situation and level of function may be documented by therapists. Most choose to document in unstructured paragraphs. The information typically contains:

**Living location**, including whether the home is a single story, or multi-story home and thus requiring stairs to reach the patient's bedroom/bathroom

**Stairs to enter the home (ste) and corresponding number of handrails**

**Falls in the last 6 months**, a significant risk factor for future falls

**Prior use of an assistive device**

My previous use of *re* alone will be difficult to accurately capture these pieces of information due to the variation in therapist sentence structure and word choice. Thus, I will attempt to handle this with Gensim. Gensim is a open-source library created by Google which attempts to utilize vector modeling to handle natural-language-processing tasks. I will be using Word2Vec with the previously trained Google News 300 model which contains 300 billion running words and 3 million 300 dimension english word vectors.

By taking advantage of this pre-trained model, I am able to calculate a numerical difference between texts. This is termed the 'Word Mover's Distance'. In the next cell I will calculate the difference between a living scenario of a single story house compared to various other living scenarios.

In [5]:
import gensim ## requires sciPy 1.10.1 due to triu, tru, and tril being deprecated
from gensim.models import Word2Vec, KeyedVectors
from gensim import models
import gensim.downloader as api

nlp = spacy.load("en_core_web_sm")
stop_words = stopwords.words('english')

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)


# Strings of different living scenarios
ssh = 'I live in a single story house'
two_sh = 'I live in a two story house'
nf = 'I live in a skilled nursing facility'
alf = 'I live in an assisted living facility'
ilf = 'I live in an independent living facility'
apt = 'I live in a apartment'

prior_location = [two_sh, nf, alf, ilf, apt]

# Removal of stop words and use of lemmatization have been shown to improve results when using Word Mover's Distance
processed_ssh = ' '.join([word.lemma_ for word in nlp(ssh) if not word.text in stop_words])

for loc in prior_location:
    processed_loc = ' '.join([word.lemma_ for word in nlp(loc) if not word.text in stop_words])
    wmd = model.wmdistance(processed_ssh, processed_loc)
    print(f'Distance between "{processed_ssh}" and "{processed_loc}": {wmd:.4f}')

# The difference between two of the same sentences is 0.000
wmd_ssh = model.wmdistance(processed_ssh, processed_ssh)
print(f'Distance between "{processed_ssh}" and "{processed_ssh}": {wmd_ssh:.4f}')

Distance between "I live single story house" and "I live two story house": 0.2388
Distance between "I live single story house" and "I live skilled nursing facility": 0.3446
Distance between "I live single story house" and "I live assisted living facility": 0.3535
Distance between "I live single story house" and "I live independent living facility": 0.4262
Distance between "I live single story house" and "I live apartment": 0.4733
Distance between "I live single story house" and "I live single story house": 0.0000


We would expect that two strings that mention houses would be more similar than a string that mentions living in a house and another that mentions living in a type of facility. As seen in the text above, strings that are more similar have a lower word mover's distance than those that are more different. This seems promising. 

The following cell will show if word mover's distance can handle sentence structure and word choice.

In [6]:
# The equivalent of I live in a two story house
sentence_two_sh = 'I do not live in a single story house, I live in a two story house'

process_two_sh = ' '.join([word.lemma_ for word in nlp(sentence_two_sh) if not word.text in stop_words])

# The equivalent of I live in a single story house
sentence_ssh = 'I do not live in a two story house, I live in a single story house'

process_ssh = ' '.join([word.lemma_ for word in nlp(sentence_ssh) if not word.text in stop_words])

# Calculate WMD
wmd_sentences = model.wmdistance(process_two_sh, process_ssh)
print(f'Distance between {sentence_ssh} and {sentence_two_sh}: {wmd_sentences:.4f}')

Distance between I do not live in a two story house, I live in a single story house and I do not live in a single story house, I live in a two story house: 0.0000


These sentences depict different living situations, however they return a word mover's distance of 0.000. This may lead to inaccuracies over time.

I will now experiment on what might occur when the model is given a paragraph with information rather than a sentence. I will include typical living situations a therapist might document. In this example, in an attempt to challenge word mover's distance further, the second paragraph will describe a patient who typically lives in a single story home but plans to move in with their son in a two story home.

In [7]:
ssh = 'Patient lives in a single story house'
two_sh = 'Patient lives in a two story house'
nf = 'Patient lives in a skilled nursing facility'
alf = 'Patient lives in an assisted living facility'
ilf = 'Patient lives in an independent living facility'
apt = 'Patient lives in a apartment'

prior_location = [ssh, two_sh, nf, alf, ilf, apt]

# Paragraph 1 describes a patient who lives in a single story house
ssh_paragraph = (f'Patient lives in a single story house with his wife. He was previously independent with all ADLs without the use of '
                 f'an assistive device. He has 2+2 STE with one handrail to enter the home. Denies falls in the last 6 months.')

# Paragraph 2 describes a patient who lives in a single story house but plans to discharge to a two story house
two_sh_paragraph = (f'Patient lives in a single story house with her daughter. However, she plans to move in with her son in a two story home. '
                    f'The staircase to the second floor has 2 handrails. '
                    f'She was previously modified independent with ADLs with the use of a walker. She has 1+1 STE. 3x falls in the '
                    f'last 6 months.')


my_paragraphs = [ssh_paragraph, two_sh_paragraph]

def process_paragraph(paragraph):
    processed_paragraph = ' '.join([word.lemma_ for word in nlp(paragraph) if not word.text in stop_words])
    return processed_paragraph
    
for index, paragraph in enumerate(my_paragraphs):
    print(f'Paragraph {index+1}:\n')
    for loc in prior_location:
        processed_paragraph = process_paragraph(paragraph)
        processed_loc = ' '.join([word.lemma_ for word in nlp(loc) if not word.text in stop_words])
        wmd = model.wmdistance(processed_paragraph, processed_loc)
        print(f'Distance between paragraph {index+1} and "{processed_loc}": {wmd:.4f}')
    print(f'\n')

Paragraph 1:

Distance between paragraph 1 and "patient live single story house": 0.2148
Distance between paragraph 1 and "patient live two story house": 0.2907
Distance between paragraph 1 and "patient live skilled nursing facility": 0.3265
Distance between paragraph 1 and "patient live assisted living facility": 0.3791
Distance between paragraph 1 and "patient live independent living facility": 0.3881
Distance between paragraph 1 and "patient live apartment": 0.4109


Paragraph 2:

Distance between paragraph 2 and "patient live single story house": 0.2634
Distance between paragraph 2 and "patient live two story house": 0.2828
Distance between paragraph 2 and "patient live skilled nursing facility": 0.3664
Distance between paragraph 2 and "patient live assisted living facility": 0.4055
Distance between paragraph 2 and "patient live independent living facility": 0.4463
Distance between paragraph 2 and "patient live apartment": 0.4939




Word mover's distance alone was not able to decipher that the patient in the second paragraph has a discharge location of a two story house. There is a greater WMD between the second paragraph and the string referenced by ssh than between the string referenced by two_sh.

Given the unpredictability of the unstructured data being input by therapists currently, it may end up being more efficient and thus more economical to utilize a stronger natural language processing tool.

## An exploration of OpenAI API to use a GPT to retrieve the correct data

Unfortunately, word mover's distance will likely be too unreliable and produce varied results when confronted with complex paragraphs. We will now attempt to utilize the OpenAI API to use the gpt-4o model. While this does require a payment, I have found that evaluations are analyzed at much less than one cent per evaluation and it is able to handle much more complexity within paragraphs.

In [8]:
import openai

from openai import OpenAI

# Initialize the OpenAI API client
client = OpenAI()

In [9]:
# We will supply a prompt for OpenAI:
prompt = f"""
Analyze the following paragraph. The paragraph describes the living situation and prior level of function of 
a patient who is admitted in the hospital.

Extract the following information and return it in the format listed below the paragraph.
1. Number of stories in the living arrangement that the patient was previously at, or the patient is planning to return.
If this is a apartment, independent living facility (ILF), skilled nursing facility (SNF), nursing home (NH), 
or assisted living facility (ALF) then the number of stories is 1. 
2. Number of stairs (steps) and the presence of a handrail.
3. Number of falls in the last six months.
4. Assistive device used to ambulate, if any. If the patient did not use one often, then this should be 'None'.
Paragraph:
\"\"\"
{two_sh_paragraph}
\"\"\"

Example response format:
1. Number of stories in the house: 2
2. Number of stairs (steps): 10 (5+2+3) with 1 handrail
3. Number of falls in the last six months: 1
4. Assistive device used: Walker
"""
response = client.chat.completions.create(
    model = "gpt-4o",
    messages = [
        {"role":"user","content":prompt}
    ]
)
print(response.choices[0].message.content)

1. Number of stories in the house: 2
2. Number of stairs (steps): 2 with 2 handrails
3. Number of falls in the last six months: 3
4. Assistive device used: Walker


It appears gpt-4o was able to decipher the correct number of stories which will need to be navigated in the house. We do have some fine tuning to perform, however. The model "hallucinated" that there were no handrails at all on the staircase. Also, for our purposes, "1+1 ste" is fundamentally different than 2 stairs to enter due to the ability for a patient to utilize a walker on one step. We will modify our prompt.


In [10]:
prompt = f"""
Analyze the following paragraph. The paragraph describes the living situation and prior level of function of 
a patient who is admitted in the hospital.

Extract the following information and return it in the format listed below the paragraph.
1. Number of stories in the living arrangement that the patient was previously at, or the patient is planning to return.
If this is a apartment, independent living facility (ILF), skilled nursing facility (SNF), nursing home (NH), long-term care facility (LTC),
or assisted living facility (ALF) then the number of stories is 1. 
2. Number of stairs (steps). If the paragraph mentions "1+1 ste" or "1+1 stairs to enter", reduce this to 1.
3. Number of handrails present. If the paragraph makes no mention of handrails, sometimes abbreviated as "hr" or "hrs", then this is 0.
3. Number of falls in the last six months.
4. We will encode the assistive device used, if any:
    No assistive device used: 0
    Cane: 1
    Walker: 2

Paragraph:
\"\"\"
{two_sh_paragraph}
\"\"\"

Example response format:
1. Number of stories in the house: 2
2. Number stairs (steps): 10 (5+2+3) with 0 handrails
3. Number falls in last 6 months: 1
4. Assistive device used, if any: Walker
"""
response = client.chat.completions.create(
    model = "gpt-4o",
    messages = [
        {"role":"user","content":prompt}
    ]
)
print(response.choices[0].message.content)

1. Number of stories in the house: 2
2. Number stairs (steps): 1 with 2 handrails
3. Number of falls in last 6 months: 3
4. Assistive device used, if any: Walker


All information is correct. Let's also provide the first paragraph we had created.

In [11]:
prompt = f"""
Analyze the following paragraph. The paragraph describes the living situation and prior level of function of 
a patient who is admitted in the hospital.

Extract the following information and return it in the format listed below the paragraph.
1. Number of stories in the living arrangement that the patient was previously at, or the patient is planning to return.
If this is a apartment, independent living facility (ILF), skilled nursing facility (SNF), nursing home (NH), long-term care facility (LTC),
or assisted living facility (ALF) then the number of stories is 1. 
2. Number of stairs (steps). If the paragraph mentions "1+1 ste" or "1+1 stairs to enter", reduce this to 1.
3. Number of handrails present. If the paragraph makes no mention of handrails, sometimes abbreviated as "hr" or "hrs", then this is 0.
3. Number of falls in the last six months.
4. We will encode the assistive device used, if any:
    No assistive device used: 0
    Cane: 1
    Walker: 2

Paragraph:
\"\"\"
{two_sh_paragraph}
\"\"\"

Example response format:
1. Number of stories in the house: 2
2. Number stairs (steps): 10 with 0 handrails
3. Number falls in last 6 months: 1
4. Assistive device used, if any: 2
"""
response = client.chat.completions.create(
    model = "gpt-4o",
    messages = [
        {"role":"user","content":prompt}
    ]
)
print(response.choices[0].message.content)

1. Number of stories in the house: 2
2. Number stairs (steps): 1 with 2 handrails
3. Number of falls in last 6 months: 3
4. Assistive device used, if any: 2


Again, all of the information is correct. We will now use OpenAI API to analyze the prior level of function and mobility assessment portions of othe document. We will supply the GPT with background information and specific information regarding response formatting in order to seamlessly retrieve our data in a format we can create our final dataframe with.

In [12]:
import ast
list_pattern = '(\[.*\])'

for eval_plof in plof:
    plof_prompt = f"""
    You are an automated data extractor. Your only goal is to analyze the following paragraph for the data listed and return it in the format
    required below. The paragraph describes the living situation and prior level of function of 
    a patient who is admitted in the hospital.
    
    Extract the following information and return it in the format listed below the paragraph.
    1. Type of location that the patient was living in. If they plan to live in another location, such as an individual who lives in a two story home
    but plans to stay in a ranch home, then encode the ranch home. We will encode this as follows:
        Ranch home or single story home (ssh): 0
        Two or more story home (2sh): 1
        Apartment: 2
        Independent living facility (ILF): 3
        Assisted living facility (ALF): 4
        Skilled nursing facility (SNF): 5
        Inpatient rehabilitation hospital: 6
        Long-term care facility (LTC): 7
    2. Number of stairs (steps). If the paragraph mentions "1+1 ste" or "1+1 stairs to enter", reduce this to 1. Otherwise, 
    perform the necessary addition.
    3. Number of handrails present. If the paragraph makes no mention of handrails, sometimes abbreviated as "hr" or "hrs", then this is 0.
    4. Type of assistive device used. If this is not mentioned, then assume no assistive device. We will encode the assistive device used, if any:
        No assistive device used: 0
        Cane: 1
        Walker: 2
    5. Number of falls in the last six months.
    
    
    Paragraph:
    \"\"\"
    {eval_plof}
    \"\"\"
    
    Response format will be in the form of a python list with only the integers returned. For example: "[ 1, 3, 2, 0, 3]" where:
        Number of stories in the house is at index 1.
        Number stairs (steps) is at index 2.
        Number of handrails is at index 3.
        Assistive device used is at index 4.
        Number of falls in the last 6 months is at index 5.

    Do not return an explanation. Only return the python list.
    """
    response = client.chat.completions.create(
        model = "gpt-4o",
        messages = [
            {"role":"user","content":plof_prompt}
        ]
    )
    gpt_plof_response = response.choices[0].message.content
    print(gpt_plof_response)
    plof_list = ast.literal_eval(re.search(list_pattern,gpt_plof_response).group())
    prior_loc.append(plof_list[0])
    ste.append(plof_list[1])
    hr.append(plof_list[2])
    ad.append(plof_list[3])
    num_falls.append(plof_list[4])

print(prior_loc)

```python
[2, 16, 0, 1, 1]
```
```python
[0, 1, 0, 0, 2]
```
```python
[4, 0, 0, 2, 2]
```
[4, 0, 0, 0, 2]
[3, 0, 0, 1, 0]
```python
[0, 10, 1, 0, 1]
```
```python
[1, 1, 0, 0, 1]
```
```python
[0, 1, 0, 0, 0]
```
```python
[4, 0, 0, 0, 1]
```
```python
[4, 0, 0, 2, 5]
```
[2, 0, 4, 4, 3, 0, 1, 0, 4, 4]


In [13]:
# We are using the ast package to convert our returned python literal into a list to be iterated
import ast
list_pattern = '(\[.*\])'

for eval_mob in mobility:
    mobility_prompt = f"""
    
    You are an automated data extractor. Your only goal is to analyze the following paragraph for the data listed and return it in the format
    required below. The paragraph describes the mobility assessment performed on a patient following a surgical fusion 
    that is in an inpatient hospital. 
    
    The purpose of this analysis is to retrieve information regarding the assistance required to perform various mobility, 
    distance of ambulation, and number of stairs completed during the session. Some portions of information may not be available as they 
    might not have been performed.
    
    Assistance ranges from:
    SBA: Stand-by-assistance
    Min-A: Minimal assistance
    Mod-A: Moderate assistance
    Max-A: Maximal Assistance
    'x#' where # is the number of individuals assisting. For example, minAx1 means minimal assistance of 1.
    
    Distance is measured in feet.
    
    Extract the following information and return it in the format listed below the paragraph. 
    1. Supine to sit (also written as sup < > sit or sup to sit) assistance required
    2. Sit to stand (also written as sit < > stand) assistance required
    3. Ambulation assistance required
    4. Ambulation distance (If Ambulation assistance required is 4, then this value is 0.)
    5. Stairs assistance required
    6. Stairs completed (If stairs assistance is 4, then this value is 0.)
    
    Paragraph:
    \"\"\"
    {eval_mob}
    \"\"\"
    
    Return the assistance variables in the encoded format:
    Stand-by-assistance: 0
    Minimal assistance of 1: 1
    Moderate assistance of 1: 2
    Maximal assistance of 1 or more, moderate assistance of 2: 3
    Unable to complete, did not complete, or not feasible to complete: 4
    
    Response format will only be in the form of a python list with only the integers returned. For example: "[ 1, 3, 2, 100, 3, 6]"
    where:
        Supine to sit assistance is index 0.
        Sit to stand assistance is index 1.
        Ambulation assistance is index 2.
        Ambulation distance is index 3.
        Stairs assistance is index 4.
        Stairs completed is index 5.

    Do not return an explanation. Only return the python list.
    """
    
    response = client.chat.completions.create(
        model = "gpt-4o",
        messages = [
            {"role":"user","content":mobility_prompt}
        ]
    )
    gpt_mob_response = response.choices[0].message.content
    print(gpt_mob_response)
    mobility_list = ast.literal_eval(re.search(list_pattern,gpt_mob_response).group())
    sup_sit.append(mobility_list[0])
    sit_stand.append(mobility_list[1])
    amb_assist.append(mobility_list[2])
    amb_distance.append(mobility_list[3])
    stairs_assist.append(mobility_list[4])
    num_stairs.append(mobility_list[5])
print(sup_sit)

```python
[2, 4, 4, 0, 4, 0]
```
```python
[1, 1, 1, 125, 2, 1]
```
```python
[0, 0, 1, 50, 4, 0]
```
[1, 2, 1, 5, 4, 0]
```python
[0, 0, 0, 150, 0, 9]
```
```python
[2, 1, 1, 100, 4, 0]
```
```python
[1, 1, 1, 350, 0, 18]
```
[2, 2, 4, 0, 4, 0]
```python
[1, 1, 1, 50, 4, 0]
```
```python
[3, 3, 4, 0, 4, 0]
```
[2, 1, 0, 1, 0, 2, 1, 2, 1, 3]


## Aggregation of the Data into a single Dataframe

In [14]:
dict = {
    'age' : age,
    'sex' : sex,
    'fusion' : fusion,
    'provider' : provider,
    'brace' : brace,
    'pain' : pain,
    'prior_loc' : prior_loc,
    'ste' : ste,
    'hr' : hr,
    'ad' : ad,
    'num_falls' : num_falls,
    'sup_sit' : sup_sit,
    'sit_stand' : sit_stand,
    'amb_assist' : amb_assist,
    'amb_distance' : amb_distance,
    'stairs_assist' : stairs_assist,
    'num_stairs' : num_stairs
}

df = pd.DataFrame(dict)

columns = df.columns.tolist()

## Conversion of integer columns to type Integer
for col in columns: 
    if col not in ('sex', 'fusion', 'provider', 'brace'):
        df[col] = df[col].astype(int)

We will import the functions used in the **Evaluation Generator** part of this project to categorize pain and predict our label columns. 

In [15]:
from clean_predict import guess_los, guess_dc_loc, categorize_pain, calc_rehab

df['los'] = (
    df.apply(lambda row: guess_los(row['fusion'], row['pain'], row['amb_distance'], row['num_falls']), axis = 1)
)
df['dc_loc'] = (
    df.apply(lambda row: guess_dc_loc(row['prior_loc'], row['fusion'], row['pain'], row['amb_distance'], \
                                                 row['num_falls']), axis=1)
)
df['pain'] = (
    df['pain'].apply(lambda x: categorize_pain(x))
)

df['need_rehab'] = df['dc_loc'].apply(lambda x: calc_rehab(x))

In [17]:
df.head(10)

Unnamed: 0,age,sex,fusion,provider,brace,pain,prior_loc,ste,hr,ad,num_falls,sup_sit,sit_stand,amb_assist,amb_distance,stairs_assist,num_stairs,los,dc_loc,need_rehab
0,68,female,lumbar,Nolan,lso,moderate,2,16,0,1,1,2,4,4,0,4,0,9.7,6,1
1,71,male,lumbar,Myers,tlso,moderate,0,1,0,0,2,1,1,1,125,2,1,4.4,0,0
2,70,female,lumbar,Myers,lso,mild,4,0,0,2,2,0,0,1,50,4,0,10.1,6,1
3,75,male,cervical,Myers,aspen,severe,4,0,0,0,2,1,2,1,5,4,0,9.1,6,1
4,70,female,cervical,Kuzak,aspen,mild,3,0,0,1,0,0,0,0,150,0,9,1.6,3,0
5,82,female,cervical,Kuzak,aspen,moderate,0,10,1,0,1,2,1,1,100,4,0,1.7,0,0
6,65,male,cervical,Myers,aspen,moderate,1,1,0,0,1,1,1,1,350,0,18,2.2,1,0
7,72,male,cervical,Myers,aspen,severe,0,1,0,0,0,2,2,4,0,4,0,8.3,6,1
8,78,male,lumbar,Nolan,lso,severe,4,0,0,0,1,1,1,1,50,4,0,5.8,4,0
9,75,male,lumbar,Nolan,tlso,severe,4,0,0,2,5,3,3,4,0,4,0,10.0,6,1


### Final Statements

In a live environment, there are likely changes to be made to this setup. For example, it'd be better to pull check providers against a list of providers in the hospital database. It may also not be entirely feasible to use this form of GPT as OpenAI has a clause that they may choose to keep your prompts for up to 30 days which does contain PHI in this case. As I am very young in the field, there are likely other methods to explore of which I am currently unaware to clean the raw data which may be more precise. There is also more data to be analyzed, such as whether the patient contracted an infection following the procedure which could have significant effects on length-of-stay and need for rehabilitation. Lastly, live environments will come with actual label data which would be essential to create real insights from this analysis.