# Drug Amount Identification
For my HRP, I will be using AI tools to scrape medical notes to identify drug amounts administered to patients. If I can do this successfully, I will use these methods to flag discrepencies between drugs given to patients according to the prescriptions table, and the notes that have been written up on the patients. Flagging discrepencies for closer examination will help hospitals in the following ways:
1. More accurately track inventory
2. Identify common pain points in data entry
3. Build more robust datasets to work with (less bad data)
4. Treat patients more effectively (more accurate information on treatment a patient has already received)
5. Identify potential fraud

## Setup and Data Exploration
Create the database connection and identify common prescriptions so that we can work with a subset of the data and determine the best model.

In [1]:
import pandas as pd
import mysql.connector
import yaml
import re
from collections import defaultdict
import spacy
import medspacy
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from gpt4all import GPT4All
from pathlib import Path
import spacy
from spacy.pipeline import EntityRuler
import re

In [48]:
# load config file and connect to MySQL
with open(r"D:\AI_In_Healthcare\config.yaml", 'r') as f:
    config = yaml.safe_load(f)

conn = mysql.connector.connect(
    host=config["mysql"]["host"],
    user=config["mysql"]["user"],
    password=config["mysql"]["password"],
    port=config["mysql"].get("port", 3306)
)
conn.database = "healthcare_db"

In [49]:
def execute_query(query):
    try:    
        conn.consume_results()
    except:
        pass
    
    cursor = conn.cursor(buffered=True)
    try:
        cursor.execute(query)
        result = cursor.fetchall()
        columns = [i[0] for i in cursor.description]
        df = pd.DataFrame(result, columns=columns)
    finally:
        cursor.close()
    
    return df

In [50]:
query = """
SELECT
  LOWER(COALESCE(DRUG_NAME_GENERIC, DRUG)) AS drug_name,
  COUNT(*) AS prescription_count,
  COUNT(DISTINCT SUBJECT_ID) AS patient_count
FROM prescriptions
GROUP BY drug_name
ORDER BY prescription_count DESC
LIMIT 10;
"""

prescription_count_df = execute_query(query)
prescription_count_df

Unnamed: 0,drug_name,prescription_count,patient_count
0,d5w,32823,4021
1,ns,28535,4001
2,potassium chloride,27652,4324
3,furosemide,21260,2869
4,metoprolol,19373,3038
5,magnesium sulfate,12847,3691
6,iso-osmotic dextrose,12133,2986
7,heparin sodium,12125,3505
8,sw,9750,2912
9,sodium chloride 0.9% flush,8927,3154


We will focus on these top drugs today. We will also need to map different names for these drugs so that we will be able to deal with hospital employees using different names for these drugs.

In [51]:
# query to see where raw != generic but generic is not null. Pairs should be unique.
query = """
SELECT
    LOWER(DRUG) AS drug_name,
    LOWER(COALESCE(DRUG_NAME_GENERIC, DRUG)) AS generic_name,
    DRUG AS raw_name
FROM prescriptions
GROUP BY drug_name, generic_name, raw_name
ORDER BY drug_name;
"""

drug_name_mapping_df = execute_query(query)

In [52]:
# find the top generic name with the most raw names
top_generic_name = drug_name_mapping_df.groupby('generic_name').size().idxmax()
top_raw_names = drug_name_mapping_df[drug_name_mapping_df['generic_name'] == top_generic_name]['raw_name'].tolist()
print(f"Top generic name: {top_generic_name}")
print(f"Raw names for {top_generic_name}: {', '.join(top_raw_names)}")

Top generic name: heparin flush
Raw names for heparin flush: HEPARIN, Heparin , Heparin CRRT, Heparin Flush, Heparin Flush (10 units/ml), Heparin Flush (100 units/ml), Heparin Flush CVL  (100 units/ml), Heparin Flush CVL  (100 units/ml) , Heparin Flush Hickman (100 units/ml), Heparin Flush Midline (100 units/ml), Heparin Flush PICC (100 units/ml)


In [53]:
# 1) No two rows share the same raw_name
print(drug_name_mapping_df['raw_name'].nunique() == len(drug_name_mapping_df))

# 2) No missing values in either column
print(drug_name_mapping_df['raw_name'].notna().all() and drug_name_mapping_df['generic_name'].notna().all())

# 3) Each raw_name maps to exactly one generic_name
print((drug_name_mapping_df.groupby('raw_name')['generic_name']
          .nunique() == 1).all())

False
True
False


In [54]:
counts = drug_name_mapping_df.groupby('raw_name')['generic_name'].nunique()

bad_raws = counts[counts > 1].index
bad_cases = (
    drug_name_mapping_df
      .loc[drug_name_mapping_df['raw_name'].isin(bad_raws)]
      .groupby('raw_name')['generic_name']
      .apply(list)
      .reset_index(name='generic_names')
)

print(bad_cases.head(10))

                  raw_name                                      generic_names
0              *NF* Niacin                   [*nf* niacin, nicotinic acid sr]
1                      *nf  [*nf* cilostazol, *nf* moxifloxacin, *nf* tria...
2         Abacavir Sulfate         [abacavir oral solution, abacavir sulfate]
3                  Acetami            [acetaminophen, acetaminophen (liquid)]
4                 Acetamin            [acetaminophen, acetaminophen (liquid)]
5            Acetaminophen  [acetaminophen, acetaminophen (liquid), acetam...
6           Acetaminophen    [acetaminophen (liquid), acetaminophen (rectal)]
7   Acetaminophen (Liquid)            [acetaminophen, acetaminophen (liquid)]
8  Acetaminophen w/Codeine  [acetaminophen w/codeine, acetaminophen w/code...
9           Acetylcysteine  [acetylcysteine, acetylcysteine  20% (neb), ac...


Ok, these many to many relationships are going to be a problem. Today we will ignore because we are going to focus on a subset of the drugs and ensure we have a model that can extract those and their amounts, but in the future we are going to have an issue that will have to be dealt with in the HRP. Hopefully we can get context clues for some and hopefully notes will largely specify say, which type of acetaminophen was administered, or give context clues that we can make a reasonable guess. If not, maybe it makes sense to just flag these as notation was not specific enough.

In [None]:
# find only the raw_names that map to exactly one generic
many_to_one_raws = (
    drug_name_mapping_df
      .groupby('raw_name')['generic_name']
      .nunique()
      .loc[lambda s: s == 1]
      .index
      .tolist()
)

prescriptions = ", ".join(f"'{r}'" for r in many_to_one_raws)

# 3) Query for the top 10 generics among those raws
query = f"""
SELECT
  LOWER(COALESCE(DRUG_NAME_GENERIC, DRUG)) AS generic_name,
  COUNT(*) AS prescription_count
FROM prescriptions
WHERE LOWER(DRUG) IN ({prescriptions})
GROUP BY generic_name
ORDER BY prescription_count DESC
LIMIT 10;
"""

most_common_generics_df = execute_query(query)
most_common_generics_df

In [None]:
raw_to_generic = dict(
    zip(
        drug_name_mapping_df['raw_name'],
        drug_name_mapping_df['generic_name']
    )
)

# for every raw name in the dictionary, add the lowercase version mapping to the same generic
for raw_name in list(raw_to_generic.keys()):
    generic_name = raw_to_generic[raw_name]
    raw_to_generic[raw_name.lower()] = generic_name
    raw_to_generic[raw_name.upper()] = generic_name

# Quick sanity check: look at the first 10 mappings
for raw, gen in list(raw_to_generic.items())[:10]:
    print(f"{raw} --> {gen}")

  -->  
*IND* Pexelizumab/Placebo --> *ind* pexelizumab/placebo
*nf --> *nf* triazolam
*nf*  Loteprednol 0.2% ophth --> *nf*  loteprednol 0.2% ophth
*NF* Alemtuzumab --> *nf* alemtuzumab
*NF* Allopurinol Sodium --> *nf* allopurinol sodium
*NF* Arginine HCl --> *nf* arginine hcl
*NF* Basiliximab --> *nf* basiliximab
*NF* Beclomethasone Dipropionate Inhalation --> *nf* beclomethasone dipropionate inhalation
*NF* Benzoyl Peroxide 5% Wash --> *nf* benzoyl peroxide 5% wash


In [None]:
# query unique categories
query = """
SELECT DISTINCT category
FROM noteevents
WHERE category IS NOT NULL AND category != '';
"""

execute_query(query)

Unnamed: 0,category
0,Discharge summary
1,Echo
2,ECG
3,Nursing
4,Physician
5,Rehab Services
6,Case Management
7,Respiratory
8,Nutrition
9,General


Now we take 8 of our top 10 generics and query the notes that contain any of the raws, or the generic itself, of that generic. Why only 8? sw and ns will show up as part of a lot of words and be hard to search for. We will do one for the categories nursing, general, discharge, physician, pharmacy, and case management to quickly decide a category to focus on.

In [None]:
def build_query(or_squence, category):
  return f"""
    SELECT
      SUBJECT_ID,
      HADM_ID,
      CHARTDATE,
      CATEGORY,
      TEXT
    FROM NOTEEVENTS
    WHERE LOWER(TEXT) REGEXP '{or_squence}'
    AND CATEGORY = '{category}'
    LIMIT 1;
    """

# drop sw and ns from the top generics
most_common_generics_df = most_common_generics_df[
    ~most_common_generics_df['generic_name'].isin(['sw', 'ns'])
]
top_gens = most_common_generics_df['generic_name'].tolist()

search_terms = {
    raw
    for raw, gen in raw_to_generic.items()
    if gen in top_gens
} | set(top_gens)

or_squence = "|".join(re.escape(term) for term in search_terms)

notes_df = execute_query(build_query(or_squence, 'discharge summary'))
print(f"Found {len(notes_df)} matching notes.")

Found 1 matching notes.


In [None]:
def find_top10_drugs(text):
    text_lower = text.lower()
    found = {
        raw_to_generic.get(term, term)
        for term in search_terms
        if term in text_lower
    }
    return list(found)

def print_first_note(notes_df):
    if not notes_df.empty:
        first_note = notes_df.iloc[0]
        print(f"Top 10 drugs in note: {first_note['top10_drugs_in_note']}")
        print(f"First note text: {first_note['TEXT']}")
    else:
        print("No notes found.")

notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)

print_first_note(notes_df)

Top 10 drugs in note: ['heparin sodium', 'lorazepam', 'levofloxacin']
First note text: Admission Date:  [**2118-6-2**]       Discharge Date:  [**2118-6-14**]

Date of Birth:                    Sex:  F

Service:  MICU and then to [**Doctor Last Name **] Medicine

HISTORY OF PRESENT ILLNESS:  This is an 81-year-old female
with a history of emphysema (not on home O2), who presents
with three days of shortness of breath thought by her primary
care doctor to be a COPD flare.  Two days prior to admission,
she was started on a prednisone taper and one day prior to
admission she required oxygen at home in order to maintain
oxygen saturation greater than 90%.  She has also been on
levofloxacin and nebulizers, and was not getting better, and
presented to the [**Hospital1 18**] Emergency Room.

In the [**Hospital3 **] Emergency Room, her oxygen saturation was
100% on CPAP.  She was not able to be weaned off of this
despite nebulizer treatment and Solu-Medrol 125 mg IV x2.

Review of systems is ne

In [None]:
notes_df = execute_query(build_query(or_squence, 'Nursing/other'))
notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)
print_first_note(notes_df)

Top 10 drugs in note: ['levofloxacin', 'd5w']
First note text: CSRU NPN

Neuro:  Propofol weaned to 10 mcg/kg/min.  Pt following commands, opens eyes minimally to request.  Denies pain by head nods.  Aggitated/restless at times-versed given w/ effect.  Left pupil 3mm, sluggish but reactive.

CV:  Attempted v pacing for HR initially in 60's nsr->BP down, no change in CI. HR up to 70's NSR with wakefulness.  APC's noted this afternoon.  Lytes repleted as needed.  Neo weaned to off w/ stable BP. Vasopressin continues. Milrinone at .125 mcg/kg/min.  MVO@ mainly in low 60's.  CI 1.8-2 via CCO, 2.75 by FICK. [** **] gtt increased to 800u/hr for PTT 61.8-repeat PTT pending.  Hct stable at 27.

Resp:  Weaned to CPAP w/ 12 IPS to maintain VT's approx 400cc, RR high teens at rest, up to 30's w/ aggitation.  CPAP ABG stable.  Suctioned for small amts clear to yellow sputum.

GI:  Abd soft, ND, hypoactive BS.  TF off x 4 hours for residuals of 130-170cc.  Reglan continues. Restarted criticare at 2

In [None]:
notes_df = execute_query(build_query(or_squence, 'Physician'))
notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)

print_first_note(notes_df)

No notes found.


In [None]:
notes_df = execute_query(build_query(or_squence, 'General'))
notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)

print_first_note(notes_df)

Top 10 drugs in note: ['heparin sodium']
First note text: Attending MICU Note
   Chief Complaint:  abd pain and  hematuria ,  transient hypotension in
   ED
   I saw and examined the patient, and was physically present with the ICU
   Resident for key portions of the services provided.  I agree with his /
   her note above, including assessment and plan.
   HPI:
   85 yo F with stable CLL, not requiring treatment, presents to ED with
   abd pain, hematuria.
   C/o several month h/o ns, fatigue, poor appetite and intermittent L abd
   discomfort Also notes tinnitus, ongoing new since [**Month (only) **].
   Was seen [**1-9**] by pcp for viral uri sx/dry cough
   Returned to PCP [**1-23**] w/ ongoing sputum production and also had noted to
   have  episode of hematuria and ongoing L LQ pelvic/abd pain.  Treated
   with azithro for presumed CAP/bronchitis.  Labs showed increase in WBC
   to 40 from baseline [**9-24**]'s and blood in urine.  Abd  u/s performed as
   outpt showed splenomega

In [None]:
notes_df = execute_query(build_query(or_squence, 'Pharmacy'))
notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)

print_first_note(notes_df)

Top 10 drugs in note: ['vancomycin hcl']
First note text: PHARMACY - VANCOMYCIN
   ASSESSMENT:
   Mr. [**Known lastname 86**] continues on vancomycin 1000 mg q48h (day 11); Currently
   on peritoneal dialysis with continually rising creatinine (2.3 to 5
   mg/dL over 5 to 7 days) and decreasing residual urine output.  Last
   vancomycin dose given [**2-2**] at 8 am; Most recent vanco trough 28.8 @
   8:31 am [**2133-2-4**].
   RECOMMENDATION:
          Hold vancomycin dose today and consider decreasing
   dose/frequency to 500 mg q48h
          Start new regimen [**2-5**] or [**2-6**] if decide to continue
   vancomycin therapy
          Goal vancomycin level 15
 20 mcg/mL
   [**Initials (NamePattern4) **] [**Last Name (NamePattern4) 79**], PharmD #[**Numeric Identifier 80**]



In [None]:
notes_df = execute_query(build_query(or_squence, 'Case Management'))
notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)

print_first_note(notes_df)

No notes found.


Using pharmacy notes, they are the clearest and look at a single drug.

In [None]:
query = f"""
SELECT
    TEXT
FROM NOTEEVENTS
WHERE LOWER(TEXT) REGEXP '{or_squence}'
AND CATEGORY = 'Pharmacy'
LIMIT 1000;
"""

notes_df = execute_query(query)
print(f"Found {len(notes_df)} pharmacy notes matching the search terms.")

notes_df['top10_drugs_in_note'] = notes_df['TEXT'].apply(find_top10_drugs)

Found 26 pharmacy notes matching the search terms.


In [None]:
conn.close()

## Extraction:
### Regex
First, were gonna do some standard scraping. This is unlikely to work well, but it will provide a baseline that we will aim to beat.

In [None]:
# build generic to set of aliases map
generic_aliases = defaultdict(set)
for raw, gen in raw_to_generic.items():
    generic_aliases[gen].add(raw)
for gen in list(generic_aliases):
    generic_aliases[gen].add(gen)

# common dosage patterns regex
unit_pattern   = r'(?:mg|g|mcg|μg|units|puffs)'
number_pattern = r'(\d+(?:\.\d+)?)'

def extract_with_aliases(text, drug_list):
    text_lower = text.lower()
    extractions = []
    for gen in drug_list:
        for alias in generic_aliases[gen]:
            esc = re.escape(alias)
            p1 = re.compile(fr'{esc}.{{0,20}}?{number_pattern}\s*{unit_pattern}', re.IGNORECASE)
            p2 = re.compile(fr'{number_pattern}\s*{unit_pattern}.{{0,20}}?{esc}', re.IGNORECASE)
            for pat in (p1, p2):
                for m in pat.finditer(text_lower):
                    extractions.append({
                        m.group(0).strip()
                    })
    return extractions

notes_df['dosage_extractions'] = notes_df.apply(
    lambda row: extract_with_aliases(row['TEXT'], row['top10_drugs_in_note']),
    axis=1
)

In [None]:
print("Note 0")
print(notes_df.iloc[0]['top10_drugs_in_note'])
print(f"\nExtractions: {notes_df.iloc[0]['dosage_extractions']}\n")
print(notes_df.iloc[0]['TEXT'])

Note 0
['vancomycin hcl']

Extractions: [{'vancomycin 1000 mg'}, {'vancomycin 1000 mg'}, {'vancomycin 1000 mg'}]

PHARMACY - VANCOMYCIN
   ASSESSMENT:
   Mr. [**Known lastname 86**] continues on vancomycin 1000 mg q48h (day 11); Currently
   on peritoneal dialysis with continually rising creatinine (2.3 to 5
   mg/dL over 5 to 7 days) and decreasing residual urine output.  Last
   vancomycin dose given [**2-2**] at 8 am; Most recent vanco trough 28.8 @
   8:31 am [**2133-2-4**].
   RECOMMENDATION:
          Hold vancomycin dose today and consider decreasing
   dose/frequency to 500 mg q48h
          Start new regimen [**2-5**] or [**2-6**] if decide to continue
   vancomycin therapy
          Goal vancomycin level 15
 20 mcg/mL
   [**Initials (NamePattern4) **] [**Last Name (NamePattern4) 79**], PharmD #[**Numeric Identifier 80**]



In [None]:
print("Note 1")
print(notes_df.iloc[1]['top10_drugs_in_note'])
print(f"\nExtractions: {notes_df.iloc[1]['dosage_extractions']}\n")
print(notes_df.iloc[1]['TEXT'])

Note 1
['lorazepam']

Extractions: [{'lorazepam 2 mg'}, {'lorazepam 1 mg'}, {'lorazepam 2mg'}, {'lorazepam 1mg'}, {'lorazepam 2 mg'}, {'lorazepam 1 mg'}, {'lorazepam 2mg'}, {'lorazepam 1mg'}, {'lorazepam 2 mg'}, {'lorazepam 1 mg'}, {'lorazepam 2mg'}, {'lorazepam 1mg'}]

Pharmacy Note
   TRANSITIONING and WEANING OPIOIDS: Continue fentanyl infusion and
   initiate methadone intermittent doses (overlap therapy initially).
   Begin to wean fentanyl infusion approximately 3 hours after beginning
   methadone by decreasing the fentanyl infusion by 50% initially.
   MONITORING: Monitor level of analgesia and response to drug as per
   [**Hospital1 54**] sedation/analgesia guideline.
   When fentanyl, propofol and midazolam are weaned off can wean methadone
   as tolerated
 Dose of methadone should be 20mg q6h >> q8h >>
   q12h>>daily.
   Transitioning from midazolam to lorazepam during the 24 hour period in
   anticipation of extubation. Discontinue midazolam. Begin lorazepam 2 mg
   FT q4hr

In [None]:
print("Note 2")
print(notes_df.iloc[2]['top10_drugs_in_note'])
print(f"\nExtractions: {notes_df.iloc[2]['dosage_extractions']}\n")
print(notes_df.iloc[2]['TEXT'])

Note 2
['vancomycin hcl']

Extractions: [{'vancomycin 1 g'}, {'vancomycin 1 g'}, {'vancomycin 1 g'}]

PHARMACY
 VANCO DOSING IN CRRT
   ASSESSMENT - Mr [**Known lastname 86**] is currently on vancomycin 1 gram q48h and is
   now on continuous renal replacement therapy (CRRT) which provides
   variable removal of drugs, but is more efficient than hemodialysis or
   peritoneal dialysis.
   RECOMMENDATION: While on CRRT, drugs may be dosed for an
estimated
   creatinine clearance 20 to 30 mL/minute; mostly dependent on the actual
   ultrafiltration rate.  When levels can be monitored, obtain levels q24h
   and redose drugs to goal (e.g. vancomycin).
     * Vancomycin dose/regimen should likely be changed to 1 gram PRN for
       goal level 15
 20 mcg/mL particularly in the setting of critical
       illness.
     * When vanco level < 20 mcg/mL give 1 gram
     * When vanco level > 20 mcg/mL hold dose and obtain another level 24
       hours later and reassess
   Spoke to Dr. [**Last Name 

I'm actually pretty impressed with how well this has worked, but it is not very robust for several reasons:
1. Using a window to extract a number will fail in cases where language is more complex and we are counting on getting lucky with the regex, our code may pick up the wrong number or no number at all
2. We are able to get units, but something like 1000mg is very different from 1000mg hourly for 24 hours. For this to be useful for our goal, we need to be able to get these aggregations
3. We have multiple differing extractions for certain entites, with no way to differentiate between them, crippling our ability to actually develop an automated system

### Medspacy
We need to improve upon this regex model of scraping. So lets use some nlp tools. I will be using medspacy to, using NER, to hopefully better understand this data and get more comprehensive extraction.

In [None]:
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe(
    "entity_ruler",
    before="ner",
    config={"phrase_matcher_attr": "LOWER"}
)

patterns = [{"label": "DRUG", "pattern": raw} for raw in raw_to_generic]
ruler.add_patterns(patterns)

unit_pat = r'(?:mg|g|mcg|μg|units|puffs)'
num_pat  = r'(\d+(?:\.\d+)?)'
dosage_re = re.compile(fr'{num_pat}\s*{unit_pat}', re.IGNORECASE)
freq_re   = re.compile(
    r'\b(?:hourly|daily|bid|tid|qid|q\d+h|every\s+\d+\s+(?:hours?|days?))\b',
    re.IGNORECASE
)

In [None]:
def extract_spacy(text):
    doc = nlp(text)
    meds = []
    for ent in doc.ents:
        if ent.label_ != "DRUG":
            continue
        span = ent.text
        generic = raw_to_generic.get(span.lower(), span.lower())
        if generic not in most_common_generics_df['generic_name'].tolist():
            continue
        
        # still need a sliding window around the span for regex
        start = max(ent.start_char - 50, 0)
        end   = min(ent.end_char + 50, len(text))
        window = text[start:end]
        
        dose_m = dosage_re.search(window)
        if not dose_m:
            continue
        
        freq_m = freq_re.search(window)
        meds.append({
            "generic":   generic,
            "matched":   span,
            "strength":  dose_m.group(0),
            "frequency": freq_m.group(0) if freq_m else None
        })
    return meds

In [None]:
notes_df['spacy_meds'] = notes_df['TEXT'].apply(extract_spacy)

num = notes_df['spacy_meds'].apply(bool).sum()
print(f"Extracted meds from {num}/{len(notes_df)} notes")
for _, row in notes_df.head(5).iterrows():
    print(row['spacy_meds'])

Extracted meds from 11/26 notes
[]
[{'generic': 'lorazepam', 'matched': 'lorazepam', 'strength': '2 mg', 'frequency': None}, {'generic': 'lorazepam', 'matched': 'lorazepam', 'strength': '2 mg', 'frequency': None}, {'generic': 'lorazepam', 'matched': 'lorazepam', 'strength': '2mg', 'frequency': 'q24h'}, {'generic': 'lorazepam', 'matched': 'lorazepam', 'strength': '2mg', 'frequency': 'q24h'}]
[]
[{'generic': 'lorazepam', 'matched': 'lorazepam', 'strength': '20 mg', 'frequency': None}, {'generic': 'lorazepam', 'matched': 'lorazepam', 'strength': '2 mg', 'frequency': None}]
[]


I'm dissapointed with how poorly this has worked. Compared to the last version, its outputs are a little better when they occur, but there are less attempts at an answer. Compared to our plain regex, we still have the sliding window and regex reliance which is not good, we still have the issue with repeat, differing drug outputs. The only improvement is that we sometimes have frequencies. This is not worth taking over the plain regex solution as is.

### Llama
The final model I will try will be a Llama model downloaded using gpt4all. This will have hardware performance issues compared to the last 2, but it will have far better memory and understanding across a text and should be able to far better label the drugs. We will create a reusable prompt and feed it in with the texts.

In [None]:
model_dir = Path(r"D:\gpt4all\models")  
model_dir.mkdir(exist_ok=True)

model = GPT4All(
    model_name="Meta-Llama-3-8B-Instruct.Q4_0.gguf",
    model_path=model_dir,
    allow_download=True,
    n_threads=4,
    backend="llama"
)

In [None]:
def find_raws_in_text(generics, text):
    text_lower = text.lower()
    raws = []
    for gen in generics:
        for raw in generic_aliases.get(gen, []):
            if raw in text_lower:
                raws.append(raw)
    return raws

In [None]:
notes_df['raws_in_note'] = notes_df.apply(
    lambda row: find_raws_in_text(row['top10_drugs_in_note'], row['TEXT']),
    axis=1
)
notes_df.head(1)

Unnamed: 0,TEXT,top10_drugs_in_note,dosage_extractions,spacy_meds,raws_in_note
0,PHARMACY - VANCOMYCIN\n ASSESSMENT:\n Mr. ...,[vancomycin hcl],"[{vancomycin 1000 mg}, {vancomycin 1000 mg}, {...",[],[vancomycin ]


In [None]:
def make_prompt(target_words, text):
    prompt = (
        "You are a medical expert specializing in drug dosage extraction. "
        "Your task is to identify and extract the dosage information for the following drugs: "
        f"{', '.join(target_words)}.\n\n"
        "For each drug, provide the dosage in the format:\n"
        "drug_name,dosage_value,unit,frequency\n"
        "If a field is not present in the text, use 'None' for that field.\n\n"
        "Here is the text to analyze:\n"
        f"{text}\n\n"
        "Return one line per drug."
    )
    return prompt

def extract_with_gpt4all(row, max_tokens=256):
    drugs = row["raws_in_note"]
    if not drugs:
        return ""
    prompt = make_prompt(drugs, row["TEXT"])

    response = model.generate(prompt, n_predict=max_tokens)
    return response.strip()

In [None]:
#notes_df["gpt4all_output"] = notes_df.apply(extract_with_gpt4all, axis=1)

# extract with gpt4all the first note

# the below line ran for 130 minutes with no results. Go to supplment_LLM notebook to see outputs with examples fed from here, needed to use a gpu
#extract_with_gpt4all(notes_df.iloc[0])

#### No Memory, no gpu, shit computer
I was not able to run this on my machine, so in Supplement_LLM.ipynb which is also attached with this assignment you can see how I got the prompted outputs from Llama. I ran supplement_LLM with a GPU in colab and copy and pasted the outputs below.

In [None]:
# From GPU enabled environment: Supplement_LLM.ipynb
output1 = "(vancomycin,1000 mg,q48h)"
output2 = "(methadone,20mg,q6h), (fentanyl,350mcg/hr,), (midazolam,10mg/hr,), (propofol,75mcg/kg/min,), (lorazepam,2mg,q4hr),(lorazepam,1mg,q4prn)"
output3 = "(vancomycin,1 gram,q48h), (vancomycin,1 gram,PRN),(vancomycin,1 gram,when level <20 mcg/mL),(vancomycin,hold dose when level >20 mcg/mL)"

In [None]:
note = notes_df.iloc[0]
print(f"Drugs in note: {note['raws_in_note']}")
print(output1)
print(f"text: {note['TEXT']}")

Drugs in note: ['vancomycin ']
(vancomycin,1000 mg,q48h)
text: PHARMACY - VANCOMYCIN
   ASSESSMENT:
   Mr. [**Known lastname 86**] continues on vancomycin 1000 mg q48h (day 11); Currently
   on peritoneal dialysis with continually rising creatinine (2.3 to 5
   mg/dL over 5 to 7 days) and decreasing residual urine output.  Last
   vancomycin dose given [**2-2**] at 8 am; Most recent vanco trough 28.8 @
   8:31 am [**2133-2-4**].
   RECOMMENDATION:
          Hold vancomycin dose today and consider decreasing
   dose/frequency to 500 mg q48h
          Start new regimen [**2-5**] or [**2-6**] if decide to continue
   vancomycin therapy
          Goal vancomycin level 15
 20 mcg/mL
   [**Initials (NamePattern4) **] [**Last Name (NamePattern4) 79**], PharmD #[**Numeric Identifier 80**]



In [None]:
note = notes_df.iloc[1]
print(f"Drugs in note: {note['raws_in_note']}")
print(output2)
print(f"text: {note['TEXT']}")

Drugs in note: ['lorazepam']
(methadone,20mg,q6h), (fentanyl,350mcg/hr,), (midazolam,10mg/hr,), (propofol,75mcg/kg/min,), (lorazepam,2mg,q4hr),(lorazepam,1mg,q4prn)
text: Pharmacy Note
   TRANSITIONING and WEANING OPIOIDS: Continue fentanyl infusion and
   initiate methadone intermittent doses (overlap therapy initially).
   Begin to wean fentanyl infusion approximately 3 hours after beginning
   methadone by decreasing the fentanyl infusion by 50% initially.
   MONITORING: Monitor level of analgesia and response to drug as per
   [**Hospital1 54**] sedation/analgesia guideline.
   When fentanyl, propofol and midazolam are weaned off can wean methadone
   as tolerated
 Dose of methadone should be 20mg q6h >> q8h >>
   q12h>>daily.
   Transitioning from midazolam to lorazepam during the 24 hour period in
   anticipation of extubation. Discontinue midazolam. Begin lorazepam 2 mg
   FT q4hr and lorazepam 1 mg IV q4hr:prn
   [**2142-2-22**]  1400
   Begin methadone 20mg IV q6hrs
   Fentany

In [None]:
note = notes_df.iloc[2]
print(f"Drugs in note: {note['raws_in_note']}")
print(output2)
print(f"text: {note['TEXT']}")

Drugs in note: ['vancomycin ']
(methadone,20mg,q6h), (fentanyl,350mcg/hr,), (midazolam,10mg/hr,), (propofol,75mcg/kg/min,), (lorazepam,2mg,q4hr),(lorazepam,1mg,q4prn)
text: PHARMACY
 VANCO DOSING IN CRRT
   ASSESSMENT - Mr [**Known lastname 86**] is currently on vancomycin 1 gram q48h and is
   now on continuous renal replacement therapy (CRRT) which provides
   variable removal of drugs, but is more efficient than hemodialysis or
   peritoneal dialysis.
   RECOMMENDATION: While on CRRT, drugs may be dosed for an
estimated
   creatinine clearance 20 to 30 mL/minute; mostly dependent on the actual
   ultrafiltration rate.  When levels can be monitored, obtain levels q24h
   and redose drugs to goal (e.g. vancomycin).
     * Vancomycin dose/regimen should likely be changed to 1 gram PRN for
       goal level 15
 20 mcg/mL particularly in the setting of critical
       illness.
     * When vanco level < 20 mcg/mL give 1 gram
     * When vanco level > 20 mcg/mL hold dose and obtain another

This is by far the best of the options. It is smart enough to not output a bunch of copies of the same drug unless there is different for frequency or dosage and does a great job, to my eyes, of picking everything up accuracy. Also, there is no reliance on regex and sliding windows, which makes this method far more robust and transferrable to different types of documents and new drugs. Interestingly, the second option added correct information for drugs that weren't even in our list, which does tell me that I need to prompt more carefully, but also shows the model's potential to be generalized to my larger project.

## Citation

@misc{gpt4all,
  author = {Yuvanesh Anand and Zach Nussbaum and Brandon Duderstadt and Benjamin Schmidt and Andriy Mulyar},
  title = {GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/nomic-ai/gpt4all}},
}