# Dataset Exploration

This notebook is split into 3 main parts, related to 3 main stages of development

1. General Data Exploration - The initial part was dedicated to gain an overall understanding of the data and how it is structured.
2. Common Headers - The second part was written to extract the most common headers, so zero-shot models can have a structure to work with.
3. Additional Constrains - As the dataset was giving Out Of Memory errors, I had to reduce the dataset to a smaller size, and this part was written to figure out a cutoff point that would be reasonable.


### Part 1 - General Data Exploration


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("./data/NOTEEVENTS.csv")

In [None]:
categories = data.groupby("CATEGORY").size().reset_index().rename(columns={0: "count"})
categories

In [None]:
discharge_summaries = data[data["CATEGORY"] == "Discharge summary"]
discharge_summaries = discharge_summaries[
    discharge_summaries["DESCRIPTION"] == "Report"
]
discharge_summaries = discharge_summaries[
    discharge_summaries["TEXT"].map(len) < 16000
]  # 16000 is the approximately the context window for GPT-3.5

discharge_summaries["TEXT"].map(len).hist(bins=100)
sample = discharge_summaries[discharge_summaries["TEXT"].map(len) == 5000]

# Using print to format the output
print(sample.iloc[0]["TEXT"])

In [None]:
random_patient = data.sample()["SUBJECT_ID"]
random_patient = 99082  # Fix a patient for reproducibility

notes = data[data["SUBJECT_ID"] == random_patient]
notes

In [None]:
summary = notes[notes["CATEGORY"] == "Discharge summary"]
print(summary.iloc[0]["TEXT"])

### Part 2 - Common Headers


In [None]:
# Figure out what the most common headings are in the discharge summaries
data = pd.read_csv("./data/single-discharge-8k.csv")
data = data[data["CATEGORY"] == "Discharge summary"]

In [None]:
import re

headings = {}

regex = r"^.+:\s"

regex = re.compile(regex, re.MULTILINE)

for text in data["TEXT"]:
    text = text.lower()
    matches = regex.findall(text)
    for match in matches:
        match = re.sub(r":\s", ":", match)
        if match not in headings:
            headings[match] = 0
        headings[match] += 1

headings, len(headings)

In [None]:
# Sort by the most common headings and show the top 20

sorted_headings = sorted(headings.items(), key=lambda x: x[1], reverse=True)

# We eliminate the first because it is standard to all discharge summaries
sorted_headings[1:20]

### Part 3 - Additional Constrains


In [None]:
data = pd.read_csv("./data/single-discharge-8k-test-formatted.csv")

In [None]:
# Get the sizes of the notes

data["notes"].map(len).hist(bins=100)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "google/gemma-1.1-7b-it",
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)

In [None]:
DEFAULT_SYSTEM_PROMPT = """
You are an expert clinical assistant. You will receive a collection of clinical notes. You will summarize them in the style of a discharge summary.
""".strip()


def generate_testing_prompt_gemma(
    notes: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""<start_of_turn>user {system_prompt}

### Input:

{notes.strip()}

<end_of_turn>
<start_of_turn>model
""".strip()


tokens = data["notes"].map(generate_testing_prompt_gemma)
tokens = tokens.map(tokenizer.tokenize)

In [None]:
biggest = tokens.map(len).idxmax()
len(tokens[biggest])