In [1]:
import pandas as pd
df_sample = pd.read_csv("soap_generated_summaries.csv")
# Display the first few rows
print(df_sample.head())

# Check DataFrame info
print(df_sample.info())

                                               input  \
0  Good afternoon, champ, how you holding up? Goo...   
1  What brings you in here today? Hi, I'm um, I'm...   
2  Do you have any known allergies to medications...   
3  How may I help you today? Yeah I've had, a fev...   
4  It sounds like that you're experiencing some c...   

                                              output  \
0  Subjective:\n- Symptoms: Lower back pain, radi...   
1  Subjective:\n- Presenting with dry cough for 1...   
2  Subjective:\n- No known allergies to medicatio...   
3  Subjective:\n- Fever and dry cough started 4 d...   
4  Subjective:\n- Presenting with chest pain for ...   

                                   generated_summary  
0  The 75-year-old man has been experiencing lowe...  
1  , but as it went along, the smell started comi...  
2  The defendant is charged with murder in connec...  
3  that you could have contracted something, poss...  
4  into this further and try to find out what's g..

## Summary

In [None]:
!pip install -q transformers huggingface_hub
!pip install -q --upgrade accelerate
!pip install -q -U bitsandbytes

In [None]:
from huggingface_hub import login

# Use your Hugging Face token
login("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")

In [23]:
import os
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd

# -----------------------------------------------------
# 1. Environment setup (optional but often helpful)
# -----------------------------------------------------
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# -----------------------------------------------------
# 2. Load your model and tokenizer
# -----------------------------------------------------
model_name = "meta-llama/Llama-3.2-1B-Instruct"  # Example model name
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",           # Automatic GPU/CPU placement
    torch_dtype=torch.float16     # Use FP16 for reduced memory usage
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Some LLaMA-based models need a special EOS token setup
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eos_token_id

# -----------------------------------------------------
# 3. Define a prompt construction function
# -----------------------------------------------------
def construct_prompt(input_text):
    """
    Constructs an instruction-based prompt for summarization.
    """
    prompt = (
        "Summarize the following case. "
        "Do not include any extra or verbatim text from the input. "
        f"Case:\n{input_text}\n\nSummary:"
    )
    return prompt

# -----------------------------------------------------
# 4. Set your generation parameters
# -----------------------------------------------------
generation_params = {
    "do_sample": True,
    "top_p": 0.8,
    "temperature": 0.9,
    "top_k": 10,
    "max_new_tokens": 30,
    "repetition_penalty": 1.1,
    "eos_token_id": tokenizer.eos_token_id
}

# -----------------------------------------------------
# 5. Load your sample DataFrame (df_sample) with columns "input" and "output"
#    For example, if you've already saved and loaded your CSV:
# -----------------------------------------------------
# df_sample = pd.read_csv("sample_summary.csv")
# For demonstration, if you need to create a dummy DataFrame:
# df_sample = pd.DataFrame({"input": ["Your input text here..."], "output": ["Ground truth summary here..."]})

# -----------------------------------------------------
# 6. Summarize your df_sample DataFrame using partial decoding
# -----------------------------------------------------
batch_size = 8  # Adjust as needed
inputs_list = df_sample["input"].tolist()
generated_summaries = []

def process_batch(batch_inputs):
    batch_generated = []
    for text in batch_inputs:
        prompt = construct_prompt(text)
        # Tokenize prompt
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1000)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        prompt_length = inputs["input_ids"].shape[1]

        # Generate output tokens
        summary_ids = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            **generation_params
        )
        # Slice out only the tokens that were generated after the prompt
        generated_tokens = summary_ids[0, prompt_length:]
        generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
        batch_generated.append(generated_text)
    return batch_generated

# -----------------------------------------------------
# 7. Process the DataFrame in batches with a progress bar
# -----------------------------------------------------
with tqdm(total=len(inputs_list), desc="Generating Summaries", unit="row") as pbar:
    for i in range(0, len(inputs_list), batch_size):
        batch = inputs_list[i:i + batch_size]
        try:
            batch_generated = process_batch(batch)
        except RuntimeError as e:
            if "out of memory" in str(e):
                torch.cuda.empty_cache()
                print("Out of memory error; try reducing batch size.")
            raise e
        generated_summaries.extend(batch_generated)
        torch.cuda.empty_cache()
        pbar.update(len(batch))

# -----------------------------------------------------
# 8. Store and Save
# -----------------------------------------------------
# Add generated summaries as a new column in df_sample
df_sample["generated_summary"] = generated_summaries

# Save the DataFrame with input, output, and generated_summary
df_sample.to_csv("soap_generated_summaries.csv", index=False)
print("Summaries saved to 'soap_generated_summaries.csv'")


Generating Summaries:   0%|          | 0/100 [00:00<?, ?row/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Generating Summaries:   8%|▊         | 8/100 [00:06<01:18,  1.18row/s]Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end gene

Summaries saved to 'soap_generated_summaries.csv'





In [4]:
df_sample.head(2)

Unnamed: 0,input,output
0,"Good afternoon, champ, how you holding up? Goo...","Subjective:\n- Symptoms: Lower back pain, radi..."
1,"What brings you in here today? Hi, I'm um, I'm...",Subjective:\n- Presenting with dry cough for 1...


In [13]:
print(df_sample['input'].iloc[50])

Good morning, young lady, how old are you? Good morning, doctor. I'm thirteen. Good, and what seems to be the problem today? Mom, can you explain for me? Guest_family: Well, if you look, doctor, her back posture is very rounded. I think, it's rounding about the thoracic spine. Is there a family history of this problem? Guest_family: Yes, on my side, my aunt and grandfather had, um, kyphosis. Yes, that's what this is. This is thoracic kyphosis to be specific. Has she seen another doctor for this? Guest_family: Yes, we saw another orthopedist. What did they recommend? Guest_family: They recommended we come in for further observation, so we're here for a second opinion. Good, is there any back pain, numbness or tingling? No, I don't have any of that. Is there any weakness, numbness or tingling in your legs and arms, my dear? No, I'm very strong, especially for my age. Are you going to the bathroom with no problem? Yes, doctor, everything is regular there.


In [20]:
print(df_sample['output'].iloc[50])

Subjective:
- Patient is a 13-year-old girl.
- Complaints of rounded back posture (thoracic spine).
- Family history of kyphosis (aunt and grandfather).
- No back pain, numbness, or tingling reported.
- No weakness, numbness, or tingling in legs and arms.
- Patient states she feels very strong for her age.
- Regular bathroom habits reported.

Objective:
- Observed rounded back posture (thoracic kyphosis).

Assessment:
- Thoracic kyphosis.

Plan:
- Further observation as recommended by the previous orthopedist.
- Consideration for a second opinion on management or treatment options.


In [22]:
print(df_sample['generated_summary'].iloc[50])

A 13-year-old girl visits an orthopedic specialist because she has a rounded back posture (thoracic kyphosis) and her mother mentions that her aunt and grandfather had similar problems. The specialist recommends further observation and a second opinion, but the patient does not experience any symptoms such as back pain, numbness, or weakness.
