# 2023 Data4Good Case Competiton: End-to-End Medical Information Extraction with Open-Source LLMs

This notebook shows how raw medical transcript text is converted into a structured medical information form using open-source Large Languages Models from HuggingFace.


## 0. Table of Contents and Workflow Highlights

Use this section to navigate quickly through the notebook and understand which real-world problem each stage solves.

| Section | Practical Problem Solved | Workflow Outcome |
| --- | --- | --- |
| **[1. Project Context and Success Criteria](#sec-1)** | Clinical conversations are unstructured and difficult to convert into consistent records. | Defines a clear six-field output (patient's name, age, condition, symptoms, advised precautions, prescribed drugs) contract for downstream documentation workflows.|
| **[2. Data Exploration and Data Quality Assessment](#sec-2)** | Raw transcripts include non-English text, noise, and missing details. | Identifies data risks early so preprocessing and extraction are designed for real input conditions. |
| **[3. LLM Pipeline Implementation](#sec-3)** | Manual extraction is slow, and naive LLM use is often inconsistent. | Adds translation, few-shots prompt engineering, retries, and chunked processing for reliable large-batch extraction. |
| **[4. Post-processing and Output Normalization](#sec-4)** | Model outputs are often semantically right but structurally unusable for systems. | Enforces deterministic formatting by using Regex and row-level alignment so outputs are system-consumable. |

### Quick Jump Links
- [2.1 Transcript Quality Checks](#sec-2-1)
- [3.1 End-to-End Workflow Overview](#sec-3-1)
- [3.2 Translation to English (`facebook/nllb-200-distilled-1.3B`)](#sec-3-2)
- [3.3 Information Extraction (`NousResearch/Nous-Hermes-Llama2-13b`)](#sec-3-3)
- [3.3.5 Batch Inference and Reliability Controls](#sec-3-3-5)
- [4.1 Cleaning Goals and Why They Matter](#sec-4-1)


<a id="sec-1"></a>
# 1. Project Context and Success Criteria


<a id="sec-1-1"></a>
## 1.1 Problem Statement, Inputs, Outputs, and Impact

- **Dataset and requirement context:** 2023 Purdue Data 4 Good competition (https://business.purdue.edu/events/data4good/).
- **Primary goal:** automate extraction of structured medical information from transcript text.
- **Input:** anonymized/synthesized doctor-patient conversations or medical dictations.
- **Output:** six form-ready fields (patient name, age, condition, symptoms, precautionary advice, medication).
- **Impact:** reduce manual documentation workload while improving workflow consistency.


<a id="sec-2"></a>
# 2. Data Exploration and Data Quality Assessment


In [None]:
# import necessary libraries
import json
import random
import pandas as pd
import numpy as np
import re
import time

In [None]:
# for formatting
dash_line = '-'.join('' for x in range(100))
print(dash_line)

In [3]:
# read transcript data
with open('transcripts.json') as f:
    data = json.load(f)

In [5]:
# print random 10 transcripts to check the data
random_keys = random.sample(list(data.keys()), 10)
for key in random_keys:
    print(f'{key}:\n{data[key]}')
    print(dash_line)

3754:
D: Hi, Ms. Barber. I'm Dr. Smith. I understand that you have been diagnosed with hepatitis E. 

P: Yes, that's right. I have been feeling really bad for the past few days. 

D: Can you tell me about your symptoms? 

P: Sure. I have had a lot of joint pain and I've been vomiting a lot. My urine has also been really dark. 

D: I see. Based on your symptoms and the diagnosis, you have hepatitis E. 

P: Oh, I see. What can I do to feel better? 

D: Well, first, we need to stop any alcohol consumption as it can worsen the condition. You should also get plenty of rest and stay hydrated. 

P: Ok, I'll do that. And what about medication? 

D: Since it's hepatitis E, we won't prescribe any medication for it. But we will monitor your condition closely and make sure you are improving. 

P: Ok, thank you for your help, Dr. Smith. 

D: You're welcome, Ms. Barber. If you have any concerns or questions, please don't hesitate to call us.
----------------------------------------------------------

In [6]:
# create a transcript df
transcripts = pd.DataFrame.from_dict(data, orient='index', columns=['transcript'])
transcripts.reset_index(inplace=True)
transcripts.rename(columns={'index':'id'}, inplace=True)
transcripts.head()

Unnamed: 0,id,transcript
0,2055,"During the visit, I examined Mr. Don Hicks, wh..."
1,291,"During the visit, I examined Tina Will, a 69-y..."
2,102,"D: Good morning Tommie, how can I help you tod..."
3,2966,"D: Good morning, Chris. I understand you've be..."
4,2438,"D: Hi Ernest, I understand you're here for a c..."


In [255]:
# change data type of id column to int
transcripts['id'] = transcripts['id'].astype(int)

<a id="sec-2-1"></a>
## 2.1 Transcript Quality Checks: Noisy, Non-English, and Incomplete Notes

This subsection inspects representative transcript edge cases before model implementation.


In [28]:
# other languages examples
other_languages_id = [3538, 1732, 2340, 4580, 2098, 4159, 76, 1407, 3972]
for id in other_languages_id:
    transcript = transcripts[transcripts['id']==id]['transcript'].values[0]
    print(transcript)
    print(dash_line)

D: शुभ प्रभात, थॉमस जी। आज मुझे आपकी कैसे सहायता कर सकते हैं? 

P: शुभ प्रभात, डॉक्टर। मुझे अपने लम्बरों में कमजोरी, चक्कर आना और गले में दर्द के बारे में अनेक दिनों से तक समस्या है। 

D: महसूस हो रहा है, थॉमस जी। आपके लक्षणों और आपकी उम्र के आधार पर, मैं आशा कर रहा हूं कि आपको सर्विकल स्पॉन्डिलोसिस हो सकती है। 

P: यह क्या है, डॉक्टर? 

D: सर्विकल स्पॉन्डिलोसिस एक ऐसा स्थिति है जो आपके गले के हिस्से को प्रभावित करता है, जो आपके गले का हिस्सा है। यह 60 साल से ऊपर के लोगों में आम है और यह आपके गले में दर्द और स्टिफ़नस की वजह से हो सकता है, जैसे ही आपके लम्बरों में कमजोरी और नम्बर्स का महसूस हो सकता है। 

P: ओह, मैं समझ गया। क्या मुझे अच्छे से महसूस करने के लिए
---------------------------------------------------------------------------------------------------
D: ¡Buenos días, Erwin Thompson! Entiendo que has sido diagnosticado con asma bronquial. ¿Me puedes decir más sobre tus síntomas?

P: Hola, doctor. Sí, he estado experimentando fiebre alta y últimamente tengo problemas para respirar

In [29]:
# messy text data examples
messy_text_id = [608, 1848, 1221, 2287, 1766, 3281, 4494]
for id in messy_text_id:
    transcript = transcripts[transcripts['id']==id]['transcript'].values[0]
    print(transcript)
    print(dash_line)

D: Hi Mr. Adair, I understand that you have been diagnosed with hep@tit!s A. How have you been feeling lately?

P: Well, Doc, I've been feeling pretty bad- like, real bad. I've got n@u$ea all the time and my stom@ch hurts. I've also been havin' di@rrhea and my joints ache, you know?

D: I see. It's not surp@rising that you're feeling this way given your condition. Hep@titis A is a v!ral infection that affects the l!ver and can cause a range of sympt0ms, including the ones you mentioned.

P: Is there anything I can do to make myself feel better?

D: First and foremost, you should consult the nearest hospital and follow their advice. In addition, make sure to wash your hands th0roughly with soap and w@ter, avoid fatty and sp!cy foods, and take med!cation as prescribed by the hosp!tal.

P: Okay, I'll do that. Th@nk you for your help, Doctor.

D: You're welcome, Mr. Ad@ir. It's important that you take care of yourself and follow the doctor's advice. Please come back if you have any further

In [30]:
# incomplete data example
incomplete_id = [4401, 1848]
for id in incomplete_id:
    transcript = transcripts[transcripts['id']==id]['transcript'].values[0]
    print(transcript)
    print(dash_line)

During the visit, I, the doxctor, assessed Nancey Blas, who is 8
---------------------------------------------------------------------------------------------------
D: Good mornin', Mr. Mcclure. I understand that you're being treated for hep@titis B. Can you tell me how you're feeling?

P: Ye$s, Doctor. I've been feeling very weak and tired lately. My urine has also become dark in color.

D: I see. Have you noticed any other sy
---------------------------------------------------------------------------------------------------


<a id="sec-2-2"></a>
## 2.2 Data Risks and Modeling Implications

- Because final answers must be in English, non-English transcripts require translation first.
- Because noisy and incomplete notes appear in the dataset, prompting and post-processing must be robust to imperfect inputs.


<a id="sec-2-3"></a>
## 2.3 Test Set Structure (`test.csv`)


In [197]:
# read test data
test = pd.read_csv('test.csv')
test.head(12)

Unnamed: 0,Id,Transcript,Question
0,587d0feb-5780-43e1-9595-e19d4b31dc07,2055,What is the patient's name?
1,263e8884-e8ba-4266-bb0c-85271419a0b3,2055,What is the patient's age?
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,2055,What is the patient's condition?
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,2055,What symptoms is the patient experiencing?
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,2055,What precautions did the doctor advise?
5,03406fb0-e67d-4614-a745-ed02c7ac6c46,2055,What drug did the doctor prescribe?
6,1f0abde6-83ac-4e8f-b06f-9c9ff72beb88,291,What is the patient's name?
7,8ce89aa4-9d9c-4174-8e5e-957efbf41256,291,What is the patient's age?
8,08862b5d-dbc0-4ea1-884c-329eec085ff6,291,What is the patient's condition?
9,2f9bb4d6-8694-478e-a7f0-83ba534673de,291,What symptoms is the patient experiencing?


<a id="sec-2-3-1"></a>
### 2.3.1 Output Contract

- `test.csv` contains the questions that must be answered by the pipeline.
- Each transcript appears in six rows/questions, which defines the downstream six-line extraction contract.


<a id="sec-2-4"></a>
## 2.4 Exploration Highlights

- Multilingual input motivated a translation stage before extraction.
- Noisy and incomplete text motivated robust prompt examples and cleanup logic.
- Fixed six-question structure informed merge-key design and submission validation.


<a id="sec-3"></a>
# 3. LLM Pipeline Implementation


<a id="sec-3-1"></a>
## 3.1 End-to-End Workflow Overview

1. Detect source language ([langdetect](https://pypi.org/project/langdetect/)).
2. Translate non-English transcripts to English ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)).
3. Extract six required fields with few-shot prompting ([NousResearch/Nous-Hermes-Llama2-13b](https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b) via API).
4. Enforce output structure and clean generated text.
5. Merge with `test.csv` and export submission artifacts.


<a id="sec-3-2"></a>
## 3.2 Stage A: Translation to English (NLLB)

Model family: Meta NLLB (reference model page: https://huggingface.co/facebook/nllb-moe-54b).


In [None]:
# utilize hugging face transformers
%pip install transformers

<a id="sec-3-2-1"></a>
### 3.2.1 Environment Setup

`PyTorch` is required for local translation model inference.


In [None]:
%pip install torch

In [None]:
%pip install torchdata

<a id="sec-3-2-2"></a>
### 3.2.2 Language Detection Strategy

- Detect source language before translation so the prompt can route correctly.
- `langdetect` is used as a lightweight, transcript-level language detector.


In [None]:
%pip install langdetect

In [None]:
# import necessary libraries for translation
from langdetect import detect

In [None]:
# import necessary modules from transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

<a id="sec-3-2-3"></a>
### 3.2.3 Model Selection Under Compute Constraints

The distilled model `facebook/nllb-200-distilled-1.3B` is selected to balance quality and available compute.


In [12]:
# load the model and tokenizer
model_name='facebook/nllb-200-distilled-1.3B'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

In [14]:
# try translating some transcripts
other_languages_id_long = [3538, 2098, 76, 4916, 570, 4830, 410]
for id in other_languages_id_long:
    transcript = transcripts[transcripts['id']==id]['transcript'].values[0]
    source_lang = detect(transcript)
    # split transcript into chunks of 512 tokens
    text_chunks = [transcript[i:i + 512] for i in range(0, len(transcript), 512)]
    translated_text_chunks = []
    for text_chunk in text_chunks:
        input_text = f"translate {source_lang} to en:\n{text_chunk}"
        inputs = tokenizer(input_text, return_tensors="pt", padding=True)
        translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
        decoded_output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
        translated_text_chunks.append(decoded_output)
    
    # Combine translated chunks
    translated_text = ' '.join(translated_text_chunks)
    
    # print prompt and translation
    print(id)
    print(transcript)
    print('Translation:')
    print(translated_text)
    print(dash_line)

3538
D: शुभ प्रभात, थॉमस जी। आज मुझे आपकी कैसे सहायता कर सकते हैं? 

P: शुभ प्रभात, डॉक्टर। मुझे अपने लम्बरों में कमजोरी, चक्कर आना और गले में दर्द के बारे में अनेक दिनों से तक समस्या है। 

D: महसूस हो रहा है, थॉमस जी। आपके लक्षणों और आपकी उम्र के आधार पर, मैं आशा कर रहा हूं कि आपको सर्विकल स्पॉन्डिलोसिस हो सकती है। 

P: यह क्या है, डॉक्टर? 

D: सर्विकल स्पॉन्डिलोसिस एक ऐसा स्थिति है जो आपके गले के हिस्से को प्रभावित करता है, जो आपके गले का हिस्सा है। यह 60 साल से ऊपर के लोगों में आम है और यह आपके गले में दर्द और स्टिफ़नस की वजह से हो सकता है, जैसे ही आपके लम्बरों में कमजोरी और नम्बर्स का महसूस हो सकता है। 

P: ओह, मैं समझ गया। क्या मुझे अच्छे से महसूस करने के लिए
Translation:
I have been having problems with my loins for several days with weakness, dizziness and neck pain. Based on your symptoms and your age, I hope that you may have cervical spondylosis. P: What is it, doctor? D: Cervical spondylosis is a condition that affects your neck, which is part of your neck. I'm sorry, but I'

In [33]:
# a function to translate transcript
def nllb_translate(transcript, source_lang):
    text_chunks = [transcript[i:i + 512] for i in range(0, len(transcript), 512)]
    translated_text_chunks = []
    
    for text_chunk in text_chunks:
        input_text = f"translate {source_lang} to en:\n{text_chunk}"
        inputs = tokenizer(input_text, return_tensors="pt", padding=True)
        translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
        decoded_output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
        translated_text_chunks.append(decoded_output)
        
    translated_text = ' '.join(translated_text_chunks)
    
    return translated_text

In [None]:
# detect language of each transcript
transcripts['source_lang'] = transcripts['transcript'].apply(lambda x: detect(x))
transcripts.head()

Unnamed: 0,id,transcript,source_lang
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en
1,291,"During the visit, I examined Tina Will, a 69-y...",en
2,102,"D: Good morning Tommie, how can I help you tod...",en
3,2966,"D: Good morning, Chris. I understand you've be...",en
4,2438,"D: Hi Ernest, I understand you're here for a c...",en


In [24]:
# check the value counts of each language
transcripts['source_lang'].value_counts()

source_lang
en    1798
ar      74
hi      69
es      57
af       2
cy       1
Name: count, dtype: int64

In [None]:
# loop over each transcript in the dataset
en_transcripts = []
# set up a counter for informing the progress
count = 0

for index, row in transcripts.iterrows():
    transcript = row['transcript']
    if row['source_lang'] == 'en':
        en_transcripts.append(transcript)
    else:
        print(transcript)
        translated = nllb_translate(transcript, row['source_lang'])
        en_transcripts.append(translated)
        print('Translation:')
        print(translated)
        print(dash_line)
    
    count += 1
    print(count)

In [None]:
# create a new column in the transcript df
transcripts.loc[:, 'en_transcript'] = en_transcripts
transcripts.head()

Unnamed: 0,id,transcript,source_lang,en_transcript
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh..."
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod..."
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c..."


In [42]:
# fileter source_lang != en transcripts, and randomly select 10 translation samples
translated_transcripts = transcripts[transcripts['source_lang']!='en'].sample(10)['en_transcript'].values
for translated_transcript in translated_transcripts:
    print(translated_transcript)
    print(dash_line)

I understand you have been diagnosed with asthma. Can you tell me more about your symptoms? P: Hello, doctor. Yes, he's been experiencing high fever and lately I've been having trouble breathing. D: See. translate it to: precautions you can take to control your symptoms. Would you mind changing your clothes to avoid any discomfort? P: Sure, I can do that. D: Excellent. Also, try to take deep breaths and stay away from any triggers that may be causing your symptoms. If your symptoms worsen or you have difficulty breathing, seek help immediately. P: Thanks for the advice. Are you prescribing any medication for me? D: For now, I don't think the medication is necessary.  However, it is essential to continue monitoring your symptoms and keep me updated on your progress. If your symptoms worsen or persist, we can discuss more treatment options. P: Okay, I will do that. Thank you, doctor. D: Nothing, Erwin. Remember to follow the precautions we discussed and do not hesitate to contact me if y

<a id="sec-3-2-4"></a>
### 3.2.4 Translation Cleanup and Limitations

- Some translations include artifacts like `translate xx to en:` at the start.
- A cleanup step removes these artifacts before extraction.
- Distilled translation is practical for this pipeline but not perfect.


In [51]:
# use re to remove the pattern of 'translate xx to en:' in the en_transcript column
transcripts['cleaned_en_transcript'] = transcripts['en_transcript'].apply(lambda x: re.sub(r'translate \w{2} to en:', '', x))
transcripts.head()

Unnamed: 0,id,transcript,source_lang,en_transcript,cleaned_en_transcript
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh..."
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod...","D: Good morning Tommie, how can I help you tod..."
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be...","D: Good morning, Chris. I understand you've be..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c...","D: Hi Ernest, I understand you're here for a c..."


In [52]:
# print all the cleaned_en_transcript where source_lang != en to check the results
for index, row in transcripts.iterrows():
    if row['source_lang'] != 'en':
        print(row['cleaned_en_transcript'])
        print(dash_line)

I have been having problems with my loins for several days with weakness, dizziness and neck pain. Based on your symptoms and your age, I hope that you may have cervical spondylosis. P: What is it, doctor? D: Cervical spondylosis is a condition that affects your neck, which is part of your neck. I'm sorry, but I'm not sure if you're feeling well, but I'm sure you're feeling well.
---------------------------------------------------------------------------------------------------
I understand you are dealing with some symptoms. Can you tell me more about how you were feeling? P: Yes, doctor. I was really feeling sick. I started vomiting and it was hard to reach for breath. Then I felt a sharp pain in my chest. This is really scary. D: I see. Based on the symptoms you are feeling, it is better to take this seriously, and take you to the hospital immediately. P: Okay, doctor. Should I take any precautions before going? D: At this moment, there are no precautions you need to take. P: Okay, 

In [None]:
# save the cleaned transcripts to a csv file
transcripts.to_csv('translated_transcripts_cleaned.csv', index=False)

<a id="sec-3-2-5"></a>
### 3.2.5 Translation Implementation Notes

- Long transcripts are translated in 512-character chunks for stability.
- Chunk translations are concatenated into one English transcript.
- Results are saved to `translated_transcripts_cleaned.csv`.


<a id="sec-3-3"></a>
## 3.3 Stage B: Six-Field Information Extraction (Nous-Hermes-Llama2-13B)

Model reference: https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b


<a id="sec-3-3-1"></a>
### 3.3.1 Hosted Inference Setup (PredictionGuard API)

PredictionGuard API is used for hosted inference to avoid local deployment of a 13B model.


In [None]:
# read the cleaned translated transcripts
transcripts = pd.read_csv('translated_transcripts_cleaned.csv')
transcripts.head()

Unnamed: 0,id,transcript,source_lang,en_transcript,cleaned_en_transcript
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh..."
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod...","D: Good morning Tommie, how can I help you tod..."
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be...","D: Good morning, Chris. I understand you've be..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c...","D: Hi Ernest, I understand you're here for a c..."


In [None]:
import os
import predictionguard as pg

os.environ['PREDICTIONGUARD_TOKEN'] = 'TOKEN'

<a id="sec-3-3-2"></a>
### 3.3.2 Prompt Design and Output Schema

The few-shot prompt enforces a strict six-line response format:
1. Patient name
2. Patient age (numeric only)
3. Condition
4. Symptoms
5. Precautionary advice
6. Medication

Downstream cleaning assumes this order and validates structure accordingly.


<a id="sec-3-3-3"></a>
### 3.3.3 Few-Shot Prompt Template

The next code cell defines the full few-shot prompt with diverse examples, including noisy and incomplete transcript patterns.


In [None]:
# few-shot prompting for information extraction
prompt_generate_answer = """### Instruction:
The input is a transcription of medical conversations between a doctor and a patient ("D:" stands for "Doctor:" and "P:" for "Patient:") or a doctor's dictation of the examination of a patient. 
Please extract 6 categories of information including the patient's name, the patient's age, the patient's condition, the patient's symptoms, the doctor's precautionary advice and the prescribed medication from the input, and present the extracted information in a structured and concise manner in the response, following these guidelines:
    - The response should consist of exactly 6 lines, and the first line should be the patient's name, the second line should be the patient's age, the third line should be the patient's condition, the fourth line should be the patient's symptoms, the fifth line should be the doctor's precautions advice, and the sixth line should be the prescribed medication.
    - If the age is not mentioned in the input, please guess the age based on the input. If any of the other 5 categories of information are not mentioned in the input, please respond with "None" for the missing category in the specific lines.
    - The second line which is the patient's age should be a number without any trailing characters.

For example, if the input is:
During the visit, I assessed Carlos Murrieta, a 75-year-old patient who presented with symptoms of obesity, restlessness, and lethargy. After a thorough physical examination, I reviewed his medical history and discussed his lifestyle and diet with him. Upon further investigation, it was determined that Mr. Murrieta had diabetes. To help manage his diabetes, I advised Mr. Murrieta to take precautions such as following a balanced diet, exercising regularly, and scheduling follow-up appointments for ongoing monitoring. I also provided him with educational resources and discussed the importance of regular blood glucose monitoring. No medication was prescribed at this time. I emphasized the importance of lifestyle modifications to manage his diabetes and improve his overall health. I also encouraged him to quit smoking, if applicable, and to limit his alcohol consumption. I provided Mr. Murrieta with a list of healthy food choices and recommended that he consult with a nutritionist to create a meal plan that best suits his needs.

The response should be:
Carlos Murrieta
75
diabetes
obesity, restlessness, lethargy
following a balanced diet, exercising regularly, scheduling follow-up appointments for ongoing monitoring, regular blood glucose monitoring, lifestyle modifications, quit smoking, limit alcohol consumption, consult with a nutritionist
None


For example, if the input is:
D: Good morning, Helen. How are you feeling today?
P: Hi, Doctor. I'm feeling really tired and short of breath, and I have a high fever too.
D: I see. Based on your symptoms, I suspect you might have bronchial asthma. Is that correct?
P: Yes, that's what my previous doctor diagnosed me with.
D: Have you been taking any medication for it?
P: Yes, I've been using an inhaler, but it doesn't seem to be helping much.
D: Okay, let's try switching to loose clothing to avoid any tightness in your chest. Also, try taking deep breaths and getting away from whatever is triggering your asthma. If your symptoms don't improve after that, please come back for a follow-up appointment.
P: Okay, thank you.
D: No problem. And remember, if your symptoms worsen or if you have any difficulty breathing, please seek immediate medical help.
P: I will. Thanks again.

The response should be:
Helen
30
bronchial asthma
feeling really tired, short of breath, high fever
switching to loose clothing to avoid any tightness in chest, taking deep breaths, getting away from whatever is triggering asthma, come back for a follow-up appointment if symptoms don't improve
None


For example, if the input is:
D: Good mornin', Mr. Mcclure. I understand that you're being treated for hep@titis B. Can you tell me how you're feeling?

P: Ye$s, Doctor. I've been feeling very weak and tired lately. My urine has also become dark in color.

D: I see. Have you noticed any other sy 

The response should be:
Mr. Mcclure
30
hepatitis B
feeling very weak and tired, urine has also become dark in color
None
None


For example, if the input is:
D: Good mornin', Mr. Eugene. How are you feelin' today?

P: Good mornin', Doctor. I'm feelin' very weak and have a terr-ible headache.

D: I see. Can you tell me about your symptomms?

P: Yess, Doctor. I've been experiencin' chills, a very high fever, and a terrible headacee. I also have a toxic look on my face, which I beleive is a symptom of ty-phoid.

D: Y'are right, Mr. Eugene. Based on your symptomms, I beleive you have typhoid.

P: Oh, I see. What should I do to get better?

D: Well, I would advise you to eat high-calorie veggetables to help build up your stength. You should also start antibiotic therap-y to treat the infection. Additionally, I would prescribe some medication to help with your fever and headache.

P: Thannk you, Doctor. What is the medication you would prescribe?

D: Since you are allergic to penicillin, I would prescribe ceftriaxone. It is a safe and effec-tive alternative.

P: Alright, I'll take it as prescribed. Thannk you for your help, Doctor.

D: You're welcome, Mr. Eugene. Remember to eat well and take your medica-tion as directed, and you should start feeling better soon.

The response should be:
Mr. Eugene
30
typhoid
weak, terrible headache, chills, very high fever, toxic look on face
eat high-calorie vegetables to help build up strength, start antibiotic therapy to treat infection, take medication to help with fever and headache
ceftriaxone


For example, if the input is:
During Mary Ricker's visit, I, the doctor, determined that her condition was acn-e. Mary presented with symptoms such as pus-filled pimples, black-heads, and scarr-ing. I advised her to take prec@utions such as bathing twice a day, avoid-ing fatty and spicy foods, drinking plenty of w@ter, and avoiding too many pro-ducts. I prescribed Cleocin T as her medica-tion to help clear up her acn-e.

The response should be:
Mary Ricker
30
acne
pus-filled pimples, blackheads, scarring
bathing twice a day, avoiding fatty and spicy foods, drinking plenty of water, avoiding too many products
Cleocin T


For example, if the input is:
I have been having problems with my loins for several days with weakness, dizziness and neck pain. Based on your symptoms and your age, I hope that you may have cervical spondylosis. P: What is it, doctor? D: Cervical spondylosis is a condition that affects your neck, which is part of your neck. I'm sorry, but I'm not sure if you're feeling well, but I'm sure you're feeling well.

The response should be:
None
35
cervical spondylosis
problems with loins for several days with weakness, dizziness and neck pain
None
None


For example, if the input is:
D: Good morning, Mr. Roberts. How can I help you today?

P: Good morning, Doctor. I've been experiencing some skin problems recently.

D: I see. Can you tell me more about your symptoms?

P: Well, my nails have these small dents in them, my skin is peeling, and I have a rash on my arms and legs.

D: Based on the symptoms you've described, it appears that you have psoriasis.

P: Is that a serious condition?

D: Not necessarily, but it can be uncomfortable. To help manage your symptoms, I recommend washing your hands with warm soapy water regularly, stopping any bleeding using pressure, and taking salt baths. I'm also going to prescribe a medication called Halonate.

P: Thank you, Doctor. I appreciate your help.

D: You're welcome, Mr. Roberts. If your symptoms worsen or don't improve within a few weeks, please don't hesitate to come back for a follow-up appointment.

The response should be:
Mr. Roberts
30
psoriasis
nails have small dents in them, skin is peeling, rash on arms and legs
washing hands with warm soapy water regularly, stopping any bleeding using pressure, taking salt baths
Halonate




### Input:
{transcript}

### Response:

"""

In [251]:
# function to generate answers based on transcripts using the llm
def generate_answer(transcript, max_retries=100):
  answer = ''
  retries = 0
  while retries <= max_retries:
    try:
      result = pg.Completion.create(
          model='Nous-Hermes-Llama2-13B',
          prompt=prompt_generate_answer.format(transcript=str(transcript)),
          )
      answer = result['choices'][0]['text']
      break
    except Exception as e:
      print(f'An error occurred: {e}')
      retries += 1
      if retries > max_retries:
        answer = "Max retries reached."
        break
      time.sleep(0.01)  # wait for 0.01 seconds before retrying
    
  return answer

<a id="sec-3-3-4"></a>
### 3.3.4 Robustness Check: Messy and Incomplete Inputs

Spot checks suggest the extraction model is generally robust to noisy/incomplete text, though deterministic post-processing is still required for format consistency.


In [253]:
# try 30 random sample transcripts
random_id = random.sample(list(transcripts['id'].values), 30)

for id in random_id:
    cleaned_en_transcript = transcripts[transcripts['id']==id]['cleaned_en_transcript'].values[0]
    print(cleaned_en_transcript)
    print('Response:')
    answer = generate_answer(cleaned_en_transcript)
    print(answer)
    print(dash_line)

During the visit, I, as the doctor, listened to Loretta Bentley's complaints of breathlessness, fatigue, and mucoid sputum. After examining her, I determined that her condition was bronchial asthma. I advised her to switch to loose clothing, take deep breaths, get away from the trigger, and seek help if necessary. Since her symptoms were not severe, I did not prescribe any medication. I suggested that she maintain a healthy lifestyle with regular exercise and a balanced diet to manage her condition effectively. Loretta seemed to understand and appreciate the advice, and she left the clinic with a better understanding of her condition.
Response:
Loretta Bentley
35
bronchial asthma
breathlessness, fatigue, mucoid sputum
switch to loose clothing, take deep breaths, get away from trigger, seek help if necessary
None
None
---------------------------------------------------------------------------------------------------
During the visit, I, as the doctor, determined that Verda Worm, a 93-ye

In [256]:
# see if the llm is robust to messy text input
messy_text_id = [608, 1848, 1221, 2287, 1766, 3281, 4494]
for id in messy_text_id:
    cleaned_en_transcript = transcripts[transcripts['id']==id]['cleaned_en_transcript'].values[0]
    print(cleaned_en_transcript)
    print('Response:')
    answer = generate_answer(cleaned_en_transcript)
    print(answer)
    print(dash_line)

D: Hi Mr. Adair, I understand that you have been diagnosed with hep@tit!s A. How have you been feeling lately?

P: Well, Doc, I've been feeling pretty bad- like, real bad. I've got n@u$ea all the time and my stom@ch hurts. I've also been havin' di@rrhea and my joints ache, you know?

D: I see. It's not surp@rising that you're feeling this way given your condition. Hep@titis A is a v!ral infection that affects the l!ver and can cause a range of sympt0ms, including the ones you mentioned.

P: Is there anything I can do to make myself feel better?

D: First and foremost, you should consult the nearest hospital and follow their advice. In addition, make sure to wash your hands th0roughly with soap and w@ter, avoid fatty and sp!cy foods, and take med!cation as prescribed by the hosp!tal.

P: Okay, I'll do that. Th@nk you for your help, Doctor.

D: You're welcome, Mr. Ad@ir. It's important that you take care of yourself and follow the doctor's advice. Please come back if you have any further

In [257]:
# see if the llm is robust to incomplete text input
incomplete_id = [4401, 1848]
for id in incomplete_id:
    cleaned_en_transcript = transcripts[transcripts['id']==id]['cleaned_en_transcript'].values[0]
    print(cleaned_en_transcript)
    print(' ')
    answer = generate_answer(cleaned_en_transcript)
    print(answer)
    print(dash_line)

During the visit, I, the doxctor, assessed Nancey Blas, who is 8
 
years old, and determined that her condition is anemia. Nancey presented with symptoms such as weakness, fatigue, and pale skin. To help manage her anemia, I advise her to take precautions such as eating a balanced diet, including vitamin C-rich foods, and drinking plenty of water. I prescribed ferrous sulfate as her medication to help increase her iron levels.
---------------------------------------------------------------------------------------------------
D: Good mornin', Mr. Mcclure. I understand that you're being treated for hep@titis B. Can you tell me how you're feeling?

P: Ye$s, Doctor. I've been feeling very weak and tired lately. My urine has also become dark in color.

D: I see. Have you noticed any other sy
 
Mr. Mcclure
30
hepatitis B
feeling very weak and tired, urine has also become dark in color
None
None


































































--------------------------------

<a id="sec-3-3-5"></a>
### 3.3.5 Batch Inference and Reliability Controls

- Split data into 20 chunks for recoverable long runs.
- Apply retry logic for transient API failures.
- Regenerate or pad/truncate outputs until each response has exactly six lines.
- Combine chunk-level files into `combined_df.csv`.


In [9]:
# split the transcripts into 20 chunks
transcripts_chunks_list = np.array_split(transcripts, 20)

In [10]:
# create a dictionary of transcripts chunks
transcripts_chunks_dict = {}
for i in range(20):
    transcripts_chunks_dict[i] = transcripts_chunks_list[i]

In [11]:
# check the first chunk
transcripts_chunks_dict[0].head()

Unnamed: 0,id,transcript,source_lang,en_transcript,cleaned_en_transcript
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh..."
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod...","D: Good morning Tommie, how can I help you tod..."
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be...","D: Good morning, Chris. I understand you've be..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c...","D: Hi Ernest, I understand you're here for a c..."


In [90]:
# generate answer for the first chunk
answers_1 = []
# set up a counter for informing the progress
count = 0

for index, row in transcripts_chunks_dict[0].iterrows():
    cleaned_en_transcript = row['cleaned_en_transcript']
    answer = generate_answer(cleaned_en_transcript)
    # if '\n' not in answer, keep generating answer until it contains '\n'
    while '\n' not in answer:
        answer = generate_answer(cleaned_en_transcript)
    
    # check if the answer contains only 6 lines
    # keep cleaning the answer until it contains only 6 lines or reach the max cleaning times
    # if reach the max cleaning times, return the first 6 lines of the answer
    max_generate_times = 10
    generate_times = 0
    while len(answer.split('\n')) != 6:
        answer = generate_answer(cleaned_en_transcript)
        generate_times += 1
        print(f'generate times: {generate_times}')
        if generate_times > max_generate_times:
            # check how many lines the answer contains
            # if the answer contains more than 6 lines, return the first 6 lines
            # if the answer contains less than 6 lines, put 'None' in the missing lines to make it 6 lines
            if len(answer.split('\n')) >= 6:
                answer = '\n'.join(answer.split('\n')[:6])
            else:
                missing_lines = 6 - len(answer.split('\n'))
                for i in range(missing_lines):
                    answer += '\nNone'
            break

    answers_1.append(answer)
    print(answer)

    count += 1
    print(count)
    print(dash_line)

Mr. Don Hicks
81
fungal infection
dischromic patches, nodal skin eruptions, skin rash
bathing twice a day, using detol or neem in the bathing water, keeping the infected area dry, using clean cloths
None
1
---------------------------------------------------------------------------------------------------
generate times: 1
generate times: 2
generate times: 3
generate times: 4
generate times: 5
generate times: 6
generate times: 7
generate times: 8
generate times: 9
generate times: 10
generate times: 11
Tina Will
69
heart attack
chest pain, vomiting, breathlessness
seek immediate medical attention
None
2
---------------------------------------------------------------------------------------------------
generate times: 1
Tommie
50
hypertension
feeling dizzy and unsteady on feet, having headaches
practice meditation, take salt baths, reduce stress, proper sleep, monitor blood pressure regularly
None
3
------------------------------------------------------------------------------------------

In [None]:
# create a new column in the chunk df to store the answers
chunk_df_0 = transcripts_chunks_dict[0]
chunk_df_0.loc[:, 'answer'] = answers_1

In [13]:
# a function to generate answers for a chunk of transcripts
def complete_prediction(transcripts_df):
    answers = []
    # set up a counter for informing the progress
    count = 0
    
    for index, row in transcripts_df.iterrows():
        cleaned_en_transcript = row['cleaned_en_transcript']
        answer = generate_answer(cleaned_en_transcript)
        # if '\n' not in answer, keep generating answer until it contains '\n'
        while '\n' not in answer:
            answer = generate_answer(cleaned_en_transcript)
        
        # check if the answer contains only 6 lines
        # keep cleaning the answer until it contains only 6 lines or reach the max cleaning times
        # if reach the max cleaning times, return the first 6 lines of the answer
        max_generate_times = 10
        generate_times = 0
        while len(answer.split('\n')) != 6:
            answer = generate_answer(cleaned_en_transcript)
            generate_times += 1
            print(f'generate times: {generate_times}')
            if generate_times > max_generate_times:
                # check how many lines the answer contains
                # if the answer contains more than 6 lines, return the first 6 lines
                # if the answer contains less than 6 lines, put 'None' in the missing lines to make it 6 lines
                if len(answer.split('\n')) >= 6:
                    answer = '\n'.join(answer.split('\n')[:6])
                else:
                    missing_lines = 6 - len(answer.split('\n'))
                    for i in range(missing_lines):
                        answer += '\nNone'
                break
            
        answers.append(answer)
        print(answer)
        
        count += 1
        print(count)
        print(dash_line)
        
    return answers

In [None]:
# generate answers for the rest of the chunks
for i in range(1, 20):
    chunk_df = transcripts_chunks_dict[i]
    answers = complete_prediction(chunk_df)
    # create a new column in the chunk df to store the answers
    chunk_df.loc[:, 'answer'] = answers
    # save it to a csv file
    chunk_df.to_csv(f'chunk_df_{i}.csv', index=False)

In [15]:
# read all the chunk dfs and combine them into a single df
for i in range(20):
    chunk_df = pd.read_csv(f'chunk_df_{i}.csv')
    if i == 0:
        combined_df = chunk_df
    else:
        combined_df = pd.concat([combined_df, chunk_df])

In [16]:
combined_df.head()

Unnamed: 0,id,transcript,source_lang,en_transcript,cleaned_en_transcript,answer
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod...","D: Good morning Tommie, how can I help you tod...",Tommie\n50\nhypertension\nfeeling dizzy and un...
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be...","D: Good morning, Chris. I understand you've be...","Chris\n30\njaundice\nlosing weight, abdominal ..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c...","D: Hi Ernest, I understand you're here for a c...",Ernest\n45\nurinary tract infection (UTI)\nbla...


In [24]:
# save combined_df to a csv file
combined_df.to_csv('combined_df.csv', index=False)

<a id="sec-4"></a>
# 4. Post-processing and Output Normalization


<a id="sec-4-1"></a>
## 4.1 Cleaning Goals and Why They Matter

LLM outputs can be semantically useful but structurally inconsistent (extra labels, bracketed text, non-numeric ages). This section:
- splits each response into six rows,
- aligns rows to `test.csv` order using merge keys,
- applies regex/text normalization,
- exports the final `submission.csv`.


In [None]:
transcripts_answer = pd.read_csv('combined_df.csv')
transcripts_answer.head()

Unnamed: 0,id,transcript,source_lang,en_transcript,cleaned_en_transcript,answer
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod...","D: Good morning Tommie, how can I help you tod...",Tommie\n50\nhypertension\nfeeling dizzy and un...
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be...","D: Good morning, Chris. I understand you've be...","Chris\n30\njaundice\nlosing weight, abdominal ..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c...","D: Hi Ernest, I understand you're here for a c...",Ernest\n45\nurinary tract infection (UTI)\nbla...


In [188]:
# function to split answer into list
def split_answer(answer):
    return answer.split('\n')

In [None]:
# apply the function and create a new column
transcripts_answer.loc[:, 'split_answers'] = transcripts_answer['answer'].apply(split_answer)
transcripts_answer.head()

Unnamed: 0,id,transcript,source_lang,en_transcript,cleaned_en_transcript,answer,split_answers
0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"[Mr. Don Hicks, 81, fungal infection, dischrom..."
1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...","[Tina Will, 69, heart attack, chest pain, vomi..."
2,102,"D: Good morning Tommie, how can I help you tod...",en,"D: Good morning Tommie, how can I help you tod...","D: Good morning Tommie, how can I help you tod...",Tommie\n50\nhypertension\nfeeling dizzy and un...,"[Tommie, 50, hypertension, feeling dizzy and u..."
3,2966,"D: Good morning, Chris. I understand you've be...",en,"D: Good morning, Chris. I understand you've be...","D: Good morning, Chris. I understand you've be...","Chris\n30\njaundice\nlosing weight, abdominal ...","[Chris, 30, jaundice, losing weight, abdominal..."
4,2438,"D: Hi Ernest, I understand you're here for a c...",en,"D: Hi Ernest, I understand you're here for a c...","D: Hi Ernest, I understand you're here for a c...",Ernest\n45\nurinary tract infection (UTI)\nbla...,"[Ernest, 45, urinary tract infection (UTI), bl..."


In [None]:
# expand the df so that each information gets its own row
transcripts_expanded = transcripts_answer.explode('split_answers').reset_index()
transcripts_expanded.head(24)

Unnamed: 0,index,id,transcript,source_lang,en_transcript,cleaned_en_transcript,answer,split_answers
0,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,Mr. Don Hicks
1,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,81
2,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,fungal infection
3,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"dischromic patches, nodal skin eruptions, skin..."
4,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"bathing twice a day, using detol or neem in th..."
5,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,
6,1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",Tina Will
7,1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",69
8,1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",heart attack
9,1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...","chest pain, vomiting, breathlessness"


In [None]:
# change the index column values in the transcripts_expanded to be 0, 1, 2, 3, 4, 5 for each id value
transcripts_expanded.loc[:, 'index'] = transcripts_expanded.groupby('id').cumcount()
transcripts_expanded.head(18)

Unnamed: 0,index,id,transcript,source_lang,en_transcript,cleaned_en_transcript,answer,split_answers
0,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,Mr. Don Hicks
1,1,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,81
2,2,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,fungal infection
3,3,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"dischromic patches, nodal skin eruptions, skin..."
4,4,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"bathing twice a day, using detol or neem in th..."
5,5,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,
6,0,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",Tina Will
7,1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",69
8,2,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",heart attack
9,3,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...","chest pain, vomiting, breathlessness"


In [198]:
test.head()

Unnamed: 0,Id,Transcript,Question
0,587d0feb-5780-43e1-9595-e19d4b31dc07,2055,What is the patient's name?
1,263e8884-e8ba-4266-bb0c-85271419a0b3,2055,What is the patient's age?
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,2055,What is the patient's condition?
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,2055,What symptoms is the patient experiencing?
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,2055,What precautions did the doctor advise?


In [None]:
# add a column to the test dataset to perform merging
test['merge_key'] = test.index % 6
test.head(12)

Unnamed: 0,Id,Transcript,Question,merge_key
0,587d0feb-5780-43e1-9595-e19d4b31dc07,2055,What is the patient's name?,0
1,263e8884-e8ba-4266-bb0c-85271419a0b3,2055,What is the patient's age?,1
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,2055,What is the patient's condition?,2
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,2055,What symptoms is the patient experiencing?,3
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,2055,What precautions did the doctor advise?,4
5,03406fb0-e67d-4614-a745-ed02c7ac6c46,2055,What drug did the doctor prescribe?,5
6,1f0abde6-83ac-4e8f-b06f-9c9ff72beb88,291,What is the patient's name?,0
7,8ce89aa4-9d9c-4174-8e5e-957efbf41256,291,What is the patient's age?,1
8,08862b5d-dbc0-4ea1-884c-329eec085ff6,291,What is the patient's condition?,2
9,2f9bb4d6-8694-478e-a7f0-83ba534673de,291,What symptoms is the patient experiencing?,3


In [202]:
# change the Transcript column data type to int
test['Transcript'] = test['Transcript'].astype(int)

In [None]:
# merge the two dfs
result = pd.merge(test, transcripts_expanded, left_on=['Transcript', 'merge_key'], right_on=['id', 'index'])
result.head(24)

Unnamed: 0,Id,Transcript,Question,merge_key,index,id,transcript,source_lang,en_transcript,cleaned_en_transcript,answer,split_answers
0,587d0feb-5780-43e1-9595-e19d4b31dc07,2055,What is the patient's name?,0,0,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,Mr. Don Hicks
1,263e8884-e8ba-4266-bb0c-85271419a0b3,2055,What is the patient's age?,1,1,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,81
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,2055,What is the patient's condition?,2,2,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,fungal infection
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,2055,What symptoms is the patient experiencing?,3,3,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"dischromic patches, nodal skin eruptions, skin..."
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,2055,What precautions did the doctor advise?,4,4,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"bathing twice a day, using detol or neem in th..."
5,03406fb0-e67d-4614-a745-ed02c7ac6c46,2055,What drug did the doctor prescribe?,5,5,2055,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,
6,1f0abde6-83ac-4e8f-b06f-9c9ff72beb88,291,What is the patient's name?,0,0,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",Tina Will
7,8ce89aa4-9d9c-4174-8e5e-957efbf41256,291,What is the patient's age?,1,1,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",69
8,08862b5d-dbc0-4ea1-884c-329eec085ff6,291,What is the patient's condition?,2,2,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...",heart attack
9,2f9bb4d6-8694-478e-a7f0-83ba534673de,291,What symptoms is the patient experiencing?,3,3,291,"During the visit, I examined Tina Will, a 69-y...",en,"During the visit, I examined Tina Will, a 69-y...","During the visit, I examined Tina Will, a 69-y...","Tina Will\n69\nheart attack\nchest pain, vomit...","chest pain, vomiting, breathlessness"


In [None]:
# drop helper columns
result.drop(columns=['id', 'index', 'merge_key'], inplace=True)
result.head(6)

Unnamed: 0,Id,Transcript,Question,transcript,source_lang,en_transcript,cleaned_en_transcript,answer,split_answers
0,587d0feb-5780-43e1-9595-e19d4b31dc07,2055,What is the patient's name?,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,Mr. Don Hicks
1,263e8884-e8ba-4266-bb0c-85271419a0b3,2055,What is the patient's age?,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,81
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,2055,What is the patient's condition?,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,fungal infection
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,2055,What symptoms is the patient experiencing?,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"dischromic patches, nodal skin eruptions, skin..."
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,2055,What precautions did the doctor advise?,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,"bathing twice a day, using detol or neem in th..."
5,03406fb0-e67d-4614-a745-ed02c7ac6c46,2055,What drug did the doctor prescribe?,"During the visit, I examined Mr. Don Hicks, wh...",en,"During the visit, I examined Mr. Don Hicks, wh...","During the visit, I examined Mr. Don Hicks, wh...",Mr. Don Hicks\n81\nfungal infection\ndischromi...,


In [None]:
# drop unnecessary columns for the submission file
submission = result.drop(columns=['Transcript', 'Question', 'transcript', 'source_lang', 'en_transcript', 'cleaned_en_transcript', 'answer'])
submission.head(12)

Unnamed: 0,Id,split_answers
0,587d0feb-5780-43e1-9595-e19d4b31dc07,Mr. Don Hicks
1,263e8884-e8ba-4266-bb0c-85271419a0b3,81
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,fungal infection
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,"dischromic patches, nodal skin eruptions, skin..."
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,"bathing twice a day, using detol or neem in th..."
5,03406fb0-e67d-4614-a745-ed02c7ac6c46,
6,1f0abde6-83ac-4e8f-b06f-9c9ff72beb88,Tina Will
7,8ce89aa4-9d9c-4174-8e5e-957efbf41256,69
8,08862b5d-dbc0-4ea1-884c-329eec085ff6,heart attack
9,2f9bb4d6-8694-478e-a7f0-83ba534673de,"chest pain, vomiting, breathlessness"


In [None]:
# rename the column 'split_answers' to 'Text'
submission.rename(columns={'split_answers':'Text'}, inplace=True)
submission.head()

Unnamed: 0,Id,Text
0,587d0feb-5780-43e1-9595-e19d4b31dc07,Mr. Don Hicks
1,263e8884-e8ba-4266-bb0c-85271419a0b3,81
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,fungal infection
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,"dischromic patches, nodal skin eruptions, skin..."
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,"bathing twice a day, using detol or neem in th..."


In [238]:
# remove the pattern generated by the llm such as 'Patient's Name: ', 'Patient's Age: ', etc.
def remove_pre_colon(text):
    return re.sub(r'^[^:]*:', '', text)

submission['Text'] = submission['Text'].apply(remove_pre_colon)

In [240]:
# for the Text column in the df, remove everything in the brackets (including the brackets) and remove trailing spaces
submission['Text'] = submission['Text'].apply(lambda x: re.sub(r'\([^)]*\)', '', x).strip())

In [None]:
# for the Text column in the df, if the row is the second row in a set of six rows, extract the number from the string and return it as the age
# # if there is no number, return 'None', otherwise return the text as it is
def clean_age(row):
    # Check if the row is the second row in a set of six rows
    if row.name % 6 == 1:
        # Extract numbers from the string, assuming age is a number
        age = re.findall(r'\d+', row['Text'])
        # If a number is found, return it, otherwise return 'None'
        return age[0] if age else "None"
    else:
        # For other rows, return the text as it is
        return row['Text']

# Apply the function to the DataFrame
submission['Text'] = submission.apply(clean_age, axis=1)

In [None]:
# Standardize the Text column by replacing variations of "None" with a consistent "None" string

# replace none as "None" string
submission['Text'] = submission['Text'].replace('None', '"none"')
submission['Text'] = submission['Text'].replace('None.', '"none"')
# replace empty string as "None" string
submission['Text'] = submission['Text'].replace('', '"none"')
# replace "not mentioned" as "None" string
submission['Text'] = submission['Text'].replace('not mentioned', '"none"')
# replace "Not Mentioned" as "None" string
submission['Text'] = submission['Text'].replace('Not Mentioned', '"none"')
# replace "Not mentioned" as "None" string
submission['Text'] = submission['Text'].replace('Not mentioned', '"none"')
# replace 'N/A' as "None" string
submission['Text'] = submission['Text'].replace('n/a', '"none"')
# replace 'nan' as "None" string
submission['Text'] = submission['Text'].replace('nan', '"none"')
# replace 'N/A' as "None" string
submission['Text'] = submission['Text'].replace('N/A', '"none"')
# replace 'NaN' as "None" string
submission['Text'] = submission['Text'].replace('NaN', '"none"')
# replace 'Unknown' as "None" string
submission['Text'] = submission['Text'].replace('Unknown', '"none"')

In [None]:
# check the final submission file
submission.head(24)

Unnamed: 0,Id,Text
0,587d0feb-5780-43e1-9595-e19d4b31dc07,Don Hicks
1,263e8884-e8ba-4266-bb0c-85271419a0b3,81
2,74c68eca-61b2-49d0-9b1c-0f6f886b04ff,fungal infection
3,8572ab5d-f20a-4de5-ab44-f42b07e45a00,"""dischromic patches, nodal skin eruptions, ski..."
4,f5c92075-ef05-4fbf-a7a0-aa86c586ff02,"""bathing twice a day, using detol or neem in t..."
5,03406fb0-e67d-4614-a745-ed02c7ac6c46,"""none"""
6,1f0abde6-83ac-4e8f-b06f-9c9ff72beb88,Tina Will
7,8ce89aa4-9d9c-4174-8e5e-957efbf41256,69
8,08862b5d-dbc0-4ea1-884c-329eec085ff6,heart attack
9,2f9bb4d6-8694-478e-a7f0-83ba534673de,"""chest pain, vomiting, breathlessness"""


In [None]:
# save the final submission file
submission.to_csv('submission.csv', index=False)