A Python script to clean and normalize prescription data from a CSV file.
It extracts active ingredients, dosage, and units from raw medication text, standardizes drug names, detects topical medications, and outputs a cleaned dataset.
- Reads prescriptions from
prescriptions.csv
- Standardizes drug names using a lookup mapping
- e.g.,
"ibu"
→"Ibuprofen"
- e.g.,
- Cleans packaging/form terms like
"filmtabletten"
,"caps"
, etc. - Detects topical medications (gel, cream, ointment, spray, etc.) and skips dosage parsing for them
- Extracts dosage and standardizes units (
mg
,g
,microgram
) - Normalizes prescription dates into YYYY-MM-DD format
- Saves processed output to
prescriptions_clean.csv
- Python 3.8+
- Dependencies:
Install dependencies with: pip install pandas
The script expects a CSV file named prescriptions.csv
in the same directory, containing (at minimum):
prescription_date
— date of prescription (any common format, e.g.12/03/2021
)medication_text
— raw string of the medication description
Example input (prescriptions.csv
):
prescription_date | medication_text |
---|---|
2021-03-12 | Ibuprofen 400mg Filmtabletten |
2021-04-01 | Amoxicillin 500 mg Capsules |
2021-04-10 | Diclofenac Gel 120 gm |
2021-05-05 | Metformin-ratiopharm 850mg |
The cleaned data will be written to prescriptions_clean.csv
, including structured fields:
active_ingredient
dosage
unit
Example output (prescriptions_clean.csv
):
prescription_date | medication_text | active_ingredient | dosage | unit |
---|---|---|---|---|
2021-03-12 | Ibuprofen 400mg Filmtabletten | Ibuprofen | 400 | mg |
2021-04-01 | Amoxicillin 500 mg Capsules | Amoxicillin | 500 | mg |
2021-04-10 | Diclofenac Gel Tube | Diclofenac | ||
2021-05-05 | Metformin-ratiopharm 850mg | Metformin | 850 | mg |
Run the script from the command line: python process_meds.py
This will:
- Load
prescriptions.csv
- Parse, clean, and structure the medication text
- Save results to
prescriptions_clean.csv
- Print a summary of processed rows
- Drug name normalization uses a dictionary to standardize names
- Regex parsing extracts numeric dosage and unit values
- Topical detection skips dosage parsing for creams, gels, sprays, etc.
- Date formatting ensures consistent YYYY-MM-DD format
Indeed the rule based system that i have designed may work efficiently with less or limited data but once the data starts to exceed human expectations (millions or billions of data points) the script will most likely break as python in itself have many limitations with large dataset. I may choose to switch over to spark for handling real time large dataset but when it comes to LLM’s like Gemini or ChatGPT leveraging their power will be a huge step as they are trained with unprecedented datasets that includes the knowledge about all the medical chemicals and names that even a doctor may not be well aware of and that too from all around the world. It can very easily handle the misspelled data (unlike this script) it can also help in define the medication based on past knowledge and current situation. And help in understanding patterns that are unseen or seen better.
But since any LLM with sufficient data can halucinate or simply make up answers it is important to keep it in check and to do so we would need to make a structured prompt or API interface where we can pass the dosage and it would identify ingredient, dosage units etc. We can easily leverage the power of transfer learning into this as the LLM is already trained in a general and larger dataset we can just train it into our smaller dataset. It will reduce the cost and time it would need to train a complete LLM and it will benefit us for the specific task we need. This might be done by providing a few annotated examples and asking the model to extract the relevant fields for each prescription entry. The LLM could be used in a batch processing pipeline, integrating with existing data cleaning steps or as an augmentation to flag cases with uncertainty for human review.
Main challenges: cost, latency, and reliability. API calls cost money and latency increases by caching every resolved medication string and batching many examples into one call. We can also use a lookup database (RxNorm / local pharmacopeia) to confirm or replace the LLM's candidate mapping. Another major challenge as already mentioned is hallucination: LLMs may invent doses or codes if not constrained. To overcome this i propose the following solutions
- forcing structured outputs with validation rules,
- falling back to rule-based parsing when the model's confidence is low, and
- adding a human-in-the-loop review for low-confidence or high-impact records. For highly sensitive clinical data, consider an on-prem or private LLM (or de-identified inputs) to satisfy privacy/regulatory requirements.
To implement it we need:
1. good prompt engineering with exact specifications Example final-instruction: “Return only JSON with keys: active_ingredient, dosage, unit, form, confidence (0–1). If no dosage, set dosage=null and unit=null.”
2. Validation so that it dosent make a mistake including a human if needed, so that the dosage remain in limit and it dosent go beyond the range checks.
3. And for precision a temperature of 0