Python Medicine Data Extractor
This is a simple Python script to read a complex CSV file of medicine data and extract specific columns into a new, clean CSV file.
This project was written as part of an educational conversation to solve the problem of reading CSV files with an unknown file encoding.
What This Project Does
The medicines.csv file (sourced from the EMA website) has 8 extra header rows and many columns. This script:
Skips the first 8 rows of the file.
Reads only the 'Name of medicine', 'Active substance', and 'Therapeutic area (MeSH)' columns.
Creates a new clean output file named extracted_medicines.csv using the standard 'utf-8' encoding.
Files
process_medicines.py: The main Python script.
medicines.csv: The raw input data file (included in this repository for testing).
How to Use It
Make sure you have Python 3 installed on your system.
Download (Download ZIP) or clone this repository.
Open your terminal (or Command Prompt) in the project folder.
Run the following command:
python process_medicines.py
(Depending on your system, you may need to use python3 instead of python)
If the script is successful, the output file extracted_medicines.csv will be created in the same folder.
The Encoding Challenge
The main challenge of this project was finding the correct file encoding for medicines.csv. The following encodings were tested:
utf-8: Failed
windows-1252: Failed
latin-1: Failed
cp1250: (Currently testing)
The script in this repository uses cp1250 to attempt to solve this.