This repository contains a modified version of the ETL pipeline from IBM’s Data Engineering labs. The code is adapted to work in a Jupyter Notebook environment and is tailored for scenarios where data is sourced from three different file types: CSV, JSON, and XML.
The pipeline extracts data from each format, applies a simple transformation, and consolidates the results into a single CSV file.
This ETL script is designed for demonstration or educational use where you are working with:
CSVfiles (tabular format)JSONfiles (line-delimited)XMLfiles (with a known structure)
It assumes that all three formats represent the same kind of data (e.g., car listings) but may come from different systems or exports.
- Extract: Reads all
.csv,.json, and.xmlfiles in the current directory (excluding the output file). - Transform: Rounds numeric values (like price) and ensures type consistency.
- Load: Writes the combined, cleaned data to a single CSV.
- Log: Appends ETL progress messages to a log file with timestamps.
-
The code is hardcoded to expect certain fields:
car_model,year_of_manufacture,price, andfuel. -
XML parsing depends on these tags existing in each
<record>element. -
The schema for XML must be manually defined in
dtype_map(you can preview XML fields using theget_xml_headers()helper function).
This means the pipeline is not fully general-purpose — it's suitable for controlled data ingestion workflows or learning environments where input formats are known.
- Place your
.csv,.json, and.xmlfiles in the same directory. - Run the notebook cells step-by-step:
- Load the helper functions
- Use
get_xml_headers("example.xml")to verify XML structure - Modify the
dtype_mapinextract_from_xml()if needed - Run the ETL steps
- The final output will be saved to
transformed_data.csv.
<root>
<record>
<car_model>Toyota</car_model>
<year_of_manufacture>2020</year_of_manufacture>
<price>15000</price>
<fuel>Petrol</fuel>
</record>
</root>This is a modified version of code provided as part of IBM’s Data Engineering course material. Use and adapt freely for learning purposes or internal projects.