# Pandas Medallion Architecture Learning Guide - Practice

Welcome to your Pandas learning repository! This notebook will guide you through the core concepts of data processing using pandas, focusing on the Medallion Architecture (Bronze, Silver, Gold layers). You'll learn how to ingest raw data, transform it, and aggregate it for analytical purposes using pandas - a great foundation before moving to distributed processing with PySpark.

## What is the Medallion Architecture?

The Medallion Architecture is a data design pattern used to logically organize data in a data lake:
- **Bronze Layer**: Raw data as-is from source systems
- **Silver Layer**: Cleaned, validated, and enriched data
- **Gold Layer**: Business-level aggregations for analytics and reporting


## 1. Setting up your pandas environment

Let's start by importing the necessary libraries and checking our environment.

**Exercise 1.1**: Import pandas and other necessary libraries.


In [None]:
# TODO: Import pandas, numpy, os, datetime and other necessary libraries
# TODO: Print pandas and numpy versions
# TODO: Print current working directory


## 2. Bronze Layer: Raw Data Ingestion

The Bronze layer is where raw data is ingested as-is from source systems. We'll read the sample sales data from `data/raw/sales_data.csv` and save it to the bronze layer.

**Exercise 2.1**: Read the raw CSV file and explore its structure.


In [None]:
# TODO: Read raw data from '../data/raw/sales_data.csv'
# TODO: Print schema using dtypes
# TODO: Print dataset shape and total records
# TODO: Display first 5 rows
# TODO: Display basic statistics


**Exercise 2.2**: Save the bronze layer data to parquet format for better performance.


In [None]:
# TODO: Create bronze directory using os.makedirs
# TODO: Save bronze data as parquet file
# TODO: Print confirmation message and file size


## 3. Silver Layer: Cleaned and Conformed Data

The Silver layer contains cleaned, conformed, and enriched data. We'll perform data quality checks, calculate derived columns, and add metadata.

**Exercise 3.1**: Load bronze data and perform data quality checks.


In [None]:
# TODO: Load bronze data from parquet file
# TODO: Check for missing values using isnull().sum()
# TODO: Check for duplicate rows
# TODO: Print unique values for each column


**Exercise 3.2**: Create the silver layer with enrichments and transformations.


In [None]:
# TODO: Create a copy of bronze_df for silver layer
# TODO: Convert transaction_date to datetime
# TODO: Calculate total_price = quantity * price
# TODO: Add processing_timestamp and data_source columns
# TODO: Extract date components (year, month, quarter, day_of_week)
# TODO: Add business logic columns (price_category, order_size) using pd.cut
# TODO: Print schema and display sample data


**Exercise 3.3**: Apply data quality filters and save the silver layer.


In [None]:
# TODO: Print records count before filtering
# TODO: Filter out invalid data (quantity > 0, price > 0, total_price > 0)
# TODO: Print records count after filtering
# TODO: Create silver directory and save filtered data as parquet


## 4. Gold Layer: Aggregated Data for Analytics

The Gold layer is optimized for analytics and reporting. We'll create various aggregations suitable for business intelligence and reporting.

**Exercise 4.1**: Create sales summary aggregation by product and city.


In [None]:
# TODO: Load silver data from parquet file
# TODO: Create gold layer aggregation by grouping by product_name and city
# TODO: Aggregate: count transactions, sum/mean total_price, sum/mean quantity
# TODO: Flatten column names and reset index
# TODO: Print schema and display sample data


**Exercise 4.2**: Perform analytics queries on the Gold table.


In [None]:
# TODO: Find top 5 products by total revenue (group by product_name)
# TODO: Find top 5 cities by total revenue (group by city)
# TODO: Find top 10 product-city combinations by revenue
# TODO: Create gold directory and save aggregated data as parquet


## 5. Summary and Cleanup

Let's summarize what we've accomplished in our Medallion Architecture implementation.


In [None]:
# TODO: Print completion message with pipeline summary
# TODO: List files created in each data directory (bronze, silver, gold)
# TODO: Print congratulations message and next steps


## Next Steps

Now that you've mastered the Medallion Architecture with pandas, here are some suggestions for further learning:

### Immediate Next Steps:
- **Try the PySpark version**: Move to `practice_pyspark_medallion_architecture_guide.ipynb` for distributed processing
- **Experiment with different data sources**: JSON, Excel, databases
- **Add more complex transformations**: Pivot tables, time series analysis, statistical functions

### Advanced Concepts to Explore:
- **Data Quality Monitoring**: Implement data validation rules and monitoring
- **Incremental Processing**: Handle new data arriving daily/hourly
- **Schema Evolution**: Handle changes in data structure over time
- **Performance Optimization**: Chunking, parallel processing, memory management

### Production Considerations:
- **Error Handling**: Robust error handling and logging
- **Configuration Management**: External config files for flexibility
- **Monitoring and Alerting**: Track pipeline health and performance
- **Testing**: Unit tests and data quality tests

Happy data engineering! 🚀
