# 08 - Practical Data Engineering Example

## Introduction

This notebook demonstrates a complete data engineering workflow combining all the concepts you've learned. We'll process sales data from multiple sources, clean it, transform it, and create a final report.

## Scenario

You work as a data engineer and need to:
1. Read sales data from multiple CSV files
2. Clean the data (handle missing values, duplicates)
3. Merge data from different sources
4. Calculate aggregations
5. Generate a summary report


In [1]:
import pandas as pd
import numpy as np

print("Starting data engineering pipeline...")
print("=" * 50)


Starting data engineering pipeline...


## Step 1: Read Data from Multiple Sources


In [2]:
# Create sample sales data
sales_data = {
    'Date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19'],
    'Product_ID': ['P001', 'P002', 'P001', 'P003', 'P002'],
    'Quantity': [10, 5, 8, 12, None],
    'Price': [100, 200, 100, 150, 200],
    'Salesperson': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob']
}

df_sales = pd.DataFrame(sales_data)
df_sales.to_csv('sales_data.csv', index=False)

# Create product master data
product_data = {
    'Product_ID': ['P001', 'P002', 'P003', 'P004'],
    'Product_Name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
}

df_products = pd.DataFrame(product_data)
df_products.to_csv('products.csv', index=False)

print("Sample data files created!")
print("\nSales Data:")
print(df_sales)
print("\nProduct Data:")
print(df_products)


Sample data files created!

Sales Data:
         Date Product_ID  Quantity  Price Salesperson
0  2024-01-15       P001      10.0    100       Alice
1  2024-01-16       P002       5.0    200         Bob
2  2024-01-17       P001       8.0    100       Alice
3  2024-01-18       P003      12.0    150     Charlie
4  2024-01-19       P002       NaN    200         Bob

Product Data:
  Product_ID Product_Name     Category
0       P001       Laptop  Electronics
1       P002        Mouse  Accessories
2       P003     Keyboard  Accessories
3       P004      Monitor  Electronics


In [3]:
# Read the data files
sales_df = pd.read_csv('sales_data.csv')
products_df = pd.read_csv('products.csv')

print("Data loaded successfully!")
print(f"Sales records: {len(sales_df)}")
print(f"Products: {len(products_df)}")


Data loaded successfully!
Sales records: 5
Products: 4


## Step 2: Data Cleaning


In [4]:
# Check for missing values
print("Missing values in sales data:")
print(sales_df.isnull().sum())


Missing values in sales data:
Date           0
Product_ID     0
Quantity       1
Price          0
Salesperson    0
dtype: int64


In [5]:
# Fill missing Quantity with 0 (or you could drop rows)
sales_df['Quantity'] = sales_df['Quantity'].fillna(0)

# Convert Date to datetime
sales_df['Date'] = pd.to_datetime(sales_df['Date'])

# Remove duplicates
sales_df = sales_df.drop_duplicates()

print("Data cleaning completed!")
print(f"Records after cleaning: {len(sales_df)}")


Data cleaning completed!
Records after cleaning: 5


## Step 3: Data Transformation


In [6]:
# Calculate total revenue
sales_df['Revenue'] = sales_df['Quantity'] * sales_df['Price']

# Merge with product data
merged_df = pd.merge(sales_df, products_df, on='Product_ID', how='left')

print("Data transformation completed!")
print("\nMerged DataFrame:")
print(merged_df)


Data transformation completed!

Merged DataFrame:
        Date Product_ID  Quantity  Price Salesperson  Revenue Product_Name  \
0 2024-01-15       P001      10.0    100       Alice   1000.0       Laptop   
1 2024-01-16       P002       5.0    200         Bob   1000.0        Mouse   
2 2024-01-17       P001       8.0    100       Alice    800.0       Laptop   
3 2024-01-18       P003      12.0    150     Charlie   1800.0     Keyboard   
4 2024-01-19       P002       0.0    200         Bob      0.0        Mouse   

      Category  
0  Electronics  
1  Accessories  
2  Electronics  
3  Accessories  
4  Accessories  


## Step 4: Data Aggregation and Analysis


In [7]:
# Total revenue by product
revenue_by_product = merged_df.groupby('Product_Name')['Revenue'].sum().sort_values(ascending=False)
print("Total Revenue by Product:")
print(revenue_by_product)


Total Revenue by Product:
Product_Name
Keyboard    1800.0
Laptop      1800.0
Mouse       1000.0
Name: Revenue, dtype: float64


In [8]:
# Total revenue by salesperson
revenue_by_salesperson = merged_df.groupby('Salesperson')['Revenue'].sum().sort_values(ascending=False)
print("\nTotal Revenue by Salesperson:")
print(revenue_by_salesperson)



Total Revenue by Salesperson:
Salesperson
Alice      1800.0
Charlie    1800.0
Bob        1000.0
Name: Revenue, dtype: float64


In [9]:
# Summary statistics
summary = merged_df.groupby('Category').agg({
    'Revenue': ['sum', 'mean', 'count'],
    'Quantity': 'sum'
})
print("\nSummary by Category:")
print(summary)



Summary by Category:
            Revenue                   Quantity
                sum        mean count      sum
Category                                      
Accessories  2800.0  933.333333     3     17.0
Electronics  1800.0  900.000000     2     18.0


## Step 5: Generate Final Report


In [10]:
# Create final report DataFrame
report = pd.DataFrame({
    'Metric': [
        'Total Revenue',
        'Total Quantity Sold',
        'Number of Transactions',
        'Average Revenue per Transaction',
        'Top Product',
        'Top Salesperson'
    ],
    'Value': [
        merged_df['Revenue'].sum(),
        merged_df['Quantity'].sum(),
        len(merged_df),
        merged_df['Revenue'].mean(),
        revenue_by_product.index[0],
        revenue_by_salesperson.index[0]
    ]
})

print("=" * 50)
print("FINAL REPORT")
print("=" * 50)
print(report)


FINAL REPORT
                            Metric     Value
0                    Total Revenue    4600.0
1              Total Quantity Sold      35.0
2           Number of Transactions         5
3  Average Revenue per Transaction     920.0
4                      Top Product  Keyboard
5                  Top Salesperson     Alice


In [11]:
# Save final report
merged_df.to_csv('processed_sales_data.csv', index=False)
report.to_csv('sales_report.csv', index=False)
print("\nReports saved to CSV files!")



Reports saved to CSV files!


## Summary

This practical example demonstrated:
- ✅ Reading data from multiple sources
- ✅ Data cleaning (handling missing values, duplicates)
- ✅ Data transformation (calculations, merging)
- ✅ Data aggregation and analysis
- ✅ Generating and saving reports

**This is a typical data engineering workflow!**
