# 19 - Complete Data Processing Example

## Introduction

This notebook combines all the concepts we've learned into a complete data processing example. This demonstrates real-world data engineering tasks using Python.

## What This Example Demonstrates

- Reading data from CSV files
- Data cleaning and transformation
- Error handling
- Using list comprehensions and lambda functions
- Writing processed data to files
- Combining multiple Python concepts


## Step 1: Create Sample Data

First, let's create a sample CSV file with sales data.


In [1]:
import csv
from datetime import datetime, date

# Create sample sales data
sales_data = [
    ["Date", "Product", "Quantity", "Price", "Customer"],
    ["2024-01-15", "Laptop", "2", "999.99", "Alice"],
    ["2024-01-16", "Mouse", "5", "29.99", "Bob"],
    ["2024-01-17", "Keyboard", "3", "79.99", "Charlie"],
    ["2024-01-18", "Laptop", "1", "999.99", "Alice"],
    ["2024-01-19", "Monitor", "2", "299.99", "Diana"],
    ["2024-01-20", "Mouse", "10", "29.99", "Bob"],
    ["invalid-date", "Headphones", "1", "99.99", "Eve"],  # Invalid date for testing
    ["2024-01-22", "Keyboard", "2", "79.99", "Frank"]
]

# Write to CSV
with open("sales_data.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(sales_data)

print("Sample data file created!")


Sample data file created!


## Step 2: Read and Process Data

Now let's read the data, clean it, and calculate totals.


In [2]:
# Read and process data
processed_records = []
errors = []

with open("sales_data.csv", "r") as file:
    reader = csv.DictReader(file)
    for row in reader:
        try:
            # Parse date with error handling
            sale_date = datetime.strptime(row["Date"], "%Y-%m-%d").date()
            
            # Convert quantity and price to numbers
            quantity = int(row["Quantity"])
            price = float(row["Price"])
            
            # Calculate total
            total = quantity * price
            
            # Create processed record
            record = {
                "date": sale_date,
                "product": row["Product"],
                "quantity": quantity,
                "price": price,
                "total": total,
                "customer": row["Customer"]
            }
            processed_records.append(record)
            
        except (ValueError, KeyError) as e:
            errors.append(f"Error processing row {row}: {e}")

print(f"Successfully processed {len(processed_records)} records")
print(f"Found {len(errors)} errors")
if errors:
    print("Errors:", errors)


Successfully processed 7 records
Found 1 errors
Errors: ["Error processing row {'Date': 'invalid-date', 'Product': 'Headphones', 'Quantity': '1', 'Price': '99.99', 'Customer': 'Eve'}: time data 'invalid-date' does not match format '%Y-%m-%d'"]


## Step 3: Data Analysis

Let's perform some analysis using list comprehensions and lambda functions.


In [3]:
# Calculate total sales
total_sales = sum(record["total"] for record in processed_records)
print(f"Total Sales: ${total_sales:,.2f}")

# Find highest sale
highest_sale = max(processed_records, key=lambda x: x["total"])
print(f"\nHighest Sale:")
print(f"  Product: {highest_sale['product']}")
print(f"  Amount: ${highest_sale['total']:,.2f}")
print(f"  Customer: {highest_sale['customer']}")

# Get unique products
unique_products = list(set(record["product"] for record in processed_records))
print(f"\nUnique Products: {unique_products}")

# Filter high-value sales (> $500)
high_value_sales = [r for r in processed_records if r["total"] > 500]
print(f"\nHigh-value sales (> $500): {len(high_value_sales)}")
for sale in high_value_sales:
    print(f"  {sale['product']}: ${sale['total']:,.2f}")


Total Sales: $4,449.75

Highest Sale:
  Product: Laptop
  Amount: $1,999.98
  Customer: Alice

Unique Products: ['Keyboard', 'Laptop', 'Monitor', 'Mouse']

High-value sales (> $500): 3
  Laptop: $1,999.98
  Laptop: $999.99
  Monitor: $599.98


## Step 4: Group by Customer

Let's calculate total sales per customer.


In [4]:
# Group sales by customer
from collections import defaultdict

customer_totals = defaultdict(float)
for record in processed_records:
    customer_totals[record["customer"]] += record["total"]

# Sort by total (descending)
sorted_customers = sorted(customer_totals.items(), key=lambda x: x[1], reverse=True)

print("Sales by Customer:")
for customer, total in sorted_customers:
    print(f"  {customer}: ${total:,.2f}")


Sales by Customer:
  Alice: $2,999.97
  Diana: $599.98
  Bob: $449.85
  Charlie: $239.97
  Frank: $159.98


## Step 5: Write Processed Data

Finally, let's write the processed data to a new CSV file.


In [5]:
# Write processed data to new file
with open("processed_sales.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Date", "Product", "Quantity", "Price", "Total", "Customer"])
    
    for record in processed_records:
        writer.writerow([
            record["date"],
            record["product"],
            record["quantity"],
            record["price"],
            record["total"],
            record["customer"]
        ])

print("Processed data written to processed_sales.csv")

# Verify by reading a few rows
with open("processed_sales.csv", "r") as file:
    reader = csv.reader(file)
    print("\nFirst 3 rows of processed data:")
    for i, row in enumerate(reader):
        if i < 4:  # Header + 3 rows
            print(row)


Processed data written to processed_sales.csv

First 3 rows of processed data:
['Date', 'Product', 'Quantity', 'Price', 'Total', 'Customer']
['2024-01-15', 'Laptop', '2', '999.99', '1999.98', 'Alice']
['2024-01-16', 'Mouse', '5', '29.99', '149.95', 'Bob']
['2024-01-17', 'Keyboard', '3', '79.99', '239.96999999999997', 'Charlie']


## Summary

This example demonstrated:
- ✅ Reading CSV files
- ✅ Error handling for invalid data
- ✅ Data transformation and calculations
- ✅ Using list comprehensions for filtering
- ✅ Using lambda functions for sorting
- ✅ Working with dates
- ✅ Grouping and aggregating data
- ✅ Writing processed data to files

These are the same concepts you'll use in PySpark, but with DataFrames instead of lists and dictionaries!
