# Lesson 10: Working with Data

**Session:** Week 3, Thursday (2 hours)  
**Learning Objectives:**
- Apply Python fundamentals to real data analysis
- Clean and process messy data using core Python
- Perform basic statistical analysis without libraries
- Build data processing pipelines
- Prepare for advanced data science tools (pandas preview)
- Understand data formats and structures common in data science

## üéâ From Fundamentals to Data Science!

Today we bridge the gap between **Python fundamentals** and **real data science work**!

### Your Journey So Far üó∫Ô∏è
- **Week 1**: Variables, strings, lists, dictionaries (the building blocks)
- **Week 2**: Conditionals, loops, functions (the logic and organization)
- **Week 3 (Tuesday)**: File I/O and error handling (working with external data)
- **Today**: Putting it all together for **real data analysis**! üìä

### What Makes This Different? ü§î
Instead of clean, perfect data, we'll work with:
- **Messy** data that needs cleaning
- **Real-world** datasets with missing values
- **Complex** structures requiring careful processing
- **Large** amounts of information to summarize

**Today you become a data scientist!** üî¨‚ú®

## The Reality of Data Science üìä

### The Data Science Pipeline
80% of a data scientist's time is spent on:
1. **Data Collection** - Getting data from various sources
2. **Data Cleaning** - Fixing inconsistencies and errors
3. **Data Transformation** - Converting to usable formats
4. **Exploratory Analysis** - Understanding patterns
5. **Visualization** - Creating insights people can understand

Only 20% is the "sexy" machine learning part!

### Today's Focus: The Foundation Skills
We'll use **pure Python** (no pandas yet!) to master the fundamentals that make you a strong data scientist.

## The Restaurant Kitchen Analogy üë®‚Äçüç≥

### Data Science is Like Running a Restaurant Kitchen

**Raw Data = Fresh Ingredients**
- Sometimes perfect (like clean CSV files)
- Often messy (missing values, wrong formats, duplicates)
- Needs inspection and preparation before use

**Data Cleaning = Prep Work**
- Wash vegetables (remove invalid data)
- Cut ingredients uniformly (standardize formats)
- Remove spoiled parts (handle missing values)
- Organize ingredients by type (group and sort)

**Data Analysis = Cooking Process**
- Combine ingredients thoughtfully (merge datasets)
- Apply heat and seasoning (mathematical operations)
- Taste and adjust (iterative analysis)
- Present beautifully (visualization and reporting)

**Good Data Scientists** = **Master Chefs**
- Know their ingredients (understand data)
- Have systematic processes (reproducible workflows)
- Handle problems gracefully (error handling)
- Create something valuable from raw materials

**Let's become data chefs!** üë®‚Äçüç≥üìä

In [None]:
# Welcome to the Data Kitchen!
import json
import csv
from datetime import datetime
import random

print("üë®‚Äçüç≥ Welcome to the Data Science Kitchen!")
print("Today's Menu:")
print("ü•ó Data Cleaning Techniques")
print("üç≤ Statistical Analysis Methods")
print("üßÑ Data Processing Workflows")
print("üç∞ Insight Generation")
print("\nLet's start cooking with data! üìä")

## Working with Real, Messy Data üßπ

Let's create some realistic, messy data to work with:

In [None]:
# Create realistic, messy sales data
print("=== Creating Realistic Sales Dataset ===")

# Simulate messy data like you'd find in the real world
messy_sales_data = [
    # Some good records
    "2024-01-15,John Smith,Laptop,1,999.99,Electronics",
    "2024-01-16,Sarah Johnson,Mouse,2,29.99,Electronics", 
    
    # Missing data
    "2024-01-17,,Keyboard,1,89.99,Electronics",  # Missing customer name
    "2024-01-18,Mike Brown,,1,149.99,Electronics",  # Missing product
    
    # Inconsistent formatting
    "2024-01-19,  Lisa Davis  ,smartphone,1,699.99,  ELECTRONICS  ",  # Extra spaces
    "01/20/2024,Bob Wilson,tablet,1,399.99,electronics",  # Different date format
    
    # Data quality issues
    "2024-01-21,Emma White,Monitor,0,299.99,Electronics",  # Zero quantity
    "2024-01-22,Tom Garcia,Headphones,-1,149.99,Electronics",  # Negative quantity
    "2024-01-23,Anna Miller,Printer,1,-199.99,Electronics",  # Negative price
    
    # More inconsistencies
    "2024-01-24,DAVID CLARK,webcam,2,79.99,Tech",  # Different category name
    "2024-01-25,mary.johnson@email.com,Speaker,1,89.99,Electronics",  # Email as name
    
    # Empty or malformed records
    "",  # Completely empty
    "2024-01-26",  # Incomplete record
    "2024-01-27,James Wilson,Gaming Mouse,1,59.99,Electronics,Extra Field",  # Extra field
]

# Save the messy data
with open('messy_sales.txt', 'w') as file:
    for record in messy_sales_data:
        file.write(record + '\n')

print(f"‚úÖ Created messy dataset with {len(messy_sales_data)} records")
print("üìä This data has typical real-world problems:")
print("   ‚Ä¢ Missing values")
print("   ‚Ä¢ Inconsistent formatting")
print("   ‚Ä¢ Invalid data (negative quantities/prices)")
print("   ‚Ä¢ Extra whitespace")
print("   ‚Ä¢ Different date formats")
print("   ‚Ä¢ Inconsistent categories")
print("   ‚Ä¢ Malformed records")
print("\nüéØ Our mission: Clean and analyze this data!")

## Data Cleaning: The Foundation of Good Analysis üßπ

In [None]:
# Data Cleaning Pipeline
print("=== Data Cleaning Pipeline ===")

def clean_sales_data(filename='messy_sales.txt'):
    """
    Clean messy sales data using core Python skills
    
    Returns: list of cleaned dictionaries
    """
    cleaned_records = []
    errors_log = []
    
    try:
        with open(filename, 'r') as file:
            for line_num, line in enumerate(file, 1):
                # Skip empty lines
                if not line.strip():
                    errors_log.append(f"Line {line_num}: Empty line skipped")
                    continue
                
                # Split by comma
                fields = [field.strip() for field in line.strip().split(',')]
                
                # Check if we have the expected number of fields
                if len(fields) < 6:
                    errors_log.append(f"Line {line_num}: Incomplete record - {len(fields)} fields")
                    continue
                elif len(fields) > 6:
                    errors_log.append(f"Line {line_num}: Extra fields found, keeping first 6")
                    fields = fields[:6]  # Keep only first 6 fields
                
                # Extract fields
                date_str, customer_name, product, quantity_str, price_str, category = fields
                
                # Clean and validate each field
                record = {}
                record_valid = True
                
                # 1. Clean date
                if date_str:
                    # Handle different date formats
                    if '/' in date_str:  # MM/DD/YYYY format
                        try:
                            month, day, year = date_str.split('/')
                            record['date'] = f"{year}-{month.zfill(2)}-{day.zfill(2)}"
                        except:
                            errors_log.append(f"Line {line_num}: Invalid date format: {date_str}")
                            record_valid = False
                    else:  # Assume YYYY-MM-DD format
                        record['date'] = date_str
                else:
                    errors_log.append(f"Line {line_num}: Missing date")
                    record_valid = False
                
                # 2. Clean customer name
                if customer_name and '@' not in customer_name:  # Skip email addresses
                    # Standardize name format (Title Case)
                    record['customer_name'] = customer_name.title().strip()
                else:
                    errors_log.append(f"Line {line_num}: Invalid customer name: {customer_name}")
                    record_valid = False
                
                # 3. Clean product name
                if product:
                    record['product'] = product.title().strip()
                else:
                    errors_log.append(f"Line {line_num}: Missing product name")
                    record_valid = False
                
                # 4. Clean quantity (must be positive integer)
                try:
                    quantity = int(quantity_str)
                    if quantity > 0:
                        record['quantity'] = quantity
                    else:
                        errors_log.append(f"Line {line_num}: Invalid quantity: {quantity}")
                        record_valid = False
                except ValueError:
                    errors_log.append(f"Line {line_num}: Non-numeric quantity: {quantity_str}")
                    record_valid = False
                
                # 5. Clean price (must be positive float)
                try:
                    price = float(price_str)
                    if price > 0:
                        record['price'] = price
                    else:
                        errors_log.append(f"Line {line_num}: Invalid price: {price}")
                        record_valid = False
                except ValueError:
                    errors_log.append(f"Line {line_num}: Non-numeric price: {price_str}")
                    record_valid = False
                
                # 6. Clean category (standardize)
                if category:
                    # Standardize category names
                    clean_category = category.upper().strip()
                    if clean_category in ['ELECTRONICS', 'TECH']:
                        record['category'] = 'Electronics'
                    else:
                        record['category'] = clean_category.title()
                else:
                    record['category'] = 'Unknown'
                
                # Add calculated field
                if record_valid and 'quantity' in record and 'price' in record:
                    record['total'] = round(record['quantity'] * record['price'], 2)
                
                # Only add valid records
                if record_valid:
                    cleaned_records.append(record)
                
    except FileNotFoundError:
        print(f"‚ùå File '{filename}' not found!")
        return [], []
    except Exception as e:
        print(f"‚ùå Error cleaning data: {e}")
        return [], []
    
    return cleaned_records, errors_log

# Clean the data
clean_data, cleaning_errors = clean_sales_data()

print(f"\nüìä Data Cleaning Results:")
print(f"‚úÖ Clean records: {len(clean_data)}")
print(f"‚ö†Ô∏è Issues found: {len(cleaning_errors)}")

print(f"\nüßπ Cleaning Issues Log:")
for error in cleaning_errors[:5]:  # Show first 5 errors
    print(f"   ‚Ä¢ {error}")
if len(cleaning_errors) > 5:
    print(f"   ... and {len(cleaning_errors) - 5} more issues")

print(f"\n‚úÖ Sample Clean Records:")
for i, record in enumerate(clean_data[:3], 1):
    print(f"Record {i}: {record}")

## Exploratory Data Analysis with Pure Python üîç

In [None]:
# Basic Statistical Analysis Functions
print("=== Basic Statistical Analysis ===")

def calculate_statistics(numbers):
    """
    Calculate basic statistics for a list of numbers
    
    Returns: dict with mean, median, mode, std_dev, min, max
    """
    if not numbers:
        return None
    
    # Sort for easier calculations
    sorted_nums = sorted(numbers)
    n = len(sorted_nums)
    
    # Mean (average)
    mean = sum(sorted_nums) / n
    
    # Median (middle value)
    if n % 2 == 0:
        median = (sorted_nums[n//2 - 1] + sorted_nums[n//2]) / 2
    else:
        median = sorted_nums[n//2]
    
    # Mode (most frequent value)
    from collections import Counter
    counts = Counter(numbers)
    mode_count = max(counts.values())
    modes = [num for num, count in counts.items() if count == mode_count]
    mode = modes[0] if len(modes) == 1 else modes  # Single mode or multiple
    
    # Standard deviation (measure of spread)
    variance = sum((x - mean) ** 2 for x in numbers) / n
    std_dev = variance ** 0.5
    
    # Range
    min_val = min(sorted_nums)
    max_val = max(sorted_nums)
    
    return {
        'count': n,
        'mean': round(mean, 2),
        'median': median,
        'mode': mode,
        'std_dev': round(std_dev, 2),
        'min': min_val,
        'max': max_val,
        'range': max_val - min_val
    }

def analyze_sales_data(data):
    """
    Perform comprehensive analysis of sales data
    """
    if not data:
        print("‚ùå No data to analyze!")
        return
    
    print(f"üìä Sales Data Analysis ({len(data)} records)")
    print("=" * 50)
    
    # Extract numerical data
    quantities = [record['quantity'] for record in data]
    prices = [record['price'] for record in data]
    totals = [record['total'] for record in data]
    
    # 1. Overall Statistics
    print("\nüí∞ Revenue Analysis:")
    total_revenue = sum(totals)
    total_items = sum(quantities)
    avg_order_value = total_revenue / len(data)
    
    print(f"   Total Revenue: ${total_revenue:,.2f}")
    print(f"   Total Items Sold: {total_items:,}")
    print(f"   Number of Orders: {len(data):,}")
    print(f"   Average Order Value: ${avg_order_value:.2f}")
    
    # 2. Price Statistics
    print("\nüí≤ Price Analysis:")
    price_stats = calculate_statistics(prices)
    print(f"   Average Price: ${price_stats['mean']:.2f}")
    print(f"   Median Price: ${price_stats['median']:.2f}")
    print(f"   Price Range: ${price_stats['min']:.2f} - ${price_stats['max']:.2f}")
    print(f"   Price Std Dev: ${price_stats['std_dev']:.2f}")
    
    # 3. Quantity Statistics
    print("\nüì¶ Quantity Analysis:")
    qty_stats = calculate_statistics(quantities)
    print(f"   Average Quantity: {qty_stats['mean']:.1f} items")
    print(f"   Median Quantity: {qty_stats['median']:.1f} items")
    print(f"   Most Common Quantity: {qty_stats['mode']} items")
    
    # 4. Product Analysis
    print("\nüõçÔ∏è Product Analysis:")
    products = {}
    for record in data:
        product = record['product']
        if product not in products:
            products[product] = {'count': 0, 'revenue': 0, 'total_qty': 0}
        
        products[product]['count'] += 1
        products[product]['revenue'] += record['total']
        products[product]['total_qty'] += record['quantity']
    
    # Sort by revenue
    sorted_products = sorted(products.items(), 
                           key=lambda x: x[1]['revenue'], 
                           reverse=True)
    
    print(f"   Total Unique Products: {len(products)}")
    print(f"   Top 3 Products by Revenue:")
    for i, (product, stats) in enumerate(sorted_products[:3], 1):
        print(f"      {i}. {product}: ${stats['revenue']:.2f} ({stats['count']} orders)")
    
    # 5. Customer Analysis
    print("\nüë• Customer Analysis:")
    customers = {}
    for record in data:
        customer = record['customer_name']
        if customer not in customers:
            customers[customer] = {'orders': 0, 'revenue': 0, 'items': 0}
        
        customers[customer]['orders'] += 1
        customers[customer]['revenue'] += record['total']
        customers[customer]['items'] += record['quantity']
    
    # Calculate customer metrics
    customer_revenues = [stats['revenue'] for stats in customers.values()]
    customer_orders = [stats['orders'] for stats in customers.values()]
    
    print(f"   Total Unique Customers: {len(customers)}")
    print(f"   Average Revenue per Customer: ${sum(customer_revenues)/len(customer_revenues):.2f}")
    print(f"   Average Orders per Customer: {sum(customer_orders)/len(customer_orders):.1f}")
    
    # Find top customer
    top_customer = max(customers.items(), key=lambda x: x[1]['revenue'])
    print(f"   Top Customer: {top_customer[0]} (${top_customer[1]['revenue']:.2f})")
    
    # 6. Category Analysis
    print("\nüè∑Ô∏è Category Analysis:")
    categories = {}
    for record in data:
        category = record['category']
        if category not in categories:
            categories[category] = {'count': 0, 'revenue': 0}
        
        categories[category]['count'] += 1
        categories[category]['revenue'] += record['total']
    
    for category, stats in categories.items():
        percentage = (stats['revenue'] / total_revenue) * 100
        print(f"   {category}: ${stats['revenue']:.2f} ({percentage:.1f}% of total)")
    
    return {
        'total_revenue': total_revenue,
        'total_orders': len(data),
        'avg_order_value': avg_order_value,
        'products': products,
        'customers': customers,
        'categories': categories
    }

# Perform the analysis
analysis_results = analyze_sales_data(clean_data)

## Advanced Data Grouping and Aggregation üìä

In [None]:
# Advanced Grouping and Aggregation
print("=== Advanced Data Grouping ===")

def group_by_field(data, field_name, aggregation_fields):
    """
    Group data by a field and calculate aggregations
    
    Parameters:
    - data: list of dictionaries
    - field_name: field to group by
    - aggregation_fields: dict {field: [operations]} e.g., {'total': ['sum', 'avg', 'count']}
    
    Returns: grouped results
    """
    groups = {}
    
    # Group the data
    for record in data:
        group_key = record[field_name]
        if group_key not in groups:
            groups[group_key] = []
        groups[group_key].append(record)
    
    # Calculate aggregations for each group
    results = {}
    for group_key, group_records in groups.items():
        results[group_key] = {}
        
        for field, operations in aggregation_fields.items():
            values = [record[field] for record in group_records if field in record]
            
            if not values:
                continue
                
            for operation in operations:
                if operation == 'sum':
                    results[group_key][f'{field}_sum'] = sum(values)
                elif operation == 'avg':
                    results[group_key][f'{field}_avg'] = sum(values) / len(values)
                elif operation == 'count':
                    results[group_key][f'{field}_count'] = len(values)
                elif operation == 'min':
                    results[group_key][f'{field}_min'] = min(values)
                elif operation == 'max':
                    results[group_key][f'{field}_max'] = max(values)
    
    return results

# Example 1: Sales by Product
print("\nüõçÔ∏è Sales Analysis by Product:")
product_analysis = group_by_field(
    clean_data, 
    'product', 
    {
        'total': ['sum', 'avg', 'count'],
        'quantity': ['sum', 'avg'],
        'price': ['avg', 'min', 'max']
    }
)

for product, metrics in sorted(product_analysis.items(), 
                             key=lambda x: x[1]['total_sum'], 
                             reverse=True)[:5]:
    print(f"\n{product}:")
    print(f"  Total Revenue: ${metrics['total_sum']:.2f}")
    print(f"  Orders: {metrics['total_count']}")
    print(f"  Avg Order Value: ${metrics['total_avg']:.2f}")
    print(f"  Total Quantity: {metrics['quantity_sum']}")
    print(f"  Price Range: ${metrics['price_min']:.2f} - ${metrics['price_max']:.2f}")

# Example 2: Sales by Customer
print("\n\nüë• Sales Analysis by Customer:")
customer_analysis = group_by_field(
    clean_data,
    'customer_name',
    {
        'total': ['sum', 'count', 'avg'],
        'quantity': ['sum']
    }
)

# Find VIP customers (high value)
vip_customers = [(customer, metrics) for customer, metrics in customer_analysis.items() 
                if metrics['total_sum'] > 500 or metrics['total_count'] > 2]

print(f"VIP Customers ({len(vip_customers)} found):")
for customer, metrics in sorted(vip_customers, key=lambda x: x[1]['total_sum'], reverse=True):
    print(f"  {customer}: ${metrics['total_sum']:.2f} ({metrics['total_count']} orders)")

# Example 3: Time-based Analysis (by date)
print("\n\nüìÖ Daily Sales Trends:")
daily_analysis = group_by_field(
    clean_data,
    'date',
    {
        'total': ['sum', 'count'],
        'quantity': ['sum']
    }
)

print("Daily Sales Summary:")
for date in sorted(daily_analysis.keys()):
    metrics = daily_analysis[date]
    print(f"  {date}: ${metrics['total_sum']:.2f} ({metrics['total_count']} orders, {metrics['quantity_sum']} items)")

## Data Transformation and Enrichment üîÑ

In [None]:
# Data Transformation and Enrichment
print("=== Data Transformation & Enrichment ===")

def enrich_sales_data(data):
    """
    Add calculated fields and business intelligence to sales data
    """
    enriched_data = []
    
    # Calculate overall statistics for comparisons
    all_totals = [record['total'] for record in data]
    avg_order_value = sum(all_totals) / len(all_totals)
    
    # Customer spending analysis
    customer_totals = {}
    for record in data:
        customer = record['customer_name']
        customer_totals[customer] = customer_totals.get(customer, 0) + record['total']
    
    # Product popularity analysis
    product_counts = {}
    for record in data:
        product = record['product']
        product_counts[product] = product_counts.get(product, 0) + 1
    
    for record in data.copy():  # Make a copy to avoid modifying original
        enriched_record = record.copy()
        
        # 1. Order Value Category
        if record['total'] >= avg_order_value * 1.5:
            enriched_record['order_value_category'] = 'High'
        elif record['total'] >= avg_order_value * 0.7:
            enriched_record['order_value_category'] = 'Medium'
        else:
            enriched_record['order_value_category'] = 'Low'
        
        # 2. Customer Category
        customer_total = customer_totals[record['customer_name']]
        if customer_total >= 1000:
            enriched_record['customer_category'] = 'VIP'
        elif customer_total >= 500:
            enriched_record['customer_category'] = 'Premium'
        else:
            enriched_record['customer_category'] = 'Standard'
        
        # 3. Product Popularity
        product_popularity = product_counts[record['product']]
        if product_popularity >= 3:
            enriched_record['product_popularity'] = 'Popular'
        elif product_popularity >= 2:
            enriched_record['product_popularity'] = 'Moderate'
        else:
            enriched_record['product_popularity'] = 'Niche'
        
        # 4. Profit Margin (simulate - assume 30% margin)
        enriched_record['estimated_profit'] = round(record['total'] * 0.30, 2)
        
        # 5. Day of Week (from date)
        try:
            from datetime import datetime
            date_obj = datetime.strptime(record['date'], '%Y-%m-%d')
            enriched_record['day_of_week'] = date_obj.strftime('%A')
            enriched_record['month'] = date_obj.strftime('%B')
        except:
            enriched_record['day_of_week'] = 'Unknown'
            enriched_record['month'] = 'Unknown'
        
        # 6. Price per Unit
        enriched_record['price_per_unit'] = round(record['total'] / record['quantity'], 2)
        
        enriched_data.append(enriched_record)
    
    return enriched_data

# Enrich the data
enriched_sales = enrich_sales_data(clean_data)

print(f"‚úÖ Data enriched! Added {len(enriched_sales[0]) - len(clean_data[0])} new fields")

# Show sample enriched record
print("\nüìä Sample Enriched Record:")
sample_record = enriched_sales[0]
for key, value in sample_record.items():
    print(f"   {key}: {value}")

# Analyze enriched data
print("\n\nüìà Enriched Data Analysis:")

# Customer category distribution
customer_categories = {}
for record in enriched_sales:
    cat = record['customer_category']
    customer_categories[cat] = customer_categories.get(cat, 0) + 1

print("\nüë• Customer Category Distribution:")
for category, count in customer_categories.items():
    percentage = (count / len(enriched_sales)) * 100
    print(f"   {category}: {count} orders ({percentage:.1f}%)")

# Order value category analysis
order_value_categories = {}
total_revenue_by_category = {}
for record in enriched_sales:
    cat = record['order_value_category']
    order_value_categories[cat] = order_value_categories.get(cat, 0) + 1
    total_revenue_by_category[cat] = total_revenue_by_category.get(cat, 0) + record['total']

print("\nüí∞ Order Value Analysis:")
for category in ['High', 'Medium', 'Low']:
    if category in order_value_categories:
        count = order_value_categories[category]
        revenue = total_revenue_by_category[category]
        avg_order = revenue / count
        print(f"   {category} Value Orders: {count} (avg: ${avg_order:.2f})")

# Day of week analysis
day_analysis = {}
for record in enriched_sales:
    day = record['day_of_week']
    if day not in day_analysis:
        day_analysis[day] = {'count': 0, 'revenue': 0}
    day_analysis[day]['count'] += 1
    day_analysis[day]['revenue'] += record['total']

print("\nüìÖ Sales by Day of Week:")
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
for day in days_order:
    if day in day_analysis:
        stats = day_analysis[day]
        avg_per_day = stats['revenue'] / stats['count']
        print(f"   {day}: {stats['count']} orders, ${stats['revenue']:.2f} (avg: ${avg_per_day:.2f})")

# Product popularity insights
popularity_analysis = {}
for record in enriched_sales:
    pop = record['product_popularity']
    popularity_analysis[pop] = popularity_analysis.get(pop, 0) + 1

print("\nüèÜ Product Popularity Distribution:")
for popularity, count in popularity_analysis.items():
    percentage = (count / len(enriched_sales)) * 100
    print(f"   {popularity}: {count} orders ({percentage:.1f}%)")

## Creating Data Reports and Exports üìÑ

In [None]:
# Data Reporting and Export Functions
print("=== Data Reporting & Export ===")

def generate_executive_summary(data):
    """
    Generate a high-level executive summary
    """
    if not data:
        return "No data available for analysis."
    
    # Calculate key metrics
    total_revenue = sum(record['total'] for record in data)
    total_orders = len(data)
    unique_customers = len(set(record['customer_name'] for record in data))
    unique_products = len(set(record['product'] for record in data))
    avg_order_value = total_revenue / total_orders
    
    # Find date range
    dates = [record['date'] for record in data]
    start_date = min(dates)
    end_date = max(dates)
    
    # Find top performers
    customer_totals = {}
    product_totals = {}
    
    for record in data:
        customer = record['customer_name']
        product = record['product']
        
        customer_totals[customer] = customer_totals.get(customer, 0) + record['total']
        product_totals[product] = product_totals.get(product, 0) + record['total']
    
    top_customer = max(customer_totals.items(), key=lambda x: x[1])
    top_product = max(product_totals.items(), key=lambda x: x[1])
    
    # Generate summary report
    summary = f"""
üìä EXECUTIVE SALES SUMMARY REPORT
{'='*50}

üìÖ REPORTING PERIOD: {start_date} to {end_date}

üí∞ KEY FINANCIAL METRICS:
   ‚Ä¢ Total Revenue: ${total_revenue:,.2f}
   ‚Ä¢ Number of Orders: {total_orders:,}
   ‚Ä¢ Average Order Value: ${avg_order_value:.2f}
   ‚Ä¢ Revenue per Day: ${total_revenue/len(set(dates)):.2f}

üë• CUSTOMER INSIGHTS:
   ‚Ä¢ Unique Customers: {unique_customers:,}
   ‚Ä¢ Revenue per Customer: ${total_revenue/unique_customers:.2f}
   ‚Ä¢ Top Customer: {top_customer[0]} (${top_customer[1]:.2f})
   ‚Ä¢ Customer Retention Rate: {(total_orders/unique_customers):.1f} orders/customer

üõçÔ∏è PRODUCT PERFORMANCE:
   ‚Ä¢ Products Sold: {unique_products:,}
   ‚Ä¢ Revenue per Product: ${total_revenue/unique_products:.2f}
   ‚Ä¢ Top Product: {top_product[0]} (${top_product[1]:.2f})

üìà BUSINESS HEALTH INDICATORS:
   ‚Ä¢ Order Frequency: {total_orders/len(set(dates)):.1f} orders/day
   ‚Ä¢ Market Diversification: {unique_products} products across {len(set(record['category'] for record in data))} categories
   ‚Ä¢ Customer Concentration: Top customer = {(top_customer[1]/total_revenue)*100:.1f}% of revenue
"""
    
    return summary

def export_to_csv(data, filename='sales_analysis_export.csv'):
    """
    Export data to CSV with all fields
    """
    if not data:
        print("‚ùå No data to export")
        return False
    
    try:
        with open(filename, 'w', newline='') as file:
            # Get all possible fields
            all_fields = set()
            for record in data:
                all_fields.update(record.keys())
            
            fieldnames = sorted(all_fields)
            
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)
        
        print(f"‚úÖ Data exported to '{filename}' ({len(data)} records)")
        return True
        
    except Exception as e:
        print(f"‚ùå Export failed: {e}")
        return False

def create_dashboard_data(data):
    """
    Create summary data suitable for dashboards
    """
    dashboard_data = {
        'summary_metrics': {},
        'trends': {},
        'top_performers': {},
        'categories': {}
    }
    
    # Summary metrics
    dashboard_data['summary_metrics'] = {
        'total_revenue': sum(record['total'] for record in data),
        'total_orders': len(data),
        'unique_customers': len(set(record['customer_name'] for record in data)),
        'unique_products': len(set(record['product'] for record in data)),
        'avg_order_value': sum(record['total'] for record in data) / len(data)
    }
    
    # Daily trends
    daily_data = {}
    for record in data:
        date = record['date']
        if date not in daily_data:
            daily_data[date] = {'revenue': 0, 'orders': 0}
        daily_data[date]['revenue'] += record['total']
        daily_data[date]['orders'] += 1
    
    dashboard_data['trends']['daily_sales'] = daily_data
    
    # Top performers
    customer_revenue = {}
    product_revenue = {}
    
    for record in data:
        customer = record['customer_name']
        product = record['product']
        
        customer_revenue[customer] = customer_revenue.get(customer, 0) + record['total']
        product_revenue[product] = product_revenue.get(product, 0) + record['total']
    
    # Get top 5 in each category
    dashboard_data['top_performers'] = {
        'customers': sorted(customer_revenue.items(), key=lambda x: x[1], reverse=True)[:5],
        'products': sorted(product_revenue.items(), key=lambda x: x[1], reverse=True)[:5]
    }
    
    # Category analysis
    category_data = {}
    for record in data:
        category = record['category']
        if category not in category_data:
            category_data[category] = {'revenue': 0, 'orders': 0}
        category_data[category]['revenue'] += record['total']
        category_data[category]['orders'] += 1
    
    dashboard_data['categories'] = category_data
    
    return dashboard_data

# Generate and display executive summary
executive_summary = generate_executive_summary(enriched_sales)
print(executive_summary)

# Export enriched data
export_success = export_to_csv(enriched_sales, 'enriched_sales_data.csv')

# Create dashboard data
dashboard_metrics = create_dashboard_data(enriched_sales)

# Save dashboard data as JSON
try:
    with open('sales_dashboard_data.json', 'w') as file:
        json.dump(dashboard_metrics, file, indent=2)
    print("‚úÖ Dashboard data saved to 'sales_dashboard_data.json'")
except Exception as e:
    print(f"‚ùå Failed to save dashboard data: {e}")

# Display dashboard preview
print("\nüìä Dashboard Data Preview:")
print(f"üìà Summary Metrics:")
for metric, value in dashboard_metrics['summary_metrics'].items():
    if 'revenue' in metric or 'value' in metric:
        print(f"   {metric.replace('_', ' ').title()}: ${value:,.2f}")
    else:
        print(f"   {metric.replace('_', ' ').title()}: {value:,}")

print(f"\nüèÜ Top 3 Customers:")
for i, (customer, revenue) in enumerate(dashboard_metrics['top_performers']['customers'][:3], 1):
    print(f"   {i}. {customer}: ${revenue:.2f}")

print(f"\nüõçÔ∏è Top 3 Products:")
for i, (product, revenue) in enumerate(dashboard_metrics['top_performers']['products'][:3], 1):
    print(f"   {i}. {product}: ${revenue:.2f}")

## üèóÔ∏è Live Coding: E-commerce Analytics Pipeline

Let's build a complete data analysis pipeline from scratch!

In [None]:
# Complete E-commerce Analytics Pipeline
print("üõí E-COMMERCE ANALYTICS PIPELINE")
print("=" * 50)

class EcommerceAnalytics:
    """
    Complete e-commerce data analysis pipeline
    """
    
    def __init__(self):
        self.raw_data = []
        self.clean_data = []
        self.enriched_data = []
        self.analysis_results = {}
        self.errors_log = []
    
    def load_data(self, filename):
        """
        Load raw data from file
        """
        print(f"üìÇ Loading data from '{filename}'...")
        
        try:
            with open(filename, 'r') as file:
                for line_num, line in enumerate(file, 1):
                    if line.strip():  # Skip empty lines
                        self.raw_data.append({
                            'line_number': line_num,
                            'raw_content': line.strip()
                        })
            
            print(f"‚úÖ Loaded {len(self.raw_data)} raw records")
            return True
            
        except FileNotFoundError:
            print(f"‚ùå File '{filename}' not found!")
            return False
        except Exception as e:
            print(f"‚ùå Error loading data: {e}")
            return False
    
    def clean_and_validate(self):
        """
        Clean and validate the raw data
        """
        print("\nüßπ Cleaning and validating data...")
        
        cleaned_count = 0
        error_count = 0
        
        for raw_record in self.raw_data:
            line_num = raw_record['line_number']
            content = raw_record['raw_content']
            
            # Parse the line
            fields = [field.strip() for field in content.split(',')]
            
            # Validate field count
            if len(fields) < 6:
                self.errors_log.append(f"Line {line_num}: Too few fields ({len(fields)})")
                error_count += 1
                continue
            
            # Extract and clean fields
            try:
                date_str = fields[0]
                customer = fields[1].title().strip()
                product = fields[2].title().strip()
                quantity = int(fields[3])
                price = float(fields[4])
                category = fields[5].title().strip()
                
                # Validate business rules
                if not customer or '@' in customer:
                    raise ValueError("Invalid customer name")
                if not product:
                    raise ValueError("Missing product name")
                if quantity <= 0:
                    raise ValueError(f"Invalid quantity: {quantity}")
                if price <= 0:
                    raise ValueError(f"Invalid price: {price}")
                
                # Standardize date format
                if '/' in date_str:
                    month, day, year = date_str.split('/')
                    clean_date = f"{year}-{month.zfill(2)}-{day.zfill(2)}"
                else:
                    clean_date = date_str
                
                # Create clean record
                clean_record = {
                    'date': clean_date,
                    'customer_name': customer,
                    'product': product,
                    'quantity': quantity,
                    'price': price,
                    'category': category,
                    'total': round(quantity * price, 2)
                }
                
                self.clean_data.append(clean_record)
                cleaned_count += 1
                
            except (ValueError, IndexError) as e:
                self.errors_log.append(f"Line {line_num}: {str(e)}")
                error_count += 1
        
        print(f"‚úÖ Cleaned {cleaned_count} records")
        print(f"‚ö†Ô∏è Found {error_count} errors")
        
        return cleaned_count > 0
    
    def enrich_data(self):
        """
        Add calculated fields and business intelligence
        """
        print("\nüî¨ Enriching data with business intelligence...")
        
        if not self.clean_data:
            print("‚ùå No clean data to enrich!")
            return False
        
        # Calculate benchmarks
        all_totals = [record['total'] for record in self.clean_data]
        avg_order_value = sum(all_totals) / len(all_totals)
        
        # Customer analysis
        customer_stats = {}
        for record in self.clean_data:
            customer = record['customer_name']
            if customer not in customer_stats:
                customer_stats[customer] = {'orders': 0, 'revenue': 0}
            customer_stats[customer]['orders'] += 1
            customer_stats[customer]['revenue'] += record['total']
        
        # Enrich each record
        for record in self.clean_data:
            enriched_record = record.copy()
            
            # Order value tier
            if record['total'] >= avg_order_value * 1.5:
                enriched_record['order_tier'] = 'High'
            elif record['total'] >= avg_order_value * 0.8:
                enriched_record['order_tier'] = 'Medium'
            else:
                enriched_record['order_tier'] = 'Low'
            
            # Customer tier
            customer_revenue = customer_stats[record['customer_name']]['revenue']
            if customer_revenue >= 1000:
                enriched_record['customer_tier'] = 'VIP'
            elif customer_revenue >= 500:
                enriched_record['customer_tier'] = 'Premium'
            else:
                enriched_record['customer_tier'] = 'Standard'
            
            # Profit estimation (assume 25% margin)
            enriched_record['estimated_profit'] = round(record['total'] * 0.25, 2)
            
            # Day of week
            try:
                from datetime import datetime
                date_obj = datetime.strptime(record['date'], '%Y-%m-%d')
                enriched_record['day_of_week'] = date_obj.strftime('%A')
            except:
                enriched_record['day_of_week'] = 'Unknown'
            
            self.enriched_data.append(enriched_record)
        
        print(f"‚úÖ Enriched {len(self.enriched_data)} records with business intelligence")
        return True
    
    def perform_analysis(self):
        """
        Perform comprehensive business analysis
        """
        print("\nüìä Performing comprehensive analysis...")
        
        if not self.enriched_data:
            print("‚ùå No enriched data to analyze!")
            return False
        
        data = self.enriched_data
        
        # 1. Overall metrics
        total_revenue = sum(record['total'] for record in data)
        total_profit = sum(record['estimated_profit'] for record in data)
        total_orders = len(data)
        unique_customers = len(set(record['customer_name'] for record in data))
        
        # 2. Customer analysis
        customer_tiers = {}
        for record in data:
            tier = record['customer_tier']
            if tier not in customer_tiers:
                customer_tiers[tier] = {'count': 0, 'revenue': 0}
            customer_tiers[tier]['count'] += 1
            customer_tiers[tier]['revenue'] += record['total']
        
        # 3. Product performance
        products = {}
        for record in data:
            product = record['product']
            if product not in products:
                products[product] = {'orders': 0, 'revenue': 0, 'profit': 0}
            products[product]['orders'] += 1
            products[product]['revenue'] += record['total']
            products[product]['profit'] += record['estimated_profit']
        
        # 4. Time analysis
        daily_sales = {}
        for record in data:
            date = record['date']
            if date not in daily_sales:
                daily_sales[date] = {'orders': 0, 'revenue': 0}
            daily_sales[date]['orders'] += 1
            daily_sales[date]['revenue'] += record['total']
        
        # Store analysis results
        self.analysis_results = {
            'summary': {
                'total_revenue': total_revenue,
                'total_profit': total_profit,
                'total_orders': total_orders,
                'unique_customers': unique_customers,
                'avg_order_value': total_revenue / total_orders,
                'profit_margin': (total_profit / total_revenue) * 100
            },
            'customer_tiers': customer_tiers,
            'product_performance': products,
            'daily_trends': daily_sales
        }
        
        print("‚úÖ Analysis complete!")
        return True
    
    def generate_report(self):
        """
        Generate comprehensive business report
        """
        print("\nüìÑ Generating business report...")
        
        if not self.analysis_results:
            print("‚ùå No analysis results to report!")
            return None
        
        results = self.analysis_results
        
        report = f"""
üè™ E-COMMERCE BUSINESS INTELLIGENCE REPORT
{'='*60}

üí∞ FINANCIAL PERFORMANCE:
   Revenue: ${results['summary']['total_revenue']:,.2f}
   Profit: ${results['summary']['total_profit']:,.2f}
   Profit Margin: {results['summary']['profit_margin']:.1f}%
   
üìä OPERATIONAL METRICS:
   Total Orders: {results['summary']['total_orders']:,}
   Unique Customers: {results['summary']['unique_customers']:,}
   Average Order Value: ${results['summary']['avg_order_value']:,.2f}
   Orders per Customer: {results['summary']['total_orders']/results['summary']['unique_customers']:.1f}

üë• CUSTOMER SEGMENTATION:
"""
        
        for tier, stats in results['customer_tiers'].items():
            percentage = (stats['revenue'] / results['summary']['total_revenue']) * 100
            report += f"   {tier}: {stats['count']} customers, ${stats['revenue']:,.2f} ({percentage:.1f}%)\n"
        
        # Top products
        top_products = sorted(results['product_performance'].items(), 
                            key=lambda x: x[1]['revenue'], reverse=True)[:3]
        
        report += "\nüèÜ TOP PERFORMING PRODUCTS:\n"
        for i, (product, stats) in enumerate(top_products, 1):
            report += f"   {i}. {product}: ${stats['revenue']:,.2f} ({stats['orders']} orders)\n"
        
        # Daily trends
        avg_daily_revenue = sum(day['revenue'] for day in results['daily_trends'].values()) / len(results['daily_trends'])
        report += f"\nüìà SALES TRENDS:\n"
        report += f"   Average Daily Revenue: ${avg_daily_revenue:,.2f}\n"
        report += f"   Active Sales Days: {len(results['daily_trends'])}\n"
        
        return report
    
    def export_results(self):
        """
        Export all results to files
        """
        print("\nüíæ Exporting results...")
        
        exports_completed = 0
        
        # Export clean data
        if self.clean_data:
            try:
                with open('pipeline_clean_data.csv', 'w', newline='') as file:
                    fieldnames = list(self.clean_data[0].keys())
                    writer = csv.DictWriter(file, fieldnames=fieldnames)
                    writer.writeheader()
                    writer.writerows(self.clean_data)
                exports_completed += 1
                print("‚úÖ Clean data exported to 'pipeline_clean_data.csv'")
            except Exception as e:
                print(f"‚ùå Failed to export clean data: {e}")
        
        # Export enriched data
        if self.enriched_data:
            try:
                with open('pipeline_enriched_data.csv', 'w', newline='') as file:
                    fieldnames = list(self.enriched_data[0].keys())
                    writer = csv.DictWriter(file, fieldnames=fieldnames)
                    writer.writeheader()
                    writer.writerows(self.enriched_data)
                exports_completed += 1
                print("‚úÖ Enriched data exported to 'pipeline_enriched_data.csv'")
            except Exception as e:
                print(f"‚ùå Failed to export enriched data: {e}")
        
        # Export analysis results
        if self.analysis_results:
            try:
                with open('pipeline_analysis_results.json', 'w') as file:
                    json.dump(self.analysis_results, file, indent=2)
                exports_completed += 1
                print("‚úÖ Analysis results exported to 'pipeline_analysis_results.json'")
            except Exception as e:
                print(f"‚ùå Failed to export analysis results: {e}")
        
        # Export errors log
        if self.errors_log:
            try:
                with open('pipeline_errors.log', 'w') as file:
                    for error in self.errors_log:
                        file.write(error + '\n')
                exports_completed += 1
                print("‚úÖ Errors log exported to 'pipeline_errors.log'")
            except Exception as e:
                print(f"‚ùå Failed to export errors log: {e}")
        
        return exports_completed

# Run the complete pipeline
analytics = EcommerceAnalytics()

# Execute pipeline steps
if analytics.load_data('messy_sales.txt'):
    if analytics.clean_and_validate():
        if analytics.enrich_data():
            if analytics.perform_analysis():
                # Generate and display report
                business_report = analytics.generate_report()
                if business_report:
                    print(business_report)
                
                # Export all results
                exports = analytics.export_results()
                print(f"\nüìÅ Pipeline complete! {exports} files exported.")

print("\nüéâ E-commerce Analytics Pipeline finished successfully!")

## Preview: What's Next with Pandas üêº

You've mastered data analysis with pure Python! Here's a preview of how pandas will supercharge your workflow:

In [None]:
# Preview: The same analysis with pandas (for comparison)
print("üêº PANDAS PREVIEW: What You'll Learn Next")
print("=" * 50)

print("\nüìä What you did with pure Python today:")
print("""
# Pure Python approach (what you mastered today):
def analyze_sales_data(data):
    total_revenue = 0
    customer_groups = {}
    
    for record in data:
        total_revenue += record['total']
        customer = record['customer_name']
        if customer not in customer_groups:
            customer_groups[customer] = []
        customer_groups[customer].append(record['total'])
    
    # Calculate averages manually
    customer_averages = {}
    for customer, orders in customer_groups.items():
        customer_averages[customer] = sum(orders) / len(orders)
    
    return total_revenue, customer_averages
""")

print("\nüöÄ The same analysis with pandas (coming soon):")
print("""
# Pandas approach (what you'll learn next):
import pandas as pd

# Load data in one line
df = pd.read_csv('sales_data.csv')

# Analysis in one line each
total_revenue = df['total'].sum()
customer_averages = df.groupby('customer_name')['total'].mean()
monthly_trends = df.groupby(df['date'].dt.month)['total'].sum()
top_products = df.groupby('product')['total'].sum().nlargest(5)

# Instant visualizations
df['total'].hist()  # Histogram
df.groupby('category')['total'].sum().plot(kind='bar')  # Bar chart
""")

print("\nüí™ Why Your Python Fundamentals Matter:")
print("‚úÖ You understand what pandas does 'under the hood'")
print("‚úÖ You can debug pandas operations when they go wrong")
print("‚úÖ You can build custom functions when pandas isn't enough")
print("‚úÖ You appreciate pandas' power because you know the manual way")
print("‚úÖ You can combine pandas with pure Python for complex tasks")

print("\nüéØ Key Pandas Advantages:")
print("‚Ä¢ Handles missing data automatically")
print("‚Ä¢ Built-in statistical functions")
print("‚Ä¢ Easy data visualization")
print("‚Ä¢ Optimized for large datasets")
print("‚Ä¢ Integrates with machine learning libraries")

print("\nüî• What You'll Build Next:")
print("‚Ä¢ Interactive dashboards")
print("‚Ä¢ Time series analysis")
print("‚Ä¢ Data visualization")
print("‚Ä¢ Machine learning pipelines")
print("‚Ä¢ Real-time data processing")

print("\nüéâ You're ready for pandas because you mastered the fundamentals!")

## üéØ In-Class Exercise: Customer Behavior Analysis (25 minutes)

Apply everything you've learned to analyze customer purchase patterns!

In [None]:
# Customer Behavior Analysis Challenge
print("üéØ CUSTOMER BEHAVIOR ANALYSIS CHALLENGE")
print("=" * 50)

# Create more complex customer transaction data
customer_transactions = [
    "2024-01-15,John Smith,Laptop,1,999.99,Electronics,Online,Credit Card",
    "2024-01-16,John Smith,Mouse,1,29.99,Electronics,Online,Credit Card",
    "2024-01-20,John Smith,Keyboard,1,89.99,Electronics,Store,Cash",
    "2024-01-17,Sarah Johnson,Smartphone,1,699.99,Electronics,Online,Credit Card",
    "2024-01-22,Sarah Johnson,Phone Case,2,19.99,Accessories,Online,Credit Card",
    "2024-01-18,Mike Brown,Tablet,1,399.99,Electronics,Store,Debit Card",
    "2024-01-25,Mike Brown,Stylus,1,49.99,Accessories,Store,Cash",
    "2024-01-19,Lisa Davis,Monitor,2,299.99,Electronics,Online,Credit Card",
    "2024-01-26,Lisa Davis,HDMI Cable,3,15.99,Accessories,Online,Credit Card",
    "2024-01-27,Lisa Davis,Webcam,1,79.99,Electronics,Store,Debit Card",
]

# Save the transaction data
with open('customer_transactions.txt', 'w') as file:
    for transaction in customer_transactions:
        file.write(transaction + '\n')

print("‚úÖ Customer transaction data created")
print("\nüéØ YOUR MISSION: Build a Customer Behavior Analysis System")
print("\nRequired Analysis:")
print("1. Customer Lifetime Value (CLV)")
print("2. Purchase frequency patterns")
print("3. Channel preferences (Online vs Store)")
print("4. Payment method analysis")
print("5. Cross-selling opportunities")
print("6. Customer segmentation")

print("\nüìã TODO: Complete these functions:")

def load_customer_data(filename='customer_transactions.txt'):
    """
    TODO: Load and parse customer transaction data
    
    Expected fields: date, customer, product, quantity, price, category, channel, payment
    
    Returns: list of dictionaries with clean data
    """
    # Your implementation here
    transactions = []
    
    # Add your data loading and cleaning logic
    # Remember to handle errors and validate data
    
    return transactions

def calculate_customer_lifetime_value(transactions):
    """
    TODO: Calculate CLV for each customer
    
    CLV = Total Revenue per Customer
    Also calculate:
    - Average order value per customer
    - Purchase frequency (orders per customer)
    - Days since last purchase
    
    Returns: dict {customer: clv_metrics}
    """
    # Your implementation here
    clv_data = {}
    
    return clv_data

def analyze_purchase_patterns(transactions):
    """
    TODO: Analyze customer purchase patterns
    
    Find:
    - Most popular product combinations (what's bought together)
    - Seasonal trends (if any)
    - Channel preferences by customer
    - Payment method preferences
    
    Returns: dict with pattern analysis
    """
    # Your implementation here
    patterns = {
        'product_combinations': {},
        'channel_preferences': {},
        'payment_preferences': {}
    }
    
    return patterns

def segment_customers(clv_data, patterns):
    """
    TODO: Segment customers into meaningful groups
    
    Segments:
    - High Value (top 20% by CLV)
    - Frequent Buyers (above average frequency)
    - Channel Loyal (strong preference for one channel)
    - At Risk (long time since last purchase)
    
    Returns: dict {segment: [customers]}
    """
    # Your implementation here
    segments = {
        'high_value': [],
        'frequent_buyers': [],
        'channel_loyal': [],
        'at_risk': []
    }
    
    return segments

def generate_customer_insights(transactions, clv_data, patterns, segments):
    """
    TODO: Generate actionable business insights
    
    Create recommendations for:
    - Marketing campaigns
    - Product recommendations
    - Customer retention strategies
    - Cross-selling opportunities
    
    Returns: formatted insights report
    """
    # Your implementation here
    insights = """
    CUSTOMER BEHAVIOR INSIGHTS REPORT
    ================================
    
    [Add your analysis here]
    """
    
    return insights

# TODO: Test your implementation
print("\nüß™ Test your functions:")
print("1. transactions = load_customer_data()")
print("2. clv_metrics = calculate_customer_lifetime_value(transactions)")
print("3. patterns = analyze_purchase_patterns(transactions)")
print("4. segments = segment_customers(clv_metrics, patterns)")
print("5. insights = generate_customer_insights(transactions, clv_metrics, patterns, segments)")

print("\nüí° HINTS:")
print("‚Ä¢ Use datetime to calculate days between purchases")
print("‚Ä¢ Group transactions by customer for CLV calculations")
print("‚Ä¢ Use Counter for finding popular combinations")
print("‚Ä¢ Set thresholds for customer segments based on your data")
print("‚Ä¢ Focus on actionable insights that drive business decisions")

print("\n‚è∞ You have 25 minutes - good luck! üöÄ")

# Uncomment and implement:
# transactions = load_customer_data()
# if transactions:
#     clv_metrics = calculate_customer_lifetime_value(transactions)
#     patterns = analyze_purchase_patterns(transactions)
#     segments = segment_customers(clv_metrics, patterns)
#     insights = generate_customer_insights(transactions, clv_metrics, patterns, segments)
#     print(insights)

## üìö Session Summary

üéâ **Incredible!** You've transformed from Python beginner to data analyst in just 3 weeks!

### ‚úÖ Data Science Skills Mastered
- **Data Cleaning**: Handling messy, real-world data like a pro
- **Statistical Analysis**: Computing means, medians, distributions with pure Python
- **Data Transformation**: Enriching data with calculated fields and business intelligence
- **Grouping & Aggregation**: Advanced data summarization techniques
- **Pipeline Development**: Building end-to-end data processing workflows
- **Business Intelligence**: Generating actionable insights from data

### üîß Technical Tools Applied
1. **File I/O**: Reading messy CSV data from files
2. **Error Handling**: Graceful handling of data quality issues
3. **Data Structures**: Using lists and dictionaries for complex data
4. **Functions**: Organizing analysis code into reusable components
5. **Loops & Conditionals**: Processing and filtering large datasets
6. **String Processing**: Cleaning and standardizing text data

### üë®‚Äçüç≥ Remember: The Data Chef Analogy
- **Raw data** = Fresh ingredients (often messy and inconsistent)
- **Data cleaning** = Prep work (washing, cutting, organizing)
- **Analysis** = Cooking process (combining, transforming, seasoning)
- **Reports** = Final presentation (making insights digestible)
- **Good data scientists** = Master chefs who create value from raw materials

### üí° Key Insights About Data Science
- **80% of work is data cleaning and preparation** (not the glamorous part!)
- **Data quality determines analysis quality** (garbage in, garbage out)
- **Business context matters more than technical complexity**
- **Reproducible workflows prevent errors and save time**
- **Error handling is essential for production systems**

### üè† Homework Preview
This week's final homework will include:
1. Complete data analysis project with real dataset
2. Building end-to-end data processing pipeline
3. Creating business intelligence dashboard data
4. Writing professional data analysis report

### üöÄ Next Session Preview
Saturday we'll have our **Mini-Project Workshop** where you'll build a complete data science project from scratch!

### üéØ What This Means for Your Career
- You can now handle real data science tasks
- You understand what pandas does "under the hood"
- You can build custom solutions when standard tools aren't enough
- You have the foundation for advanced data science libraries
- You can communicate with other data scientists and engineers

### üèÜ Professional Skills Gained
- **Problem Decomposition**: Breaking complex analysis into manageable steps
- **Data Quality Assessment**: Identifying and fixing data issues
- **Business Communication**: Translating technical analysis into actionable insights
- **Workflow Design**: Creating reproducible, maintainable analysis pipelines
- **Error Handling**: Building robust systems that handle real-world messiness

**Congratulations! You're now a data scientist who can work with real data!** üìäüéì‚ú®

### üí≠ Final Thought
*"Data science isn't about knowing every library or algorithm. It's about understanding data, asking the right questions, and communicating insights that drive decisions. You now have these fundamental skills!"*

**Ready to build something amazing on Saturday!** üöÄ