# Week 4: Python Aggregations & Summary Statistics - Part 1
## From Excel SUMIF/COUNTIF to Pandas GroupBy

**Wednesday Python Class - September 3, 2025**  
**Business Context**: Sales Performance Analysis using Olist E-commerce Data  
**Excel Concepts**: SUMIF, COUNTIF, AVERAGEIF functions

---

### Learning Objectives:
1. Master basic pandas aggregation functions (sum, count, mean, min, max)
2. Understand value_counts() for frequency analysis
3. Use describe() for comprehensive statistical summaries
4. Bridge from Excel SUMIF/COUNTIF thinking to pandas GroupBy operations

### Today's Business Challenge:
As a data analyst for a Nigerian e-commerce startup, you need to analyze regional sales performance, similar to what companies like Jumia or Konga would do. We'll use real Brazilian marketplace data (Olist) as our case study.

## Section 1: Setup and Data Loading

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# Load the Olist sample dataset
# In a real scenario, you'd download the full dataset from GitHub
data_path = '../datasets/olist_sample_data.csv'
df = pd.read_csv(data_path)

# Convert date column to datetime
df['order_date'] = pd.to_datetime(df['order_date'])

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape} (rows, columns)")
print(f"\nColumn names: {list(df.columns)}")

In [None]:
# Explore the dataset structure
print("=== DATASET OVERVIEW ===")
print(df.head())
print("\n=== DATA TYPES ===")
print(df.dtypes)
print("\n=== BASIC STATISTICS ===")
print(df.describe())

## Section 2: From Excel Functions to Pandas - Conceptual Bridge

### Excel vs Pandas Comparison:

| Excel Function | Purpose | Pandas Equivalent |
|----------------|---------|-------------------|
| `SUMIF(range, criteria, sum_range)` | Sum with condition | `df[condition]['column'].sum()` |
| `COUNTIF(range, criteria)` | Count with condition | `df[condition].shape[0]` or `value_counts()` |
| `AVERAGEIF(range, criteria, avg_range)` | Average with condition | `df[condition]['column'].mean()` |
| Pivot Table | Summarize by categories | `df.groupby().agg()` |

Let's see these concepts in action!

## Section 3: Basic Aggregation Functions

### 3.1 Simple Aggregations - The Excel SUMIF Equivalent

In [None]:
# Business Question: What's our total revenue and order count?
# Excel equivalent: SUM(price_column) and COUNT(order_column)

total_revenue = df['price'].sum()
total_orders = df.shape[0]  # Number of rows
avg_order_value = df['price'].mean()

print("=== OVERALL BUSINESS METRICS ===")
print(f"Total Revenue: R$ {total_revenue:,.2f}")
print(f"Total Orders: {total_orders:,}")
print(f"Average Order Value: R$ {avg_order_value:.2f}")

# Additional insights
min_order = df['price'].min()
max_order = df['price'].max()
median_order = df['price'].median()

print(f"\nOrder Value Range:")
print(f"Minimum: R$ {min_order:.2f}")
print(f"Maximum: R$ {max_order:.2f}")
print(f"Median: R$ {median_order:.2f}")

### 3.2 Conditional Aggregations - Excel SUMIF/COUNTIF in Action

In [None]:
# Business Question: How much revenue comes from high-value orders (>= R$150)?
# Excel equivalent: SUMIF(price_column, ">=150", price_column)

# Method 1: Using boolean indexing (most common)
high_value_orders = df[df['price'] >= 150]
high_value_revenue = high_value_orders['price'].sum()
high_value_count = high_value_orders.shape[0]

print("=== HIGH-VALUE ORDERS ANALYSIS (>= R$150) ===")
print(f"High-value orders: {high_value_count} ({high_value_count/total_orders*100:.1f}% of total)")
print(f"High-value revenue: R$ {high_value_revenue:,.2f} ({high_value_revenue/total_revenue*100:.1f}% of total)")
print(f"Average high-value order: R$ {high_value_orders['price'].mean():.2f}")

# Method 2: Using query method (alternative approach)
high_value_query = df.query('price >= 150')
print(f"\nVerification using query method: {high_value_query.shape[0]} orders")

In [None]:
# Business Question: Compare performance across different price segments
# This is like having multiple SUMIF formulas for different conditions

def analyze_price_segment(df, min_price, max_price, segment_name):
    """Function to analyze a specific price segment"""
    if max_price == float('inf'):
        segment_data = df[df['price'] >= min_price]
        condition_text = f">= R${min_price}"
    else:
        segment_data = df[(df['price'] >= min_price) & (df['price'] < max_price)]
        condition_text = f"R${min_price} - R${max_price-0.01}"
    
    count = segment_data.shape[0]
    revenue = segment_data['price'].sum()
    avg_value = segment_data['price'].mean() if count > 0 else 0
    
    return {
        'segment': segment_name,
        'price_range': condition_text,
        'orders': count,
        'revenue': revenue,
        'avg_order_value': avg_value,
        'revenue_share': revenue/total_revenue*100 if total_revenue > 0 else 0
    }

# Define price segments (Nigerian context: Budget, Standard, Premium, Luxury)
segments = [
    analyze_price_segment(df, 0, 100, 'Budget'),
    analyze_price_segment(df, 100, 200, 'Standard'),
    analyze_price_segment(df, 200, 300, 'Premium'),
    analyze_price_segment(df, 300, float('inf'), 'Luxury')
]

# Display results
print("=== PRICE SEGMENT ANALYSIS ===")
for segment in segments:
    print(f"\n{segment['segment']} ({segment['price_range']}):")
    print(f"  Orders: {segment['orders']:,} ({segment['orders']/total_orders*100:.1f}%)")
    print(f"  Revenue: R$ {segment['revenue']:,.2f} ({segment['revenue_share']:.1f}%)")
    print(f"  Avg Order Value: R$ {segment['avg_order_value']:.2f}")

## Section 4: Value Counts - The Excel COUNTIF Super Power

In [None]:
# Business Question: How are our orders distributed across states?
# Excel equivalent: Multiple COUNTIF formulas for each state

print("=== STATE-WISE ORDER DISTRIBUTION ===")
state_counts = df['customer_state'].value_counts()
print(state_counts)

# Show as percentages
print("\n=== STATE-WISE PERCENTAGE DISTRIBUTION ===")
state_percentages = df['customer_state'].value_counts(normalize=True) * 100
print(state_percentages.round(1))

# Combine counts and percentages
print("\n=== COMBINED STATE ANALYSIS ===")
state_analysis = pd.DataFrame({
    'orders': state_counts,
    'percentage': state_percentages.round(1)
})
print(state_analysis)

In [None]:
# Business Question: What are our most popular product categories?
# Nigerian context: Similar to analyzing Jumia's category performance

print("=== PRODUCT CATEGORY POPULARITY ===")
category_counts = df['category'].value_counts()
print(category_counts.head(10))  # Top 10 categories

# Visualize category distribution
plt.figure(figsize=(12, 6))
category_counts.head(10).plot(kind='bar')
plt.title('Top 10 Product Categories by Order Count')
plt.xlabel('Product Category')
plt.ylabel('Number of Orders')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Business insight: Category performance
total_categories = df['category'].nunique()
top_5_share = category_counts.head(5).sum() / category_counts.sum() * 100
print(f"\nBusiness Insights:")
print(f"Total unique categories: {total_categories}")
print(f"Top 5 categories represent {top_5_share:.1f}% of all orders")

## Section 5: The describe() Method - Your Statistical Summary Assistant

In [None]:
# Business Question: What's the complete statistical profile of our key metrics?
# This gives you count, mean, std, min, 25%, 50%, 75%, max - all at once!

print("=== COMPREHENSIVE STATISTICAL SUMMARY ===")
numeric_summary = df[['price', 'freight_value', 'review_score']].describe()
print(numeric_summary)

# Custom percentiles for business analysis
print("\n=== BUSINESS-FOCUSED PERCENTILES ===")
business_percentiles = df[['price', 'freight_value']].describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
print(business_percentiles)

# Interpretation for business stakeholders
price_stats = df['price'].describe()
print(f"\n=== BUSINESS INTERPRETATION ===")
print(f"Price Analysis:")
print(f"• 50% of orders are below R$ {price_stats['50%']:.2f} (median)")
print(f"• 75% of orders are below R$ {price_stats['75%']:.2f} (Q3)")
print(f"• Top 25% of orders start from R$ {price_stats['75%']:.2f}")
print(f"• Standard deviation: R$ {price_stats['std']:.2f} (price variability)")

In [None]:
# Describe by categories (like Excel pivot table summaries)
print("=== DESCRIBE BY CUSTOMER STATE ===")
state_price_summary = df.groupby('customer_state')['price'].describe()
print(state_price_summary)

# Focus on key states for business decisions
top_states = df['customer_state'].value_counts().head(3).index
print(f"\n=== TOP 3 STATES DETAILED ANALYSIS ===")
for state in top_states:
    state_data = df[df['customer_state'] == state]['price']
    print(f"\n{state} State:")
    print(f"  Average order value: R$ {state_data.mean():.2f}")
    print(f"  Median order value: R$ {state_data.median():.2f}")
    print(f"  Order count: {len(state_data):,}")
    print(f"  Revenue contribution: R$ {state_data.sum():,.2f}")

## Section 6: Introduction to GroupBy - Excel Pivot Table Power

In [None]:
# Business Question: How does performance vary by customer state?
# Excel equivalent: Pivot table with State as rows, sum/count/average as values

print("=== STATE PERFORMANCE ANALYSIS ===")
state_performance = df.groupby('customer_state').agg({
    'price': ['count', 'sum', 'mean', 'median'],
    'freight_value': 'mean',
    'review_score': 'mean'
}).round(2)

# Flatten column names for easier reading
state_performance.columns = ['Order_Count', 'Total_Revenue', 'Avg_Price', 'Median_Price', 'Avg_Freight', 'Avg_Rating']
state_performance = state_performance.sort_values('Total_Revenue', ascending=False)

print(state_performance)

# Business insights
print(f"\n=== BUSINESS INSIGHTS ===")
top_revenue_state = state_performance.index[0]
highest_aov_state = state_performance['Avg_Price'].idxmax()
best_rating_state = state_performance['Avg_Rating'].idxmax()

print(f"• Top revenue state: {top_revenue_state}")
print(f"• Highest average order value: {highest_aov_state} (R$ {state_performance.loc[highest_aov_state, 'Avg_Price']:.2f})")
print(f"• Best customer satisfaction: {best_rating_state} ({state_performance.loc[best_rating_state, 'Avg_Rating']:.2f}/5.0)")

In [None]:
# Business Question: Which product categories generate the most revenue?
# Excel equivalent: Pivot table with Category as rows

print("=== CATEGORY PERFORMANCE ANALYSIS ===")
category_performance = df.groupby('category').agg({
    'price': ['count', 'sum', 'mean'],
    'review_score': 'mean'
}).round(2)

category_performance.columns = ['Order_Count', 'Total_Revenue', 'Avg_Price', 'Avg_Rating']
category_performance = category_performance.sort_values('Total_Revenue', ascending=False)

# Show top 10 categories
print(category_performance.head(10))

# Calculate revenue share
category_performance['Revenue_Share_%'] = (category_performance['Total_Revenue'] / total_revenue * 100).round(1)

print(f"\n=== TOP 5 CATEGORIES BY REVENUE ===")
top_categories = category_performance.head(5)
for category, row in top_categories.iterrows():
    print(f"{category}:")
    print(f"  Revenue: R$ {row['Total_Revenue']:,.2f} ({row['Revenue_Share_%']}% of total)")
    print(f"  Orders: {row['Order_Count']:,}")
    print(f"  Avg Price: R$ {row['Avg_Price']:.2f}")
    print(f"  Rating: {row['Avg_Rating']:.1f}/5.0\n")

## Section 7: Practical Business Applications

In [None]:
# Business Challenge: Create a executive dashboard summary
# This combines multiple aggregation techniques

def create_business_summary(df):
    """Create comprehensive business summary"""
    
    summary = {}
    
    # Overall metrics
    summary['total_orders'] = df.shape[0]
    summary['total_revenue'] = df['price'].sum()
    summary['avg_order_value'] = df['price'].mean()
    summary['unique_customers'] = df['order_id'].nunique()  # Assuming each order_id is unique
    
    # Regional insights
    summary['top_state'] = df['customer_state'].value_counts().index[0]
    summary['states_covered'] = df['customer_state'].nunique()
    
    # Product insights
    summary['top_category'] = df['category'].value_counts().index[0]
    summary['categories_sold'] = df['category'].nunique()
    
    # Customer satisfaction
    summary['avg_rating'] = df['review_score'].mean()
    summary['satisfaction_rate'] = (df['review_score'] >= 4).sum() / df['review_score'].notna().sum() * 100
    
    # Price segments
    summary['premium_orders_pct'] = (df['price'] >= 200).sum() / len(df) * 100
    
    return summary

# Generate executive summary
business_summary = create_business_summary(df)

print("=== EXECUTIVE BUSINESS SUMMARY ===")
print(f"📊 OVERALL PERFORMANCE")
print(f"   Total Orders: {business_summary['total_orders']:,}")
print(f"   Total Revenue: R$ {business_summary['total_revenue']:,.2f}")
print(f"   Average Order Value: R$ {business_summary['avg_order_value']:.2f}")

print(f"\n🗺️  MARKET COVERAGE")
print(f"   Top Performing State: {business_summary['top_state']}")
print(f"   States Covered: {business_summary['states_covered']}")

print(f"\n🛍️  PRODUCT PERFORMANCE")
print(f"   Top Category: {business_summary['top_category']}")
print(f"   Categories Available: {business_summary['categories_sold']}")

print(f"\n⭐ CUSTOMER SATISFACTION")
print(f"   Average Rating: {business_summary['avg_rating']:.1f}/5.0")
print(f"   Satisfaction Rate: {business_summary['satisfaction_rate']:.1f}% (4+ stars)")

print(f"\n💰 PREMIUM SEGMENT")
print(f"   Premium Orders (≥R$200): {business_summary['premium_orders_pct']:.1f}%")

## Section 8: Key Takeaways and Best Practices

### 🎯 Key Concepts Learned:

1. **Basic Aggregations**: `sum()`, `count()`, `mean()`, `min()`, `max()`, `median()`
2. **Conditional Aggregations**: Using boolean indexing for Excel SUMIF/COUNTIF equivalent
3. **Value Counts**: `value_counts()` for frequency analysis (Excel COUNTIF for categories)
4. **Statistical Summary**: `describe()` for comprehensive data profiling
5. **GroupBy Introduction**: Basic grouping for category-wise analysis

### 🔄 Excel to Pandas Translation:

- **Excel SUMIF** → `df[condition]['column'].sum()`
- **Excel COUNTIF** → `df[condition].shape[0]` or `value_counts()`
- **Excel AVERAGEIF** → `df[condition]['column'].mean()`
- **Excel Pivot Table** → `df.groupby().agg()`

### 💡 Business Applications:

- Regional performance analysis
- Customer segmentation by order value
- Product category optimization
- Customer satisfaction measurement
- Executive summary reporting

### 🚀 Next Class Preview:
Tomorrow in SQL class, we'll perform the same analyses using GROUP BY, HAVING, and window functions!

## Practice Exercises

Try these exercises to reinforce your learning:

1. **State Analysis**: Find the state with the highest average freight cost
2. **Category Insights**: Identify categories with average rating above 4.0
3. **Price Distribution**: Create 5 equal price segments and analyze order distribution
4. **Monthly Trends**: Extract month from order_date and analyze monthly performance
5. **Customer Satisfaction**: Calculate satisfaction rate by customer state

In [None]:
# Space for practice exercises
# Students can work on the exercises here

# Example solution for Exercise 1:
print("Exercise 1 Solution - State with highest average freight cost:")
freight_by_state = df.groupby('customer_state')['freight_value'].mean().sort_values(ascending=False)
print(f"Highest avg freight: {freight_by_state.index[0]} (R$ {freight_by_state.iloc[0]:.2f})")