# Introduction to Pandas and DataFrames: Your Digital Spreadsheet

**Week 01 - Wednesday | Python Fundamentals**  
*From Excel Worksheets to Python DataFrames*

---

## The Big Moment: From Excel to Python Data Analysis

### What You've Learned So Far:
✅ Variables (named cells)  
✅ Lists (column ranges)  
✅ Dictionaries (lookup tables)  

### What We're Learning Now:
🎯 **DataFrames** = Complete Excel Worksheets in Python!

### Excel to Pandas Translation:
| Excel Concept | Pandas Equivalent |
|---------------|------------------|
| Worksheet | DataFrame |
| Column | Series |
| Row | Record/Index |
| File > Open | pd.read_csv() |
| Summary Statistics | .describe() |
| Column Headers | .columns |
| Number of Rows | .shape[0] |

## Setting Up: Import Pandas Library

Think of this like installing an Excel add-in - we need to load the pandas library to work with data:

In [None]:
# Import pandas library (like loading Excel add-in)
import pandas as pd
import numpy as np

# Also import some other useful libraries
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print("Ready to work with data!")

## Creating Your First DataFrame (Like Making a New Worksheet)

### Method 1: From Lists (Like Typing Data in Excel)

In [None]:
# Business data for NaijaCommerce
customer_names = ["Adebayo Okonkwo", "Fatima Abdullahi", "Chinedu Okoro", "Amina Hassan", "Yemi Adeyemi"]
cities = ["Lagos", "Abuja", "Port Harcourt", "Kano", "Ibadan"]
order_values = [125000, 87500, 200000, 156000, 93000]
order_dates = ["2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19"]
is_vip = [True, False, True, False, False]

# Create DataFrame (like creating Excel worksheet)
orders_df = pd.DataFrame({
    'customer_name': customer_names,
    'city': cities,
    'order_value': order_values,
    'order_date': order_dates,
    'is_vip': is_vip
})

print("=== First DataFrame Created! ===")
print(orders_df)
print(f"\nDataFrame shape: {orders_df.shape} (rows, columns)")

### Understanding DataFrame Structure (Excel Worksheet Layout)

In [None]:
# Exploring DataFrame properties (like Excel worksheet properties)
print("=== DataFrame Information (Like Excel Sheet Properties) ===")
print(f"📏 Shape (rows, columns): {orders_df.shape}")
print(f"📊 Number of rows: {len(orders_df)}")
print(f"📈 Number of columns: {len(orders_df.columns)}")
print(f"📋 Column names: {list(orders_df.columns)}")
print(f"🗃️  Index (row numbers): {list(orders_df.index)}")

print("\n=== Data Types (Like Excel Column Formats) ===")
print(orders_df.dtypes)

print("\n=== Basic Info (Like Excel Data Tab Summary) ===")
orders_df.info()

## Basic DataFrame Operations (Essential Excel Functions)

### Viewing Data (Like Looking at Your Spreadsheet)

In [None]:
# View first few rows (like looking at top of Excel sheet)
print("=== First 3 Rows (.head()) - Like Ctrl+Home ===")
print(orders_df.head(3))

print("\n=== Last 2 Rows (.tail()) - Like Ctrl+End ===")
print(orders_df.tail(2))

print("\n=== Random Sample (2 rows) ===")
print(orders_df.sample(2))

### Selecting Columns (Like Selecting Excel Columns)

In [None]:
# Select single column (like clicking column A)
print("=== Single Column: Customer Names ===")
customer_names_column = orders_df['customer_name']
print(customer_names_column)
print(f"Type: {type(customer_names_column)}")

print("\n=== Multiple Columns (Like Ctrl+Click in Excel) ===")
customer_order_info = orders_df[['customer_name', 'order_value', 'city']]
print(customer_order_info)

print("\n=== All Columns Except One ===")
without_date = orders_df.drop('order_date', axis=1)
print(without_date)

### Basic Statistics (Like Excel's Summary Functions)

In [None]:
# Summary statistics (like Excel's Data Analysis ToolPak)
print("=== Complete Statistical Summary (.describe()) ===")
print(orders_df.describe())

print("\n=== Individual Statistics (Like Excel Functions) ===")
print(f"📊 Total Order Value (SUM): ₦{orders_df['order_value'].sum():,}")
print(f"📈 Average Order Value (AVERAGE): ₦{orders_df['order_value'].mean():,.2f}")
print(f"🔺 Maximum Order (MAX): ₦{orders_df['order_value'].max():,}")
print(f"🔻 Minimum Order (MIN): ₦{orders_df['order_value'].min():,}")
print(f"📏 Standard Deviation (STDEV): ₦{orders_df['order_value'].std():,.2f}")
print(f"🧮 Order Count (COUNT): {orders_df['order_value'].count()}")

print("\n=== Business Insights ===")
vip_count = orders_df['is_vip'].sum()  # True counts as 1
vip_percentage = (vip_count / len(orders_df)) * 100
unique_cities = orders_df['city'].nunique()

print(f"👑 VIP Customers: {vip_count} out of {len(orders_df)} ({vip_percentage:.1f}%)")
print(f"📍 Cities Served: {unique_cities}")
print(f"🌆 Cities List: {', '.join(orders_df['city'].unique())}")

## Loading Real Data from CSV (Like Opening Excel Files)

### Creating Sample Data First

In [None]:
# Create a larger sample dataset (simulating real e-commerce data)
import random
from datetime import datetime, timedelta

# Nigerian cities for our e-commerce platform
nigerian_cities = ['Lagos', 'Abuja', 'Port Harcourt', 'Kano', 'Ibadan', 'Benin City', 'Jos', 'Ilorin']

# Nigerian names
first_names = ['Adebayo', 'Fatima', 'Chinedu', 'Amina', 'Yemi', 'Khadija', 'Emeka', 'Aisha', 'Tunde', 'Zainab']
last_names = ['Okonkwo', 'Abdullahi', 'Okoro', 'Hassan', 'Adeyemi', 'Ibrahim', 'Okafor', 'Mohammed', 'Ogbonna', 'Usman']

# Product categories
categories = ['Electronics', 'Fashion', 'Home & Garden', 'Books', 'Sports', 'Beauty']
products = {
    'Electronics': ['iPhone 15', 'Samsung Galaxy S24', 'MacBook Pro', 'iPad Air', 'AirPods'],
    'Fashion': ['Nike Sneakers', 'Adidas Shirt', 'Zara Dress', 'Polo Shirt', 'Ankara Fabric'],
    'Home & Garden': ['Sofa Set', 'Dining Table', 'Bed Frame', 'Kitchen Set', 'Garden Tools'],
    'Books': ['Business Strategy', 'Python Programming', 'Nigerian History', 'Recipe Book', 'Self Help'],
    'Sports': ['Football', 'Basketball', 'Tennis Racket', 'Running Shoes', 'Gym Equipment'],
    'Beauty': ['Skincare Set', 'Makeup Kit', 'Perfume', 'Hair Products', 'Nail Polish']
}

# Generate sample data
sample_data = []
for i in range(50):  # 50 sample orders
    category = random.choice(categories)
    product = random.choice(products[category])
    
    # Generate realistic prices based on category
    price_ranges = {
        'Electronics': (50000, 2500000),
        'Fashion': (5000, 150000),
        'Home & Garden': (20000, 800000),
        'Books': (2000, 15000),
        'Sports': (8000, 200000),
        'Beauty': (3000, 75000)
    }
    
    min_price, max_price = price_ranges[category]
    price = random.randint(min_price, max_price)
    
    # Generate order date (last 30 days)
    days_ago = random.randint(0, 30)
    order_date = (datetime.now() - timedelta(days=days_ago)).strftime('%Y-%m-%d')
    
    sample_data.append({
        'order_id': f'ORD-{1000 + i}',
        'customer_name': f"{random.choice(first_names)} {random.choice(last_names)}",
        'city': random.choice(nigerian_cities),
        'product_category': category,
        'product_name': product,
        'order_value': price,
        'order_date': order_date,
        'is_vip': random.choice([True, False]),
        'payment_method': random.choice(['Card', 'Transfer', 'Cash']),
        'order_status': random.choice(['Delivered', 'Shipped', 'Processing', 'Cancelled'])
    })

# Create DataFrame from sample data
ecommerce_df = pd.DataFrame(sample_data)

print("=== Sample E-commerce Dataset Created! ===")
print(f"📊 Dataset size: {ecommerce_df.shape[0]} orders, {ecommerce_df.shape[1]} columns")
print(f"📅 Date range: {ecommerce_df['order_date'].min()} to {ecommerce_df['order_date'].max()}")
print(f"💰 Total revenue: ₦{ecommerce_df['order_value'].sum():,}")

print("\n=== First 5 Orders ===")
print(ecommerce_df.head())

### Data Exploration (Like Excel's Data Analysis)

Now let's explore this data like you would in Excel:

In [None]:
# Dataset overview (like Excel's worksheet properties)
print("=== DATASET OVERVIEW ===")
print(f"📏 Size: {ecommerce_df.shape[0]} rows × {ecommerce_df.shape[1]} columns")
print(f"📋 Columns: {', '.join(ecommerce_df.columns)}")

print("\n=== COLUMN DATA TYPES ===")
for column, dtype in ecommerce_df.dtypes.items():
    print(f"   {column}: {dtype}")

print("\n=== MISSING VALUES CHECK ===")
missing_data = ecommerce_df.isnull().sum()
if missing_data.sum() == 0:
    print("✅ No missing values found!")
else:
    print(missing_data[missing_data > 0])

print("\n=== BASIC STATISTICS ===")
print(ecommerce_df.describe())

### Business Analysis (Like Excel Pivot Tables)

In [None]:
# Revenue analysis
print("=== REVENUE ANALYSIS ===")
total_revenue = ecommerce_df['order_value'].sum()
avg_order_value = ecommerce_df['order_value'].mean()
median_order_value = ecommerce_df['order_value'].median()
total_orders = len(ecommerce_df)

print(f"💰 Total Revenue: ₦{total_revenue:,}")
print(f"📊 Average Order Value: ₦{avg_order_value:,.2f}")
print(f"📈 Median Order Value: ₦{median_order_value:,.2f}")
print(f"📦 Total Orders: {total_orders:,}")

print("\n=== TOP PERFORMING CATEGORIES ===")
category_revenue = ecommerce_df.groupby('product_category')['order_value'].agg(['sum', 'count', 'mean'])
category_revenue.columns = ['Total_Revenue', 'Order_Count', 'Avg_Order_Value']
category_revenue = category_revenue.sort_values('Total_Revenue', ascending=False)

for category, row in category_revenue.iterrows():
    print(f"   {category}:")
    print(f"      Revenue: ₦{row['Total_Revenue']:,}")
    print(f"      Orders: {row['Order_Count']}")
    print(f"      Avg Order: ₦{row['Avg_Order_Value']:,.2f}")

print("\n=== CITY PERFORMANCE ===")
city_stats = ecommerce_df.groupby('city')['order_value'].agg(['sum', 'count']).sort_values('sum', ascending=False)
print("Top 5 cities by revenue:")
for city, (revenue, orders) in city_stats.head().iterrows():
    print(f"   {city}: ₦{revenue:,} ({orders} orders)")

print("\n=== VIP CUSTOMER ANALYSIS ===")
vip_stats = ecommerce_df.groupby('is_vip')['order_value'].agg(['count', 'sum', 'mean'])
for vip_status, (count, total, avg) in vip_stats.iterrows():
    status_text = "VIP Customers" if vip_status else "Regular Customers"
    print(f"   {status_text}:")
    print(f"      Count: {count} customers")
    print(f"      Revenue: ₦{total:,}")
    print(f"      Avg Order: ₦{avg:,.2f}")

### Data Filtering (Like Excel AutoFilter)

In [None]:
# Filter data like Excel AutoFilter
print("=== DATA FILTERING (Like Excel AutoFilter) ===")

# Filter 1: High value orders (like Excel: Order Value > 100000)
high_value_orders = ecommerce_df[ecommerce_df['order_value'] > 100000]
print(f"\n🔍 High Value Orders (>₦100,000):")
print(f"   Found: {len(high_value_orders)} orders")
print(f"   Total Value: ₦{high_value_orders['order_value'].sum():,}")
print("   Sample:")
print(high_value_orders[['customer_name', 'product_name', 'order_value', 'city']].head(3))

# Filter 2: Lagos customers only
lagos_customers = ecommerce_df[ecommerce_df['city'] == 'Lagos']
print(f"\n🏙️ Lagos Customers Only:")
print(f"   Orders: {len(lagos_customers)}")
print(f"   Revenue: ₦{lagos_customers['order_value'].sum():,}")
print(f"   Avg Order: ₦{lagos_customers['order_value'].mean():,.2f}")

# Filter 3: VIP Electronics customers
vip_electronics = ecommerce_df[(ecommerce_df['is_vip'] == True) & 
                               (ecommerce_df['product_category'] == 'Electronics')]
print(f"\n👑 VIP Electronics Customers:")
print(f"   Orders: {len(vip_electronics)}")
if len(vip_electronics) > 0:
    print(f"   Revenue: ₦{vip_electronics['order_value'].sum():,}")
    print("   Products:")
    for _, order in vip_electronics.iterrows():
        print(f"      {order['customer_name']}: {order['product_name']} (₦{order['order_value']:,})")
else:
    print("   No VIP electronics customers found")

# Filter 4: Recent orders (last 7 days)
from datetime import datetime, timedelta
recent_date = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
recent_orders = ecommerce_df[ecommerce_df['order_date'] >= recent_date]
print(f"\n📅 Recent Orders (Last 7 Days):")
print(f"   Orders: {len(recent_orders)}")
print(f"   Revenue: ₦{recent_orders['order_value'].sum():,}")

### Sorting Data (Like Excel Sort Feature)

In [None]:
# Sort data like Excel Sort feature
print("=== DATA SORTING (Like Excel Sort) ===")

# Sort by order value (highest to lowest)
print("\n💰 TOP 5 HIGHEST VALUE ORDERS:")
top_orders = ecommerce_df.sort_values('order_value', ascending=False).head()
for _, order in top_orders.iterrows():
    vip_status = "👑 VIP" if order['is_vip'] else "Regular"
    print(f"   {order['customer_name']} ({vip_status})")
    print(f"      Product: {order['product_name']}")
    print(f"      Value: ₦{order['order_value']:,}")
    print(f"      City: {order['city']}")
    print()

# Sort by multiple columns (like Excel multi-level sort)
print("\n📊 ORDERS BY CATEGORY AND VALUE:")
sorted_by_category = ecommerce_df.sort_values(['product_category', 'order_value'], 
                                              ascending=[True, False])
print("Top order from each category:")
for category in ecommerce_df['product_category'].unique():
    top_in_category = sorted_by_category[sorted_by_category['product_category'] == category].iloc[0]
    print(f"   {category}: {top_in_category['product_name']} - ₦{top_in_category['order_value']:,}")

## Hands-On Exercise: Your Business Analysis

**Scenario**: You're the new Data Analyst at NaijaCommerce. Use the dataset to answer key business questions.

**Your Tasks**: Complete the analysis below

In [None]:
print("=== YOUR BUSINESS ANALYSIS CHALLENGE ===")
print("Complete the following analysis using the ecommerce_df dataset:\n")

# Task 1: Find the most profitable day
print("📅 TASK 1: Which date had the highest total sales?")
daily_sales = ecommerce_df.groupby('order_date')['order_value'].sum().sort_values(ascending=False)
best_day = daily_sales.index[0]
best_day_revenue = daily_sales.iloc[0]
print(f"Answer: {best_day} with ₦{best_day_revenue:,}")
print(f"Your task: Explain why this might have happened...")
print("# Your explanation here: ______________________\n")

# Task 2: Payment method analysis
print("💳 TASK 2: Which payment method generates the most revenue?")
payment_analysis = ecommerce_df.groupby('payment_method')['order_value'].agg(['sum', 'count', 'mean'])
print(payment_analysis)
print("Your task: Which payment method should we focus on promoting?")
print("# Your recommendation: ______________________\n")

# Task 3: Customer segmentation
print("👥 TASK 3: Create customer segments based on order value")
print("Define segments:")
print("- Premium: Orders > ₦500,000")
print("- Standard: Orders ₦50,000 - ₦500,000")
print("- Budget: Orders < ₦50,000")

# Your code here - create the segments
def categorize_order(value):
    if value > 500000:
        return 'Premium'
    elif value >= 50000:
        return 'Standard'
    else:
        return 'Budget'

ecommerce_df['customer_segment'] = ecommerce_df['order_value'].apply(categorize_order)
segment_analysis = ecommerce_df.groupby('customer_segment')['order_value'].agg(['count', 'sum', 'mean'])
print(segment_analysis)
print("Your task: What percentage of customers are in each segment?")
segment_percentages = ecommerce_df['customer_segment'].value_counts(normalize=True) * 100
print(segment_percentages)
print("# Your insights: ______________________\n")

# Task 4: Geographic expansion
print("🗺️ TASK 4: Should we expand to new cities?")
city_performance = ecommerce_df.groupby('city')['order_value'].agg(['count', 'sum', 'mean']).sort_values('sum', ascending=False)
print("Current city performance:")
print(city_performance)
print("Your task: Rank cities by priority for marketing investment")
print("# Your ranking: 1.______ 2.______ 3.______")
print("# Reasoning: ______________________\n")

# Task 5: Product strategy
print("📦 TASK 5: Which product categories need attention?")
category_performance = ecommerce_df.groupby('product_category').agg({
    'order_value': ['sum', 'count', 'mean'],
    'is_vip': 'sum'  # Count of VIP customers
})
print(category_performance)
print("Your recommendations:")
print("# Categories to promote more: ______________________")
print("# Categories to investigate: ______________________")
print("# Why: ______________________")

## Saving Your Analysis (Like Saving Excel Files)

In [None]:
# Save DataFrame to CSV (like Save As in Excel)
# Note: In Google Colab, this will save to the Colab environment

# Create summary report DataFrame
summary_report = pd.DataFrame({
    'Metric': [
        'Total Orders',
        'Total Revenue',
        'Average Order Value',
        'VIP Customers',
        'Cities Served',
        'Product Categories'
    ],
    'Value': [
        len(ecommerce_df),
        f"₦{ecommerce_df['order_value'].sum():,}",
        f"₦{ecommerce_df['order_value'].mean():,.2f}",
        ecommerce_df['is_vip'].sum(),
        ecommerce_df['city'].nunique(),
        ecommerce_df['product_category'].nunique()
    ]
})

print("=== BUSINESS SUMMARY REPORT ===")
print(summary_report.to_string(index=False))

# Show how to save (commented out for demo purposes)
print("\n=== HOW TO SAVE YOUR ANALYSIS ===")
print("# Save main dataset:")
print("# ecommerce_df.to_csv('naija_commerce_orders.csv', index=False)")
print("# Save summary report:")
print("# summary_report.to_csv('business_summary.csv', index=False)")
print("\n✅ Analysis complete! Ready for Thursday's SQL session.")

## Key Takeaways: Pandas DataFrames

### What You Learned Today
✅ **DataFrames** = Excel worksheets in Python  
✅ **Loading data** with pd.read_csv() = Opening Excel files  
✅ **Exploring data** with .head(), .info(), .describe()  
✅ **Filtering data** with boolean conditions = AutoFilter  
✅ **Sorting data** with .sort_values() = Excel Sort  
✅ **Grouping data** with .groupby() = Pivot Tables  
✅ **Summary statistics** = Excel's Data Analysis functions  

### Python vs Excel: Data Operations
| Task | Excel | Pandas |
|------|-------|--------|
| Open file | File > Open | `pd.read_csv()` |
| View data | Scroll worksheet | `.head()`, `.tail()` |
| Summary stats | Data Analysis ToolPak | `.describe()` |
| Filter data | AutoFilter | `df[condition]` |
| Sort data | Data > Sort | `.sort_values()` |
| Group data | Pivot Table | `.groupby()` |
| Save file | Save As | `.to_csv()` |

### Most Important DataFrame Methods
- **`.head(n)`** - First n rows
- **`.info()`** - Dataset overview
- **`.describe()`** - Summary statistics
- **`.shape`** - Rows and columns count
- **`.columns`** - Column names
- **`.groupby()`** - Group data for analysis
- **`.sort_values()`** - Sort by column values

### Business Analysis Workflow
1. **Load** the data → `pd.read_csv()`
2. **Explore** structure → `.info()`, `.head()`
3. **Check** data quality → `.describe()`, missing values
4. **Filter** relevant data → boolean conditions
5. **Group** for insights → `.groupby()`
6. **Sort** for rankings → `.sort_values()`
7. **Save** results → `.to_csv()`

### Next Steps: From Python to SQL
**Tomorrow's SQL Session** will cover the same concepts using database queries:
- DataFrames → Database Tables
- Filtering → WHERE clauses
- Grouping → GROUP BY statements
- Sorting → ORDER BY clauses

**You're Ready!** You now understand data analysis thinking - just different tools!

---

### Congratulations! 
You've successfully completed your first day of Python programming for data analysis. You've learned to:
- Think in Python while leveraging your Excel knowledge
- Work with variables, lists, and dictionaries
- Load and analyze real business data with pandas
- Perform the same analysis you'd do in Excel, but with code!

**Assignment**: Complete the practice exercises in the `exercises/` folder before tomorrow's SQL session.