# TechFlow Data Analysis - Module 2
## Indexing, Grouping & Aggregation

**Your Role:** Data Analyst at TechFlow (B2B SaaS Company)

**Your Mission:** Control your data, don't fight it.

After this module, you will understand:
- Why indexing is about CONTROL over your data axes
- Why Group By is about THINKING in aggregated patterns
- How these skills separate junior analysts from senior ones

---

# SETUP - Run these first

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('../dataset/TechFlow.csv')

---
# PART 1: Indexing (Control the Axes)

In business, every row represents SOMETHING meaningful: a customer, a transaction, a product.

The index is YOUR LABEL for that something.

Default integer indexes (0, 1, 2, 3...) tell you NOTHING about the data. They're just positions.

**Bad indexing leads to:**
- Silent merge errors (rows that don't align correctly)
- Off-by-one mistakes when filtering
- Confusion when presenting results to stakeholders

**Good indexing gives you:**
- Instant lookup by business meaning
- Cleaner joins between datasets
- Self-documenting DataFrames

---
## 1.1 Creating an Index

**Default index - meaningless**

Look at the data with its default integer index. It tells you nothing about which customer you're looking at.

```python
df.head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Set CustomerID as the index**

Why CustomerID? It uniquely identifies each row. It's the natural "primary key" of this dataset.

```python
df_indexed = df.set_index('CustomerID')
df_indexed.head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Reset the index**

`reset_index()` goes back to the default integer index, moving the current index back to a column.

```python
df_reset = df_indexed.reset_index()
df_reset.head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
## 1.2 Multi-Level Indexing

Sometimes a single column isn't enough to organize your data.

A multi-level index creates a HIERARCHY.

**When it's useful:**
- Analyzing data across multiple dimensions simultaneously
- Creating hierarchical reports (like Excel pivot tables)

**When it becomes dangerous:**
- Too many levels (3+ levels gets confusing)
- When colleagues don't understand your structure

**Create a multi-level index**

Use Industry and SubscriptionPlan as a hierarchical index.

```python
df_multi = df.set_index(['Industry', 'SubscriptionPlan'])
df_multi.head(15)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**View the index structure**

```python
print(df_multi.index)
print('Names:', df_multi.index.names)
print('Levels:', df_multi.index.nlevels)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Sort multi-index for hierarchical view**

```python
df_multi_sorted = df_multi.sort_index()
df_multi_sorted.head(20)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
## 1.3 Filtering with loc and iloc

**THIS IS WHERE MOST ANALYSTS MAKE SILENT MISTAKES.**

- `loc` = Label-based (uses index LABELS and column NAMES)
- `iloc` = Integer-based (uses POSITIONS regardless of labels)

The mistake: Using iloc when you mean loc, or vice versa.
The result: Wrong data, wrong conclusions, wrong decisions.

**Setup: Create indexed DataFrame**

```python
df_work = df.set_index('CustomerID')
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Single row with loc (by label)**

Get customer with ID 1005.

```python
df_work.loc[1005]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Single row with iloc (by position)**

Get the 5th row (position 4, because Python counts from 0).

```python
df_work.iloc[4]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Range slicing with loc (INCLUSIVE on both ends)**

Get customers 1003 through 1007. Note: 1007 IS included.

```python
df_work.loc[1003:1007]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Range slicing with iloc (EXCLUSIVE on end)**

Get rows at positions 2, 3, 4, 5, 6. Note: Position 7 is NOT included.

```python
df_work.iloc[2:7]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**CRITICAL:**
- loc slices are INCLUSIVE: 1003:1007 includes 1007
- iloc slices are EXCLUSIVE: 2:7 does NOT include position 7

This inconsistency causes countless bugs. Memorize it.

**Row and column with loc**

Get company name of customer 1005.

```python
df_work.loc[1005, 'CompanyName']
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Multiple rows and columns with loc**

```python
df_work.loc[1003:1006, ['CompanyName', 'MonthlyRevenue']]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Position-based with iloc**

First 3 rows, first 3 columns.

```python
df_work.iloc[0:3, 0:3]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Specific customers with loc**

Select non-contiguous customers by ID.

```python
df_work.loc[[1001, 1010, 1025], ['CompanyName', 'Industry']]
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
## 1.4 Reordering and Sorting Indexes

Sorting is NOT cosmetic. Sorting affects:
- Performance of lookups (sorted indexes are faster)
- Correctness of slicing (unsorted slices behave unexpectedly)
- Visual clarity for stakeholders
- Merge operations between DataFrames

**Sort by index (ascending)**

```python
df_sorted_idx = df_work.sort_index()
df_sorted_idx.head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Sort by index (descending)**

```python
df_sorted_idx_desc = df_work.sort_index(ascending=False)
df_sorted_idx_desc.head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Sort by values (MonthlyRevenue - highest first)**

```python
df_sorted_val = df_work.sort_values('MonthlyRevenue', ascending=False)
df_sorted_val[['CompanyName', 'MonthlyRevenue']].head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Multi-column sort**

Sort by Industry (A-Z), then by MonthlyRevenue (highest first).

```python
df_sorted_multi = df_work.sort_values(['Industry', 'MonthlyRevenue'], ascending=[True, False])
df_sorted_multi[['CompanyName', 'Industry', 'MonthlyRevenue']].head(15)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Swap levels in multi-index**

```python
df_multi = df.set_index(['Industry', 'SubscriptionPlan']).sort_index()
df_swapped = df_multi.swaplevel()
df_swapped.sort_index().head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
## 1.5 Reshaping and Pivoting Indexes

Raw data is usually in LONG format: one row per observation.
Executives prefer WIDE format: easier to compare across categories.

Pivot transforms your index structure:
- Long → Wide: One row per transaction → One row per category
- Indexes become columns, or columns become indexes

**Basic pivot table**

Average revenue by Industry × SubscriptionPlan.

```python
pivot_revenue = pd.pivot_table(
    df,
    values='MonthlyRevenue',
    index='Industry',
    columns='SubscriptionPlan',
    aggfunc='mean'
)
pivot_revenue.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**What changed:**
- Each INDUSTRY is now a ROW (the index)
- Each SUBSCRIPTION PLAN is now a COLUMN
- The VALUES are AVERAGES of MonthlyRevenue

**Pivot with multiple aggregations**

```python
pivot_multi_agg = pd.pivot_table(
    df,
    values='MonthlyRevenue',
    index='Industry',
    columns='SubscriptionPlan',
    aggfunc=['mean', 'sum', 'count']
)
pivot_multi_agg.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Pivot with multiple values**

```python
pivot_multi_val = pd.pivot_table(
    df,
    values=['MonthlyRevenue', 'SeatCount'],
    index='Industry',
    columns='SubscriptionPlan',
    aggfunc='mean'
)
pivot_multi_val.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Why executives prefer pivoted output:**
- One row per category = easy comparison
- Columns = natural left-to-right reading
- No scrolling through hundreds of rows
- Matches how they think about business segments

---
# PART 2: Group By and Aggregation (Pattern Recognition)

Individual rows are useless for decision-making.

No executive asks: "What's the revenue for customer 1023?"
They ask: "What's our average revenue BY industry?"

Group By converts ACTIVITY into INSIGHT:
- Rows → Categories
- Transactions → Summaries
- Noise → Signal

This is where data becomes information.

---
## 2.1 Basics of Group By

Group By has three steps:
1. **SPLIT**: Divide data into groups by category
2. **APPLY**: Run a function on each group
3. **COMBINE**: Put results back together

Think of it as: "For each [category], calculate [metric]"

**Count customers by Industry**

```python
df.groupby('Industry')['CustomerID'].count()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Mean revenue by Industry**

```python
df.groupby('Industry')['MonthlyRevenue'].mean().round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Sum of revenue by Industry**

```python
df.groupby('Industry')['MonthlyRevenue'].sum()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Mean revenue by SubscriptionPlan**

```python
df.groupby('SubscriptionPlan')['MonthlyRevenue'].mean().round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
## 2.2 Average Revenue Case

Business question: "What's our average revenue by customer segment?"

This informs:
- Where to invest sales resources
- Which segments are underperforming
- Pricing strategy validation

**Average revenue by Industry (sorted)**

```python
avg_rev_industry = df.groupby('Industry')['MonthlyRevenue'].mean().sort_values(ascending=False)
avg_rev_industry.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Average revenue by SubscriptionPlan (sorted)**

```python
avg_rev_plan = df.groupby('SubscriptionPlan')['MonthlyRevenue'].mean().sort_values(ascending=False)
avg_rev_plan.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Decision this informs:**
- Enterprise customers pay ~6x more than Basic customers
- Should we focus sales on upgrading Basic → Standard → Enterprise?
- Are there underserved industries we should target?

---
## 2.3 Aggregations with Group By (.agg())

`.agg()` lets you run MULTIPLE aggregations at once.

This is how you build summary tables for reports.

**Multiple metrics on one column**

```python
df.groupby('Industry')['MonthlyRevenue'].agg(['count', 'mean', 'sum', 'min', 'max']).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Multiple metrics on multiple columns**

```python
df.groupby('Industry').agg({
    'MonthlyRevenue': ['count', 'mean', 'sum'],
    'SeatCount': ['mean', 'sum'],
    'NPS_Score': 'mean'
}).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Named aggregations (cleaner column names)**

```python
df.groupby('Industry').agg(
    customer_count=('CustomerID', 'count'),
    avg_revenue=('MonthlyRevenue', 'mean'),
    total_revenue=('MonthlyRevenue', 'sum'),
    avg_seats=('SeatCount', 'mean'),
    avg_nps=('NPS_Score', 'mean')
).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
## 2.4 Usage-Based Analysis (Proxy Metrics)

**ASSUMPTION:** AvgWeeklyLogins represents product engagement level. Higher logins = more engaged customer = healthier account.

**Analysts must STATE ASSUMPTIONS EXPLICITLY.**

Why? Because:
- Different stakeholders may interpret metrics differently
- Assumptions affect conclusions
- Documented assumptions can be validated or challenged

Always write: "This analysis assumes X represents Y."

**Average weekly logins by Industry**

```python
df.groupby('Industry')['AvgWeeklyLogins'].mean().sort_values(ascending=False).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Average weekly logins by SubscriptionPlan**

```python
df.groupby('SubscriptionPlan')['AvgWeeklyLogins'].mean().sort_values(ascending=False).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Logins per seat (normalized metric)**

This normalizes for company size. Create a new column first.

```python
df['LoginsPerSeat'] = df['AvgWeeklyLogins'] / df['SeatCount']
df.groupby('Industry')['LoginsPerSeat'].mean().sort_values(ascending=False).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Insight:**
Enterprise customers log in more, but is it because they have more seats?
Logins per seat shows engagement INTENSITY, not just volume.

---
## 2.5 Group By on Multiple Columns

Single-dimension grouping tells you: "What's different across X?"
Multi-dimension grouping tells you: "What's different across X AND Y?"

This creates **SEGMENTATION TREES**:
- First split by Industry
- Then split by SubscriptionPlan
- Now you see patterns at the INTERSECTION

Dimensions COMPOUND insight. Each dimension reveals new patterns.

**Two-dimensional grouping**

```python
multi_group = df.groupby(['Industry', 'SubscriptionPlan'])['MonthlyRevenue'].mean()
multi_group.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Unstack for table format**

```python
multi_group.unstack().round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Three-dimensional analysis**

```python
df.groupby(['Industry', 'SubscriptionPlan', 'Cancelled'])['CustomerID'].count()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Segmentation insight:**
Multi-dimensional grouping reveals INTERSECTIONAL patterns.
Example: Healthcare + Enterprise + Not Cancelled = your best segment?
Dimensions compound insight, but too many dimensions = confusion.

---
## 2.6 Transform with Group By

Regular aggregation COLLAPSES rows into group summaries.
Transform PRESERVES rows but ADDS group-level context.

Use case: "For each customer, show their revenue AND the industry average."

This is an ADVANCED technique that enables:
- Comparing individuals to their group
- Calculating percentages of group totals
- Normalizing data within groups

**Add industry average to each row**

```python
df['IndustryAvgRevenue'] = df.groupby('Industry')['MonthlyRevenue'].transform('mean')
df[['CompanyName', 'Industry', 'MonthlyRevenue', 'IndustryAvgRevenue']].head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Calculate difference from industry average**

```python
df['RevenueVsIndustryAvg'] = df['MonthlyRevenue'] - df['IndustryAvgRevenue']
df[['CompanyName', 'Industry', 'MonthlyRevenue', 'IndustryAvgRevenue', 'RevenueVsIndustryAvg']].head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Percentage of industry total**

```python
df['IndustryTotalRevenue'] = df.groupby('Industry')['MonthlyRevenue'].transform('sum')
df['PctOfIndustryRevenue'] = (df['MonthlyRevenue'] / df['IndustryTotalRevenue']) * 100
df[['CompanyName', 'Industry', 'MonthlyRevenue', 'PctOfIndustryRevenue']].head(10).round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Rank within industry**

```python
df['IndustryRevenueRank'] = df.groupby('Industry')['MonthlyRevenue'].rank(ascending=False)
df[['CompanyName', 'Industry', 'MonthlyRevenue', 'IndustryRevenueRank']].sort_values(['Industry', 'IndustryRevenueRank']).head(20)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Why transform is powerful:**
- You keep individual rows (no collapse)
- Each row now has GROUP CONTEXT
- Compare any customer to their segment
- Find outliers: Who's way above/below their group average?

---
# PRACTICE: Answer These Business Questions

### Q1: What is the average SeatCount by SubscriptionPlan?

In [None]:
# Your answer:


### Q2: Create a pivot table showing average NPS_Score by Industry and ContractType

In [None]:
# Your answer:


### Q3: Which industry has the highest total MonthlyRevenue?

In [None]:
# Your answer:


### Q4: Use loc to get the MonthlyRevenue of customer 1012

In [None]:
# Your answer:


### Q5: Group by SupportTier and calculate count, mean, and max of SupportTicketsRaised

In [None]:
# Your answer:


### Q6: Add a column showing each customer's TenureMonths compared to their industry average (use transform)

In [None]:
# Your answer:


---
# CHEAT SHEET

## Indexing

| What you want | Code |
|---------------|------|
| Set index | `df.set_index('col')` |
| Reset index | `df.reset_index()` |
| Multi-level index | `df.set_index(['col1', 'col2'])` |
| Row by label | `df.loc[label]` |
| Row by position | `df.iloc[pos]` |
| Row + column by label | `df.loc[label, 'col']` |
| Slice by label (inclusive) | `df.loc[start:end]` |
| Slice by position (exclusive) | `df.iloc[start:end]` |
| Sort by index | `df.sort_index()` |
| Sort by values | `df.sort_values('col')` |
| Swap index levels | `df.swaplevel()` |

## Pivoting

| What you want | Code |
|---------------|------|
| Pivot table | `pd.pivot_table(df, values='val', index='row', columns='col', aggfunc='mean')` |
| Unstack | `grouped_series.unstack()` |

## Group By

| What you want | Code |
|---------------|------|
| Group + count | `df.groupby('col')['other'].count()` |
| Group + mean | `df.groupby('col')['other'].mean()` |
| Group + sum | `df.groupby('col')['other'].sum()` |
| Multiple aggs | `df.groupby('col')['other'].agg(['count', 'mean', 'sum'])` |
| Multiple columns | `df.groupby(['col1', 'col2'])['other'].mean()` |
| Named aggs | `df.groupby('col').agg(name=('col', 'func'))` |
| Transform | `df.groupby('col')['other'].transform('mean')` |
| Rank within group | `df.groupby('col')['other'].rank()` |

---
# Module 2 Summary

## Indexing is about CONTROL
- The index is your handle on the data - make it meaningful
- `set_index()` / `reset_index()` - switch between views
- Multi-indexing - powerful but requires discipline
- `loc` vs `iloc` - labels vs positions (memorize the difference)
- Sorting indexes - not cosmetic, affects correctness
- Pivoting - reshape for human readability

## Group By is about THINKING
- Individual rows → Category summaries
- Split → Apply → Combine (the mental model)
- `.agg()` for multiple metrics in one pass
- Multi-column grouping = segmentation trees
- `transform()` = group context at row level
- State your assumptions explicitly

## What separates junior from senior analysts
- **Junior**: Gets data, makes a chart
- **Senior**: Controls data structure, asks "what question am I answering?"

- **Junior**: Uses integer indexes blindly
- **Senior**: Sets meaningful indexes, understands loc vs iloc

- **Junior**: Groups by one dimension
- **Senior**: Builds segmentation trees, uses transform for context

- **Junior**: Returns numbers
- **Senior**: Returns insight with stated assumptions

---

**You now understand Pandas well enough to CONTROL data, not fight it.**

Next: Module 3 - Data Cleaning & Transformation