# Data Analysis - Module 2
## Indexing, Filtering, Grouping & Aggregation

**Your Role:** Data Analyst at a B2B SaaS Company

**Your Mission:** Control your data, don't fight it.

**Why this matters:**
- Indexing: Instant lookup by business meaning (CustomerID, not row number)
- Filtering: Answer "which customers..." questions
- GroupBy: Answer "how does X vary by Y" questions
- These skills separate junior analysts from senior ones

**This module covers:**
- Setting and using indexes effectively
- Boolean filtering (single and multiple conditions)
- GroupBy operations and aggregations
- Multiple aggregations and named aggregations
- Pivot tables
- Working with multiple datasets

**Dataset files used:**
- `TechFlow.csv` - Full 50-customer dataset
- `customers_small.csv` - 10-customer learning dataset
- `support_tickets.tsv` - Support ticket data
- `nps_surveys.csv` - NPS survey responses

**Time to complete:** ~75 minutes

---

# SETUP: Import and Load Data

In [None]:
# Standard imports - run this cell first!
import pandas as pd
import numpy as np

# Set display options
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 200)

# Load datasets
df = pd.read_csv('../dataset/TechFlow.csv')
df_small = pd.read_csv('../dataset/customers_small.csv')
tickets = pd.read_csv('../dataset/support_tickets.tsv', sep='\t')
surveys = pd.read_csv('../dataset/nps_surveys.csv')

print("Datasets loaded:")
print(f"  Main data: {df.shape}")
print(f"  Small data: {df_small.shape}")
print(f"  Tickets: {tickets.shape}")
print(f"  Surveys: {surveys.shape}")

---
# PART 1: Indexing - Control Your Data

The **index** is your row label. Default integer indexes (0, 1, 2...) are meaningless.

## 1.1 Setting an Index

**Default index is meaningless**

```python
# Look at the data with default integer index
df_small.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Set a meaningful index**

CustomerID uniquely identifies each row - make it the index.

```python
# Set CustomerID as the index
df_indexed = df_small.set_index('CustomerID')

df_indexed
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Index during load**

Set the index when you read the file.

```python
# Set index while loading
df_loaded = pd.read_csv('../dataset/customers_small.csv', index_col='CustomerID')

df_loaded.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 1.2 Using the Index

**Lookup by index value with .loc**

```python
# Get customer 1001
df_indexed.loc[1001]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Get specific columns for a customer**

```python
# Customer 1004's name and revenue
df_indexed.loc[1004, ['CompanyName', 'MonthlyRevenue']]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Lookup multiple customers**

```python
# Get customers 1001, 1004, 1009
df_indexed.loc[[1001, 1004, 1009]]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Slice by index range**

```python
# Customers from 1003 to 1007 (inclusive!)
df_indexed.loc[1003:1007]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 1.3 Reset Index

**Move index back to column**

```python
# Reset to default integer index
df_reset = df_indexed.reset_index()

df_reset.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Drop the index entirely**

```python
# Reset and don't keep original index
df_dropped = df_indexed.reset_index(drop=True)

df_dropped.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 2: Boolean Filtering

Answer "which customers..." questions.

## 2.1 Single Condition Filters

**How filtering works**

1. Create a boolean mask (True/False for each row)
2. Use the mask to select rows

```python
# Step 1: Create boolean mask
mask = df_small['MonthlyRevenue'] >= 500
print("Boolean mask:")
print(mask)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Apply the mask**

```python
# Step 2: Apply mask to get matching rows
high_revenue = df_small[mask]

high_revenue
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**One-liner (common pattern)**

```python
# Combine both steps
df_small[df_small['MonthlyRevenue'] >= 500]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Filter by string equality**

```python
# Enterprise customers only
df_small[df_small['SubscriptionPlan'] == 'Enterprise']
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Filter by NOT equal**

```python
# All except Basic plan
df_small[df_small['SubscriptionPlan'] != 'Basic']
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 2.2 Multiple Condition Filters

**AND conditions (&)**

Both conditions must be True.

```python
# Enterprise AND revenue >= 500
df_small[
    (df_small['SubscriptionPlan'] == 'Enterprise') & 
    (df_small['MonthlyRevenue'] >= 500)
]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**OR conditions (|)**

At least one condition must be True.

```python
# Enterprise OR high revenue
df_small[
    (df_small['SubscriptionPlan'] == 'Enterprise') | 
    (df_small['MonthlyRevenue'] >= 500)
]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**NOT condition (~)**

```python
# NOT cancelled (where Cancelled == 0)
df_small[~(df_small['Cancelled'] == 1)]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**IMPORTANT:** Always use parentheses around each condition!

## 2.3 Advanced Filtering

**isin() - Match multiple values**

```python
# Technology or Finance industries
df_small[df_small['Industry'].isin(['Technology', 'Finance'])]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**between() - Range filter**

```python
# Revenue between 100 and 300
df_small[df_small['MonthlyRevenue'].between(100, 300)]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**str.contains() - Text search**

```python
# Companies with 'Care' in name
df_small[df_small['CompanyName'].str.contains('Care', case=False)]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**query() method - SQL-like syntax**

```python
# Same filter, cleaner syntax
df_small.query('MonthlyRevenue >= 500 and SubscriptionPlan == "Enterprise"')
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 3: GroupBy - Think in Aggregates

Answer "how does X vary by Y" questions.

## 3.1 Basic GroupBy

**GroupBy creates groups**

```python
# Group by SubscriptionPlan
grouped = df_small.groupby('SubscriptionPlan')

print(f"Type: {type(grouped)}")
print(f"Number of groups: {grouped.ngroups}")
print(f"Groups: {list(grouped.groups.keys())}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Aggregate a column**

```python
# Average revenue by plan
df_small.groupby('SubscriptionPlan')['MonthlyRevenue'].mean()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Common aggregation functions**

```python
# Count, sum, mean, median, min, max, std
print("Count by plan:")
print(df_small.groupby('SubscriptionPlan')['CustomerID'].count())

print("\nTotal revenue by plan:")
print(df_small.groupby('SubscriptionPlan')['MonthlyRevenue'].sum())

print("\nMax revenue by plan:")
print(df_small.groupby('SubscriptionPlan')['MonthlyRevenue'].max())
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 3.2 Multiple Aggregations

**Aggregate multiple columns**

```python
# Multiple columns, single aggregation
df_small.groupby('SubscriptionPlan')[['MonthlyRevenue', 'SeatCount']].mean()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Multiple aggregations with agg()**

```python
# Single column, multiple aggregations
df_small.groupby('SubscriptionPlan')['MonthlyRevenue'].agg(['count', 'sum', 'mean', 'min', 'max'])
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Named aggregations (recommended)**

Clear, named output columns.

```python
# Named aggregations
df_small.groupby('SubscriptionPlan').agg(
    customer_count=('CustomerID', 'count'),
    total_revenue=('MonthlyRevenue', 'sum'),
    avg_revenue=('MonthlyRevenue', 'mean'),
    avg_seats=('SeatCount', 'mean'),
    avg_nps=('NPS_Score', 'mean')
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 3.3 GroupBy Multiple Columns

**Group by two columns**

```python
# Average revenue by Plan AND Cancelled status
df_small.groupby(['SubscriptionPlan', 'Cancelled'])['MonthlyRevenue'].mean()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Unstack for readability**

```python
# Convert to table format
df_small.groupby(['SubscriptionPlan', 'Cancelled'])['MonthlyRevenue'].mean().unstack()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 4: Working with the Full Dataset

Apply these skills to real business questions.

## 4.1 Business Analysis Examples

**Q: Which industries have the highest average revenue?**

```python
# Revenue by industry, sorted
df.groupby('Industry')['MonthlyRevenue'].mean().sort_values(ascending=False)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Q: Comprehensive breakdown by subscription plan?**

```python
# Full analysis by plan
plan_analysis = df.groupby('SubscriptionPlan').agg(
    customers=('CustomerID', 'count'),
    total_revenue=('MonthlyRevenue', 'sum'),
    avg_revenue=('MonthlyRevenue', 'mean'),
    avg_tenure=('TenureMonths', 'mean'),
    avg_nps=('NPS_Score', 'mean'),
    churn_rate=('Cancelled', 'mean')
)

plan_analysis
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Q: Find at-risk customers (low NPS, not cancelled)**

```python
# At-risk: NPS < 7 and not cancelled
at_risk = df[
    (df['NPS_Score'] < 7) & 
    (df['Cancelled'] == 0)
][['CompanyName', 'Industry', 'MonthlyRevenue', 'NPS_Score']]

print(f"At-risk customers: {len(at_risk)}")
at_risk
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Q: Top 10 customers by revenue?**

```python
# Sort and get top 10
df.nlargest(10, 'MonthlyRevenue')[['CompanyName', 'Industry', 'MonthlyRevenue', 'SubscriptionPlan']]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 4.2 Churn Analysis

**Churn rate by industry**

```python
# Churn rate per industry
churn_by_industry = df.groupby('Industry').agg(
    total_customers=('CustomerID', 'count'),
    churned=('Cancelled', 'sum'),
    churn_rate=('Cancelled', 'mean')
).sort_values('churn_rate', ascending=False)

churn_by_industry
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Compare churned vs active customers**

```python
# Churned vs Active comparison
df.groupby('Cancelled').agg(
    count=('CustomerID', 'count'),
    avg_revenue=('MonthlyRevenue', 'mean'),
    avg_tenure=('TenureMonths', 'mean'),
    avg_nps=('NPS_Score', 'mean'),
    avg_logins=('AvgWeeklyLogins', 'mean')
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 5: Pivot Tables

GroupBy on steroids - create summary tables.

**Basic pivot table**

```python
# Revenue by Plan and Industry
pd.pivot_table(
    df,
    values='MonthlyRevenue',
    index='Industry',
    columns='SubscriptionPlan',
    aggfunc='mean'
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Pivot with marginals (totals)**

```python
# Add row/column totals
pd.pivot_table(
    df,
    values='MonthlyRevenue',
    index='Industry',
    columns='SubscriptionPlan',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Pivot with multiple aggregations**

```python
# Count and average
pd.pivot_table(
    df,
    values='MonthlyRevenue',
    index='Industry',
    columns='SubscriptionPlan',
    aggfunc=['count', 'mean']
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 6: Working with Multiple Datasets

Combine filtering and grouping across datasets.

## 6.1 Analyze Support Tickets

**Explore tickets data**

```python
tickets
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Tickets by category**

```python
tickets.groupby('Category').agg(
    ticket_count=('TicketID', 'count'),
    avg_response=('ResponseMins', 'mean')
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Tickets by priority**

```python
tickets.groupby('Priority').agg(
    ticket_count=('TicketID', 'count'),
    avg_response=('ResponseMins', 'mean'),
    open_tickets=('Status', lambda x: (x == 'Open').sum())
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 6.2 Analyze NPS Surveys

**Explore surveys data**

```python
surveys
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**NPS by customer (multiple surveys)**

```python
# Average NPS per customer
surveys.groupby('CustomerID').agg(
    survey_count=('SurveyID', 'count'),
    avg_nps=('NPS_Score', 'mean'),
    min_nps=('NPS_Score', 'min'),
    max_nps=('NPS_Score', 'max')
)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Classify NPS responses**

```python
# Add NPS category
def classify_nps(score):
    if score >= 9:
        return 'Promoter'
    elif score >= 7:
        return 'Passive'
    else:
        return 'Detractor'

surveys['NPS_Category'] = surveys['NPS_Score'].apply(classify_nps)

# Count by category
surveys['NPS_Category'].value_counts()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PRACTICE: Business Scenarios

### Q1: Find all Enterprise customers with revenue > 400

In [None]:
# Your answer:


### Q2: Calculate average NPS score by SubscriptionPlan

In [None]:
# Your answer:


### Q3: Find customer count and total revenue by Industry

In [None]:
# Your answer:


### Q4: Get Technology or Healthcare customers with NPS >= 8

In [None]:
# Your answer:


### Q5: Create pivot table of churn rate by Industry and Plan

In [None]:
# Your answer:


### Q6: Which customers have the most support tickets?

In [None]:
# Your answer:


### Q7: Find bottom 5 customers by average NPS from surveys

In [None]:
# Your answer:


---
# CHEAT SHEET

## Indexing
```python
# Set index
df.set_index('column')
pd.read_csv(file, index_col='column')

# Reset index
df.reset_index()
df.reset_index(drop=True)

# Lookup by index
df.loc[label]
df.loc[[label1, label2]]
df.loc[label, 'column']
```

## Filtering
```python
# Single condition
df[df['col'] > value]
df[df['col'] == 'text']

# Multiple conditions
df[(cond1) & (cond2)]  # AND
df[(cond1) | (cond2)]  # OR
df[~(condition)]       # NOT

# Advanced
df[df['col'].isin([a, b, c])]
df[df['col'].between(low, high)]
df[df['col'].str.contains('text')]
df.query('col > value')
```

## GroupBy
```python
# Basic
df.groupby('col')['values'].mean()
df.groupby('col')[['v1', 'v2']].sum()

# Multiple aggregations
df.groupby('col')['v'].agg(['count', 'mean', 'sum'])

# Named aggregations
df.groupby('col').agg(
    new_name=('column', 'function')
)

# Multiple grouping columns
df.groupby(['col1', 'col2'])['v'].mean()
```

## Pivot Tables
```python
pd.pivot_table(
    df,
    values='column',
    index='row_groups',
    columns='col_groups',
    aggfunc='mean',
    margins=True
)
```

## Sorting
```python
df.sort_values('col')
df.sort_values('col', ascending=False)
df.nlargest(n, 'col')
df.nsmallest(n, 'col')
```

---
## Module 2 Complete! ðŸŽ‰

**You now know how to:**
- âœ… Set and use indexes for fast lookups
- âœ… Filter data with single and multiple conditions
- âœ… Use advanced filters (isin, between, str.contains)
- âœ… Group data with groupby()
- âœ… Apply multiple aggregations with agg()
- âœ… Create pivot tables for summary views
- âœ… Analyze multiple datasets

**Key Takeaways:**
1. Set meaningful indexes (CustomerID, not row numbers)
2. Always use parentheses in multi-condition filters
3. GroupBy â†’ Aggregate â†’ Sort is the analysis pattern
4. Named aggregations make results self-documenting
5. Pivot tables are GroupBy with better formatting

**Next: Module 3 - Data Cleaning & Transformation**