# Data Analysis - Module 6
## Real-World Projects: Capstone

**Your Role:** Senior Data Analyst at TechFlow

**Your Mission:** Apply everything you've learned to solve real business problems.

**This module contains 3 progressively challenging projects:**

| Project | Difficulty | Departments Impacted |
|---------|------------|---------------------|
| 1. Customer Health Dashboard | Intermediate | Sales, CS, Finance |
| 2. Churn Prediction Analysis | Advanced | CS, Product, Executive |
| 3. Executive Business Review | Expert | All Departments |

**Skills Applied:**
- Data loading and cleaning (Module 1, 4)
- Filtering and grouping (Module 2)
- Joining multiple datasets (Module 3)
- Exploratory analysis (Module 4)
- Data visualization (Module 5)

---

# SETUP: Load All Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
%matplotlib inline

# Load all datasets
customers = pd.read_csv('../dataset/TechFlow.csv')
orders = pd.read_csv('../dataset/orders.csv')
products = pd.read_csv('../dataset/products.csv')
regions = pd.read_csv('../dataset/customer_regions.csv')
tickets = pd.read_csv('../dataset/support_tickets.tsv', sep='\t')
surveys = pd.read_csv('../dataset/nps_surveys.csv')
daily = pd.read_csv('../dataset/daily_metrics.csv')
daily['Date'] = pd.to_datetime(daily['Date'])

print("All datasets loaded successfully!")

---
# PROJECT 1: Customer Health Dashboard
## For: Sales & Customer Success Teams

**Business Context:** The CS team needs a way to quickly identify which customers need attention.

**Deliverables:**
1. Customer Health Score (0-100)
2. Risk categorization (Low/Medium/High)
3. Dashboard visualization
4. Priority action list

## Step 1.1: Create Customer Health Score

In [None]:
# Create a copy for our analysis
df = customers.copy()

# Health Score Components (each 0-25 points, total 100)
# 1. Engagement Score (based on logins)
df['EngagementScore'] = pd.cut(
    df['AvgWeeklyLogins'],
    bins=[0, 20, 50, 100, 300],
    labels=[5, 15, 20, 25]
).astype(int)

# 2. NPS Score (scaled to 25)
df['NPSPoints'] = (df['NPS_Score'] / 10 * 25).round(0)

# 3. Support Health (fewer tickets = better)
df['SupportScore'] = 25 - (df['SupportTicketsRaised'] * 2.5).clip(upper=25)

# 4. Tenure Score (longer = better)
df['TenureScore'] = pd.cut(
    df['TenureMonths'],
    bins=[0, 6, 12, 24, 100],
    labels=[10, 15, 20, 25]
).astype(int)

# Total Health Score
df['HealthScore'] = df['EngagementScore'] + df['NPSPoints'] + df['SupportScore'] + df['TenureScore']

# Risk Category
df['RiskLevel'] = pd.cut(
    df['HealthScore'],
    bins=[0, 50, 70, 100],
    labels=['High Risk', 'Medium Risk', 'Low Risk']
)

df[['CompanyName', 'HealthScore', 'RiskLevel', 'EngagementScore', 'NPSPoints']].head(10)

## Step 1.2: Risk Analysis Summary

In [None]:
# Risk summary
risk_summary = df.groupby('RiskLevel').agg(
    customer_count=('CustomerID', 'count'),
    total_revenue=('MonthlyRevenue', 'sum'),
    avg_revenue=('MonthlyRevenue', 'mean'),
    avg_health=('HealthScore', 'mean'),
    churn_rate=('Cancelled', 'mean')
).round(2)

risk_summary['revenue_at_risk'] = risk_summary['total_revenue']
print("Risk Level Summary:")
risk_summary

## Step 1.3: Customer Health Dashboard

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Health Score Distribution
sns.histplot(data=df, x='HealthScore', bins=15, ax=axes[0, 0], color='steelblue')
axes[0, 0].axvline(50, color='red', linestyle='--', label='Risk Threshold')
axes[0, 0].axvline(70, color='orange', linestyle='--')
axes[0, 0].set_title('Health Score Distribution')

# 2. Risk Level Count
risk_counts = df['RiskLevel'].value_counts()
colors = {'High Risk': '#e74c3c', 'Medium Risk': '#f39c12', 'Low Risk': '#27ae60'}
axes[0, 1].bar(risk_counts.index, risk_counts.values, 
               color=[colors[x] for x in risk_counts.index])
axes[0, 1].set_title('Customers by Risk Level')

# 3. Revenue at Risk by Plan
risk_by_plan = df[df['RiskLevel'] == 'High Risk'].groupby('SubscriptionPlan')['MonthlyRevenue'].sum()
axes[1, 0].bar(risk_by_plan.index, risk_by_plan.values, color='#e74c3c')
axes[1, 0].set_title('Revenue at Risk by Plan ($)')

# 4. Health vs Revenue
colors_scatter = df['RiskLevel'].map(colors)
axes[1, 1].scatter(df['HealthScore'], df['MonthlyRevenue'], c=colors_scatter, alpha=0.6)
axes[1, 1].set_xlabel('Health Score')
axes[1, 1].set_ylabel('Monthly Revenue')
axes[1, 1].set_title('Health Score vs Revenue')

plt.suptitle('Customer Health Dashboard', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 1.4: Priority Action List

In [None]:
# Top 10 at-risk customers by revenue
priority_list = df[df['RiskLevel'] == 'High Risk'].nlargest(10, 'MonthlyRevenue')[
    ['CompanyName', 'Industry', 'MonthlyRevenue', 'HealthScore', 'NPS_Score', 'AvgWeeklyLogins']
]

print("TOP 10 HIGH-RISK CUSTOMERS - IMMEDIATE ACTION REQUIRED")
print("="*60)
priority_list

---
# PROJECT 2: Churn Prediction Analysis
## For: Product, CS, and Executive Teams

**Business Context:** Understand WHY customers churn to prevent future losses.

**Deliverables:**
1. Churn drivers identification
2. Segment-level churn analysis
3. Early warning indicators
4. Recommendations

## Step 2.1: Churn Overview

In [None]:
# Overall churn metrics
total_customers = len(customers)
churned = customers['Cancelled'].sum()
churn_rate = churned / total_customers * 100
revenue_lost = customers[customers['Cancelled'] == 1]['MonthlyRevenue'].sum()

print("CHURN OVERVIEW")
print("="*40)
print(f"Total Customers: {total_customers}")
print(f"Churned: {churned} ({churn_rate:.1f}%)")
print(f"Monthly Revenue Lost: ${revenue_lost:,}")
print(f"Annual Revenue Lost: ${revenue_lost * 12:,}")

## Step 2.2: Churned vs Active Comparison

In [None]:
# Compare churned vs active
comparison = customers.groupby('Cancelled').agg({
    'CustomerID': 'count',
    'MonthlyRevenue': 'mean',
    'TenureMonths': 'mean',
    'AvgWeeklyLogins': 'mean',
    'NPS_Score': 'mean',
    'SupportTicketsRaised': 'mean',
    'SeatCount': 'mean'
}).round(2)

comparison.index = ['Active', 'Churned']
comparison.columns = ['Count', 'Avg Revenue', 'Avg Tenure', 'Avg Logins', 'Avg NPS', 'Avg Tickets', 'Avg Seats']
comparison

## Step 2.3: Churn by Segment

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Churn by Plan
churn_by_plan = customers.groupby('SubscriptionPlan')['Cancelled'].mean() * 100
churn_by_plan.plot(kind='bar', ax=axes[0, 0], color=['#3498db', '#2ecc71', '#9b59b6'])
axes[0, 0].set_title('Churn Rate by Plan (%)')
axes[0, 0].set_ylabel('Churn Rate (%)')

# 2. Churn by Contract Type
churn_by_contract = customers.groupby('ContractType')['Cancelled'].mean() * 100
churn_by_contract.plot(kind='bar', ax=axes[0, 1], color=['#e74c3c', '#27ae60'])
axes[0, 1].set_title('Churn Rate by Contract Type (%)')

# 3. NPS Distribution by Churn
sns.boxplot(data=customers, x='Cancelled', y='NPS_Score', ax=axes[1, 0], palette=['#27ae60', '#e74c3c'])
axes[1, 0].set_xticklabels(['Active', 'Churned'])
axes[1, 0].set_title('NPS Score: Active vs Churned')

# 4. Tenure Distribution by Churn
sns.boxplot(data=customers, x='Cancelled', y='TenureMonths', ax=axes[1, 1], palette=['#27ae60', '#e74c3c'])
axes[1, 1].set_xticklabels(['Active', 'Churned'])
axes[1, 1].set_title('Tenure: Active vs Churned')

plt.suptitle('Churn Analysis by Segment', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 2.4: Churn Correlation Analysis

In [None]:
# Find what correlates with churn
numeric_cols = ['MonthlyRevenue', 'SeatCount', 'TenureMonths', 'AvgWeeklyLogins', 
                'NPS_Score', 'SupportTicketsRaised', 'LastLoginDaysAgo', 'Cancelled']

churn_corr = customers[numeric_cols].corr()['Cancelled'].sort_values(ascending=False)

plt.figure(figsize=(10, 6))
colors = ['#e74c3c' if x > 0 else '#27ae60' for x in churn_corr.values[:-1]]
plt.barh(churn_corr.index[:-1], churn_corr.values[:-1], color=colors)
plt.xlabel('Correlation with Churn')
plt.title('Churn Drivers (Correlation Analysis)', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("- Positive correlation = increases churn risk")
print("- Negative correlation = reduces churn risk")

## Step 2.5: Early Warning Indicators

In [None]:
# Define warning thresholds based on churned customer patterns
churned_profile = customers[customers['Cancelled'] == 1].describe()

# Create warning flags
df['Warning_LowNPS'] = df['NPS_Score'] < 6
df['Warning_LowEngagement'] = df['AvgWeeklyLogins'] < 20
df['Warning_HighTickets'] = df['SupportTicketsRaised'] > 5
df['Warning_ShortTenure'] = df['TenureMonths'] < 6
df['Warning_InactiveDays'] = df['LastLoginDaysAgo'] > 14

# Count warnings per customer
warning_cols = [col for col in df.columns if col.startswith('Warning_')]
df['WarningCount'] = df[warning_cols].sum(axis=1)

# Warning summary
print("EARLY WARNING INDICATORS")
print("="*50)
for col in warning_cols:
    count = df[col].sum()
    pct = count / len(df) * 100
    print(f"{col.replace('Warning_', '')}: {count} customers ({pct:.1f}%)")

print(f"\nCustomers with 3+ warnings: {(df['WarningCount'] >= 3).sum()}")

---
# PROJECT 3: Executive Business Review
## For: All Departments & Leadership

**Business Context:** Quarterly executive report with company-wide insights.

**Deliverables:**
1. Executive KPI Summary
2. Department-specific insights
3. Trend analysis
4. Strategic recommendations

## Step 3.1: Executive KPIs

In [None]:
# Calculate key metrics
total_mrr = customers['MonthlyRevenue'].sum()
total_arr = total_mrr * 12
avg_revenue = customers['MonthlyRevenue'].mean()
total_customers = len(customers)
active_customers = (customers['Cancelled'] == 0).sum()
churn_rate = customers['Cancelled'].mean() * 100
avg_nps = customers['NPS_Score'].mean()
avg_tenure = customers['TenureMonths'].mean()

print("="*60)
print("           TECHFLOW EXECUTIVE SUMMARY")
print("="*60)
print(f"\nüìä REVENUE METRICS")
print(f"   Monthly Recurring Revenue (MRR): ${total_mrr:,}")
print(f"   Annual Recurring Revenue (ARR):  ${total_arr:,}")
print(f"   Average Revenue per Customer:    ${avg_revenue:,.0f}")

print(f"\nüë• CUSTOMER METRICS")
print(f"   Total Customers:    {total_customers}")
print(f"   Active Customers:   {active_customers}")
print(f"   Churn Rate:         {churn_rate:.1f}%")

print(f"\n‚≠ê SATISFACTION METRICS")
print(f"   Average NPS Score:  {avg_nps:.1f}")
print(f"   Average Tenure:     {avg_tenure:.1f} months")
print("="*60)

## Step 3.2: Department Insights

In [None]:
# SALES: Revenue by plan and industry
print("\nüìà SALES TEAM INSIGHTS")
print("-"*40)
sales_by_plan = customers.groupby('SubscriptionPlan').agg({
    'CustomerID': 'count',
    'MonthlyRevenue': ['sum', 'mean']
}).round(0)
sales_by_plan.columns = ['Customers', 'Total Revenue', 'Avg Revenue']
print(sales_by_plan)

print("\nTop Industries by Revenue:")
print(customers.groupby('Industry')['MonthlyRevenue'].sum().nlargest(5))

In [None]:
# CS TEAM: Support and satisfaction
print("\nüéØ CUSTOMER SUCCESS INSIGHTS")
print("-"*40)

# NPS breakdown
promoters = (customers['NPS_Score'] >= 9).sum()
passives = ((customers['NPS_Score'] >= 7) & (customers['NPS_Score'] < 9)).sum()
detractors = (customers['NPS_Score'] < 7).sum()
nps_score = (promoters - detractors) / len(customers) * 100

print(f"Net Promoter Score: {nps_score:.0f}")
print(f"  Promoters (9-10):   {promoters} ({promoters/len(customers)*100:.1f}%)")
print(f"  Passives (7-8):     {passives} ({passives/len(customers)*100:.1f}%)")
print(f"  Detractors (0-6):   {detractors} ({detractors/len(customers)*100:.1f}%)")

print(f"\nAvg Support Tickets: {customers['SupportTicketsRaised'].mean():.1f}")

In [None]:
# FINANCE: Revenue analysis
print("\nüí∞ FINANCE INSIGHTS")
print("-"*40)

# Revenue concentration
top_10_revenue = customers.nlargest(10, 'MonthlyRevenue')['MonthlyRevenue'].sum()
top_10_pct = top_10_revenue / total_mrr * 100

print(f"Revenue Concentration:")
print(f"  Top 10 customers: ${top_10_revenue:,} ({top_10_pct:.1f}% of MRR)")

# By contract type
print("\nRevenue by Contract Type:")
print(customers.groupby('ContractType')['MonthlyRevenue'].agg(['sum', 'mean']).round(0))

## Step 3.3: Executive Dashboard

In [None]:
fig = plt.figure(figsize=(16, 12))

# KPI Cards Row
kpis = [
    ('MRR', f'${total_mrr:,}', '#3498db'),
    ('Customers', str(total_customers), '#2ecc71'),
    ('Churn Rate', f'{churn_rate:.1f}%', '#e74c3c'),
    ('Avg NPS', f'{avg_nps:.1f}', '#9b59b6')
]

for i, (title, value, color) in enumerate(kpis):
    ax = fig.add_subplot(3, 4, i+1)
    ax.text(0.5, 0.6, value, fontsize=24, fontweight='bold', ha='center', color=color)
    ax.text(0.5, 0.25, title, fontsize=12, ha='center')
    ax.axis('off')

# Revenue by Plan
ax1 = fig.add_subplot(3, 2, 3)
plan_revenue = customers.groupby('SubscriptionPlan')['MonthlyRevenue'].sum()
ax1.pie(plan_revenue, labels=plan_revenue.index, autopct='%1.1f%%', colors=['#3498db', '#2ecc71', '#9b59b6'])
ax1.set_title('Revenue by Plan')

# Customer Growth (simulated)
ax2 = fig.add_subplot(3, 2, 4)
tenure_dist = customers.groupby(pd.cut(customers['TenureMonths'], bins=[0,6,12,24,50]))['CustomerID'].count()
ax2.bar(['0-6mo', '6-12mo', '12-24mo', '24+mo'], tenure_dist.values, color='steelblue')
ax2.set_title('Customers by Tenure')

# Top Industries
ax3 = fig.add_subplot(3, 2, 5)
top_ind = customers.groupby('Industry')['MonthlyRevenue'].sum().nlargest(6)
ax3.barh(top_ind.index, top_ind.values, color='#27ae60')
ax3.set_title('Top Industries by Revenue')
ax3.invert_yaxis()

# NPS Distribution
ax4 = fig.add_subplot(3, 2, 6)
sns.histplot(customers['NPS_Score'], bins=10, ax=ax4, color='purple')
ax4.axvline(7, color='orange', linestyle='--')
ax4.axvline(9, color='green', linestyle='--')
ax4.set_title('NPS Score Distribution')

plt.suptitle('TECHFLOW EXECUTIVE DASHBOARD', fontsize=18, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

## Step 3.4: Strategic Recommendations

In [None]:
print("="*70)
print("           STRATEGIC RECOMMENDATIONS")
print("="*70)

# Based on analysis
print("\nüéØ IMMEDIATE PRIORITIES (Next 30 Days)")
print("-"*50)
high_risk = df[df['RiskLevel'] == 'High Risk']
print(f"1. Contact {len(high_risk)} high-risk customers (${high_risk['MonthlyRevenue'].sum():,} MRR at risk)")
print(f"2. Address {detractors} NPS detractors with personalized outreach")
print(f"3. Review {(customers['SupportTicketsRaised'] > 5).sum()} accounts with excessive support tickets")

print("\nüìà GROWTH OPPORTUNITIES")
print("-"*50)
basic_customers = customers[customers['SubscriptionPlan'] == 'Basic']
print(f"1. {len(basic_customers)} Basic plan customers - target for upgrade")
print(f"2. Top industry ({customers.groupby('Industry')['MonthlyRevenue'].sum().idxmax()}) - expand presence")
print(f"3. Annual contracts have {customers[customers['ContractType']=='Annual']['Cancelled'].mean()*100:.0f}% lower churn - push annual")

print("\n‚ö†Ô∏è  RISK MITIGATION")
print("-"*50)
monthly = customers[customers['ContractType'] == 'Monthly']
print(f"1. {len(monthly)} monthly customers at higher churn risk")
print(f"2. Enterprise churn costing ${customers[(customers['SubscriptionPlan']=='Enterprise') & (customers['Cancelled']==1)]['MonthlyRevenue'].sum():,}/month")
print(f"3. Customers with <6 month tenure need extra attention ({(customers['TenureMonths'] < 6).sum()} accounts)")

print("\n" + "="*70)

---
# CONGRATULATIONS! üéâ

You have completed the entire Pandas Training Series!

**Skills Mastered:**
- ‚úÖ Python fundamentals (Modules 0.1-0.9)
- ‚úÖ Pandas Series & DataFrames (Module 1)
- ‚úÖ Filtering, Grouping, Aggregation (Module 2)
- ‚úÖ Joining & Merging Data (Module 3)
- ‚úÖ Data Cleaning & Exploration (Module 4)
- ‚úÖ Data Visualization (Module 5)
- ‚úÖ Real-World Business Projects (Module 6)

**You're ready to analyze data like a pro!**