# Hypothesis 2: Leadership Impact Analysis

**Goal:** Validate if "High Quality" leaders (consistent, experienced) generate higher user retention.

**Scope:** Post-2023 Data only (to avoid COVID/Virtual bias).

---

## 1. Setup & Data Loading

In [1]:
import pandas as pd
import numpy as np
import sys
import os

# Add src to path to import data_loader
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from data_loader import FitFamDataLoader

# Initialize Loader
loader = FitFamDataLoader(data_dir=os.path.abspath(os.path.join('..', 'fitfam-json')))

print("Loading Data...")
events = loader.load_events()
event_user = loader.load_event_user()
print("Data Loaded.")

Loading Data...
Data Loaded.


## 2. Data Filtering (Post-2023)

In [2]:
print("--- Filtering Data (Post-2023) ---")
events['start_time'] = pd.to_datetime(events['start_time'])
events_filtered = events[events['start_time'] >= '2023-01-01'].copy()
print(f"Events (Total): {len(events)}")
print(f"Events (Post-2023): {len(events_filtered)}")

# Filter attendance to only include these events
event_user_filtered = event_user[event_user['event_id'].isin(events_filtered['id'])].copy()
print(f"Attendance (Total): {len(event_user)}")
print(f"Attendance (Post-2023): {len(event_user_filtered)}")

--- Filtering Data (Post-2023) ---
Events (Total): 37630
Events (Post-2023): 12141
Attendance (Total): 405861
Attendance (Post-2023): 94838


## 3. Step 1: Data Discovery (Leader Verification)

In [None]:
if 'is_leader' in event_user_filtered.columns:
    leaders_from_attendance = event_user_filtered[event_user_filtered['is_leader'] == 1]
    print(f"{len(leaders_from_attendance)} leader records.")
    
    if not leaders_from_attendance.empty:
        leader_counts = leaders_from_attendance.groupby('user_id').size().reset_index(name='event_count')
        print(f"Total unique leaders: {len(leader_counts)}")
        
        active_leaders = leader_counts[leader_counts['event_count'] >= 5]
        print(f"Active Leaders (>= 5 events): {len(active_leaders)}")
        print(leader_counts['event_count'].describe())
else:
    print("ERROR")

Found 10070 leader records.
Total unique leaders: 187
Active Leaders (>= 5 events): 160
count    187.000000
mean      53.850267
std       64.332462
min        1.000000
25%       10.000000
50%       30.000000
75%       74.500000
max      437.000000
Name: event_count, dtype: float64


## 4. Step 2: Feature Engineering (Leader Quality Metrics)

**Metrics:**
*   **Consistency:** Std Dev of days between events.
*   **Tenure:** Days between first and last event.
*   **Frequency:** Events per month.

In [4]:
# Merge Leaders with Event Dates
leaders = event_user_filtered[event_user_filtered['is_leader'] == 1].copy()
leaders_with_dates = leaders.merge(events_filtered[['id', 'start_time']], left_on='event_id', right_on='id', how='left')
leaders_with_dates = leaders_with_dates.sort_values(['user_id', 'start_time'])

leader_stats = []

print("Calculating metrics...")

for user_id, group in leaders_with_dates.groupby('user_id'):
    dates = group['start_time'].sort_values()
    
    total_events = len(group)
    first_led = dates.iloc[0]
    last_led = dates.iloc[-1]
    
    tenure_days = (last_led - first_led).days
    if tenure_days == 0: tenure_days = 1 
    
    events_per_month = total_events / (tenure_days / 30.44)
    
    if total_events > 1:
        gaps = dates.diff().dt.total_seconds() / (24 * 3600)
        gaps = gaps.dropna()
        consistency_std = gaps.std()
        avg_gap = gaps.mean()
    else:
        consistency_std = np.nan 
        avg_gap = np.nan
        
    leader_stats.append({
        'leader_user_id': user_id,
        'total_events': total_events,
        'first_led_date': first_led,
        'last_led_date': last_led,
        'tenure_days': tenure_days,
        'events_per_month': events_per_month,
        'consistency_std': consistency_std,
        'avg_gap_days': avg_gap
    })

df_leaders = pd.DataFrame(leader_stats)

if not df_leaders.empty:
    df_leaders.set_index('leader_user_id', inplace=True)
    print(f"Computed stats for {len(df_leaders)} leaders.")
    
    # Save
    df_leaders.to_csv('leaders_quality_metrics_filtered.csv')
    print("Saved to leaders_quality_metrics_filtered.csv")
    
    # Inspect
    print("\nTop 5 Consistent Leaders (Min 10 events):")
    print(df_leaders[df_leaders['total_events'] >= 10].sort_values('consistency_std').head(5)[['total_events', 'consistency_std', 'events_per_month']])
else:
    print("No leaders found in filtered data.")

Calculating metrics...
Computed stats for 187 leaders.
Saved to leaders_quality_metrics_filtered.csv

Top 5 Consistent Leaders (Min 10 events):
                total_events  consistency_std  events_per_month
leader_user_id                                                 
102842                    19         0.004910          4.626880
226                       57         1.590454          4.201162
7844                      17         1.750000          4.348571
66                       224         2.509773          8.555282
32608                     12         2.662876          6.522857
