# Leader Capacity Dashboard - Data Engineering Notebook

## 📋 Table of Contents
1. [Cell 1-3] Project Overview & Setup
2. [Cell 4-10] Data Loading
3. [Cell 11-14] Data Processing
4. [Cell 15-16] Dashboard Structure
5. [Cell 17-18] Data Quality
6. [Cell 19-20] Export & Next Steps

## [Cell 1] Project Overview

This notebook handles the data engineering pipeline for recreating a leader capacity dashboard. The dashboard will display current month plus three future months of:
- Booked time as % of available working time
- Vacation/leave data
- Salesforce opportunity data with likelihood and dates

### Data Sources
All data files are located in the `../data/` directory:
1. **10k Data for S3 (1).csv** - Time booking/allocation data
2. **10k Users.csv** - User roles and demographics (filtered to leadership roles)
3. **Namely Vacation and Leave Dataset.csv** - Employee vacation and leave records
4. **Salesforce Opportunity Data.csv** - Sales opportunities with probability and schedule
5. **Working Hours For US.csv** - US working hours and holidays
6. **UAE Working Hours.csv** - UAE working hours and holidays

In [1]:
# [Cell 2] Package Installation Check
# ✅ All packages have been installed in the virtual environment!

# If you ever need to reinstall packages:
# !pip install pandas numpy matplotlib seaborn openpyxl

print("✅ All required packages are installed!")
print("📍 Using virtual environment at: venv/")
print("🚀 You can proceed to Cell 3 to import the libraries")

✅ All required packages are installed!
📍 Using virtual environment at: venv/
🚀 You can proceed to Cell 3 to import the libraries


In [2]:
# [Cell 3] Import Required Libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import calendar
import warnings
warnings.filterwarnings('ignore')

# For visualization (optional)
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## [Cell 4] Data Loading Section

### 🔄 Load all data sources with error handling
We'll load each CSV file and explore its structure to understand what data we're working with.

In [3]:
# [Cell 5] Load and Process 10k Data - FILTERED BY SELECTED ROLES ONLY

# Define the roles we want to include
selected_roles = [
    'Design',
    'Principal', 
    'Program Management',
    'Strategy',
    'Studio',
    'Tech'
]

print("📋 Selected roles to filter:")
for role in selected_roles:
    print(f"  ✓ {role}")
print(f"\n📊 Total roles selected: {len(selected_roles)}")
print()

# Load 10k Data
try:
    df_10k_data = pd.read_csv('../data/10k Data for S3 (1).csv')
    print(f"✅ 10k Data loaded successfully: {df_10k_data.shape}")
    print(f"   Columns: {df_10k_data.columns.tolist()[:10]}...")  # Show first 10 columns
except Exception as e:
    print(f"❌ Error loading 10k data: {e}")
    df_10k_data = None

# Load 10k Users
try:
    df_10k_users = pd.read_csv('../data/10k Users.csv')  # Note: capital 'U' in Users
    print(f"\n✅ 10k Users loaded successfully: {df_10k_users.shape}")
    print(f"   Columns: {df_10k_users.columns.tolist()}")
    
    # Check what roles are in the data
    if 'role' in df_10k_users.columns:
        print(f"\n📊 Available roles in the data:")
        role_counts = df_10k_users['role'].value_counts()
        print(role_counts.head(20))
    elif 'Role' in df_10k_users.columns:
        print(f"\n📊 Available roles in the data:")
        role_counts = df_10k_users['Role'].value_counts()
        print(role_counts.head(20))
    else:
        print("\n⚠️  No 'role' column found. Available columns:")
        print(df_10k_users.columns.tolist())
        
except Exception as e:
    print(f"❌ Error loading 10k users: {e}")
    df_10k_users = None

# Filter users by selected roles
if df_10k_users is not None:
    # Find the correct role column name (could be 'role', 'Role', 'job_title', etc.)
    role_column = None
    for col in ['role', 'Role', 'job_title', 'Job_Title', 'position', 'Position']:
        if col in df_10k_users.columns:
            role_column = col
            break
    
    if role_column:
        print(f"\n🔍 Filtering by role column: '{role_column}'")
        
        # Filter users by selected roles
        df_filtered_users = df_10k_users[df_10k_users[role_column].isin(selected_roles)]
        print(f"✅ Filtered users: {df_filtered_users.shape[0]} out of {df_10k_users.shape[0]} total users")
        
        # Show role distribution after filtering
        print(f"\n📊 Role distribution after filtering:")
        print(df_filtered_users[role_column].value_counts())
    else:
        print("\n⚠️  Could not find role column. Please check column names.")
        df_filtered_users = df_10k_users
else:
    df_filtered_users = None

# Merge 10k data with filtered users
if df_10k_data is not None and df_filtered_users is not None:
    print("\n🔗 Merging datasets...")
    
    # Check for ID columns
    print(f"\n📋 ID columns check:")
    if 'user_id' in df_10k_data.columns:
        print(f"  ✓ Found 'user_id' in 10k data")
    if 'id' in df_filtered_users.columns:
        print(f"  ✓ Found 'id' in users data")
    
    # Perform the merge
    try:
        df_merged = pd.merge(
            df_10k_data,
            df_filtered_users,
            left_on='user_id',
            right_on='id',
            how='inner'  # Only keep records that match
        )
        print(f"\n✅ Merge successful! Result: {df_merged.shape}")
        print(f"   {df_merged.shape[0]} records for {df_merged['id'].nunique()} unique users")
        
        # Save to use in later cells
        df_10k = df_merged
        
    except Exception as e:
        print(f"\n❌ Error during merge: {e}")
        print("Please check that the ID columns exist and match correctly")
        df_10k = None
else:
    df_10k = None

# Display comprehensive sample of merged data
if df_10k is not None:
    print("\n" + "="*80)
    print("📊 COMPREHENSIVE SAMPLE OF MERGED DATA")
    print("="*80)
    
    # Show basic info about the merged dataset
    print(f"\n📋 Dataset Overview:")
    print(f"   • Total records: {df_10k.shape[0]:,}")
    print(f"   • Total columns: {df_10k.shape[1]}")
    print(f"   • Unique users: {df_10k['id'].nunique()}")
    print(f"   • Memory usage: {df_10k.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    # DEBUG: Show all available columns to help understand the structure
    print(f"\n🔍 DEBUG - All Available Columns in Merged Data:")
    if df_10k_data is not None:
        print("   📊 From 10k Data (first 10):", [col for col in df_10k.columns if col in df_10k_data.columns][:10])
    if df_filtered_users is not None:
        print("   👥 From Users Data (first 10):", [col for col in df_10k.columns if col in df_filtered_users.columns][:10])
    print("   📋 All columns:", df_10k.columns.tolist())
    
    # Show key columns from both datasets
    print(f"\n📊 Sample of Key Columns (First 10 rows):")
    key_cols = ['user_id', 'id', 'first_name', 'last_name', role_column if role_column else 'role', 
                'discipline', 'client', 'phase_name', 'incurred_hours', 'scheduled_hours', 'total_hours']
    available_key_cols = [col for col in key_cols if col in df_10k.columns]
    
    sample_data = df_10k[available_key_cols].head(10)
    print(sample_data.to_string(index=False))
    
    # Show data types for key columns
    print(f"\n📊 Data Types for Key Columns:")
    for col in available_key_cols:
        print(f"   • {col}: {df_10k[col].dtype}")
    
    # Show hours-related columns specifically
    hours_cols = [col for col in df_10k.columns if 'hours' in col.lower()]
    if hours_cols:
        print(f"\n⏰ Hours-Related Columns Summary:")
        for col in hours_cols:
            non_null_count = df_10k[col].notna().sum()
            total_hours = df_10k[col].sum() if df_10k[col].dtype in ['int64', 'float64'] else 'N/A'
            print(f"   • {col}: {non_null_count:,} non-null values, Total: {total_hours}")
    
    # Show sample of merged user info
    print(f"\n👥 Sample User Information (First 5 unique users):")
    # Build list of user columns that actually exist in the dataframe
    desired_user_cols = ['id', 'first_name', 'last_name', role_column if role_column else 'role', 
                         'discipline', 'location', 'email', 'hire_date']
    available_user_cols = [col for col in desired_user_cols if col in df_10k.columns]
    
    if available_user_cols:
        user_sample = df_10k[available_user_cols].drop_duplicates().head(5)
        print(user_sample.to_string(index=False))
    else:
        print("⚠️  No user information columns found in merged data")
    
    # VERIFICATION: Confirm that ONLY selected roles are in the final dataset
    print("\n" + "="*80)
    print("✅ VERIFICATION - Confirming filtered data contains ONLY selected roles:")
    print("="*80)
    
    # Check what roles are actually in the merged data
    if role_column and role_column in df_10k.columns:
        actual_roles = df_10k[role_column].unique()
        print(f"\n📋 Roles found in final dataset:")
        for role in sorted(actual_roles):
            if role in selected_roles:
                print(f"  ✓ {role} (expected)")
            else:
                print(f"  ❌ {role} (UNEXPECTED - should not be here!)")
        
        # Double-check: Are all roles in the data part of selected_roles?
        unexpected_roles = [r for r in actual_roles if r not in selected_roles]
        if unexpected_roles:
            print(f"\n⚠️  WARNING: Found {len(unexpected_roles)} unexpected roles in the data!")
            print(f"These roles should NOT be in the filtered data: {unexpected_roles}")
        else:
            print(f"\n✅ SUCCESS: All roles in the final dataset are from the selected list!")
            print(f"   - {len(actual_roles)} unique roles found")
            print(f"   - {df_10k.shape[0]:,} total records")
            print(f"   - {df_10k['id'].nunique()} unique users")
    
    # Show final role distribution
    print(f"\n📊 Final role distribution in filtered 10k data:")
    if role_column and role_column in df_10k.columns:
        role_dist = df_10k[role_column].value_counts()
        print(role_dist)
        
        # Show records per user by role
        print(f"\n📊 Average records per user by role:")
        user_counts = df_10k.groupby(role_column)['id'].nunique()
        record_counts = df_10k[role_column].value_counts()
        for role in sorted(actual_roles):
            if role in user_counts.index and role in record_counts.index:
                avg_records = record_counts[role] / user_counts[role]
                print(f"   • {role}: {user_counts[role]} users, {record_counts[role]:,} records, {avg_records:.1f} avg records/user")
    
    print("\n" + "="*80)
    print("✅ MERGE COMPLETE - Data ready for further processing!")
    print("="*80)

📋 Selected roles to filter:
  ✓ Design
  ✓ Principal
  ✓ Program Management
  ✓ Strategy
  ✓ Studio
  ✓ Tech

📊 Total roles selected: 6

✅ 10k Data loaded successfully: (173658, 19)
   Columns: ['role', 'discipline', 'client', 'phase_name', 'incurred_hours', 'scheduled_hours', 'difference_from_past_scheduled_hours', 'future_scheduled_hours', 'total_hours', 'RequestTodayDate']...

✅ 10k Users loaded successfully: (1045, 36)
   Columns: ['last_login_time', 'billrate', 'id', 'first_name', 'last_name', 'account_owner', 'archived', 'billability_target', 'billable', 'created_at', 'deleted', 'deleted_at', 'discipline', 'display_name', 'email', 'employee_number', 'guid', 'hire_date', 'invitation_pending', 'license_type', 'location', 'location_id', 'mobile_phone', 'office_phone', 'role', 'termination_date', 'type', 'updated_at', 'user_settings', 'user_type_id', 'thumbnail', 'has_login', 'login_type', 'archived_at', '_BATCH_ID_', '_BATCH_LAST_RUN_']

📊 Available roles in the data:
role
Design   

## [Cell 5a] Load and Integrate Employee Org Units (Dept, Div, Loc)

We'll load a new dataset `Employee Org Units (Dept, Div, Loc)` that contains authoritative org unit fields per employee. Using `email` as the unique identifier, we'll:
- Normalize column names and expected fields
- Drop current `location`, `department`, and `division` from users/main datasets
- Replace them with `Org_Office_Location`, `Org_Department`, `Org_Division` from this dataset

Expected file path: `../data/Employee Org Units (Dept, Div, Loc).csv`


In [None]:
# [Cell 5b] Read and merge Org Units by email; replace dept/div/location

import os

org_units_path = '../data/Employee Org Units (Dept, Div, Loc).csv'

try:
    df_org_units_raw = pd.read_csv(org_units_path)
    print(f"✅ Org Units loaded: {df_org_units_raw.shape}")
    print("Columns:", df_org_units_raw.columns.tolist())
except Exception as e:
    print(f"❌ Error loading Org Units from {org_units_path}: {e}")
    df_org_units_raw = None

# Normalize column names and pick canonical fields
if df_org_units_raw is not None:
    df_org_units = df_org_units_raw.copy()
    # Lowercase and strip columns for robust matching
    df_org_units.columns = [c.strip().lower() for c in df_org_units.columns]

    # Map potential variants to canonical names
    col_map = {}
    def resolve(col_variants, fallback=None):
        for v in col_variants:
            if v.lower() in df_org_units.columns:
                return v.lower()
        return fallback

    email_col = resolve(['email', 'work email', 'employee email', 'e-mail'])
    dept_col = resolve(['department', 'dept'])
    div_col = resolve(['division', 'div'])
    loc_col = resolve(['office location', 'location', 'office', 'current office location'])

    required = [email_col, dept_col, div_col, loc_col]
    if any(col is None for col in required):
        print("⚠️ Missing required columns in Org Units. Found:", {
            'email': email_col, 'department': dept_col, 'division': div_col, 'location': loc_col
        })
    else:
        # Keep only necessary columns and rename to canonical schema
        df_org_units = df_org_units[[email_col, dept_col, div_col, loc_col]].rename(columns={
            email_col: 'email',
            dept_col: 'Org_Department',
            div_col: 'Org_Division',
            loc_col: 'Org_Office_Location',
        })

        # Clean emails and drop dupes keeping the latest if timestamp available
        df_org_units['email'] = df_org_units['email'].astype(str).str.strip().str.lower()
        df_org_units = df_org_units.dropna(subset=['email']).drop_duplicates(subset=['email'], keep='last')

        print(f"✅ Org Units normalized: {df_org_units.shape}")
        print("Sample:")
        print(df_org_units.head(5).to_string(index=False))

        # Merge into filtered users
        if 'df_filtered_users' in globals() and df_filtered_users is not None:
            users = df_filtered_users.copy()
            if 'email' in users.columns:
                users['email'] = users['email'].astype(str).str.strip().str.lower()
                original_cols = users.columns.tolist()
                # Drop existing org columns if present
                for c in ['location', 'department', 'division']:
                    if c in users.columns:
                        users = users.drop(columns=[c])
                users = users.merge(df_org_units, on='email', how='left')
                print(f"✅ Users merged with Org Units: {users.shape}")
                missing = users['Org_Department'].isna().sum()
                print(f"   🔎 Users without Org Units match: {missing}")
                df_filtered_users = users
            else:
                print("⚠️ df_filtered_users has no 'email' column; cannot merge Org Units.")
        else:
            print("⚠️ df_filtered_users not available; run Cell 5 first.")

        # Merge into main 10k merged dataset
        if 'df_10k' in globals() and df_10k is not None:
            main = df_10k.copy()
            if 'email' in main.columns:
                main['email'] = main['email'].astype(str).str.strip().str.lower()
                for c in ['location', 'department', 'division']:
                    if c in main.columns:
                        main = main.drop(columns=[c])
                main = main.merge(df_org_units, on='email', how='left')
                print(f"✅ Main df_10k merged with Org Units: {main.shape}")
                missing_main = main['Org_Department'].isna().sum()
                print(f"   🔎 df_10k rows without Org Units match: {missing_main}")
                df_10k = main
            else:
                print("⚠️ df_10k has no 'email' column; cannot merge Org Units.")
        else:
            print("⚠️ df_10k not available; run Cell 5 first.")


In [9]:
# [Cell 6] Debug - Check Available Variables
print("🔍 DEBUG: Checking available dataframes")
print("="*60)

# Check what variables are available in the current session
available_vars = []
missing_vars = []

# Check for df_10k
if 'df_10k' in globals():
    print(f"✅ df_10k found (global): {df_10k.shape}")
    available_vars.append('df_10k')
elif 'df_10k' in locals():
    print(f"✅ df_10k found (local): {df_10k.shape}")
    available_vars.append('df_10k')
else:
    print("❌ df_10k NOT found")
    missing_vars.append('df_10k')

# Check for df_filtered_users
if 'df_filtered_users' in globals():
    print(f"✅ df_filtered_users found (global): {df_filtered_users.shape}")
    available_vars.append('df_filtered_users')
elif 'df_filtered_users' in locals():
    print(f"✅ df_filtered_users found (local): {df_filtered_users.shape}")
    available_vars.append('df_filtered_users')
else:
    print("❌ df_filtered_users NOT found")
    missing_vars.append('df_filtered_users')

# Check for df_10k_data
if 'df_10k_data' in globals():
    print(f"✅ df_10k_data found: {df_10k_data.shape}")
    available_vars.append('df_10k_data')
elif 'df_10k_data' in locals():
    print(f"✅ df_10k_data found: {df_10k_data.shape}")
    available_vars.append('df_10k_data')
else:
    print("❌ df_10k_data NOT found")
    missing_vars.append('df_10k_data')

# Check for df_10k_users
if 'df_10k_users' in globals():
    print(f"✅ df_10k_users found: {df_10k_users.shape}")
    available_vars.append('df_10k_users')
elif 'df_10k_users' in locals():
    print(f"✅ df_10k_users found: {df_10k_users.shape}")
    available_vars.append('df_10k_users')
else:
    print("❌ df_10k_users NOT found")
    missing_vars.append('df_10k_users')

print(f"\n📊 Summary:")
print(f"   Available variables: {len(available_vars)}")
print(f"   Missing variables: {len(missing_vars)}")

if missing_vars:
    print(f"\n⚠️  To fix missing variables, please run:")
    if 'df_10k' in missing_vars:
        print("   • Cell 5 - Load and Process 10k Data")
    
print("\n💡 Tip: Variables must be created in the same kernel session to be available.")
print("   If you restarted the kernel, you need to re-run all cells in order.")


🔍 DEBUG: Checking available dataframes
✅ df_10k found (global): (164349, 55)
✅ df_filtered_users found (global): (528, 36)
✅ df_10k_data found: (173658, 19)
✅ df_10k_users found: (1045, 36)

📊 Summary:
   Available variables: 4
   Missing variables: 0

💡 Tip: Variables must be created in the same kernel session to be available.
   If you restarted the kernel, you need to re-run all cells in order.


In [10]:
# [Cell 6a] Export DataFrames to CSV Review Folder

print("📊 EXPORTING DATAFRAMES TO CSV REVIEW FOLDER")
print("="*60)

import os
from datetime import datetime

# Create CSV review folder if it doesn't exist
csv_folder = '../CSV review'
os.makedirs(csv_folder, exist_ok=True)

# Generate timestamp for file naming
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Dictionary to track what we're exporting
export_summary = []

# Export df_10k if it exists
if 'df_10k' in globals() and df_10k is not None:
    filename = f'df_10k_merged_leadership_{timestamp}.csv'
    filepath = os.path.join(csv_folder, filename)
    df_10k.to_csv(filepath, index=False)
    export_summary.append({
        'DataFrame': 'df_10k',
        'Description': 'Merged 10k booking data with leadership users',
        'Shape': df_10k.shape,
        'Filename': filename
    })
    print(f"✅ Exported df_10k: {df_10k.shape[0]:,} rows, {df_10k.shape[1]} columns")
else:
    print("❌ df_10k not found - please run Cell 5 first")

# Export df_filtered_users if it exists
if 'df_filtered_users' in globals() and df_filtered_users is not None:
    filename = f'df_filtered_users_leadership_{timestamp}.csv'
    filepath = os.path.join(csv_folder, filename)
    df_filtered_users.to_csv(filepath, index=False)
    export_summary.append({
        'DataFrame': 'df_filtered_users',
        'Description': 'Leadership users only (filtered by role)',
        'Shape': df_filtered_users.shape,
        'Filename': filename
    })
    print(f"✅ Exported df_filtered_users: {df_filtered_users.shape[0]:,} rows, {df_filtered_users.shape[1]} columns")
else:
    print("❌ df_filtered_users not found")

# Export vacation data if it exists
if 'df_vacation' in globals() and df_vacation is not None:
    filename = f'df_vacation_detailed_{timestamp}.csv'
    filepath = os.path.join(csv_folder, filename)
    df_vacation.to_csv(filepath, index=False)
    export_summary.append({
        'DataFrame': 'df_vacation',
        'Description': 'Detailed vacation records for leadership',
        'Shape': df_vacation.shape,
        'Filename': filename
    })
    print(f"✅ Exported df_vacation: {df_vacation.shape[0]:,} rows, {df_vacation.shape[1]} columns")
else:
    print("❌ df_vacation not found - please run Cell 7 first")

# Export monthly vacation summary if it exists
if 'df_vacation_monthly' in globals() and df_vacation_monthly is not None:
    filename = f'df_vacation_monthly_summary_{timestamp}.csv'
    filepath = os.path.join(csv_folder, filename)
    df_vacation_monthly.to_csv(filepath, index=False)
    export_summary.append({
        'DataFrame': 'df_vacation_monthly',
        'Description': 'Monthly vacation summary for dashboard',
        'Shape': df_vacation_monthly.shape,
        'Filename': filename
    })
    print(f"✅ Exported df_vacation_monthly: {df_vacation_monthly.shape[0]:,} rows, {df_vacation_monthly.shape[1]} columns")
else:
    print("❌ df_vacation_monthly not found")

# Create a summary report
if export_summary:
    print(f"\n📋 EXPORT SUMMARY")
    print("="*60)
    print(f"📁 Files saved to: {os.path.abspath(csv_folder)}")
    print(f"🕐 Timestamp: {timestamp}")
    print(f"\n📊 Exported DataFrames:")
    
    for item in export_summary:
        print(f"\n• {item['DataFrame']}:")
        print(f"  - Description: {item['Description']}")
        print(f"  - Shape: {item['Shape'][0]:,} rows × {item['Shape'][1]} columns")
        print(f"  - File: {item['Filename']}")
    
    # Also create a summary CSV with metadata
    summary_df = pd.DataFrame(export_summary)
    summary_filename = f'export_summary_{timestamp}.csv'
    summary_filepath = os.path.join(csv_folder, summary_filename)
    summary_df.to_csv(summary_filepath, index=False)
    print(f"\n📄 Export summary saved to: {summary_filename}")
    
    # Create a README for the CSV review folder
    readme_content = f"""# CSV Review Folder
    
## Export Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

This folder contains exported DataFrames from the Leader Capacity Dashboard notebook.

## Files in this export:

"""
    for item in export_summary:
        readme_content += f"""
### {item['Filename']}
- **DataFrame**: {item['DataFrame']}
- **Description**: {item['Description']}
- **Shape**: {item['Shape'][0]:,} rows × {item['Shape'][1]} columns

"""
    
    readme_path = os.path.join(csv_folder, 'README.md')
    with open(readme_path, 'w') as f:
        f.write(readme_content)
    
    print(f"\n📝 README created in CSV review folder")
    print(f"\n✅ Export complete! Check the '{csv_folder}' folder for your files.")
else:
    print("\n⚠️  No DataFrames were found to export.")
    print("   Please run the previous cells to create the DataFrames first.")


📊 EXPORTING DATAFRAMES TO CSV REVIEW FOLDER
✅ Exported df_10k: 164,349 rows, 55 columns
✅ Exported df_filtered_users: 528 rows, 36 columns
✅ Exported df_vacation: 933 rows, 21 columns
✅ Exported df_vacation_monthly: 258 rows, 8 columns

📋 EXPORT SUMMARY
📁 Files saved to: /Users/psweeney/leader-capacity-dashboard/CSV review
🕐 Timestamp: 20250711_113120

📊 Exported DataFrames:

• df_10k:
  - Description: Merged 10k booking data with leadership users
  - Shape: 164,349 rows × 55 columns
  - File: df_10k_merged_leadership_20250711_113120.csv

• df_filtered_users:
  - Description: Leadership users only (filtered by role)
  - Shape: 528 rows × 36 columns
  - File: df_filtered_users_leadership_20250711_113120.csv

• df_vacation:
  - Description: Detailed vacation records for leadership
  - Shape: 933 rows × 21 columns
  - File: df_vacation_detailed_20250711_113120.csv

• df_vacation_monthly:
  - Description: Monthly vacation summary for dashboard
  - Shape: 258 rows × 8 columns
  - File: df_v

In [6]:
# [Cell 7] Load, Process, and Integrate Vacation Data (Fixed Date Range)

print("🏖️ VACATION AND LEAVE DATA PROCESSING (ADJUSTED FOR AVAILABLE DATA)")
print("="*60)

# Load vacation data
try:
    df_vacation_raw = pd.read_csv('../data/Namely Vacation and Leave Dataset.csv')
    print(f"✅ Vacation data loaded successfully: {df_vacation_raw.shape}")
    print(f"   📊 Total records: {df_vacation_raw.shape[0]:,}")
    print(f"   📋 Total columns: {df_vacation_raw.shape[1]}")
    
    # Show vacation types available
    print(f"\n🏖️ Available vacation/leave types:")
    vacation_types = df_vacation_raw['Type'].value_counts()
    print(vacation_types.head(10))
    
except Exception as e:
    print(f"❌ Error loading vacation data: {e}")
    df_vacation_raw = None

if df_vacation_raw is not None:
    
    # STEP 1: Filter vacation data for leadership roles only
    print(f"\n" + "="*60)
    print("🔍 STEP 1: FILTERING VACATION DATA FOR LEADERSHIP ROLES")
    print("="*60)
    
    # We need to match vacation data with our leadership users
    # Check if we have the filtered users from Cell 5
    if 'df_filtered_users' in globals() and df_filtered_users is not None:
        print(f"✅ Found filtered leadership users: {df_filtered_users.shape[0]} users")
        
        # Create matching datasets
        leadership_names = []
        leadership_employee_numbers = []
        
        # Collect names and employee numbers from leadership users
        for _, user in df_filtered_users.iterrows():
            # Add full names
            full_name = f"{user['first_name']} {user['last_name']}"
            leadership_names.append(full_name)
            
            # Add employee numbers if available
            if 'employee_number' in user and pd.notna(user['employee_number']):
                leadership_employee_numbers.append(user['employee_number'])
        
        print(f"   📋 Leadership names to match: {len(leadership_names)}")
        print(f"   📋 Leadership employee numbers: {len(leadership_employee_numbers)}")
        
        # Filter vacation data by matching names and employee numbers
        name_matches = df_vacation_raw['Full Name'].isin(leadership_names)
        emp_num_matches = df_vacation_raw['Employee Number'].isin(leadership_employee_numbers)
        vacation_filtered = df_vacation_raw[name_matches | emp_num_matches].copy()
        
        print(f"\n✅ Filtered vacation data: {vacation_filtered.shape[0]} records")
        print(f"   📊 From {vacation_filtered['Full Name'].nunique()} unique employees")
        
        # Show which leadership people have vacation data
        matched_names = vacation_filtered['Full Name'].unique()
        print(f"\n👥 Leadership employees with vacation data:")
        for name in sorted(matched_names)[:10]:  # Show first 10
            count = vacation_filtered[vacation_filtered['Full Name'] == name].shape[0]
            print(f"   • {name}: {count} vacation records")
        if len(matched_names) > 10:
            print(f"   ... and {len(matched_names) - 10} more employees")
            
    else:
        print("⚠️  No filtered users found from Cell 5. Using all vacation data.")
        print("   💡 Please run Cell 5 first to filter for leadership roles only.")
        vacation_filtered = df_vacation_raw.copy()
    
    # STEP 2: Process vacation data for dashboard use
    print(f"\n" + "="*60)
    print("🔄 STEP 2: PROCESSING VACATION DATA FOR DASHBOARD")
    print("="*60)
    
    # Clean and process the vacation data
    vacation_processed = vacation_filtered.copy()
    
    # Convert date columns
    date_columns = ['Start date', 'Departure date']
    for col in date_columns:
        if col in vacation_processed.columns:
            vacation_processed[col] = pd.to_datetime(vacation_processed[col], errors='coerce')
    
    # Focus on actual vacation/leave (not just allocations)
    vacation_actual = vacation_processed[
        (vacation_processed['Used'] > 0) | (vacation_processed['Scheduled'] > 0)
    ].copy()
    
    print(f"✅ Processed vacation data: {vacation_actual.shape[0]} records with actual time off")
    
    # Check date range of vacation data
    if not vacation_actual.empty and vacation_actual['Start date'].notna().any():
        date_min = vacation_actual['Start date'].min()
        date_max = vacation_actual['Start date'].max()
        print(f"\n📅 Vacation data date range:")
        print(f"   • Earliest: {date_min.strftime('%Y-%m-%d')}")
        print(f"   • Latest: {date_max.strftime('%Y-%m-%d')}")
    
    # Categorize vacation types for dashboard
    vacation_categories = {
        'Vacation': ['Vacation', 'UAE Vacation', 'Work From Anywhere'],
        'Sick Leave': ['Sick', 'UAE Sick Time'],
        'Parental Leave': ['Parental Leave (UAE)', 'Prenatal Leave'],
        'Family Leave': ['Family Caregiver Leave', 'Family Caregiver Leave (UAE)', 'Bereavement'],
        'Other': ['Jury Duty', 'UAE Study Leave']
    }
    
    def categorize_vacation_type(vacation_type):
        for category, types in vacation_categories.items():
            if vacation_type in types:
                return category
        return 'Other'
    
    vacation_actual = vacation_actual.copy()
    vacation_actual['Vacation_Category'] = vacation_actual['Type'].apply(categorize_vacation_type)
    
    print(f"\n📊 Vacation by category:")
    print(vacation_actual['Vacation_Category'].value_counts())
    
    # STEP 3: Create vacation summary for dashboard integration
    print(f"\n" + "="*60)
    print("📊 STEP 3: CREATING VACATION SUMMARY FOR DASHBOARD")
    print("="*60)
    
    # ADJUSTED: Use the last 4 months of available data instead of future months
    if not vacation_actual.empty and vacation_actual['Start date'].notna().any():
        latest_date = vacation_actual['Start date'].max()
        # Go back 4 months from the latest date
        start_date = latest_date - pd.DateOffset(months=3)
        start_date = start_date.replace(day=1)  # Start of month
        
        date_range = pd.date_range(
            start=start_date,
            periods=4,
            freq='MS'  # Month start
        )
        
        print(f"📅 Adjusted dashboard date range (based on available data):")
        print(f"   • From: {date_range[0].strftime('%Y-%m')}")
        print(f"   • To: {date_range[-1].strftime('%Y-%m')}")
    else:
        # Fallback to default range
        current_date = pd.Timestamp.now()
        date_range = pd.date_range(
            start=current_date.replace(day=1),
            periods=4,
            freq='MS'
        )
        print(f"📅 Using default date range: {date_range[0].strftime('%Y-%m')} to {date_range[-1].strftime('%Y-%m')}")
    
    # Create vacation summary by person and month
    vacation_summary = []
    
    for _, row in vacation_actual.iterrows():
        start_date = row['Start date']
        departure_date = row['Departure date']
        
        # Skip if no valid dates
        if pd.isna(start_date):
            continue
            
        # Use departure date if available, otherwise assume single day
        end_date = departure_date if pd.notna(departure_date) else start_date
        
        # Check if vacation overlaps with our dashboard period
        for month_start in date_range:
            month_end = month_start + pd.offsets.MonthEnd(0)
            
            # Check if vacation overlaps with this month
            if start_date <= month_end and end_date >= month_start:
                vacation_summary.append({
                    'Full_Name': row['Full Name'],
                    'First_Name': row['First Name'],
                    'Last_Name': row['Last Name'],
                    'Employee_Number': row['Employee Number'],
                    'Month': month_start,
                    'Vacation_Type': row['Type'],
                    'Vacation_Category': row['Vacation_Category'],
                    'Days_Used': row['Used'],
                    'Days_Scheduled': row['Scheduled'],
                    'Start_Date': start_date,
                    'End_Date': end_date,
                    'Job_Title': row['Job Title'],
                    'Office_Location': row['Office Location']
                })
    
    # Convert to DataFrame
    df_vacation_summary = pd.DataFrame(vacation_summary)
    
    if not df_vacation_summary.empty:
        print(f"\n✅ Vacation summary created: {df_vacation_summary.shape[0]} month-person records")
        print(f"   📊 Covering {df_vacation_summary['Full_Name'].nunique()} unique employees")
        print(f"   📅 Across {df_vacation_summary['Month'].nunique()} months")
        
        # Show sample of vacation summary
        print(f"\n📊 Sample vacation summary:")
        sample_cols = ['Full_Name', 'Month', 'Vacation_Category', 'Days_Used', 'Days_Scheduled']
        print(df_vacation_summary[sample_cols].head(10).to_string(index=False))
        
        # Show monthly vacation totals
        print(f"\n📊 Monthly vacation totals:")
        monthly_totals = df_vacation_summary.groupby('Month').agg({
            'Days_Used': 'sum',
            'Days_Scheduled': 'sum',
            'Full_Name': 'nunique'
        }).round(1)
        monthly_totals.columns = ['Total_Days_Used', 'Total_Days_Scheduled', 'Unique_Employees']
        print(monthly_totals)
        
    else:
        print("⚠️  No vacation data found for the selected period")
        df_vacation_summary = pd.DataFrame()
    
    # STEP 4: Prepare for integration with main dataset
    print(f"\n" + "="*60)
    print("🔗 STEP 4: PREPARING FOR INTEGRATION WITH MAIN DATASET")
    print("="*60)
    
    # Create a version that can be merged with the main 10k dataset
    if not df_vacation_summary.empty:
        # Check if df_10k exists
        if 'df_10k' in globals() and df_10k is not None:
            print("✅ Found df_10k dataset for integration")
            print(f"   📊 df_10k shape: {df_10k.shape}")
        else:
            print("⚠️  df_10k not found - please run Cell 5 first")
            print("   💡 The vacation data is still processed and ready for later use")
        
        # Group by person and month to avoid duplicates
        vacation_monthly = df_vacation_summary.groupby(['Full_Name', 'Month']).agg({
            'Days_Used': 'sum',
            'Days_Scheduled': 'sum',
            'Vacation_Category': lambda x: ', '.join(x.unique()),
            'First_Name': 'first',
            'Last_Name': 'first',
            'Employee_Number': 'first'
        }).reset_index()
        
        print(f"\n📊 Monthly vacation data ready for merge: {vacation_monthly.shape[0]} records")
        print(f"   👥 For {vacation_monthly['Full_Name'].nunique()} unique employees")
        
        # Store for use in later cells
        df_vacation = vacation_actual  # Full detailed data
        df_vacation_monthly = vacation_monthly  # Monthly summary for dashboard
        
        print(f"\n✅ Vacation data processing complete!")
        print(f"   📊 df_vacation: {df_vacation.shape[0]} detailed records")
        print(f"   📊 df_vacation_monthly: {df_vacation_monthly.shape[0]} monthly summaries")
        
    else:
        print("⚠️  No vacation summary to prepare")
        df_vacation = vacation_actual if 'vacation_actual' in locals() else pd.DataFrame()
        df_vacation_monthly = pd.DataFrame()
    
    print(f"\n" + "="*60)
    print("✅ VACATION DATA PROCESSING COMPLETE!")
    print("="*60)
    
else:
    print("❌ Cannot process vacation data - loading failed")
    df_vacation = None
    df_vacation_monthly = None


🏖️ VACATION AND LEAVE DATA PROCESSING (ADJUSTED FOR AVAILABLE DATA)
✅ Vacation data loaded successfully: (18190, 20)
   📊 Total records: 18,190
   📋 Total columns: 20

🏖️ Available vacation/leave types:
Type
Work From Anywhere              2171
Jury Duty                       2100
Family Caregiver Leave          2059
Sick                            1534
UAE Vacation                    1378
Vacation                        1371
Bereavement                     1361
Parental Leave (UAE)            1319
UAE Study Leave                 1313
Family Caregiver Leave (UAE)    1313
Name: count, dtype: int64

🔍 STEP 1: FILTERING VACATION DATA FOR LEADERSHIP ROLES
✅ Found filtered leadership users: 528 users
   📋 Leadership names to match: 528
   📋 Leadership employee numbers: 458

✅ Filtered vacation data: 8881 records
   📊 From 459 unique employees

👥 Leadership employees with vacation data:
   • Aaron Fallon: 15 vacation records
   • Abby  Brewster: 3 vacation records
   • Abby  Ciucias: 2 vacat

In [7]:
# [Cell 6] Load, Process, and Integrate Vacation and Leave Data

print("🏖️ VACATION AND LEAVE DATA PROCESSING")
print("="*60)

# Load vacation data
try:
    df_vacation_raw = pd.read_csv('../data/Namely Vacation and Leave Dataset.csv')
    print(f"✅ Vacation data loaded successfully: {df_vacation_raw.shape}")
    print(f"   📊 Total records: {df_vacation_raw.shape[0]:,}")
    print(f"   📋 Total columns: {df_vacation_raw.shape[1]}")
    
    # Show vacation types available
    print(f"\n🏖️ Available vacation/leave types:")
    vacation_types = df_vacation_raw['Type'].value_counts()
    print(vacation_types.head(10))
    
except Exception as e:
    print(f"❌ Error loading vacation data: {e}")
    df_vacation_raw = None

if df_vacation_raw is not None:
    
    # STEP 1: Filter vacation data for leadership roles only
    print(f"\n" + "="*60)
    print("🔍 STEP 1: FILTERING VACATION DATA FOR LEADERSHIP ROLES")
    print("="*60)
    
    # We need to match vacation data with our leadership users
    # Check if we have the filtered users from Cell 5
    if 'df_filtered_users' in locals() and df_filtered_users is not None:
        print(f"✅ Found filtered leadership users: {df_filtered_users.shape[0]} users")
        
        # Create matching datasets - try multiple approaches
        leadership_names = []
        leadership_employee_numbers = []
        
        # Collect names and employee numbers from leadership users
        for _, user in df_filtered_users.iterrows():
            # Add full names
            full_name = f"{user['first_name']} {user['last_name']}"
            leadership_names.append(full_name)
            
            # Add employee numbers if available
            if 'employee_number' in user and pd.notna(user['employee_number']):
                leadership_employee_numbers.append(user['employee_number'])
        
        print(f"   📋 Leadership names to match: {len(leadership_names)}")
        print(f"   📋 Leadership employee numbers: {len(leadership_employee_numbers)}")
        
        # Filter vacation data by matching names and employee numbers
        name_matches = df_vacation_raw['Full Name'].isin(leadership_names)
        emp_num_matches = df_vacation_raw['Employee Number'].isin(leadership_employee_numbers)
        vacation_filtered = df_vacation_raw[name_matches | emp_num_matches].copy()
        
        print(f"\n✅ Filtered vacation data: {vacation_filtered.shape[0]} records")
        print(f"   📊 From {vacation_filtered['Full Name'].nunique()} unique employees")
        
        # Show which leadership people have vacation data
        matched_names = vacation_filtered['Full Name'].unique()
        print(f"\n👥 Leadership employees with vacation data:")
        for name in sorted(matched_names)[:10]:  # Show first 10
            count = vacation_filtered[vacation_filtered['Full Name'] == name].shape[0]
            print(f"   • {name}: {count} vacation records")
        if len(matched_names) > 10:
            print(f"   ... and {len(matched_names) - 10} more employees")
            
    else:
        print("⚠️  No filtered users found from Cell 5. Using all vacation data.")
        vacation_filtered = df_vacation_raw.copy()
    
    # STEP 2: Process vacation data for dashboard use
    print(f"\n" + "="*60)
    print("🔄 STEP 2: PROCESSING VACATION DATA FOR DASHBOARD")
    print("="*60)
    
    # Clean and process the vacation data
    vacation_processed = vacation_filtered.copy()
    
    # Convert date columns
    date_columns = ['Start date', 'Departure date']
    for col in date_columns:
        if col in vacation_processed.columns:
            vacation_processed[col] = pd.to_datetime(vacation_processed[col], errors='coerce')
    
    # Focus on actual vacation/leave (not just allocations)
    # Filter for records with actual used time or scheduled time
    vacation_actual = vacation_processed[
        (vacation_processed['Used'] > 0) | (vacation_processed['Scheduled'] > 0)
    ].copy()
    
    print(f"✅ Processed vacation data: {vacation_actual.shape[0]} records with actual time off")
    
    # Categorize vacation types for dashboard
    vacation_categories = {
        'Vacation': ['Vacation', 'UAE Vacation', 'Work From Anywhere'],
        'Sick Leave': ['Sick', 'UAE Sick Time'],
        'Parental Leave': ['Parental Leave (UAE)', 'Prenatal Leave'],
        'Family Leave': ['Family Caregiver Leave', 'Family Caregiver Leave (UAE)', 'Bereavement'],
        'Other': ['Jury Duty', 'UAE Study Leave']
    }
    
    def categorize_vacation_type(vacation_type):
        for category, types in vacation_categories.items():
            if vacation_type in types:
                return category
        return 'Other'
    
    vacation_actual = vacation_actual.copy()
    vacation_actual['Vacation_Category'] = vacation_actual['Type'].apply(categorize_vacation_type)
    
    print(f"\n📊 Vacation by category:")
    print(vacation_actual['Vacation_Category'].value_counts())
    
    # STEP 3: Create vacation summary for dashboard integration
    print(f"\n" + "="*60)
    print("📊 STEP 3: CREATING VACATION SUMMARY FOR DASHBOARD")
    print("="*60)
    
    # Create monthly vacation summary
    current_date = pd.Timestamp.now()
    
    # Create date range for next 4 months (current + 3 future)
    date_range = pd.date_range(
        start=current_date.replace(day=1),
        periods=4,
        freq='MS'  # Month start
    )
    
    print(f"📅 Dashboard date range: {date_range[0].strftime('%Y-%m')} to {date_range[-1].strftime('%Y-%m')}")
    
    # Create vacation summary by person and month
    vacation_summary = []
    
    for _, row in vacation_actual.iterrows():
        start_date = row['Start date']
        departure_date = row['Departure date']
        
        # Skip if no valid dates
        if pd.isna(start_date):
            continue
            
        # Use departure date if available, otherwise assume single day
        end_date = departure_date if pd.notna(departure_date) else start_date
        
        # Check if vacation overlaps with our dashboard period
        for month_start in date_range:
            month_end = month_start + pd.offsets.MonthEnd(0)
            
            # Check if vacation overlaps with this month
            if start_date <= month_end and end_date >= month_start:
                vacation_summary.append({
                    'Full_Name': row['Full Name'],
                    'First_Name': row['First Name'],
                    'Last_Name': row['Last Name'],
                    'Employee_Number': row['Employee Number'],
                    'Month': month_start,
                    'Vacation_Type': row['Type'],
                    'Vacation_Category': row['Vacation_Category'],
                    'Days_Used': row['Used'],
                    'Days_Scheduled': row['Scheduled'],
                    'Start_Date': start_date,
                    'End_Date': end_date,
                    'Job_Title': row['Job Title'],
                    'Office_Location': row['Office Location']
                })
    
    # Convert to DataFrame
    df_vacation_summary = pd.DataFrame(vacation_summary)
    
    if not df_vacation_summary.empty:
        print(f"✅ Vacation summary created: {df_vacation_summary.shape[0]} month-person records")
        print(f"   📊 Covering {df_vacation_summary['Full_Name'].nunique()} unique employees")
        print(f"   📅 Across {df_vacation_summary['Month'].nunique()} months")
        
        # Show sample of vacation summary
        print(f"\n📊 Sample vacation summary:")
        sample_cols = ['Full_Name', 'Month', 'Vacation_Category', 'Days_Used', 'Days_Scheduled']
        print(df_vacation_summary[sample_cols].head(10).to_string(index=False))
        
        # Show monthly vacation totals
        print(f"\n📊 Monthly vacation totals:")
        monthly_totals = df_vacation_summary.groupby('Month').agg({
            'Days_Used': 'sum',
            'Days_Scheduled': 'sum',
            'Full_Name': 'nunique'
        }).round(1)
        monthly_totals.columns = ['Total_Days_Used', 'Total_Days_Scheduled', 'Unique_Employees']
        print(monthly_totals)
        
    else:
        print("⚠️  No vacation data found for the dashboard period")
        df_vacation_summary = pd.DataFrame()
    
    # STEP 4: Prepare for integration with main dataset
    print(f"\n" + "="*60)
    print("🔗 STEP 4: PREPARING FOR INTEGRATION WITH MAIN DATASET")
    print("="*60)
    
    # Create a version that can be merged with the main 10k dataset
    if not df_vacation_summary.empty and 'df_10k' in locals() and df_10k is not None:
        print("✅ Attempting to merge vacation data with main 10k dataset...")
        
        # Try to match by name
        vacation_for_merge = df_vacation_summary.copy()
        
        # Group by person and month to avoid duplicates
        vacation_monthly = vacation_for_merge.groupby(['Full_Name', 'Month']).agg({
            'Days_Used': 'sum',
            'Days_Scheduled': 'sum',
            'Vacation_Category': lambda x: ', '.join(x.unique()),
            'First_Name': 'first',
            'Last_Name': 'first',
            'Employee_Number': 'first'
        }).reset_index()
        
        print(f"   📊 Monthly vacation data ready for merge: {vacation_monthly.shape[0]} records")
        print(f"   👥 For {vacation_monthly['Full_Name'].nunique()} unique employees")
        
        # Store for use in later cells
        df_vacation = vacation_actual  # Full detailed data
        df_vacation_monthly = vacation_monthly  # Monthly summary for dashboard
        
        print(f"\n✅ Vacation data processing complete!")
        print(f"   📊 df_vacation: {df_vacation.shape[0]} detailed records")
        print(f"   📊 df_vacation_monthly: {df_vacation_monthly.shape[0]} monthly summaries")
        
    else:
        print("⚠️  Cannot merge with main dataset - df_10k not available")
        df_vacation = vacation_actual
        df_vacation_monthly = df_vacation_summary
    
    print(f"\n" + "="*60)
    print("✅ VACATION DATA PROCESSING COMPLETE!")
    print("="*60)
    
else:
    print("❌ Cannot process vacation data - loading failed")
    df_vacation = None
    df_vacation_monthly = None

🏖️ VACATION AND LEAVE DATA PROCESSING
✅ Vacation data loaded successfully: (18190, 20)
   📊 Total records: 18,190
   📋 Total columns: 20

🏖️ Available vacation/leave types:
Type
Work From Anywhere              2171
Jury Duty                       2100
Family Caregiver Leave          2059
Sick                            1534
UAE Vacation                    1378
Vacation                        1371
Bereavement                     1361
Parental Leave (UAE)            1319
UAE Study Leave                 1313
Family Caregiver Leave (UAE)    1313
Name: count, dtype: int64

🔍 STEP 1: FILTERING VACATION DATA FOR LEADERSHIP ROLES
✅ Found filtered leadership users: 528 users
   📋 Leadership names to match: 528
   📋 Leadership employee numbers: 458

✅ Filtered vacation data: 8881 records
   📊 From 459 unique employees

👥 Leadership employees with vacation data:
   • Aaron Fallon: 15 vacation records
   • Abby  Brewster: 3 vacation records
   • Abby  Ciucias: 2 vacation records
   • Abby Brewster

In [8]:
# [Cell 7] Load Salesforce Opportunity Data
try:
    df_salesforce = pd.read_csv('../data/Salesforce Opportunity Data.csv')
    print(f"✅ Salesforce data loaded successfully: {df_salesforce.shape}")
    print("\n📊 Column names:")
    print(df_salesforce.columns.tolist())
    print("\n📊 Data types:")
    print(df_salesforce.dtypes)
except Exception as e:
    print(f"❌ Error loading Salesforce data: {e}")
    df_salesforce = None

✅ Salesforce data loaded successfully: (1787, 19)

📊 Column names:
['Probability', 'Account Name', 'Engagement Name', 'Created Date', 'Schedule Month', 'Region', 'Schedule Amount', 'Intacct Project ID', 'Project Code', 'Engagement Launch Date', 'Primary Partner', 'Fee', 'Engagement ID', 'Expense Budget', 'Engagement End Date', 'Office', 'Industry', '_BATCH_ID_', '_BATCH_LAST_RUN_']

📊 Data types:
Probability                object
Account Name               object
Engagement Name            object
Created Date               object
Schedule Month             object
Region                     object
Schedule Amount           float64
Intacct Project ID         object
Project Code               object
Engagement Launch Date     object
Primary Partner           float64
Fee                       float64
Engagement ID              object
Expense Budget            float64
Engagement End Date        object
Office                     object
Industry                   object
_BATCH_ID_            

In [7]:
# [Cell 8] Load US Working Hours Data
try:
    df_us_hours = pd.read_csv('../data/Working Hours For US.csv')
    print(f"✅ US Working Hours data loaded successfully: {df_us_hours.shape}")
    print("\n📊 Column names:")
    print(df_us_hours.columns.tolist())
    # Convert Month column to datetime
    df_us_hours['Month'] = pd.to_datetime(df_us_hours['Month'])
    print("\n📅 Sample data:")
    print(df_us_hours[['Month', 'Net Working Hours', 'Billable Days']].head(10))
except Exception as e:
    print(f"❌ Error loading US working hours data: {e}")
    df_us_hours = None

✅ US Working Hours data loaded successfully: (145, 17)

📊 Column names:
['Month', 'Start Date', 'End Date', 'Holiday #1', 'Holiday #2', 'Holiday #3', 'Holiday #4', 'Net Working Hours', 'Column_9', 'First Closure Day', 'Billable Days', 'Total Closures', 'Column_13', 'Column_14', 'Column_15', 'Column_16', 'Column_17']

📅 Sample data:
       Month  Net Working Hours  Billable Days
0 2014-01-01                189            NaN
1 2014-02-01                180            NaN
2 2014-03-01                189            NaN
3 2014-04-01                198            NaN
4 2014-05-01                189            NaN
5 2014-06-01                189            NaN
6 2014-07-01                198            NaN
7 2014-08-01                189            NaN
8 2014-09-01                189            NaN
9 2014-10-01                207            NaN


In [8]:
# [Cell 9] Load UAE Working Hours Data
try:
    df_uae_hours = pd.read_csv('../data/UAE Working Hours.csv')
    print(f"✅ UAE Working Hours data loaded successfully: {df_uae_hours.shape}")
    print("\n📊 Column names:")
    print(df_uae_hours.columns.tolist())
    print("\n🔍 First few rows:")
    print(df_uae_hours.head())
except Exception as e:
    print(f"❌ Error loading UAE working hours data: {e}")
    df_uae_hours = None

✅ UAE Working Hours data loaded successfully: (49, 16)

📊 Column names:
['Month', 'Start Date', 'End Date', 'Holiday #1', 'Holiday #2', 'Holiday #3', 'Holiday #4', 'Holiday #5', 'Net Working Hours', 'Working Days', 'Column_11', 'Column_12', 'First Closure Day', 'Billable Days', 'Total Closures', 'Column_16']

🔍 First few rows:
        Month  Start Date    End Date  Holiday #1  Holiday #2 Holiday #3  \
0  2022-01-01  2022-01-01  2022-01-31  2022-01-21  2022-01-24        NaN   
1  2022-02-01  2022-02-01  2022-02-28  2022-02-21  2022-02-18        NaN   
2  2022-03-01  2022-03-01  2022-03-31  2022-03-25         NaN        NaN   
3  2022-04-01  2022-04-01  2022-04-30  2022-04-29         NaN        NaN   
4  2022-05-01  2022-05-01  2022-05-31  2022-05-27  2022-05-30        NaN   

  Holiday #4 Holiday #5  Net Working Hours  Working Days Column_11 Column_12  \
0        NaN        NaN                171           NaN       NaN       NaN   
1        NaN        NaN                162           N