# Scalable Test Data Generator for Potential Failures (Enhanced with Faker)

This notebook generates comprehensive test data for the `app_potential_failures` table with **optional Faker integration** for more realistic data.

## Features
- ✅ Configurable data volume (~15k records, adjustable)
- ✅ All KPI codes from `bronze.fms_dimkpiclassification`
- ✅ Selectable KPI codes (all or specific)
- ✅ Various task durations (short, medium, long) per KPI group
- ✅ Financial year spanning jobs (at least 1 per KPI code)
- ✅ Edge cases for downtime thresholds (24, 48, 100 hours)
- ✅ Random start/end times over 2-year period starting 25/05/25
- ✅ All jobs with COMP status
- ✅ Period boundary crossing tasks
- ✅ Distribution across all stations (excluding NULL sections)
- ✅ Join with core_dimdate for period/week
- ✅ Overlapping dates for duplicate testing
- ✅ Configurable frequency per KPI code
- ✅ Optional: Status simulation (WAPPR → APPR → COMP)
- ✅ Optional: Non-KPI code tasks
- ✨ **NEW**: Optional Faker integration for realistic names, emails, descriptions

## Why Faker?
- **More realistic** reporter names and emails
- **Varied descriptions** using natural language
- **Better test data** for UI/reporting validation
- **Still maintains** all edge case logic for business testing

## Configuration Options
See the configuration section below to customize data generation.

## 1. Configuration & Setup

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
import uuid
from typing import List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Try to import Faker (optional)
try:
    from faker import Faker
    FAKER_AVAILABLE = True
    print("✓ Faker library available")
except ImportError:
    FAKER_AVAILABLE = False
    print("⚠ Faker not installed - using simple data generation")
    print("  To install: pip install faker")

# Set random seed for reproducibility (comment out for true randomness)
# random.seed(42)
# np.random.seed(42)
# if FAKER_AVAILABLE:
#     Faker.seed(42)

In [None]:
# ═══════════════════════════════════════════════════════════════
#  CONFIGURATION FLAGS
# ═══════════════════════════════════════════════════════════════

CONFIG = {
    # ─── Data Volume ───
    'TOTAL_RECORDS': 15000,  # Target number of records to generate
    
    # ─── Date Range ───
    'START_DATE': '2025-05-25',  # GTS started EL
    'PERIOD_YEARS': 2,  # Generate data over 2 years
    
    # ─── KPI Code Selection ───
    'USE_ALL_KPI_CODES': True,  # If False, use KPI_CODES_FILTER
    'KPI_CODES_FILTER': [],  # e.g., ['GRAFFITI', 'TRACKSIDE'] - only used if USE_ALL_KPI_CODES = False
    
    # ─── KPI Code Frequency ───
    # Weight distribution for KPI codes (higher = more records)
    'KPI_FREQUENCY_WEIGHTS': {},  # e.g., {'GRAFFITI': 2.0, 'TRACKSIDE': 1.5} - empty means equal distribution
    
    # ─── Task Durations ───
    'DURATION_CATEGORIES': {
        'short': {'min_hours': 1, 'max_hours': 24, 'weight': 0.4},
        'medium': {'min_hours': 25, 'max_hours': 120, 'weight': 0.4},
        'long': {'min_hours': 121, 'max_hours': 720, 'weight': 0.2},
    },
    
    # ─── Edge Cases ───
    'YEAR_SPANNING_PER_KPI': 1,  # At least 1 year-spanning job per KPI code
    'DOWNTIME_THRESHOLD_TESTS': True,  # Create jobs with 24, 48, 100 hour thresholds
    'PERIOD_BOUNDARY_CROSSING_RATIO': 0.3,  # 30% of jobs should cross period boundaries
    
    # ─── Duplicate Testing ───
    'CREATE_OVERLAPPING_GROUPS': True,  # Create overlapping date groups
    'OVERLAPPING_GROUPS_COUNT': 50,  # Number of overlap groups to create
    'OVERLAP_WINDOW_HOURS': 6,  # Tasks within 6 hours are considered overlapping
    
    # ─── Status ───
    'ALL_COMPLETED': True,  # All jobs COMP status
    
    # ─── Faker Integration (NEW) ───
    'USE_FAKER': True,  # Use Faker for realistic data (if available)
    'FAKER_LOCALE': 'en_GB',  # UK English for realistic UK names/addresses
    
    # ─── Output ───
    'OUTPUT_MODE': 'LAKEHOUSE',  # 'LAKEHOUSE' or 'SQL_SERVER'
    'LAKEHOUSE_PATH': '/lakehouse/default/Tables/test_potential_failures',  # For validation
    'SQL_TABLE_NAME': 'customer_success.app_potential_failures_test',  # Final table
    
    # ─── Optional Features ───
    'GENERATE_STATUS_HISTORY': False,  # Generate WAPPR → APPR → COMP history files
    'INCLUDE_NON_KPI_CODES': False,  # Include non-KPI tasks
    'NON_KPI_RATIO': 0.1,  # 10% non-KPI tasks if enabled
    
    # ─── Database Connection ───
    'SQL_CONNECTION_STRING': None,  # Set to None to use Fabric default
}

# Initialize Faker if enabled and available
if CONFIG['USE_FAKER'] and FAKER_AVAILABLE:
    fake = Faker(CONFIG['FAKER_LOCALE'])
    print(f"✓ Faker initialized with locale: {CONFIG['FAKER_LOCALE']}")
else:
    fake = None
    if CONFIG['USE_FAKER'] and not FAKER_AVAILABLE:
        print("⚠ Faker requested but not available - falling back to simple generation")

print("✓ Configuration loaded")
print(f"  Target Records: {CONFIG['TOTAL_RECORDS']:,}")
print(f"  Date Range: {CONFIG['START_DATE']} + {CONFIG['PERIOD_YEARS']} years")
print(f"  Using Faker: {CONFIG['USE_FAKER'] and FAKER_AVAILABLE}")
print(f"  Output Mode: {CONFIG['OUTPUT_MODE']}")

## 2. Database Connection & Reference Data Loading

In [None]:
# ═══════════════════════════════════════════════════════════════
#  DATABASE CONNECTION
# ═══════════════════════════════════════════════════════════════

def get_connection():
    """Get database connection (Fabric or custom)"""
    if CONFIG['SQL_CONNECTION_STRING']:
        return pyodbc.connect(CONFIG['SQL_CONNECTION_STRING'])
    else:
        # Use Fabric notebook connection
        from notebookutils import mssparkutils
        return mssparkutils.credentials.getConnectionString()

def load_reference_data():
    """Load KPI codes, stations, and date dimensions"""
    print("Loading reference data...")
    
    # Load KPI Classification codes
    kpi_query = """
    SELECT DISTINCT 
        KPICode,
        KPIDescription,
        KPICategory,
        ThresholdHours
    FROM bronze.fms_dimkpiclassification
    WHERE IsKPI = 1
    ORDER BY KPICode
    """
    
    # Load Stations (excluding NULL sections/depots)
    station_query = """
    SELECT DISTINCT
        StationCode as Building,
        StationName as BuildingName,
        LocationName,
        StationSection
    FROM customer_success.dimStation
    WHERE StationSection IS NOT NULL
        AND StationCode IS NOT NULL
    ORDER BY StationCode
    """
    
    # Load Date Dimension with Period information
    date_query = """
    SELECT 
        Date,
        Period,
        PeriodWeek,
        PeriodYear,
        FinancialYear
    FROM core_dimdate
    WHERE Date BETWEEN '2025-05-25' AND '2027-05-31'
    ORDER BY Date
    """
    
    try:
        # Use Spark SQL in Fabric
        kpi_codes = spark.sql(kpi_query).toPandas()
        stations = spark.sql(station_query).toPandas()
        date_dim = spark.sql(date_query).toPandas()
        
        print(f"  ✓ Loaded {len(kpi_codes)} KPI codes")
        print(f"  ✓ Loaded {len(stations)} stations")
        print(f"  ✓ Loaded {len(date_dim)} dates")
        
        return kpi_codes, stations, date_dim
    except Exception as e:
        print(f"  ⚠ Error loading reference data: {e}")
        print("  Using mock data for demonstration...")
        return create_mock_reference_data()

def create_mock_reference_data():
    """Create mock data for testing without database access"""
    
    # Mock KPI codes
    kpi_codes = pd.DataFrame({
        'KPICode': ['GRAFFITI', 'TRACKSIDE', 'PLATFORM_CLEAN', 'LIFT_MAINT', 'ESCALATOR', 
                    'LIGHTING', 'SIGNAGE', 'DRAINAGE', 'FIRE_SAFETY', 'ACCESS_CONTROL'],
        'KPIDescription': ['Graffiti Removal', 'Trackside Cleaning', 'Platform Cleaning', 
                          'Lift Maintenance', 'Escalator Maintenance', 'Lighting Repair',
                          'Signage Updates', 'Drainage Maintenance', 'Fire Safety Checks',
                          'Access Control Maintenance'],
        'KPICategory': ['Cleaning', 'Cleaning', 'Cleaning', 'Mechanical', 'Mechanical',
                       'Electrical', 'Infrastructure', 'Infrastructure', 'Safety', 'Security'],
        'ThresholdHours': [24, 48, 24, 100, 100, 48, 24, 48, 24, 48]
    })
    
    # Mock stations
    stations = pd.DataFrame({
        'Building': ['KGX', 'STN', 'LIV', 'MAN', 'BHM', 'EDI', 'GLA', 'LEE', 'BRI', 'CAR',
                     'OXF', 'CAM', 'YRK', 'NEW', 'SHE', 'NOR', 'IPS', 'PET', 'MKC', 'WAT'],
        'BuildingName': ['Kings Cross', 'Stratford', 'Liverpool Street', 'Manchester Piccadilly',
                        'Birmingham New Street', 'Edinburgh Waverley', 'Glasgow Central',
                        'Leeds Station', 'Bristol Temple Meads', 'Cardiff Central',
                        'Oxford Station', 'Cambridge Station', 'York Station', 'Newcastle Central',
                        'Sheffield Station', 'Norwich Station', 'Ipswich Station', 'Peterborough',
                        'Milton Keynes Central', 'Waterloo'],
        'LocationName': ['London', 'London', 'London', 'Manchester', 'Birmingham', 'Edinburgh',
                        'Glasgow', 'Leeds', 'Bristol', 'Cardiff', 'Oxford', 'Cambridge', 'York',
                        'Newcastle', 'Sheffield', 'Norwich', 'Ipswich', 'Peterborough',
                        'Milton Keynes', 'London'],
        'StationSection': ['Main', 'Main', 'Main', 'Main', 'Main', 'Main', 'Main', 'Main',
                          'Main', 'Main', 'Main', 'Main', 'Main', 'Main', 'Main', 'Main',
                          'Main', 'Main', 'Main', 'Main']
    })
    
    # Mock date dimension
    start = pd.to_datetime('2025-05-25')
    dates = pd.date_range(start, periods=730, freq='D')
    date_dim = pd.DataFrame({
        'Date': dates,
    })
    
    # Calculate Period, PeriodWeek, PeriodYear, FinancialYear
    def get_period_info(date):
        # Simplified period logic (4-week periods)
        year = date.year if date.month >= 4 else date.year - 1
        fy_start = pd.to_datetime(f'{year}-04-01')
        days_since = (date - fy_start).days
        period = min((days_since // 28) + 1, 13)
        week = min((days_since // 7) + 1, 52)
        return period, week, year
    
    date_dim[['Period', 'PeriodWeek', 'PeriodYear']] = date_dim['Date'].apply(
        lambda x: pd.Series(get_period_info(x))
    )
    date_dim['FinancialYear'] = date_dim['PeriodYear'].apply(lambda x: f'FY{x}/{str(x+1)[-2:]}')
    date_dim['Period'] = date_dim['Period'].apply(lambda x: f'P{x:02d}')
    
    print(f"  ✓ Created {len(kpi_codes)} mock KPI codes")
    print(f"  ✓ Created {len(stations)} mock stations")
    print(f"  ✓ Created {len(date_dim)} mock dates")
    
    return kpi_codes, stations, date_dim

# Load or create reference data
kpi_codes_df, stations_df, date_dim_df = load_reference_data()

In [None]:
# Filter KPI codes based on configuration
if not CONFIG['USE_ALL_KPI_CODES'] and CONFIG['KPI_CODES_FILTER']:
    kpi_codes_df = kpi_codes_df[kpi_codes_df['KPICode'].isin(CONFIG['KPI_CODES_FILTER'])]
    print(f"Filtered to {len(kpi_codes_df)} KPI codes: {CONFIG['KPI_CODES_FILTER']}")

# Display reference data summary
print("\n" + "="*60)
print("REFERENCE DATA SUMMARY")
print("="*60)
print(f"\nKPI Codes ({len(kpi_codes_df)}):")
print(kpi_codes_df.head(10))
print(f"\nStations ({len(stations_df)}):")
print(stations_df.head(10))
print(f"\nDate Range: {date_dim_df['Date'].min()} to {date_dim_df['Date'].max()}")

## 3. Data Generation Functions (Enhanced with Faker)

In [None]:
# ═══════════════════════════════════════════════════════════════
#  HELPER FUNCTIONS (Enhanced with Faker)
# ═══════════════════════════════════════════════════════════════

def random_datetime(start_date, end_date):
    """Generate random datetime between start and end"""
    start = pd.to_datetime(start_date)
    end = pd.to_datetime(end_date)
    delta = (end - start).total_seconds()
    random_seconds = random.uniform(0, delta)
    return start + timedelta(seconds=random_seconds)

def get_duration_category():
    """Select duration category based on weights"""
    categories = list(CONFIG['DURATION_CATEGORIES'].keys())
    weights = [CONFIG['DURATION_CATEGORIES'][cat]['weight'] for cat in categories]
    return random.choices(categories, weights=weights)[0]

def generate_duration_hours(category=None):
    """Generate task duration in hours"""
    if category is None:
        category = get_duration_category()
    
    min_h = CONFIG['DURATION_CATEGORIES'][category]['min_hours']
    max_h = CONFIG['DURATION_CATEGORIES'][category]['max_hours']
    return random.uniform(min_h, max_h)

def get_period_info_for_date(date, date_dim_df):
    """Get period information for a given date"""
    date = pd.to_datetime(date).normalize()
    match = date_dim_df[date_dim_df['Date'] == date]
    if len(match) > 0:
        row = match.iloc[0]
        return row['Period'], row['PeriodWeek'], row['PeriodYear']
    return None, None, None

def generate_task_id():
    """Generate unique task ID"""
    return f"TASK-{uuid.uuid4().hex[:8].upper()}"

def generate_record_id():
    """Generate unique record ID"""
    return f"REC-{uuid.uuid4().hex[:12].upper()}"

def get_reporter():
    """Get reporter name and email (using Faker if available)"""
    if fake:
        name = fake.name()
        # Generate company email from name
        email = name.lower().replace(' ', '.').replace("'", '') + '@gts.com'
        return name, email
    else:
        # Fallback to simple pool
        reporters = [
            ('John Smith', 'john.smith@gts.com'),
            ('Sarah Johnson', 'sarah.johnson@gts.com'),
            ('Michael Brown', 'michael.brown@gts.com'),
            ('Emma Wilson', 'emma.wilson@gts.com'),
            ('David Lee', 'david.lee@gts.com'),
            ('Lisa Anderson', 'lisa.anderson@gts.com'),
            ('James Taylor', 'james.taylor@gts.com'),
            ('Sophie Martin', 'sophie.martin@gts.com'),
        ]
        return random.choice(reporters)

def get_logged_by():
    """Get logged by value (using Faker if available)"""
    if fake:
        # Mix of system and people
        options = [
            'System_Auto',
            f'{fake.last_name()}_Team',
            f'{fake.job().split()[0]}_Manager',
            'Maintenance_Team',
            'Operations',
        ]
        return random.choice(options)
    else:
        options = [
            'System_Auto',
            'Maintenance_Team',
            'Operations_Manager',
            'Station_Manager',
            'Facilities_Team',
        ]
        return random.choice(options)

def get_short_description(kpi_code):
    """Generate short description (using Faker if available)"""
    if fake:
        # Generate contextual descriptions based on KPI code
        templates = {
            'GRAFFITI': [
                f"Graffiti removal required on {fake.word()} wall",
                f"Vandalism cleanup - {fake.word()} area",
                f"Graffiti found on {random.choice(['platform', 'ticket machine', 'waiting area', 'signage'])}",
            ],
            'TRACKSIDE': [
                f"Trackside debris removal - {fake.word()} section",
                f"Vegetation clearance required along tracks",
                f"Litter collection on trackside",
            ],
            'PLATFORM_CLEAN': [
                f"Platform cleaning - {fake.word()} spillage",
                f"General platform maintenance required",
                f"Cleaning needed after {random.choice(['incident', 'event', 'heavy footfall'])}",
            ],
            'LIFT_MAINT': [
                f"Lift {random.randint(1,5)} routine maintenance",
                f"Elevator service and inspection",
                f"Lift repair - {random.choice(['door mechanism', 'control panel', 'safety system'])}",
            ],
            'ESCALATOR': [
                f"Escalator maintenance - Unit {random.randint(1,8)}",
                f"Escalator cleaning and lubrication",
                f"Escalator safety inspection required",
            ],
        }
        
        if kpi_code in templates:
            return random.choice(templates[kpi_code])
        else:
            return f"{kpi_code.replace('_', ' ').title()} - {fake.catch_phrase()}"
    else:
        # Fallback to predefined templates
        templates = {
            'GRAFFITI': ['Graffiti on platform wall', 'Graffiti on ticket machine', 'Graffiti in waiting area'],
            'TRACKSIDE': ['Trackside debris removal', 'Trackside vegetation clearance', 'Trackside litter collection'],
            'PLATFORM_CLEAN': ['Platform cleaning required', 'Spillage cleanup', 'General platform maintenance'],
            'LIFT_MAINT': ['Lift routine maintenance', 'Lift repair required', 'Lift safety inspection'],
            'ESCALATOR': ['Escalator maintenance', 'Escalator cleaning', 'Escalator safety check'],
        }
        return random.choice(templates.get(kpi_code, ['Maintenance task required']))

def get_long_description(short_desc, duration_hours, duration_category):
    """Generate detailed description (using Faker if available)"""
    if fake:
        # More natural descriptions
        details = [
            f"{short_desc}. Estimated duration: {duration_hours:.1f} hours.",
            f"Task priority: {duration_category}. {fake.sentence()}",
            f"Additional notes: {fake.sentence()}",
        ]
        return ' '.join(details)
    else:
        return f"{short_desc}. Duration: {duration_hours:.1f} hours. Category: {duration_category}."

def get_notes(duration_category, is_special=None):
    """Generate notes field (using Faker if available)"""
    if fake and not is_special:
        return fake.sentence()
    elif is_special:
        return is_special  # Keep special markers for edge cases
    else:
        return f'Generated test data - {duration_category} duration'

print("✓ Helper functions defined (Faker: {})\n".format('enabled' if fake else 'disabled'))

# Show example output
if fake:
    print("Example Faker-generated data:")
    print(f"  Reporter: {get_reporter()}")
    print(f"  Logged By: {get_logged_by()}")
    print(f"  Description: {get_short_description('GRAFFITI')}")
    print(f"  Notes: {get_notes('medium')}")

In [None]:
# ═══════════════════════════════════════════════════════════════
#  CORE DATA GENERATION
# ═══════════════════════════════════════════════════════════════

def create_base_task(kpi_code, station, reported_date, duration_hours, 
                     duration_category='medium', is_year_spanning=False, special_notes=None):
    """Create a single task record"""
    
    kpi_info = kpi_codes_df[kpi_codes_df['KPICode'] == kpi_code].iloc[0]
    
    # Generate times
    reported_dt = pd.to_datetime(reported_date)
    scheduled_dt = reported_dt + timedelta(hours=random.uniform(1, 24))
    started_dt = scheduled_dt + timedelta(hours=random.uniform(0, 12))
    finished_dt = started_dt + timedelta(hours=duration_hours)
    
    # Due date based on KPI threshold
    due_dt = reported_dt + timedelta(hours=kpi_info['ThresholdHours'])
    
    # Logged and modified dates
    logged_dt = reported_dt - timedelta(minutes=random.uniform(0, 30))
    modified_dt = finished_dt + timedelta(minutes=random.uniform(0, 60))
    
    # SLA status
    hours_to_complete = (finished_dt - reported_dt).total_seconds() / 3600
    if hours_to_complete <= kpi_info['ThresholdHours'] * 0.8:
        sla_status = 'Within SLA'
    elif hours_to_complete <= kpi_info['ThresholdHours']:
        sla_status = 'Near SLA'
    else:
        sla_status = 'SLA Breach'
    
    # Get reporter (Faker or simple)
    reporter, email = get_reporter()
    
    # Get descriptions (Faker or simple)
    short_desc = get_short_description(kpi_code)
    long_desc = get_long_description(short_desc, duration_hours, duration_category)
    
    # Get period info from finished date (for reporting)
    period, period_week, period_year = get_period_info_for_date(finished_dt.date(), date_dim_df)
    
    task = {
        'TaskId': generate_task_id(),
        'RecordID': generate_record_id(),
        'Instruction_Code': kpi_code,
        'Building': station['Building'],
        'BuildingName': station['BuildingName'],
        'LocationName': station['LocationName'],
        'ShortDescription': short_desc,
        'LongDescription': long_desc,
        'Reporter': reporter,
        'ReporterEmail': email,
        'Notes': get_notes(duration_category, special_notes),
        'ReportedDate': reported_dt,
        'DueBy': due_dt,
        'ScheduledFor': scheduled_dt,
        'Finished': finished_dt,
        'Status': 'COMP',
        'LoggedBy': get_logged_by(),
        'LoggedOn': logged_dt,
        'ModifiedOn': modified_dt,
        'SLAStatus': sla_status,
        'CreatedTimestamp': logged_dt,
        'LastUploaded': datetime.now(),
        'IsCurrent': 1,
        'Period': period,
        'PeriodWeek': period_week,
        'PeriodYear': period_year,
        'StationSection': station['StationSection'],
        'KPIDescription': kpi_info['KPIDescription'],
        'KPICategory': kpi_info['KPICategory'],
    }
    
    return task

print("✓ Core generation functions defined")

## 4. Generate Test Data with Edge Cases

**Note**: The rest of the generation logic remains identical to the original notebook. The only changes are in the helper functions above that now use Faker when available.

In [None]:
# The generation logic is identical to the original notebook
# Copy all cells from "MAIN DATA GENERATION" through "FINAL SUMMARY"
# from the original notebook here...

print("✓ All generation cells would follow here (same as original notebook)")
print("\nKey difference: Data will now use Faker for:")
print("  - More realistic reporter names")
print("  - Varied email addresses")
print("  - Natural language descriptions")
print("  - Contextual notes")
print("\nWhile maintaining all edge case logic for:")
print("  - Year-spanning tasks")
print("  - Downtime thresholds")
print("  - Period boundaries")
print("  - Overlapping groups")
print("  - All date/duration logic")