# TransXChange XML Parser - Abellio London Ltd

This notebook parses all TransXChange XML files from **Abellio London Ltd_27** folder and extracts data into CSV files.

## What it does:
1. üîç Scans all XML files in the Abellio London Ltd_27 folder
2. üìä Extracts 14 different data tables from each XML file
3. üîÑ Consolidates data from all XML files
4. üíæ Exports each table type to separate CSV files

## Tables extracted:
- **stops** - Bus stop information
- **operators** - Operator details (Abellio London Limited)
- **services** - Service codes and routes
- **lines** - Line numbers and descriptions
- **routes** - Route definitions
- **route_links** - Links between stops on routes
- **journey_patterns** - Journey pattern definitions
- **timing_links** - Timing information between stops
- **vehicle_journeys** - Individual journey schedules
- **serviced_organisations** - Organizations served
- **serviced_org_working_days** - Working day definitions
- **operating_profiles** - Operating schedules
- **service_journey_patterns** - Service to journey pattern mappings
- **service_lines** - Service to line mappings


In [10]:
# Import required libraries
import os
import sys
import glob
import pandas as pd
import re
import xml.etree.ElementTree as ET
from datetime import datetime

print("‚úì All libraries imported successfully!")
print(f"‚úì Python version: {sys.version}")
print(f"‚úì Pandas version: {pd.__version__}")
print(f"‚úì Working directory: {os.getcwd()}")


‚úì All libraries imported successfully!
‚úì Python version: 3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, 11:09:21) [Clang 14.0.6 ]
‚úì Pandas version: 2.2.3
‚úì Working directory: /Users/shitalyadav/Desktop/untitled folder


In [11]:
# Configuration
ROOT_FOLDER = os.path.join(os.getcwd(), "timetable", "Abellio London Ltd_27")
OUTPUT_BASE = os.path.join(os.getcwd(), "timetable_parsed_data")

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_BASE, exist_ok=True)

# Scan for XML files
xml_files = sorted(glob.glob(os.path.join(ROOT_FOLDER, "*.xml")))
print(f"‚úì Found {len(xml_files)} XML files in Abellio London Ltd_27 folder")
print(f"‚úì Output directory: {OUTPUT_BASE}\n")

if xml_files:
    print("Files to process:")
    for i, xml_file in enumerate(xml_files, 1):
        print(f"  {i}. {os.path.basename(xml_file)}")
else:
    print("‚ö†Ô∏è  No XML files found!")
    
print(f"\nFirst 5 files: {[os.path.basename(f) for f in xml_files[:5]]}")
xml_files[:5]


‚úì Found 11 XML files in Abellio London Ltd_27 folder
‚úì Output directory: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data

Files to process:
  1. 404-21-404-_-y05-2_1.xml
  2. tfl_21-404-_-y05-6_qyPH39c.xml
  3. tfl_21-404-_-y05-7.xml
  4. tfl_21-407-_-y05-11_HygcEjI.xml
  5. tfl_21-464-_-y05-55535.xml
  6. tfl_21-464-_-y05-55538.xml
  7. tfl_21-464-_-y05-55539.xml
  8. tfl_21-465-_-y05-7_b3z8CFQ.xml
  9. tfl_21-R68-_-y05-58885.xml
  10. tfl_21-R68-_-y05-58885_glMYlIz.xml
  11. tfl_21-S1-_-y05-59996.xml

First 5 files: ['404-21-404-_-y05-2_1.xml', 'tfl_21-404-_-y05-6_qyPH39c.xml', 'tfl_21-404-_-y05-7.xml', 'tfl_21-407-_-y05-11_HygcEjI.xml', 'tfl_21-464-_-y05-55535.xml']


['/Users/shitalyadav/Desktop/untitled folder/timetable/Abellio London Ltd_27/404-21-404-_-y05-2_1.xml',
 '/Users/shitalyadav/Desktop/untitled folder/timetable/Abellio London Ltd_27/tfl_21-404-_-y05-6_qyPH39c.xml',
 '/Users/shitalyadav/Desktop/untitled folder/timetable/Abellio London Ltd_27/tfl_21-404-_-y05-7.xml',
 '/Users/shitalyadav/Desktop/untitled folder/timetable/Abellio London Ltd_27/tfl_21-407-_-y05-11_HygcEjI.xml',
 '/Users/shitalyadav/Desktop/untitled folder/timetable/Abellio London Ltd_27/tfl_21-464-_-y05-55535.xml']

In [12]:
def parse_transxchange_file(xml_file):
    """
    Parse a TransXChange XML file and extract all tables into separate DataFrames
    Returns a dictionary with table names as keys and DataFrames as values
    """
    try:
        tree = ET.parse(xml_file)
        root = tree.getroot()
        
        # Define namespace
        ns = {'tx': 'http://www.transxchange.org.uk/'}
        
        # Dictionary to hold all tables
        tables = {}
        
        # 1. Extract Stops (AnnotatedStopPointRef)
        stops = []
        for stop in root.findall('.//tx:AnnotatedStopPointRef', ns):
            stop_ref = stop.find('tx:StopPointRef', ns)
            common_name = stop.find('tx:CommonName', ns)
            stops.append({
                'stop_point_ref': stop_ref.text if stop_ref is not None else None,
                'common_name': common_name.text if common_name is not None else None
            })
        tables['stops'] = pd.DataFrame(stops)
        
        # 2. Extract Operators
        operators = []
        for operator in root.findall('.//tx:Operator', ns):
            operators.append({
                'operator_id': operator.get('id'),
                'national_operator_code': operator.findtext('tx:NationalOperatorCode', default=None, namespaces=ns),
                'operator_code': operator.findtext('tx:OperatorCode', default=None, namespaces=ns),
                'operator_short_name': operator.findtext('tx:OperatorShortName', default=None, namespaces=ns),
                'operator_name_on_licence': operator.findtext('tx:OperatorNameOnLicence', default=None, namespaces=ns),
                'trading_name': operator.findtext('tx:TradingName', default=None, namespaces=ns),
                'licence_number': operator.findtext('tx:LicenceNumber', default=None, namespaces=ns)
            })
        tables['operators'] = pd.DataFrame(operators)
        
        # 3. Extract Services
        services = []
        for service in root.findall('.//tx:Service', ns):
            services.append({
                'service_code': service.findtext('tx:ServiceCode', default=None, namespaces=ns),
                'private_code': service.findtext('tx:PrivateCode', default=None, namespaces=ns),
                'operator_ref': service.findtext('.//tx:RegisteredOperatorRef', default=None, namespaces=ns),
                'start_date': service.findtext('.//tx:StartDate', default=None, namespaces=ns),
                'end_date': service.findtext('.//tx:EndDate', default=None, namespaces=ns),
                'origin': service.findtext('.//tx:Origin', default=None, namespaces=ns),
                'destination': service.findtext('.//tx:Destination', default=None, namespaces=ns)
            })
        tables['services'] = pd.DataFrame(services)
        
        # 4. Extract Lines
        lines = []
        for line in root.findall('.//tx:Line', ns):
            lines.append({
                'line_id': line.get('id'),
                'line_name': line.findtext('tx:LineName', default=None, namespaces=ns),
                'outbound_origin': line.findtext('.//tx:OutboundDescription/tx:Origin', default=None, namespaces=ns),
                'outbound_destination': line.findtext('.//tx:OutboundDescription/tx:Destination', default=None, namespaces=ns),
                'inbound_origin': line.findtext('.//tx:InboundDescription/tx:Origin', default=None, namespaces=ns),
                'inbound_destination': line.findtext('.//tx:InboundDescription/tx:Destination', default=None, namespaces=ns)
            })
        tables['lines'] = pd.DataFrame(lines)
        
        # 5. Extract Routes
        routes = []
        for route in root.findall('.//tx:Route', ns):
            routes.append({
                'route_id': route.get('id'),
                'route_description': route.findtext('tx:Description', default=None, namespaces=ns),
                'route_section_ref': route.findtext('.//tx:RouteSectionRef', default=None, namespaces=ns)
            })
        tables['routes'] = pd.DataFrame(routes)
        
        # 6. Extract Route Links
        route_links = []
        for route_link in root.findall('.//tx:RouteLink', ns):
            route_links.append({
                'route_link_id': route_link.get('id'),
                'from_stop': route_link.findtext('.//tx:From/tx:StopPointRef', default=None, namespaces=ns),
                'to_stop': route_link.findtext('.//tx:To/tx:StopPointRef', default=None, namespaces=ns),
                'direction': route_link.findtext('tx:Direction', default=None, namespaces=ns),
                'distance': route_link.findtext('tx:Distance', default=None, namespaces=ns)
            })
        tables['route_links'] = pd.DataFrame(route_links)
        
        # 7. Extract Journey Patterns
        journey_patterns = []
        for jp in root.findall('.//tx:JourneyPattern', ns):
            journey_patterns.append({
                'journey_pattern_id': jp.get('id'),
                'destination_display': jp.findtext('tx:DestinationDisplay', default=None, namespaces=ns),
                'direction': jp.findtext('tx:Direction', default=None, namespaces=ns),
                'route_ref': jp.findtext('tx:RouteRef', default=None, namespaces=ns),
                'journey_pattern_section_refs': jp.findtext('tx:JourneyPatternSectionRefs', default=None, namespaces=ns)
            })
        tables['journey_patterns'] = pd.DataFrame(journey_patterns)
        
        # 8. Extract Timing Links (JourneyPatternTimingLink)
        timing_links = []
        for tl in root.findall('.//tx:JourneyPatternTimingLink', ns):
            from_elem = tl.find('tx:From', ns)
            to_elem = tl.find('tx:To', ns)
            timing_links.append({
                'timing_link_id': tl.get('id'),
                'from_stop_ref': from_elem.findtext('tx:StopPointRef', default=None, namespaces=ns) if from_elem is not None else None,
                'from_sequence': from_elem.get('SequenceNumber') if from_elem is not None else None,
                'from_timing_status': from_elem.findtext('tx:TimingStatus', default=None, namespaces=ns) if from_elem is not None else None,
                'to_stop_ref': to_elem.findtext('tx:StopPointRef', default=None, namespaces=ns) if to_elem is not None else None,
                'to_sequence': to_elem.get('SequenceNumber') if to_elem is not None else None,
                'to_timing_status': to_elem.findtext('tx:TimingStatus', default=None, namespaces=ns) if to_elem is not None else None,
                'route_link_ref': tl.findtext('tx:RouteLinkRef', default=None, namespaces=ns),
                'run_time': tl.findtext('tx:RunTime', default=None, namespaces=ns)
            })
        tables['timing_links'] = pd.DataFrame(timing_links)
        
        # 9. Extract Vehicle Journeys
        vehicle_journeys = []
        for vj in root.findall('.//tx:VehicleJourney', ns):
            # Extract operating days
            days_of_week = []
            for day_elem in vj.findall('.//tx:DaysOfWeek/*', ns):
                days_of_week.append(day_elem.tag.split('}')[-1])
            
            vehicle_journeys.append({
                'vehicle_journey_code': vj.findtext('tx:VehicleJourneyCode', default=None, namespaces=ns),
                'private_code': vj.findtext('tx:PrivateCode', default=None, namespaces=ns),
                'service_ref': vj.findtext('tx:ServiceRef', default=None, namespaces=ns),
                'line_ref': vj.findtext('tx:LineRef', default=None, namespaces=ns),
                'journey_pattern_ref': vj.findtext('tx:JourneyPatternRef', default=None, namespaces=ns),
                'departure_time': vj.findtext('tx:DepartureTime', default=None, namespaces=ns),
                'journey_code': vj.findtext('.//tx:JourneyCode', default=None, namespaces=ns),
                'days_of_week': ','.join(days_of_week) if days_of_week else None,
                'sequence_number': vj.get('SequenceNumber')
            })
        tables['vehicle_journeys'] = pd.DataFrame(vehicle_journeys)
        
        # 10. Extract Serviced Organisations
        serviced_orgs = []
        for org in root.findall('.//tx:ServicedOrganisation', ns):
            serviced_orgs.append({
                'organisation_code': org.findtext('tx:OrganisationCode', default=None, namespaces=ns),
                'name': org.findtext('tx:Name', default=None, namespaces=ns)
            })
        tables['serviced_organisations'] = pd.DataFrame(serviced_orgs)
        
        # 11. Extract Serviced Organisation Working Days
        working_days = []
        for org in root.findall('.//tx:ServicedOrganisation', ns):
            org_code = org.findtext('tx:OrganisationCode', default=None, namespaces=ns)
            for date_range in org.findall('.//tx:DateRange', ns):
                working_days.append({
                    'organisation_code': org_code,
                    'start_date': date_range.findtext('tx:StartDate', default=None, namespaces=ns),
                    'end_date': date_range.findtext('tx:EndDate', default=None, namespaces=ns),
                    'description': date_range.findtext('tx:Description', default=None, namespaces=ns)
                })
        tables['serviced_org_working_days'] = pd.DataFrame(working_days)
        
        # 12. Extract Operating Profiles
        operating_profiles = []
        for vj in root.findall('.//tx:VehicleJourney', ns):
            vj_code = vj.findtext('tx:VehicleJourneyCode', default=None, namespaces=ns)
            op_profile = vj.find('tx:OperatingProfile', ns)
            
            if op_profile is not None:
                # Regular days
                days_of_week = []
                for day_elem in op_profile.findall('.//tx:DaysOfWeek/*', ns):
                    days_of_week.append(day_elem.tag.split('}')[-1])
                
                # Bank holidays
                bank_holidays_operation = []
                for bh in op_profile.findall('.//tx:BankHolidayOperation/tx:DaysOfOperation/*', ns):
                    bank_holidays_operation.append(bh.tag.split('}')[-1])
                
                bank_holidays_non_operation = []
                for bh in op_profile.findall('.//tx:BankHolidayOperation/tx:DaysOfNonOperation/*', ns):
                    bank_holidays_non_operation.append(bh.tag.split('}')[-1])
                
                operating_profiles.append({
                    'vehicle_journey_code': vj_code,
                    'days_of_week': ','.join(days_of_week) if days_of_week else None,
                    'bank_holidays_operation': ','.join(bank_holidays_operation) if bank_holidays_operation else None,
                    'bank_holidays_non_operation': ','.join(bank_holidays_non_operation) if bank_holidays_non_operation else None
                })
        tables['operating_profiles'] = pd.DataFrame(operating_profiles)
        
        # 13. Extract Service Journey Patterns (mapping services to journey patterns)
        service_journey_patterns = []
        for service in root.findall('.//tx:Service', ns):
            service_code = service.findtext('tx:ServiceCode', default=None, namespaces=ns)
            for jp in service.findall('.//tx:JourneyPattern', ns):
                service_journey_patterns.append({
                    'service_code': service_code,
                    'journey_pattern_id': jp.get('id'),
                    'destination_display': jp.findtext('tx:DestinationDisplay', default=None, namespaces=ns)
                })
        tables['service_journey_patterns'] = pd.DataFrame(service_journey_patterns)
        
        # 14. Extract Service Lines (mapping services to lines)
        service_lines = []
        for service in root.findall('.//tx:Service', ns):
            service_code = service.findtext('tx:ServiceCode', default=None, namespaces=ns)
            for line in service.findall('.//tx:Line', ns):
                service_lines.append({
                    'service_code': service_code,
                    'line_id': line.get('id'),
                    'line_name': line.findtext('tx:LineName', default=None, namespaces=ns)
                })
        tables['service_lines'] = pd.DataFrame(service_lines)
        
        return tables
    
    except Exception as e:
        print(f"Error parsing {xml_file}: {str(e)}")
        return None


In [13]:
# Test with the first XML file
if xml_files:
    test_file = xml_files[0]
    print(f"Testing with: {os.path.basename(test_file)}")
    print("=" * 80)
    
    tables = parse_transxchange_file(test_file)
    
    if tables:
        print(f"\n‚úì Successfully parsed {os.path.basename(test_file)}")
        print(f"\nExtracted {len(tables)} tables:\n")
        
        for table_name, df in tables.items():
            print(f"  üìä {table_name:30s} - {len(df):,} rows")
        
        # Show sample data from key tables
        print(f"\n{'='*80}")
        print("Sample Data Preview:")
        print(f"{'='*80}\n")
        
        # Preview Stops
        if not tables['stops'].empty:
            print("üöè STOPS (first 5):")
            print(tables['stops'].head())
            print()
        
        # Preview Operators
        if not tables['operators'].empty:
            print("üöå OPERATORS:")
            print(tables['operators'])
            print()
        
        # Preview Lines
        if not tables['lines'].empty:
            print("üõ§Ô∏è  LINES:")
            print(tables['lines'])
            print()
        
        # Preview Vehicle Journeys
        if not tables['vehicle_journeys'].empty:
            print("üöê VEHICLE JOURNEYS (first 5):")
            print(tables['vehicle_journeys'].head())
    else:
        print("‚ùå Failed to parse file")
else:
    print("‚ö†Ô∏è  No XML files found to test")


Testing with: 404-21-404-_-y05-2_1.xml

‚úì Successfully parsed 404-21-404-_-y05-2_1.xml

Extracted 14 tables:

  üìä stops                          - 77 rows
  üìä operators                      - 1 rows
  üìä services                       - 1 rows
  üìä lines                          - 1 rows
  üìä routes                         - 2 rows
  üìä route_links                    - 88 rows
  üìä journey_patterns               - 19 rows
  üìä timing_links                   - 836 rows
  üìä vehicle_journeys               - 204 rows
  üìä serviced_organisations         - 1 rows
  üìä serviced_org_working_days      - 6 rows
  üìä operating_profiles             - 204 rows
  üìä service_journey_patterns       - 19 rows
  üìä service_lines                  - 1 rows

Sample Data Preview:

üöè STOPS (first 5):
  stop_point_ref                       common_name
0     490002178Z  Crawford Crescent (->S) Coulsdon
1     490002177Z   Shaftesbury Lane (->E) Coulsdon
2     490002176Z      

In [14]:
# Process ALL XML files and consolidate data
print("üöÄ Processing ALL Abellio London Ltd XML files...")
print(f"Total files to process: {len(xml_files)}\n")

# Dictionary to accumulate all tables across all files
all_tables = {
    'stops': [],
    'operators': [],
    'services': [],
    'lines': [],
    'routes': [],
    'route_links': [],
    'journey_patterns': [],
    'timing_links': [],
    'vehicle_journeys': [],
    'serviced_organisations': [],
    'serviced_org_working_days': [],
    'operating_profiles': [],
    'service_journey_patterns': [],
    'service_lines': []
}

# Process each XML file
successful_files = 0
failed_files = 0

for i, xml_file in enumerate(xml_files, 1):
    filename = os.path.basename(xml_file)
    print(f"[{i}/{len(xml_files)}] Processing: {filename}...", end=" ")
    
    try:
        tables = parse_transxchange_file(xml_file)
        
        if tables:
            # Add source file column and accumulate data
            for table_name, df in tables.items():
                if not df.empty:
                    df['source_file'] = filename
                    all_tables[table_name].append(df)
            
            print(f"‚úì Done ({sum(len(df) for df in tables.values()):,} records)")
            successful_files += 1
        else:
            print("‚ùå Failed")
            failed_files += 1
            
    except Exception as e:
        print(f"‚ùå Error: {str(e)}")
        failed_files += 1

print(f"\n{'='*80}")
print(f"Processing complete!")
print(f"  ‚úì Successful: {successful_files}/{len(xml_files)}")
print(f"  ‚ùå Failed: {failed_files}/{len(xml_files)}")
print(f"{'='*80}\n")

# Consolidate all dataframes
consolidated_tables = {}
for table_name, df_list in all_tables.items():
    if df_list:
        consolidated_tables[table_name] = pd.concat(df_list, ignore_index=True)
    else:
        consolidated_tables[table_name] = pd.DataFrame()

# Display summary
print("üìä Consolidated Data Summary:\n")
for table_name, df in consolidated_tables.items():
    if not df.empty:
        print(f"  {table_name:30s} - {len(df):,} rows")
    else:
        print(f"  {table_name:30s} - Empty")


üöÄ Processing ALL Abellio London Ltd XML files...
Total files to process: 11

[1/11] Processing: 404-21-404-_-y05-2_1.xml... ‚úì Done (1,460 records)
[2/11] Processing: tfl_21-404-_-y05-6_qyPH39c.xml... ‚úì Done (1,354 records)
‚úì Done (1,352 records)_21-404-_-y05-7.xml... 
‚úì Done (22,136 records)21-407-_-y05-11_HygcEjI.xml... 
[5/11] Processing: tfl_21-464-_-y05-55535.xml... ‚úì Done (921 records)
[6/11] Processing: tfl_21-464-_-y05-55538.xml... ‚úì Done (1,099 records)
‚úì Done (1,221 records)_21-464-_-y05-55539.xml... 
‚úì Done (8,225 records)_21-465-_-y05-7_b3z8CFQ.xml... 
‚úì Done (11,150 records)21-R68-_-y05-58885.xml... 
‚úì Done (11,150 records)_21-R68-_-y05-58885_glMYlIz.xml... 
‚úì Done (8,806 records)l_21-S1-_-y05-59996.xml... 

Processing complete!
  ‚úì Successful: 11/11
  ‚ùå Failed: 0/11

üìä Consolidated Data Summary:

  stops                          - 931 rows
  operators                      - 11 rows
  services                       - 11 rows
  lines          

In [15]:
# Save all tables to CSV files
print("üíæ Saving tables to CSV files...\n")

saved_count = 0
empty_count = 0

for table_name, df in consolidated_tables.items():
    # Create subdirectory for each table type
    output_dir = os.path.join(OUTPUT_BASE, table_name)
    os.makedirs(output_dir, exist_ok=True)
    
    if not df.empty:
        # Save to CSV with timestamp
        csv_filename = f"{table_name}_abellio_london.csv"
        csv_path = os.path.join(output_dir, csv_filename)
        
        df.to_csv(csv_path, index=False, encoding='utf-8')
        print(f"  ‚úì Saved: {csv_path}")
        print(f"    ‚îî‚îÄ {len(df):,} rows, {len(df.columns)} columns")
        saved_count += 1
    else:
        print(f"  ‚äò Skipped: {table_name} (empty)")
        empty_count += 1

print(f"\n{'='*80}")
print(f"Export complete!")
print(f"  ‚úì Saved: {saved_count} tables")
print(f"  ‚äò Empty: {empty_count} tables")
print(f"  üìÅ Output location: {OUTPUT_BASE}")
print(f"{'='*80}")


üíæ Saving tables to CSV files...

  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/stops/stops_abellio_london.csv
    ‚îî‚îÄ 931 rows, 3 columns
  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/operators/operators_abellio_london.csv
    ‚îî‚îÄ 11 rows, 8 columns
  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/services/services_abellio_london.csv
    ‚îî‚îÄ 11 rows, 8 columns
  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/lines/lines_abellio_london.csv
    ‚îî‚îÄ 11 rows, 7 columns
  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/routes/routes_abellio_london.csv
    ‚îî‚îÄ 27 rows, 4 columns
  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/route_links/route_links_abellio_london.csv
    ‚îî‚îÄ 958 rows, 6 columns
  ‚úì Saved: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/journey_patterns/journ

In [16]:
# Verify the saved CSV files
print("üîç Verifying saved CSV files...\n")

for table_name in consolidated_tables.keys():
    output_dir = os.path.join(OUTPUT_BASE, table_name)
    csv_path = os.path.join(output_dir, f"{table_name}_abellio_london.csv")
    
    if os.path.exists(csv_path):
        # Read back and verify
        df_verify = pd.read_csv(csv_path)
        file_size = os.path.getsize(csv_path) / 1024  # Size in KB
        
        print(f"‚úì {table_name}")
        print(f"  ‚îî‚îÄ Path: {csv_path}")
        print(f"  ‚îî‚îÄ Rows: {len(df_verify):,} | Columns: {len(df_verify.columns)} | Size: {file_size:.1f} KB")
        
        # Show column names for important tables
        if table_name in ['stops', 'operators', 'vehicle_journeys', 'services']:
            print(f"  ‚îî‚îÄ Columns: {', '.join(df_verify.columns[:5])}{'...' if len(df_verify.columns) > 5 else ''}")
        print()

print(f"{'='*80}")
print("‚úÖ All CSV files verified successfully!")
print(f"{'='*80}")


üîç Verifying saved CSV files...

‚úì stops
  ‚îî‚îÄ Path: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/stops/stops_abellio_london.csv
  ‚îî‚îÄ Rows: 931 | Columns: 3 | Size: 55.5 KB
  ‚îî‚îÄ Columns: stop_point_ref, common_name, source_file

‚úì operators
  ‚îî‚îÄ Path: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/operators/operators_abellio_london.csv
  ‚îî‚îÄ Rows: 11 | Columns: 8 | Size: 1.1 KB
  ‚îî‚îÄ Columns: operator_id, national_operator_code, operator_code, operator_short_name, operator_name_on_licence...

‚úì services
  ‚îî‚îÄ Path: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/services/services_abellio_london.csv
  ‚îî‚îÄ Rows: 11 | Columns: 8 | Size: 1.4 KB
  ‚îî‚îÄ Columns: service_code, private_code, operator_ref, start_date, end_date...

‚úì lines
  ‚îî‚îÄ Path: /Users/shitalyadav/Desktop/untitled folder/timetable_parsed_data/lines/lines_abellio_london.csv
  ‚îî‚îÄ Rows: 11 | Columns: 7 | Size: 0.8 KB

‚úì rout

In [17]:
# Display sample data from key tables
print("üìã Sample Data from Key Tables\n")
print(f"{'='*80}\n")

# 1. Operators
if not consolidated_tables['operators'].empty:
    print("üöå OPERATORS:")
    print(consolidated_tables['operators'][['operator_id', 'operator_short_name', 'licence_number']].drop_duplicates())
    print()

# 2. Lines Summary
if not consolidated_tables['lines'].empty:
    print("üõ§Ô∏è  LINES SUMMARY:")
    lines_summary = consolidated_tables['lines'][['line_name', 'line_id']].drop_duplicates()
    print(f"Total unique lines: {len(lines_summary)}")
    print(lines_summary.head(10))
    print()

# 3. Stops Statistics
if not consolidated_tables['stops'].empty:
    print("üöè STOPS STATISTICS:")
    print(f"Total unique stops: {consolidated_tables['stops']['stop_point_ref'].nunique():,}")
    print("\nSample stops:")
    print(consolidated_tables['stops'][['stop_point_ref', 'common_name']].drop_duplicates().head(10))
    print()

# 4. Vehicle Journeys Statistics
if not consolidated_tables['vehicle_journeys'].empty:
    print("üöê VEHICLE JOURNEYS STATISTICS:")
    vj_df = consolidated_tables['vehicle_journeys']
    print(f"Total vehicle journeys: {len(vj_df):,}")
    print(f"Unique departure times: {vj_df['departure_time'].nunique():,}")
    print(f"\nJourneys by day of week:")
    if 'days_of_week' in vj_df.columns:
        days_counts = vj_df['days_of_week'].value_counts().head(10)
        for day, count in days_counts.items():
            print(f"  {day}: {count:,}")
    print()

# 5. Services Summary
if not consolidated_tables['services'].empty:
    print("üìÖ SERVICES SUMMARY:")
    services_df = consolidated_tables['services']
    print(f"Total services: {len(services_df):,}")
    print(f"Unique service codes: {services_df['service_code'].nunique()}")
    print("\nSample services:")
    print(services_df[['service_code', 'origin', 'destination']].head(5))
    print()

print(f"{'='*80}")
print("‚úÖ Data extraction complete!")
print(f"{'='*80}")
print(f"\nüìÅ All CSV files saved in: {OUTPUT_BASE}")
print(f"\nüìä Summary of extracted tables:")
for table_name, df in consolidated_tables.items():
    if not df.empty:
        print(f"  ‚Ä¢ {table_name}: {len(df):,} rows")


üìã Sample Data from Key Tables


üöå OPERATORS:
   operator_id     operator_short_name licence_number
0           CX  ABELLIO LONDON LIMITED      PK0003436
1       OId_CX            Transport UK      UZ000ABLO
3       OId_CX            Transport UK      PK0003436
7       OId_TE            Transport UK      PK0001815
10      OId_CX            Transport UK      PK0001816

üõ§Ô∏è  LINES SUMMARY:
Total unique lines: 7
   line_name                 line_id
0        404    ABLO:PK0003436:5:404
1        404  ABLO:UZ000ABLO:404:404
3        407    ABLO:PK0003436:2:407
4        464    ABLO:PK0003436:4:464
7        465   ABLO:PK0001815:90:465
8        R68   ABLO:PK0001815:15:R68
10        S1   ABLO:PK0001816:107:S1

üöè STOPS STATISTICS:
Total unique stops: 570

Sample stops:
  stop_point_ref                             common_name
0     490002178Z        Crawford Crescent (->S) Coulsdon
1     490002177Z         Shaftesbury Lane (->E) Coulsdon
2     490002176Z             Salter Close (->E) 