<a href="https://colab.research.google.com/github/Ahsan97Javed/gtfs-batch-pipeline/blob/main/ingestion_gtfs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GTFS Batch Processing Pipeline — Ingestion Microservice

## 1. Mount Google Drive

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. List Available GTFS Files


In [6]:
import os

raw_data_path = '/content/drive/My Drive/GTFS_RAW'
files = os.listdir(raw_data_path)
print("GTFS files found:", files)


GTFS files found: ['calendar.txt', 'feed_info.txt', 'stops.txt', 'routes.txt', 'calendar_dates.txt', 'agency.txt', 'trips.txt', 'attributions.txt', 'stop_times.txt']


## 3. Read and Preview GTFS Tables


In [9]:
import pandas as pd

dfs = {}
for fname in files:
    if fname.endswith('.txt'):
        fpath = os.path.join(raw_data_path, fname)
        df = pd.read_csv(fpath)
        print(f"\nPreview of {fname}:")
        print(df.head())
        dfs[fname] = df



Preview of calendar.txt:
   monday  tuesday  wednesday  thursday  friday  saturday  sunday  start_date  \
0       0        0          0         0       0         0       1    20250615   
1       0        0          0         0       0         0       1    20250615   
2       0        0          0         0       0         0       1    20250615   
3       0        0          0         0       0         0       1    20250615   
4       0        0          0         0       0         0       1    20250615   

   end_date  service_id  
0  20250622        1692  
1  20250622        2322  
2  20250622        2631  
3  20250622        2738  
4  20250622        4936  

Preview of feed_info.txt:
                                 feed_publisher_name feed_publisher_url  \
0  gtfs.de - GTFS für Deutschland, Daten bereitge...     http://gtfs.de   

  feed_lang feed_version feed_contact_email feed_contact_url  
0        de  latest-free       info@gtfs.de  https://gtfs.de  

Preview of stops.txt:
    

## 4. Validate File Structure

In [11]:
# Expected columns for all standard GTFS tables
expected_columns = {
    'calendar.txt': ['service_id', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'start_date', 'end_date'],
    'feed_info.txt': ['feed_publisher_name', 'feed_publisher_url', 'feed_lang', 'feed_start_date', 'feed_end_date', 'feed_version', 'feed_contact_email', 'feed_contact_url'],
    'stops.txt': ['stop_id', 'stop_name', 'parent_station', 'stop_lat', 'stop_lon', 'location_type'],
    'routes.txt': ['route_id', 'route_short_name', 'agency_id', 'route_type'],
    'calendar_dates.txt': ['service_id', 'date', 'exception_type'],
    'agency.txt': ['agency_id', 'agency_name', 'agency_url', 'agency_timezone', 'agency_lang'],
    'trips.txt': ['route_id', 'service_id', 'trip_id'],
    'attributions.txt': ['attribution_id', 'organization_name', 'is_producer', 'is_operator', 'is_authority', 'attribution_url', 'attribution_email', 'attribution_phone'],
    'stop_times.txt': ['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'stop_sequence', 'pickup_type', 'drop_off_type']
}

# Validate columns and print missing/extras
for fname, expected in expected_columns.items():
    if fname in dfs:
        actual = set(dfs[fname].columns)
        missing = set(expected) - actual
        extra = actual - set(expected)
        print(f"\nVALIDATION for {fname}:")
        print(" - Missing columns:", missing)
        print(" - Extra columns:", extra)
        if missing:
            print(f"WARNING: {fname} is missing columns {missing}")
        if extra:
            print(f"NOTE: {fname} has extra columns {extra}")

 # Add missing columns as empty for schema consistency
        for col in missing:
            dfs[fname][col] = pd.NA
        if missing:
            print(f"Added missing columns as empty for {fname} (columns added for schema consistency).")


VALIDATION for calendar.txt:
 - Missing columns: set()
 - Extra columns: set()

VALIDATION for feed_info.txt:
 - Missing columns: {'feed_start_date', 'feed_end_date'}
 - Extra columns: set()
Added missing columns as empty for feed_info.txt (columns added for schema consistency).

VALIDATION for stops.txt:
 - Missing columns: set()
 - Extra columns: set()

VALIDATION for routes.txt:
 - Missing columns: set()
 - Extra columns: {'route_long_name'}
NOTE: routes.txt has extra columns {'route_long_name'}

VALIDATION for calendar_dates.txt:
 - Missing columns: set()
 - Extra columns: set()

VALIDATION for agency.txt:
 - Missing columns: set()
 - Extra columns: set()

VALIDATION for trips.txt:
 - Missing columns: set()
 - Extra columns: set()

VALIDATION for attributions.txt:
 - Missing columns: {'is_operator', 'attribution_phone', 'is_authority'}
 - Extra columns: set()
Added missing columns as empty for attributions.txt (columns added for schema consistency).

VALIDATION for stop_times.txt:

## 5. Save Validated Data
- Save to a new Drive folder `/My Drive/GTFS_VALIDATED/`


In [12]:
validated_path = '/content/drive/My Drive/GTFS_VALIDATED'
os.makedirs(validated_path, exist_ok=True)

for fname, df in dfs.items():
    # filter for only validated files if desired
    df.to_csv(os.path.join(validated_path, fname), index=False)
print("Validated files saved.")


Validated files saved.
