<a href="https://colab.research.google.com/github/Ahsan97Javed/gtfs-batch-pipeline/blob/main/output_gtfs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GTFS Batch Processing Pipeline — Output Microservice

## 1. Mount Google Drive & Set Paths


In [1]:
from google.colab import drive
import os
import pandas as pd

drive.mount('/content/drive')

agg_path = '/content/drive/My Drive/GTFS_AGGREGATED'
# Outputs will remain here and can be copied elsewhere if needed


Mounted at /content/drive


## 2. List All Aggregated Output Files


In [2]:
output_files = os.listdir(agg_path)
print("Aggregated outputs available:")
for f in output_files:
    print(f)


Aggregated outputs available:
trip_counts_per_route.csv
popular_stops.csv
trips_per_day.csv
avg_stops_per_trip.txt
routes_per_agency.csv


## 3. Load and Summarize Key Aggregated Results



In [3]:
for fname in output_files:
    if fname.endswith('.csv'):
        print(f"\n===== {fname} =====")
        df = pd.read_csv(os.path.join(agg_path, fname))
        print(df.head())
        print(f"(Rows: {df.shape[0]}, Columns: {df.shape[1]})")
    elif fname.endswith('.txt'):
        print(f"\n===== {fname} =====")
        with open(os.path.join(agg_path, fname), 'r') as f:
            print(f.read())



===== trip_counts_per_route.csv =====
   route_id  trip_count route_short_name
0     18162        2617               U5
1      6386        2454               M8
2      8443        2329              AST
3     24376        2133              302
4     10748        2054               18
(Rows: 25081, Columns: 3)

===== popular_stops.csv =====
   stop_id  num_stop_times             stop_name
0   172476            6624    Hamburg Bf. Altona
1   390511            5623            Hertzallee
2   449492            4515  F Willy-Brandt-Platz
3   252245            3906    F Eschenheimer Tor
4   175628            3906     F Schweizer Platz
(Rows: 429051, Columns: 3)

===== trips_per_day.csv =====
  day_of_week  num_trips
0      monday     982723
1     tuesday     984563
2   wednesday     969625
3    thursday     929649
4      friday     965953
(Rows: 7, Columns: 2)

===== avg_stops_per_trip.txt =====
Average stops per trip: 20.36

===== routes_per_agency.csv =====
   agency_id  num_routes         

## 4. Upload Key Aggregates to Google BigQuery

In [4]:
# Install and import BigQuery client
!pip install --quiet google-cloud-bigquery
from google.cloud import bigquery
from google.colab import auth

auth.authenticate_user()
project_id = "gtfs-batch-pipeline"

# Define a BigQuery dataset (will be created if not exists)
dataset_id = f"{project_id}.gtfs_batch"
client = bigquery.Client(project=project_id)

# Create the dataset
def create_dataset_if_not_exists(dataset_id):
    try:
        client.get_dataset(dataset_id)
        print(f"Dataset {dataset_id} already exists.")
    except Exception:
        dataset = bigquery.Dataset(dataset_id)
        dataset.location = "EU"
        dataset = client.create_dataset(dataset, exists_ok=True)
        print(f"Created dataset {dataset_id}.")

create_dataset_if_not_exists(dataset_id)

# List of main result files to upload
files_to_upload = [
    "trip_counts_per_route.csv",
    "popular_stops.csv",
    "trips_per_day.csv",
    "routes_per_agency.csv"
]
for fname in files_to_upload:
    csv_path = os.path.join(agg_path, fname)
    if os.path.exists(csv_path):
        table_id = f"{dataset_id}.{fname.replace('.csv','')}"
        job_config = bigquery.LoadJobConfig(
            autodetect=True, skip_leading_rows=1, source_format=bigquery.SourceFormat.CSV
        )
        with open(csv_path, "rb") as source_file:
            job = client.load_table_from_file(source_file, table_id, job_config=job_config)
        job.result()
        print(f"Uploaded {fname} to BigQuery as {table_id}.")
    else:
        print(f"{fname} not found, skipping.")

# verify upload
for fname in files_to_upload:
    table_id = f"{dataset_id}.{fname.replace('.csv','')}"
    try:
        table = client.get_table(table_id)
        print(f"{table_id}: {table.num_rows} rows uploaded.")
    except Exception as e:
        print(f"{table_id} not found or error: {e}")


Created dataset gtfs-batch-pipeline.gtfs_batch.
Uploaded trip_counts_per_route.csv to BigQuery as gtfs-batch-pipeline.gtfs_batch.trip_counts_per_route.
Uploaded popular_stops.csv to BigQuery as gtfs-batch-pipeline.gtfs_batch.popular_stops.
Uploaded trips_per_day.csv to BigQuery as gtfs-batch-pipeline.gtfs_batch.trips_per_day.
Uploaded routes_per_agency.csv to BigQuery as gtfs-batch-pipeline.gtfs_batch.routes_per_agency.
gtfs-batch-pipeline.gtfs_batch.trip_counts_per_route: 25081 rows uploaded.
gtfs-batch-pipeline.gtfs_batch.popular_stops: 429051 rows uploaded.
gtfs-batch-pipeline.gtfs_batch.trips_per_day: 7 rows uploaded.
gtfs-batch-pipeline.gtfs_batch.routes_per_agency: 451 rows uploaded.
