# NYC Taxi Duration Prediction Pipeline
**Author:** Ali Ahmed  
**Role:** Associate ML/MLOps Engineer  
**Status:** Development / Production-Ready Simulation

---

## Project Overview
This project implements a professional data pipeline designed with MLOps best practices:
* **Automation Ready:** Modular code structure prepared for orchestration.
* **Data Versioning Support:** Tiered storage (raw/processed) for better data lineage.
* **Portability:** Environment-agnostic path management for seamless deployment.

## 1. Imports and Setup

In [29]:
import os

import pandas as pd
# Setting pandas display options for professional logging
pd.options.display.max_columns = None

## 2. Global Configuration
I establish a centralized configuration layer to manage environment-specific variables. 
By defining a `BASE_DIR` and using `os.path.join`, I ensure the pipeline remains:
* **Environment Agnostic:** Works on Linux, macOS, and Windows.
* **Scalable:** Easy to update parameters for different months/years in a single location.

In [30]:
def get_config(taxi_type: str = 'yellow', year: int = 2024, month: int = 1):
    """
    Centralized configuration management. 
    Encapsulates all environment and resource parameters.
    """
    base_dir = '..'
    raw_data_dir = os.path.join(base_dir, 'data', 'raw')
    processed_data_dir = os.path.join(base_dir, 'data', 'processed')
    
    # Ensure directories exist (Infrastructure as Code principle)
    os.makedirs(raw_data_dir, exist_ok=True)
    os.makedirs(processed_data_dir, exist_ok=True)
    
    file_name = f'{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
    
    config = {
        'taxi_type': taxi_type,
        'year': year,
        'month': month,
        'data_url': f'https://d37ci6vzurychx.cloudfront.net/trip-data/{file_name}',
        'raw_path': os.path.join(raw_data_dir, file_name),
        'processed_path': os.path.join(processed_data_dir, file_name)
    }
    
    return config

# Initialize configuration
cfg = get_config(taxi_type='yellow', year=2025, month=10)
print(f"LOG: Pipeline configuration initialized for {cfg['taxi_type']} - {cfg['year']}/{cfg['month']}")

LOG: Pipeline configuration initialized for yellow - 2025/10


## 3. Data Ingestion Phase
In this stage, we implement an **Idempotent Ingestion Engine**. 
The goal is to fetch the NYC Taxi dataset from the official Cloudfront repository and store it in our `data/raw` directory.

**Professional Standards Applied:**
* **Caching Mechanism:** Before downloading, the system checks if the file already exists locally to save bandwidth and time.
* **Encapsulation:** The logic is wrapped in a function that accepts our centralized `config` object.
* **Logging:** Status updates are provided to track the download progress.

In [31]:
def ingest_data(config: dict) -> pd.DataFrame:
    """
    Downloads the data from NYC Taxi URL and saves it locally.
    """
    url = config['data_url']
    save_path = config['raw_path']
    
    # التأكد إذا كان الملف موجود مسبقاً (Caching)
    if not os.path.exists(save_path):
        print(f"LOG: Downloading from: {url}")
        data = pd.read_parquet(url)
        data.to_parquet(save_path, index=False)
        print(f"✅ Success: Data saved to {save_path}")
    else:
        print(f"LOG: Resource found in cache. Loading: {save_path}")
        data = pd.read_parquet(save_path)
    
    return data

# التنفيذ الفعلي
df = ingest_data(cfg)

# التأكد من حجم البيانات
print(f"LOG: Total rows captured: {len(df):,}")
df.head()

LOG: Resource found in cache. Loading: ../data/raw/yellow_tripdata_2025-10.parquet
LOG: Total rows captured: 4,428,699


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee
0,1,2025-10-01 00:15:32,2025-10-01 01:04:03,1.0,17.2,2.0,N,132,107,1,70.0,5.0,0.5,0.0,6.94,1.0,83.44,2.5,1.75,0.75
1,7,2025-10-01 00:00:08,2025-10-01 00:00:08,1.0,5.0,1.0,N,107,225,1,28.2,0.0,0.5,8.49,0.0,1.0,42.44,2.5,0.0,0.75
2,2,2025-10-01 00:08:54,2025-10-01 00:14:44,1.0,2.75,1.0,N,263,229,1,12.8,1.0,0.5,3.71,0.0,1.0,22.26,2.5,0.0,0.75
3,1,2025-10-01 00:58:48,2025-10-01 01:04:40,1.0,1.3,1.0,N,211,231,2,7.9,4.25,0.5,0.0,0.0,1.0,13.65,2.5,0.0,0.75
4,2,2025-10-01 00:39:51,2025-10-01 00:49:40,1.0,2.88,1.0,N,230,151,1,14.2,1.0,0.5,3.99,0.0,1.0,23.94,2.5,0.0,0.75
