## TEMPO NO2 Data Download and Processing for NYC
Downloads TEMPO satellite NO2 data and converts to CSV for ML training

Data Specifications
TEMPO NO2 L3 Product Details

Spatial Resolution: ~2 km at nadir
Temporal Resolution: Hourly during daylight (typically 12-14 scans/day)
Coverage: Continental US, Mexico, and most of Canada
Key Variables:

Tropospheric NO2 column (primary for air quality)
Stratospheric NO2 column
Total column NO2
Quality flags and uncertainties



Expected Data Volume

NYC area: ~200-300 grid cells
2 weeks of data: ~200-300 files
Each file: ~5-20 MB
Total download: ~1-5 GB
Processed CSV: ~10-50 MB

Next Steps for ML Training
After downloading data, you'll have:

Spatial CSV: Use for spatial modeling or as features
Time Series CSV: Use for temporal forecasting




In [3]:
import earthaccess
import xarray as xr
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')


  from .autonotebook import tqdm as notebook_tqdm


### Configuration 
configuring the boundary boxes for, latitude and longitude range for New York city. Setting date and time for downloading the data. Creating directories to save them.

In [4]:
START_DATE = "2023-04-01"
END_DATE = "2023-07-01"  

# Output directories
OUTPUT_DIR = Path("tempo_data")
NETCDF_DIR = OUTPUT_DIR / "netcdf"
CSV_DIR = OUTPUT_DIR / "csv"

# Create directories
OUTPUT_DIR.mkdir(exist_ok=True)
NETCDF_DIR.mkdir(exist_ok=True)
CSV_DIR.mkdir(exist_ok=True)

### STEP 1: EARTHDATA AUTHENTICATION SETUP

Setup authentication for NASA Earthdata.
    
First time users: Run this to create credentials
You'll need to:
1. Create account at: https://urs.earthdata.nasa.gov/
2. Run earthaccess.login() which will prompt for username/password
3. Credentials will be saved for future use

In [11]:
def setup_earthdata_auth():
  
    print("=" * 70)
    print("EARTHDATA AUTHENTICATION")
    print("=" * 70)
    
    try:
        # Try to login (will use saved credentials if available)
        auth = earthaccess.login(strategy="interactive")
        
        if auth.authenticated:
            print("✓ Successfully authenticated with NASA Earthdata!")
            return auth
        else:
            print("✗ Authentication failed. Please check your credentials.")
            print("\nTo create an account:")
            print("1. Visit: https://urs.earthdata.nasa.gov/")
            print("2. Click 'Register' and follow the steps")
            print("3. Run this script again")
            return None
            
    except Exception as e:
        print(f"✗ Error during authentication: {e}")
        print("\nFirst time setup:")
        print("1. Create account at: https://urs.earthdata.nasa.gov/")
        print("2. Run: earthaccess.login() in Python")
        print("3. Enter your username and password when prompted")
        return None

### STEP 2: SEARCH FOR TEMPO DATA
 Search for TEMPO NO2 Level 3 data.
    
    Args:
        start_date: Start date string (YYYY-MM-DD)
        end_date: End date string (YYYY-MM-DD)
        bounds: Dictionary with min_lon, max_lon, min_lat, max_lat
    
    Returns:
        List of granules found

In [6]:
def search_tempo_data(start_date, end_date, bounds=None):
 
    print("\n" + "=" * 70)
    print("SEARCHING FOR TEMPO DATA")
    print("=" * 70)
    print(f"Date Range: {start_date} to {end_date}")
    if bounds:
        print(f"Spatial Bounds: {bounds}")
    
    try:
        # Search for TEMPO NO2 L3 data
        search_params = {
            'short_name': 'TEMPO_NO2_L3',
            'temporal': (start_date, end_date)
        }
        
        # Add spatial bounds if provided
        if bounds:
            search_params['bounding_box'] = (
                bounds['min_lon'],
                bounds['min_lat'],
                bounds['max_lon'],
                bounds['max_lat']
            )
        
        results = earthaccess.search_data(**search_params)
        
        print(f"\n✓ Found {len(results)} granules")
        
        if len(results) > 0:
            print(f"\nSample granule info:")
            print(f"  - Size: {results[0].size()} MB")
            print(f"  - Format: {results[0].data_links()[0].split('.')[-1]}")
        
        return results
        
    except Exception as e:
        print(f"✗ Error searching for data: {e}")
        return []

### STEP 3: DOWNLOAD DATA
    Download TEMPO granules to local directory.
    
    Args:
        granules: List of granules from search
        output_dir: Directory to save files
    
    Returns:
        List of downloaded file paths

In [7]:
def download_tempo_data(granules, output_dir):

    print("\n" + "=" * 70)
    print("DOWNLOADING DATA")
    print("=" * 70)
    
    if not granules:
        print("✗ No granules to download")
        return []
    
    try:
        # Download files
        print(f"Downloading {len(granules)} files to {output_dir}...")
        files = earthaccess.download(
            granules,
            local_path=str(output_dir)
        )
        
        print(f"✓ Successfully downloaded {len(files)} files")
        return files
        
    except Exception as e:
        print(f"✗ Error downloading data: {e}")
        return []

### STEP 4: PROCESS NETCDF TO CSV
    Process TEMPO NetCDF files and create CSV for ML training.
    
    Args:
        netcdf_files: List of NetCDF file paths
        bounds: Dictionary with spatial bounds for NYC
        output_csv: Output CSV file path
    
    Returns:
        DataFrame with processed data

In [8]:
def process_netcdf_to_csv(netcdf_files, bounds, output_csv):
 
    print("\n" + "=" * 70)
    print("PROCESSING NETCDF TO CSV")
    print("=" * 70)
    
    all_data = []
    
    for i, file in enumerate(netcdf_files, 1):
        print(f"\nProcessing file {i}/{len(netcdf_files)}: {Path(file).name}")
        
        try:
            # Open NetCDF file
            ds = xr.open_dataset(file)
            
            # Extract key variables
            # Check what variables are available
            print(f"  Available variables: {list(ds.data_vars)}")
            
            # Subset to NYC area
            ds_nyc = ds.sel(
                longitude=slice(bounds['min_lon'], bounds['max_lon']),
                latitude=slice(bounds['min_lat'], bounds['max_lat'])
            )
            
            # Extract time
            if 'time' in ds_nyc.coords:
                time_val = pd.to_datetime(ds_nyc.time.values[0])
            else:
                # Parse from filename or use file attributes
                time_val = pd.to_datetime(ds.attrs.get('time_coverage_start', 'NaT'))
            
            # Extract NO2 data (adjust variable names based on what's available)
            variables_to_extract = [
                'vertical_column_troposphere',  # Tropospheric NO2
                'vertical_column_stratosphere',  # Stratospheric NO2
                'vertical_column_total',         # Total column NO2
                'column_amount',                 # Alternative name
                'tropopause_pressure',          # Useful ancillary data
                'surface_pressure',
                'cloud_fraction',
                'solar_zenith_angle',
                'weight'  # Area weight
            ]
            
            # Create dataframe for this file
            lons, lats = np.meshgrid(ds_nyc.longitude.values, ds_nyc.latitude.values)
            
            file_data = {
                'timestamp': time_val,
                'date': time_val.date() if not pd.isna(time_val) else None,
                'hour': time_val.hour if not pd.isna(time_val) else None,
                'latitude': lats.flatten(),
                'longitude': lons.flatten()
            }
            
            # Extract available variables
            for var in variables_to_extract:
                if var in ds_nyc.data_vars:
                    data = ds_nyc[var].values
                    if data.ndim == 3:  # (time, lat, lon)
                        data = data[0]  # Take first time step
                    file_data[var] = data.flatten()
                    print(f"  ✓ Extracted: {var}")
            
            # Convert to DataFrame
            df = pd.DataFrame(file_data)
            
            # Remove rows with all NaN values in data columns
            data_cols = [col for col in df.columns if col not in ['timestamp', 'date', 'hour', 'latitude', 'longitude']]
            df = df.dropna(subset=data_cols, how='all')
            
            if len(df) > 0:
                all_data.append(df)
                print(f"  ✓ Extracted {len(df)} valid data points")
            else:
                print(f"  ⚠ No valid data points found")
            
            ds.close()
            
        except Exception as e:
            print(f"  ✗ Error processing file: {e}")
            continue
    
    # Combine all data
    if all_data:
        print("\n" + "-" * 70)
        print("Combining all data...")
        final_df = pd.concat(all_data, ignore_index=True)
        
        # Sort by timestamp
        final_df = final_df.sort_values('timestamp')
        
        # Save to CSV
        final_df.to_csv(output_csv, index=False)
        print(f"✓ Saved combined data to: {output_csv}")
        print(f"  - Total rows: {len(final_df):,}")
        print(f"  - Columns: {list(final_df.columns)}")
        print(f"  - Date range: {final_df['timestamp'].min()} to {final_df['timestamp'].max()}")
        
        # Display summary statistics
        print("\n" + "-" * 70)
        print("DATA SUMMARY")
        print("-" * 70)
        print(final_df.describe())
        
        return final_df
    else:
        print("✗ No data to combine")
        return None


### STEP 5: CREATE AGGREGATED TIME SERIES
   Create aggregated time series (hourly/daily averages) for ML training.
    
    Args:
        df: DataFrame with spatial data
        output_csv: Output CSV file path
    
    Returns:
        Aggregated DataFrame

In [9]:
def create_aggregated_timeseries(df, output_csv):
 
    print("\n" + "=" * 70)
    print("CREATING AGGREGATED TIME SERIES")
    print("=" * 70)
    
    if df is None or len(df) == 0:
        print("✗ No data to aggregate")
        return None
    
    # Group by timestamp and calculate spatial averages over NYC
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    numeric_cols = [col for col in numeric_cols if col not in ['latitude', 'longitude']]
    
    agg_dict = {col: ['mean', 'std', 'min', 'max', 'count'] for col in numeric_cols}
    
    agg_df = df.groupby('timestamp').agg(agg_dict)
    
    # Flatten column names
    agg_df.columns = ['_'.join(col).strip() for col in agg_df.columns.values]
    agg_df = agg_df.reset_index()
    
    # Add time features
    agg_df['date'] = agg_df['timestamp'].dt.date
    agg_df['hour'] = agg_df['timestamp'].dt.hour
    agg_df['day_of_week'] = agg_df['timestamp'].dt.dayofweek
    agg_df['is_weekend'] = agg_df['day_of_week'].isin([5, 6]).astype(int)
    
    # Save aggregated data
    agg_df.to_csv(output_csv, index=False)
    print(f"✓ Saved aggregated time series to: {output_csv}")
    print(f"  - Total time points: {len(agg_df)}")
    print(f"  - Columns: {len(agg_df.columns)}")
    
    print("\n" + "-" * 70)
    print("AGGREGATED DATA PREVIEW")
    print("-" * 70)
    print(agg_df.head(10))
    
    return agg_df


### Main Execution


In [None]:

def create_aggregated_timeseries(df, output_csv):
    """
    Create aggregated time series (hourly/daily averages) for ML training.
    
    Args:
        df: DataFrame with spatial data
        output_csv: Output CSV file path
    
    Returns:
        Aggregated DataFrame
    """
    print("\n" + "=" * 70)
    print("CREATING AGGREGATED TIME SERIES")
    print("=" * 70)
    
    if df is None or len(df) == 0:
        print("✗ No data to aggregate")
        return None
    
    # Group by timestamp and calculate spatial averages over NYC
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    numeric_cols = [col for col in numeric_cols if col not in ['latitude', 'longitude']]
    
    agg_dict = {col: ['mean', 'std', 'min', 'max', 'count'] for col in numeric_cols}
    
    agg_df = df.groupby('timestamp').agg(agg_dict)
    
    # Flatten column names
    agg_df.columns = ['_'.join(col).strip() for col in agg_df.columns.values]
    agg_df = agg_df.reset_index()
    
    # Add time features
    agg_df['date'] = agg_df['timestamp'].dt.date
    agg_df['hour'] = agg_df['timestamp'].dt.hour
    agg_df['day_of_week'] = agg_df['timestamp'].dt.dayofweek
    agg_df['is_weekend'] = agg_df['day_of_week'].isin([5, 6]).astype(int)
    
    # Save aggregated data
    agg_df.to_csv(output_csv, index=False)
    print(f"✓ Saved aggregated time series to: {output_csv}")
    print(f"  - Total time points: {len(agg_df)}")
    print(f"  - Columns: {len(agg_df.columns)}")
    
    print("\n" + "-" * 70)
    print("AGGREGATED DATA PREVIEW")
    print("-" * 70)
    print(agg_df.head(10))
    
    return agg_df

# ============================================================================
# MAIN EXECUTION
# ============================================================================

def main():
    """Main execution function."""
    
    print("\n" + "=" * 70)
    print("TEMPO NO2 DATA DOWNLOAD AND PROCESSING FOR NYC")
    print("=" * 70)
    print(f"Date Range: {START_DATE} to {END_DATE}")
    print(f"Location: NYC ({NYC_BOUNDS})")
    print(f"Output Directory: {OUTPUT_DIR}")
    
    # Step 1: Authenticate
    auth = setup_earthdata_auth()
    if not auth:
        print("\n✗ Cannot proceed without authentication")
        return
    
    # Step 2: Search for data
    granules = search_tempo_data(START_DATE, END_DATE, NYC_BOUNDS)
    if not granules:
        print("\n✗ No data found. Check date range and try again.")
        print("Note: TEMPO data is only available from 2023 onwards")
        return
    
    # Step 3: Download data
    downloaded_files = download_tempo_data(granules, NETCDF_DIR)
    if not downloaded_files:
        print("\n✗ No files downloaded")
        return
    
    # Step 4: Process to CSV
    spatial_csv = CSV_DIR / "tempo_no2_nyc_spatial.csv"
    df = process_netcdf_to_csv(downloaded_files, NYC_BOUNDS, spatial_csv)
    
    # Step 5: Create aggregated time series
    if df is not None:
        aggregated_csv = CSV_DIR / "tempo_no2_nyc_timeseries.csv"
        agg_df = create_aggregated_timeseries(df, aggregated_csv)
    
    print("\n" + "=" * 70)
    print("✓ PROCESSING COMPLETE!")
    print("=" * 70)
    print(f"\nOutput files:")
    print(f"  1. NetCDF files: {NETCDF_DIR}")
    print(f"  2. Spatial CSV: {spatial_csv}")
    print(f"  3. Time series CSV: {CSV_DIR / 'tempo_no2_nyc_timeseries.csv'}")
    print("\nYou can now use the CSV files for ML training!")

if __name__ == "__main__":
    main()

### Download Multiple Regions
Run the script multiple times with different bounds and output directories.
Support and Resources

TEMPO Documentation: https://tempo.si.edu/
Earthdata Search: https://search.earthdata.nasa.gov/
earthaccess Docs: https://earthaccess.readthedocs.io/
TEMPO Data Guide: https://asdc.larc.nasa.gov/project/TEMPO

### Trobleshooting checklist

 NASA Earthdata account created
 All packages installed (pip list | grep earthaccess)
 Authenticated successfully (earthaccess.login())
 Date range is valid (2023 onwards)
 Sufficient disk space (~5-10 GB)
 Internet connection stable
 NYC bounds configured correctly