# Overview

This is an example notebook showing how to start with a CSV file including timestamps, latitudes, and longitudes, use this to define a query for corresponding TEMPO data, and extract and write out TEMPO data corresponding to those times and locations in the CSV file.

**Notebook Author / Affiliation**

* Author: Carl Malings / NASA ARSET
* This notebook is based on examples from the [ASDC Data and User Services Github](https://github.com/nasa/ASDC_Data_and_User_Services).

## Package Installation and Setup

*Instructions*

* Run the cell below to install the non-standard package required for this exercise.

In [47]:
!pip install --quiet harmony-py

*Instructions*

* Run the code cell below to import the required packages.

In [48]:
# Downloading TEMPO data
import datetime as dt
import getpass
import os
from harmony import BBox, Client, Collection, Request
from harmony.config import Environment

# Opening TEMPO data files
import xarray as xr

# Working with data tables
import numpy as np
import pandas as pd

# Accessing Google Drive
from google.colab import drive

# Basic Settings

This cell defines basic settings and variables used throughout the code; hopefully this is the only place you should need to make changes.

*Instructions*

* Change the basic settings in the code below if desired.
* Note that the CSV file you are using must be uploaded to your Google Drive to be accessible by Google Colab.

In [61]:
### DESIRED TEMPO DATA INFORMATION ###
TEMPO_Collection_ID = 'C2930763263-LARC_CLOUD' # Collection ID for the data to be retrieved, in this case, the TEMPO Level 3 NO2 data

# QA information for filtering the TEMPO data:
Max_Quality_Flag = 1
Max_Cloud_Fraction = 0.5
Max_Solar_Zenith_Angle = 80

# This next is a dictionary object in python, with the keys (left of :) being the names of the variables in the TEMPO data files, and the values (right of :) being the new name the variable will have in the output CSV file
TEMPO_variables_to_keep_and_rename = {'product/vertical_column_troposphere':'TEMPO_no2_vertical_column_troposphere',
                                      'product/vertical_column_stratosphere':'TEMPO_no2_vertical_column_stratosphere'}

### CSV FILE INFORMATION ###
CSV_file_path = '/content/drive/MyDrive/CSV_Test_File.csv' # This is the path to the CSV file in your Google Drive
CSV_file_name_append = '_with_TEMPO_data' # this text will be appended to the CSV file with TEMPO data to distinguish it from the input file

Column_Name_Timestamp = 'Time (EST)' # Name of the column in the CSV file with timestamps
Column_Name_Latitude = 'Latitude' # Name of the column in the CSV file with latitude coordinates
Column_Name_Longitude = 'Longitude' # Name of the column in the CSV file with longitude coordinates

Time_Zone = 'US/Eastern' # Time Zone used by the timestamps in the CSV File
# Suggested Time Zone Options are:
# 'UTC'             : Coordinated Universal Time
# 'US/Eastern'      : Eastern US Time
# 'US/Central'      : Central US Time
# 'US/Mountain'.    : Mountain US Time
# 'US/Pacific'      : Pacific US Time
# 'America/Phoenix' : Arizona Time (no daylight savings)

*Authenticate your Earthdata credentials*

In [None]:
username = input("Username:")
harmony_client = Client(env=Environment.PROD, auth=(username, getpass.getpass()))

# Main Workflow

## Access and read the CSV File to define the area of interest

You will need to authorize the notebook to connect to your Google Drive to find the file.

In [50]:
# Authorize access to Google Drive
drive.mount('/content/drive')

# Read the file:
f_data_csv = pd.read_csv(CSV_file_path)

#Parse the timestamps, convert to UTC, add a temporary column with this information
v_times_csv = [pd.Timestamp(timestamp,tz=Time_Zone).tz_convert(tz='UTC') for timestamp in f_data_csv[Column_Name_Timestamp]]
f_data_csv['TEMPORARY_COLUMN_TIME_UTC'] = v_times_csv

# Find Min and Max Time, Latitude, and Longitude
t_min = np.min(v_times_csv)
t_max = np.max(v_times_csv)
lat_min = f_data_csv[Column_Name_Latitude].min()
lat_max = f_data_csv[Column_Name_Latitude].max()
lon_min = f_data_csv[Column_Name_Longitude].min()
lon_max = f_data_csv[Column_Name_Longitude].max()

# Define the Region of Interest, start, and stop time. Pad the values slightly to ensure coverage.
RoI = [lon_min-0.1, lat_min-0.1, lon_max+0.1, lat_max+0.1]
query_start = t_min - pd.Timedelta(hours=1)
query_stop = t_max + pd.Timedelta(hours=1)

# Sanity Checks if the region and time are too big
if ((lon_max-lon_min) > 10):
  print('WARNING: You have selected a very large longitude range (> 10 degrees)')
if ((lat_max-lat_min) > 10):
  print('WARNING: You have selected a very large latitude range (> 10 degrees)')
if ((t_max-t_min) > pd.Timedelta(days=10)):
  print('WARNING: You have selected a very large time range (> 10 days)')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Get TEMPO  Data


*Instructions*

Run the cells in sequence to download and process the desired TEMPO data.

*Build the request*

In [51]:
# Structure the request:
request = Request(
    collection=Collection(id=TEMPO_Collection_ID),
    temporal={
        'start': dt.datetime(query_start.year, query_start.month, query_start.day, query_start.hour),
        'stop': dt.datetime(query_stop.year, query_stop.month, query_stop.day, query_stop.hour)
    },
    spatial=BBox(RoI[0], RoI[1], RoI[2], RoI[3]),
)

# Check the request is valid:
request.is_valid()

True

*Submit and monitor the request*

In [53]:
job_id = harmony_client.submit(request)
print(f"jobID = {job_id}")

harmony_client.wait_for_processing(job_id, show_progress=True)

 [ Processing:   0% ] |                                                   | [/]

jobID = f2a7f977-a514-414a-a780-a794d7667d35


 [ Processing: 100% ] |###################################################| [|]


*Download the results*

In [54]:
download_dir = os.path.expanduser("~/tempo_data_for_csv")
os.makedirs(download_dir, exist_ok=True)

results = harmony_client.download_all(job_id, directory=download_dir)
all_results_stored = [f.result() for f in results]

print(f"Number of files: {len(all_results_stored)}")

/root/tempo_data_for_csv/108126295_TEMPO_NO2_L3_V03_20250506T112109Z_S002_subsetted.nc4
/root/tempo_data_for_csv/108126297_TEMPO_NO2_L3_V03_20250506T124125Z_S004_subsetted.nc4
/root/tempo_data_for_csv/108126296_TEMPO_NO2_L3_V03_20250506T120117Z_S003_subsetted.nc4
/root/tempo_data_for_csv/108126298_TEMPO_NO2_L3_V03_20250506T132133Z_S005_subsetted.nc4
/root/tempo_data_for_csv/108126300_TEMPO_NO2_L3_V03_20250506T150141Z_S007_subsetted.nc4
/root/tempo_data_for_csv/108126299_TEMPO_NO2_L3_V03_20250506T140141Z_S006_subsetted.nc4
/root/tempo_data_for_csv/108126301_TEMPO_NO2_L3_V03_20250506T160141Z_S008_subsetted.nc4
/root/tempo_data_for_csv/108126303_TEMPO_NO2_L3_V03_20250506T180141Z_S010_subsetted.nc4
/root/tempo_data_for_csv/108126302_TEMPO_NO2_L3_V03_20250506T170141Z_S009_subsetted.nc4
Number of files: 9


*Open the files and combine into a single Dataset*

In [63]:
# Define which variables to keep and rename:
variables_to_keep_and_rename = TEMPO_variables_to_keep_and_rename.copy()
variables_to_keep_and_rename['product/main_data_quality_flag'] = 'qc_flag'
variables_to_keep_and_rename['geolocation/solar_zenith_angle'] = 'sza'
variables_to_keep_and_rename['support_data/eff_cloud_fraction'] = 'cloud_fraction'

# Create a dictionary to store the data:
data_dictionary = {variable:[] for variable in variables_to_keep_and_rename.keys()}

# Loop through the result files:
for result_file in sorted(all_results_stored):
    # Loop throuch variables:
    for variable in variables_to_keep_and_rename.keys():
        # For each file and variable, add the data from that file to the appropriate list in the dictionary:
        data_dictionary[variable] += [xr.open_datatree(result_file)[variable]]

# Concatenate each list into a Dataset along the time dimenion:
for variable in variables_to_keep_and_rename.keys():
    data_dictionary[variable] = xr.concat(data_dictionary[variable],dim='time')

# Merge the Datasets together:
tempo_data = xr.merge([data_dictionary[variable] for variable in variables_to_keep_and_rename.keys()])

# Rename the variables
tempo_data = tempo_data.rename({variable.split('/')[1]:variables_to_keep_and_rename[variable] for variable in variables_to_keep_and_rename.keys()})

# Examine the result:
tempo_data

*Apply quality control*

In [64]:
filter_qa = tempo_data['qc_flag'] <= Max_Quality_Flag
filter_sza = tempo_data['sza'] < Max_Solar_Zenith_Angle
filter_cf = tempo_data['cloud_fraction'] < Max_Cloud_Fraction

tempo_data_filtered = tempo_data.where(filter_qa & filter_sza & filter_cf).squeeze()

## Match TEMPO Data with CSV coordinates and timestamps and append

In [82]:
# Create new columns in the table to store the TEMPO data; initialize it to NaN values using np.nan
for new_variable in TEMPO_variables_to_keep_and_rename.values():
  f_data_csv[new_variable] = np.nan

# Loop through the rows of the table:
for row in range(len(f_data_csv)):
  sample_latitude = f_data_csv[Column_Name_Latitude][row] # latitude of the monitor
  sample_longitude = f_data_csv[Column_Name_Longitude][row] # longitude of the monitor
  sample_time = np.datetime64(str(f_data_csv['TEMPORARY_COLUMN_TIME_UTC'][row])[0:19]) # timestamp of the row, converted to datetime64 type for compatibility

  # select the TEMPO data closest to the sample time and location:
  tempo_at_sample = tempo_data_filtered.sel(latitude=sample_latitude,longitude=sample_longitude,method='nearest',tolerance=0.02).sel(time=sample_time,method='nearest',tolerance=np.timedelta64(1,'h'))

  # Store the TEMPO data into the appropriate row and column of the table:
  for new_variable in TEMPO_variables_to_keep_and_rename.values():
    f_data_csv.loc[row,new_variable] = tempo_at_sample[new_variable].values

# Examine the resulting table:
f_data_csv

Unnamed: 0,Time (EST),Latitude,Longitude,NO2 Measurement,TEMPORARY_COLUMN_TIME_UTC,TEMPO_no2_vertical_column_troposphere,TEMPO_no2_vertical_column_stratosphere
0,05/06/2025 09:00,38.95,-76.8,10,2025-05-06 13:00:00+00:00,,
1,05/06/2025 09:30,38.93,-77.13,15,2025-05-06 13:30:00+00:00,,
2,05/06/2025 10:05,39.07,-76.97,18,2025-05-06 14:05:00+00:00,2733385000000000.0,2913120000000000.0
3,05/06/2025 10:45,38.61,-76.72,7,2025-05-06 14:45:00+00:00,1043817000000000.0,3033916000000000.0
4,05/06/2025 11:15,38.75,-77.25,12,2025-05-06 15:15:00+00:00,,
5,05/06/2025 11:17,39.01,-76.25,9,2025-05-06 15:17:00+00:00,-800721500000000.0,3033014000000000.0
6,05/06/2025 12:30,38.88,-76.59,11,2025-05-06 16:30:00+00:00,,
7,05/06/2025 13:00,39.02,-76.99,6,2025-05-06 17:00:00+00:00,2508684000000000.0,3385710000000000.0
8,05/06/2025 13:50,38.71,-77.08,13,2025-05-06 17:50:00+00:00,1.73538e+16,3519297000000000.0
9,05/06/2025 14:47,38.66,-76.85,14,2025-05-06 18:47:00+00:00,2479906000000000.0,3478177000000000.0


## Write out the new CSV File

In [87]:
# Drop the temporary UTC time column:
f_data_csv = f_data_csv.drop(columns=['TEMPORARY_COLUMN_TIME_UTC'],errors='ignore')

# Create a new file name with the approprite append:
new_file_name = CSV_file_path.split('.')[0] + CSV_file_name_append + '.csv'

# Write out the new CSV File:
f_data_csv.to_csv(new_file_name,index=False)