### Overview

This notebook aims to generate and download a climate dataset from NASA, specifically focusing on data for Brazil from 2007 to 2025. The dataset will include several key variables (columns), including:

- **T2M**: Temperature at 2 meters above the surface
- **QV2M**: Humidity at 2 meters above the surface
- **PS**: Surface pressure
- **SWGDN**: Solar radiation
- **PRECTOT**: Total precipitation at the surface

These climate data will be sourced from the following NASA datasets: **M2TMNXSLV**, **M2TMNXRAD**, and **M2TMNXFLX**. All of these datasets are derived from NASA's research and are available through the NASA GES DISC (Global Earth Science Data and Information Service).


### Library Imports and Authentication with NASA EarthData

This block of code imports the necessary libraries and performs authentication with NASA EarthData.

In [1]:
import earthaccess
import xarray as xr
import pandas as pd
from pathlib import Path
import psutil, time

auth = earthaccess.login()


Enter your Earthdata Login username:  grazinha
Enter your Earthdata password:  ········


### Download Configuration

This code block handles the configuration for downloading the dataset, specifically filtering the data for the area of Brazil, selecting specific variables, defining a specific time range, and setting the directory for the download. Additionally, a filtering step was applied to reduce the number of variables being downloaded, as well as selecting a specific time range and year to make the dataset lighter and more manageable for download on a typical machine.

In [None]:
# Bounding box do Brasil
bbox = (-74.0, -33.74, -34.79, 5.27)

# Variáveis
vars_slv = ["T2M", "QV2M", "PS"]   # atmosféricas
vars_rad = ["SWGDN"]               # radiação
vars_precip = ["PRECTOT"]          # Precipitação total

# Período
start_year, end_year = 2007, 2025

# Pasta de saída
output_dir = Path("clima_brasil_mensal")
output_dir.mkdir(exist_ok=True)

# Lista para consolidação final
all_dfs = []


### Download Loop

This code block is designed to execute the necessary downloads. A loop is created to download data from January 2007 to July 2025. For each year, it searches for the required variables in each dataset. We opted for the dataset version that provides monthly data instead of daily data for performance reasons. 

Error handling is implemented to account for any unavailable data, and a merge operation is performed across the three datasets to combine them into a single dataset with the requested variables. At the end of the process, the data is converted into a DataFrame and saved in Parquet format. 

For each year, the loop reports the time it took to download the data and the amount of memory used during the process.


In [None]:
# Loop mensal já agregado
for year in range(start_year, end_year + 1):
    if year == 2025:
        months = range(1, 8)  # até julho de 2025
    else:
        months = range(1, 13)

    print(f"\n📥 Processando {year}...")

    start_time = time.time()

    # Buscar dados SLV (mensal)
    results_slv = earthaccess.search_data(
        short_name="M2TMNXSLV",
        version="5.12.4",
        temporal=(f"{year}-01-01", f"{year}-12-31"),
        bounding_box=bbox
    )

    # Buscar dados RAD (mensal)
    results_rad = earthaccess.search_data(
        short_name="M2TMNXRAD",
        version="5.12.4",
        temporal=(f"{year}-01-01", f"{year}-12-31"),
        bounding_box=bbox
    )

    # Buscar dados de precipitação
    results_precip = earthaccess.search_data(
        short_name="M2TMNXFLX",
        version="5.12.4",
        temporal=(f"{year}-01-01", f"{year}-12-31"),
        bounding_box=bbox
    )

    if not results_slv or not results_rad or not results_precip:
        print(f"⚠️ Nenhum dado encontrado para {year}")
        continue

    # Abrir datasets já mensais
    ds_slv = xr.open_mfdataset(
        earthaccess.open(results_slv),
        combine="by_coords", chunks={"time": 1}
    )[vars_slv]

    ds_rad = xr.open_mfdataset(
        earthaccess.open(results_rad),
        combine="by_coords", chunks={"time": 1}
    )[vars_rad]

    ds_precip = xr.open_mfdataset(
        earthaccess.open(results_precip),
        combine="by_coords", chunks={"time": 1}
    )[vars_precip]

    # Merge direto (adicionando precipitação)
    ds = xr.merge([ds_slv, ds_rad, ds_precip])

    # Converter em DataFrame
    df = ds.to_dataframe().reset_index()

    # Salvar parquet anual
    out_path = output_dir / f"clima_brasil_{year}.parquet"
    df.to_parquet(out_path, index=False)

    all_dfs.append(df)

    elapsed = time.time() - start_time
    mem = psutil.Process().memory_info().rss / (1024 ** 2)
    print(f"✅ {year} salvo! Tempo: {elapsed:.1f}s | Memória usada: {mem:.1f} MB")

### Consolidating Data

This code block consolidates all the downloaded data from each year into a single dataset. The previously collected DataFrames are concatenated into one final DataFrame using `pd.concat()`. The resulting dataset is then saved in two formats: Parquet and CSV. 

The Parquet file is saved for efficient storage and performance, while the CSV file is saved for easy readability and compatibility with other applications.

At the end, a success message is printed to confirm that the dataset has been successfully consolidated and saved.

In [None]:
# Consolidar tudo
df_final = pd.concat(all_dfs, ignore_index=True)

df_final.to_parquet(output_dir / "clima_brasil_2000_2025.parquet", index=False)
df_final.to_csv(output_dir / "clima_brasil_2000_2025.csv", index=False)

print("🎉 Dataset consolidado salvo com sucesso!")