# Data Collection and Integration of Meteostat Dataset

In this notebook, we will extend our data collection process to include historical weather data from the MeteoStat dataset, supplied via Python library. This will allow us to obtain data dating back to 1990, significantly enhancing the robustness of our predictive models.

We will:

- Set up Meteostat and install the necessary libraries.
- Download data for the variables: `temperature`, `rainfall`, `snowfall`.
- Process and save the data in the same format and structure as our existing datasets.
- Integrate the new data with our existing data cleaning pipeline.


## 2. Install the Meteostat Client

The Meteostat client is a Python-based library that allows us to access the dataset programmatically.

**Installation:**

- Run the following command in your terminal or use a code cell to install via `pip`:

In [32]:
%pip install meteostat

Note: you may need to restart the kernel to use updated packages.


In [33]:
import meteostat
print(meteostat.__version__)

1.6.8


## 3. Import Required Libraries

In [34]:
import os
import pandas as pd
import numpy as np
from meteostat import Stations, Daily
from datetime import datetime
import time

### 3(a) Define the Correct Data Paths

In [35]:

# Define the root data directory
data_root = '/workspace/SkiSnow/data'

# Define subdirectories for raw data
raw_data_root = os.path.join(data_root, 'raw', 'cds')

# Create the directories if they don't exist
os.makedirs(raw_data_root, exist_ok=True)

print(f"Raw data will be saved to: {raw_data_root}")

Raw data will be saved to: /workspace/SkiSnow/data/raw/cds


## 4. Define the List of Resorts and Their Coordinates

Specify each ski resort with its corresponding latitude and longitude. This information is crucial for fetching accurate weather data.

In [36]:
# Dictionary of resorts with their coordinates
resorts = {
    'french_alps/chamonix': {
        'latitude': 45.9237,
        'longitude': 6.8694,
        'months_open': ['12', '01', '02', '03', '04'], 
    },
    'french_alps/val_d_isere_tignes': {
        'latitude': 45.4969,
        'longitude': 7.0290,
        'months_open': ['11', '12', '01', '02', '03', '04', '05'],
    },
    'french_alps/les_trois_vallees': {
        'latitude': 45.4281,
        'longitude': 6.6874,
        'months_open': ['12', '01', '02', '03', '04'],
    },
    'austrian_alps/st_anton': {
        'latitude': 47.1787,
        'longitude': 10.3143,
        'months_open': ['12', '01', '02', '03', '04'],
    },
    'austrian_alps/kitzbuhel': {
        'latitude': 47.4967,
        'longitude': 12.4429,
        'months_open': ['11', '12', '01', '02', '03', '04'],
    },
    'austrian_alps/solden': {
        'latitude': 47.0190,
        'longitude': 11.0606,
        'months_open': ['10', '11', '12', '01', '02', '03', '04', '05'],
    },
    'swiss_alps/zermatt': {
        'latitude': 46.0707,
        'longitude': 7.7991,
        'months_open': ['11', '12', '01', '02', '03', '04', '05'],
    },
    'swiss_alps/st_moritz': {
        'latitude': 46.5407,
        'longitude': 9.8855,
        'months_open': ['11', '12', '01', '02', '03', '04'],
    },
    'swiss_alps/verbier': {
        'latitude': 46.1465,
        'longitude': 7.2769,
        'months_open': ['12', '01', '02', '03', '04'],
    },
    'italian_alps/cortina_d_ampezzo': {
        'latitude': 46.5905,
        'longitude': 12.1857,
        'months_open': ['12', '01', '02', '03', '04'],
    },
    'italian_alps/val_gardena': {
        'latitude': 46.6219,
        'longitude': 11.7673,
        'months_open': ['12', '01', '02', '03', '04'],
    },
    'italian_alps/sestriere': {
        'latitude': 45.0055,
        'longitude': 6.9335,
        'months_open': ['12', '01', '02', '03', '04'],
    },
    'slovenian_alps/kranjska_gora': {
        'latitude': 46.5347,
        'longitude': 13.8336,
        'months_open': ['12', '01', '02', '03'],
    },
    'slovenian_alps/mariborsko_pohorje': {
        'latitude': 46.5652,
        'longitude': 15.6431,
        'months_open': ['12', '01', '02', '03'],
    },
    'slovenian_alps/krvavec': {
        'latitude': 46.3471,
        'longitude': 14.5875,
        'months_open': ['12', '01', '02', '03', '04'],
    },
}


### 5. Function to Find the Nearest Weather Station

Meteostat retrieves data based on weather stations. This function identifies the closest station to a given resort's coordinates.

In [37]:
def get_nearest_station(latitude, longitude, start_year, end_year):
    """
    Find the nearest weather station to the given coordinates with available data.
    """
    # Define the time period
    start = datetime(start_year, 1, 1)
    end = datetime(end_year, 12, 31)
    
    # Search for stations nearby
    stations = Stations()
    stations = stations.nearby(latitude, longitude)
    
    # Filter stations with daily data availability
    stations = stations.inventory('daily', (start, end))
    
    # Fetch the nearest station
    station = stations.fetch(1)
    
    if not station.empty:
        return station.index[0]
    else:
        return None


### 6. Function to Download Meteostat Data

This function downloads the required weather variables for a specific resort and time range.

In [38]:
def download_meteostat_data(resort_name, latitude, longitude, start_year, end_year, output_dir):
    """
    Download daily weather data for the specified resort using Meteostat.
    
    Parameters:
    - resort_name (str): Name of the resort.
    - latitude (float): Latitude of the resort.
    - longitude (float): Longitude of the resort.
    - start_year (int): Starting year for data retrieval.
    - end_year (int): Ending year for data retrieval.
    - output_dir (str): Directory to save the downloaded CSV file.
    """
    # Find the nearest station
    station_id = get_nearest_station(latitude, longitude, start_year, end_year)
    
    if not station_id:
        print(f"No station found for {resort_name} at ({latitude}, {longitude}).")
        return
    
    print(f"Nearest station for {resort_name}: {station_id}")
    
    # Define the time period
    start = datetime(start_year, 1, 1)
    end = datetime(end_year, 12, 31)
    
    # Fetch daily data
    data = Daily(station_id, start, end)
    data = data.fetch()
    
    if data.empty:
        print(f"No data available for {resort_name} from {start_year} to {end_year}.")
        return
    
    # Select required variables and rename them
    data = data[['tmin', 'tmax', 'prcp', 'snow']]
    data = data.rename(columns={
        'tmin': 'temperature_min',
        'tmax': 'temperature_max',
        'prcp': 'precipitation_sum',
        'snow': 'snow_depth'
    })
    
    # Reset index to have 'time' as a column
    data = data.reset_index()
    
    # Define the output file path
    os.makedirs(output_dir, exist_ok=True)
    file_name = f"{resort_name.replace('/', '_')}_1990_2023.csv"  # Adjust years as needed
    file_path = os.path.join(output_dir, file_name)
    
    # Save to CSV if not already present
    if not os.path.exists(file_path):
        data.to_csv(file_path, index=False)
        print(f"Data for {resort_name} saved to {file_path}.")
    else:
        print(f"Data for {resort_name} from 1990 to 2023 already exists at {file_path}.")


### 7. Function to Process Downloaded Data

Although Meteostat provides data in CSV format, this function ensures consistency and prepares the data for integration.

In [39]:
def process_meteostat_data(file_path):
    """
    Process the downloaded Meteostat CSV data.
    
    Parameters:
    - file_path (str): Path to the downloaded CSV file.
    
    Returns:
    - pd.DataFrame: Processed data.
    """
    # Read the CSV file
    df = pd.read_csv(file_path)
    
    # Convert 'time' to datetime
    df['time'] = pd.to_datetime(df['time'])
    
    # Keep only necessary columns
    columns_to_keep = ['time', 'temperature_min', 'temperature_max', 'precipitation_sum', 'snow_depth']
    df = df[columns_to_keep]
    
    # Rename 'time' to 'date'
    df = df.rename(columns={'time': 'date'})
    
    # Handle missing values
    df['temperature_min'] = df['temperature_min'].replace(-9999, np.nan)
    df['temperature_max'] = df['temperature_max'].replace(-9999, np.nan)
    df['precipitation_sum'] = df['precipitation_sum'].replace(-9999, np.nan)
    df['snow_depth'] = df['snow_depth'].replace(-9999, np.nan)
    
    return df



### 8. Compile and Save the Data

After processing individual CSV files, compile them into a single dataset for each resort.

In [40]:
def compile_meteostat_data(resort_name, raw_data_dir, compiled_csv_path):
    """
    Compile the downloaded Meteostat CSV data for the resort.
    
    Parameters:
    - resort_name (str): Name of the resort.
    - raw_data_dir (str): Directory containing the downloaded CSV file.
    - compiled_csv_path (str): Path to save the compiled CSV file.
    """
    # Define the raw data file path
    file_name = f"{resort_name.replace('/', '_')}_1990_2023.csv"  # Adjust years as needed
    file_path = os.path.join(raw_data_dir, file_name)
    
    if os.path.exists(file_path):
        print(f"Processing data for {resort_name}...")
        df = process_meteostat_data(file_path)
        
        # Save the processed data (overwrite the existing file for a single copy)
        df.to_csv(compiled_csv_path, index=False)
        print(f"Compiled data saved to {compiled_csv_path}.")
    else:
        print(f"File {file_name} not found for {resort_name}.")



### 9. Execute the Data Retrieval Workflow

With all functions defined, we can now orchestrate the data retrieval and processing for all resorts over the desired time frame.

In [45]:
def run_data_retrieval(resorts, start_year, end_year, raw_data_root):
    """
    Orchestrate the data retrieval and processing for all resorts.
    
    Parameters:
    - resorts (dict): Dictionary of resorts with coordinates and open months.
    - start_year (int): Starting year for data retrieval.
    - end_year (int): Ending year for data retrieval.
    - raw_data_root (str): Root directory for raw data.
    """
    for resort_name, resort_info in resorts.items():
        latitude = resort_info['latitude']
        longitude = resort_info['longitude']
        months_open = resort_info['months_open']
        
        # Define resort-specific raw data directory
        resort_raw_dir = os.path.join(raw_data_root, resort_name.replace('/', '_'))
        
        # Define compiled CSV path within raw_data_root
        compiled_csv_path = os.path.join(raw_data_root, f"{resort_name.replace('/', '_')}_meteostat.csv")
        
        # Download data
        download_meteostat_data(resort_name, latitude, longitude, start_year, end_year, resort_raw_dir)
        
        # Optional: Pause to respect any rate limits
        time.sleep(1)
        
        # Compile data
        compile_meteostat_data(resort_name, resort_raw_dir, compiled_csv_path)



In [46]:
if __name__ == "__main__":
    # Define temporal range
    start_year = 1990
    end_year = 2023  # Adjust as needed
    
    # Define directories
    raw_data_root = '/workspace/SkiSnow/data/raw/cds'
    
    # Run the data retrieval process
    run_data_retrieval(resorts, start_year, end_year, raw_data_root)


Nearest station for french_alps/chamonix: 06717
Data for french_alps/chamonix from 1990 to 2023 already exists at /workspace/SkiSnow/data/raw/cds/french_alps_chamonix/french_alps_chamonix_1990_2023.csv.
Processing data for french_alps/chamonix...
Compiled data saved to /workspace/SkiSnow/data/raw/cds/french_alps_chamonix_meteostat.csv.
Nearest station for french_alps/val_d_isere_tignes: 06717
Data for french_alps/val_d_isere_tignes saved to /workspace/SkiSnow/data/raw/cds/french_alps_val_d_isere_tignes/french_alps_val_d_isere_tignes_1990_2023.csv.
Processing data for french_alps/val_d_isere_tignes...
Compiled data saved to /workspace/SkiSnow/data/raw/cds/french_alps_val_d_isere_tignes_meteostat.csv.
Nearest station for french_alps/les_trois_vallees: 06717
Data for french_alps/les_trois_vallees saved to /workspace/SkiSnow/data/raw/cds/french_alps_les_trois_vallees/french_alps_les_trois_vallees_1990_2023.csv.
Processing data for french_alps/les_trois_vallees...
Compiled data saved to /wo

# Setting Up NSIDC API Access for further data collection

To access NSIDC data programmatically, we need to set up API authentication using NASA's Earthdata credentials.

**Steps:**

1. **Create an Earthdata Account:**
   - Visit [Earthdata Login](https://urs.earthdata.nasa.gov/users/new) and register for an account.
   - Confirm your email address.

2. **Obtain an API Token:**
   - After logging in, navigate to your profile settings.
   - Generate a new API token under the **API Tokens** section.
   - Store this token securely as it will be required for API requests.

3. **Install Required Libraries:**
   - We'll use the `requests` library to interact with the NSIDC API.


In [None]:
%pip install requests

### 3.1. Define NSIDC Data Retrieval Functions

We'll create functions to:

1. **Authenticate** with the NSIDC API using your Earthdata token.
2. **Fetch** ice extent, snow cover, and snow melt data for each resort.
3. **Save** the retrieved data into the existing RAW resort folders.