# Data Cleaning and Preparation

In this notebook, we will perform data cleaning and preparation for all resorts across the Alps. This includes:

- Loading the raw data
- Handling missing values
- Correcting data types
- Normalizing resort names to handle special characters
- Filtering data based on resort operating dates
- Saving the cleaned data for further analysis

### 1. Import Libraries

In [4]:
import pandas as pd
import os
import sys
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### 2. Update pathing and Import custom modules

In [5]:
# Determine the project root directory
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Add the project root to sys.path
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print("Updated sys.path:")
for path in sys.path:
    print(path)

from src.data.cleaning import (
    get_all_csv_files_with_metadata,
    clean_and_filter_data,
    save_cleaned_data
)
from src.features.feature_engineering import (
    categorize_season,
    add_operating_season_indicator
)
from src.features.anomaly_detection import (
    detect_snow_depth_anomalies,
    handle_snow_depth_anomalies
)

Updated sys.path:
/workspace/SkiSnow
/home/gitpod/.pyenv/versions/3.12.6/lib/python312.zip
/home/gitpod/.pyenv/versions/3.12.6/lib/python3.12
/home/gitpod/.pyenv/versions/3.12.6/lib/python3.12/lib-dynload

/workspace/SkiSnow/venv/lib/python3.12/site-packages


### 3. Define Resort Operating Seasons

In [6]:
resort_seasons = {
    'austrian_alps/st_anton': {'open': '12-01', 'close': '04-30'},
    'austrian_alps/kitzbuhel': {'open': '10-15', 'close': '05-01'},
    'austrian_alps/solden': {'open': '11-01', 'close': '05-01'},
    'swiss_alps/st_moritz': {'open': '11-25', 'close': '05-01'},
    'swiss_alps/verbier': {'open': '12-01', 'close': '04-30'},
    'italian_alps/cortina_d_ampezzo': {'open': '11-25', 'close': '04-05'},
    'italian_alps/val_gardena': {'open': '12-01', 'close': '04-15'},
    'italian_alps/sestriere': {'open': '12-01', 'close': '04-15'},
    'slovenian_alps/kranjska_gora': {'open': '12-15', 'close': '04-15'},
    'slovenian_alps/mariborsko_pohorje': {'open': '12-01', 'close': '04-05'},
    'slovenian_alps/krvavec': {'open': '12-01', 'close': '04-30'},
}

### 4. Load and Clean Data

In [7]:
# Define the root directory
raw_data_root = '../data/raw/cds'
processed_data_root = '../data/processed/cds'

# Get list of all CSV files with dataset type
csv_files = get_all_csv_files_with_metadata(raw_data_root)
print(f"Found {len(csv_files)} CSV files after excluding specified resorts.")

data_frames = {}
for file_info in csv_files:
    if file_info['type'] == 'new':  # Only process 'new' datasets
        key, df = clean_and_filter_data(file_info)
        if key and df is not None:
            data_frames[key] = df
            print(f"Loaded and cleaned data for {key}: {df.shape[0]} rows.")
    else:
        print(f"Excluded 'old' dataset: {file_info['file_path']}")

Excluding resort due to insufficient data: swiss_alps/verbier
Found 10 CSV files after excluding specified resorts.
austrian_alps/st_anton: 'snow_depth' is assumed to be in centimeters. No conversion applied.
Loaded and cleaned data for austrian_alps/st_anton: 12418 rows.
austrian_alps/kitzbuhel: 'snow_depth' is assumed to be in centimeters. No conversion applied.
Loaded and cleaned data for austrian_alps/kitzbuhel: 11184 rows.
austrian_alps/solden: 'snow_depth' is assumed to be in centimeters. No conversion applied.
Loaded and cleaned data for austrian_alps/solden: 12418 rows.
italian_alps/sestriere: 'snow_depth' is assumed to be in centimeters. No conversion applied.
Loaded and cleaned data for italian_alps/sestriere: 12038 rows.
italian_alps/val_gardena: 'snow_depth' is assumed to be in centimeters. No conversion applied.
Loaded and cleaned data for italian_alps/val_gardena: 12015 rows.
italian_alps/cortina_d_ampezzo: 'snow_depth' is assumed to be in centimeters. No conversion appli

### 4. Feature Engineering: Season Categorisation

In [8]:
for key, df in data_frames.items():
    resort = key
    if resort in resort_seasons:
        season_info = resort_seasons[resort]
        
        # Categorize seasons
        df = categorize_season(df, season_info, resort)
        
        # Add operating season indicator
        df = add_operating_season_indicator(df)
        
        # Update the DataFrame in the dictionary
        data_frames[key] = df
        print(f"Season categorized and operating season indicator added for {resort}.")
    else:
        print(f"No season information for {resort}. Data not categorized.")

Season categorized and operating season indicator added for austrian_alps/st_anton.
Season categorized and operating season indicator added for austrian_alps/kitzbuhel.
Season categorized and operating season indicator added for austrian_alps/solden.
Season categorized and operating season indicator added for italian_alps/sestriere.
Season categorized and operating season indicator added for italian_alps/val_gardena.
Season categorized and operating season indicator added for italian_alps/cortina_d_ampezzo.
Season categorized and operating season indicator added for slovenian_alps/kranjska_gora.
Season categorized and operating season indicator added for slovenian_alps/krvavec.
Season categorized and operating season indicator added for slovenian_alps/mariborsko_pohorje.
Season categorized and operating season indicator added for swiss_alps/st_moritz.


### 5. Handle Missing Values and Anomalies

In [9]:
for key, df in data_frames.items():
    # Impute missing 'snow_depth' during operating season
    if 'snow_depth' in df.columns:
        # Example imputation logic can be modularized further if needed
        df['snow_depth'].fillna(method='ffill', inplace=True)
        print(f"{key}: Imputed missing 'snow_depth' values.")
    
    # Detect and handle anomalies
    df = detect_snow_depth_anomalies(df, threshold=20)
    df = handle_snow_depth_anomalies(df)
    
    # Update the DataFrame in the dictionary
    data_frames[key] = df
    print(f"Anomaly detection and handling completed for {key}.")

austrian_alps/st_anton: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for austrian_alps/st_anton.
austrian_alps/kitzbuhel: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for austrian_alps/kitzbuhel.
austrian_alps/solden: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for austrian_alps/solden.
italian_alps/sestriere: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for italian_alps/sestriere.
italian_alps/val_gardena: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for italian_alps/val_gardena.
italian_alps/cortina_d_ampezzo: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for italian_alps/cortina_d_ampezzo.
slovenian_alps/kranjska_gora: Imputed missing 'snow_depth' values.
Anomaly detection and handling completed for slovenian_alps/kranjska_gora.
slovenian_alps/krvavec: Imputed missing 'snow_depth' values.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['snow_depth'].fillna(method='ffill', inplace=True)
  df['snow_depth'].fillna(method='ffill', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['snow_depth'].fillna(method='ffill', inplace=True)
  df['snow_depth'].fillna(method='ffill', inplace=True)
The behavior

### 6. Rounding and Unit Conversion

In [10]:
columns_to_round = {
    'snow_depth': 1,
    'precipitation_sum': 1,
    'temperature_min': 1,
    'temperature_max': 1,
}

for key, df in data_frames.items():
    # Round numerical columns
    for column, decimals in columns_to_round.items():
        if column in df.columns:
            df[column] = df[column].round(decimals)
            print(f"{key}: Rounded '{column}' to {decimals} decimal places.")
    
    data_frames[key] = df
    print(f"Processed numerical columns for {key}.")

austrian_alps/st_anton: Rounded 'snow_depth' to 1 decimal places.
austrian_alps/st_anton: Rounded 'precipitation_sum' to 1 decimal places.
austrian_alps/st_anton: Rounded 'temperature_min' to 1 decimal places.
austrian_alps/st_anton: Rounded 'temperature_max' to 1 decimal places.
Processed numerical columns for austrian_alps/st_anton.
austrian_alps/kitzbuhel: Rounded 'snow_depth' to 1 decimal places.
austrian_alps/kitzbuhel: Rounded 'precipitation_sum' to 1 decimal places.
austrian_alps/kitzbuhel: Rounded 'temperature_min' to 1 decimal places.
austrian_alps/kitzbuhel: Rounded 'temperature_max' to 1 decimal places.
Processed numerical columns for austrian_alps/kitzbuhel.
austrian_alps/solden: Rounded 'snow_depth' to 1 decimal places.
austrian_alps/solden: Rounded 'precipitation_sum' to 1 decimal places.
austrian_alps/solden: Rounded 'temperature_min' to 1 decimal places.
austrian_alps/solden: Rounded 'temperature_max' to 1 decimal places.
Processed numerical columns for austrian_alps/so

### 7. Save Cleaned Data

In [11]:
save_cleaned_data(data_frames, processed_data_root)

Saved cleaned data to ../data/processed/cds/austrian_alps/st_anton/st_anton_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/austrian_alps/kitzbuhel/kitzbuhel_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/austrian_alps/solden/solden_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/italian_alps/sestriere/sestriere_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/italian_alps/val_gardena/val_gardena_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/italian_alps/cortina_d_ampezzo/cortina_d_ampezzo_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/slovenian_alps/kranjska_gora/kranjska_gora_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/slovenian_alps/krvavec/krvavec_cleaned_2024-10-28_13-30-56.csv.
Saved cleaned data to ../data/processed/cds/slovenian_alps/mariborsko_pohorje/mariborsko_pohorje_