# Feature Engineering for Snow Depth Prediction

In this section, we will perform feature engineering for each resort's dataset individually. Our goal is to prepare the data for modeling by creating new features and handling missing values appropriately.

### Steps:
- **Load Data**: Each cleaned resort-specific CSV file is loaded from the processed directory.
- **Lag Features**: We create lagged features for `snow_depth` (one day and seven days prior) to capture temporal dependencies.
- **Temperature Features**: We calculate the average temperature (`temperature_avg`) and its square (`temperature_avg_squared`) to account for potential non-linear relationships with snow depth.
- **Missing Values Handling**: Rows with any missing data in the selected features and target variable are dropped to ensure data quality for modeling.
- **Saving Processed Data**: Each resort’s processed data is saved as a separate CSV file for streamlined access in the modeling stage.

This approach ensures each resort is treated individually, maintaining flexibility for subsequent analysis and model building.

### 1. Loading imports

In [1]:
import pandas as pd
import numpy as np
import os
import glob

In [2]:
processed_data_root = os.path.abspath(os.path.join('data', 'processed', 'cds'))

print("Processed data root absolute path:", processed_data_root)
print("Does the processed data root exist?", os.path.exists(processed_data_root))

Processed data root absolute path: /workspace/SkiSnow/notebooks/data/processed/cds
Does the processed data root exist? False


### 2. Loading & combining required data

In [3]:
processed_data_root = '../data/processed/cds'

# Initialize a list to store DataFrames
combined_data = []

# Use glob to find all cleaned CSV files in the processed data directory
csv_files = glob.glob(os.path.join(processed_data_root, '**', '*_cleaned_*.csv'), recursive=True)

print(f"Found {len(csv_files)} files:")
for file_path in csv_files:
    print(file_path)

for file_path in csv_files:
    # Load each resort-specific DataFrame
    df = pd.read_csv(file_path)

    # Extract resort information from the file path
    resort_name = os.path.basename(os.path.dirname(file_path))
    df['resort'] = resort_name

    # Append to the list
    combined_data.append(df)

# Concatenating all DataFrames into one
model_data = pd.concat(combined_data, ignore_index=True)

# Displaying the shape to confirm combining
print(f"Combined data shape: {model_data.shape}")

# Ensuring 'date' column is in datetime format
model_data['date'] = pd.to_datetime(model_data['date'])

Found 10 files:
../data/processed/cds/austrian_alps/kitzbuhel/kitzbuhel_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/austrian_alps/solden/solden_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/austrian_alps/st_anton/st_anton_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/italian_alps/cortina_d_ampezzo/cortina_d_ampezzo_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/italian_alps/sestriere/sestriere_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/italian_alps/val_gardena/val_gardena_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/slovenian_alps/kranjska_gora/kranjska_gora_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/slovenian_alps/krvavec/krvavec_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/slovenian_alps/mariborsko_pohorje/mariborsko_pohorje_cleaned_2024-10-28_13-30-56.csv
../data/processed/cds/swiss_alps/st_moritz/st_moritz_cleaned_2024-10-28_13-30-56.csv
Combined data shape: (121351, 8)


### 3. Creating Lag Features

In [16]:
model_data['snow_depth_lag1'] = model_data.groupby('resort')['snow_depth'].shift(1)
model_data['snow_depth_lag7'] = model_data.groupby('resort')['snow_depth'].shift(7)

### 4. Calculating Temperature Features

In [17]:
# Calculating the average temperature
model_data['temperature_avg'] = (model_data['temperature_min'] + model_data['temperature_max']) / 2

# Calculating the squared average temperature to capture non-linear effects
model_data['temperature_avg_squared'] = model_data['temperature_avg'] ** 2

### Handling Missing Values

In [18]:
# Defining the features
features = ['temperature_avg', 'temperature_avg_squared', 'precipitation_sum', 'snow_depth_lag1', 'snow_depth_lag7']

# Dropping rows with any missing values in the selected features or target variable
model_data = model_data.dropna(subset=features + ['snow_depth'])

# Reseting index after dropping rows
model_data = model_data.reset_index(drop=True)

# Displaying the shape after dropping missing values
print(f"Data shape after dropping missing values: {model_data.shape}")

Data shape after dropping missing values: (57862, 12)


### Saving Processed Data

In [20]:
output_path = os.path.join('..', 'data', 'processed', 'processed_data_for_modeling.csv')

# Save the combined processed data
model_data.to_csv(output_path, index=False)
print(f"Saved combined processed data to {output_path}")

Saved combined processed data to ../data/processed/processed_data_for_modeling.csv
