# Modeling for Snow Depth Prediction

This notebook is focused on preparing the data for modeling, addressing multicollinearity, and creating training, validation, and testing sets. We will proceed with the following steps:
- Loading the Processed Data
- Creating Lag Features and Derived Variables
- Handling Multicollinearity
- Splitting the Data into Training, Validation, and Testing Sets

### 1. Loading Libraries and Processed Data

In [2]:
import pandas as pd
import numpy as np
import os
from statsmodels.stats.outliers_influence import variance_inflation_factor

### 2. Load Processed Data

In [3]:
# Adjust the path based on your current working directory
data_path = os.path.join('..', 'data', 'processed', 'processed_data_for_modeling.csv')

# Load the combined processed data
model_data = pd.read_csv(data_path)

# Ensure 'date' column is in datetime format
model_data['date'] = pd.to_datetime(model_data['date'])

# Display the shape to confirm loading
print(f"Loaded combined data with shape {model_data.shape}")

Loaded combined data with shape (57862, 12)


### 3. Initiating the "Resort" feauture

The data now includes the "resort" feature, we need to handle it appropriately for modeling.  Since 'resort' is a categorical variable, we need to encode it into numerical format using One-Hot encoding.  Furthermore, we'll avoid multicollinearity by dropping one of the dummy variables.

In [4]:
# One-Hot Encode the 'resort' feature
model_data = pd.get_dummies(model_data, columns=['resort'], drop_first=True)

# Display the columns to confirm encoding
print(f"Columns after encoding: {model_data.columns.tolist()}")

Columns after encoding: ['date', 'temperature_min', 'temperature_max', 'precipitation_sum', 'snow_depth', 'season_id', 'is_operating_season', 'snow_depth_lag1', 'snow_depth_lag7', 'temperature_avg', 'temperature_avg_squared', 'resort_cortina_d_ampezzo', 'resort_kitzbuhel', 'resort_kranjska_gora', 'resort_krvavec', 'resort_les_trois_vallees', 'resort_mariborsko_pohorje', 'resort_sestriere', 'resort_solden', 'resort_st_anton', 'resort_st_moritz', 'resort_val_d_isere_tignes', 'resort_val_gardena', 'resort_verbier']


### 4. Preparing Features and Target Variable

Features defined include: 
- temperature_avg
- temperature_avg_squared
- precipitation_sum
- snow_depth_lag1
- snow_depth_lag7
- encoded resort

Defining the target variable - snow_depth

In [5]:
# Define the target variable
y = model_data['snow_depth']

# Exclude 'date' and 'snow_depth' from features
feature_columns = [col for col in model_data.columns if col not in ['date', 'snow_depth']]

# Create the features DataFrame
X = model_data[feature_columns]

### 5. Handling Multicollinearity

Before splitting the data, it's important to check for multicollinearity among features.

Calculating the Variance Inflation Factor (VIF) for each feature is important.

In [13]:
# Calculate VIF for each feature
X_vif = X.copy()
X_vif['Intercept'] = 1  # Add intercept term if necessary

# Identify columns with object data type
object_columns = X_vif.select_dtypes(include=['object']).columns.tolist()

# Exclude 'season_id' from X_vif
X_vif = X_vif.drop(columns=['season_id'])

bool_columns = X_vif.select_dtypes(include=['bool']).columns.tolist()

# Convert bool columns to int
X_vif[bool_columns] = X_vif[bool_columns].astype(int)

vif_data = pd.DataFrame()
vif_data['feature'] = X_vif.columns
vif_data['VIF'] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]

# Drop the intercept term from VIF data
vif_data = vif_data[vif_data['feature'] != 'Intercept']

print("Variance Inflation Factors:")
print(vif_data.sort_values('VIF', ascending=False))

  vif = 1. / (1. - r_squared_i)


Variance Inflation Factors:
                      feature       VIF
0             temperature_min       inf
1             temperature_max       inf
6             temperature_avg       inf
20             resort_verbier  7.483137
16            resort_st_anton  6.477424
9            resort_kitzbuhel  5.467867
13  resort_mariborsko_pohorje  3.483182
4             snow_depth_lag1  3.445501
5             snow_depth_lag7  3.087779
10       resort_kranjska_gora  3.034219
7     temperature_avg_squared  2.887472
15              resort_solden  2.832157
3         is_operating_season  2.698546
11             resort_krvavec  2.362995
14           resort_sestriere  2.220728
17           resort_st_moritz  2.095204
18  resort_val_d_isere_tignes  1.925096
12   resort_les_trois_vallees  1.701801
19         resort_val_gardena  1.479474
8    resort_cortina_d_ampezzo  1.471285
2           precipitation_sum  1.133011


### 6. Splitting the Data

Since the data is time-series data, it's important to split it in a way that respects the temporal order tand thus avoid data leakage.

In [14]:
# Sort the data by date
model_data = model_data.sort_values('date').reset_index(drop=True)

# Re-prepare X and y after sorting
X = model_data[feature_columns]
y = model_data['snow_depth']

#### 6 (a) Split the data into training, validation, and test sets using time-based splits.

In [15]:
# Define the sizes for training, validation, and testing sets
train_size = int(0.7 * len(X))
val_size = int(0.15 * len(X))
test_size = len(X) - train_size - val_size

# Split the data
X_train = X.iloc[:train_size]
y_train = y.iloc[:train_size]

X_val = X.iloc[train_size:train_size + val_size]
y_val = y.iloc[train_size:train_size + val_size]

X_test = X.iloc[train_size + val_size:]
y_test = y.iloc[train_size + val_size:]

### 7. Proceeding with Modeling

The data is now prepared.  We can proceed to build and evaluate the models.