# Data Cleaning and Preparation

In this notebook, we will perform data cleaning and preparation for all resorts across the Alps. This includes:

- Loading the raw data
- Handling missing values
- Correcting data types
- Normalizing resort names to handle special characters
- Filtering data based on resort operating dates
- Saving the cleaned data for further analysis

## 1. Import Libraries

In [13]:
import pandas as pd
import os
import unicodedata
import re
from dateutil.relativedelta import relativedelta

## 2. Handling Special Characters in File and Resort Names

To avoid issues with special characters (like accents and apostrophes) in file names and resort names, we'll define a normalization function. This function will:

- Convert names to lowercase
- Remove accents and diacritics
- Replace non-alphanumeric characters with underscores

In [14]:
def normalize_name(name):
    # Convert to lowercase
    name = name.lower()
    # Remove accents and diacritics
    name = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore').decode('utf-8')
    # Replace non-alphanumeric characters with underscores
    name = re.sub(r'[^a-z0-9]+', '_', name)
    # Remove leading/trailing underscores
    name = name.strip('_')
    return name

### 2(a) Set Paths for Raw and Processed Data

In [18]:
raw_data_root = '../data/raw/cds'
processed_data_root = '../data/processed/cds'

### 3. Function to get list of all CSV files in the raw data directory

We'll create a function to traverse the directory structure and collect all CSV files. While doing so, we'll normalize the country and resort names to ensure consistency.

In [19]:
def get_all_csv_files(root_dir):
    csv_files = []
    for country in os.listdir(root_dir):
        country_path = os.path.join(root_dir, country)
        if os.path.isdir(country_path):
            normalized_country = normalize_name(country)
            for resort in os.listdir(country_path):
                resort_path = os.path.join(country_path, resort)
                if os.path.isdir(resort_path):
                    normalized_resort = normalize_name(resort)
                    for file in os.listdir(resort_path):
                        if file.endswith('.csv'):
                            file_path = os.path.join(resort_path, file)
                            csv_files.append({
                                'country': normalized_country,
                                'resort': normalized_resort,
                                'file_path': file_path
                            })
    return csv_files

# Get list of all CSV files
csv_files = get_all_csv_files(raw_data_root)
print(f"Found {len(csv_files)} CSV files.")

Found 15 CSV files.


## 4. Data Cleaning Steps

We will perform the following data cleaning steps for each resort:

1. Remove empty rows prior to `2021-03-23`
2. Handle missing values
3. Handle duplicates
4. Correct data types
5. Filter data based on each resort's opening and closing dates
6. Save cleaned data

## 4.1 Function to Clean and Filter a Single CSV File

In [24]:
def clean_and_filter_data(file_info):
    country = file_info['country']
    resort = file_info['resort']
    file_path = file_info['file_path']
    
    # Load the CSV file
    df = pd.read_csv(file_path)
    
    # Convert 'date' column to datetime format and remove timezone info
    df['date'] = pd.to_datetime(df['date'], errors='coerce').dt.tz_localize(None)
    
    # Drop rows with missing dates
    df = df.dropna(subset=['date'])
    
    # Remove rows prior to 2021-03-23
    df = df[df['date'] >= '2021-03-23']
    
    # Handle missing values in other columns
    df = df.dropna()
    
    # Reset index after dropping rows
    df = df.reset_index(drop=True)
    
    # Normalize the resort key
    key = f"{country}/{resort}"
    
    return key, df

### 4.2 Process All CSV Files

In [25]:
data_frames = {}

for file_info in csv_files:
    key, df = clean_and_filter_data(file_info)
    data_frames[key] = df
    print(f"Loaded and cleaned data for {key}: {df.shape[0]} rows.")

Loaded and cleaned data for austrian_alps/kitzbuhel: 1166 rows.
Loaded and cleaned data for austrian_alps/st_anton: 1166 rows.
Loaded and cleaned data for austrian_alps/solden: 1166 rows.
Loaded and cleaned data for french_alps/chamonix: 1166 rows.
Loaded and cleaned data for french_alps/les_trois_vallees: 1166 rows.
Loaded and cleaned data for french_alps/val_d_isere_tignes: 1166 rows.
Loaded and cleaned data for italian_alps/cortina_d_ampezzo: 1166 rows.
Loaded and cleaned data for italian_alps/sestriere: 1166 rows.
Loaded and cleaned data for italian_alps/val_gardena: 1166 rows.
Loaded and cleaned data for slovenian_alps/kranjska_gora: 1166 rows.
Loaded and cleaned data for slovenian_alps/krvavec: 1166 rows.
Loaded and cleaned data for slovenian_alps/mariborsko_pohorje: 1166 rows.
Loaded and cleaned data for swiss_alps/st_moritz: 1166 rows.
Loaded and cleaned data for swiss_alps/verbier: 1166 rows.
Loaded and cleaned data for swiss_alps/zermatt: 1166 rows.


## 4.3 Filter data based on each resort's opening and closing dates

Each resort operates during specific dates in the year. We'll filter the data to include only the dates when each resort is open.

Here are the approximate opening and closing dates for each resort:

- **French Alps:**
  - **Chamonix:** Opens mid-December (`12-15`), closes mid-May (`05-15`)
  - **Val d'Isère & Tignes:** Opens November 30 (`11-30`), closes May 5 (`05-05`)
  - **Les Trois Vallées:** Opens December 7 (`12-07`), closes mid-April (`04-15`)
  
- **Austrian Alps:**
  - **St. Anton:** Opens early December (`12-01`), closes late April (`04-30`)
  - **Kitzbühel:** Opens mid-October (`10-15`), closes May (`05-01`)
  - **Sölden:** Opens early November (`11-01`), closes early May (`05-01`)
  
- **Swiss Alps:**
  - **Zermatt:** Opens mid-November (`11-15`), closes late April (`04-30`)
  - **St. Moritz:** Opens late November (`11-25`), closes early May (`05-01`)
  - **Verbier:** Opens early December (`12-01`), closes late April (`04-30`)
  
- **Italian Alps:**
  - **Cortina d'Ampezzo:** Opens late November (`11-25`), closes early April (`04-05`)
  - **Val Gardena:** Opens early December (`12-01`), closes mid-April (`04-15`)
  - **Sestriere:** Opens early December (`12-01`), closes mid-April (`04-15`)
  
- **Slovenian Alps:**
  - **Kranjska Gora:** Opens mid-December (`12-15`), closes mid-April (`04-15`)
  - **Mariborsko Pohorje:** Opens December (`12-01`), closes early April (`04-05`)
  - **Krvavec:** Opens December (`12-01`), closes April (`04-30`)

  We'll define the `resort_seasons` dictionary with normalized keys to match the keys in `data_frames`.


In [28]:
resort_seasons = {
    'french_alps/chamonix': {'open': '12-15', 'close': '05-15'},
    'french_alps/val_d_isere_tignes': {'open': '11-30', 'close': '05-05'},
    'french_alps/les_trois_vallees': {'open': '12-07', 'close': '04-15'},
    'austrian_alps/st_anton': {'open': '12-01', 'close': '04-30'},
    'austrian_alps/kitzbuhel': {'open': '10-15', 'close': '05-01'},
    'austrian_alps/solden': {'open': '11-01', 'close': '05-01'},
    'swiss_alps/zermatt': {'open': '11-15', 'close': '04-30'},
    'swiss_alps/st_moritz': {'open': '11-25', 'close': '05-01'},
    'swiss_alps/verbier': {'open': '12-01', 'close': '04-30'},
    'italian_alps/cortina_d_ampezzo': {'open': '11-25', 'close': '04-05'},
    'italian_alps/val_gardena': {'open': '12-01', 'close': '04-15'},
    'italian_alps/sestriere': {'open': '12-01', 'close': '04-15'},
    'slovenian_alps/kranjska_gora': {'open': '12-15', 'close': '04-15'},
    'slovenian_alps/mariborsko_pohorje': {'open': '12-01', 'close': '04-05'},
    'slovenian_alps/krvavec': {'open': '12-01', 'close': '04-30'},
}


## 4.4/5 Filter Data Based on Resort Operating Dates

We'll filter each resort's data to include only dates within its operating season.

In [29]:
for key, df in data_frames.items():
    resort = key
    if resort in resort_seasons:
        season = resort_seasons[resort]
        open_month_day = season['open']
        close_month_day = season['close']
        
        # Since the data spans multiple years, we need to filter for each year
        df['year'] = df['date'].dt.year
        filtered_dfs = []
        
        for year in df['year'].unique():
            open_date_str = f"{year}-{open_month_day}"
            close_date_str = f"{year}-{close_month_day}"
            open_date = pd.to_datetime(open_date_str, errors='coerce').tz_localize(None)
            close_date = pd.to_datetime(close_date_str, errors='coerce').tz_localize(None)
            
            # Handle cases where the season spans over the new year
            if close_date < open_date:
                # Season spans over to the next year
                close_date += relativedelta(years=1)
            
            season_df = df[(df['date'] >= open_date) & (df['date'] <= close_date)]
            filtered_dfs.append(season_df)
        
        # Combine all seasons
        df_season = pd.concat(filtered_dfs)
        
        # Drop the 'year' column
        df_season = df_season.drop(columns=['year'])
        
        # Update the DataFrame in the dictionary
        data_frames[key] = df_season.reset_index(drop=True)
        
        print(f"Filtered data for {resort}: {df_season.shape[0]} rows within operating dates.")
    else:
        print(f"No season information for {resort}. Data not filtered.")

Filtered data for austrian_alps/kitzbuhel: 595 rows within operating dates.
Filtered data for austrian_alps/st_anton: 451 rows within operating dates.
Filtered data for austrian_alps/solden: 544 rows within operating dates.
Filtered data for french_alps/chamonix: 454 rows within operating dates.
Filtered data for french_alps/les_trois_vallees: 388 rows within operating dates.
Filtered data for french_alps/val_d_isere_tignes: 469 rows within operating dates.
Filtered data for italian_alps/cortina_d_ampezzo: 394 rows within operating dates.
Filtered data for italian_alps/sestriere: 406 rows within operating dates.
Filtered data for italian_alps/val_gardena: 406 rows within operating dates.
Filtered data for slovenian_alps/kranjska_gora: 364 rows within operating dates.
Filtered data for slovenian_alps/krvavec: 451 rows within operating dates.
Filtered data for slovenian_alps/mariborsko_pohorje: 376 rows within operating dates.
Filtered data for swiss_alps/st_moritz: 472 rows within opera

## 5. Save Cleaned Data

We'll save the cleaned and filtered DataFrames to the `data/processed` directory, maintaining the normalized folder structure.

In [30]:
for key, df in data_frames.items():
    # Split the key back into country and resort
    country, resort = key.split('/')
    # Build the processed data path
    processed_dir = os.path.join(processed_data_root, country, resort)
    os.makedirs(processed_dir, exist_ok=True)
    # Save the cleaned DataFrame
    processed_file_path = os.path.join(processed_dir, f"{resort}_cleaned.csv")
    df.to_csv(processed_file_path, index=False)
    print(f"Saved cleaned data to {processed_file_path}.")

Saved cleaned data to ../data/processed/cds/austrian_alps/kitzbuhel/kitzbuhel_cleaned.csv.
Saved cleaned data to ../data/processed/cds/austrian_alps/st_anton/st_anton_cleaned.csv.
Saved cleaned data to ../data/processed/cds/austrian_alps/solden/solden_cleaned.csv.
Saved cleaned data to ../data/processed/cds/french_alps/chamonix/chamonix_cleaned.csv.
Saved cleaned data to ../data/processed/cds/french_alps/les_trois_vallees/les_trois_vallees_cleaned.csv.
Saved cleaned data to ../data/processed/cds/french_alps/val_d_isere_tignes/val_d_isere_tignes_cleaned.csv.
Saved cleaned data to ../data/processed/cds/italian_alps/cortina_d_ampezzo/cortina_d_ampezzo_cleaned.csv.
Saved cleaned data to ../data/processed/cds/italian_alps/sestriere/sestriere_cleaned.csv.
Saved cleaned data to ../data/processed/cds/italian_alps/val_gardena/val_gardena_cleaned.csv.
Saved cleaned data to ../data/processed/cds/slovenian_alps/kranjska_gora/kranjska_gora_cleaned.csv.
Saved cleaned data to ../data/processed/cds/sl

## 6. Summary

- Loaded and cleaned data for all resorts.
- Normalized resort names to handle special characters.
- Filtered data based on each resort's operating dates.
- Saved cleaned data to the `data/processed` directory.

The cleaned datasets are now ready for feature engineering and further analysis.