# Data Cleaning and Preparation

In this notebook, we will perform data cleaning and preparation for all resorts across the Alps. This includes loading the raw data, handling missing values, correcting data types, and saving the cleaned data for further analysis.

## 1. Import Libraries

In [5]:
import pandas as pd
import os
import unicodedata

## 2. Define Data Paths and Helper Functions

To efficiently load and clean data for all resorts, we'll define the root directories and create helper functions to process each file.

### 2(a).  Handling Special Characters in File Names
Issues can arrise from special characters.  We need to normalise file names before processing.

In [6]:
# Normalise file names to remove accents
def normalize_name(file_name):
    return unicodedata.normalize('NFKD', file_name).encode('ascii', 'ignore').decode('utf-8')

In [7]:
# Root directories for raw and processed data
raw_data_root = 'data/raw/cds'
processed_data_root = 'data/processed/cds'

In [9]:
# Function to get list of all CSV files in the raw data directory
def get_all_csv_files(root_dir):
    # Print the root directory path to verify
    print(f"Checking directory: {root_dir}")
    
    if not os.path.exists(root_dir):
        print(f"Directory {root_dir} does not exist.")
        return []
    
    print(f"Directory {root_dir} exists!")
    
    csv_files = []
    for country in os.listdir(root_dir):
        normalized_country = normalize_name(country)
        country_path = os.path.join(root_dir, country)
        print(f"Checking country path: {country_path}")
        
        if os.path.isdir(country_path):
            for resort in os.listdir(country_path):
                normalized_resort = normalize_name(resort)
                resort_path = os.path.join(country_path, resort)
                print(f"Checking resort path: {resort_path}")
                
                if os.path.isdir(resort_path):
                    for file in os.listdir(resort_path):
                        normalized_file = normalize_name(file)
                        print(f"Found file: {file}")
                        
                        if normalized_file.endswith('.csv'):
                            file_path = os.path.join(resort_path, file)
                            csv_files.append({
                                'country': normalized_country,
                                'resort': normalized_resort,
                                'file_path': file_path
                            })
    return csv_files

# Set the correct root directory
raw_data_root = '../data/raw/cds'

# Print current working directory
print(f"Current working directory: {os.getcwd()}")

# Get list of all CSV files
csv_files = get_all_csv_files(raw_data_root)
print(f"Found {len(csv_files)} CSV files.")

Current working directory: /workspace/SkiSnow/notebooks
Checking directory: ../data/raw/cds
Directory ../data/raw/cds exists!
Checking country path: ../data/raw/cds/austrian_alps
Checking resort path: ../data/raw/cds/austrian_alps/kitzbühel
Found file: kitzbühel.csv
Checking resort path: ../data/raw/cds/austrian_alps/st._anton
Found file: st._anton.csv
Checking resort path: ../data/raw/cds/austrian_alps/sölden
Found file: sölden.csv
Checking country path: ../data/raw/cds/french_alps
Checking resort path: ../data/raw/cds/french_alps/chamonix
Found file: chamonix.csv
Checking resort path: ../data/raw/cds/french_alps/les_trois_vallées
Found file: les_trois_vallées.csv
Checking resort path: ../data/raw/cds/french_alps/val_d'isère_&_tignes
Found file: val_d'isère_&_tignes.csv
Checking country path: ../data/raw/cds/italian_alps
Checking resort path: ../data/raw/cds/italian_alps/cortina_d'ampezzo
Found file: cortina_d'ampezzo.csv
Checking resort path: ../data/raw/cds/italian_alps/sestriere
Fo

## 3. Handle missing values 

As the data is not available (from this source) until 2021-03-23, we must remove all missing rows until this point.

In [10]:
# Root directory where all CSV files are stored
raw_data_root = '../data/raw/cds'

# Function to clean and filter a single CSV file
def clean_and_filter_data(file_path):
    # Load the CSV file
    df = pd.read_csv(file_path)
    
    # Convert the date column to datetime format
    df['date'] = pd.to_datetime(df['date'])
    
    # Drop rows with missing values
    df.dropna(inplace=True)
    
    # Filter rows to only keep data from 2021-03-23 onward
    df = df[df['date'] >= '2021-03-23']
    
    return df

### 3(a) Function to iterate over all CSV files.  We then clean, and filter them

In [11]:
def process_all_csv_files(root_dir):
    all_dataframes = {}  # Dictionary to store cleaned dataframes for each resort
    
    for country in os.listdir(root_dir):
        country_path = os.path.join(root_dir, country)
        if os.path.isdir(country_path):
            for resort in os.listdir(country_path):
                resort_path = os.path.join(country_path, resort)
                if os.path.isdir(resort_path):
                    for file in os.listdir(resort_path):
                        if file.endswith('.csv'):
                            file_path = os.path.join(resort_path, file)
                            print(f"Processing {file_path}...")
                            
                            # Clean and filter the data
                            cleaned_df = clean_and_filter_data(file_path)
                            
                            # Store the cleaned dataframe using a key (country_resort) in a dictionary
                            resort_name = f"{country}_{resort}"
                            all_dataframes[resort_name] = cleaned_df
    
    return all_dataframes

# Process all CSV files and store the results in a dictionary
cleaned_dataframes = process_all_csv_files(raw_data_root)

# Example: Accessing cleaned data for a specific resort (e.g., 'french_alps_chamonix')
df_chamonix_cleaned = cleaned_dataframes['french_alps_chamonix']
print(df_chamonix_cleaned.head())


Processing ../data/raw/cds/austrian_alps/kitzbühel/kitzbühel.csv...
Processing ../data/raw/cds/austrian_alps/st._anton/st._anton.csv...
Processing ../data/raw/cds/austrian_alps/sölden/sölden.csv...
Processing ../data/raw/cds/french_alps/chamonix/chamonix.csv...
Processing ../data/raw/cds/french_alps/les_trois_vallées/les_trois_vallées.csv...
Processing ../data/raw/cds/french_alps/val_d'isère_&_tignes/val_d'isère_&_tignes.csv...
Processing ../data/raw/cds/italian_alps/cortina_d'ampezzo/cortina_d'ampezzo.csv...
Processing ../data/raw/cds/italian_alps/sestriere/sestriere.csv...
Processing ../data/raw/cds/italian_alps/val_gardena/val_gardena.csv...
Processing ../data/raw/cds/slovenian_alps/kranjska_gora/kranjska_gora.csv...
Processing ../data/raw/cds/slovenian_alps/krvavec/krvavec.csv...
Processing ../data/raw/cds/slovenian_alps/mariborsko_pohorje/mariborsko_pohorje.csv...
Processing ../data/raw/cds/swiss_alps/st._moritz/st._moritz.csv...
Processing ../data/raw/cds/swiss_alps/verbier/verbi