# Data Cleaning and Preparation

In this notebook, we will perform data cleaning and preparation for all resorts across the Alps. This includes loading the raw data, handling missing values, correcting data types, and saving the cleaned data for further analysis.

## 1. Import Libraries

In [5]:
import pandas as pd
import os
import unicodedata

## 2. Define Data Paths and Helper Functions

To efficiently load and clean data for all resorts, we'll define the root directories and create helper functions to process each file.

### 2(a).  Handling Special Characters in File Names
Issues can arrise from special characters.  We need to normalise file names before processing.

In [12]:
# Normalize file names to remove accents
def normalize_name(file_name):
    return unicodedata.normalize('NFKD', file_name).encode('ascii', 'ignore').decode('utf-8')

In [13]:
# Root directories for raw and processed data
raw_data_root = 'data/raw/cds'
processed_data_root = 'data/processed/cds'

In [14]:
# Function to get list of all CSV files in the raw data directory
def get_all_csv_files(root_dir):
    # Print the root directory path to verify
    print(f"Checking directory: {root_dir}")
    
    if not os.path.exists(root_dir):
        print(f"Directory {root_dir} does not exist.")
        return []
    
    print(f"Directory {root_dir} exists!")
    
    csv_files = []
    for country in os.listdir(root_dir):
        normalized_country = normalize_name(country)
        country_path = os.path.join(root_dir, country)
        print(f"Checking country path: {country_path}")
        
        if os.path.isdir(country_path):
            for resort in os.listdir(country_path):
                normalized_resort = normalize_name(resort)
                resort_path = os.path.join(country_path, resort)
                print(f"Checking resort path: {resort_path}")
                
                if os.path.isdir(resort_path):
                    for file in os.listdir(resort_path):
                        normalized_file = normalize_name(file)
                        print(f"Found file: {file}")
                        
                        if normalized_file.endswith('.csv'):
                            file_path = os.path.join(resort_path, file)
                            csv_files.append({
                                'country': normalized_country,
                                'resort': normalized_resort,
                                'file_path': file_path
                            })
    return csv_files

# Set the correct root directory
raw_data_root = '../data/raw/cds'

# Print current working directory
print(f"Current working directory: {os.getcwd()}")

# Get list of all CSV files
csv_files = get_all_csv_files(raw_data_root)
print(f"Found {len(csv_files)} CSV files.")

Current working directory: /workspace/SkiSnow/notebooks
Checking directory: ../data/raw/cds
Directory ../data/raw/cds exists!
Checking country path: ../data/raw/cds/austrian_alps
Checking resort path: ../data/raw/cds/austrian_alps/st._anton
Found file: st._anton.csv
Checking resort path: ../data/raw/cds/austrian_alps/kitzbühel
Found file: kitzbühel.csv
Checking resort path: ../data/raw/cds/austrian_alps/sölden
Found file: sölden.csv
Checking country path: ../data/raw/cds/french_alps
Checking resort path: ../data/raw/cds/french_alps/chamonix
Found file: chamonix.csv
Checking resort path: ../data/raw/cds/french_alps/val_d'isère_&_tignes
Found file: val_d'isère_&_tignes.csv
Checking resort path: ../data/raw/cds/french_alps/les_trois_vallées
Found file: les_trois_vallées.csv
Checking country path: ../data/raw/cds/italian_alps
Checking resort path: ../data/raw/cds/italian_alps/sestriere
Found file: sestriere.csv
Checking resort path: ../data/raw/cds/italian_alps/val_gardena
Found file: val_

## 3. Handle missing values 

As the data doesn't seem to be available (from this source) until 2021-03-23, we must remove all missing row until this point