# IREM data transformations

This notebook demonstrates how to download and transform raw IREM data into a formats suitable for further analysis (DataFrame, HDF5, CSV).

## Prerequisites

### Downloading IREM data (optional)

In this section, we will guide you through the process of downloading IREM data (if you haven't downloaded it already).

- This cell supports Linux-based operating systems, for other OSes you can download the data manually.
- We will use the `wget` command to fetch the data from the official [IREM data repository](http://srem.psi.ch/datarepo/V0/irem/).
- The data will be organized into directories for original raw data, and extracted CDF data. You can modify the `DATA_DIR` variable to change the location of the data.
- Additionally, we will ensure that existing files are not downloaded again to save time and bandwidth.
- Running this cell the first time can take a while, grab a coffee.

In [25]:
%%sh
# Data directory to store the data
DATA_DIR="../data/irem"

# Creating data directories
mkdir -p ${DATA_DIR}
mkdir -p ${DATA_DIR}/extracted
mkdir -p ${DATA_DIR}/hdf5
mkdir -p ${DATA_DIR}/csv
# Create a symlink to the raw directory
DATA_RAW_DIR=${DATA_DIR}/raw
if [ ! -L "$DATA_RAW_DIR" ]; then
    ABS_DATA_RAW_DIR=$(readlink -f ${DATA_RAW_DIR})
    ABS_DATA_DIR=$(readlink -f ${DATA_DIR})
    ln -s ${ABS_DATA_DIR}/srem.psi.ch/datarepo/V0/irem ${ABS_DATA_RAW_DIR}
fi

# Get data recursively
wget \
    --timestamping \
    --recursive \
    --no-parent \
    --no-verbose \
    -A gz \
    http://srem.psi.ch/datarepo/V0/irem/ \
    -P ${DATA_DIR} \
    2> ${DATA_DIR}/wget.log # Redirect wget output to a log file to avoid cluttering the notebook

# Remove summary plots dir which we don't care about
rm -rf ${DATA_DIR}/irem/raw/summaryplots

### Initializing the notebook

In [2]:
import radem
import os
from pathlib import Path
import gzip
from datetime import date

DATA_DIR = Path("../data/irem")
DATA_RAW_DIR = DATA_DIR / "raw"
DATA_EXTRACTED_DIR = DATA_DIR / "extracted"
DATA_HDF5_DIR = DATA_DIR / "hdf5"
DATA_CSV_DIR = DATA_DIR / "csv"

### Extracting IREM data archives (optional)

In [None]:
def extract_and_process_files(data_raw_dir: Path, data_extracted_dir: Path) -> None:
    # Get sorted list of .cdf.gz files
    data_raw_filenames = sorted(
        data_raw_dir / dirname / filename
        for dirname in os.listdir(data_raw_dir)
        for filename in os.listdir(data_raw_dir / dirname)
        if filename.endswith(".cdf.gz")
    )
    
    # Process each file
    for filename in data_raw_filenames:
        output_filename = data_extracted_dir / filename.stem
        print(f"Extracting {filename} to {output_filename}")
        if output_filename.exists():
            print(f"Overriding {filename} - already exists.")
        
        # Extract the file
        with open(filename, 'rb') as f_in, gzip.open(f_in) as f_decompressed, open(output_filename, 'wb') as f_out:
            f_out.write(f_decompressed.read())
            

extract_and_process_files(DATA_RAW_DIR, DATA_EXTRACTED_DIR)

## Data transformations

### Reading CDFs (option 1)

In [None]:
cdfs = radem.handlers.read_irem_cdfs(DATA_EXTRACTED_DIR)

print(len(cdfs))

### Reading CDFs (option 2)

In [None]:
paths = radem.handlers.get_irem_cdf_paths(
    DATA_EXTRACTED_DIR,
    from_date=date(2011, 11, 11),
    to_date=date(2012, 12, 12))

cdfs = radem.handlers.read_irem_cdfs(paths)

print(len(cdfs))

### Reading CDFs (option 3)

In [None]:
cdfs = radem.handlers.read_irem_cdfs(
    DATA_EXTRACTED_DIR,
    from_date=date(2011, 11, 11),
    to_date=date(2012, 12, 12))

print(len(cdfs))

### Exploring CDF (optional)


In [None]:
def print_cdf_report(cdf):
    print(f'Keys:')
    print(cdf)

    print(f'\nCDF meta:')
    print(cdf.meta)
    for key, val in cdf.items(): 
        print(f'\n{key} -> {val}')
        print(val.meta)

print_cdf_report(cdfs[-10])

### Fix and convert for further analysis

> 💡 This step merges, removes duplicates, sorts, and converts the data to a pandas DataFrame which simplifies further analysis and eliminates low-level issues with CDF files.


In [None]:
df = radem.handlers.convert_irem_cdfs_to_df(cdfs)

print(df)

### Writing to HDF5

In [8]:
radem.handlers.write_hdf(df, DATA_HDF5_DIR / "example.hdf5")

### Reading from HDF5

In [None]:
df_hdf = radem.handlers.read_hdf(DATA_HDF5_DIR / "example.hdf5")

print(all(df_hdf == df))

### Writing to CSV (not recommended)

> ⚠️ CSV files are not efficient for storing large datasets, use HDF5 format.

> ⚠️ Tiny floating point errors may occur when writing / reading to CSV e.g. `1.4143094841930115` vs `1.4143094841930117`. If it matters to you, use HDF5 format.

In [10]:
radem.handlers.write_csv(df, DATA_CSV_DIR / "example.csv")

### Reading from CSV (not recommended)

In [11]:
df_csv = radem.handlers.read_csv(DATA_CSV_DIR / "example.csv")