# Python Dataset Exploration Notebook
This notebook demonstrates how to install dependencies, import common data science libraries, and access the mounted dataset via the DATA_DIR environment variable or /data.

## Install Dependencies
Installs packages listed in requirements.txt. Add any additional libraries your analysis needs there.

In [None]:
# Install dependencies from requirements.txt (silent)
!pip install -r requirements.txt > /dev/null

In [None]:
# Example: install a package NOT listed in requirements.txt
# Useful for quick experiments
!pip install polars --quiet
import polars as pl

## Import Common Libraries
Example imports of widely used data science libraries.

In [1]:
import os
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
print('pandas version:', pd.__version__)
print('seaborn version:', sns.__version__)

pandas version: 2.2.1
seaborn version: 0.13.2


## Inspect Dataset Directory
DATA_DIR is an environment variable pointing to the mounted (read-only) dataset directory. You can also access it via the /data symlink. Always prefer DATA_DIR for portability.

In [2]:
data_dir = Path(os.environ['DATA_DIR'])
print('DATA_DIR =', data_dir)
print('\nListing via DATA_DIR:')
for p in data_dir.iterdir():
    print(' -', p.name)

csvs = sorted(data_dir.glob('*.csv'))
df = None
if csvs:
    first = csvs[0]
    print(f"\nAttempting to load {first.name} ...")
    df = pd.read_csv(first)
    
    if df is not None:
        print('Shape =', df.shape)
        display(df.head())

DATA_DIR = /data

Listing via DATA_DIR:
 - act_atmos-2.2.1


## Example library usage
Converts the pandas DataFrame (df) to a Polars DataFrame (pl_df), displays the first rows, and computes simple column means using Polars expressions

In [None]:
if 'df' in locals() and isinstance(df, pd.DataFrame) and df is not None:
    pl_df = pl.from_pandas(df)
    print('Converted pandas DataFrame to polars shape =', pl_df.shape)
    print(pl_df.head())
    print('\nColumn means:')
    print(pl_df.select(pl.all().mean()))

## Access another dataset
Use os.environ['PUBLIC_DATA_DIR"] (preferred) and /public_datasets to access other datasets from this environment. PUBLIC_DATA_DIR is an environment variable pointing to the mounted (read-only) directory of public datasets by record ID with file exploration enabled. You can find the record ID of another dataset by via the Details section on it's landing page in MSD-LIVE.

In [None]:
import os
from pathlib import Path

# PUBLIC_DATA_DIR points to all public datasets with file exploration enabled
public_dir = Path(os.environ['PUBLIC_DATA_DIR'])
print("PUBLIC_DATA_DIR =", public_dir)

# Access another dataset by its Record ID
# we can access another dataset's data by using its Record ID as the path under public_datasets
other_dataset_id = "6yawb-zyx60"  
other_data_path = public_dir / other_dataset_id

print(f"Files available in dataset {other_dataset_id}:")
for f in other_data_path.iterdir():
    print(" -", f.name)