# 01 	6 Data merging (simple, self-learning EDA)

This notebook is intentionally minimal and designed for **self-learning**. It demonstrates basic, memory-safe data loading and hourly aggregation using the included sample `dataset/sampled_probe.csv` and `dataset/synthetic_traffic_counts.csv`. For basic EDA you only need a working Python installation and `pip install -r requirements.txt`. Advanced kernel creation/registration is optional â€” skip those steps if you prefer to keep things simple.

In [None]:
# 1) Detect available Python interpreters and Jupyter kernels

import sys
print('sys.executable:', sys.executable)
print('sys.version:', sys.version)

# List kernels via jupyter_client
import jupyter_client
print('Available kernel specs:', jupyter_client.kernelspec.find_kernel_specs())

# Shell checks (Windows / POSIX)
# On Windows: `!where python`
# On POSIX: `!which python`

## 2) Create and activate the recommended virtual environment

# POSIX (bash) example (run in a terminal cell)
# %%bash
# python -m venv .venv
# source .venv/bin/activate

# Windows (PowerShell) example (run in a terminal cell)
# python -m venv .venv
# .\.venv\Scripts\Activate.ps1

# Note: Running shell commands directly from notebook is possible using %%bash or ! commands; prefer creating the env in a terminal and then registering the kernel.

In [None]:
## 3) Install `ipykernel` and required packages in the environment

# Use the `%pip` magic inside the active kernel (ensures installation in kernel env)
# Example: run in a notebook cell after activating the env or switching kernel
%pip install --upgrade pip
%pip install ipykernel pandas matplotlib seaborn jupyterlab

# Optional: create a pinned requirements file
%pip freeze > requirements.txt

In [None]:
## 4) Register a Jupyter kernel for the environment

# Register the active environment as a kernel (run from the env's python)
!python -m ipykernel install --user --name recommended-env --display-name "Recommended (recommended-env)"

# Verify kernel is registered
import jupyter_client
print('Kernel specs after install:', jupyter_client.kernelspec.find_kernel_specs())

In [None]:
## 5) Configure notebook `kernelspec` metadata so VS Code recommends the kernel

import nbformat
nb_path = 'notebooks/01-data-merging-example.ipynb'
nb = nbformat.read(nb_path, as_version=4)
nb['metadata']['kernelspec'] = {
    'name':'recommended-env',
    'display_name':'Recommended (recommended-env)',
    'language':'python'
}
nbformat.write(nb, nb_path)
print(f"Wrote kernelspec metadata to {nb_path}. After running this cell, reopen the notebook in VS Code to see the recommended kernel.")

In [None]:
## 6) Verify kernel selection and environment variables from inside the notebook

import sys, os
print('sys.executable ->', sys.executable)
print('sys.version ->', sys.version)
print('VIRTUAL_ENV ->', os.environ.get('VIRTUAL_ENV'))

# Ensure pip points to same env
!python -m pip --version
!which pip || where pip

In [None]:
## 7) Use `%pip` and `%conda` magics to manage packages reproducibly

# Prefer `%pip` inside a notebook to ensure install goes into the active kernel env
%pip install numpy==1.24.0
%pip freeze | head -n 20

# If using conda-managed kernel, `%conda` magics can be used (only in conda kernels)
# Example: %conda install -y pandas

In [None]:
## 8) Run sample code to confirm environment isolation and package versions

import numpy as np
import pandas as pd
print('numpy', np.__version__)
print('pandas', pd.__version__)

# If you have different kernels, switch kernels and re-run this cell to observe version differences.

In [None]:
## 9) Inspect and edit kernel spec files programmatically

from jupyter_client.kernelspec import KernelSpecManager
ksm = KernelSpecManager()
# List installed kernels
print('installed kernels:', list(ksm.find_kernel_specs().keys()))
# Inspect kernel spec for 'recommended-env' (if present)
try:
    kspec = ksm.get_kernel_spec('recommended-env')
    print('recommended-env resources:', kspec.resource_dir)
except Exception as e:
    print('recommended-env kernel not found (install it first):', e)

# You can programmatically install or remove kernel specs with ksm.install_kernel_spec/remove_kernel_spec when needed.

## 10) Troubleshoot common kernel and environment issues

# Useful checks you can run from within a notebook cell or terminal
!jupyter kernelspec list
!python -m ipykernel install --user --name recommended-env --display-name "Recommended (recommended-env)" --force
!pip check || echo 'pip check returned non-zero (inspect environment)'

# If you see kernel startup failures, check the kernel logs (VS Code 'Jupyter' output panel) and ensure the kernel's python is executable and all required packages are installed.

In [None]:
## 11) Small demo: read sample probe and counts, aggregate hourly, and merge

import pandas as pd

# Load a small sample (fast) and counts
probe = pd.read_csv('dataset/sampled_probe.csv', parse_dates=['timestamp'])
counts = pd.read_csv('dataset/synthetic_traffic_counts.csv', parse_dates=['timestamp'])

# Aggregate probe to hourly averages
probe['timestamp_hour'] = probe['timestamp'].dt.floor('H')
probe_hour = probe.groupby(['station_id','timestamp_hour'], as_index=False)['speed_mph'].mean().rename(columns={'speed_mph':'avg_speed_mph'})

# Prepare counts
counts = counts.rename(columns={'timestamp':'timestamp_hour'})

# Merge
merged = counts.merge(probe_hour, on=['station_id','timestamp_hour'], how='left')
merged.head()

## 12) Notes & links

- Use `dataset/sampled_probe.csv` for quick iteration.  
- See `dataset/DATASET_DESCRIPTION.md` for column meanings and tips.  
- For reproducible sampling: `tools/sample_probe.py` (reservoir sampling script).  

---

*This notebook is intentionally output-stripped. Re-run cells locally after creating/activating your environment and registering the kernel.*