# Lab 8 Report: 
## FINAL PROJECT STRATEGIC PLANNING

### Lab 8 Instruction: 
https://canvas.uw.edu/files/106242897/download?download_frd=1

### Team Members:

### Dataset for your project:

#### Feel free to delete or modify pre-written markdown cells below the line

---

## Understanding the Data

### **a. Data Size and Structure**

* **Dataset Size**:

  * The dataset includes thousands of labeled EEG samples, each 50 seconds long, sampled at **200 Hz** across multiple electrode channels.
  * Data is large: each EEG sample has `200 Hz × 50 s = 10,000` timepoints per electrode channel.
  * EEG files are stored in `/train_eegs/` and `/test_eegs/`; spectrograms (generated from EEG data) are also available.

* **Structure**:

  * **`train.csv`** is the primary metadata file linking EEG and spectrogram samples to expert-labeled classifications.
  * Each row in `train.csv` corresponds to a **specific labeled segment** within a longer EEG recording.
  * Labels are provided by **multiple experts**. Each class has an associated **vote count**.
  * There are six target categories:

    * `seizure`, `lpd`, `gpd`, `lrda`, `grda`, `other`
  * Labeling is done on the **central 10 seconds** of each 50s EEG window (i.e., the labels refer to seconds 20–30).

* **Data Types**:

  * Raw EEG time series (per electrode channel)
  * Spectrograms (frequency-domain representation)
  * Metadata (e.g., offset, patient ID, annotator votes)

* **Overlap**:

  * Many EEG windows **overlap**. The dataset consolidates overlapping regions with associated metadata to extract relevant segments.

* **File Format**:

  * Data is stored in `.parquet` format with columns as electrode names (e.g., `Fp1`, `F3`, `EKG`).

---

### **b. Data Cleanness**

* **Potential Noise**:

  * EEG data is inherently **noisy**, with possible contamination from muscle movement, eye blinks, and environmental interference.
  * The **EKG** channel is included, which can sometimes help with artifact detection (e.g., removing heartbeat interference).

* **Label Noise**:

  * Even trained annotators **disagree** — hence vote counts are provided per class rather than a single deterministic label.
  * The column `expert_consensus` offers a simplified label but should be used cautiously due to possible inter-rater disagreement.

* **Simulator limitations**:

  * This dataset is **real-world**, not simulated. However, **sampling frequency and electrode coverage** may limit some types of fine-grained analysis.

---

### **c. Diversity of Features**

* **Feature Space**:

  * Each EEG sample consists of multiple channels (electrode signals), each with 10,000 data points (50s × 200Hz).
  * Spectrograms provide time-frequency representations, increasing feature diversity across frequency bins and brain regions.

* **Labels**:

  * There are 6 multi-label classification targets; each sample can have **multiple** brain activity types (i.e., it's **multi-label**).
  * Label distribution is **highly imbalanced** — seizure events are **rare**, while “other” may be common.

* **Patient Diversity**:

  * Samples come from different patients (`patient_id` is included), and cross-patient variation (e.g., age, pathology) could affect generalization.

* **Conclusion**:

  * The dataset is rich and complex, with both time-domain and frequency-domain inputs, weak supervision (due to annotator disagreement), and a multi-label output space.

---



## Understanding the Task

### a. **Problem type**

* **Binary classification** of EEG segments:

  * Predict whether a time window of EEG signals contains harmful brain activity.

---

### b. **Significance**

* Early detection of harmful brain activity is critical for medical intervention, e.g., seizure detection, coma monitoring, etc.
* Potential to assist medical professionals or automate monitoring in ICUs.

---

### c. **Evaluation metric**

* The official Kaggle competition uses:

  * **Log loss** (`BinaryCrossentropy`)
  * Possibly also reports **AUC-ROC** or **Accuracy**, but final rankings are based on log loss.

---

### d. **Good performance**

* A good model should achieve:

  * **Low log loss** (e.g., < 0.3 on validation)
  * Generalize across patients (i.e., work well on unseen EEG patterns)
  * Handle noisy signals robustly
* Also, **balanced sensitivity and specificity** is ideal (to avoid missing harmful events).

---

### e. **Baseline methods**

* A basic **fully connected neural net (FCN)** on flattened EEG signals.
* Classical baselines could include:

  * **Logistic Regression** on frequency-domain features (FFT)
  * **Random Forests** on engineered statistical features
* More advanced baselines:

  * **CNNs** for spatial-temporal patterns
  * **RNNs** or **Transformers** for sequential dependencies

---

## Developing an initial plan for your project

### a. **Inputs and Outputs**

* **Input**: `(6, 10,000)` tensor representing 50s EEG from 6 channels.

  * May be downsampled or windowed further.
  * Can also extract frequency features (STFT, spectrogram).
* **Output**: Scalar (0 or 1), probability of harmful brain activity.

**Data preparation:**

* Normalize each EEG channel.
* Apply bandpass filter (e.g., 0.5–40 Hz).
* Handle missing or extreme values.
* Possibly segment data further (e.g., sliding window with overlap).

---

### b. **Model**

* Initial model: **1D CNN** (temporal conv across each channel)
* Could try:

  * Multi-scale CNNs
  * CNN + GRU
  * Transformer for long-range dependencies
* Use dropout and batch normalization for regularization.

---

### c. **Loss function**

* **Binary Cross-Entropy Loss**, weighted if data is imbalanced.

---

### d. **Model evaluation**

* Validation using:

  * **Log loss**
  * **Accuracy**
  * **ROC AUC**
* Use stratified K-Fold cross-validation based on patient ID to prevent leakage.
* Track learning curves and confusion matrix.

---
