## Assignment 2 – Practical Deep Learning Workshop
### Final Report


**Course:** Deep Learning  
**name:** nehoray chalfon
**ID:** 325833531

---

## 1. Introduction and Dataset Overview

### 1.1 Task Description

This report presents our approach to the **Human Activity Recognition (HAR)** competition. The objective is to classify **18 distinct human activities** from smartphone **3-axis accelerometer data**. This is a **multi-class classification** problem where we predict the activity being performed based on raw sensor time series.

### 1.2 Dataset Statistics

| Metric | Value |
|--------|-------|
| **Training Samples** | 50,248 |
| **Test Samples** | 74,744 |
| **Number of Users (Train)** | 8 (user01 - user08) |
| **Number of Classes** | 18 |
| **Sensor Type** | 3-axis Accelerometer (x, y, z) |
| **Sequence Length** | Variable (500 - 6,000+ timesteps) |
| **Data Files** | Individual CSV files per sample |

### 1.3 Activity Classes

The 18 activity classes span multiple categories:

| Category | Activities |
|----------|------------|
| **Walking** | walking_freely, walking_with_handbag, walking_holding_a_tray, walking_with_hands_in_pockets, walking_with_object_underarm |
| **Stairs** | stairs_up, stairs_down |
| **Hand Activities** | typing, writing, reading_book, using_phone, using_remote_control |
| **Kitchen Tasks** | preparing_sandwich, washing_mug, washing_plate |
| **Hygiene** | brushing_teeth, washing_face_and_hands |
| **Stationary** | idle |

### 1.4 Classification Objective

This is a **multi-class classification** problem where we predict the **current activity** being performed (not a future event). The model receives a time series of accelerometer readings and must output one of 18 activity labels.

---

## 2. Exploratory Data Analysis Summary

*Full EDA available in `eda.ipynb`*

### 2.1 Key Findings

#### (i) What is the type of this data? What does it represent?
- **Data Type:** Time series of 3-axis accelerometer readings (x, y, z coordinates)
- **Representation:** Each sample captures smartphone sensor data during a specific human activity
- **Format:** Individual CSV files with variable-length sequences

#### (ii) Is it homogeneous or does it vary in some way?
- **Heterogeneous Data:**
  - **Variable sequence lengths:** Ranges from ~500 to 6,000+ timesteps per sample
  - **Two file types detected:**
    - *Type 1:* Raw sensor readings with "measurement type" column (acceleration m/s²)
    - *Type 2:* Derived position data (x, y, z in meters)
  - **User variability:** Each user has distinct motion patterns (gait, hand movements)

#### (iii) How was the data labeled?
- Data was labeled by the **activity type** being performed
- Labels were assigned per sample (one activity label per entire time series)
- Labeling appears to have been done during controlled data collection sessions

#### (iv) Should all labels be treated equally?
- **Class Imbalance Exists:**
  - Most frequent: "idle" (~9% of samples)
  - Some activities have fewer samples than others
  - Walking variants have similar motion patterns → higher confusion
- **Recommendation:** Consider class weights or stratified sampling

#### (v) How was the data split to train/test?
- **Training data:** 8 users (user01 - user08)
- **Test data:** Includes users **not present** in training set
- **Critical Insight:** This is a **user-based split** where models must generalize to unseen users

### 2.2 Class Distribution

From our EDA, the activity distribution shows:

| Activity | Samples | Percentage |
|----------|---------|------------|
| idle | ~4,600 | 9.2% |
| walking_freely | ~3,800 | 7.6% |
| typing | ~3,500 | 7.0% |
| ... | ... | ... |
| *Least frequent* | ~1,500 | 3.0% |

### 2.3 Sample Signal Visualization

Each sample consists of 3 channels (x, y, z acceleration) with the following characteristics:
- **Walking activities:** Show periodic oscillations corresponding to step patterns
- **Stationary activities:** Show low amplitude, noisy signals
- **Transitions:** Some activities show distinct patterns at start/end

*See `eda.ipynb` for detailed visualizations and plots.*

### 2.4 Visualizations

![Figure 1: Training Data Activity Distribution](figures/figure1_activity_distribution.png)

**Key Observation:** Class imbalance ratio (max/min) is approximately 4:1. Walking activities and idle have the most samples.

---

![Figure 2: User Analysis](figures/figure2_user_analysis.png)

**Key Observation:** All 8 training users have relatively balanced sample counts (5,500-7,400 samples each). The heatmap shows that some activities (stairs_down, stairs_up, typing) have zero samples for certain users.

---

![Figure 3: Train/Test Split Analysis](figures/figure3_train_test_split.png)

**Critical Finding:** The train/test split is **user-based** with **zero overlap**. Training contains 8 users while the test set contains 21 different users. This means our model must generalize to completely unseen users.

---

![Figure 4: Sample Accelerometer Signals by Activity](figures/figure4_sample_signals.png)

**Key Observations:**
- **Idle:** Low amplitude, relatively stable signals
- **Walking activities:** High amplitude oscillations with periodic patterns corresponding to steps
- **Typing/Reading:** Low amplitude with occasional small movements
- **Brushing teeth:** Distinctive repetitive high-frequency patterns

---

## 3. Self-Supervised Pretraining Tasks

As seen in class, self-supervised learning can help models learn useful representations from unlabeled data before fine-tuning on the classification task. We propose two tasks suitable for accelerometer time series:

### 3.1 Task 1: Masked Reconstruction

**Objective:** Mask random portions of the time series and train the model to reconstruct them.

**Implementation:**
1. Take an input sequence of length T with 3 channels (x, y, z)
2. Randomly mask 15-25% of consecutive timesteps (similar to BERT's MLM)
3. Train an encoder-decoder to reconstruct the masked values
4. Use MSE loss between predicted and original values

**Why it helps:**
- Forces the model to learn **temporal dependencies** and patterns
- Model must understand activity-specific motion dynamics to predict missing parts
- The encoder learns transferable features for classification

**Architecture suggestion:** Transformer encoder with a linear decoder head

---

### 3.2 Task 2: Contrastive Learning (SimCLR-style)

**Objective:** Learn representations where augmented versions of the same sample are similar, while different samples are dissimilar.

**Implementation:**
1. For each sample, create two augmented views using:
   - Time jittering (add Gaussian noise)
   - Magnitude scaling (multiply by random factor 0.9-1.1)
   - Time warping (speed up/slow down segments)
2. Pass both views through an encoder to get embeddings
3. Use NT-Xent (Normalized Temperature-scaled Cross Entropy) loss:
   - Pull together embeddings from same sample (positive pairs)
   - Push apart embeddings from different samples (negative pairs)

**Why it helps:**
- Learns **activity-invariant representations** robust to sensor variations
- Captures inherent structure of activities regardless of user-specific patterns
- Particularly useful for generalizing to unseen users

**Architecture suggestion:** 1D-CNN or Transformer encoder with projection head

---

### 3.3 Comparison

| Task | Type | Key Benefit | Complexity |
|------|------|-------------|------------|
| Masked Reconstruction | Generative | Learns temporal dynamics | Medium |
| Contrastive Learning | Discriminative | Learns invariant features | High |

**Recommendation:** Start with contrastive learning as it directly encourages user-invariant features, which is crucial given our user-based train/test split.

---

## 4. Validation Strategy

### 4.1 Critical Insight from EDA

Our exploratory analysis revealed a crucial finding:

> **The test set contains users NOT present in the training set.**

This means a model that memorizes user-specific patterns will fail when evaluated on the test set. We must design our validation strategy to simulate this scenario.

### 4.2 Chosen Strategy: Leave-One-User-Out Cross-Validation (LOUO)

**Approach:**
- Use **LeaveOneGroupOut** with user ID as the group variable
- Each fold uses **7 users for training**, **1 user for validation**
- Total of **8 folds** (one per user)
- Final score = average across all 8 folds

**Why this works:**
1. Simulates the test scenario where we encounter unseen users
2. Prevents overfitting to user-specific motion patterns
3. Provides robust performance estimates for real-world generalization

### 4.3 Cross-Validation Splits

| Fold | Training Users | Validation User | Train Size | Val Size |
|------|----------------|-----------------|------------|----------|
| 1 | user02-08 | user01 | ~44,000 | ~6,200 |
| 2 | user01, user03-08 | user02 | ~44,000 | ~6,200 |
| 3 | user01-02, user04-08 | user03 | ~44,000 | ~6,200 |
| ... | ... | ... | ... | ... |
| 8 | user01-07 | user08 | ~44,000 | ~6,200 |

### 4.4 For Faster Iteration: 3-Fold GroupKFold

Due to computational constraints, we also used a faster variant:
- **GroupKFold with 3 splits** (user-based groups)
- Maintains user separation but reduces training time by ~60%
- Used for neural network hyperparameter tuning

```python
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut

# Full LOUO (8 folds)
cv_full = LeaveOneGroupOut()

# Fast iteration (3 folds)
cv_fast = GroupKFold(n_splits=3)
```

### 4.5 Validation Strategy Summary

| Strategy | Folds | Time | Usage |
|----------|-------|------|-------|
| LeaveOneGroupOut | 8 | ~3x longer | Final evaluation |
| GroupKFold (n=3) | 3 | ~1x | Development & tuning |

---

## 5. Naïve Baseline Results

Before building complex models, we establish **naïve baselines** to set minimum performance thresholds.

### 5.1 Baseline 1: Random Prediction (Class Prior)

**Method:** Predict each class with probability proportional to its frequency in the training data.

```python
# Predict based on training class distribution
y_pred = np.random.choice(classes, size=n_samples, p=class_probabilities)
```

**Result:**
- **Accuracy:** 6.5% ± 0.5%
- **F1-Score (Macro):** 5.2%

---

### 5.2 Baseline 2: Most Frequent Class

**Method:** Always predict the most frequent class ("idle").

```python
# Always predict 'idle'
y_pred = ['idle'] * n_samples
```

**Result:**
- **Accuracy:** 9.2% ± 1.0%
- **F1-Score (Macro):** 1.0%

---

### 5.3 Reference: Random Chance

With 18 classes, random chance gives: $\frac{1}{18} = 5.56\%$

---

### 5.4 Baseline Comparison

| Baseline | Accuracy | F1 (Macro) | Notes |
|----------|----------|------------|-------|
| Random Chance | 5.6% | - | Theoretical minimum |
| Random (Class Prior) | 6.5% | 5.2% | Slightly above random |
| **Most Frequent Class** | **9.2%** | 1.0% | **Threshold to beat** |

---

### 5.5 Key Takeaway

> **Any useful model must achieve > 9.2% accuracy.**

This Most Frequent baseline establishes the minimum threshold. If a model cannot beat this, it has learned nothing meaningful from the data.

---

## 6. Classical ML with Feature Engineering

### 6.1 Feature Engineering Approach

Instead of feeding raw time series to a model, we extracted **70 handcrafted statistical features** from each sample:

#### Time Domain Features (per axis: x, y, z, and magnitude)
| Feature | Description |
|---------|-------------|
| Mean, Std | Basic statistics |
| Min, Max, Range | Extremes |
| Percentiles (25, 50, 75) | Distribution |
| IQR | Interquartile range |
| Skewness, Kurtosis | Shape of distribution |
| Zero-Crossing Rate | Signal frequency indicator |
| Mean Absolute Deviation | Variability measure |
| RMS | Root mean square energy |
| Energy | Sum of squared values |

#### Frequency Domain Features
| Feature | Description |
|---------|-------------|
| Spectral Energy | Total power in frequency domain |
| Dominant Frequency | Most prominent frequency component |
| Spectral Centroid | Center of mass of spectrum |

#### Cross-Axis Features
| Feature | Description |
|---------|-------------|
| Correlation (xy, xz, yz) | Relationships between axes |
| Jerk (mean, std, max) | First derivative statistics |

**Total: 70 features per sample**

---

### 6.2 Random Forest Classifier

We trained a **Random Forest** with the following configuration:

```python
RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
```

---

### 6.3 Results (Leave-One-User-Out CV)

| Metric | Value |
|--------|-------|
| **Mean Validation Accuracy** | **58.6% ± 4.8%** |
| **Mean Validation F1 (Macro)** | **57.4% ± 8.2%** |
| Mean Training Accuracy | 99.9% |

**Improvement over baseline:** $(58.6 - 9.2) / 9.2 = $ **537% improvement!**

---

### 6.4 Per-Fold Results

| Fold | Val User | Val Accuracy | Val F1 |
|------|----------|--------------|--------|
| 1 | user01 | 55.2% | 52.1% |
| 2 | user02 | 61.3% | 58.7% |
| 3 | user03 | 58.9% | 56.2% |
| 4 | user04 | 54.7% | 51.8% |
| 5 | user05 | 62.1% | 61.4% |
| 6 | user06 | 59.8% | 58.3% |
| 7 | user07 | 60.2% | 59.1% |
| 8 | user08 | 56.4% | 62.3% |

---

### 6.5 Top 10 Most Important Features

| Rank | Feature | Importance |
|------|---------|------------|
| 1 | mag_mean | 0.089 |
| 2 | mag_std | 0.076 |
| 3 | z_mean | 0.065 |
| 4 | mag_energy | 0.058 |
| 5 | x_std | 0.052 |
| 6 | jerk_mean | 0.048 |
| 7 | mag_rms | 0.045 |
| 8 | y_range | 0.042 |
| 9 | spectral_energy | 0.038 |
| 10 | z_zcr | 0.035 |

**Observation:** Magnitude-based features and signal energy are most discriminative for activity classification.

---

### 6.6 Key Takeaway

> **Handcrafted features are extremely powerful for HAR.**

The Random Forest achieved **58.6% accuracy** - significantly better than deep learning models with raw sequences. This demonstrates that domain knowledge in feature engineering can outperform end-to-end learning when data is limited.

![Figure 5: Classical ML Results (Random Forest)](figures/figure5_random_forest.png)

**Key Finding:** Random Forest achieved **58.6% accuracy** - a 537% improvement over the most frequent baseline! The most important features are magnitude-based (mag_mean, mag_std) and signal energy measures.

---

## 7. Neural Network Models (CNN & BiLSTM)

We implemented two neural network architectures to learn directly from raw time series data.

### 7.1 Model 1: 1D Convolutional Neural Network (CNN1D)

**Architecture:**

```
Input: (batch, 3, seq_len)  # 3 channels: x, y, z
    ↓
Conv1D(3→64, k=7, s=2) + BatchNorm + ReLU + MaxPool
    ↓
Conv1D(64→128, k=5, s=1) + BatchNorm + ReLU + MaxPool
    ↓
Conv1D(128→256, k=3, s=1) + BatchNorm + ReLU + MaxPool
    ↓
Conv1D(256→256, k=3, s=1) + BatchNorm + ReLU + AdaptiveAvgPool
    ↓
Flatten → Linear(256→128) → ReLU → Dropout(0.5)
    ↓
Linear(128→18) → Output
```

**Parameters:** ~375,000

**Key Design Choices:**
- Increasing filter sizes to capture multi-scale temporal patterns
- BatchNorm for stable training
- Global Average Pooling for variable-length input handling
- Dropout for regularization

---

### 7.2 Model 2: Bidirectional LSTM (BiLSTM)

**Architecture:**

```
Input: (batch, seq_len, 3)  # Transposed from CNN format
    ↓
BiLSTM(input=3, hidden=128, layers=2, bidirectional=True)
    ↓
Attention Layer: Linear(256→64) → Tanh → Linear(64→1) → Softmax
    ↓
Weighted Sum across time (context vector: 256-dim)
    ↓
Linear(256→128) → ReLU → Dropout(0.3)
    ↓
Linear(128→18) → Output
```

**Parameters:** ~583,000

**Key Design Choices:**
- Bidirectional processing captures both past and future context
- Attention mechanism learns which timesteps are most important
- Deeper network (2 layers) for complex temporal patterns

---

### 7.3 Training Configuration

| Parameter | Value |
|-----------|-------|
| Batch Size | 64 |
| Learning Rate | 0.001 |
| Optimizer | Adam |
| Scheduler | ReduceLROnPlateau (patience=2, factor=0.5) |
| Max Epochs | 15 |
| Early Stopping | patience=4 |
| Sequence Length | 1,500 (padded/truncated) |
| Training Samples | 10,000 (stratified subset) |
| Cross-Validation | 3-Fold GroupKFold |

**Hardware:** NVIDIA GTX 1050 Ti (4GB), CUDA 12.1

---

### 7.4 Results

| Model | Val Accuracy | Val F1 (Macro) | Training Time |
|-------|-------------|----------------|---------------|
| **1D-CNN** | **48.5% ± 1.5%** | **47.8% ± 2.1%** | ~21 min/fold |
| BiLSTM | 35.3% ± 4.3% | 33.2% ± 5.1% | ~21 min/fold |

---

### 7.5 Training Curves

Both models showed typical learning patterns:
- **CNN:** Steady convergence, small train-val gap (good generalization)
- **LSTM:** More volatile training, larger train-val gap (overfitting tendency)

---

### 7.6 Analysis: Why CNN Outperformed LSTM

1. **Local Patterns Matter More:** Walking and hand activities have distinctive local temporal signatures (e.g., step patterns, repetitive motions) that convolutions capture effectively.

2. **LSTM Limitations:**
   - Sequences are very long (1,500 timesteps) → vanishing gradients
   - LSTMs need more data to learn long-range dependencies
   - Attention helps but adds complexity

3. **Parameter Efficiency:** CNN has fewer parameters but still outperformed the larger LSTM.

---

### 7.7 Comparison with Random Forest

| Model | Val Accuracy | Approach |
|-------|-------------|----------|
| Random Forest | 58.6% | Features + Classical ML |
| 1D-CNN | 48.5% | Raw sequences + Deep Learning |
| BiLSTM | 35.3% | Raw sequences + Deep Learning |

**Observation:** Random Forest with handcrafted features still outperforms end-to-end deep learning by **+10% accuracy**. This suggests:
- Our dataset size (10K samples) may be insufficient for deep learning
- Feature engineering captures domain knowledge that models must learn from scratch
- Hybrid approaches (CNN features + RF) could be promising

![Figure 6: Neural Network Results](figures/figure6_neural_networks.png)

**Key Findings:**
- CNN (48.5%) outperformed LSTM (35.3%)
- Both neural networks underperformed Random Forest (58.6%)
- Training curves show CNN has better generalization (smaller train-val gap)
- LSTM shows signs of overfitting with larger train-val divergence

---

## 8. Pretrained Model Fine-tuning (Chronos)

### 8.1 Transfer Learning Approach

We leveraged **Amazon Chronos T5-Small**, a foundation model for time series pretrained on diverse forecasting tasks.

**Model:** `amazon/chronos-t5-small`
- **Architecture:** T5 (Text-to-Text Transfer Transformer) adapted for time series
- **Pretraining:** Trained on massive time series forecasting datasets
- **Original Task:** Time series forecasting (predict future values)

---

### 8.2 Adaptation Strategy

**Challenge:** Chronos is designed for forecasting, not classification.

**Solution:** Freeze the pretrained encoder and add a trainable classification head.

```
Input: 3-axis accelerometer (x, y, z)
    ↓
Compute magnitude: √(x² + y² + z²)  # Convert to univariate
    ↓
Downsample to 384 timesteps (Chronos expected input size)
    ↓
Chronos T5 Encoder (FROZEN - 46M parameters)
    ↓
Global Average Pooling → 512-dim embedding
    ↓
Classification Head (TRAINABLE):
    Linear(512→512) → ReLU → Dropout(0.4)
    Linear(512→256) → ReLU → Dropout(0.3)
    Linear(256→18) → Output
```

---

### 8.3 Parameter Efficiency

| Component | Parameters | Trainable? |
|-----------|------------|------------|
| Chronos T5 Encoder | 46,000,000 |  Frozen |
| Classification Head | ~400,000 |  Trainable |
| **Total** | **46,400,000** | **0.8% trained** |

This is a highly **parameter-efficient** approach: we only train 0.8% of the model!

---

### 8.4 Training Configuration

| Parameter | Value |
|-----------|-------|
| Sample Size | 3,000 |
| Sequence Length | 384 (downsampled) |
| Batch Size | 64 |
| Learning Rate | 0.003 (higher for classifier head) |
| Epochs | 12 |
| Validation Split | 80/20 (user-based) |

---

### 8.5 Results

The fine-tuning showed promising learning dynamics, with the classification head gradually learning to map Chronos embeddings to activity labels.

**Key Observations:**
1. **Learning Curve:** The model showed steady improvement over epochs
2. **Domain Gap:** Chronos was pretrained on forecasting, not classification
3. **Univariate Limitation:** Converting 3-axis data to magnitude loses directional information

---

### 8.6 Comparison with Other Models

| Model | Training Strategy | Val Accuracy |
|-------|-------------------|--------------|
| Random Forest | Feature Engineering | 58.6% |
| 1D-CNN | Train from Scratch | 48.5% |
| BiLSTM | Train from Scratch | 35.3% |
| Chronos (Fine-tuned) | Transfer Learning | ~45-50%* |

*Chronos results varied based on training configuration.

---

### 8.7 Analysis and Lessons Learned

**Why Transfer Learning Had Mixed Results:**

1. **Domain Mismatch:** Chronos learned forecasting patterns (predict future values), not classification patterns (categorize sequences).

2. **Modality Difference:** Time series forecasting operates on continuous values, while classification requires categorical outputs.

3. **Sequence Conversion:** Converting 3-axis data to univariate magnitude loses important directional information.

**What Could Help:**
- Fine-tune a HAR-specific pretrained model (e.g., trained on other HAR datasets)
- Use a model pretrained with contrastive learning on sensor data
- Keep full 3-axis input with multi-channel adaptation

---

## 9. Error Analysis and Improvement Suggestions

### 9.1 Confusion Matrix Analysis

Analyzing the confusion matrix of our best model (Random Forest), we identified systematic error patterns:

#### Most Confused Activity Pairs

| True Activity | Predicted As | Confusion Rate |
|---------------|--------------|----------------|
| walking_with_handbag | walking_freely | 18% |
| walking_with_hands_in_pockets | walking_freely | 15% |
| walking_holding_a_tray | walking_freely | 14% |
| washing_mug | washing_plate | 22% |
| using_phone | typing | 12% |
| reading_book | typing | 11% |

**Pattern:** Activities within the same category (walking variants, sitting activities, kitchen tasks) are frequently confused.

---

### 9.2 Per-Class Performance

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| walking_freely | 0.72 | 0.68 | 0.70 | ~3,800 |
| idle | 0.85 | 0.91 | 0.88 | ~4,600 |
| typing | 0.61 | 0.58 | 0.59 | ~3,500 |
| stairs_down | 0.78 | 0.71 | 0.74 | ~2,200 |
| washing_plate | 0.45 | 0.42 | 0.43 | ~1,800 |
| ... | ... | ... | ... | ... |

**Best performing:** idle, stairs_down (distinct motion patterns)
**Worst performing:** kitchen activities, walking variants (similar patterns)

---

### 9.3 Three Improvement Suggestions

Based on our error analysis, we propose three strategies to improve model performance:

#### Suggestion 1: Data Augmentation

**Problem:** Limited training data leads to overfitting and poor generalization.

**Solution:** Apply random transformations during training:
- **Jittering:** Add Gaussian noise (σ = 0.05)
- **Scaling:** Multiply by random factor (0.9 - 1.1)
- **Time Warping:** Speed up/slow down random segments

**Expected Benefit:** Forces model to learn robust features, reduces overfitting, improves generalization to unseen users.

**Difficulty:** Easy to implement, low computational overhead.

---

#### Suggestion 2: Deeper Architecture (ResNet-1D)

**Problem:** Simple 4-layer CNN may underfit complex activity patterns.

**Solution:** Implement ResNet-style architecture with:
- Residual connections (skip connections)
- 8 residual blocks
- Deeper feature extraction

**Expected Benefit:** Better gradient flow enables training deeper networks, potentially capturing more complex temporal hierarchies.

**Difficulty:** Medium - requires architectural changes.

---

#### Suggestion 3: Ensemble Methods

**Problem:** Different models capture different aspects of the data.

**Solution:** Combine predictions from multiple models:
- Random Forest (statistical features)
- CNN (local temporal patterns)
- LSTM (sequential dependencies)

**Approach:** Weighted average of predicted probabilities or voting.

**Expected Benefit:** Reduced variance, more robust predictions, leverages complementary strengths.

**Difficulty:** Easy to implement, but increases inference time.

---

### 9.4 Prioritization

| Rank | Suggestion | Impact | Effort | Priority |
|------|------------|--------|--------|----------|
| 1 | Data Augmentation | High | Low | High |
| 2 | ResNet-1D | Medium | Medium | Medium |
| 3 | Ensemble | Medium | Low | Low |

**We implemented suggestions 1 and 2 in Part 2g (see next section).**

---

## 10. Implemented Improvements and Final Results

We implemented the top 2 prioritized improvements and evaluated their impact.

### 10.1 Improvement 1: Data Augmentation

**Implementation:**

```python
class AugmentedDataset(Dataset):
    def __init__(self, X, y, augment=False):
        self.X = torch.FloatTensor(X).permute(0, 2, 1)
        self.y = torch.LongTensor(y)
        self.augment = augment
    
    def __getitem__(self, idx):
        x, y = self.X[idx], self.y[idx]
        
        if self.augment:
            # Jittering: Add Gaussian noise
            if torch.rand(1) < 0.5:
                noise = torch.randn_like(x) * 0.05
                x = x + noise
            
            # Scaling: Random magnitude change
            if torch.rand(1) < 0.5:
                scale = 1.0 + (torch.rand(1) - 0.5) * 0.2  # 0.9 to 1.1
                x = x * scale
        
        return x, y
```

**Result:** CNN with augmentation achieved **49.6% accuracy** (+1.1% improvement over baseline CNN)

---

### 10.2 Improvement 2: ResNet-1D Architecture

**Implementation:**

```python
class ResNet1D(nn.Module):
    def __init__(self, in_channels=3, num_classes=18):
        super().__init__()
        
        # Initial convolution
        self.conv1 = nn.Conv1d(in_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm1d(64)
        self.maxpool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
        
        # Residual blocks (8 blocks total)
        self.layer1 = self._make_layer(64, 2, stride=1)    # 2 blocks
        self.layer2 = self._make_layer(128, 2, stride=2)   # 2 blocks
        self.layer3 = self._make_layer(256, 2, stride=2)   # 2 blocks
        self.layer4 = self._make_layer(512, 2, stride=2)   # 2 blocks
        
        # Classifier
        self.avgpool = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(512, num_classes)
```

**Parameters:** ~4.2M (vs ~375K for baseline CNN)

**Result:** ResNet-1D achieved **48.4% accuracy** (comparable to baseline CNN)

---

### 10.3 Final Results Comparison

| Model | Val Accuracy | Val F1 | vs Baseline CNN |
|-------|-------------|--------|-----------------|
| **Random Forest** | **58.6%** | **57.4%** | +10.1% |
| **CNN + Augmentation** | **49.6%** | 47.7% | **+1.1%** |
| ResNet-1D | 48.4% | 47.3% | -0.1% |
| 1D-CNN (Baseline) | 48.5% | 47.8% | - |
| BiLSTM | 35.3% | 33.2% | -13.2% |
| Most Frequent | 9.2% | 1.0% | -39.3% |

---

### 10.4 Analysis of Improvement Results

#### Data Augmentation Success

The CNN + Augmentation model showed consistent improvement:
- **+1.1% absolute accuracy improvement** over baseline CNN
- Particularly effective for fold 1 (51.4% vs ~48%)
- **Zero additional inference cost** - augmentation only applied during training

**Why it worked:**
- Jittering simulates sensor noise variations between users
- Scaling simulates different motion intensities
- Forces model to learn robust, generalizable features

---

#### ResNet-1D Performance

The ResNet-1D achieved comparable results to the baseline CNN:
- 48.4% accuracy (marginally below 48.5% baseline)
- 11x more parameters but no improvement

**Why it didn't improve:**
1. **Insufficient data:** Deeper models need more training data, not all data was used due to time and compute constraints.
2. **Limited training epochs:** 15 epochs may not be enough for 4.2M parameters
3. **Overfitting risk:** More parameters with same data can lead to overfitting

**What could help:**
- Train for 50+ epochs with learning rate scheduling
- Use more aggressive regularization (higher dropout, weight decay)
- Increase training data with more aggressive augmentation

---

![Figure 7: Final Results and Improvement Analysis](figures/figure7_final_results.png)

**Final Results Summary:**

| Model | Accuracy | F1-Score |
|-------|----------|----------|
| Random Forest | 58.6% | 57.4% |
| CNN + Augmentation | 49.6% | 47.7% |
| 1D-CNN (Baseline) | 48.5% | 47.8% |
| ResNet-1D | 48.4% | 47.3% |
| BiLSTM | 35.3% | 33.2% |
| Most Frequent Baseline | 9.2% | 1.0% |

**Key Takeaway:** Classical ML with handcrafted features (Random Forest) outperformed all deep learning approaches by +10% accuracy, demonstrating the value of domain expertise when training data is limited.

---

## 11. Conclusion

### 11.1 Summary of Findings

#### Best Performing Models

| Rank | Model | Accuracy | Key Insight |
|------|-------|----------|-------------|
| 1 | Random Forest + Features | 58.6% | Handcrafted features excel with limited data |
| 2 | CNN + Data Augmentation | 49.6% | Augmentation improves neural network generalization |
| 3 | 1D-CNN (Baseline) | 48.5% | Solid baseline for end-to-end learning |

#### Critical Success Factors

1. **User-Based Validation:** Using Leave-One-User-Out CV was essential for realistic performance estimation. Models that don't account for user separation will overestimate their performance.

2. **Feature Engineering:** Domain-specific features (magnitude, spectral energy, jerk) captured activity characteristics more effectively than raw sequences with limited data.

3. **Data Augmentation:** Simple jittering and scaling provided consistent improvements with zero inference overhead.

---

### 11.2 What Worked vs. What Didn't

| Worked Well | Didn't Meet Expectations |
|---------------|---------------------------|
| Random Forest with 70 features | BiLSTM (underperformed CNN) |
| Data augmentation | ResNet-1D (no improvement over simple CNN) |
| User-based cross-validation | Transfer learning from forecasting model |
| 1D-CNN architecture | Very deep networks with limited data |

---

### 11.3 Recommendations for This Competition

Based on the experiments, the recommendation is:

1. **Primary Submission:** Random Forest with handcrafted features (58.6% accuracy)
2. **Alternative:** CNN + Augmentation for faster inference if needed

---

### 11.4 Code

**GitHub Repository:** [https://github.com/NehorayChalfon0166/deeplearning_ws](https://github.com/NehorayChalfon0166/deeplearning_ws)


| Resource | Location |
|----------|----------|
| EDA Notebook | `eda.ipynb` |
| Training Notebook | `model_training.ipynb` |
| Report | `report.ipynb` (this file) |
| Model Checkpoints | `checkpoints/` |
| Experiment Tracking | W&B project: `har-deep-learning` |