# Exploratory Data Analysis (EDA)

**Purpose**: This notebook explores the self-driving vehicle dataset to understand:
- What the data looks like
- How many samples we have for each steering direction
- Whether there are any problems with the data
- How to split the data for training and testing

**Why EDA is important**: Before building any machine learning model, we need to understand our data. This helps us:
1. Choose the right models
2. Avoid common mistakes
3. Explain why our models work or fail

## 1. Setup: Import Libraries

**What are libraries?** Pre-written code that helps us do common tasks.

- `numpy`: Math operations on arrays (lists of numbers)
- `matplotlib`: Draw plots and charts
- `seaborn`: Make prettier charts
- `sklearn`: Machine learning tools

In [1]:
# Import libraries
import numpy as np                          # For working with arrays of numbers
import matplotlib.pyplot as plt             # For creating plots
import seaborn as sns                       # For creating nice-looking plots
from sklearn.decomposition import PCA       # For reducing dimensions (explained later)
from collections import Counter             # For counting things

# Settings to make plots look better
plt.style.use('default')                    # Use default plot style
sns.set_palette("husl")                     # Use colorful palette
%matplotlib inline
# Show plots inside the notebook


UsageError: unrecognized arguments: # Show plots inside the notebook


## 2. Load the Dataset

**What is our dataset?** A `.npy` file containing:
- Images: 64×64 grayscale pictures from the vehicle camera
- Labels: Steering direction (-1 = left, 0 = forward, 1 = right)

**Why .npy format?** It's a NumPy file format that stores arrays efficiently.

In [None]:
# Load the dataset
# np.load() reads a .npy file
# allow_pickle=True lets us load this specific file format
data = np.load('../data/training_data-SIZE10000-TIME80557.npy', allow_pickle=True)

# Print basic information
print(f"Total number of samples: {len(data)}")
print(f"Type of data: {type(data)}")
print(f"First sample structure: image shape = {data[0][0].shape}, label = {data[0][1]}")

## 3. Separate Images (X) and Labels (y)

**Convention in machine learning**:
- `X` = features (input data) = images
- `y` = labels (what we want to predict) = steering directions

**Why separate them?** Most machine learning functions expect X and y as separate inputs.

In [None]:
# Extract images (X) and labels (y)
# sample[0] = image, sample[1] = label
X = np.array([sample[0] for sample in data])  # All images
y = np.array([sample[1] for sample in data])  # All labels

print(f"X shape: {X.shape}")  # Should be (9900, 64, 64)
print(f"y shape: {y.shape}")  # Should be (9900,)
print(f"Unique labels: {np.unique(y)}")  # Should be [-1, 0, 1]

## 4. Class Distribution

**Purpose**: Count how many samples we have for each steering direction.

**Why this matters**: If we have many more "forward" samples than "left" or "right", the model might:
- Always predict "forward" (lazy strategy)
- Perform poorly on turns

This is called **class imbalance**.

In [None]:
# Count samples for each class
class_counts = Counter(y)  # Counter counts occurrences

# Create a readable summary
label_names = {-1: 'Left', 0: 'Forward', 1: 'Right'}
total = len(y)

print("Class Distribution:")
print("-" * 50)
for label in [-1, 0, 1]:
    count = class_counts[label]
    percentage = (count / total) * 100
    print(f"{label_names[label]:8s} (label={label:2d}): {count:5d} samples ({percentage:5.1f}%)")
print("-" * 50)
print(f"Total: {total} samples")

In [None]:
# Visualize class distribution with bar chart
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
labels = ['Left', 'Forward', 'Right']
counts = [class_counts[-1], class_counts[0], class_counts[1]]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

axes[0].bar(labels, counts, color=colors)
axes[0].set_ylabel('Number of Samples', fontsize=12)
axes[0].set_title('Class Distribution (Counts)', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, (label, count) in enumerate(zip(labels, counts)):
    axes[0].text(i, count + 100, str(count), ha='center', fontsize=11, fontweight='bold')

# Pie chart
axes[1].pie(counts, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('Class Distribution (Percentages)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Key observation
print("\n⚠️ KEY OBSERVATION:")
print(f"Forward class has {class_counts[0] / class_counts[1]:.1f}x more samples than Right class!")
print("This severe imbalance will likely cause the model to bias toward 'Forward' predictions.")

## 5. Visualize Sample Images

**Purpose**: Look at actual images to understand:
- What does the camera see?
- Do different steering directions look visually different?
- Are labels correct?

**What to look for**:
- Track surface (light gray)
- Track edges (black)
- Vehicle position

In [None]:
# Show 5 random samples from each class
fig, axes = plt.subplots(3, 5, figsize=(15, 9))

# For each class
for row, label in enumerate([-1, 0, 1]):
    # Find indices where y equals this label
    indices = np.where(y == label)[0]  # np.where finds matching positions
    
    # Randomly select 5 samples
    selected = np.random.choice(indices, size=5, replace=False)
    
    # Display each sample
    for col, idx in enumerate(selected):
        axes[row, col].imshow(X[idx], cmap='gray')  # cmap='gray' shows grayscale
        axes[row, col].axis('off')  # Hide axis numbers
        
        # Add title to first column only
        if col == 0:
            axes[row, col].set_ylabel(label_names[label], fontsize=14, fontweight='bold')

fig.suptitle('Random Samples from Each Class', fontsize=16, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

## 6. Label Quality Analysis

**Important discovery**: Some images don't match their labels!

**Why does this happen?**
- Labels are **reactive steering commands**, not descriptions of what the image shows
- Example: Image shows vehicle drifting right → Label = "turn left" (to correct)
- This is called **temporal lag** - the action responds to the current state

**What this means for our project**:
- Single-frame prediction is inherently difficult
- Sequential models (using multiple frames) should work better
- We shouldn't expect 90%+ accuracy

In [None]:
# Manually examine some examples that might look confusing
# Let's look at left turn examples from the middle of the dataset
left_indices = np.where(y == -1)[0]
sample_indices = left_indices[20:25]  # Pick a few examples

fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for i, idx in enumerate(sample_indices):
    axes[i].imshow(X[idx], cmap='gray')
    axes[i].set_title(f"Label: {label_names[y[idx]]}\nIndex: {idx}", fontsize=10)
    axes[i].axis('off')

plt.suptitle('Example: Left Turn Labels (Notice some might not visually show left turns)', 
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

print("💡 INSIGHT:")
print("If you see images labeled 'left' that don't look like they need to turn left,")
print("this is expected! The label is the CORRECTIVE ACTION, not a description of the image.")

## 7. Pixel Statistics

**Purpose**: Analyze pixel intensities to understand:
- Are different classes visually distinct?
- What does an "average" left/forward/right image look like?

**Pixel values**:
- Range from 0 (black) to 255 (white)
- Track surface: light gray (high values)
- Track edges: black (low values)

In [None]:
# Calculate mean (average) image for each class
# Mean image = average all pixels across all images of that class
mean_images = {}
for label in [-1, 0, 1]:
    # Get all images with this label
    class_images = X[y == label]
    # Calculate mean across all images (axis=0 means "across samples")
    mean_images[label] = np.mean(class_images, axis=0)

# Display mean images
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for i, label in enumerate([-1, 0, 1]):
    axes[i].imshow(mean_images[label], cmap='gray')
    axes[i].set_title(f'Mean {label_names[label]} Image', fontsize=12, fontweight='bold')
    axes[i].axis('off')

plt.suptitle('Average Image for Each Class', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("❓ QUESTION TO THINK ABOUT:")
print("Do the mean images look different from each other?")
print("If they look very similar, it means the classes are hard to distinguish visually.")

In [None]:
# Plot pixel intensity distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, label in enumerate([-1, 0, 1]):
    # Get all images with this label and flatten them
    # flatten() converts 64x64 image to 4096 values in a list
    class_images = X[y == label]
    all_pixels = class_images.flatten()  # Combine all pixels from all images
    
    # Create histogram (count how many pixels have each value)
    axes[i].hist(all_pixels, bins=50, color=['#FF6B6B', '#4ECDC4', '#45B7D1'][i], alpha=0.7)
    axes[i].set_xlabel('Pixel Intensity (0=black, 255=white)', fontsize=10)
    axes[i].set_ylabel('Frequency (count)', fontsize=10)
    axes[i].set_title(f'{label_names[label]} Pixel Distribution', fontsize=12, fontweight='bold')
    axes[i].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("💡 INTERPRETATION:")
print("If all three histograms look similar, the classes are not easily separable by pixel values.")

## 8. Temporal Analysis (Very Important!)

**What is temporal correlation?**
- Our data is a video sequence (consecutive frames)
- Nearby frames look very similar (the vehicle doesn't teleport!)
- Correlation = how similar two frames are (1.0 = identical, 0.0 = completely different)

**Why this matters**:
- If we do random train/test split, test images might be very similar to training images
- Model might "cheat" by memorizing, not actually learning
- We need **temporal split** instead: train on first part, test on last part

In [None]:
# Calculate correlation between consecutive frames
# Correlation measures how similar two images are
correlations = []

# Compare each frame with the next frame
for i in range(len(X) - 1):
    # Flatten images to 1D arrays
    img1 = X[i].flatten()
    img2 = X[i+1].flatten()
    
    # Calculate correlation coefficient
    # np.corrcoef returns a 2x2 matrix, we want the off-diagonal value
    corr = np.corrcoef(img1, img2)[0, 1]
    correlations.append(corr)

# Plot correlation over time
plt.figure(figsize=(14, 5))
plt.plot(correlations, alpha=0.5, linewidth=0.5)
plt.axhline(y=0.8, color='r', linestyle='--', label='High correlation threshold (0.8)')
plt.axhline(y=0.5, color='orange', linestyle='--', label='Medium correlation threshold (0.5)')
plt.xlabel('Frame Index', fontsize=12)
plt.ylabel('Correlation with Next Frame', fontsize=12)
plt.title('Temporal Correlation: How Similar Are Consecutive Frames?', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Statistics
print(f"Mean correlation: {np.mean(correlations):.3f}")
print(f"Median correlation: {np.median(correlations):.3f}")
print(f"Percentage of frames with correlation > 0.8: {(np.array(correlations) > 0.8).mean() * 100:.1f}%")

print("\n⚠️ KEY FINDING:")
if np.mean(correlations) > 0.7:
    print("Consecutive frames are HIGHLY correlated!")
    print("→ Random train/test split will leak information")
    print("→ Must use temporal split instead")

In [None]:
# Analyze correlation at different time gaps
# Does correlation decrease as frames get further apart?
gaps = [1, 5, 10, 20, 50, 100]
gap_correlations = []

for gap in gaps:
    corrs = []
    for i in range(len(X) - gap):
        img1 = X[i].flatten()
        img2 = X[i + gap].flatten()
        corr = np.corrcoef(img1, img2)[0, 1]
        corrs.append(corr)
    gap_correlations.append(np.mean(corrs))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(gaps, gap_correlations, marker='o', linewidth=2, markersize=8)
plt.xlabel('Frame Gap (how many frames apart)', fontsize=12)
plt.ylabel('Average Correlation', fontsize=12)
plt.title('How Does Correlation Decay with Time?', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Find when correlation drops below 0.5
for gap, corr in zip(gaps, gap_correlations):
    print(f"Gap = {gap:3d} frames: correlation = {corr:.3f}")

print("\n💡 INSIGHT:")
print("Frames need to be at least X frames apart to be considered 'independent'.")
print("This tells us how carefully we need to split our data.")

## 9. Label Transition Analysis

**Purpose**: Understand the sequence of steering decisions.

**Questions**:
- After a left turn, what usually comes next?
- Do we see realistic sequences? (e.g., left → forward → right on a curve)
- Are there impossible transitions? (e.g., always left → left)

**Transition matrix**: A table showing "if current label is X, next label is Y"

In [None]:
# Build transition matrix
# transition[i, j] = count of times label i is followed by label j
transition_matrix = np.zeros((3, 3))

for i in range(len(y) - 1):
    current_label = y[i]
    next_label = y[i + 1]
    # Map labels: -1→0, 0→1, 1→2 for indexing
    transition_matrix[current_label + 1, next_label + 1] += 1

# Normalize to probabilities (each row sums to 1)
transition_probs = transition_matrix / transition_matrix.sum(axis=1, keepdims=True)

# Visualize as heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(transition_probs, annot=True, fmt='.2f', cmap='YlOrRd',
            xticklabels=['Left', 'Forward', 'Right'],
            yticklabels=['Left', 'Forward', 'Right'],
            cbar_kws={'label': 'Probability'})
plt.xlabel('Next Label', fontsize=12)
plt.ylabel('Current Label', fontsize=12)
plt.title('Label Transition Probabilities\n(What label usually follows what?)', 
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("📊 INTERPRETATION GUIDE:")
print("Each row shows: if current label is X, what's the probability next label is Y?")
print("Example: If current=Forward, what's P(next=Forward)?")
print(f"→ P(Forward → Forward) = {transition_probs[1, 1]:.2f}")
print("\nHigh diagonal values = labels tend to repeat (vehicle keeps same direction)")

## 10. Dimensionality Reduction (PCA)

**What is PCA (Principal Component Analysis)?**
- Each image has 64×64 = 4,096 pixels (dimensions)
- PCA finds the 2 most important directions (principal components)
- We can plot data in 2D to see if classes are separable

**Why 2D?** So we can visualize it!

**What to look for**:
- Do different colors (classes) form separate clusters?
- If yes: classes are easily separable → model should work well
- If no: classes overlap → harder problem

In [None]:
# Prepare data for PCA
# PCA needs data as (samples, features)
# Our images are (9900, 64, 64), we need (9900, 4096)
X_flat = X.reshape(len(X), -1)  # -1 means "calculate this dimension automatically"

print(f"Original shape: {X.shape}")
print(f"Flattened shape: {X_flat.shape}")
print(f"Reduced from {X.shape[1]}×{X.shape[2]} = {X.shape[1]*X.shape[2]} dimensions")

In [None]:
# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)  # Keep only 2 components
X_2d = pca.fit_transform(X_flat)  # Transform data to 2D

print(f"Reduced to: {X_2d.shape}")
print(f"\nVariance explained by 2 components: {pca.explained_variance_ratio_.sum():.1%}")
print("(This tells us how much information we kept)")
print(f"Component 1 explains: {pca.explained_variance_ratio_[0]:.1%}")
print(f"Component 2 explains: {pca.explained_variance_ratio_[1]:.1%}")

In [None]:
# Plot in 2D space
plt.figure(figsize=(12, 8))

# Plot each class with different color
colors = {-1: '#FF6B6B', 0: '#4ECDC4', 1: '#45B7D1'}
for label in [-1, 0, 1]:
    # Get points for this class
    mask = (y == label)  # Boolean array: True where y equals label
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], 
                c=colors[label], label=label_names[label],
                alpha=0.6, s=20, edgecolors='none')

plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=12)
plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=12)
plt.title('2D PCA Projection: Are Classes Separable?', fontsize=14, fontweight='bold')
plt.legend(fontsize=12, markerscale=2)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n❓ QUESTION:")
print("Do you see three separate clusters?")
print("- YES → Classes are linearly separable, simple models might work well")
print("- NO → Classes overlap, need more complex models (like neural networks)")

## 11. Train/Validation/Test Splits

**Why split data?**
- **Train set**: Model learns from this
- **Validation set**: Tune hyperparameters (like learning rate)
- **Test set**: Final evaluation (model has never seen this)

**Two splitting strategies**:

### Strategy A: Random Split (Naive)
- Randomly shuffle and split
- Problem: Consecutive frames are similar, test set might "leak" into train set

### Strategy B: Temporal Split (Proper)
- First 70% → train
- Next 15% → validation  
- Last 15% → test
- Advantage: Test set is truly unseen (future data)

**We'll create both and compare results later**

In [None]:
# Strategy A: Random Split
from sklearn.model_selection import train_test_split

# First split: 70% train, 30% temp
X_train_rand, X_temp, y_train_rand, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
# stratify=y ensures each split has similar class distribution

# Second split: split temp into 50% validation, 50% test
X_val_rand, X_test_rand, y_val_rand, y_test_rand = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print("Random Split:")
print(f"Train: {len(X_train_rand)} samples")
print(f"Val:   {len(X_val_rand)} samples")
print(f"Test:  {len(X_test_rand)} samples")
print(f"Total: {len(X_train_rand) + len(X_val_rand) + len(X_test_rand)} samples")

In [None]:
# Strategy B: Temporal Split
n = len(X)
train_end = int(0.7 * n)      # 70% for training
val_end = int(0.85 * n)       # Next 15% for validation
# Remaining 15% for test

X_train_temp = X[:train_end]
y_train_temp = y[:train_end]

X_val_temp = X[train_end:val_end]
y_val_temp = y[train_end:val_end]

X_test_temp = X[val_end:]
y_test_temp = y[val_end:]

print("\nTemporal Split:")
print(f"Train: {len(X_train_temp)} samples (frames 0 to {train_end-1})")
print(f"Val:   {len(X_val_temp)} samples (frames {train_end} to {val_end-1})")
print(f"Test:  {len(X_test_temp)} samples (frames {val_end} to {n-1})")
print(f"Total: {len(X_train_temp) + len(X_val_temp) + len(X_test_temp)} samples")

In [None]:
# Compare class distributions in both strategies
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Random split distributions
for i, (y_split, title) in enumerate([
    (y_train_rand, 'Random: Train'),
    (y_val_rand, 'Random: Val'),
    (y_test_rand, 'Random: Test')
]):
    counts = [np.sum(y_split == -1), np.sum(y_split == 0), np.sum(y_split == 1)]
    axes[0, i].bar(['Left', 'Forward', 'Right'], counts, color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
    axes[0, i].set_title(title, fontsize=12, fontweight='bold')
    axes[0, i].set_ylabel('Count')
    
# Temporal split distributions
for i, (y_split, title) in enumerate([
    (y_train_temp, 'Temporal: Train'),
    (y_val_temp, 'Temporal: Val'),
    (y_test_temp, 'Temporal: Test')
]):
    counts = [np.sum(y_split == -1), np.sum(y_split == 0), np.sum(y_split == 1)]
    axes[1, i].bar(['Left', 'Forward', 'Right'], counts, color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
    axes[1, i].set_title(title, fontsize=12, fontweight='bold')
    axes[1, i].set_ylabel('Count')

plt.suptitle('Class Distribution in Different Splits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n📊 OBSERVATION:")
print("Random split: All three sets have similar proportions (because of stratify=y)")
print("Temporal split: Proportions might differ (depends on where in the track we split)")

## 12. Save Splits for Later Use

**Purpose**: Save our train/val/test splits so:
- We use the same splits in all experiments (fair comparison)
- Other team members can use the same splits

**File format**: `.npz` = compressed NumPy arrays

In [None]:
# Save random splits
np.savez('../data/splits_random.npz',
         X_train=X_train_rand, y_train=y_train_rand,
         X_val=X_val_rand, y_val=y_val_rand,
         X_test=X_test_rand, y_test=y_test_rand)

# Save temporal splits
np.savez('../data/splits_temporal.npz',
         X_train=X_train_temp, y_train=y_train_temp,
         X_val=X_val_temp, y_val=y_val_temp,
         X_test=X_test_temp, y_test=y_test_temp)

print("✅ Splits saved successfully!")
print("Files created:")
print("  - data/splits_random.npz")
print("  - data/splits_temporal.npz")
print("\nTo load later: data = np.load('data/splits_random.npz')")
print("Then access: data['X_train'], data['y_train'], etc.")

## 13. Summary of Key Findings

**Based on our exploratory data analysis, here are the main findings:**

### 1. Severe Class Imbalance ⚠️
- Forward: 74.2% (dominant class)
- Left: 16.4%
- Right: 9.5% (minority class)

**Implication**: Models will likely bias toward predicting "Forward". Need to address with:
- Class weights in loss function
- Oversampling minority classes (SMOTE)
- Evaluation metrics beyond accuracy (F1-score per class)

### 2. High Temporal Correlation 📈
- Consecutive frames are highly correlated (>0.7)
- Frames within ~10 steps are almost identical

**Implication**: 
- Random train/test split is inappropriate (data leakage)
- Must use temporal split for realistic evaluation
- Sequential models (LSTM, temporal CNN) should outperform single-frame models

### 3. Label Noise (Temporal Lag) 🔄
- Labels are reactive control signals, not image descriptions
- Many images visually contradict their labels

**Implication**:
- Inherent difficulty in single-frame prediction
- Don't expect >80-90% accuracy
- Temporal context is crucial

### 4. Limited Visual Separability 👁️
- PCA shows significant class overlap
- Mean images look similar across classes

**Implication**:
- Linear models may struggle
- Need non-linear models (neural networks)
- Feature engineering might help

### 5. Small Dataset 📊
- Only 9,900 samples total
- Right class has only 937 samples

**Implication**:
- Deep CNNs may overfit
- Need regularization (dropout, weight decay)
- Data augmentation might help
- Simpler models might outperform complex ones

---

## Next Steps

1. **Baseline Models** (Notebook 02):
   - Majority class classifier
   - Logistic Regression
   - Random Forest

2. **CNN Models** (Notebook 03):
   - Simple CNN (2-3 layers)
   - Deeper architectures
   - Compare random vs temporal splits

3. **Advanced Methods** (Notebook 04):
   - LSTM for temporal sequences
   - 1D-CNN on frame sequences
   - Ensemble methods

4. **Analysis** (Notebook 05):
   - Error analysis
   - Confusion matrices
   - Statistical significance testing