# Deep Learning Fundamentals

Deep Learning is a subset of machine learning that uses artificial neural networks with multiple layers (deep neural networks) to learn hierarchical representations of data. Unlike traditional machine learning algorithms that require manual feature engineering, deep learning models automatically learn relevant features from raw data.

<span style="color : red">Band 5 & 6 students should understand the difference between traditional machine learning and deep learning, and be able to identify appropriate use cases for deep learning.</span>

## What Makes Deep Learning Different?

### Traditional Machine Learning vs Deep Learning

```mermaid
graph LR
    A[Raw Data] --> B[Feature Engineering]
    B --> C[ML Algorithm]
    C --> D[Output]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#d4edda,color:#333
    style D fill:#f8d7da,color:#333
```

**Traditional ML:** Requires manual feature engineering

```mermaid
graph LR
    A[Raw Data] --> B[Deep Neural Network]
    B --> C[Learned Features]
    C --> D[Output]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#d4edda,color:#333
    style C fill:#d4edda,color:#333
    style D fill:#f8d7da,color:#333
```

**Deep Learning:** Automatically learns features from data

## Key Concepts in Deep Learning

| Term | Definition |
| --- | --- |
| Deep Neural Network | A neural network with multiple hidden layers (typically 3+) |
| Feature Learning | Automatic extraction of useful features from raw data |
| Hierarchical Representation | Learning features at multiple levels of abstraction |
| Backpropagation | Algorithm for training deep networks by propagating errors backward |
| Activation Function | Non-linear function that introduces complexity (ReLU, Sigmoid, Tanh) |
| Dropout | Regularization technique to prevent overfitting |
| Batch Normalization | Technique to stabilize and speed up training |

## Deep Neural Network Architecture

```mermaid
graph LR
    I1((Input 1)) --> H11((H1))
    I2((Input 2)) --> H11
    I3((Input 3)) --> H11
    I1 --> H12((H2))
    I2 --> H12
    I3 --> H12
    I1 --> H13((H3))
    I2 --> H13
    I3 --> H13
    I1 --> H14((H4))
    I2 --> H14
    I3 --> H14
    
    H11 --> H21((H1))
    H12 --> H21
    H13 --> H21
    H14 --> H21
    H11 --> H22((H2))
    H12 --> H22
    H13 --> H22
    H14 --> H22
    H11 --> H23((H3))
    H12 --> H23
    H13 --> H23
    H14 --> H23
    
    H21 --> H31((H1))
    H22 --> H31
    H23 --> H31
    H21 --> H32((H2))
    H22 --> H32
    H23 --> H32
    
    H31 --> O1((Output 1))
    H32 --> O1
    H31 --> O2((Output 2))
    H32 --> O2
    
    style I1 fill:#e1f5ff,color:#333
    style I2 fill:#e1f5ff,color:#333
    style I3 fill:#e1f5ff,color:#333
    style O1 fill:#f8d7da,color:#333
    style O2 fill:#f8d7da,color:#333
```

**Input Layer → Hidden Layer 1 → Hidden Layer 2 → Hidden Layer 3 → Output Layer**

In [None]:
# Import dependencies
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

## Activation Functions in Deep Learning

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

In [None]:
# Common activation functions
x = np.linspace(-5, 5, 100)

# ReLU (Rectified Linear Unit) - most common in deep learning
relu = np.maximum(0, x)

# Sigmoid
sigmoid = 1 / (1 + np.exp(-x))

# Tanh
tanh = np.tanh(x)

# Leaky ReLU
leaky_relu = np.where(x > 0, x, x * 0.01)

# Plot all activation functions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].plot(x, relu, 'r-', linewidth=2)
axes[0, 0].set_title('ReLU Activation')
axes[0, 0].set_xlabel('Input')
axes[0, 0].set_ylabel('Output')
axes[0, 0].grid(True)

axes[0, 1].plot(x, sigmoid, 'b-', linewidth=2)
axes[0, 1].set_title('Sigmoid Activation')
axes[0, 1].set_xlabel('Input')
axes[0, 1].set_ylabel('Output')
axes[0, 1].grid(True)

axes[1, 0].plot(x, tanh, 'g-', linewidth=2)
axes[1, 0].set_title('Tanh Activation')
axes[1, 0].set_xlabel('Input')
axes[1, 0].set_ylabel('Output')
axes[1, 0].grid(True)

axes[1, 1].plot(x, leaky_relu, 'm-', linewidth=2)
axes[1, 1].set_title('Leaky ReLU Activation')
axes[1, 1].set_xlabel('Input')
axes[1, 1].set_ylabel('Output')
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

### Choosing Activation Functions

| Activation | Use Case | Advantages | Disadvantages |
| --- | --- | --- | --- |
| ReLU | Hidden layers in most networks | Fast computation, no vanishing gradient for positive values | Dead neurons (outputs 0 for negative inputs) |
| Leaky ReLU | When ReLU causes dead neurons | Prevents dying ReLU problem | Slight increase in computation |
| Sigmoid | Binary classification output layer | Output between 0-1 (probability) | Vanishing gradient problem |
| Tanh | Hidden layers when centered data needed | Output between -1 to 1, zero-centered | Vanishing gradient problem |
| Softmax | Multi-class classification output | Outputs sum to 1 (probability distribution) | Only for output layer |

## Why Deep Learning Requires More Data

```mermaid
graph TD
    A[Small Dataset<br/>100-1000 samples] --> B[Traditional ML<br/>Decision Trees, KNN, Linear Models]
    C[Medium Dataset<br/>1000-100,000 samples] --> D[Shallow Neural Networks<br/>1-2 hidden layers]
    E[Large Dataset<br/>100,000+ samples] --> F[Deep Neural Networks<br/>Many hidden layers]
    G[Massive Dataset<br/>Millions of samples] --> H[Very Deep Networks<br/>Transfer Learning, LLMs]
    
    style A fill:#fff3cd,color:#333
    style B fill:#d4edda,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#d4edda,color:#333
    style E fill:#fff3cd,color:#333
    style F fill:#d4edda,color:#333
    style G fill:#fff3cd,color:#333
    style H fill:#d4edda,color:#333
```

## Deep Learning Performance vs Traditional ML

Deep learning models typically require more data to outperform traditional ML algorithms.

In [None]:
# Simulate performance curves
data_size = np.logspace(2, 6, 100)  # 100 to 1,000,000 samples

# Traditional ML plateaus early
traditional_ml = 70 + 25 * (1 - np.exp(-data_size/10000))

# Deep learning continues to improve with more data
deep_learning = 50 + 48 * (1 - np.exp(-data_size/100000))

plt.figure(figsize=(10, 6))
plt.semilogx(data_size, traditional_ml, 'b-', linewidth=2, label='Traditional ML')
plt.semilogx(data_size, deep_learning, 'r-', linewidth=2, label='Deep Learning')
plt.xlabel('Amount of Training Data (samples)')
plt.ylabel('Model Performance (%)')
plt.title('Performance vs Data Size: Traditional ML vs Deep Learning')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=10000, color='green', linestyle='--', alpha=0.5, label='Crossover Point')
plt.show()

## Common Deep Learning Challenges

### 1. Vanishing Gradient Problem

In very deep networks, gradients can become extremely small as they're propagated backwards, making it difficult to train early layers.

**Solutions:**
- Use ReLU activation functions
- Implement batch normalization
- Use residual connections (ResNet)
- Careful weight initialization

### 2. Overfitting

Deep networks have millions of parameters and can easily memorize training data.

**Solutions:**
- Dropout regularization
- Data augmentation
- Early stopping
- L1/L2 regularization

### 3. Computational Requirements

Training deep networks requires significant computational resources.

**Solutions:**
- Use GPUs/TPUs
- Transfer learning (use pre-trained models)
- Model compression
- Efficient architectures (MobileNet, EfficientNet)

## Training Process Visualization

```mermaid
graph TD
    A[Initialize Network<br/>Random Weights] --> B[Forward Pass<br/>Make Predictions]
    B --> C[Calculate Loss<br/>Compare to True Values]
    C --> D[Backward Pass<br/>Compute Gradients]
    D --> E[Update Weights<br/>Using Optimizer]
    E --> F{Converged?}
    F -->|No| B
    F -->|Yes| G[Final Model]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#f8d7da,color:#333
    style D fill:#fff3cd,color:#333
    style E fill:#d4edda,color:#333
    style F fill:#ffeaa7,color:#333
    style G fill:#00b894
```

## When to Use Deep Learning

### Good Use Cases:
- **Image Recognition**: Object detection, facial recognition, medical imaging
- **Natural Language Processing**: Translation, sentiment analysis, text generation
- **Speech Recognition**: Voice assistants, transcription services
- **Complex Pattern Recognition**: Where features are difficult to engineer manually
- **Large datasets available**: Millions of training examples

### When Traditional ML May Be Better:
- Small datasets (< 10,000 samples)
- Simple, well-defined problems
- Need for interpretability
- Limited computational resources
- Fast training time required
- Structured/tabular data

## Deep Learning Frameworks

| Framework | Organization | Strengths |
| --- | --- | --- |
| TensorFlow | Google | Production deployment, TensorFlow Lite for mobile |
| PyTorch | Meta (Facebook) | Research-friendly, dynamic computation graphs |
| Keras | Google | High-level API, beginner-friendly |
| JAX | Google | High-performance numerical computing |
| MXNet | Apache | Scalable, efficient |

**Note**: Keras is now integrated into TensorFlow as `tf.keras` and is the recommended high-level API.