

# What is Deep Learning?

Deep Learning is:
- A **subset of Machine Learning**.
- Uses **neural networks with multiple layers** (deep architectures).
- Learns **features + decision boundaries directly from data**.

---

##  Why "Deep"?

- "Deep" = **many hidden layers** between input and output.
- More layers → ability to learn **complex patterns and representations**.

---

##  Key Components

###  Neural Networks:
- Composed of **neurons (nodes)** connected by **weights**.
- Layers:
  - **Input Layer:** Takes features.
  - **Hidden Layers:** Extract patterns.
  - **Output Layer:** Provides predictions.

###  Activation Functions:
- Add **non-linearity**.
- Common:
  - `relu`
  - `sigmoid`
  - `tanh`
  - `softmax` (for multi-class classification)

###  Loss Functions:
- Measure **prediction error**.
- Examples:
  - `categorical_crossentropy` (multi-class)
  - `binary_crossentropy` (binary)
  - `mse` (regression)

### 4 Optimizers:
- Update weights to minimize loss.
- Examples:
  - `SGD`
  - `Adam`
  - `RMSprop`

---





---

# Loss Functions

Loss functions **measure how well your model predictions match true labels**.

---

### **Classification:**

- **Binary Classification:**
  - `binary_crossentropy`
  - Formula:
    $$
    L = - [y \log(\hat{y}) + (1-y) \log(1-\hat{y})]
    $$
- **Multi-class Classification:**
  - `categorical_crossentropy` (one-hot labels)
  - `sparse_categorical_crossentropy` (integer labels)

---

### **Regression:**

- **Mean Squared Error (MSE):**
  - Measures average squared difference.
  - $$
    L = \frac{1}{n} \sum_{i} (y_i - \hat{y}_i)^2
    $$
- **Mean Absolute Error (MAE):**
  - Measures average absolute difference.

---

##  Gradient Descent (GD)

Gradient Descent:
- An optimization algorithm to **minimize the loss**.
- Computes:
  $$
  w := w - \eta \frac{\partial L}{\partial w}
  $$
  where:
  - $ w $ = parameters/weights
  - $ \eta $ = learning rate
  - $ \frac{\partial L}{\partial w} $ = gradient of loss

---

### Key Points:
* Takes **steps in the negative gradient direction**.  
* Repeats until convergence (loss stops decreasing).

---

## Stochastic Gradient Descent (SGD)

SGD:
- Variant of GD.
- Updates weights using **one data sample at a time** (or small batches).

---

### **Difference from GD:**
| GD | SGD |
|----|-----|
| Uses **all data** per update | Uses **one sample** per update |
| Stable but slow | Faster updates, more noise |
| Needs large memory | Memory efficient |

---

### **Mini-batch SGD:**
Uses **small batches (e.g., 32, 64 samples)** per update:
Balance between stability (GD) and speed (SGD).

---

##  Why SGD is used in Deep Learning?

1. Faster convergence on large datasets.  
2. Adds noise to help escape local minima.  
3. Scalable for large models and data.




# Negative Log-Likelihood (NLL)



##  What is it?

**Negative Log-Likelihood (NLL)** is a **loss function** that measures how well predicted probabilities match true labels.

---

##  Formula

For one-hot labels:
$$
L = - \sum_{i} y_i \cdot \log(\hat{y}_i)
$$

where:
- $ y_i $ = true label (1 for correct class, 0 otherwise),
- $ \hat{y}_i $ = predicted probability for class $ i $.

 **Same as `categorical_crossentropy`** in multi-class classification.

---

##  Why use NLL?

 1. Penalizes **low probabilities for correct labels heavily**.  
 2. Encourages **high confidence for correct predictions**.  
 3. Smooth, differentiable, ideal for **gradient-based optimization**.

---

##  Where used?

- **Classification tasks**
- **Language models**
- Any **probabilistic deep learning task requiring likelihood maximization**

---


 >> Minimize NLL → **Maximize your model's confidence on correct classes.**



# Learning Rate, Momentum, Dropout, and Regularization

---

##  Learning Rate (LR)

- Controls **step size** during optimization.
- Too high → diverges.
- Too low → slow convergence.
- Typical values: `0.1`, `0.01`, `0.001`.

 Tune using learning rate schedules or optimizers like `Adam` which adapt LR automatically.

---

##  Momentum

- Helps **accelerate gradients in relevant directions** and **dampens oscillations**.
- Adds a fraction of previous update to the current update:
  $$
  v_t = \beta v_{t-1} + (1 - \beta) \nabla L
  $$
- Typical `momentum` values: `0.9`, `0.99`.

 Useful with `SGD` to speed up convergence.

---

##  Dropout

- **Regularization technique** to prevent overfitting.
- Randomly “drops” a fraction of neurons during training.
- Forces the network to **learn redundant, robust representations**.

✅Typical dropout rates: `0.2`, `0.5`.

---

##  Regularization

Adds a **penalty to the loss function** to prevent overfitting.

---

### **Types:**

- **L1 Regularization (Lasso):**
  Adds:
  $$
  \lambda \sum |w_i|
  $$
  Encourages sparsity (many weights → 0).

- **L2 Regularization (Ridge):**
  Adds:
  $$
  \lambda \sum w_i^2
  $$
  Encourages smaller weights, smooths the model.

- **Elastic Net:**
  Combines L1 + L2.

---


