# Introduction to Deep Learning

Deep Learning is a subset of machine learning that utilizes artificial neural networks with multiple layers to model complex patterns in data. It has achieved remarkable success in fields such as computer vision, natural language processing, and speech recognition.

Key characteristics of deep learning include:

- **Hierarchical Feature Learning:** Deep neural networks automatically learn representations from raw data through multiple layers of abstraction.
- **Large-Scale Data Utilization:** Deep learning models excel with large datasets, leveraging vast amounts of data to improve performance.
- **End-to-End Learning:** These models can learn directly from input to output, reducing the need for manual feature engineering.

In this notebook, we will explore the fundamental concepts and practical applications of deep learning.

## Types of Deep Learning

Deep learning techniques can be categorized based on the type of data and supervision involved in the learning process:

### 1. Supervised Learning
- **Definition:** The model is trained on labeled data, where each input has a corresponding output label.
- **Examples:** Image classification, speech recognition, sentiment analysis.

### 2. Unsupervised Learning
- **Definition:** The model learns patterns from unlabeled data, discovering hidden structures without explicit output labels.
- **Examples:** Clustering, dimensionality reduction, anomaly detection.

### 3. Semi-supervised Learning
- **Definition:** Combines a small amount of labeled data with a large amount of unlabeled data during training.
- **Examples:** Text classification with limited labeled samples, image recognition with few annotated images.

These approaches enable deep learning models to tackle a wide range of real-world problems, even when labeled data is scarce.

## Core Concepts of Deep Learning

Deep learning is built upon several foundational concepts that enable neural networks to learn complex patterns from data:

- **Artificial Neural Networks (ANNs):** Computational models inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers.

- **Layers:**
    - **Input Layer:** Receives raw data.
    - **Hidden Layers:** Perform feature extraction and transformation through learned weights.
    - **Output Layer:** Produces the final prediction or classification.

- **Activation Functions:** Non-linear functions (e.g., ReLU, sigmoid, tanh) applied to neuron outputs, allowing networks to model complex relationships.

- **Forward Propagation:** The process of passing input data through the network to generate predictions.

- **Loss Function:** Measures the difference between predicted and actual values (e.g., mean squared error, cross-entropy).

- **Backpropagation:** An algorithm for computing gradients of the loss function with respect to network weights, enabling learning.

- **Optimization Algorithms:** Methods like stochastic gradient descent (SGD), Adam, or RMSprop that update weights to minimize the loss.

- **Overfitting and Regularization:** Techniques such as dropout, weight decay, and data augmentation help prevent the model from memorizing training data and improve generalization.

- **Batch and Epoch:** Training is performed in batches (subsets of data), and one complete pass through the dataset is called an epoch.

Understanding these core concepts is essential for designing, training, and evaluating deep learning models effectively.

# Forward Propagation Example – Student Marks Prediction

We want to predict a student's **marks** based on two features:  

- `x1` = hours studied per day  
- `x2` = number of days studied  

The student scored **16 marks**.

---

## Step 1: Define weights and bias
Assume the model gives importance like this:

- `w1 = 2` (weight for hours/day)  
- `w2 = 3` (weight for days)  
- `b = 0` (bias term)

Prediction rule:

ŷ = (w1 * x1) + (w2 * x2) + b


---

## Step 2: Feed-forward example
Suppose the student:  

- studied **2 hours/day** (`x1 = 2`)  
- studied for **4 days** (`x2 = 4`)  

Compute step by step:

- z = (2 * 2) + (3 * 4) + 0
- z = 4 + 12 = 16
- ŷ = 16


✅ Predicted marks = **16**, matches actual marks.

---

## Step 3: Neural network interpretation
- **Inputs (`x1, x2`)** → student features  
- **Weights (`w1, w2`)** → importance of each feature  
- **Bias (`b`)** → extra adjustment  
- **Computation (forward propagation)** → weighted sum  
- **Output (`ŷ`)** → predicted marks  

---

⚡ Forward propagation = features → multiply by weights → add bias → output.

---

![Student Marks Prediction Neural Network](./Images/StudentMarks.png)

*Figure: Simple neural network diagram for student marks prediction using forward propagation.*

## Understanding the Loss Function in Neural Networks

In the student marks prediction example, the **loss function** measures how well the model's predictions match the actual marks. It quantifies the difference between the predicted output (`ŷ`) and the true value (`y`).

### Why is the Loss Function Important?
- It provides feedback to the model about its performance.
- During training, the model adjusts its weights and bias to minimize this loss, improving prediction accuracy.

### Common Loss Functions
- **Mean Squared Error (MSE):** Used for regression tasks. It calculates the average of the squared differences between predicted and actual values.
- **Cross-Entropy Loss:** Used for classification tasks.

### Example Calculation

Suppose the actual marks are **16** and the model predicts **15**:

### Mean Squared Error (MSE) Formula

The general formula for Mean Squared Error is:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Where:
- $n$ = number of samples
- $y_i$ = actual value for sample $i$
- $\hat{y}_i$ = predicted value for sample $i$

---

### Example Calculations

**For one sample with actual marks = 16, predicted marks = 15:**

$$\text{MSE} = (16 - 15)^2 = 1$$

**If the prediction is perfect (predicted = 16):**

$$\text{MSE} = (16 - 16)^2 = 0$$

✅ **Key Insight:** Perfect predictions give zero loss!

### In the Neural Network

- **Forward propagation** computes the prediction (`ŷ`).
- The **loss function** (e.g., MSE) measures the error.
- **Backpropagation** uses this error to update weights and bias, reducing the loss in future predictions.

---

**Summary:**  
The loss function is a critical component that guides the learning process in neural networks by quantifying prediction errors and enabling the model to improve through training.

## Understanding Backpropagation in Neural Networks

Backpropagation is the key algorithm that enables neural networks to learn from data. It is used to update the model's weights and bias based on the error (loss) between the predicted output and the actual value.

### How Backpropagation Works

1. **Forward Propagation:**  
    - Compute the output (`ŷ`) using the current weights and bias.
    - Calculate the loss (e.g., mean squared error).

2. **Backward Propagation:**  
    - Compute the gradient of the loss with respect to each parameter (weights and bias).
    - These gradients indicate how much each parameter contributed to the error.

3. **Parameter Update:**  
    - Adjust the weights and bias in the direction that reduces the loss, typically using gradient descent.

---

### Example: Backpropagation for Student Marks Prediction

Suppose we have:
- Inputs: `x1 = 2` (hours/day), `x2 = 4` (days)
- Weights: `w1 = 2`, `w2 = 3`
- Bias: `b = 0`
- Actual marks: `y = 16`
- Predicted marks: `ŷ = (w1 * x1) + (w2 * x2) + b = 16`

If the prediction is perfect, the loss is zero, and no update is needed.  
But if the prediction is not perfect (e.g., `ŷ = 15`), we need to update the weights and bias.

#### Steps:

1. **Compute the loss:**  
   $$\text{Loss} = (y - \hat{y})^2$$

2. **Compute gradients:**  
   - For each parameter (e.g., `w1`), calculate how the loss changes if we change that parameter:
     $$\frac{\partial \text{Loss}}{\partial w_1} = 2 \cdot (y - \hat{y}) \cdot (-x_1)$$
   - Similarly for `w2` and `b`.

3. **Update parameters:**  
   - Using a learning rate (`lr`), update each parameter:
     $$w_1 = w_1 - lr \cdot \frac{\partial \text{Loss}}{\partial w_1}$$
   - Repeat for `w2` and `b`.
---

#### Steps with Actual Values:

**Given:**
- Actual marks (y) = 16
- Predicted marks (ŷ) = 15
- Input values: x₁ = 2 (hours), x₂ = 4 (days)
- Current weights: w₁ = 1.5, w₂ = 2.5 (example values that give ŷ = 15)

1. **Compute the loss:**  
   $$\text{Loss} = (y - \hat{y})^2 = (16 - 15)^2 = 1$$

2. **Compute gradients:**  
   - For w₁:
     $$\frac{\partial \text{Loss}}{\partial w_1} = 2 \cdot (y - \hat{y}) \cdot (-x_1) = 2 \cdot (16 - 15) \cdot (-2) = 2 \cdot 1 \cdot (-2) = -4$$
   
   - For w₂:
     $$\frac{\partial \text{Loss}}{\partial w_2} = 2 \cdot (y - \hat{y}) \cdot (-x_2) = 2 \cdot (16 - 15) \cdot (-4) = 2 \cdot 1 \cdot (-4) = -8$$

3. **Update parameters (assuming learning rate = 0.1):**  
   - Update w₁:
     $$w_1 = w_1 - lr \cdot \frac{\partial \text{Loss}}{\partial w_1} = w_1 - 0.1 \cdot (-4) = w_1 + 0.4$$
   
   - Update w₂:
     $$w_2 = w_2 - lr \cdot \frac{\partial \text{Loss}}{\partial w_2} = w_2 - 0.1 \cdot (-8) = w_2 + 0.8$$
     
4. **Revised prediction with new weights:**
   $$\hat{y}^{new} = w_1^{new} \cdot x_1 + w_2^{new} \cdot x_2 = 1.9 \cdot 2 + 3.3 \cdot 4 = 3.8 + 13.2 = 17$$


**Result:** Both weights increase, which will increase the prediction (MAY BE closer to 16).
---

**Summary:**  
Backpropagation calculates how to adjust each weight and bias to reduce the error. By repeating this process over many examples, the neural network learns to make better predictions.

## Linearity vs. Non-Linearity in Neural Networks

### What is Linearity?

In neural networks, a linear model computes outputs as a weighted sum of inputs plus a bias:
$$
y = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b
$$

**Example:**  
A single-layer perceptron without activation functions is a linear model. It can only learn to separate data that is linearly separable (e.g., a straight line in 2D).

---

### What is Non-Linearity?

A function is **non-linear** if it does not satisfy the properties above. Non-linear functions can model complex relationships and boundaries.

In neural networks, **non-linearity** is introduced using activation functions like ReLU, sigmoid, or tanh. These functions allow the network to learn and represent complex, non-linear patterns in data.

**Example:**  
The XOR problem cannot be solved by a linear model, but a neural network with non-linear activation functions can learn the XOR relationship.

---

### Why is Non-Linearity Important?

- **Expressive Power:** Non-linear activation functions enable neural networks to approximate any function, not just straight lines or planes.
- **Solving Complex Problems:** Many real-world problems (like image recognition, language understanding) require modeling non-linear relationships.

---

**Summary:**  
- **Linear models** are limited to simple, straight-line relationships.
- **Non-linear models** (with activation functions) can capture complex patterns, making deep learning powerful and flexible.

![Linearity vs Non-Linearity in Neural Networks](./Images/LinearityVsNonLinearity.png)

*Figure: Linear models can only separate data with straight lines, while non-linear models (with activation functions) can capture complex boundaries.*

| Study Hours | Linear Prediction | Reality (Non-Linear) | Difference |
|-------------|-------------------|---------------------|------------|
| 2 hours     | 40 marks          | 35 marks            | -5 (slow start) |
| 4 hours     | 60 marks          | 65 marks            | +5 (sweet spot) |
| 6 hours     | 80 marks          | 85 marks            | +5 (peak efficiency) |
| 8 hours     | 100 marks         | 90 marks            | -10 (diminishing returns) |
| 10 hours    | 120 marks         | 85 marks            | -35 (burnout effect) |

## Activation Functions in Neural Networks

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns and relationships in data. Without activation functions, a neural network would behave like a simple linear model, regardless of its depth.

### Why Are Activation Functions Important?
- **Non-linearity:** Allow networks to approximate complex, non-linear functions.
- **Decision Boundaries:** Enable the network to learn intricate decision boundaries for classification and regression tasks.
- **Gradient Flow:** Affect how gradients are propagated during backpropagation, influencing learning efficiency.

### Common Activation Functions

| Name      | Formula                                  | Range         | Typical Use                |
|-----------|------------------------------------------|---------------|----------------------------|
| Sigmoid   | $\sigma(x) = \frac{1}{1 + e^{-x}}$       | (0, 1)        | Output layer (binary)      |
| Tanh      | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | (-1, 1)   | Hidden layers              |
| ReLU      | $f(x) = \max(0, x)$                      | [0, ∞)        | Hidden layers (default)    |
| Leaky ReLU| $f(x) = \max(\alpha x, x)$, $\alpha \ll 1$ | (-∞, ∞)    | Hidden layers (avoids dying ReLU) |
| Softmax   | $f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$ | (0, 1), sum=1 | Output layer (multi-class) |

---

### Visual Comparison

- **Sigmoid:** S-shaped curve, squashes input to (0, 1). Good for probabilities, but can cause vanishing gradients.
- **Tanh:** Similar to sigmoid but outputs between -1 and 1. Zero-centered, often preferred over sigmoid in hidden layers.
- **ReLU (Rectified Linear Unit):** Outputs zero for negative inputs, linear for positive. Fast and effective, but can suffer from "dying ReLU" (neurons stuck at zero).
- **Leaky ReLU:** Like ReLU, but allows a small, non-zero gradient when input is negative.
- **Softmax:** Converts a vector of values into probabilities that sum to 1. Used for multi-class classification.


---

![Activation Functions Comparison](./Images/ActivationFunctions.png)
**Summary:**  
Activation functions are essential for deep learning. They enable neural networks to model complex, non-linear relationships and make deep architectures powerful and expressive. The choice of activation function can significantly impact model performance and training dynamics.



**Example Table:**
| Input ($x$) | Sigmoid($x$) | Tanh($x$) | ReLU($x$) | Leaky ReLU($x$) ($\alpha=0.01$) | Softmax($x$)\* |
|-------------|--------------|-----------|-----------|-------------------|--------------|
| -200        | 0.00         | -1.00     | 0         | -2.00             | ~0.00        |
| -100        | 0.00         | -1.00     | 0         | -1.00             | ~0.00        |
| -10         | 0.00         | -1.00     | 0         | -0.10             | ~0.00        |
| -5          | 0.01         | -1.00     | 0         | -0.05             | ~0.00        |
| -2          | 0.12         | -0.96     | 0         | -0.02             | ~0.00        |
| 0           | 0.50         | 0         | 0         | 0                 | ~0.09        |
| 2           | 0.88         | 0.96      | 2         | 2                 | ~0.67        |
| 5           | 0.99         | 1.00      | 5         | 5                 | ~0.99        |
| 10          | 1.00         | 1.00      | 10        | 10                | ~1.00        |
| 100         | 1.00         | 1.00      | 100       | 100               | ~1.00        |
| 200         | 1.00         | 1.00      | 200       | 200               | ~1.00        |

\*Softmax($x$) shown as the probability for $x$ in the set [$x$, 0], i.e., $\text{softmax}(x) = \frac{e^{x}}{e^{x} + e^{0}}$. For large negative $x$, softmax approaches 0; for large positive $x$, it approaches 1.

These activation functions are essential for enabling deep neural networks to model complex, non-linear relationships in data.

### Guidelines for Selecting an Activation Function

Choosing the right activation function is crucial for neural network performance. Here are some practical guidelines:

- **Hidden Layers (Default):**
    - **ReLU (Rectified Linear Unit):**  
        - Most commonly used for hidden layers in deep networks.
        - Pros: Fast, reduces vanishing gradient problem, simple to compute.
        - Cons: Can suffer from "dying ReLU" (neurons stuck at zero).
    - **Leaky ReLU / Parametric ReLU:**  
        - Use if you observe many dead neurons with standard ReLU.
        - Allows a small gradient when input is negative.

- **Output Layer:**
    - **Sigmoid:**  
        - Use for binary classification (output between 0 and 1).
    - **Softmax:**  
        - Use for multi-class classification (outputs probabilities that sum to 1).
    - **Linear:**  
        - Use for regression tasks (predicting continuous values).

- **Other Considerations:**
    - **Tanh:**  
        - Sometimes preferred over sigmoid in hidden layers (zero-centered output).
        - Can still suffer from vanishing gradients in deep networks.
    - **Swish, GELU, ELU:**  
        - Advanced activations that may offer improvements in some architectures.

- **Empirical Testing:**  
    - Try different activation functions and compare validation performance.
    - Monitor for issues like vanishing/exploding gradients or dead neurons.

**Summary Table:**

| Task Type                | Recommended Activation Function |
|--------------------------|-------------------------------|
| Hidden layers (default)  | ReLU, Leaky ReLU              |
| Binary classification    | Sigmoid (output layer)        |
| Multi-class classification | Softmax (output layer)      |
| Regression               | Linear (output layer)         |

> **Tip:** Start with ReLU for hidden layers and the appropriate output activation for your task. Experiment if you encounter training issues.

## Batch vs. Epoch in Neural Network Training

**Batch** and **epoch** are key terms in the training process of neural networks:

### Batch
- A **batch** is a subset of the training dataset used to compute one forward and backward pass.
- Instead of updating weights after every single sample (stochastic) or after the entire dataset (full batch), data is split into smaller batches.
- **Batch size** is the number of samples processed before the model's parameters are updated.

### Epoch
- An **epoch** is one complete pass through the entire training dataset.
- During an epoch, the model sees every sample in the training set once (typically, in shuffled order).
- Training usually involves multiple epochs.

---

### Key Differences

| Term   | Definition | Example (Dataset of 1000 samples, batch size = 100) |
|--------|------------|-----------------------------------------------------|
| Batch  | Subset of data used for one update | 10 batches per epoch (100 samples each) |
| Epoch  | One full pass through all data     | 1 epoch = 10 batches (all 1000 samples seen once) |

- **Batch size** controls how many samples are processed before updating the model.
- **Epoch** counts how many times the model has seen the entire dataset.

---

**Summary:**  
- **Batch:** Number of samples processed before updating weights.
- **Epoch:** One full pass through the entire training data.
- Multiple batches make up one epoch; multiple epochs are used to train the model for better performance.

### Example: Understanding Batch and Epoch in Neural Network Training

Suppose you have a dataset with **12 samples**:

| Sample | Data |
|--------|------|
| 1      | ...  |
| 2      | ...  |
| 3      | ...  |
| ...    | ...  |
| 12     | ...  |

Let's say you set **batch size = 4** and train for **3 epochs**.

### How Training Proceeds

- **Batch:** Each batch contains 4 samples.
- **Epoch:** One epoch means the model sees all 12 samples once.

#### Breakdown

- **Number of batches per epoch:**  
    $12 \text{ samples} \div 4 \text{ (batch size)} = 3 \text{ batches per epoch}$

- **Epoch 1:**  
    - Batch 1: samples 1–4  
    - Batch 2: samples 5–8  
    - Batch 3: samples 9–12

- **Epoch 2:**  
    - Batch 1: samples 1–4 (possibly shuffled)  
    - Batch 2: samples 5–8  
    - Batch 3: samples 9–12

- **Epoch 3:**  
    - Repeat as above

### Visualization

| Epoch | Batch 1      | Batch 2      | Batch 3      |
|-------|--------------|--------------|--------------|
| 1     | 1, 2, 3, 4   | 5, 6, 7, 8   | 9, 10, 11, 12|
| 2     | 1, 2, 3, 4   | 5, 6, 7, 8   | 9, 10, 11, 12|
| 3     | 1, 2, 3, 4   | 5, 6, 7, 8   | 9, 10, 11, 12|

- **After each batch:** Model updates its weights.
- **After each epoch:** Model has seen all data once.

---

**Summary:**  
- **Batch:** Subset of data used for one update (e.g., 4 samples).
- **Epoch:** One full pass through the entire dataset (e.g., all 12 samples).  
- Training for multiple epochs means the model sees the data multiple times, improving learning.

In neural network training:

- **One batch** involves **one forward propagation** and **one backward propagation** (i.e., one update step) for that batch of samples.
- **One epoch** consists of as many forward and backward propagations as there are batches in the dataset.

**Example:**  
If your dataset has 1000 samples and batch size is 100:
- Number of batches per epoch = 1000 / 100 = **10**
- So, **one epoch = 10 forward + 10 backward propagations** (one per batch)

**Summary Table:**

| Term   | Forward Propagations | Backward Propagations |
|--------|----------------------|-----------------------|
| 1 Batch | 1                    | 1                     |
| 1 Epoch | Number of batches    | Number of batches     |

Thus, **one epoch = (number of batches) × (one forward + one backward propagation)**.

When using **batch training** in neural networks, all 4 samples in the batch are processed **together** in a single forward and backward pass:

- **Forward pass:**  
    The model computes predictions for all 4 samples at once (using vectorized operations).  
    Example: If `X` is shape `(4, 2)`, the network computes outputs for all 4 rows in one go.

- **Loss calculation:**  
    The loss is computed for each sample, then averaged (or summed) across the batch.

- **Backward pass:**  
    Gradients are calculated for all 4 samples together (again, using vectorized math), and the average gradient is used to update the weights.

**Summary:**  
All samples in the batch are processed in parallel using matrix operations. The model updates its parameters once per batch, based on the average effect of all samples in that batch. This is much faster and more efficient than processing each sample one by one.

## Determining the Size of Input and Output Layers in Neural Networks

**Input Layer:**
- The number of neurons in the input layer equals the number of features in your data.
    - **Example:** For an image of size 28×28 pixels, the input layer has 784 neurons (one per pixel).
    - For tabular data with 2 features (like `x1`, `x2` in the XOR example), the input layer has 2 neurons.

**Output Layer:**
- The number of neurons in the output layer depends on the prediction task:
    - **Regression:** 1 neuron (predicts a continuous value).
    - **Binary Classification:** 1 neuron (outputs probability or class label).
    - **Multi-class Classification:** Number of neurons equals the number of classes (e.g., 10 for digit classification 0–9).

**Summary Table:**

| Task Type               | Input Layer Size         | Output Layer Size         |
|-------------------------|-------------------------|--------------------------|
| Regression (1 target)   | # features              | 1                        |
| Binary classification   | # features              | 1                        |
| Multi-class classification | # features           | # classes                |
| Image (28×28 pixels)    | 784                     | Depends on task          |

**Key Point:**  
- **Input layer:** Matches the number of input features.
- **Output layer:** Matches the number of prediction targets or classes.

## Optimization Approaches in Neural Networks

**Optimization** in neural networks refers to the process of adjusting model parameters (weights and biases) to minimize the loss function and improve performance. The choice of optimization algorithm can significantly impact training speed, convergence, and final accuracy.

### Common Optimization Algorithms

- **Stochastic Gradient Descent (SGD):**
    - Updates parameters using the gradient of the loss with respect to a random subset (batch) of data.
    - Simple and widely used, but can be slow to converge and sensitive to learning rate.

- **Momentum:**
    - Enhances SGD by adding a fraction of the previous update to the current update.
    - Helps accelerate convergence and escape local minima.

- **RMSprop:**
    - Adapts the learning rate for each parameter by dividing by a running average of recent gradients' magnitudes.
    - Works well for recurrent neural networks and non-stationary objectives.

- **Adam (Adaptive Moment Estimation):**
    - Combines ideas from Momentum and RMSprop.
    - Maintains running averages of both gradients and their squares.
    - Generally provides fast convergence and is robust to hyperparameter choices.

### Comparison Table

| Optimizer | Pros | Cons | Typical Use |
|-----------|------|------|-------------|
| SGD       | Simple, memory efficient | Slow, sensitive to learning rate | General use, baseline |
| Momentum  | Faster convergence | Adds extra parameter (momentum) | Deep networks, escaping local minima |
| RMSprop   | Adapts learning rate | May not generalize as well | RNNs, non-stationary problems |
| Adam      | Fast, adaptive, robust | Slightly more memory | Most deep learning tasks |

### Applicability

- **SGD/Momentum:** Good for large datasets and when you want more control over learning dynamics.
- **RMSprop:** Often used for recurrent networks and time-series data.
- **Adam:** Default choice for most deep learning tasks due to its speed and reliability.

**Summary:**  
Optimization algorithms are crucial for effective neural network training. Adam is often a safe and efficient default, but understanding and experimenting with different optimizers can lead to better results for specific problems.


## Setting Up a Virtual Environment with Python 3.12 for Deep Learning Experiments

To ensure a clean and manageable workspace for your deep learning projects, it is recommended to use a virtual environment. This allows you to isolate dependencies and avoid conflicts with other Python projects.

### Steps to Install Python 3.12

1. **Download Python 3.12:**
    - Visit the [official Python downloads page](https://www.python.org/downloads/) and select Python 3.12 for your operating system.

2. **Install Python 3.12:**
    - **Windows:** Run the installer and follow the prompts. Make sure to check "Add Python to PATH" during installation.
    - **macOS:** Download the macOS installer and run it, or use Homebrew:
      ```bash
      brew install python@3.12
      ```
    - **Linux (Ubuntu/Debian):**
      ```bash
      sudo apt update
      sudo apt install python3.12 python3.12-venv python3.12-dev
      ```

3. **If `python3.12` is not recognized:**
    - Check the installation path:
      ```bash
      which python3.12
      ```
    - If the path is not in your `PATH` environment variable, add it (replace `/opt/homebrew/bin` with your actual path if different):
      ```bash
      export PATH="/opt/homebrew/bin:$PATH"
      ```
    - On Windows, ensure the Python installation directory is added to your system's PATH environment variable.(replace `C:\Python312;C:\Python312\Scripts` with your actual path if different):
      ```bash
        setx PATH "%PATH%;C:\Python312;C:\Python312\Scripts" /M
      ```

4. **Verify the installation:**
    ```bash
    python3.12 --version
    ```

### Steps to Create a Virtual Environment with Python 3.12

1. **Install `virtualenv` (if not already installed):**
    ```bash
    pip install virtualenv
    ```

2. **Create a new virtual environment using Python 3.12:**
    ```bash
    virtualenv -p python3.12 module1_dl_env
    ```

3. **Activate the virtual environment:**
    - On Windows:
      ```bash
      module1_dl_env\Scripts\activate
      ```
    - On macOS/Linux:
      ```bash
      source module1_dl_env/bin/activate
      ```

4. **Install required packages (e.g., TensorFlow, Keras, scikit-learn):**
    ```bash
    pip install tensorflow==2.17.0 scikit-learn==1.5.0 matplotlib==3.9.0 torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0
    ```
    
    Note: Keras is now included with TensorFlow 2.17.0, so no separate installation is needed.

5. **Deactivate the environment when done with all the below experiments:**
    ```bash
    deactivate
    ```

Using a virtual environment with Python 3.12 helps you manage dependencies and ensures reproducibility for your deep learning experiments.

### Steps to Install Git and Download a Repository

1. **Install Git:**
  - **Windows:** Download and run the installer from [git-scm.com](https://git-scm.com/download/win).
  - **macOS:**  
    ```bash
    brew install git
    ```
  - **Linux (Ubuntu/Debian):**  
    ```bash
    sudo apt update
    sudo apt install git
    ```

2. **Verify Git Installation:**
  ```bash
  git --version
  ```

3. **Clone the Repository (including all subfolders):**
  ```bash
  git clone https://github.com/<reponame>/DeepLearningCourse.git
  ```

  This command will download the entire repository and its subfolders into a local directory named `DeepLearningCourse`.

4. **Navigate into the Downloaded Repository:**
  ```bash
  cd DeepLearningCourse
  ```

You can now access all files and subfolders from the repository on your local machine.

## Using Python Notebooks in VS Code with a Custom Virtual Environment

To work with Jupyter Notebooks in Visual Studio Code and use your custom Python 3.12 virtual environment, follow these steps:

### 1. Install VS Code and Extensions

- Download and install [Visual Studio Code](https://code.visualstudio.com/).
- Open VS Code and go to the Extensions view (`Ctrl+Shift+X`).
- Search for and install the **Python** extension (by Microsoft).
- Search for and install the **Jupyter** extension (by Microsoft).

### 2. Open Your Project Folder

- Open the folder containing your Jupyter Notebook (`.ipynb`) and virtual environment.

### 3. Select the Python Interpreter

- Press `Ctrl+Shift+P` to open the Command Palette.
- Type `Python: Select Interpreter` and select it.
- Choose the interpreter from your virtual environment (e.g., `./module1_dl_env/bin/python` or `.\module1_dl_env\Scripts\python.exe`).

### 4. Enable and Use Jupyter Notebooks

- Open or create a `.ipynb` notebook file in VS Code.
- At the top right of the notebook, click on the **kernel selector** (shows the current Python version).
- Select your virtual environment as the Jupyter kernel.

### 5. Install Jupyter in the Virtual Environment (if needed)

If you haven't already installed Jupyter in your virtual environment, activate the environment and run:

```bash
pip install jupyter
```

### 6. Start Coding

- You can now run notebook cells using your selected virtual environment and installed packages.

---

**Tip:**  
If your virtual environment does not appear in the kernel list, restart VS Code after activating the environment and installing the Jupyter extension. Make sure the environment is properly set up and recognized by VS Code.

## Popular Deep Learning Frameworks

Several open-source frameworks make it easier to build, train, and deploy deep learning models. Here are some of the most widely used:

- **TensorFlow:** Developed by Google, TensorFlow is a flexible and scalable framework for building deep learning models. It supports both high-level APIs (like Keras) and low-level operations for advanced research.

- **PyTorch:** Developed by Facebook's AI Research lab, PyTorch is known for its dynamic computation graph and intuitive interface, making it popular in both academia and industry.

- **Keras:** Initially a standalone high-level API, Keras is now integrated with TensorFlow. It provides a user-friendly interface for quickly prototyping and building neural networks.

- **MXNet:** Backed by Apache, MXNet is designed for efficiency and scalability, supporting both symbolic and imperative programming.

- **JAX:** Developed by Google, JAX enables high-performance machine learning research with automatic differentiation and GPU/TPU acceleration.

These frameworks provide tools and abstractions to accelerate deep learning research and production deployment. Choosing the right framework depends on your project requirements, familiarity, and ecosystem support.