# Deep Learning with PyTorch

## 1. Introduction

**Instructor**: Jasmin, Senior Data Science Content Developer at DataCamp  
This course introduces deep learning using the PyTorch framework.

---

## 2. Deep Learning Is Everywhere

Deep learning powers many modern innovations:

- Language translation
- Self-driving cars
- Medical diagnostics
- Chatbots

---

## 3. What Is Deep Learning?

Deep learning is a subset of machine learning. Its core structure is a network composed of:

- InputsS`
- Hidden layers (one or many)
- Outputs

These networks are inspired by the human brain, using interconnected units called **neurons**. Hence, the term **neural networks**.

Deep learning models require:

- Large datasets (hundreds of thousands or more)
- High computational resources
- Careful tuning and validation

---

## 4. PyTorch: A Deep Learning Framework

PyTorch is a widely-used deep learning framework:

- Originally developed by Meta AI (Facebook AI Research)
- Now maintained by the Linux Foundation
- Designed to be intuitive and user-friendly
- Shares syntax and behavior with NumPy

---

## 5. PyTorch Tensors

The fundamental data structure in PyTorch is a **tensor**, which is similar to an array or matrix. Tensors support many mathematical operations and form the building blocks of neural networks.

Tensors can be created from:

- Python lists
- NumPy arrays

They are converted using the `torch.tensor()` class into a format compatible with deep learning.

---

## 6. Tensor Attributes

Important tensor attributes include:

- `tensor.shape`: Displays the shape of the tensor
- `tensor.dtype`: Displays the data type (e.g., 64-bit integer)

Checking these attributes ensures tensors align correctly with the model and task, and helps with debugging.

---

## 7. Tensor Operations

Tensors support various operations:

- **Addition and subtraction**: Requires tensors of the same shape
- **Element-wise multiplication**: Multiplies corresponding elements of tensors with the same shape
- **Matrix multiplication**: Combines rows of the first matrix with columns of the second matrix and sums the results

---

## 8. Behind the Scenes in Deep Learning

Deep learning models perform countless operations such as:

- Addition
- Multiplication
- Matrix transformations

These operations are used to process data and learn patterns.

---

## 9. Additional Concepts to Explore

To deepen your understanding of PyTorch and deep learning, explore:

- Autograd: Automatic differentiation for training
- `nn.Module`: Building neural network layers
- Optimizers: Algorithms like SGD, Adam
- Loss functions: Metrics to evaluate model performance
- DataLoaders: Efficient data batching and shuffling


In [1]:
import torch

#firstefine the temperatures tensor
temperatures = torch.tensor([[30, 32, 33], [29, 31, 34]])

#define the adjustment tensor
adjustment = torch.tensor([[2, 2, 2], [2, 2, 2]])

#display the shape and type of the adjustment tensor
print("Adjustment shape:", adjustment.shape)
print("Adjustment type:", adjustment.dtype)

# Display the shape and type of the temperatures tensor
print("Temperatures shape:", temperatures.shape)
print("Temperatures type:", temperatures.dtype)

# appply the adjustment to the temperatures
adjusted_temperatures = temperatures + adjustment
print("Adjusted temperatures:", adjusted_temperatures)


  cpu = _conversion_method_template(device=torch.device("cpu"))


Adjustment shape: torch.Size([2, 3])
Adjustment type: torch.int64
Temperatures shape: torch.Size([2, 3])
Temperatures type: torch.int64
Adjusted temperatures: tensor([[32, 34, 35],
        [31, 33, 36]])


## tensor represents:

    * A 2D grid of temperature values (e.g., from sensors or locations).

    * Each row could represent a day, and each column a time or region.

    * It’s just numerical data structured like a matrix.

        ## What the adjustment tensor represents:

    ## A 2D grid of values to be added to the temperatures.

    * Could represent calibration offsets, corrections, or environmental adjustments.

    * Same shape as temperatures so they can be added element-wise.

## operation does:

    * Adds each value in adjustment to the corresponding value in temperatures.

    * This is element-wise addition, like basic matrix addition.

    * Result is a new tensor with adjusted temperature values.

## Difference between tensor and matrix:

    * Matrix: A 2D array of numbers (rows × columns).

    * Tensor: A general term for multi-dimensional arrays (can be 0D, 1D, 2D, 3D, etc.).

        - A matrix is just a 2D tensor.

        - Tensors support more complex operations and higher dimensions (e.g., images, videos, batches of data).

## Why use tensors in deep learning:

   * Tensors are optimized for GPU computation.

   * They support automatic differentiation (for training models).

   * They can represent complex data structures beyond simple matrices.



# Neural Networks and Layers

## 1. Building Our First Neural Network
Let's build our first neural network using PyTorch Tensors.

---

## 2. Neural Network Structure

A neural network consists of:

- **Input layer**: Contains the dataset features  
- **Hidden layers**: Lie between input and output  
- **Output layer**: Contains the predictions

---

## 3. Our First Neural Network

We begin with a network that has **no hidden layers**, where the output layer is a **linear layer**.  
In this setup:

- Every input neuron connects to every output neuron  
- This is called a **fully connected network**  
- It behaves like a **linear model**, useful for understanding fundamentals before adding complexity

---

## 4. Designing a Neural Network

We use the `torch.nn` module (imported as `nn`) to build networks.  
This module makes code more concise and flexible.

### Key Design Principles:

- **Input layer size** = number of features in the dataset  
- **Output layer size** = number of classes to predict  

Example:  
An `input_tensor` of shape `1 × 3` represents one row with three features or neurons.

---

## 5. Linear Layer and Prediction

We pass the `input_tensor` to a **linear layer**, which applies a linear function to generate predictions.

- `in_features`: Number of input features (e.g., 3)  
- `out_features`: Desired size of output (e.g., 2)

Correctly specifying `in_features` ensures the layer can process the input.

---

## 6. Output Interpretation

After passing the input through the linear layer:

- The output has two features or neurons  
- This matches the `out_features` specified

---

## 7. Weights and Biases

When the input is passed to the linear layer:

- A **linear operation** is performed using weights and biases  
- Each neuron has:
  - **Weights**: Indicate the importance of each input feature  
  - **Bias**: A baseline value independent of the weights

Initially, weights and biases are **randomly assigned** and later **tuned during training**.

---

## 8. Fully Connected Network in Action

Imagine a weather dataset with three features:

- Temperature  
- Humidity  
- Wind

We want to predict:

- Rain  
- Cloudy conditions

### Feature Importance:

- **Humidity** has a higher weight due to its strong predictive power  
- A **bias** is added to reflect the tropical region’s baseline likelihood of rain

With this setup, the model makes a prediction.



_if a neural network has no hidden layers and only a linear output layer, it behaves exactly like a linear model._

In [6]:
import torch
import torch.nn as nn 
input_tensor = torch.tensor([[0.3471, 0.4547,-0.2356]])
linear_layer = nn.Linear(in_features=3, out_features=2)

output=linear_layer(input_tensor)
print(output)
                            

tensor([[-0.5433,  0.0326]], grad_fn=<AddmmBackward0>)


# Hidden Layers and Parameters in Neural Networks

## 1. Expanding Beyond a Single Layer
So far, we've used one input layer and one linear layer. Now, we'll add more layers to help the network learn complex patterns.

---

## 2. Stacking Layers with `nn.Sequential()`

We can stack multiple linear layers using `nn.Sequential()`, a PyTorch container that applies layers in sequence:

- Input is passed through each layer one after another
- The intermediate layers are considered **hidden layers**
- This structure allows the network to learn hierarchical representations

---

## 3. Defining Input and Output Dimensions

- The **first layer's input** corresponds to the number of features in the dataset (`n_features`)
- The **final layer's output** corresponds to the number of prediction classes (`n_classes`)

---

## 4. Adding Hidden Layers

We can add:

- As many hidden layers as needed  
- As long as each layer’s input dimension matches the previous layer’s output dimension

### Example:
- First layer: input = 10, output = 18  
- Second layer: input = 18, output = 20  
- Third layer: input = 20, output = 5

---

## 5. Layers Are Made of Neurons

- A layer is **fully connected** when each neuron links to all neurons in the previous layer
- Each neuron performs a **linear operation** using all inputs from the previous layer

### Learnable Parameters:
- Each neuron has \( N + 1 \) parameters:
  - \( N \): weights (one per input)
  - \( +1 \): bias term

---

## 6. Parameters and Model Capacity

- More hidden layers -> more parameters -> higher **model capacity**
- High-capacity models can learn complex patterns but:
  - May take longer to train
  - Risk overfitting if not regularized

### Example Breakdown:
- **First layer**: 4 neurons, each with 8 weights + 1 bias -> \( 4 x 9 = 36 \) parameters  
- **Second layer**: 2 neurons, each with 4 weights + 1 bias -> \( 2 x 5 = 10 \) parameters  
- **Total**: 46 learnable parameters

---

## 7. Calculating Parameters in PyTorch

Use `.numel()` to count elements in each parameter tensor:

```python
total_params = sum(p.numel() for p in model.parameters())


 ![image.png](attachment:f820b4e6-ccf6-478b-8fab-d630374f85b7.png)


In [8]:
import torch
import torch.nn as nn

input_tensor = torch.Tensor([[2, 3, 6, 7, 9, 3, 2, 1]])

# Create a container for stacking linear layers
model = nn.Sequential(nn.Linear(8, 4),
                nn.Linear(4, 1)
                )

output = model(input_tensor)
print(output)

tensor([[0.6709]], grad_fn=<AddmmBackward0>)


In [9]:
model = nn.Sequential(nn.Linear(92, 4),
                      nn.Linear(4, 69),
                      nn.Linear(269, 1))

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total learnable parameters: {total_params}")

Total learnable parameters: 987


In [11]:
# another way of doing easy syntax

import torch.nn as nn
model = nn.Sequential(nn.Linear(92, 4),
                      nn.Linear(4, 69),
                      nn.Linear(269, 1))

total = 0

# Calculate the number of parameters in the model
for p in model.parameters():
  total += p.numel()

  
print(f"The number of parameters in the model is {total}")

The number of parameters in the model is 987


# Activation Functions in Neural Networks

## 1. Why Activation Functions Matter

Neural networks made only of linear layers can learn only straight-line relationships.  
To model complex patterns and interactions between inputs and outputs, we introduce **non-linearity** using activation functions.

---

## 2. What Is a Pre-Activation Output?

The output of the final linear layer before applying any activation is called the **pre-activation output**.  
This raw value is not yet interpretable. We pass it through an activation function to transform it into a meaningful output (e.g., probability or class score).

---

## 3. Sigmoid Activation Function (Binary Classification)

- **Use case**: Binary classification (e.g., mammal vs not mammal)
- **Input features**: number of limbs, lays eggs (1/0), has hair (1/0)
- **Model output**: raw score (e.g., 6) -> not interpretable
- **Sigmoid transforms** this score into a value between 0 and 1:



![image.png](attachment:af1141c7-2b83-4ba0-9a2a-09f0996c3a22.png)


![image.png](attachment:a6a20905-f68e-4749-b251-8af0954fed5f.png)


- **Interpretation**:
  - Output > 0.5 -> class 1 (mammal)
  - Output < 0.5 -> class 0 (not mammal)

This transformation allows us to treat the output as a **probability**.

---

## 4. Sigmoid in Neural Network Architecture

- Typically added as the **final layer** in `nn.Sequential()`
- Automatically transforms the output of the last linear layer
- A network with only linear layers + sigmoid behaves like **logistic regression**
- Adding hidden layers and nonlinear activations unlocks the full power of deep learning

---

![image.png](attachment:d0914759-cb05-4cf4-b8ac-726648742af9.png)



## 5. Softmax Activation Function (Multi-Class Classification)

- **Use case**: Multi-class classification (e.g., bird, mammal, reptile)
- **Class labels**:
  - bird ->0  
  - mammal -> 1  
  - reptile -> 2

- **Softmax transforms** a vector of raw scores into a **probability distribution**:


![image.png](attachment:37b4aecb-0367-4aed-8ac5-46e1fc516386.png)

- Each output:
  - Is between 0 and 1
  - All outputs **sum to 1**
  - Highest value -> predicted class

### Example:
- Pre-activation output: `[1.2, 3.5, 0.8]`
- Softmax output: `[0.07, 0.84, 0.09]`
- Prediction: class 1 (mammal) with probability 0.84

---

## 6. Softmax in Neural Network Architecture

- Added as the **final layer** in `nn.Sequential()` for multi-class tasks
- `dim=-1` ensures softmax is applied across the last dimension of the tensor
- Like sigmoid, it transforms raw scores into interpretable probabilities

---
![image.png](attachment:2bbb46bf-dfac-4cfd-a309-245c328710cb.png)

![image.png](attachment:8883c5dd-6169-4a74-ab27-9ffa0d25564e.png)



## Reinforcement Insights

- **Activation functions** are essential for learning complex, nonlinear patterns.
- **Sigmoid** is ideal for binary classification - output is a single probability.
- **Softmax** is ideal for multi-class classification - output is a probability vector.
- Without activation functions, neural networks are just stacks of linear equations.
- Activation functions are what make deep learning **flexible, expressive, and powerful**.

---

## Summary Table

| Activation Function | Use Case               | Output Range | Behavior                                |
|---------------------|------------------------|--------------|-----------------------------------------|
| Sigmoid             | Binary classification  | 0 to 1       | Converts raw score to probability       |
| Softmax             | Multi-class classification | 0 to 1 (sums to 1) | Converts scores to probability distribution |



In [15]:
input_tensor = torch.tensor([[2.4]])

# Create a sigmoid function and apply it on input_tensor
sigmoide = nn.Sigmoid()
probability = sigmoide(input_tensor)
print(probability)

tensor([[0.9168]])


In [23]:
input_tensor = torch.tensor([[1.0, -6.0, 2.5, -0.3, 1.2, 0.8]])

# Create a softmax function and apply it on input_tensor
softmax = nn.Softmax()
probabilities =softmax(input_tensor)
print(probabilities)

tensor([[1.2828e-01, 1.1698e-04, 5.7492e-01, 3.4961e-02, 1.5669e-01, 1.0503e-01]])


# Forward Pass in Neural Networks

##1. What Is a Forward Pass?

A **forward pass** is the process of sending input data through a neural network to generate predictions.  
It flows **forward** from the input layer to through hidden layers to the output layer.

At each layer:
- The data is transformed using weights, biases, and activation functions.
- These transformations create new representations of the input.
- The final output is the model’s prediction.

This process is used during both:
- **Training**: to compute predictions and compare them with actual labels.
- **Inference**: to make predictions on new, unseen data.

---

## 2. Types of Outputs from Forward Pass

Depending on the task, the final output can be:
- **Binary classification**: e.g., mammal vs not mammal
- **Multi-class classification**: e.g., mammal, bird, reptile
- **Regression**: predicting continuous values like weight or price

---

##3. Binary Classification Example

- **Input**: Data for 5 animals, each with 6 features (ex limbs, lays eggs, has hair, etc.)

- ![image.png](attachment:b4f7b47a-7a0b-4bba-889f-599df8b30ea5.png)
![image.png](attachment:d63c0ecf-189b-4a86-9a08-da201a4187e9.png)

- **Model**: Two linear layers + sigmoid activation
  - First layer: 6 inputs -> 4 outputs
  - Second layer: 4 inputs -> 1 output (probability)

###Output:
- A single probability between 0 and 1 for each animal
- Use a threshold (commonly 0.5) to convert probability into class label:
  - > 0.5 -> class 1 (mammal)
  - < 0.5 -> class 0 (not mammal)

**Note**: These predictions are based on current weights and biases.  
To improve accuracy, we use **backpropagation** to update these parameters     

---

## 4. Multi-Class Classification Example

- **Goal**: Predict one of three classes mammal, bird, reptile
  ![image.png](attachment:2127e2d1-9340-4842-a158-edad7f7cad1e.png)

  ![image.png](attachment:51b2528e-a2f8-4ffd-a906-7f0f74b1e5a7.png)
- **Model**:
  - Last linear layer outputs 3 values (one per class)
  - Use **softmax** activation to convert raw scores into probabilities
  - `dim = -1` ensures softmax is applied across the last dimension

### Output:
- Shape: 5 × 3 (5 animals × 3 class probabilities)
- Each row sums to 1
- Predicted label = class with highest probability

###  Example:
- Row 1: `[0.07, 0.84, 0.09]` -> predicted class = mammal
- Row 2: `[0.10, 0.85, 0.05]` -> predicted class = mammal
- Row 3: `[0.12, 0.08, 0.80]` -> predicted class = reptile

---

## 5. Regression Example

- **Goal**: Predict continuous values (ex animal weight)
- **Model**:
  - Same input data (5 animals × 6 features)
  - Last linear layer outputs 1 value per row
  - **No activation function** at the end

### Output:
- Shape: 5 × 1
- Each value is a continuous prediction (e.g., weight in kg)

---
![image.png](attachment:2726239a-1f82-4c70-bae9-dc7316ea5e54.png)

## Reinforcement Insights

- A forward pass is the **core prediction mechanism** in neural networks.
- It uses the model’s current parameters (weights and biases) to generate outputs.
- Different tasks (binary, multi-class, regression) require different output shapes and activation functions.
- The forward pass is only half the story learning happens when we **compare predictions to true labels** and **update parameters** using backpropagation.

---

## Summary Table

| Task Type              | Output Shape | Final Activation | Output Meaning                     |
|------------------------|--------------|------------------|------------------------------------|
| Binary Classification  | (batch_size, 1) | Sigmoid          | Probability of class 1             |
| Multi-Class Classification | (batch_size, num_classes) | Softmax          | Probability distribution across classes |
| Regression             | (batch_size, 1) | None             | Continuous value prediction        |


In [24]:
import torch
import torch.nn as nn

input_tensor = torch.Tensor([[3, 4, 6, 2, 3, 6, 8, 9]])

model = nn.Sequential(
  nn.Linear(8,1),
  nn.Sigmoid()
)

output = model(input_tensor)
print(output)

tensor([[0.2081]], grad_fn=<SigmoidBackward0>)


In [25]:
import torch
import torch.nn as nn

input_tensor = torch.Tensor([[3, 4, 6, 7, 10, 12, 2, 3, 6, 8, 9]])

# Implement a neural network with exactly four linear layers
model = nn.Sequential(
  nn.Linear(11,64),
  nn.Linear(64,32),
  nn.Linear(32,16),
  nn.Linear(16,1),
  nn.Softmax(dim=-1)
)

output = model(input_tensor)
print(output)

tensor([[1.]], grad_fn=<SoftmaxBackward0>)


# Loss Functions in Neural Networks

## 1. Why We Use a Loss Function

After making predictions with a forward pass, we need to check how close those predictions are to the actual answers.  
That’s what a loss function does. It compares the predicted value (y_hat) with the true label (y) and gives us a number.  
This number is called the loss. Lower loss means better predictions. Our goal is to reduce this loss during training.

![image.png](attachment:76b33ddb-bcd1-4d39-b5ea-ca6ade9dbd3c.png)
---

## 2. Example: Multi-Class Classification

Let’s say we’re predicting whether an animal is a mammal (0), bird (1), or reptile (2).  
If the true label is 0 and the model also predicts 0, the loss is low.  
If the model predicts 1 or 2, the loss is high.  
The loss function helps us measure how wrong the prediction is.

---

## 3. What the Loss Function Takes In

The loss function takes two things:
- y: the true label (like 0, 1, or 2)
- y_hat: the model’s raw prediction before softmax

If there are 3 classes, y_hat is a tensor with 3 values.  
Softmax will turn these into probabilities, but the loss function works with the raw scores.

---

## 4. One-Hot Encoding

To compare y with y_hat properly, we use one-hot encoding.  
This turns the label into a vector of zeros and ones.

Examples:
- y = 0 -> [1, 0, 0]
- y = 1 -> [0, 1, 0]
- y = 2 -> [0, 0, 1]

This helps match the shape of the prediction and makes comparison easier.


![image.png](attachment:56902421-4f2e-458f-85f3-8d842cf8de17.png)
---

## 5. Automating One-Hot Encoding

Instead of writing one-hot vectors manually, we use a function from PyTorch called F (from torch.nn.functional).  
It automatically converts the label into the right format.

- y = 0 ->one at first position
- y = 1 -> one at second position
- y = 2 -> one at third position

---

![image.png](attachment:329f44e2-8a72-471e-820c-bd1ce50965ac.png)


![image.png](attachment:1b8f0ed1-f42e-4e69-b908-6347dd89ff57.png)

## 6. Cross-Entropy Loss

Once we have the encoded label and the prediction scores, we pass them to a loss function.  
The most common one for classification is cross-entropy loss.  
It compares the predicted scores with the true label and gives a single float value.  
This value tells us how far off the prediction was.

---

## 7. Summary

- The loss function compares predictions with actual labels
- It gives a number that shows how wrong the prediction is
- Lower loss means better performance
- We use one-hot encoding to match label format
- Cross-entropy is the go-to loss function for classification
- The goal is to minimize this loss during training



In [31]:
import numpy as np
import torch  
import torch.nn.functional as F  
y = 1
num_classes = 3

# Create the one-hot encoded vector using NumPy
one_hot_numpy = np.array([0, 1, 0])

# Create the one-hot encoded vector using PyTorch
one_hot_pytorch = F.one_hot(torch.tensor(y), num_classes=num_classes)

print("One-hot vector using NumPy:", one_hot_numpy)
print("One-hot vector using PyTorch:", one_hot_pytorch)

One-hot vector using NumPy: [0 1 0]
One-hot vector using PyTorch: tensor([0, 1, 0])


In [34]:
import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

y = [2]
scores = torch.tensor([[0.1, 6.0, -2.0, 3.2]])

# Create a one-hot encoded vector of the label y
one_hot_label = F.one_hot(torch.tensor(y), num_classes=4)


In [35]:
import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

y = [2]
scores = torch.tensor([[0.1, 6.0, -2.0, 3.2]])

# Create a one-hot encoded vector of the label y
one_hot_label = F.one_hot(torch.tensor(y), num_classes=4)

# Create the cross entropy loss function
criterion = CrossEntropyLoss()

# Calculate the cross entropy loss
loss = criterion(scores.double(), one_hot_label.double())
print(loss)

tensor(8.0619, dtype=torch.float64)


# Day 7

# Using Derivatives to Update Model Parameters

## 1. Why We Need Derivatives

Once we calculate the loss, we need a way to reduce it.  
Derivatives (also called gradients in deep learning) help us figure out how to adjust the model’s weights and biases to make better predictions.  
They tell us the direction and steepness of change - like a slope on a hill.

---

## 2. Visual Analogy: Loss as a Valley

Imagine the loss function as a valley:

- The height of the valley represents the loss value
- The slope tells us how steep the curve is at a point
- A steep slope means the loss is changing quickly
- A flat slope means the loss is stable

We want to reach the **lowest point** in the valley - this is where the loss is minimal.


   ![image.png](attachment:dd9e84d1-a421-4d14-b2ef-2896c90995f2.png)

- Red arrows show steep slopes -> large gradient -> big update steps
- Green arrows show gentle slopes -> small gradient -> small update steps
- Blue arrow at the valley floor -> slope is zero -> gradient is zero -> we’ve reached the minimum

---

## 3. Convex vs Non-Convex Functions
![image.png](attachment:14d49cf8-aa8a-4c70-9bfb-d2ce9cdadb85.png)

- **Convex function**: has one global minimum. Easy to find and optimize.
- **Non-convex function**: has many local minima. These are low points, but not the lowest possible.

Loss functions in deep learning are usually **non-convex** because of complex layer interactions.  
Our goal is to get as close as possible to the **global minimum**, even if we pass through local dips along the way.

---

## 4. How Derivatives Connect to Training

During training:

![image.png](attachment:352342a7-8872-4e52-957c-b317903ade13.png)

- We run a **forward pass** to get predictions
- We calculate the **loss** by comparing predictions to actual labels
- We then run a **backward pass** to compute gradients

These gradients tell us how each weight and bias contributed to the error.  
We use them to adjust the parameters so the model improves over time.

---

## 5. What Are Gradients?

In deep learning, derivatives are called **gradients**.  
They measure how much the loss changes when we tweak a specific parameter.

- A large gradient means the parameter has a big impact on the loss
- A small gradient means the parameter has little effect

We use gradients to update weights and biases in the direction that reduces the loss.

---

## 6. Backpropagation: The Core Mechanism

Backpropagation is the process of computing gradients layer by layer, starting from the output and moving backward.

If we have a network with three layers:

![image.png](attachment:8bf3246a-7e4c-4da6-a39d-b1752cccf748.png)

- First, we compute gradients for the last layer (L2)
- Then for the middle layer (L1)
- Finally for the first layer (L0)

Each layer’s gradients depend on the layers that come after it.  
This chain of calculations is what makes deep learning powerful and efficient.

---

## 7. How Parameters Are Updated

Once we have gradients, we update each parameter like this:

- Multiply the gradient by a **learning rate**
- Subtract that value from the current weight or bias

![image.png](attachment:79889b85-59fe-42e6-8f7c-189d357dacfd.png)


T
his moves the parameter in the direction that reduces the loss.  
The learning rate controls how big each step is too big and we overshoot, too small and we move too slowly.

---

## 8. Gradient Descent: The Optimization Strategy

Gradient descent is the method we use to find the minimum of the loss function.

Steps:
- Calculate gradients
- Move parameters in the direction that reduces loss
- Repeat until the loss is low enough

There are many versions of gradient descent:
- **Batch gradient descent**: uses all data at once
- **Stochastic gradient descent (SGD)**: uses one sample at a time
- **Mini-batch gradient descent**: uses small groups of samples

Most deep learning frameworks use **SGD with mini-batches** for speed and stability.

---

## 9. Optimizers: Automating the Updates

Instead of manually updating parameters, we use **optimizers**.  
They handle gradient calculations and parameter updates for us.

Popular optimizers:
- **SGD**: basic and reliable
- **Adam**: adaptive learning rates, faster convergence
- **RMSprop**: good for noisy data

Optimizers use the gradients and learning rate to update all model parameters automatically.  
This is what makes training scalable and efficient.

---

## Summary

- Derivatives (gradients) tell us how to reduce loss
- Loss is like a valley - we want to reach the lowest point
- Backpropagation computes gradients layer by layer
- Gradients are used to update weights and biases
- Gradient descent is the strategy to minimize loss
- Optimizers automate the update process

This is the heart of how neural networks learn. Next step: training loops and learning rate tuning.


In [10]:
model = nn.Sequential(nn.Linear(16, 8),
                      nn.Linear(8, 2)
                     )

# Access the weight of the first linear layer
weight_0 = model[0].weight
print("Weight of the first layer:", weight_0)

# Access the bias of the second linear layer
bias_1 = model[1].bias
print("Bias of the second layer:", bias_1)

Weight of the first layer: Parameter containing:
tensor([[ 0.0233,  0.1375, -0.2388,  0.1920,  0.1582, -0.0953,  0.2423, -0.2101,
         -0.0451,  0.1237,  0.0517,  0.0876,  0.0803,  0.2458,  0.0549, -0.0127],
        [-0.1490,  0.0814, -0.1160,  0.2146,  0.0144, -0.1342, -0.1187,  0.1851,
         -0.1039, -0.0703,  0.1267,  0.1178,  0.1707,  0.1404, -0.0788, -0.2171],
        [ 0.0903, -0.0357,  0.0834, -0.0754, -0.2117, -0.2359,  0.2317,  0.1958,
         -0.1110,  0.0280,  0.2210, -0.1352,  0.1396, -0.0213,  0.0492,  0.1608],
        [ 0.1706,  0.1428, -0.2215,  0.1833,  0.2402,  0.0170,  0.0064,  0.1672,
         -0.1529,  0.2206,  0.2257, -0.1901,  0.1042,  0.0337,  0.1202, -0.2291],
        [ 0.2436,  0.1577,  0.1970,  0.2074,  0.1439,  0.1070,  0.0152,  0.1157,
          0.0889,  0.1591,  0.0353,  0.1520,  0.1205, -0.0905,  0.0356, -0.0421],
        [-0.0725, -0.1265, -0.0948,  0.1989, -0.1358,  0.0280, -0.0401, -0.1572,
          0.2461, -0.0019,  0.0350, -0.1192, -0.1581,  

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim

# Sample input and target
sample = torch.randn(1, 16)  # Batch size = 1, input features = 16
target = torch.tensor([1]) # Target class index (for CrossEntropyLoss)

# Define the model
model = nn.Sequential(
    nn.Linear(16, 8),
    nn.ReLU(),
    nn.Linear(8, 4),
    nn.ReLU(),
    nn.Linear(4, 2) # Output layer with 2 classes
)

# Define loss function
criterion = nn.CrossEntropyLoss()

# Forward pass
prediction = model(sample)

# Compute loss
loss = criterion(prediction, target)

# Backward pass
loss.backward()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=0.006)

# Update weights
optimizer.step()

optimizer.zero_grad()

# Print loss for verification
print("Loss:", loss.item())


Loss: 0.7817837595939636


In [25]:
import torch
import torch.nn as nn

# Sample input and target
sample = torch.randn(1, 16)# Input: batch size 1, 16 features
target = torch.tensor([1])# Target class index (for CrossEntropyLoss)

# Define model
model = nn.Sequential(
    nn.Linear(16, 8),
    nn.ReLU(),
    nn.Linear(8, 4),
    nn.ReLU(),
    nn.Linear(4, 2)# Output layer: 2 classes
)

# Define loss function
criterion = nn.CrossEntropyLoss()

# Forward pass
prediction = model(sample)

# Compute loss
loss = criterion(prediction, target)

# Backward pass
loss.backward()

# Learning rate
lr = 0.06

# Manual weight update (in-place)
with torch.no_grad():
    for layer in [0, 2, 4]:  # Only Linear layers (ReLU has no weights)
        model[layer].weight -= lr * model[layer].weight.grad
        model[layer].bias -= lr * model[layer].bias.grad


# Print loss
print("Loss:", loss.item())


Loss: 0.9721090197563171


# we cannot update weights biases and try optimizing that stuff manually as it wont be feasible and sacalable reason being that neurons will be a lot even millions so not the good  coice
_in sum no need to do this hassle stuff leave it to function cal autommated_

so we got something that does just that but automatically handles it

``` python

# Create the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

loss = criterion(pred, target)
loss.backward()

# Update the model's parameters using the optimizer
optimizer.step()
```

# day 8

# Loading and Preparing Data for Deep Learning in PyTorch

## 1. Why Data Loading Matters

Before training a neural network, we need to load and organize our data efficiently.  
Good data handling ensures smooth training, faster computation, and better generalization.

---

## 2. The Animals Dataset

We’re working with a CSV file containing animal classification data.  
Each row represents an animal, with features like:

- hair  
- feathers  
- eggs  
- milk  
- predator  
- legs  
- tail  

The goal is to predict the **type** of animal:

- bird -> 0  
- mammal -> 1  
- reptile -> 2

We ignore the `animal_name` column since names don’t help with classification.

---

## 3. Defining Input Features (X)

We use `.iloc` to select all columns **except** the first (`animal_name`) and the last (`type`).  
This gives us the input features, which we convert into a **NumPy array** called `X`.  
NumPy arrays are easier to work with in PyTorch and allow fast numerical operations.

---

## 4. Defining Target Labels (y)

We extract the last column (`type`) to get the class labels.  
This becomes our target array `y`, which tells the model what each animal actually is.

---

## 5. Using TensorDataset

To prepare data for PyTorch:

- Import `TensorDataset` from `torch.utils.data`
- Convert `X` and `y` into PyTorch tensors
- Pass them into `TensorDataset`

This wraps the data into a format that PyTorch can use during training.

### Accessing Samples:
- `dataset[0]` returns a tuple: `(input_features, label)`
- You can unpack it like:  
  `input_sample, label_sample = dataset[0]`

This makes it easy to inspect or debug individual samples.

---

## 6. Using DataLoader

`DataLoader` helps manage data during training:

- **batch_size**: number of samples per training step  
- **shuffle**: randomizes data order each epoch to improve generalization

### Why batching matters:
- Speeds up training by processing multiple samples at once
- Reduces memory usage
- Helps the model learn more stable patterns

### Why shuffling matters:
- Prevents the model from memorizing the order of data
- Improves performance on unseen data

We create a `DataLoader` instance with these parameters to iterate through the dataset in batches.

---

## 7. Iterating Through DataLoader

Each item from the DataLoader is a tuple:

- `batch_inputs`: a batch of input features  
- `batch_labels`: the corresponding labels

Example with 5 animals and batch size of 2:

- First batch -> 2 animals  
- Second batch -> 2 animals  
- Last batch -> 1 animal (since 5 is odd)

In real-world deep learning, datasets are much larger.  
Typical batch sizes are 32, 64, or 128 - chosen based on memory and performance.

---

## Reinforcement Insights

- Always preprocess and clean your data before loading
- Use NumPy for fast array manipulation
- Convert to tensors for PyTorch compatibility
- Use `TensorDataset` to bundle features and labels
- Use `DataLoader` to batch, shuffle, and iterate efficiently
- Batching and shuffling are key for scalable and generalizable training

---

## Summary

| Component       | Purpose                                      |
|----------------|----------------------------------------------|
| CSV file        | Raw data source                              |
| pd.read_csv()   | Load data into a DataFrame                   |
| .iloc           | Select relevant columns                      |
| NumPy array     | Fast numerical format for features and labels|
| TensorDataset   | Wraps features and labels into PyTorch format|
| DataLoader      | Manages batching and shuffling during training|



In [122]:
import torch
from torch.utils.data import TensorDataset
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Step 1: Load the dataset
animals = pd.read_csv('animal.csv')

# Step 2: One-hot encode categorical features (excluding first column and label column)
X = pd.get_dummies(animals.iloc[:, 1:-1])  # Features

In [123]:
# Step 3: Encode labels using LabelEncoder
le = LabelEncoder()
y = le.fit_transform(animals.iloc[:, -1])  # Labels

# Step 4: Convert to PyTorch tensors with correct types
X_tensor = torch.tensor(X.values)
y_tensor = torch.tensor(y)

# Step 5: Create TensorDataset
dataset = TensorDataset(X_tensor, y_tensor)

# Step 6: Print the first sample
input_sample, label_sample = dataset[0]
print('Input sample:', input_sample)
print('Label sample:', label_sample)

Input sample: tensor([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 4, 0, 0, 1])
Label sample: tensor(0)


In [26]:
from torch.utils.data import DataLoader

# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterate over the dataloader
for batch_inputs, batch_labels in dataloader:
    print('batch_inputs:', batch_inputs)
    print('batch_labels:', batch_labels)

batch_inputs: tensor([[1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 0, 1],
        [0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1, 0, 0]])
batch_labels: tensor([0, 1])
batch_inputs: tensor([[1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 4, 1, 0, 1],
        [1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 4, 1, 0, 1]])
batch_labels: tensor([0, 0])
batch_inputs: tensor([[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1],
        [1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 2, 1, 0, 1]])
batch_labels: tensor([0, 0])
batch_inputs: tensor([[0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 0],
        [1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 4, 0, 0, 1]])
batch_labels: tensor([1, 0])
batch_inputs: tensor([[1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1],
        [0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 2, 1, 0, 0]])
batch_labels: tensor([0, 1])
batch_inputs: tensor([[0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 2, 1, 0, 0],
        [1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1]])
batch_labels: tensor([1, 0])
batch_inputs: tensor([[0, 0, 0, 1, 0, 1,

# Training Loops in PyTorch: Regression with Salary Data

## 1. Overview: Writing Our First Training Loop

We now have all the core components needed to train a PyTorch deep learning model.  
Training involves repeatedly passing data through the model, computing loss, calculating gradients, and updating parameters.  
This process is called the **training loop**.

---

## 2. Core Components of Model Training

To train a neural network, we need:

- A defined model architecture
- A suitable loss function
- A dataset of input features and target labels
- An optimizer to update model parameters

Once these are set up, we loop through the dataset, compute loss, calculate gradients, and update weights.  
This loop runs for multiple **epochs**, where one epoch is a full pass through the dataset.

---

## 3. Dataset: Data Science Salary Prediction

We use a dataset containing salaries of data scientists.  
Key characteristics:

- **Features**: Categorical variables (e.g., job title, experience level, location), already encoded
- **Target**: Salary in US dollars, normalized to a continuous range

Since the target is a continuous value, this is a **regression problem**.  
For regression:

- The final layer of the model is a **linear layer** (no activation like softmax or sigmoid)
- The loss function must be suitable for continuous outputs

---

## 4. Loss Function: Mean Squared Error (MSE)

For regression tasks, we use **Mean Squared Error (MSE)** loss.  
MSE measures the average squared difference between predicted and actual values.

Mathematically:

1/nsum(y*-y)^2


Key points:

- Predictions and targets must be **float tensors**
- MSE penalizes larger errors more heavily
- Lower MSE indicates better model performance

---

## 5. Preparing Data Before Training

We start with two NumPy arrays:

- `features`: input data (already encoded)
- `target`: normalized salary values

Steps:

- Convert both arrays to **float tensors**
- Wrap them in a **TensorDataset** to pair inputs with targets
- Use **DataLoader** to enable batching and shuffling

Batching improves training efficiency and generalization.  
We use a batch size of 4 for this example, but it can be adjusted based on dataset size and hardware.

Model setup:

- Input layer: 4 features
- Output layer: 1 target value
- No one-hot encoding needed since this is regression

Optimizer setup:

- Use **SGD** or another optimizer
- Set a learning rate (e.g., 0.001 is a good default)

---

## 6. The Training Loop

The training loop runs for multiple epochs.  
Each epoch involves:

1. **Iterating through the DataLoader**  
   Each batch contains a subset of the dataset

2. **Zeroing gradients**  
   `optimizer.zero_grad()` clears previous gradients to avoid accumulation

3. **Forward pass**  
   Pass batch features through the model to get predictions

4. **Loss calculation**  
   Compare predictions with actual targets using the loss function

5. **Backward pass**  
   `loss.backward()` computes gradients of the loss with respect to model parameters

6. **Parameter update**  
   `optimizer.step()` adjusts weights and biases using the gradients

This process repeats for each batch in every epoch.  
Over time, the model learns to minimize the loss and improve predictions.

---

## Summary

- A training loop is the backbone of model learning in PyTorch
- Regression tasks use linear output layers and MSE loss
- Data must be properly formatted and batched
- Gradients are computed and used to update parameters each epoch
- The loop repeats until the model converges or reaches a stopping criterion

This structure gives full control over training and is the foundation for building more advanced workflows like validation, early stopping, and learning rate scheduling.


In [3]:
import numpy as np
import torch
import pandas as pd
from sklearn.preprocessing import LabelEncoder


y_pred = np.array([3, 5.0, 2.5, 7.0])  
y = np.array([3.0, 4.5, 2.0, 8.0])     

# Calculate MSE using NumPy
mse_numpy = np.mean((y_pred - y) ** 2)

# Create the MSELoss function in PyTorch
criterion = torch.nn.MSELoss()

# Calculate MSE using PyTorch
mse_pytorch = criterion(torch.tensor(y_pred, dtype=torch.float32), torch.tensor(y, dtype=torch.float32))

print("MSE (NumPy):", mse_numpy)
print("MSE (PyTorch):", mse_pytorch)


MSE (NumPy): 0.375
MSE (PyTorch): tensor(0.3750)


## Writing a training loop

In scikit-learn, the training loop is wrapped in the .fit() method, while in PyTorch, it's set up manually. While this adds flexibility, it requires a custom implementation.


The show_results() function is provided to help you visualize some sample predictions.

The package imports provided are: pandas as pd, torch, torch.nn as nn, torch.optim as optim, as well as DataLoader and TensorDataset from torch.utils.data.

The following variables have been created: num_epochs, containing the number of epochs (set to 5); dataloader, containing the dataloader; model, containing the neural network; criterion, containing the loss function, nn.MSELoss(); optimizer, containing the SGD optimizer.

In [None]:
# Loop over the number of epochs and the dataloader
for i in range(num_epochs):
  for data in dataloader:
    # Set the gradients to zero
    optimizer.zero_grad()
    # Run a forward pass
    feature, target = data
    prediction = model(feature)    
    # Compute the loss
    loss = criterion(prediction, target)    
    # Compute the gradients
    loss.backward()
    # Update the model's parameters
    optimizer.step()
show_results(model, dataloader)


## **Full implementation of regression prediction with pytorch neural network**

In [100]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import LabelEncoder

**Step 1: Learning material**

In [101]:
df = pd.read_csv("datascience_salary.csv")


**Step 2: Indexing and serializing (kindof making in ready form for sending) as no tezt stuffs***

In [102]:
label_cols = ["experience_level", "employment_type", "company_size",
              "job_title", "salary_currency", "employee_residence", "company_location"]
for col in label_cols:
    df[col] = LabelEncoder().fit_transform(df[col].astype(str))
    #encoding mid low high as 2 1 3 and other variable as number not effiecient for ml but doing now

**Step 3: eliminating unsure stuffs**

In [103]:
df = df[df["salary_in_usd"].notna()]

**Step 4: ready what to achive what to learn to achive**

In [104]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Extract features and target
X_raw = df.drop(columns=["salary_in_usd"]).values
y_raw = df["salary_in_usd"].values.reshape(-1, 1)

# Scale both
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X = scaler_X.fit_transform(X_raw)
y = scaler_y.fit_transform(y_raw)

**Step 5: encode everything in onself compatible form**

In [105]:
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

**tep 6: load into the brain all the stuffs**

In [106]:
dataset = TensorDataset(X_tensor, y_tensor)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

**Step 7: schedule charting concept to output**

In [116]:
model = nn.Sequential(
    nn.Linear(X.shape[1], 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 16),
    nn.ReLU(),
    nn.Linear(16, 1)
)

 **Step 8: self test template with improvement options**

In [117]:
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)


**Step 9: actual learning with self testing and improving**

In [119]:
num_epochs = 13
for epoch in range(num_epochs):
    for batch_features, batch_targets in dataloader:
        predictions = model(batch_features)
        loss = criterion(predictions, batch_targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print("Epoch", epoch + 1, "Loss:", loss.item())

Epoch 1 Loss: 0.00026889806031249464
Epoch 2 Loss: 0.004797717090696096
Epoch 3 Loss: 0.0008252369589172304
Epoch 4 Loss: 0.003989673685282469
Epoch 5 Loss: 0.008695780299603939
Epoch 6 Loss: 0.0004931185976602137
Epoch 7 Loss: 0.10007531940937042
Epoch 8 Loss: 0.00035373959690332413
Epoch 9 Loss: 0.005076317116618156
Epoch 10 Loss: 0.0023416171316057444
Epoch 11 Loss: 0.009227569214999676
Epoch 12 Loss: 0.013410422019660473
Epoch 13 Loss: 0.0009954485576599836


**Step 10: Exam evaluation**

In [120]:
preds_scaled = model(X_tensor).detach().numpy()
preds_actual = scaler_y.inverse_transform(preds_scaled)
print("Predicted salaries:", preds_actual[:5].squeeze())
# actual predicted values

Predicted salaries: [107249.44  127464.266 100694.266 129942.04  113806.57 ]


In [124]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2021,2,2,89,30400000,4,40038,15,100,14,0
1,2021,2,2,21,11000000,9,36259,38,50,74,0
2,2020,2,2,89,11000000,9,35735,38,50,34,0
3,2021,2,2,114,8500000,12,77364,47,50,42,2
4,2022,3,2,113,7500000,11,95386,42,50,38,0


# Day 9

# Activation Functions in Neural Networks

## 1. Purpose of Activation Functions

Activation functions introduce **non-linearity** into neural networks, allowing them to learn complex patterns.  
They play a crucial role in the **training loop**, especially during **backpropagation**, where gradients guide weight updates.

---

## 2. Sigmoid and Softmax Functions

These are commonly used in the **final layer** of a neural network for classification tasks.

- **Sigmoid**: Maps input to a range between 0 and 1.
- **Softmax**: Converts a vector of values into probabilities that sum to 1.


![image.png](attachment:f955ccf0-07a3-488a-ab2b-081574319fb9.png)
---

## 3. Limitations of Sigmoid and Softmax

![image.png](attachment:7f29a9e7-3237-4b6f-b114-7965b80bb2a4.png)

### Saturation and Vanishing Gradients

- **Sigmoid outputs** are bounded between 0 and 1.
- For large or small input values, the **gradient becomes very small**.
- This leads to **saturation**, where neurons stop learning effectively.
- During backpropagation, small gradients compound and result in **vanishing gradients**.
- **Softmax** suffers from similar saturation issues.

### Best Use Case

- These functions are **not ideal for hidden layers**.
- They are best used in the **output layer** for classification.

---

## 4. ReLU (Rectified Linear Unit)

![image.png](attachment:47bf0a43-9477-46b6-84ac-70b99adca849.png)


### Definition

- Outputs the **maximum of 0 and the input**:
  - Positive input -> output equals input
  - Negative input -> output is 0

### Advantages

- **No upper bound** on output
- **Gradients remain large** for positive inputs
- Helps overcome the **vanishing gradients problem**
- Efficient and widely used in **hidden layers**

### PyTorch Implementation

- Available via `torch.nn.ReLU`
- Default choice for many deep learning architectures

---

## 5. Leaky ReLU


![image.png](attachment:395309ac-0a13-41db-a4d4-10a208f30fd4.png)


### Definition

- Variation of ReLU:
  - Positive input -> behaves like ReLU
  - Negative input -> output is a small negative value (e.g., 0.01 × input)

### Advantages

- Prevents neurons from **dying** (i.e., outputting zero permanently)
- Maintains **non-zero gradients** for negative inputs

### PyTorch Implementation

- Available via `torch.nn.LeakyReLU`
- `negative_slope` parameter controls the leak factor (default is 0.01)

---

## 6. Summary

| Activation Function | Output Range | Gradient Behavior | Best Used In       |
|---------------------|--------------|-------------------|--------------------|
| Sigmoid             | 0 to 1       | Vanishes at extremes | Output layer       |
| Softmax             | 0 to 1 (sum = 1) | Saturates like sigmoid | Output layer       |
| ReLU                | 0 to ∞       | Large for positive inputs | Hidden layers       |
| Leaky ReLU          | Small negative to ∞ | Non-zero for all inputs | Hidden layers       |

---

## Key Takeaway

Understanding activation functions - especially their gradient behavior - is essential for designing effective neural networks.  
**ReLU and Leaky ReLU** are preferred for hidden layers due to their ability to maintain strong gradients and avoid saturation.



In [4]:
# regular ReLU

import torch
from torch import nn

# Create a ReLU function with PyTorch
relu_pytorch = nn.ReLU()

x_pos = torch.tensor(2.0)
x_neg = torch.tensor(-3.0)

# Apply the ReLU function to the tensors
output_pos = relu_pytorch(x_pos)
output_neg = relu_pytorch(x_neg)

print("ReLU applied to positive value:", output_pos)
print("ReLU applied to negative value:", output_neg)

ReLU applied to positive value: tensor(2.)
ReLU applied to negative value: tensor(0.)


In [5]:
# Leaky ReLU
# Create a leaky relu function in PyTorch
leaky_relu_pytorch = nn.LeakyReLU(negative_slope=.05)

x = torch.tensor(-2.0)
# Call the above function on the tensor x
output = leaky_relu_pytorch(x)
print(output)

tensor(-0.1000)


# Optimizer Hyperparameters: Learning Rate and Momentum

Training a neural network involves minimizing a **loss function** by adjusting model parameters (weights and biases).  
This is done using optimization algorithms like **Stochastic Gradient Descent (SGD)** or its variants.  
Two key hyperparameters that control this process are:

- **Learning Rate (lr)** : control movenent step size
- closer the minimul step size gradually decreases step is gradient x learning rate so a normal just righ learing rate can do this
- 
- ![image.png](attachment:642f337e-369c-41be-b90c-db01ef672d1d.png)

when we choose smaller learning rate this happens: small steps slower and not sufficient

![image.png](attachment:432316e4-0699-4644-92c7-f940a3e35366.png)


when we choose higher lering rate this happens: back and oforth oscilation unstable

![image.png](attachment:eb2f8bc8-be20-405e-af5d-2a1a58029ce9.png)


- **Momentum** : add inertia getting stuck

Understanding these is essential for stable, efficient, and successful training.

---

##  Learning Rate

### What It Is
The **learning rate** determines the size of the steps taken toward the minimum of the loss function during training.

### Formula
`new_weight = old_weight - (learning_rate × gradient)`


### Effects of Learning Rate

| Learning Rate | Behavior |
|---------------|----------|
| Too High      | Overshoots minimum, oscillates or diverges |
| Too Low       | Slow convergence, may get stuck |
| Just Right    | Smooth descent toward minimum |

### Typical Ranges
- Common values: `0.01`, `0.001`, `0.0001`
- Depends on model architecture, dataset size, and optimizer


### Best Practices
- Start with `0.001` for Adam, `0.01` for SGD
- Use **learning rate scheduling** to reduce lr over time
- Use **learning rate finder** tools to empirically choose lr

---

##  Momentum

### What It Is
Momentum helps the optimizer **accelerate in consistent directions** and dampen oscillations.  
It adds a fraction of the previous update to the current one.

### Formula
`velocity = momentum × previous_velocity + learning_rate × gradient`

`new_weight = old_weight - velocity`

no momentum: just reaching local minima
![image.png](attachment:75b187ac-651c-44ea-ab11-308413209557.png)

with momentun: got to the global minima aka less possible optimal loss
![image.png](attachment:f027738e-9ed8-4c2d-aafc-f79a3ba62052.png)









### Effects of Momentum

| Momentum Value | Behavior |
|----------------|----------|
| 0              | Pure SGD, may get stuck in local minima |
| 0.9            | Smooth updates, escapes shallow dips |
| >0.99          | May overshoot or oscillate |


    0 = No memory= Every step is independent (like walking on ice).  
    0.9 = Good memory= Smooth, steady roll; ignores small bumps.  
    >0.99 = Too much memory= Hard to stop or turn; might zoom past the goal.
     
### Typical Range
- Usually between `0.85` and `0.99`

### Benefits
- Helps escape **local minima**
- Speeds up convergence
- Reduces zig-zagging in steep valleys

---

##  Combined Impact

| Scenario                        | Result |
|---------------------------------|--------|
| High lr, no momentum            | Unstable, diverges |
| Low lr, no momentum             | Very slow training |
| Moderate lr, high momentum      | Fast and stable convergence |
| High lr, high momentum          | Risk of overshooting |

---

##  Convex vs Non-Convex Loss Functions

- **Convex functions** have one global minimum — easy to optimize
- **Non-convex functions** (common in deep learning) have many local minima and saddle points

Momentum helps **push through local dips** and reach better minima.

---

##  Practical Tips

- Use **Adam optimizer** for most tasks — it combines momentum and adaptive learning rates
- Monitor **loss curves** to detect instability or slow learning
- Use **gradient clipping** if gradients explode
- Try **cyclical learning rates** or **warm restarts** for better generalization
---
    Too big =You overshoot the valley and bounce around wildly (training explodes).  
    Too small =You crawl super slowly (training takes forever).  
    Just right =You reach the bottom efficiently.
     
---

##  Summary Table

| Hyperparameter | Controls         | Too Low Behavior       | Too High Behavior       | Typical Range |
|----------------|------------------|-------------------------|--------------------------|----------------|
| Learning Rate  | Step size        | Slow convergence        | Divergence, instability  | 0.01 to 0.0001 |
| Momentum       | Update smoothing | Gets stuck in local dips| Overshooting, oscillation| 0.85 to 0.99   |

---

##  Key Takeaways

- **Learning rate** controls how fast the model learns  
- **Momentum** helps the optimizer move smoothly and escape traps  
- Tuning both is critical for efficient and stable training  
- Use visualizations (loss curves) and experimentation to find optimal values

---

##  other

- **Nesterov Momentum**: Looks ahead before updating — often more stable than classic momentum
- **Adaptive Optimizers**: Adam, RMSprop, Adagrad — adjust learning rate per parameter
- **Learning Rate Scheduling**: Reduce lr over time (step decay, exponential decay, cosine annealing)
- **Warmup**: Start with small lr and gradually increase — helps stabilize early training

Use the Learning Rate Finder trick: 

    Start with a very small LR.
    Gradually increase it each batch.
    Plot loss vs. LR.
    Pick the LR just before loss starts rising (steepest drop).


     
High LR + No Momentum: Wild swings =Training fails

Low LR + No Momentum: Turtle speed  Wastes time

Medium LR + High Momentum (0.9) Smooth & fast: Goldilocks zone

High LR + High Momentum Rocket with no brakes:Overshoots





# Experimenting with learning rate

find the optimal learning rate such that the optimizer can find the minimum of the non-convex function
in ten steps.

experiment with three different learning rate values. For this problem, try learning rate values between 0.001 to 0.1.

# The optimize_and_plot() function that takes the learning rate for the first argument. This function will run 10 steps of the SGD optimizer and display the results.
---
# tw3aking only rearning rate to see what value does rightno momentum 10 steps

lr .001 

![image.png](attachment:84619264-56ff-4ad2-9476-7d825e1bbe57.png)

lr .01 

![image.png](attachment:36d837b4-de89-419e-b28b-14d209e33781.png)

lr .1 

![image.png](attachment:a2650603-0da7-42be-b9ef-008982af524b.png)

lr .09 

![image.png](attachment:059c5d0c-a38e-4818-8a04-11cc9c985679.png)
---

# Tweaking momentum while fixing learing rate at .01 with 20 steps

mom .1

![image.png](attachment:6f19c9be-0106-4bd8-babe-ceb552130e57.png)

mom .918

![image.png](attachment:9127120a-5235-4c43-9cff-38a379457a82.png)

Momentum and learning rate are critical to the training of your neural network. A good rule of thumb is to start with a learning rate of 0.001 and a momentum of 0.95.






# Layer Initialization, Transfer Learning, and Fine-Tuning in PyTorch

---

## 1. Overview

Neural networks learn by updating weights during training.  
To improve performance and efficiency, we use advanced techniques like:

- **Layer Initialization**
- **Transfer Learning**
- **Fine-Tuning**

These methods help stabilize training, reuse learned knowledge, and adapt models to new tasks.

---

## 2. Layer Initialization

### What It Is
Layer initialization refers to how weights of neural network layers are set **before training begins**.

### Why It Matters
- Prevents unstable outputs and exploding activations
- Ensures gradients flow properly during backpropagation
- Improves convergence speed and model performance

### Typical Initialization Ranges
- Default PyTorch linear layers: weights initialized between `-0.125` and `+0.125`
- Custom initialization (e.g. uniform from `0` to `1`) can be done using `torch.nn.init`

### Common Initialization Methods
| Method            | Description                                      | Use Case                        |
|-------------------|--------------------------------------------------|----------------------------------|
| Uniform           | Random values in a fixed range                   | Simple models                    |
| Normal            | Values from a Gaussian distribution              | General-purpose                  |
| Xavier (Glorot)   | Scales based on input/output size                | Sigmoid/tanh activations         |
| He Initialization | Scales based on input size only                  | ReLU activations                 |
| Constant/Zero     | Fixed values (rarely used)                       | Debugging or freezing layers     |

### Best Practices
- Match initialization method to activation function
- Normalize input data to complement weight scale
- Monitor early training behavior for instability

---

## 3. Transfer Learning

### What It Is
Transfer learning reuses a **pretrained model** on a new but related task.

### Why It’s Useful
- Saves time and compute
- Leverages learned features from large datasets
- Improves performance on small or domain-specific datasets

### Workflow
1. Train model on Task A (e.g. US salary prediction)
2. Save weights using `torch.save(model.state_dict())`
3. Load weights for Task B (e.g. EU salary prediction) using `torch.load()`
4. Continue training on new data

### Key Concepts
- **Feature reuse**: Early layers capture general patterns (e.g. edges, shapes)
- **Task adaptation**: Later layers are retrained to fit the new task

---

## 4. Fine-Tuning

### What It Is
Fine-tuning is a **specific type of transfer learning** where:
- You load pretrained weights
- You train the model further on a new dataset 
- You use a **smaller learning rate** to preserve learned features

### Freezing Layers
- You can **freeze** parts of the network to avoid retraining them
- Typically, **early layers** are frozen and **later layers** are fine-tuned

### How to Freeze
- Set `requires_grad = False` for selected parameters
- Use `model.named_parameters()` to access and control layer-wise training

### Benefits
- Faster training
- Prevents overfitting
- Preserves useful representations

---

## 5. Summary Table

| Technique         | Purpose                          | Key Benefit                     | PyTorch Tooling                  |
|-------------------|----------------------------------|----------------------------------|----------------------------------|
| Layer Initialization | Stabilize training from the start | Prevents exploding/vanishing gradients | `torch.nn.init`                  |
| Transfer Learning | Reuse pretrained model weights   | Saves time, boosts performance  | `torch.save`, `torch.load`       |
| Fine-Tuning       | Adapt model to similar task      | Efficient learning, avoids overfitting | `requires_grad`, `named_parameters()` |

---

## 6. Key things

- Proper **initialization** ensures stable and efficient training
- **Transfer learning** allows you to build on existing models instead of starting from scratch
- **Fine-tuning** lets you selectively adapt models to new tasks while preserving useful knowledge
- These techniques are essential for modern deep learning workflows, especially when working with limited data or compute

## 7. Mnemonic
**FLFTA**

Forever Laughing Flying Through Adventures

Find Load Freeze Train Adjust


![image.png](attachment:d3d405d0-cf00-491b-a831-db9dc434a78a.png)

# Freeze layers of a model

fine-tune a model on a new task after loading pre-trained weights. The model contains three linear layers. However, because your dataset is small, you only want to train the last linear layer of this model and freeze the first two linear layers.

The model has already been created and exists under the variable model. You will be using the named_parameters method of the model to list the parameters of the model. Each parameter is described by a name. This name is a string with the following naming convention: x.name where x is the index of the layer.

Remember that a linear layer has two parameters: the weight and the bias.

```
for name, param in model.named_parameters():
  
    # Check for first layer's weight
    if name == '0.weight':
   
        # Freeze this weight
        param.requires_grad = False
        
    # Check for second layer's weight
    if name == '1.weight':
      
        # Freeze this weight
        param.requires_grad =False
```

--- 

# Layer Initialization with initial weight
The initialization of the weights of a neural network has been the focus of researchers for many years. When training a network, the method used to initialize the weights has a direct impact on the final performance of the network.

As a machine learning practitioner, you should be able to experiment with different initialization strategies. In this exercise, you are creating a small neural network made of two layers and you are deciding to initialize each layer's weights with the uniform method.


```
layer0 = nn.Linear(16, 32)
layer1 = nn.Linear(32, 64)

# Use uniform initialization for layer0 and layer1 weights
nn.init.uniform_(layer0.weight)
nn.init.uniform_(layer1.weight)

model = nn.Sequential(layer0, layer1)
```

