# <h1 align="center">ML Modeling</h1>
<h3 align="center">Creating Models with Deep Learning</h3>

# Types of Nueral Networks 
- Recurrent Neural Networks (RNNs) – Used for sequential data such as time series and NLP.
- Convolutional Neural Networks (CNNs) – Used for image processing and computer vision.

### Non-Linear Activation Functions
These functions define the output of a node / neuron given its input signals. Linear Activation functions arent used. 

Best practice is to choose a non-linear activation function due to it: 
- These can create complex mappings between inputs and outputs
- Allow backpropagation (because they have a useful derivative)
- Allow for multiple layers (linear functions degenerate to a single layer)

 **Sigmoid/Logistic Function**
- Scales everything from 0 to 1 based on neuron input.  

 **Tanh / Hyperbolic Tangent Function**
- Scales everything from -1 to 1 based on neuron input.  
- Generally preferred over sigmoid due to being averaged around 0.  

 **Cons for Both:**
- Suffers from the **vanishing gradient problem**.  
- Changes slowly for high or low values.  
- Computationally expensive.  

---

 **Rectified Linear Unit (ReLU)**
- Most commonly used, very popular choice.  
- Easy & fast to compute due to linearity.  

 **Cons:**
- When inputs are zero or negative, we have a linear function and all of its problems (**Dying ReLU problem**).  
- Solution: **Leaky ReLU**.  

---

 **Leaky ReLU**
- Solves **Dying ReLU** by introducing a negative slope below 0.  
- Negative slope is determined arbitrarily.  

---

**Parametric ReLU (PReLU)**
- ReLU, but the slope in the negative part is **learned via backpropagation**.  
- More flexible but **computationally expensive**.  

---

**Other ReLU Variants**
**Swish**
- Developed by **Google**, performs really well.  
- Mostly beneficial for **very deep networks** (40+ layers).  

---

<table style="width: 100%; text-align: center;">
    <tr>
        <td><img src="../Figures/Modeling/ActivationFuns.png" style="width: 600px;"></td>
        <td><img src="../Figures/Modeling/Relu.png" style="width: 600px;"></td>
    </tr>
</table>


**Soft-Max** commonly used as the Last output-layer output 
- Used on the final output layer of a multi-class classification problem
- converts outputs to probabilities of each classification
- Can’t produce more than one label for something, total probability equals 1 so only one label. 
- (sigmoid can predict more than one label) like (Sigmoid with Binary Cross-Entropy (BCE))

## CNNS
CNNs excel at spatially structured data like, overall unstructured data: 
- ✅ Images & Videos – Object detection, segmentation, recognition.
- ✅ Audio & Spectrograms – Converting speech/audio into an image-like format for classification.
- ✅ Medical Imaging – X-rays, MRIs, CT scans for disease detection.

### **How CNNs Work**  
CNNs consist of several key layers that transform raw images into meaningful patterns:

#### **1️⃣ Convolutional Layer**
- Applies **filters (kernels)** that scan the image.
- Detects **edges, textures, and shapes**.
- Each filter learns different features automatically.

#### **2️⃣ Pooling Layer**
- **Reduces image size** while keeping essential features.
- Helps make computations **faster and efficient**.
- Common types: **Max Pooling, Average Pooling**.

#### **3️⃣ Fully Connected Layer**
- Takes extracted features and makes final predictions.
- Works like a traditional neural network to classify data.

This is very resource intensive, has lots of hyper-parameters. Main difficulty is getting data. 

## RNNs
Recurrent Neural Networks (RNNs) are designed for sequential data processing, making them ideal for tasks like time-series forecasting, natural language processing (NLP), and speech recognition. Unlike traditional neural networks, RNNs have memory, allowing them to capture dependencies over time.

#### **1️⃣ Recurrent Connections**
- Each neuron receives **input from the previous timestep**.
- Helps the network **remember past information** while processing new data.
- Key feature: **Hidden states**, which store context across timesteps.

#### **2️⃣ Backpropagation Through Time (BPTT)**
- Special version of **backpropagation** used to update RNN weights.
- Unrolls the network over time and **computes gradients across timesteps**.
- **Problem:** Can lead to **vanishing gradients**, making training difficult.

#### **3️⃣ Variants of RNNs**
To overcome RNN limitations, several advanced architectures have been developed:
- **LSTMs (Long Short-Term Memory)** → Adds **gates (forget, input, output)** to control memory.
- **GRUs (Gated Recurrent Units)** → Similar to LSTMs but with fewer parameters, making them more efficient.
- **Bidirectional RNNs** → Processes data **both forward and backward**, capturing more context.

---

### **Challenges of RNNs**
🔹 **Vanishing & Exploding Gradients** – Makes long-range dependencies difficult to learn.  
🔹 **Computationally Expensive** – Due to sequential processing, harder to parallelize than CNNs.  
🔹 **Lots of Hyperparameters** – Requires tuning learning rates, hidden units, sequence lengths.  
🔹 **Data Dependency** – Needs **large labeled datasets** to generalize well.  

---

### **When to Use RNNs?**
✅ **Text Processing** – Language modeling, chatbots, machine translation.  
✅ **Time-Series Analysis** – Stock market predictions, sensor data analysis.  
✅ **Speech & Audio** – Speech recognition, music generation.  

However, **Transformers (BERT, GPT, ViTs) are replacing RNNs** in most NLP & vision tasks due to their efficiency and scalability.

Would you like a **comparison between RNNs, LSTMs, GRUs, and Transformers**? 🚀

## Tuning Neural Networks
Tuning Hyper-parameters of a Neural Network

### Learning Rate

- Neural networks are trained using gradient descent (or similar methods).
- Training involves starting at a random point and sampling different solutions (weights) to minimize a cost function over many epochs.
- The learning rate determines how far apart these samples are.

### Effect of Learning Rate

- **Too high**: May overshoot the optimal solution.
- **Too low**: Training takes too long to converge.
- Learning rate is a **hyperparameter** that requires tuning.

### Batch Size

- Defines the number of training samples used in each batch of each epoch.

### Effects of Batch Size

- **Smaller batch sizes**:
  - Help escape local minima.
  - Can make training appear inconsistent across runs due to random shuffling.
- **Larger batch sizes**:
  - Can get stuck in a suboptimal solution.

### Recap (Important!)

- **Small batch sizes** help avoid local minima.
- **Large batch sizes** risk converging on the wrong solution.
- **Large learning rates** may overshoot the correct solution.
- **Small learning rates** slow down training.

---

_Source: Sundog Education, DataCumulus (© 2022)_


## Transfer Leanring
Also know as Fine-Tuning Pre-Trained Models

Transfer learning is a powerful technique where a pre-trained model is used as a starting point for a new task, saving time and computational resources. Well known Model Zoo or model collection is called Hugging Face. This is a platform that allows users access to numerous open sourced machine learning models. 

---

#### **Approaches to Transfer Learning**
##### **1️⃣ Fine-Tuning a Pre-Trained Model**
✅ Continue training a **pre-trained model** on new data.  
✅ Ideal when the model has been trained on **far more data** than you have.  
✅ **Use a low learning rate** to **incrementally** improve performance.  
✅ **Freeze lower layers** and add **new trainable layers** to adapt to new tasks.  
✅ The model **learns to repurpose old features** for new predictions.  
✅ Hybrid approach: **First freeze layers → Then fine-tune everything**.

---

##### **2️⃣ Retraining from Scratch**
✅ **Only do this if you have LOTS of training data**.  
✅ Data must be **very different** from the original pre-trained dataset.  
✅ Requires **high computing power** (e.g., TPUs, GPUs).  
✅ Used in cases where pre-trained knowledge **isn’t relevant**.

---

##### **3️⃣ Using a Pre-Trained Model "As-Is"**
✅ If the model was trained on **exactly the data you need**, no extra training is required.  
✅ Example: Using a pre-trained **ResNet** for generic **image classification**.  

---

#### **🚀 Choosing the Right Transfer Learning Strategy**
| **Scenario** | **Best Approach** |
|-------------|------------------|
| Small dataset, similar to pre-trained model | ✅ Fine-tune existing model |
| Small dataset, very different from pre-trained model | ✅ Add new layers, then fine-tune |
| Large dataset, different from pre-trained model | 🔥 Retrain from scratch |
| Model already fits your needs | ✅ Use pre-trained model as-is |

---

### **💡 TL;DR**
- **Fine-tuning**: Best for leveraging pre-trained knowledge while adapting to new tasks.  
- **Retrain from scratch**: Only if you have **tons of data** and **high computational power**.  
- **Use as-is**: If the model already fits your requirements.  

🚀 **Transfer learning saves time, improves performance, and reduces training costs!**  
Would you like an example **implementation in TensorFlow or PyTorch?** 😊


## Neural Network Regularization Techniques
Regularization is a technique used to prevent overfitting, which occurs when a model performs well on training data but struggles with new, unseen data.

 Overfitting happens when the model learns patterns that exist only in the training set rather than general trends applicable to real-world data. A key indicator of overfitting is high accuracy on the training set but lower accuracy on the test or evaluation set. Regularization helps address this by discouraging the model from relying too heavily on specific details in the training data.
 Regulization is used to prevent overfitting. 

 In a neural network overfitting might be due to too many layers and neurons, thus we can prune it and try to use a simpler model. 
 
 Another techinque is called a drop out layerSo by dropping out specific neurons that are chosen at random, at each training step, we're basically forcing the learning to spread itself out more. And this has the effect of preventing any individual neuron
 from overfitting to a specific data point, right?  network can make it actually trained better.

Another techinque is called early stopping which stops training once overfitting is detected when accuracy goes over the validation accuracy.  

L1 Regularization 
L2 Regurlarization 