# ----  Supervised Learning  ----
 
1. **Supervised Learning Fundamentals**  
   - Definition: _Supervised learning algorithms_ trained on **labeled examples** i.e. for an input the desired output is known (input-output pairs).  
   - Example: Email spam detection (spam vs. legitimate) or movie reviews (positive vs. negative).  
   - Key Idea: Historical labeled data trains **models (network or ML algorithm)** to learn to predict future unlabeled data.  
   

2. **How it works: Neural Networks & Model Training**  
   - Process:  
     - Network receives a set of **_Input data + correct outputs_** → Model compares predictions (outputs by model) vs. actuals correct outputs (label)  
     - According to comparison, the model **_adjusts_** weights/biases to minimize errors (through a process like backpropagation.).  
   - Key Terms: *Weights*, *bias values* (we'll discuss in neural network theory).  

### Supervised Learning Pipeline
Supervised learning is often used in applications where patterns in past data help forecast likely future events.


# 📊 Data Science Notes  

---

## **📌 Data Pipeline for Supervised Learning**

1. **Data Acquisition**  
   - Where to get data.  
   - **Sources:** Customer data, online databases, physical sensors, etc.

2. **Data Cleaning & Formatting**  
   - Clean and format data so that the model (e.g., Neural Network) can process it.  
   - **Tools:** Libraries like Pandas for preprocessing.

3. **Train-Test Split**  
   - Typical ratio: **70% training / 30% testing**.

4. **Model Training**  
   - Use the **training set** to fit the model/network's parameters.

5. **Model Testing**  
   - Evaluate performance using the **test set**.  
   - Compare **model output/predictions** with **actual labels**.  
   - If performance is unsatisfactory, go back to **step 4** and adjust hyperparameters (e.g., add more NN layers).

6. **Deployment**  
   - If the model performs well, deploy it to production or a real-world application.

---

## ⚠️ Important Note  

The pipeline above is a **simplified version** — it has a flaw.

**Single train-test split isn't ideal** for evaluating our models performance because:
- After building the model using "Train data" and we test our model using "Test data" we get some sort of **Performance Metric**
- After tuning the model based on test set performance, the test set is no longer "unseen".
- This leads to biased (over-optimistic) performance metrics.

---

It’s not fair to use the accuracy from the test data as the final performance metric for our model.  
After all, if we **_repeatedly use the test data_** to evaluate and adjust the model's hyperparameters, it stops being an unbiased, unseen dataset.   

**To address this issue, we split the dataset into three sets:**  
- **Training Set**  
- **Validation Set**  
- **Test Set**

This 3-way split is a standard approach in machine learning and deep learning tasks to ensure reliable and fair model evaluation.

  
## ✅ **Model Evaluation** : Better Approach: 3-Way Data Split  

   - **Problem:** Using the same test set repeatedly to tweak models → "cheating" (data leakage).  
   - **Solution:** Introduce a **validation set** (3-way split: train/validate/test).  

**Split into:**
- **Training Set:** Used to fit/train the model parameters. look at the features, correct output and fit on this Train Data
- **Validation Set:**  
  - Its a kind of test data  
  - Used to tune hyperparameters (e.g. adding more NN layers, learning rate, neurons or change the NN architecture).  
  - Check performance on this set during development.  
  - we check the performance of our model using **_Performance Metric_** and adjust the hyperparameters  
  - we repeat his process untill out models performance in satisfactory level on the **_Validation data_**  
- **Test Set:**  
  - Final, unseen data to evaluate the model’s real-world true performance.  
  - gives final performance metric  
  - No tweaking allowed (no updating weights or parameters) after test-set evaluation.  

**Why?**  
Using the validation set prevents **data leakage** and ensures the final test set reflects real-world performance.

---

## **Model Evaluation Summary**

- **Problem:** Repeatedly tweaking based on test set → cheating.
- **Solution:** Use a **validation set** during model development.

| Set Type        | Purpose                                 |
|:----------------|:-----------------------------------------|
| **Training Set**  | Fit model parameters.                    |
| **Validation Set**| Tune hyperparameters, adjust architecture.|
| **Test Set**      | Final unbiased performance evaluation.   |

**Critical Point:**  
- No further tweaks after evaluating on the **test set**.  
- This final score reflects the model's true generalization ability.

---

## 📌 Course Simplification  
- The course uses **train-test split only** for exercises.
- In real projects, **train-validate-test** is essential.

---

## **Performance Metrics**
- **For Regression:** Root Mean Squared Error (RMSE)
- **For Classification:** Accuracy, Precision, Recall, F1 Score, etc.

---

## **Hyperparameters vs. Parameters**

| Type             | Description                                | Example                 |
|:----------------|:--------------------------------------------|:------------------------|
| **Parameters**    | Learned by the model during training        | Weights, Biases          |
| **Hyperparameters**| Set before training (manual or automated) | Learning rate, layers, batch size |

---
