# Lab06: Introduction to Machine Learning

Overview of the lab:
1. Machine Learning Concepts 
2. Classical Machine Learning 
3. Supervised Machine Learning 
4. Unsupervised Machine Learning

----

## 1. Machiine Learning Concepts

### 1.1 The Foundation: AI and Machine Learning 
* **Artificial Intelligence (AI):** The capability of machines to perform tasks requiring human intelligence, such as understanding language, recognizing images, and making decisions. Not all AI is machine learning (e.g., rule-based systems or typical chess engines).

* **Machine Learning (ML):** A branch of AI that enables computers to learn from data and improve performance over time without explicit programming. It identifies patterns to make predictions or decisions on new, unseen information (e.g., a spam filter).

### 1.2 Core Data and Model Concepts 
* **Data:** Information collected, analyzed, and used to make decisions. In ML, this can be numbers, text, images (pixel intensities), or any form of input.

* **Model**: A mathematical representation trained to recognize patterns in data and make predictions. It's often a mapping function between an input and an output( e.g the line equation in linear regression).

* **Model Fitting / Training / Learning**: The process of adjusting a model's internal parameters to find the best match between its predictions and the actual data.

* **Training Data:** A subset of data used to teach the model, consisting of input examples paired with their correct outputs (labels).

* **Test Data (Test Set)**: A completely separate collection of data used only to evaluate how well the trained model performs on examples it hasn't seen.

### 1.3 Machine Learning Learning Paradigms: 
* **Supervised Learning**: Models learn from labeled examples, where the correct outcomes or targets are known and provided. This is the most common type of ML (approx. 70% of applications).

* **Unsupervised Learning**: Models find patterns and structure in data without labeled examples or correct answers. It discovers natural groupings or relationships on its own (e.g., customer segmentation).

* **Reinforcement Learning**: An agent learns through interaction and feedback (trial and error), receiving rewards for good decisions and penalties for poor ones. It's powerful for sequential decision-making tasks like game playing.

### 1.4 Features and Data Pre-processing
* **Feature:** A specific piece of information or characteristic used as input for a model (e.g., square footage, number of bedrooms).

* **Target (Dependent Variable / Label):** The value the model is trying to predict based on the features (e.g., the house price or 'spam'/'not spam').

* **Feature Engineering:** The process of creating new, more informative features from existing raw data to improve model performance (e.g., transforming a raw date into 'is holiday').

* **Feature Scaling / Normalization / Standardization:** Transforming numeric features to a similar scale to prevent features with larger ranges from dominating the learning process.

* **Dimensionality:** The number of features in a dataset.

### 1.5 Model Performance and Trade-offs
* **Model Complexity:** How sophisticated a model is in capturing patterns, determined by the number of its parameters (e.g., number of layers in a neural network).

* **Bias** is the inability of a machine learning method to accurately capture the true underlying relationship in the data.  Bias is the error introduced by approximating a real-world problem (which may be very complex) with a much simpler model. It measures how far the average prediction of your model is from the correct value.
![bias](images/bias.jpg)

* **Variance:** How much a model's predictions would change if it were trained on different subsets of the data. High variance indicates high sensitivity to the training data.

* **Bias-Variance Trade-off:** 

*The tradeoff is that decreasing bias typically increases variance, and decreasing variance typically increases bias.*

bias = incapability to capture the underlying relationship in the data.

```capable (Low Bias) < (-1)------(0)------(1) > incapable (High Bias)```
                
variance = sensitivity to different data; difference in model performence for different data sets

```consistent (Low Variance) < (-1)------(0)------(1) > very different (High Variance)```
                

The **ideal algorithm** must have **low bias** and **low variance**. This is achieved by finding the  sweet spotâ€”a model complexity that is flexible enough to capture the true relationship but simple enough to produce consistent predictions across different datasets.

* **Noise:** Random variations or errors in data that don't represent true underlying patterns.

* **Overfitting:** Occurs when a model learns the random noise and fluctuations in the training data instead of the true patterns, resulting in good training performance but poor generalization. The model has high variance.

* **Underfitting:** Occurs when a model is too simple to capture the important patterns, resulting in poor performance on both training and test data. The model has *high bias*.

* Low Complexity Model => High Bias, Low Variance (Underfitting)
* High Complexity Mode => Low Bias, High Variance (Overfitting)

### 1.6 Training anf Opyimization Techniques 
* **Validation:** Evaluating a model's performance on a held-out portion of the training data (the validation set) during the development process.

* **Cross Validation:** Extends validation by repeatedly training and validating the model on different, multiple splits of the data for a more robust performance estimate.

* **Batch:** A subset of training data processed together in a single step of training. The batch size is an important hyperparameter.

* **Iteration:** A single pass through one batch of data, leading to an update of the model's parameters.

* **Epoch:** A complete pass through the entire training data set. 

* **Parameter (Weight / Model Parameter):** A value that the model learns from the data during training (e.g., the slope and intercept in linear regression). Finding these is the goal of training.

* **Hyperparameter:** A configuration setting set before training begins to control the learning process (e.g., learning rate, batch size, number of epochs).

* **Learning Rate:** A crucial hyperparameter that determines how much the model adjusts its parameters in response to errors.

* **Cost Function (Loss/Objective Function):** A measure of how wrong a model's predictions are compared to the true values. The goal of training is to minimize this function.

## 2. Machine Learning Algorithms

![ml](images/ml.jpg)

### 2.1. Supervised Learning; Regression vs Classification

| Category       | Description                                                      | Example                                                                                        |
| ---            | ---                                                              | ---                                                                                            |
| Regression     |	Predicting a continuous numeric target variable.                |	Predicting the price of a house or a person's weight.                                        |
| Classification |	Assigning a discrete categorical label (class) to a data point. |	Classifying an email as "spam" or "not spam," or classifying an image as a "cat" or a "dog." |

### 2.2 Regression:

* **Linear Regression**:

   * Goal: Determine a linear relationship between input and output variables. Predicts numerical values.

   * Mechanics: Fits a straight line to the data by minimizing the sum of the squared distances between the data points and the regression line, thus minimizing prediction errors.
![linear_reg](images/linear_regression.jpg)

* **Logistic Regression:**

   * Goal: The most basic classification algorithm. Logistic regression is fundamentally a classification algorithm, even though it's called "regression." Its goal is to predict the probability of an event belonging to a particular class (e.g., probability of 'yes' or 'no', '1' or '0')

   * Mechanics: Fits a sigmoid function (S-curve) to the data, which conveniently outputs the probability of a data point belonging to a certain class (e.g., the likelihood of an email being spam).

   * $S(x) = \frac{1}{1 + e^{-x}}$

![logistic_reg](images/logistic_reg.jpg)


### K-Nearest Neighbors (KNN)
* Goal: Used for both regression and classification. It is non-parametric (no equation or parameters are explicitly fitted).

* Mechanics: For a new data point, the target is predicted to be the average (regression) or the majority class (classification) of its K closest neighbors in the training data.

* Note: The value of K is a hyperparameter and must be tuned (e.g., by using cross-validation) to avoid overfitting or underfitting.

![knn](images/knn.jpg)

### Suport Vector Machine (SVM)
* Goal: Primarily for classification, but can also be used for regression.

* Mechanics: Draws a decision boundary (or hyperplane in higher dimensions) that separates the classes with the largest margin possible. The data points sitting on the edge of this margin are called support vectors.

![svm1](images/svm1.jpg)
![svm2](images/svm2.jpg)

### 2.5 Naive Bayes Classifier
* Goal: A classification algorithm often used for text-based tasks like spam filtering.

* Mechanics: Uses Bayes' theorem to calculate the probability of a class based on the combined probabilities of all the words it contains. It is called "naive" because it falsely assumes that the probabilities of different words appearing are independent of each other.

$P(A|B)=\frac{P(B|A) * P(A)}{P(B)}$

$P(spam| FMI, AI) = ?$

### 2.6 Decision Trees

* Goal: Used for both classification and regression.

* Mechanics: Creates a series of yes/no questions to partition the dataset. The goal is to create leaf nodes (the final outcomes) that are as "pure" as possible (meaning minimal misclassification).

![tree](images/decision_tree.jpg)

### 2.6 Ensemble Methods: 

Combine many simple models (often decision trees) into a single, more powerful model.

### 2.7 Deep Learning 

Artificial Neural Networks (ANN) / Deep Learning:

* Goal: The current reigning king of AI, used for complex classification and regression (e.g., image recognition).

* Mechanics: Adds hidden layers of unknown variables between the input features and the output variable. These layers implicitly and automatically learn complex features from the raw data (e.g., identifying a "horizontal line" or a "face" in a picture without being explicitly programmed to do so).

### 2.8 Unsupervised Learning: K-Means
Goal: To find unknown clusters or underlying groupings in unlabeled data. This is conceptually different from classification, where the groups are already known.

**K-Means Clustering:**

* Goal: The most famous clustering algorithm. K is the hyperparameter representing the number of clusters to find.

* Mechanics: It iteratively selects random cluster centers, assigns every data point to the closest center, and then recalculates the center based on the newly assigned points, repeating until the centers stabilize.

Pictures taken from: 
* https://www.youtube.com/watch?v=EuBBz3bI-aA&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=6 - Stats Quest
* https://www.youtube.com/watch?v=E0Hmnixke2g&t=182s - Infinite Codes