## How to Move from Data to (Machine Learning) Models to Decisions?

No prior knowledge needed: Learn how to move from data to (machine learning) models to decisions. Learn what terms such as (machine learning) models, supervised learning, unsupervised learning, regression, classification, clustering and more mean.

``` directory
Artificial Intelligence (AI)
└── Machine Learning (ML)
    ├── Supervised Learning
    ├── Unsupervised Learning
    ├── Semi-Supervised Learning
    ├── Reinforcement Learning
    ├── Deep Learning
    └── Generative AI
```

- Artificial Intelligence (AI): Creating machines that can perform tasks that typically require human intelligence / that replicate human behavior.
- Machine Learning (ML): A subset of AI where machines learn relationships hidden in data - to model real systems.
- Deep Learning: A branch of ML that uses neural networks with many layers and neurons - to model real systems.
- Generative AI: Often using deep learning / neural networks with 100's of billions of neurons that creates new content, like text, images, or music, by learning patterns from existing data.

### Supervised: regression
<img src="images/01_regression_income_2.PNG" alt="Regression Income" width="1200"/>

- `age` alone is unlikely to provide an accurate prediction of a particular man’s `wage`.
- Number of features might be quite large, such as on the order of thousands or even millions.
- Regression: predicting a continuous output value.

### Supervised: classification

<img src="images/02_classification_churn.PNG" alt="Classification Churn" width="1200"/>

### Unsupervised: clustering

![Clustering Gene](images/03_clustering_gene.PNG)
- $Z_1$ and $Z_2$ are the first two principal components of the data, which summarize the data down from 64 features to two numbers or dimensions.
- Likely dimension reduction has resulted in some loss of information, but it is now possible to visually examine the data for evidence of clustering.
- Cell lines with same cancer type tend to be located near each other in this two-dimensional representation; even though the cancer information was not used to produce the left-hand panel, the clustering obtained does bear some resemblance to some of the actual cancer types observed in the right-hand panel.
- Clustering: grouping according to observed characteristics (no labels).

### Supervised learning: regression - in depth

#### Observed data
| $(i)$ | Education $({x}_i)$ | Income $({y}_i)$ |
|-------|---------------------|------------------|
| 1 | 10          | 26.6588387834389 |
| 2 | 10.40133779 | 27.3064353457772 |
| 3 | 10.84280936 | 22.1324101716143 |
| 4 | 11.24414716 | 21.1698405046065 |
| 5 | 11.64548495 | 15.1926335164307 |
| 6 | 12.08695652 | 26.3989510407284 |
| 7 | 12.48829431 | 17.435306578572  |
| 8 | 12.88963211 | 25.5078852305278 |
| 9 | 13.2909699  | 36.884594694235  |
| 10| 13.73244147 | 39.666108747637  |
| 11| 14.13377926 | 34.3962805641312 |
| 12| 14.53511706 | 41.4979935356871 |
| 13| 14.97658863 | 44.9815748660704 |
| 14| 15.37792642 | 47.039595257834  |
| 15| 15.77926421 | 48.2525782901863 |
| 16| 16.22073579 | 57.0342513373801 |
| 17| 16.62207358 | 51.4909192102538 |
| 18| 17.02341137 | 61.3366205527288 |
| 19| 17.46488294 | 57.581988179306  |
| 20| 17.86622074 | 68.5537140185881 |
| 21| 18.26755853 | 64.310925303692  |
| 22| 18.7090301  | 68.9590086393083 |
| 23| 19.11036789 | 74.6146392793647 |
| 24| 19.51170569 | 71.8671953042483 |
| 25| 19.91304348 | 76.098135379724  |
| 26| 20.35451505 | 75.77521802986   |
| 27| 20.75585284 | 72.4860553152424 |
| 28| 21.15719064 | 77.3550205741877 |
| 29| 21.59866221 | 72.1187904524136 |
| 30| 22          | 80.2605705009016 |

<br>
<img src="images/04_regression_education_only_data_points.PNG" alt="Regression Education Only Data" width="600"/>

Where:
- $ x_i $ is the observed value (actual value) for the $ i $-th observation.
- $ x_i $ is the independent variable (input / feature / predictor) for the $ i $-th observation.
- $ y_i $ is the observed value (actual value) for the $ i $-th observation.
- $ y_i $ is the dependent variable (output / label / response) for the $ i $-th observation.



### 1. **Simple Linear Regression Model**

<img src="images/04_regression_education_with_two_models.PNG" alt="Regression Education With Models" width="600"/>

Problem:
- The true relationship between $ x_i $ and $ y_i $ across all observations is typically complex and influenced by numerous factors. Therefore, an exact model that describes the relationship between $\mathbf{x}$ and $\mathbf{y}$ is typically not attainable.

Solution:
- Try to find a model that approximates $ y $.
- The the model $$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i $$ provides a simplified approximation: $ \hat{y} \approx y $.

Where:
- $ \hat{\beta}_0 $ is the estimated intercept (the estimated value of $ \hat{y} $ when $ x = 0 $).
- $ \hat{\beta}_1 $ is the estimated slope (the estimated change in $ \hat{y} $ for a one-unit change in $ x $).
- $ \hat{\mathbf{\beta}} $ is the vector that represents the estimated coefficients / parameters / effects.
- $ \hat{y}_i $ is the predicted value of $ y $ for the $ i $-th observation, calculated using the estimated model.
- $ \hat{y}_i $ represents the part of $ y_i $ that is explained by the linear relationship with $ x_i $ according to the estimated model.
- Estimated model equation represents the relationship between $ x_i $ and $ \hat{y}_i $ for the $ i $-th observation, based on the estimated coefficients.
- Estimated model equation describes how the dependent variable ( $ \hat{y} $ ) depends on one or more independent variables ( $ x $ ) based on the data.
- Estimated model is the equation: $ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i $.
- Estimated model expresses that each predicted $ \hat{y}_i $ is composed of the estimated systematic part ($ \hat{\beta}_0 + \hat{\beta}_1 x_i $), which represents the linear relationship with $ x_i $ as derived from the observed data.





### 2. **Mean Squared Error (MSE)**

<img src="images/05_regression_education_residuals.PNG" alt="Regression Education With Residuals" width="600"/>

The Mean Squared Error (MSE) is a measure of how well the model's predictions $ \hat{y}_i $ match the actual observations $ y_i $. It is defined as:

$$
\text{MSE}(\hat{\beta}_0, \hat{\beta}_1) = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2
$$

Where the predicted value $ \hat{y}_i $ is given by:

$$
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i
$$

Substituting $ \hat{y}_i $ into the MSE formula:

$$
\text{MSE}(\hat{\beta}_0, \hat{\beta}_1) = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)^2
$$

### 3. **Gradient Descent**

To minimize the MSE, we use the Gradient Descent algorithm. This involves updating the parameters $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $ iteratively in the direction of the negative gradient of the MSE.

#### 3.1. **Compute the Gradients**

The partial derivatives of the MSE with respect to $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $ are:

1. **Gradient with respect to $ \hat{\beta}_0 $** (Intercept):

   $$
   \frac{\partial \text{MSE}}{\partial \hat{\beta}_0} = -\frac{2}{n} \sum_{i=1}^{n} \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)
   $$

2. **Gradient with respect to $ \hat{\beta}_1 $** (Slope):

   $$
   \frac{\partial \text{MSE}}{\partial \hat{\beta}_1} = -\frac{2}{n} \sum_{i=1}^{n} \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right) x_i
   $$

#### 3.2. **Update Rules**

Using the gradients, we update $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $ using the following rules:

$$
\hat{\beta}_0^{(k+1)} = \hat{\beta}_0^{(k)} - \alpha \frac{\partial \text{MSE}}{\partial \hat{\beta}_0}
$$

$$
\hat{\beta}_1^{(k+1)} = \hat{\beta}_1^{(k)} - \alpha \frac{\partial \text{MSE}}{\partial \hat{\beta}_1}
$$

Where:
- $ \hat{\beta}_0^{(k)} $ and $ \hat{\beta}_1^{(k)} $ are the estimates of the parameters at the $ k $-th iteration.
- $ \alpha $ is the learning rate, which controls the step size of each update.

### 4. **Convergence and Final Model**

After many iterations, the algorithm will converge to the final estimates $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $, which minimize the MSE and represent the best-fit line for the data.

The final model can then be used for making predictions:

$$
\hat{y} = -39.5 + 5.6 x
$$

<img src="images/06_regression_education_residuals_mse_surface.PNG" alt="MSE Surface" width="800"/>
<br>
<br>
<img src="images/06_1_convex_and_non_convex_surface.PNG" alt="MSE Surface"/>
Left convex function, right non-convex function.
<br>
<br>
<img src="images/06_2_non_convex_surface.PNG" alt="MSE Surface" width="1200"/>


### 5. **Beyond Simple Linear Regression Models**

<img src="images/07_regression_education_seniority_eg_multiple.PNG" alt="2 Independent Variables" width="800"/> <br>

$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \hat{\beta}_2 x_{2i} $$


<img src="images/08_regression_education_seniority_eg_polynomial.PNG" alt="2 Independent Variables" width="800"/>


$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \hat{\beta}_2 x_{2i} + \hat{\beta}_3 x_{1i}^2 + \hat{\beta}_4 x_{2i}^2 + \hat{\beta}_5 x_{1i}x_{2i} $$

- Linear Regression
- Polynomial Regression
- Ridge Regression
- Lasso Regression
- Elastic Net Regression
- Neural Network Regression
- Decision Tree Regression
- Random Forest Regression
- Support Vector Regression
- K-Nearest Neighbors Regression
- Combining Regression models
- ...

### Decision Tree (Regression) & Random Forest (Regression)
<img src="images/08_01_regression_decision_tree.PNG" alt="Regression Decision Tree" width="800"/>

### Supervised learning: classification - in depth

#### Observed data
| Default | Balance  |
|---------|----------|
| No      | 729.526  |
| No      | 817.180  |
| No      | 1073.549 |
| Yes     | 1838.871 |
| Yes     | 1893.023 |
| Yes     | 1605.214 |
| ...     | ...      |

<br>
<img src="images/09_classification.PNG" alt="Classification"/>
<br>
<br>
<br>
<img src="images/10_linear_regression_and_logistic_regression.PNG" alt="Linear Regression and Logistic Regression"/>

Linear regression on the left, logistic regression on the right.

For a classification problem with one independent variable, we typically use **logistic regression** rather than linear regression. Logistic regression is used to model the probability that a given observation belongs to a particular class (usually binary, such as 0 or 1).

### 1. **Logistic Regression Model**

In logistic regression, the relationship between the dependent variable $ y $ (which is binary) and the independent variable $ x $ is modeled using the logistic function (also called the sigmoid function):

$$
\hat{p}(x) = \frac{1}{1 + e^{-(\hat{\beta}_0 + \hat{\beta}_1 x)}}
$$

Where:
- $ \hat{p}(x) $ is the estimated probability that the dependent variable $ y = 1 $ given the independent variable $ x $.
- $ \hat{\beta}_0 $ is the estimated intercept.
- $ \hat{\beta}_1 $ is the estimated coefficient (slope) for the independent variable $ x $.

The estimated probability $ \hat{p}(x) $ is then used to classify the observation:
- If $ \hat{p}(x) > 0.5 $, predict $ y = 1 $.
- If $ \hat{p}(x) \leq 0.5 $, predict $ y = 0 $.


### 2. **Log-Loss (Binary Cross-Entropy Loss)**
### 2.1 **Likelihood Function**

In maximum likelihood estimation (MLE), we aim to find the parameters $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $ that maximize the likelihood of observing the given data.

The likelihood for a single observation $ (x_i, y_i) $ is:

$$
\text{Likelihood} = \hat{p}(y_i | x_i) = \hat{p}(x_i)^{y_i} \cdot (1 - \hat{p}(x_i))^{1 - y_i}
$$

This likelihood function works for both cases:
- If $ y_i = 1 $, then the likelihood is $ \hat{p}(x_i) $.
- If $ y_i = 0 $, then the likelihood is $ 1 - \hat{p}(x_i) $.

### 2.2 **Log-Likelihood Function**

To make the optimization easier, we typically work with the **log-likelihood** instead of the likelihood. The log-likelihood for a single observation is the natural logarithm of the likelihood:

$$
\text{Log-Likelihood} = \log(\text{Likelihood}) = \log\left( \hat{p}(x_i)^{y_i} \cdot (1 - \hat{p}(x_i))^{1 - y_i} \right)
$$

Using the logarithm property $ \log(a \cdot b) = \log(a) + \log(b) $, we can rewrite the log-likelihood as:

$$
\text{Log-Likelihood} = y_i \log(\hat{p}(x_i)) + (1 - y_i) \log(1 - \hat{p}(x_i))
$$

### 2.3 **(Total) Log-Likelihood for All Observations**

For a dataset with $ n $ observations, the total log-likelihood is the sum of the log-likelihoods for all individual observations:

$$
\text{(Total) Log-Likelihood} = \sum_{i=1}^{n} \left[ y_i \log(\hat{p}(x_i)) + (1 - y_i) \log(1 - \hat{p}(x_i)) \right]
$$

### 2.4 **Negative Log-Likelihood (Log-Loss)**:
$$
\text{Negative Log-Likelihood} = -\sum_{i=1}^{n} \left[ y_i \log(\hat{p}(x_i)) + (1 - y_i) \log(1 - \hat{p}(x_i)) \right]
$$

Where:
- $ y_i $ is the actual class label (0 or 1) for the $ i $-th observation.
- $ \hat{p}(x_i) $ is the estimated probability for the $ i $-th observation.

Further:
- The **Log-Loss** is the negative log-likelihood divided by the number of data points, $ n $, making it the average loss per observation.
- It is commonly used in machine learning because it provides a normalized metric, allowing comparisons across datasets of different sizes.
- Log-Loss is also known as **Cross-Entropy Loss** in the context of classification problems. When applied to binary classification, it is sometimes referred to as **Binary Cross-Entropy Loss** or **Binary Entropy Loss**.

### 3. **Gradient Descent for Logistic Regression**

Gradient Descent is used to minimize the log-loss and find the best estimated parameters $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $.

#### 3.1. **Compute the Gradients**

The gradients of the log-loss with respect to $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $ are:

1. **Gradient with respect to $ \hat{\beta}_0 $**:

   $$
   \frac{\partial \text{Log-Loss}}{\partial \hat{\beta}_0} = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{p}(x_i) - y_i \right)
   $$

2. **Gradient with respect to $ \hat{\beta}_1 $**:

   $$
   \frac{\partial \text{Log-Loss}}{\partial \hat{\beta}_1} = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{p}(x_i) - y_i \right) x_i
   $$

#### 3.2. **Update Rules**

Using the gradients, we update $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $ using the following rules:

$$
\hat{\beta}_0^{(k+1)} = \hat{\beta}_0^{(k)} - \alpha \frac{\partial \text{Log-Loss}}{\partial \hat{\beta}_0}
$$

$$
\hat{\beta}_1^{(k+1)} = \hat{\beta}_1^{(k)} - \alpha \frac{\partial \text{Log-Loss}}{\partial \hat{\beta}_1}
$$

Where:
- $ \hat{\beta}_0^{(k)} $ and $ \hat{\beta}_1^{(k)} $ are the estimates at the $ k $-th iteration.
- $ \alpha $ is the learning rate.

### 4. **Convergence and Final Model**

The final model can then be used for making predictions:

$$
\hat{p}(x) = \frac{1}{1 + e^{10.65132823650608 - 0.005498915547165412 x}}
$$

<br>

<img src="images/11_logistic_regression_log_loss.PNG" alt="Log-Loss"/>

### 5. **Beyond Simple Logistic Regression Models**

<img src="images/12_classification_two_independent_variables.PNG" alt="2 Independent Variables" width="1200"/> <br>

$$ \hat{p}(y_i | x_{1i}, x_{2i}) = \hat{p}(x) = \frac{1}{1 + e^{-(\hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \hat{\beta}_2 x_{2i})}} $$

### Interpretability vs. Evaluation Metrics

<img src="images/17_interpretability_vs_evaluation_metrics.PNG" alt="Interpretability VS Evaluation Metrics" width="1200"/> <br>
<br>
<br>

#### Neural Net
<img src="images/18_ann.PNG" alt="ANN" width="1200"/> <br>

### Unsupervised learning: clustering - in depth

#### Observed data

| $(i)$ | Education $(x_{1i})$ | Seniority $(x_{2i})$ |
|-------|----------------------|----------------------|
| 1     | 19.9310344827586     | 168.965517241379     |
| 2     | 20.3448275862069     | 187.586206896552     |
| ...   | ...                  | ...                  |
| n     | $(x_{1n})$           | $(x_{2n})$           |


<br>
<img src="images/13_unsupervised_data.PNG" alt="K-Means Clustering Education Seniority Only Data" width="600"/>

### 1. **K-Means Clustering Model**

The K-means clustering model aims to partition a dataset into $ k $ clusters, where each data point $ \mathbf{x}_i $ is assigned to the cluster with the nearest centroid $ \boldsymbol{\mu}_j $. The model can be described by the following assignment rule:

$$
c_i = \arg\min_{j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2
$$

Where:
- $ c_i $ is the index of the cluster assigned to data point $ \mathbf{x}_i $.
- $ \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 $ represents the squared Euclidean distance between $ \mathbf{x}_i $ and centroid $ \boldsymbol{\mu}_j $.

### 2. **Objective Function (Loss Function)**

The **objective function** in K-means clustering is the **Within-Cluster Sum of Squares (WCSS)**, which measures the total variance within clusters. It quantifies how compact the clusters are by summing the squared Euclidean distances between each data point and its assigned cluster centroid. The objective function is defined as:

$$
\text{WCSS} = \sum_{i=1}^{n} \sum_{j=1}^{k} \mathbb{I}(c_i = j) \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2
$$

where:
- $ n $ is the number of data points.
- $ k $ is the number of clusters.
- $ \mathbb{I}(c_i = j) $ is an indicator function that equals 1 if data point $ \mathbf{x}_i $ is assigned to cluster $ j $ and 0 otherwise.
- $ \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 $ is the squared Euclidean distance between data point $ \mathbf{x}_i $ and centroid $ \boldsymbol{\mu}_j $.

### 3. **Steps to Minimize the Objective Function**

The K-means algorithm follows these steps:

1. **Initialization**: (Each observation is randomly assigned to a cluster.) Randomly initialize $ k $ centroids $ \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_k $.

2. **Assignment Step**: Assign each data point $ \mathbf{x}_i $ to the nearest centroid:

   $$
   c_i = \arg\min_{j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2
   $$

3. **Update Step**: Recalculate the centroids as the mean of all data points assigned to each cluster:

   $$
   \boldsymbol{\mu}_j = \frac{1}{|C_j|} \sum_{\mathbf{x}_i \in C_j} \mathbf{x}_i
   $$

   where $ C_j $ is the set of data points assigned to cluster $ j $, and $ |C_j| $ is the number of points in cluster $ j $.

4. **Repeat**: Continue iterating through the Assignment and Update steps until the centroids stabilize (i.e., their positions change very little) or until a maximum number of iterations is reached.

<img src="images/14_K_means_clustering.PNG" alt="K-Means Clustering Iteration" width="1200"/>
<br>
<br>

### 4. **Convergence and Final Model**

The final model can then be used for making predictions:

<img src="images/14_01_K_means_clustering_final_result.PNG" alt="K-Means Clustering Final Result" width="600"/>
<br>
<br>
<img src="images/15_K_means_clustering_different_starting_values.PNG" alt="K-Means Clustering Different Starting Values" width="1200"/>
<br>
WCSS values for different starting values and $K=3$, best solution is 235.8

<br>
<br>
<img src="images/16_K_means_clustering_elbow_plot.PNG" alt="K-Means Clustering Elbow Plot" width="1200"/>


### Succeeding with Machine Learning goes beyond 'data is the new oil'

- Business opportunity: 
    - Impact on P&L (revenue, costs).
    - Customer-centric.
    - Reach wide audience.
    - Enhance user experience.
- People: 
    - Right skills.
    - Clear purpose / desire.
    - Aligned with organization's goals.
- Foundation model: 
    - Use it as-is.
    - Fine-tune.
- Data:
    - Proprietary/unique data.
    - Quality, up-to-date, ...
- Act:
    - MVP: Build > train > improve; iterate based on feedback.
    - Measure impact (increased conversion rate by 20%, reduced customer churn by 15%, ...).
- Resource commitment: Allocate sufficient resources, including time, budget, and personnel, to support ML projects.
- Risks:
    - Uncertainty: Significant upfront investment required, and unclear whether successful.
    - Legal.
    - Ethical.
    - ...
- ...