In [22]:
import random 
import math
import scipy
import numpy as np
import matplotlib.pyplot as plt
import library_data_science as lds

# Introduction to Multi-Dimentional Data


Multi-dimensional data can be represented as an array of tuples, where each tuple consists one, two or more elements. In this structure, the first element in each tuple is treated as independent, while the another elements typically depends on the first, reflecting the relationship between the variables. \
Remember that `D.size = n` and $\forall{m < n}($ `D[m].size = k`$)$.

$$D =  \bigg< (d_{00}, d_{01}, ..., d_{0(k-1)}), (d_{10}, d_{11}, ..., d_{1(k-1)}), (d_{20}, d_{21}, ..., d_{2(k-1)}), ... , (d_{(n-1)0}, d_{(n-1)1}, ..., d_{(n-1)(k-1)}) \bigg>,$$
$$D = \bigg< (D[0][0], D[0][1], ..., D[0][k-1]), (D[1][0], D[1][1], ..., D[1][k-1]), ..., (D[n-1][0], D[n-1][1], ..., D[n-1][k-1]) \bigg>$$

Multi-dimentional dataset can be unzipped to the $k$ separated sets.

$$D_0 = \big< d_{00}, d_{10}, ... , d_{(n-1)0} \big> = \big< D[0][0], D[1][0], ... , D[n-1][0] \big> $$
$$D_1 = \big< d_{01}, d_{11}, ... , d_{(n-1)1} \big> = \big< D[0][1], D[1][1], ... , D[n-1][1] \big>$$
$$...$$
$$D_{k-1} = \big< d_{0(k-1)}, d_{1(k-1)}, ... , d_{(n-1)(k-1)} \big> = \big< D[0][k-1], D[1][k-1], ... , D[n-1][k-1] \big>$$

To demonstrate the difference, I will use the example I used in the presentation to explain what multidimensional data are.

#### Example 1: Runners' Performance

When examining the distances covered by runners during a 5-minute run, along with their heart rates and oxygen consumption, our data will no longer be one-dimensional. Instead, it will consist of multiple attributes for each runner.

$$
Distance\_Heart\_Oxygen = \big< (1078, 145, 3.2), (896, 152, 2.9), (1196, 138, 3.5), (1009, 149, 3.1), (1078, 143, 3.3), (1096, 141, 3.4), (923, 155, 3.0) \big>
$$

Here, each tuple represents **(distance in meters, heart rate in bpm, oxygen consumption in L/min)**, making it a **three-dimensional dataset**.

$$
Distance = \big<1078, 896, 1196, 1009, 1078, 1096, 923\big> \\
Heart = \big<145, 152, 138, 149, 143, 141, 155\big> \\
Oxygen = \big<3.2, 2.9, 3.5, 3.1, 3.3, 3.4, 3.0\big>
$$

#### Example 2: Physiological and Lifestyle Factors

When studying multiple physiological and lifestyle factors influencing body weight, a more complex dataset may include height, body weight, age, and daily caloric intake.

$$
Weights\_Heights\_Ages\_Calories = \big< (60, 177, 25, 2200), (76, 189, 30, 2500), (99, 197, 28, 2700), (48, 165, 22, 1800) \big>
$$

Each entry now contains four attributes, making it a **four-dimensional dataset**.

$$
Weights = \big< 60, 76, 99, 48 \big> \\
Heights = \big< 177, 189, 197, 165 \big> \\
Ages = \big< 25, 30, 28, 22 \big> \\
Calories = \big< 2200, 2500, 2700, 1800 \big>
$$

#### Key Takeaways

The more attributes we include in our dataset, the higher its dimensionality. Multidimensional data allow for deeper analysis, such as finding correlations between different factors, but they also introduce challenges like increased complexity in visualization and computational processing. 

**Machine learning techniques, such as Principal Component Analysis (PCA), can help reduce dimensionality while preserving essential information.**


# Machine Learning

**Machine Learning (ML)** is a branch of artificial intelligence (AI) that enables computers to recognize patterns and make decisions based on data—without the need for explicitly programmed rules.

![Traditional Programming versus Machine Learning](https://cdn.prod.website-files.com/614c82ed388d53640613982e/63ef5f4e24edde6ef055c3b2_traditional%20programming%20vs%20machine%20learning.jpg)

### How does it work?

1. **Input Data** – The ML model receives a large amount of data, such as images, text, numbers, or sounds.

2. **Training** – The algorithm analyzes the data and "learns" relationships between them, adjusting its parameters.

3. **Prediction** – After training, the model can process new data and make decisions based on it.

### Examples of ML applications:

- **Recommendations** (Netflix, Spotify suggesting movies/music)

- **Speech and image recognition** (Siri, Google Lens)

- **Spam filters** (detecting spam emails)

- **Predictive systems** (weather forecasts, financial analysis)

### Types of Machine Learning:

1. **Supervised Learning** – The model learns from labeled data (e.g., images of cats and dogs, where it knows what is what).

2. **Unsupervised Learning** – The model searches for patterns in data without labels (e.g., clustering customers based on similar behavior).

![Supervised versus Unsupervised Learning](https://www.mathworks.com/discovery/machine-learning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns/a32c7d5d-8012-4de1-bc76-8bd092f97db8/image_2128876021_cop.adapt.full.medium.svg/1741205964325.svg)

### Basic Paradigm of Machine Learning

1. Observe set of examples: **training data**.

2. Infer something about process that generated that data.

3. Use inference to make predictions about previously unseen data: **test data**.

# Supervised versus Unsupervised Learning

Machine learning can be broadly categorized into two main types: **Supervised Learning** and **Unsupervised Learning**. The key difference lies in whether the data used for training includes labeled outputs.

![Classification versus Clustering versus Regression](https://lh6.googleusercontent.com/proxy/b9cTY0TniOxMDzL0UEDPN9WdCMqxJ0ETnubKDQ37IIubX6NK1l_iGMkRZTzAdC-Xi3G2V9_jX9PlAQzsUd2g-LLxU7q0qM_KgzKiOeuIodms5uNEVQoy0xEw93U75fZPVT-R_-XN7D4h5L6E)


## Supervised Learning

Supervised learning is a type of machine learning where the model learns from **labeled data**. Each training example consists of an input and a corresponding correct output.

### How it works:

- The algorithm is trained on a dataset containing **inputs (X)** and **expected outputs (Y)**.

- The model makes predictions and adjusts itself based on the difference between its predictions and the actual labels.

- Once trained, the model can make accurate predictions on new, unseen data.

### Examples:

- **Email Spam Detection** – Given labeled emails ("spam" or "not spam"), the model learns to classify new emails.

- **Image Classification** – Identifying whether an image contains a cat or a dog.

- **Stock Price Prediction** – Predicting future stock prices based on historical labeled data.


## Unsupervised Learning

Unsupervised learning deals with **unlabeled data**. The model finds patterns and structures in the data without predefined labels.

### How it works:

- The algorithm analyzes input data **without any associated outputs**.

- It groups similar data points or identifies hidden structures.

- Often used for **clustering**, **anomaly detection**, and **pattern recognition**.

### Examples:

- **Customer Segmentation** – Grouping customers by purchasing behavior.

- **Anomaly Detection** – Identifying fraudulent transactions in banking.

- **Topic Modeling** – Discovering topics in a collection of documents.


### Key Differences: 
| Feature              | Supervised Learning | Unsupervised Learning |
|----------------------|--------------------|----------------------|
| **Data Type**        | Labeled data (X, Y) | Unlabeled data (X) |
| **Main Goal**        | Learn a mapping from inputs to outputs | Find hidden structures and patterns |
| **Typical Use Cases** | Classification, Regression | Clustering, Anomaly Detection |

![Classification versus Clustering](https://cdn.prod.website-files.com/614c82ed388d53640613982e/63ef769f6a877d715fa75008_supervised%20vs%20Unsupervised%20learning.jpg)

Both types of learning have unique applications and are used depending on the problem at hand.

# Feature Engineering

Process of creating, transforming and selecting features used as input for machine learning models. The goal is to improve the quality of the data so the model can learn and make predictions more effectively.

### 1. Understanding the Problem and Data  

- **Domain Knowledge** – Understand which features might influence the outcome.  

- **Exploratory Data Analysis (EDA)** – Identify missing values, outliers, and relationships.  

- **Understanding Variable Types** – Identify numerical, categorical, and time-based data.  


### 2. Feature Selection and Transformation  

- **Removing Irrelevant Features** – Drop low-variance or unrelated columns.  

- **Handling Missing Values** – Impute (mean, median, mode), remove, or replace with special values.  

- **Normalization & Standardization** – Use min-max scaling or standard scaling for models sensitive to scale.  

### 3. Creating New Features  

- **Feature Extraction** – Generate new variables from existing ones (e.g., date differences, ratios).  

- **Binning Values** – Categorize numerical values into discrete bins (e.g., age groups).  

- **Feature Interactions** – Create new features by multiplying or dividing existing ones (e.g., price per unit).  

### 4. Transforming Categorical Variables  

- **One-Hot Encoding** – Convert categories into binary vectors.  

- **Label Encoding** – Assign integer values to categories.  

- **Target Encoding** – Replace categories with the mean target value (use with caution to avoid overfitting).  

### 5. Dimensionality Reduction and Redundancy Removal  

- **Removing Highly Correlated Features** – Avoid collinearity.  

- **PCA (Principal Component Analysis)** – Reduce dimensionality when dealing with too many features.  

- **Feature Selection (e.g., Lasso, ANOVA, statistical tests)** – Choose the most relevant features.  

### 6. Validation and Testing  

- **Measuring Feature Impact** – Compare model metrics before and after adding features.  

- **Avoiding Data Leakage** – Ensure features are created only from available information (no future data usage).  

- **Using Cross-Validation** – Verify the stability of new features.  

### Overfitting

Overfitting occurs when a machine learning model learns the training data too well, capturing noise and random fluctuations instead of general patterns. As a result, the model performs exceptionally well on the training data but fails to generalize to new, unseen data.

 # Statistical Metrics in Machine Learning

When evaluating machine learning models, we use various statistical measures to assess their performance.

* **Confusion Matrix**: A table that summarizes the performance of a classification model by comparing actual vs. predicted values.

  | Observed / Predicted | $c_1$            | $c_2$            | $c_3$            | ... | $c_k$            |
  |----------------------|------------------|------------------|------------------|-----|------------------|
  | $c_1$                | $Tc_1$           | $Fc_2$ for $c_1$ | $Fc_3$ for $c_1$ | ... | $Fc_k$ for $c_1$ |
  | $c_2$                | $Fc_1$ for $c_2$ | $Tc_2$           | $Fc_3$ for $c_2$ | ... | $Fc_k$ for $c_2$ |
  | $c_3$                | $Fc_1$ for $c_3$ | $Fc_2$ for $c_3$ | $Tc_3$           | ... | $Fc_k$ for $c_3$ |
  | ...                  | ...              | ...              | ...              | ... | ...              |
  | $c_k$                | $Fc_1$ for $c_k$ | $Fc_2$ for $c_k$ | $Fc_3$ for $c_k$ | ... | $Tc_k$           |

* **Accuracy**: Measures the overall correctness of the model. It is the ratio of correctly classified instances to the total instances. However, it can be misleading if the dataset is imbalanced.  

$$
Accuracy = \frac{\sum_{i=1}^{k}{Tc_i}}{\sum_{i=1}^{k}{Tc_i} + \sum_{i=1}^{k}{\sum_{j=1}^{k}{\big(Fc_{i}\text{ for } c_{j}}}\big)}
$$

* **Sensitivity (Recall or True Positive Rate)**: Measures how well the model identifies positive instances. Higher sensitivity means fewer false negatives.  

$$
Recall(c_{i}) = \frac{Tc_{i}}{Tc_{i} + \sum_{j=1}^{k}{\big(Fc_{j} \text{ for } c_{i}\big)}}
$$

* **Positive Predictive Value (Precision)**: Represents the proportion of correctly predicted positive cases among all predicted positive cases. A higher precision means fewer false positives.  

$$
Precision(c_{i}) = \frac{Tc_{i}}{Tc_{i} + \sum_{j=1}^{k}{\big(Fc_{i} \text{ for } c_{j}\big)}}
$$

* F1

$$
F1(c_{i}) = \frac{2 \times Precision(c_i) \times Recall(c_i)}{Precision(c_i)+ Recall(c_i)}
$$

### Binary-Class Classification Metrics


* **Confusion Matrix**: A table that summarizes the performance of a classification model by comparing actual vs. predicted values. It consists of four components:

  - **True Positives (TP)**: Correctly predicted positive cases.

  - **False Positives (FP)**: Incorrectly predicted positive cases.

  - **False Negatives (FN)**: Incorrectly predicted negative cases.

  - **True Negatives (TN)**: Correctly predicted negative cases.

  | Observed / Predicted | Positive | Negative |
  |---|---|---|
  | **Positive** | $TP$ | $FN$ |
  | **Negative** | $FP$ | $TN$ |

  $$
  Confusion\_Matrix = \left[\begin{array}{cc} TP & FN \\ FP & TN \end{array}\right]
  $$
  
* **Accuracy**: Measures the overall correctness of the model. It is the ratio of correctly classified instances to the total instances. However, it can be misleading if the dataset is imbalanced.  
  $$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$
  
* **Sensitivity (Recall or True Positive Rate)**: Measures how well the model identifies positive instances. Higher sensitivity means fewer false negatives.  
  $$Sensitivity = \frac{TP}{TP + FN}$$
  
* **Specificity (True Negative Rate)**: Measures how well the model identifies negative instances. Higher specificity means fewer false positives.  
  $$Specificity = \frac{TN}{TN + FP}$$
  
* **Positive Predictive Value (Precision)**: Represents the proportion of correctly predicted positive cases among all predicted positive cases. A higher precision means fewer false positives.  
  $$PPV = \frac{TP}{TP + FP}$$
  
* **Negative Predictive Value (NPV)**: Represents the proportion of correctly predicted negative cases among all predicted negative cases.  
  $$NPV = \frac{TN}{TN + FN}$$

In [None]:
def count_classes(C_observed: list, C_predicted: list) -> list:
    classes = []

    for c in C_observed:
        if c not in classes:
            classes.append(c)

    for c in C_predicted:
        if c not in classes:
            classes.append(c)
    
    return classes

def confusion_matrix(C_observed: list, C_predicted:list) -> list:
    classes = count_classes(C_observed, C_predicted)

    indexes = {}

    for i in range(len(classes)):
        

    