In [1]:
import random 
import math
import scipy
import numpy as np
import matplotlib.pyplot as plt
import library_data_science as lds

# Introduction to Multi-Dimentional Data


Multi-dimensional data can be represented as an array of tuples, where each tuple consists one, two or more elements. In this structure, the first element in each tuple is treated as independent, while the another elements typically depends on the first, reflecting the relationship between the variables. \
Remember that `D.size = n` and $\forall{m < n}($ `D[m].size = k`$)$.

$$D =  \bigg< (d_{00}, d_{01}, ..., d_{0(k-1)}), (d_{10}, d_{11}, ..., d_{1(k-1)}), (d_{20}, d_{21}, ..., d_{2(k-1)}), ... , (d_{(n-1)0}, d_{(n-1)1}, ..., d_{(n-1)(k-1)}) \bigg>,$$
$$D = \bigg< (D[0][0], D[0][1], ..., D[0][k-1]), (D[1][0], D[1][1], ..., D[1][k-1]), ..., (D[n-1][0], D[n-1][1], ..., D[n-1][k-1]) \bigg>$$

Multi-dimentional dataset can be unzipped to the $k$ separated sets.

$$D_0 = \big< d_{00}, d_{10}, ... , d_{(n-1)0} \big> = \big< D[0][0], D[1][0], ... , D[n-1][0] \big> $$
$$D_1 = \big< d_{01}, d_{11}, ... , d_{(n-1)1} \big> = \big< D[0][1], D[1][1], ... , D[n-1][1] \big>$$
$$...$$
$$D_{k-1} = \big< d_{0(k-1)}, d_{1(k-1)}, ... , d_{(n-1)(k-1)} \big> = \big< D[0][k-1], D[1][k-1], ... , D[n-1][k-1] \big>$$

To demonstrate the difference, I will use the example I used in the presentation to explain what multidimensional data are.

#### Example 1: Runners' Performance

When examining the distances covered by runners during a 5-minute run, along with their heart rates and oxygen consumption, our data will no longer be one-dimensional. Instead, it will consist of multiple attributes for each runner.

$$
Distance\_Heart\_Oxygen = \big< (1078, 145, 3.2), (896, 152, 2.9), (1196, 138, 3.5), (1009, 149, 3.1), (1078, 143, 3.3), (1096, 141, 3.4), (923, 155, 3.0) \big>
$$

Here, each tuple represents **(distance in meters, heart rate in bpm, oxygen consumption in L/min)**, making it a **three-dimensional dataset**.

$$
Distance = \big<1078, 896, 1196, 1009, 1078, 1096, 923\big>
$$

$$
Heart = \big<145, 152, 138, 149, 143, 141, 155\big>
$$

$$
Oxygen = \big<3.2, 2.9, 3.5, 3.1, 3.3, 3.4, 3.0\big>
$$

#### Example 2: Physiological and Lifestyle Factors

When studying multiple physiological and lifestyle factors influencing body weight, a more complex dataset may include height, body weight, age, and daily caloric intake.

$$
Weights\_Heights\_Ages\_Calories = \big< (60, 177, 25, 2200), (76, 189, 30, 2500), (99, 197, 28, 2700), (48, 165, 22, 1800) \big>
$$

Each entry now contains four attributes, making it a **four-dimensional dataset**.

$$
Weights = \big< 60, 76, 99, 48 \big>
$$

$$
Heights = \big< 177, 189, 197, 165 \big> 
$$

$$
Ages = \big< 25, 30, 28, 22 \big>
$$

$$
Calories = \big< 2200, 2500, 2700, 1800 \big>
$$

#### Key Takeaways

The more attributes we include in our dataset, the higher its dimensionality. Multidimensional data allow for deeper analysis, such as finding correlations between different factors, but they also introduce challenges like increased complexity in visualization and computational processing. 

**Machine learning techniques, such as Principal Component Analysis (PCA), can help reduce dimensionality while preserving essential information.**


# Machine Learning

**Machine Learning (ML)** is a branch of artificial intelligence (AI) that enables computers to recognize patterns and make decisions based on data—without the need for explicitly programmed rules.

![Traditional Programming versus Machine Learning](https://cdn.prod.website-files.com/614c82ed388d53640613982e/63ef5f4e24edde6ef055c3b2_traditional%20programming%20vs%20machine%20learning.jpg)

### How does it work?

1. **Input Data** – The ML model receives a large amount of data, such as images, text, numbers, or sounds.

2. **Training** – The algorithm analyzes the data and "learns" relationships between them, adjusting its parameters.

3. **Prediction** – After training, the model can process new data and make decisions based on it.

### Examples of ML applications:

- **Recommendations** (Netflix, Spotify suggesting movies/music)

- **Speech and image recognition** (Siri, Google Lens)

- **Spam filters** (detecting spam emails)

- **Predictive systems** (weather forecasts, financial analysis)

### Types of Machine Learning:

1. **Supervised Learning** – The model learns from labeled data (e.g., images of cats and dogs, where it knows what is what).

2. **Unsupervised Learning** – The model searches for patterns in data without labels (e.g., clustering customers based on similar behavior).

![Supervised versus Unsupervised Learning](https://www.mathworks.com/discovery/machine-learning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns/a32c7d5d-8012-4de1-bc76-8bd092f97db8/image_2128876021_cop.adapt.full.medium.svg/1741205964325.svg)

### Basic Paradigm of Machine Learning

1. Observe set of examples: **training data**.

2. Infer something about process that generated that data.

3. Use inference to make predictions about previously unseen data: **test data**.

# Supervised versus Unsupervised Learning

Machine learning can be broadly categorized into two main types: **Supervised Learning** and **Unsupervised Learning**. The key difference lies in whether the data used for training includes labeled outputs.

![Classification versus Clustering versus Regression](https://lh6.googleusercontent.com/proxy/b9cTY0TniOxMDzL0UEDPN9WdCMqxJ0ETnubKDQ37IIubX6NK1l_iGMkRZTzAdC-Xi3G2V9_jX9PlAQzsUd2g-LLxU7q0qM_KgzKiOeuIodms5uNEVQoy0xEw93U75fZPVT-R_-XN7D4h5L6E)


## Supervised Learning

Supervised learning is a type of machine learning where the model learns from **labeled data**. Each training example consists of an input and a corresponding correct output.

### How it works:

- The algorithm is trained on a dataset containing **inputs (X)** and **expected outputs (Y)**.

- The model makes predictions and adjusts itself based on the difference between its predictions and the actual labels.

- Once trained, the model can make accurate predictions on new, unseen data.

### Examples:

- **Email Spam Detection** – Given labeled emails ("spam" or "not spam"), the model learns to classify new emails.

- **Image Classification** – Identifying whether an image contains a cat or a dog.

- **Stock Price Prediction** – Predicting future stock prices based on historical labeled data.


## Unsupervised Learning

Unsupervised learning deals with **unlabeled data**. The model finds patterns and structures in the data without predefined labels.

### How it works:

- The algorithm analyzes input data **without any associated outputs**.

- It groups similar data points or identifies hidden structures.

- Often used for **clustering**, **anomaly detection**, and **pattern recognition**.

### Examples:

- **Customer Segmentation** – Grouping customers by purchasing behavior.

- **Anomaly Detection** – Identifying fraudulent transactions in banking.

- **Topic Modeling** – Discovering topics in a collection of documents.


### Key Differences: 
| Feature              | Supervised Learning | Unsupervised Learning |
|----------------------|--------------------|----------------------|
| **Data Type**        | Labeled data (X, Y) | Unlabeled data (X) |
| **Main Goal**        | Learn a mapping from inputs to outputs | Find hidden structures and patterns |
| **Typical Use Cases** | Classification, Regression | Clustering, Anomaly Detection |

![Classification versus Clustering](https://cdn.prod.website-files.com/614c82ed388d53640613982e/63ef769f6a877d715fa75008_supervised%20vs%20Unsupervised%20learning.jpg)

Both types of learning have unique applications and are used depending on the problem at hand.

# Feature Engineering

Process of creating, transforming and selecting features used as input for machine learning models. The goal is to improve the quality of the data so the model can learn and make predictions more effectively.

### 1. Understanding the Problem and Data  

- **Domain Knowledge** – Understand which features might influence the outcome.  

- **Exploratory Data Analysis (EDA)** – Identify missing values, outliers, and relationships.  

- **Understanding Variable Types** – Identify numerical, categorical, and time-based data.  


### 2. Feature Selection and Transformation  

- **Removing Irrelevant Features** – Drop low-variance or unrelated columns.  

- **Handling Missing Values** – Impute (mean, median, mode), remove, or replace with special values.  

- **Normalization & Standardization** – Use min-max scaling or standard scaling for models sensitive to scale.  

### 3. Creating New Features  

- **Feature Extraction** – Generate new variables from existing ones (e.g., date differences, ratios).  

- **Binning Values** – Categorize numerical values into discrete bins (e.g., age groups).  

- **Feature Interactions** – Create new features by multiplying or dividing existing ones (e.g., price per unit).  

### 4. Transforming Categorical Variables  

- **One-Hot Encoding** – Convert categories into binary vectors.  

- **Label Encoding** – Assign integer values to categories.  

- **Target Encoding** – Replace categories with the mean target value (use with caution to avoid overfitting).  

### 5. Dimensionality Reduction and Redundancy Removal  

- **Removing Highly Correlated Features** – Avoid collinearity.  

- **PCA (Principal Component Analysis)** – Reduce dimensionality when dealing with too many features.  

- **Feature Selection (e.g., Lasso, ANOVA, statistical tests)** – Choose the most relevant features.  

### 6. Validation and Testing  

- **Measuring Feature Impact** – Compare model metrics before and after adding features.  

- **Avoiding Data Leakage** – Ensure features are created only from available information (no future data usage).  

- **Using Cross-Validation** – Verify the stability of new features.  

### Overfitting

Overfitting occurs when a machine learning model learns the training data too well, capturing noise and random fluctuations instead of general patterns. As a result, the model performs exceptionally well on the training data but fails to generalize to new, unseen data.

 # Statistical Metrics in Machine Learning

When evaluating machine learning models, we use various statistical measures to assess their performance.

* **Confusion Matrix**: A table that summarizes the performance of a classification model by showing the counts of true positives, false positives, true negatives, and false negatives for multiple classes.

$$
\begin{array}{c|cccc}
\text{Observed / Predicted} & c_0 & c_1 & c_2 & \dots & c_{k-1} \\
\hline
c_0 & Tc_0 & Fc_1 \text{ for } c_0 & Fc_2 \text{ for } c_0 & \dots & Fc_{k-1} \text{ for } c_0 \\
c_1 & Fc_0 \text{ for } c_1 & Tc_1 & Fc_2 \text{ for } c_1 & \dots & Fc_{k-1} \text{ for } c_1 \\
c_2 & Fc_0 \text{ for } c_2 & Fc_1 \text{ for } c_2 & Tc_2 & \dots & Fc_{k-1} \text{ for } c_2 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
c_{k-1} & Fc_0 \text{ for } c_{k-1} & Fc_1 \text{ for } c_{k-1} & Fc_2 \text{ for } c_{k-1} & \dots & Tc_{k-1}
\end{array}
$$


* **Accuracy**: The ratio of correctly predicted instances to the total instances, measuring overall correctness but potentially misleading for imbalanced datasets.

$$
Accuracy = \frac{\sum_{i=0}^{k-1}{Tc_i}}{\sum_{i=0}^{k-1}{Tc_i} + \sum_{i=0}^{k-1}{\sum_{j=0}^{k-1}{\big(Fc_{i}\text{ for } c_{j}}}\big)}
$$

* **Precision (Positive Predictive Value)**: The proportion of correctly predicted positive instances out of all instances predicted as positive, important when false positives are costly. 

$$
Precision(c_{i}) = \frac{Tc_{i}}{Tc_{i} + \sum_{j=0}^{k-1}{\big(Fc_{i} \text{ for } c_{j}\big)}}
$$

* **Recall (Sensitivity or True Positive Rate)**: The proportion of correctly predicted positive instances out of all actual positive instances, crucial when false negatives are costly.

$$
Recall(c_{i}) = \frac{Tc_{i}}{Tc_{i} + \sum_{j=0}^{k-1}{\big(Fc_{j} \text{ for } c_{i}\big)}}
$$

* **F1-Score**: The harmonic mean of precision and recall, balancing both metrics and useful when dealing with class imbalance.

$$
F1\_Score(c_{i}) = \frac{2 \times Precision(c_i) \times Recall(c_i)}{Precision(c_i)+ Recall(c_i)}
$$

In [2]:
def get_classes(C_observed: list, C_predicted: list) -> list:
    """
    Returns a list of unique classes found in both observed and predicted labels.
    
    Parameters:
    C_observed (list): List of observed class labels.
    C_predicted (list): List of predicted class labels.
    
    Returns:
    list: A list of unique class labels.
    """

    classes = []

    for c in C_observed:
        if c not in classes:
            classes.append(c)

    for c in C_predicted:
        if c not in classes:
            classes.append(c)
    
    return classes


def print_matrix(matrix):
    """
    Prints a given matrix row by row.
    
    Parameters:
    matrix (list of lists): The matrix to be printed.
    """

    for row in matrix:
        print(row)


def confusion_matrix(C_observed: list, C_predicted: list) -> list:
    """
    Computes the confusion matrix for the given observed and predicted labels.
    
    Parameters:
    C_observed (list): List of observed class labels.
    C_predicted (list): List of predicted class labels.
    
    Returns:
    tuple: A tuple containing the confusion matrix (list of lists) and a dictionary 
           mapping class labels to matrix indices.
    """

    classes = get_classes(C_observed, C_predicted)
    indexes = {}

    for i in range(len(classes)):
        c = classes[i]
        indexes[c] = i
    
    cm = [ [0] * len(classes) for i in range(len(classes)) ]

    for i in range(len(C_observed)):
        cm[indexes[C_observed[i]]][indexes[C_predicted[i]]] = cm[indexes[C_observed[i]]][indexes[C_predicted[i]]] + 1

    return (cm, indexes)


def accuracy(C_observed: list, C_predicted: list) -> float:
    """
    Calculates the accuracy of the classification.
    
    Parameters:
    C_observed (list): List of observed class labels.
    C_predicted (list): List of predicted class labels.
    
    Returns:
    float: Accuracy of the classification.
    """

    cm = confusion_matrix(C_observed, C_predicted)[0]

    T = 0

    for i in range(len(cm)):
        T = T + cm[i][i]
    
    F = 0

    for i in range(len(cm)):
        for j in range(len(cm)):
            if not i == j:
                F = F + cm[i][j]
    
    return (T) / (T + F)


def precision(C_observed: list, C_predicted: list, c) -> float:
    """
    Computes the precision for a specific class.
    
    Parameters:
    C_observed (list): List of observed class labels.
    C_predicted (list): List of predicted class labels.
    c: The class for which precision is calculated.
    
    Returns:
    float: Precision score for the specified class.
    """

    cm, indexes = confusion_matrix(C_observed, C_predicted)

    i = indexes[c]

    T = cm[i][i]

    F = 0

    for j in range(len(cm)):
        if not i == j:
            F = F + cm[j][i]
    
    return (T) / (T + F)


def recall(C_observed: list, C_predicted: list, c) -> float:
    """
    Computes the recall for a specific class.
    
    Parameters:
    C_observed (list): List of observed class labels.
    C_predicted (list): List of predicted class labels.
    c: The class for which recall is calculated.
    
    Returns:
    float: Recall score for the specified class.
    """
    cm, indexes = confusion_matrix(C_observed, C_predicted)

    i = indexes[c]

    T = cm[i][i]

    F = 0

    for j in range(len(cm)):
        if not i == j:
            F = F + cm[i][j]
    
    return (T) / (T + F)


def f1_score(C_observed: list, C_predicted: list, c) -> float:
    """
    Computes the F1-score for a specific class.
    
    Parameters:
    C_observed (list): List of observed class labels.
    C_predicted (list): List of predicted class labels.
    c: The class for which the F1-score is calculated.
    
    Returns:
    float: F1-score for the specified class.
    """

    prec = precision(C_observed, C_predicted, c)
    rec = recall(C_observed, C_predicted, c)

    return (2 * prec * rec) / (prec + rec)

In [3]:
C_obse = ['A', 'B', 'B', 'A', 'C', 'B', 'A', 'C', 'C', 'A', 'C', 'B']
C_pred = ['A', 'B', 'A', 'A', 'C', 'C', 'A', 'A', 'C', 'A', 'B', 'B']

cm = confusion_matrix(C_obse, C_pred)

print_matrix(cm[0])

print('Accuracy =', np.round(accuracy(C_obse, C_pred), 3))
print('Precision(A) =', np.round(precision(C_obse, C_pred, 'A'), 3))
print('Precision(B) =', np.round(precision(C_obse, C_pred, 'B'), 3))
print('Precision(C) =', np.round(precision(C_obse, C_pred, 'C'), 3))
print('Recall(A) =', np.round(recall(C_obse, C_pred, 'A'), 3))
print('Recall(B) =', np.round(recall(C_obse, C_pred, 'B'), 3))
print('Recall(C) =', np.round(recall(C_obse, C_pred, 'C'), 3))
print('F1_Score(A) =', np.round(f1_score(C_obse, C_pred, 'A'), 3))
print('F1_Score(B) =', np.round(f1_score(C_obse, C_pred, 'B'), 3))
print('F1_Score(C) =', np.round(f1_score(C_obse, C_pred, 'C'), 3))

[4, 0, 0]
[1, 2, 1]
[1, 1, 2]
Accuracy = 0.667
Precision(A) = 0.667
Precision(B) = 0.667
Precision(C) = 0.667
Recall(A) = 1.0
Recall(B) = 0.5
Recall(C) = 0.5
F1_Score(A) = 0.8
F1_Score(B) = 0.571
F1_Score(C) = 0.571


### Binary-Class Classification Metrics

Let a binary example serve as an illustration, meaning true or false. In the study, we test whether the patient is infected or not. In this case, our measurements could look as follows.

$$
Confusion\_Matrix = \left[\begin{array}{cc} TP & FN \\ FP & TN \end{array}\right] \quad Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$  
$$Precision(\text{Positive}) = \frac{TP}{TP + FP} \quad Precision(\text{Negative}) = \frac{TN}{TN + FN}$$
$$Recall(\text{Positive}) = \frac{TP}{TP + FN} \quad Recall(\text{Negative}) = \frac{TN}{TN + FP}$$