In [22]:
import random 
import math
import scipy
import numpy as np
import matplotlib.pyplot as plt
import library_data_science as lds

# Introduction to Multi-Dimentional Data


Multi-dimensional data can be represented as an array of tuples, where each tuple consists one, two or more elements. In this structure, the first element in each tuple is treated as independent, while the another elements typically depends on the first, reflecting the relationship between the variables. \
Remember that `D.size = n` and $\forall{m < n}($ `D[m].size = k`$)$.

$$D =  \bigg< (d_{00}, d_{01}, ..., d_{0(k-1)}), (d_{10}, d_{11}, ..., d_{1(k-1)}), (d_{20}, d_{21}, ..., d_{2(k-1)}), ... , (d_{(n-1)0}, d_{(n-1)1}, ..., d_{(n-1)(k-1)}) \bigg>,$$
$$D = \bigg< (D[0][0], D[0][1], ..., D[0][k-1]), (D[1][0], D[1][1], ..., D[1][k-1]), ..., (D[n-1][0], D[n-1][1], ..., D[n-1][k-1]) \bigg>$$

Multi-dimentional dataset can be unzipped to the $k$ separated sets.

$$D_0 = \big< d_{00}, d_{10}, ... , d_{(n-1)0} \big> = \big< D[0][0], D[1][0], ... , D[n-1][0] \big> $$
$$D_1 = \big< d_{01}, d_{11}, ... , d_{(n-1)1} \big> = \big< D[0][1], D[1][1], ... , D[n-1][1] \big>$$
$$...$$
$$D_{k-1} = \big< d_{0(k-1)}, d_{1(k-1)}, ... , d_{(n-1)(k-1)} \big> = \big< D[0][k-1], D[1][k-1], ... , D[n-1][k-1] \big>$$

To demonstrate the difference, I will use the example I used in the presentation to explain what multidimensional data are.

#### Example 1: Runners' Performance

When examining the distances covered by runners during a 5-minute run, along with their heart rates and oxygen consumption, our data will no longer be one-dimensional. Instead, it will consist of multiple attributes for each runner.

$$
Distance\_Heart\_Oxygen = \big< (1078, 145, 3.2), (896, 152, 2.9), (1196, 138, 3.5), (1009, 149, 3.1), (1078, 143, 3.3), (1096, 141, 3.4), (923, 155, 3.0) \big>
$$

Here, each tuple represents **(distance in meters, heart rate in bpm, oxygen consumption in L/min)**, making it a **three-dimensional dataset**.

$$
Distance = \big<1078, 896, 1196, 1009, 1078, 1096, 923> \\
Heart = \big<145, 152, 138, 149, 143, 141, 155> \\
Oxygen = \big<3.2, 2.9, 3.5, 3.1, 3.3, 3.4, 3.0>
$$

#### Example 2: Physiological and Lifestyle Factors

When studying multiple physiological and lifestyle factors influencing body weight, a more complex dataset may include height, body weight, age, and daily caloric intake.

$$
Weights\_Heights\_Ages\_Calories = \big< (60, 177, 25, 2200), (76, 189, 30, 2500), (99, 197, 28, 2700), (48, 165, 22, 1800) \big>
$$

Each entry now contains four attributes, making it a **four-dimensional dataset**.

$$
Weights = \big< 60, 76, 99, 48 \big> \\
Heights = \big< 177, 189, 197, 165 \big> \\
Ages = \big< 25, 30, 28, 22 \big> \\
Calories = \big< 2200, 2500, 2700, 1800 \big>
$$

#### Key Takeaways

The more attributes we include in our dataset, the higher its dimensionality. Multidimensional data allow for deeper analysis, such as finding correlations between different factors, but they also introduce challenges like increased complexity in visualization and computational processing. 

**Machine learning techniques, such as Principal Component Analysis (PCA), can help reduce dimensionality while preserving essential information.**


# Machine Learning

**Machine Learning (ML)** is a branch of artificial intelligence (AI) that enables computers to recognize patterns and make decisions based on data—without the need for explicitly programmed rules.

![Traditional Programming versus Machine Learning](https://cdn.prod.website-files.com/614c82ed388d53640613982e/63ef5f4e24edde6ef055c3b2_traditional%20programming%20vs%20machine%20learning.jpg)

### How does it work?

1. **Input Data** – The ML model receives a large amount of data, such as images, text, numbers, or sounds.

2. **Training** – The algorithm analyzes the data and "learns" relationships between them, adjusting its parameters.

3. **Prediction** – After training, the model can process new data and make decisions based on it.

### Examples of ML applications:

- **Recommendations** (Netflix, Spotify suggesting movies/music)

- **Speech and image recognition** (Siri, Google Lens)

- **Spam filters** (detecting spam emails)

- **Predictive systems** (weather forecasts, financial analysis)

### Types of Machine Learning:

1. **Supervised Learning** – The model learns from labeled data (e.g., images of cats and dogs, where it knows what is what).

2. **Unsupervised Learning** – The model searches for patterns in data without labels (e.g., clustering customers based on similar behavior).

![Supervised versus Unsupervised Learning](https://www.mathworks.com/discovery/machine-learning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns/a32c7d5d-8012-4de1-bc76-8bd092f97db8/image_2128876021_cop.adapt.full.medium.svg/1741205964325.svg)

### Basic Paradigm of Machine Learning

1. Observe set of examples: **training data**.

2. Infer something about process that generated that data.

3. Use inference to make predictions about previously unseen data: **test data**.

# Supervised versus Unsupervised Learning

Machine learning can be broadly categorized into two main types: **Supervised Learning** and **Unsupervised Learning**. The key difference lies in whether the data used for training includes labeled outputs.

![Classification versus Clustering versus Regression](https://lh6.googleusercontent.com/proxy/b9cTY0TniOxMDzL0UEDPN9WdCMqxJ0ETnubKDQ37IIubX6NK1l_iGMkRZTzAdC-Xi3G2V9_jX9PlAQzsUd2g-LLxU7q0qM_KgzKiOeuIodms5uNEVQoy0xEw93U75fZPVT-R_-XN7D4h5L6E)

---

## Supervised Learning

Supervised learning is a type of machine learning where the model learns from **labeled data**. Each training example consists of an input and a corresponding correct output.

### How it works:

- The algorithm is trained on a dataset containing **inputs (X)** and **expected outputs (Y)**.

- The model makes predictions and adjusts itself based on the difference between its predictions and the actual labels.

- Once trained, the model can make accurate predictions on new, unseen data.

### Examples:

- **Email Spam Detection** – Given labeled emails ("spam" or "not spam"), the model learns to classify new emails.

- **Image Classification** – Identifying whether an image contains a cat or a dog.

- **Stock Price Prediction** – Predicting future stock prices based on historical labeled data.

---

## Unsupervised Learning

Unsupervised learning deals with **unlabeled data**. The model finds patterns and structures in the data without predefined labels.

### How it works:

- The algorithm analyzes input data **without any associated outputs**.

- It groups similar data points or identifies hidden structures.

- Often used for **clustering**, **anomaly detection**, and **pattern recognition**.

### Examples:

- **Customer Segmentation** – Grouping customers by purchasing behavior.

- **Anomaly Detection** – Identifying fraudulent transactions in banking.

- **Topic Modeling** – Discovering topics in a collection of documents.

---

### Key Differences: 
| Feature              | Supervised Learning | Unsupervised Learning |
|----------------------|--------------------|----------------------|
| **Data Type**        | Labeled data (X, Y) | Unlabeled data (X) |
| **Main Goal**        | Learn a mapping from inputs to outputs | Find hidden structures and patterns |
| **Typical Use Cases** | Classification, Regression | Clustering, Anomaly Detection |

![Classification versus Clustering](https://cdn.prod.website-files.com/614c82ed388d53640613982e/63ef769f6a877d715fa75008_supervised%20vs%20Unsupervised%20learning.jpg)

Both types of learning have unique applications and are used depending on the problem at hand.

# Feature Engineering

Process of creating, transforming and selecting features used as input for machine learning models. The goal is to improve the quality of the data so the model can learn and make predictions more effectively.

---

### 🧬 **Biology: Diagnosing Diseases**

**Problem:** Predicting whether a patient has diabetes based on medical data.

- **Feature Creation:** Instead of using just raw glucose levels, we create a feature: 

  **"Glucose Level Change Rate"** = (Glucose after meal - Fasting Glucose) / Time  

  This helps capture how fast glucose levels rise, which is a better indicator than just one reading.
  
- **Binning (Discretization):** Categorizing blood pressure as:
  - Normal (BP < 120)

  - Pre-Hypertension (120 ≤ BP < 140)

  - Hypertension (BP ≥ 140)

---

### ⚡ **Physics: Predicting Energy Consumption**

**Problem:** Predicting electricity usage in a city.

- **Feature Extraction:** Instead of using just temperature data, extract:

  - **"Cooling Demand"** = Max(0, Temperature - 22°C)  

  - **"Heating Demand"** = Max(0, 18°C - Temperature)  

  This helps separate heating vs. cooling needs.

- **Time-Based Features:**  

  - **"Peak Hours"**: 6 AM – 9 AM, 5 PM – 9 PM (Higher usage during these hours)

  - **"Weekend vs. Weekday"**: Energy usage patterns differ.

---

### 💰 **Economics: Stock Market Prediction**

**Problem:** Predicting future stock prices.

- **Rolling Averages:** Instead of raw stock prices, calculate:

  - **"7-day Moving Average"** = Average closing price over the past 7 days.

  - **"Volatility"** = Standard deviation of stock prices over 30 days.
  
- **Sentiment Analysis:** Create a **"Market Sentiment Score"** from news headlines:

  - Positive words like “growth” → +1  

  - Negative words like “crisis” → -1  

- **Event-Based Features:**  

  - **"Earnings Report Released?"** (Yes/No) → Stocks react to earnings reports.

---

## Avoiding Overfitting

### **Biology: Disease Diagnosis**

- **Overfit Risk:** The model memorizes specific patients' data instead of generalizing.

- **Solution:** Use **cross-validation**, ensuring the model is tested on unseen patients.

### **Physics: Energy Prediction**

- **Overfit Risk:** The model learns short-term weather noise instead of long-term trends.

- **Solution:** Apply **regularization (L1/L2)** to smooth out fluctuations.

### **Economics: Stock Prediction**

- **Overfit Risk:** The model relies too much on short-term stock price fluctuations.

- **Solution:** Use **feature selection** to remove unnecessary variables.

# Statistical Measures of Machine Learning Models