# DTE-2501 Week 2 Lecture Notes: 

## Basic Terminology in Supervised Learning

### Supervised Learning
- **Definition**: Trains a model on known input-output pairs to make future predictions.
- **Example**: Training a spam filter using emails labeled as spam or not spam.

### Notations
- **$𝑋$**: Set of inputs
- **$𝑌$**: Set of outputs
- **$𝑦: 𝑋 → 𝑌$**: Target function, the ideal function we aim to approximate.
- **$𝑎: 𝑋 → 𝑌$**: Algorithm or decision function, our model's approximation of $𝑦$.
- **$𝑙$**: Cardinality of the data set, i.e., the number of samples in the training data.
- **$𝑦𝑖 = 𝑦(𝑥𝑖)$**: Known responses or outputs for training samples.

### Features
- **Feature (attribute)**: A characteristic used to describe an input, denoted as a mapping $𝑓: 𝑋 → 𝐷𝑓$.
- **Feature Description**: A vector of features for a specific input.
- **Feature Data Matrix**: A table containing feature descriptions for all inputs in the training data.

$$ 
F = \left( f_j(x_i) \right)_{l \times n} = 
\begin{pmatrix}
f_1(x_1) & \cdots & f_n(x_1) \\
\vdots & \ddots & \vdots \\
f_1(x_l) & \cdots & f_n(x_l)
\end{pmatrix}
$$ 

# Main Differences Between Unsupervised and Supervised Learning

### Supervision
- **Supervised Learning**: Utilizes labeled data for training.
- **Unsupervised Learning**: Utilizes unlabeled data for training.

### Objective
- **Supervised Learning**: Aims to learn a mapping from inputs to outputs.
- **Unsupervised Learning**: Aims to find hidden patterns or structures in the data.

## Methods Used

### Methods for Supervised Learning
1. **Linear Regression**
2. **Logistic Regression**
3. **Decision Trees**
4. **Random Forests**
5. **Support Vector Machines**
6. **Neural Networks**

### Methods for Unsupervised Learning
1. **Clustering** (e.g., K-means, Hierarchical clustering)
2. **Dimensionality Reduction** (e.g., PCA, t-SNE)
3. **Anomaly Detection** (e.g., Isolation Forest)
4. **Association Rule Mining** (e.g., Apriori)
5. **Autoencoders**


# Feature types

#### Quantitative
1. **Discrete**: Numeric features that take on specific, often integer, values.  
   *Example*: Number of rooms in a house.
   
2. **Continuous**: Numeric features that can take on any value within a range.  
   *Example*: Temperature, height.

#### Qualitative
1. **Nominal**: Categorical features that do not have an intrinsic order.  
   *Example*: Color, country names.
   
2. **Ordinal**: Categorical features that have a meaningful order, but intervals between values are not consistent.  
   *Example*: Star ratings.

3. **Binary**: Features that can take only two values.  
   *Example*: True/False, 0/1.


# Machine Learning Methodology with Iris Dataset

## Data Preprocessing

### 1. Acquire the Relevant Dataset

The Iris dataset is commonly used for machine learning experimentation and is readily available in many libraries like scikit-learn. Let's start by importing the dataset.


In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()

# Convert to a DataFrame for easier manipulation
df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
df['target'] = iris['target']

df.head()

### 2. Identifying the Missing Values

Although the Iris dataset is usually clean, in practice, datasets may contain missing values. We can identify these using pandas.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values

### 3. Splitting the Data Set into Two Separate Sets: Training Set and Test Set

Splitting the dataset into training and test sets is crucial for assessing the model's performance.

In [None]:
from sklearn.model_selection import train_test_split

# Features and Labels
X = df.iloc[:, :-1]
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4. Feature Scaling: Standardization and Normalization

Feature scaling is an important step, especially for algorithms that rely on the magnitude of the features. We will look into two methods: standardization and normalization.

#### Standardization

The formula for standardization is:

$$
x' = \frac{x - \text{mean}(x)}{\sigma}
$$

Here's how to perform standardization:

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit on training set and transform both training and test set
X_train_standardized = scaler.fit_transform(X_train)
X_test_standardized = scaler.transform(X_test)

#### Normalization

The formula for normalization is:

$$
x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)} $$

Here's how to perform normalization:

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit on training set and transform both training and test set
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

### ELI5: Standardization and Normalization

#### What is Standardization?

Imagine you have a basket of apples and oranges, and you are trying to compare their weights. But here's the catch: apples are weighed in grams, and oranges are weighed in pounds. It would be like comparing apples to oranges, literally! Standardization is like converting all the weights to a common scale where the average is zero, and the scale of the weights is the same.

In math terms, you take each weight, subtract the average weight, and then divide by the standard deviation (a measure of how spread out the weights are).

$$[
x' = \frac{x - \text{mean}(x)}{\sigma}
$$]

#### What is Normalization?

Now, let's say you have a race between a snail and a rabbit. The snail covers a distance of 2 meters, while the rabbit covers 200 meters. Normalization is like converting their distances into percentages of the race they have completed.

In math terms, you take each distance, subtract the smallest distance, and then divide by the range of distances (max distance - min distance).

$$[
x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}
$$]

#### Why Are They Important?

Both standardization and normalization are methods to scale your features, so that no particular feature has undue influence on the model's performance. This is crucial for machine learning algorithms that use distance metrics (like k-NN) or gradient descent (like neural networks) because these algorithms are sensitive to the magnitude of the features. By transforming the features, you make it easier for the algorithm to learn from the data and often improve the model's performance.

So, whether you're comparing apples to oranges, or racing snails against rabbits, standardization and normalization help put things on a level playing field.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style of seaborn for better visualization
sns.set(style="whitegrid")

# Create a pairplot to visualize the relationships between features
sns.pairplot(df, hue='target', vars=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])

# Show the plot
plt.show()


### ELI5: What is Feature Generation?

#### Simple Explanation

Imagine you're trying to predict how well a student will do in a math test. You have their grades in English and Science, but you know that math involves both language skills and logical reasoning. Feature generation is like creating a new 'subject' that combines both English and Science to give you a better idea of how the student might perform in math.

#### How It Works in Regression

In a regression problem, you're trying to predict a number (like the math grade) based on other numbers (like English and Science grades). Feature generation is about creating new numbers (features) that could help you make a better prediction.

---

### ELI Somewhat More Advanced: What is Feature Generation?

#### Detailed Explanation

In machine learning, especially in regression problems, the quality of your features can often determine the quality of your model. Feature generation involves creating new features from the existing ones, or even bringing in entirely new data that could be relevant to the problem. For example, in predicting house prices, if you have the width and length of the house, you might generate a new feature called 'Area' by multiplying them.

#### Mathematical Perspective

Mathematically, feature generation can involve various operations such as:

- **Linear Combinations**: $( NewFeature = a \times Feature1 + b \times Feature2 )$
- **Polynomial Features**: $( NewFeature = Feature1^{2}, Feature1 \times Feature2 )$
- **Logarithmic or Exponential Transformations**: $( NewFeature = \log(Feature1) )$
- **Interaction Terms**: $( NewFeature = Feature1 \times Feature2 \times Feature3 )$

#### Importance in Regression

In regression, feature generation can provide the model with more information, capture hidden relationships in the data, and often improve the model's performance. However, it's crucial to be cautious, as adding too many features can lead to overfitting, where the model becomes too complex and performs poorly on new data.

So, feature generation is like giving your model extra tools to solve the problem, but you have to make sure it doesn't get overwhelmed with too many tools.



# Loss Function in Machine Learning

## Overview

In machine learning, the primary goal is to find an optimal algorithm that performs well on a given dataset. To quantify how well an algorithm performs, we use a loss function \(\epsilon(a, x)\). The loss function measures the discrepancy between the predicted output $(a(x) )$ and the true output $(y(x) )$ for a training sample $(x)$.

The loss function $(\epsilon(a, x))$ depends on the type of problem we are trying to solve:

## Types of Loss Functions

### Classification

In classification problems, the loss function is usually an error indicator. It is a boolean variable that is 1 if the algorithm misclassifies a sample and 0 otherwise.

$$
\epsilon(a, x) = [a(x) \neq y(x)]
$$

### Regression

In regression problems, common loss functions include:

1. **Absolute Error**: Measures the absolute difference between the predicted and true outputs.

$$
\epsilon(a, x) = |a(x) - y(x)|
$$

2. **Squared Error**: Measures the squared difference between the predicted and true outputs.

$$
\epsilon(a, x) = (a(x) - y(x))^2
$$

## Empirical Risk

To evaluate the performance of an algorithm on the entire dataset, we introduce the concept of *empirical risk*. It is essentially the average loss over all training samples.

$$
Q(a, X_l) = \frac{1}{l} \sum_{i=1}^{l} \epsilon(a, x_i)
$$

Here, $(X_l)$ is the training dataset containing $(l)$ samples. The objective in machine learning is often to minimize this empirical risk to find the best algorithm $(a)$.

In summary, the loss function serves as the cornerstone in optimizing machine learning algorithms. Different types of problems require different types of loss functions, and the empirical risk gives us a single value that summarizes the algorithm's performance on the training dataset.


# ELI5: Minimization of the Empirical Risk 

Think of empirical risk minimization like trying to get the best grades in your class. You have various subjects (features), like Math, Science, and English. Your grade (the output) depends on how well you perform in each of these subjects. 

Imagine you're trying different study methods to improve your grades. The "best method" is the one that gives you the highest grades while putting in the least amount of study time.

In the given example, think of:
- $X_l$ as different combinations of time you spend studying different subjects.
- $a$ as a study method you're trying out.
- $Q(a, X_l)$ as how good or bad your grades are with a given study method and time combination.
- $arg \min$ means you're looking for the method and time combination that gives you the best grades with least effort.

So, $\mu(X_l) = arg \min_a Q(a, X_l)$ means you're trying to find the best study method that gives you the best grades for the time you put in for each subject.

---

# ELI More Advanced: Empirical Risk Minimization

Empirical Risk Minimization (ERM) is a principle used in machine learning for finding a model that best fits the available data. Here, $\mu$ represents a learning method that takes a dataset $X_l$ and finds the model parameters that minimize some loss function $Q(a, X_l)$.

- $\mu$: Learning method
- $X_l$: Dataset
- $Q(a, X_l)$: The "risk" or error of using model $a$ on dataset $X_l$
- $arg \min_a Q(a, X_l)$: Find the model $a$ that minimizes this risk

## Linear Regression as an Example

In the case of linear regression, we have:
- $Y = \mathbb{R}$: Real-valued outputs
- $n$ features $f_j: X \to \mathbb{R}$, for $j = 1, \dots, n$
- Linear model: $g(x_i, \theta) = \sum_{j=1}^{n} \theta_j f_j(x)$, with $\theta \in \mathbb{R}^n$
- Squared error: $\epsilon(a, x) = (a(x) - y(x))^2$

For linear regression, ERM becomes a least squares optimization problem:

$$[
\mu(X_l) = arg \min_\theta \sum_{i=1}^{l} (g(x_i, \theta) - y_i)^2
]$$

This finds the values of $\theta$ that minimize the sum of the squared differences between the predicted and actual outputs, effectively "fitting" the model to the data.

