<a href="https://colab.research.google.com/github/MrSimple07/MachineLearning_ITMO/blob/main/machine_learning_exam_1st_semester.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Technologies (September 2023) EXAM


1. Supervised learning (an overview of the tasks and algorithms).
2. Unsupervised learning (an overview of the tasks and algorithms).
3. Machine learning and Bayes theorem. Prior and posterior distribution.
4. Error decomposition. Bias and variance tradeoff.
5. Linear regression model and Logistic regression.
6. The method of the k-nearest neighbors. Support vector machines. Kernel trick.
7. Decision trees and principles of its construction.
8. Types of features. Feature selection approaches. One-hot-encoding.
9. Data visualization and dimension-reduction algorithms: PCA and t-SNE.
10. Classification metrics. Accuracy, precision, recall, F1-score, log-loss, ROC-AUC. Types of errors, confusion matrix. Metrics of accuracy for regression models.
11. Construction of Ensembles of algorithms. Random Forrest.
12. Clustering algorithms. K-means and DBSCAN. Estimation of clustering quality.
13. Multilayer perceptron. Activation functions and loss functions in neural networks. Parameters and hyperparameters.
14. Training of the deep neural network as an optimization problem. Gradient descent, stochastic gradient descent, Momentum, RMSProp and Adam algorithms.
15. Deep multi-layer neural networks. Backpropagation algorithm. The problem of vanishing and exploding gradients and the methods of its solution.
16. Datasets: train, test, validation (dev) sets. Cross-validation. Monitoring the learning process. Overfitting.
17. Convolutional neural networks (CNNs): convolution, pooling, padding, feature maps, low-level and high-level features.
18. Transfer learning approach. An overview of modern CNN architectures and open-source datasets. Advantages and disadvantages of modern CNNs.
19. Natural language processing. Bag of words approach. TF-IDF method. Stemming and lemmatization. Stop words.
20. Word embeddings. Skip-gram model. Word2vec, Glove, BERT.
21. Sequence analysis tasks. Simple recurrent neural network architecture.
22. LSTM and GRU cells. Memory in neural networks.


# 1 Supervised Learning

Supervised learning is a type of machine learning where the model learns from **examples with correct answers** (called labels). It’s like learning with a teacher: you see input data and are told the correct output.

---

## 🔍 How It Works

1. **Input Data (X)**: Features (like size, color, age, etc.)
2. **Output Labels (Y)**: Correct answers (like price, category, etc.)
3. **Model**: A mathematical function learns the relationship between X and Y.
4. **Training**: The model adjusts itself to reduce mistakes.
5. **Prediction**: After training, it can guess Y for new X.

---

## 🎯 Goals (Tasks)

### 1. Classification
- Predicts a **category or class** (discrete value).
- Example: Email → Spam or Not Spam

### 2. Regression
- Predicts a **number** (continuous value).
- Example: House Features → Price

---

## ⚙️ Common Algorithms

| Type          | Algorithms                            |
|---------------|----------------------------------------|
| Classification| Logistic Regression, SVM, KNN, Trees   |
| Regression    | Linear Regression, Decision Trees      |
| Both          | Neural Networks, Random Forest, XGBoost|

---

## 📈 Process

1. **Collect Data** (with labels)
2. **Split** into training/test sets
3. **Train** the model on training set
4. **Test** on unseen data
5. **Evaluate** using metrics (accuracy, MAE, etc.)

---

## ✅ Pros
- Predictable and measurable
- Works well with enough labeled data

## ❌ Cons
- Needs labeled data (can be expensive to get)
- May not work well with noisy or biased data

---

## 📌 Summary
> Supervised learning teaches machines to map inputs to outputs using example data. It's used for tasks like spam detection, price prediction, medical diagnosis, and more.



# 2 🤖 Unsupervised Learning — Explained Simply

Unsupervised learning is a type of machine learning where the model **finds patterns** in data **without any labels**. It's like exploring a new place with no guide — the model figures out the structure on its own.

---

## 🔍 How It Works

1. **Input Data (X)**: Only features, no correct answers
2. **Model**: Learns hidden patterns, groupings, or structures in the data
3. **Goal**: Discover insights or simplify the data

---

## 🎯 Goals (Tasks)

### 1. Clustering
- Group similar items together
- Example: Group customers by shopping behavior

### 2. Dimensionality Reduction
- Reduce the number of features while keeping key information
- Example: Visualizing high-dimensional data in 2D

---

## ⚙️ Common Algorithms

| Task                   | Algorithms                            |
|------------------------|----------------------------------------|
| Clustering             | K-Means, DBSCAN, Hierarchical Clustering |
| Dimensionality Reduction | PCA, t-SNE, Autoencoders               |
| Association Rule Mining | Apriori, Eclat                        |

---

## 📈 Process

1. **Collect Data** (no labels needed)
2. **Preprocess** (clean, scale, etc.)
3. **Train** the model to find patterns
4. **Analyze** output (e.g. clusters, components)
5. **Interpret** results for decision-making

---

## ✅ Pros
- No need for labeled data
- Good for exploring unknown patterns

## ❌ Cons
- Harder to evaluate results
- Can find meaningless patterns if not used carefully

---

## 📌 Summary
> Unsupervised learning helps machines understand the hidden structure of data — great for clustering, visualization, and data exploration when you don't have labels.


# 3 # 🧠 Machine Learning & Bayes' Theorem

Bayes’ Theorem is a way to **update our beliefs** (probabilities) based on **new evidence**. It’s used in **probabilistic models** in Machine Learning.

---

## 📘 Formula

**Bayes' Theorem:**
$$
\[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
\]
$$
---

## 🧩 Key Terms

- **Prior (P(A))**: What we believe before seeing data
- **Likelihood (P(B|A))**: Probability of seeing data if A is true
- **Posterior (P(A|B))**: Updated belief after seeing data
- **Evidence (P(B))**: Total probability of the data

---

## 🤖 In ML

Used in:
- **Naive Bayes classifier**
- **Bayesian inference**
- **Bayesian neural networks**

---

## 🛠️ Example (Spam Detection)

- **Prior**: Chance an email is spam (e.g., 20%)
- **Likelihood**: If "free" appears, how likely is spam?
- **Posterior**: New spam probability after seeing "free"

---

## 🎯 Summary

> Bayes’ theorem lets ML models **learn from evidence** by updating probabilities. It’s the core of **Bayesian thinking** in AI.


# 4 # 🎯 Error Decomposition: Bias-Variance Tradeoff

In Machine Learning, total prediction error can be split into three parts: **bias**, **variance**, and **irreducible error**.

---

## 📊 Total Error = Bias² + Variance + Irreducible Error

---

### 📌 Bias
- **What**: Error from wrong assumptions in the model.
- **High Bias**: Model is too simple → underfitting.
- **Example**: Linear model for complex patterns.

---

### 📌 Variance
- **What**: Error from model sensitivity to training data.
- **High Variance**: Model is too complex → overfitting.
- **Example**: Model learns noise in training data.

---

### ⚖️ Tradeoff
- **Goal**: Find the balance between bias and variance.
- **Simple model** → low variance, high bias.
- **Complex model** → low bias, high variance.

---

## 📈 Visualization

- Underfitting: 🎯 misses target completely → high bias
- Overfitting: 🎯 hits different spots every time → high variance
- Good fit: 🎯 hits near the center consistently

---

## 🧠 Summary

> Bias-Variance tradeoff explains **why models make errors** and helps us choose **right model complexity**.


# 5 📉 Linear Regression vs 🔐 Logistic Regression

---

## 🔢 Linear Regression

- **Goal**: Predict a continuous value (e.g., price, weight).
- **Formula**:  
  `y = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ`  
  (a straight line or hyperplane)

- **How it works**:  
  Finds the best line that fits the training data by minimizing the difference between predicted and actual values (using **Mean Squared Error**).

- **Example**:  
  Predict house price based on area and number of rooms.

---

## 🚦 Logistic Regression

- **Goal**: Predict probability for **classification** (e.g., spam or not spam).
- **Formula**:  
  `P(y=1|x) = 1 / (1 + e^-(w₀ + w₁x₁ + ... + wₙxₙ))`  
  (sigmoid function)

- **How it works**:  
  Outputs a number between 0 and 1 → interprets as probability → sets threshold (e.g., 0.5) for classification.

- **Example**:  
  Predict if a student passes (yes/no) based on study hours.

---

## 🧠 Summary Table

| Feature               | Linear Regression         | Logistic Regression          |
|----------------------|---------------------------|------------------------------|
| Output Type          | Continuous number          | Probability / Class (0 or 1) |
| Used For             | Regression problems        | Binary classification        |
| Activation Function  | None (just linear)         | Sigmoid                      |
| Loss Function        | Mean Squared Error (MSE)   | Log Loss (Cross Entropy)     |


# 6 # 🤖 k-NN, SVM, and Kernel Trick

---

## 🧭 k-Nearest Neighbors (k-NN)

- **What it is**: A lazy, simple algorithm that classifies a point by looking at its **k closest neighbors**.
- **How it works**:
  1. Choose `k` (e.g., 3).
  2. Measure distance (usually Euclidean) to all points in training set.
  3. Find the `k` nearest ones.
  4. Assign the most common label among them.

- **Use**: Classification or regression.

- **Example**: To classify a fruit by shape/size, check the 3 most similar fruits already labeled.

---

## ⚖️ Support Vector Machines (SVM)

- **What it is**: A powerful classifier that finds the **best boundary (hyperplane)** that separates classes.
- **Goal**: Maximize the **margin** between two classes.

- **How it works**:
  - Finds a hyperplane that best separates the data.
  - Only the closest points (called **support vectors**) affect the boundary.

- **Use**: Works well for high-dimensional data and text classification.

- **Example**: Email spam detection, image classification.

---

## 🎯 Kernel Trick

- **Problem**: Some data is **not linearly separable** (can’t draw a straight line to split).
- **Solution**: Use a **kernel** function to map data into a **higher dimension** where it becomes separable.

- **Popular Kernels**:
  - Polynomial
  - Radial Basis Function (RBF)

- **Key idea**: Do math to simulate higher dimensions **without actually computing them** (saves time and memory).

---

## 🧠 Summary

| Method       | Type         | Strength                          | Weakness                      |
|--------------|--------------|-----------------------------------|-------------------------------|
| k-NN         | Lazy learner | Easy to understand and use        | Slow for large datasets       |
| SVM          | Hard margin  | Works well with high-dimensional data | Not good for very large datasets |
| Kernel Trick |


# 7 # 🌳 Decision Trees and How They Work

---

## ✅ What is a Decision Tree?

A **decision tree** is a flowchart-like model used for **classification** or **regression**.  
It splits data into **branches** based on features, leading to a final **decision (leaf)**.

---

## 🧱 How It Works

1. **Start** at the root (all data).
2. **Choose the best feature** to split the data (based on criteria like Gini or Entropy).
3. **Split** the data into groups.
4. Repeat for each branch until:
   - All data in a node is pure (same class), or
   - Max depth or min samples is reached.

---

## 🔍 Splitting Criteria

- **Gini Impurity** (used in CART): Measures how mixed the labels are.
- **Entropy & Information Gain** (used in ID3/C4.5):
  - Entropy: Disorder in the data.
  - Info Gain: How much disorder is reduced by the split.

---

## 🧠 Example

If predicting if a person will buy a phone:
- Root: Age
  - Age < 30 → Student?
    - Yes → Buys
    - No → Doesn’t buy
  - Age > 30 → Income?
    - High → Buys
    - Low → Doesn’t buy

---

## ⚖️ Pros and Cons

| Pros                        | Cons                             |
|-----------------------------|----------------------------------|
| Easy to understand and use  | Can overfit (too specific)       |
| Works with numeric/categorical data | Not great with noisy data   |
| No need to normalize data   | Instable to small changes        |

---

## 🌲 Final Tip

For better performance, use **Random Forests** (many trees combined) to reduce overfitting.


# 8 # 🔢 Features in Machine Learning

---

## 🧱 Types of Features

1. **Numerical (Continuous)**  
   - Example: Age, Salary  
   - Can be any number.

2. **Categorical (Discrete)**  
   - Example: Gender, Country  
   - Stored as labels or strings.

3. **Ordinal**  
   - Ordered categories.  
   - Example: Education level (High School < Bachelor < Master).

4. **Boolean/Binary**  
   - True/False, 0/1  
   - Example: IsStudent = Yes/No

5. **Text / Time / Image / Audio**  
   - Require special preprocessing.

---

## 🎯 Feature Selection Approaches

Used to pick only the **most useful features** (avoid noise and reduce overfitting):

1. **Filter Methods**  
   - Use statistics like correlation, chi-squared.  
   - Fast but ignore model performance.

2. **Wrapper Methods**  
   - Use model performance to evaluate combinations (e.g., forward/backward selection).  
   - More accurate but slower.

3. **Embedded Methods**  
   - Feature selection built into the model (e.g., Lasso Regression, Decision Trees).

---

## 📦 One-Hot Encoding

A method to convert **categorical features** into numbers:

- Creates a new column for each unique value.
- Puts `1` where it matches, `0` elsewhere.

### Example:

| Country   | → One-Hot |
|-----------|-----------|
| France    | [1, 0, 0] |
| Germany   | [0, 1, 0] |
| Spain     | [0, 0, 1] |

Used to make **categorical data usable** by machine learning models.

---

## 🧠 Summary

- Different feature types need different handling.
- Feature selection improves performance.
- One-hot encoding transforms categories into machine-friendly format.


# 9 # 📊 Data Visualization & Dimensionality Reduction

---

## 📈 Data Visualization

**Data visualization** is the process of representing data graphically. It helps in understanding the structure, patterns, and relationships in the data. Common visualizations:

- **Histograms**: Distribution of a single variable.
- **Scatter Plots**: Relationship between two continuous variables.
- **Bar Charts**: Comparison of categorical data.
- **Heatmaps**: Correlation or intensity of data across dimensions.

---

## 🔽 Principal Component Analysis (PCA)

PCA is a **dimensionality reduction** technique that transforms data into fewer dimensions while retaining as much information as possible.

### How It Works:
- Identifies the **principal components** (directions in which data varies the most).
- Projects the data onto these components.
- Reduces dimensions by keeping only the most important components.

### Example:
For a dataset with many variables (features), PCA reduces it to 2-3 main features that still represent the data well, making it easier to visualize or analyze.

---

## 🔍 t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a **non-linear dimensionality reduction** technique designed for **visualizing high-dimensional data** in 2 or 3 dimensions. It focuses on preserving local structures.

### How It Works:
- It minimizes the difference between pairwise similarities in high and low-dimensional space.
- Helps reveal patterns like clusters or groups in data.

### When to Use:
- Best used for visualizing data like images, text embeddings, or anything with high-dimensional features.

### Example:
It’s often used to visualize the clustering of data, like when you apply it to a neural network's activations or a word embedding.

---

## 📉 Summary

- **Data Visualization** helps to understand data visually.
- **PCA** is a linear method that reduces the dimensionality while retaining the variance.
- **t-SNE** is a non-linear method focused on preserving local relationships for better visualization.



# 10 # 🧮 Classification Metrics & Evaluation

---

## 🔢 Key Classification Metrics

### 1. **Accuracy**
- **Definition**: The proportion of correctly predicted instances out of the total instances.
- **Formula**:
  $$
  \text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Instances}}
  $$
- **Use Case**: Good for balanced datasets but not ideal for imbalanced datasets.

### 2. **Precision**
- **Definition**: The proportion of true positive predictions out of all positive predictions made.
- **Formula**:
  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}
  $$
- **Use Case**: Useful when false positives are more costly (e.g., email spam detection).

### 3. **Recall (Sensitivity or True Positive Rate)**
- **Definition**: The proportion of true positive predictions out of all actual positives.
- **Formula**:
  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}
  $$
- **Use Case**: Useful when missing a positive instance is costly (e.g., detecting diseases).

### 4. **F1-Score**
- **Definition**: The harmonic mean of precision and recall, balancing both metrics.
- **Formula**:
  $$
  \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}
  $$
- **Use Case**: Useful when both false positives and false negatives are important and you need a balanced metric.

### 5. **Log-Loss (Logarithmic Loss)**
- **Definition**: Measures the accuracy of a classifier by penalizing incorrect classifications, especially when confident about wrong predictions.
- **Formula**:
  $$
  \text{Log-Loss} = - \frac{1}{N} \sum_{i=1}^N y_i \log(p_i) + (1 - y_i) \log(1 - p_i)
  $$
  where \( y_i \) is the true label and \( p_i \) is the predicted probability.
- **Use Case**: Especially useful for probabilistic classifiers (e.g., logistic regression, neural networks).

### 6. **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**
- **Definition**: A performance measurement for classification problems at various thresholds settings.
- **Use Case**: Evaluates the trade-off between true positive rate and false positive rate. The higher the AUC, the better the model.

---

## 📊 Confusion Matrix

A **Confusion Matrix** is a table that describes the performance of a classification model by comparing predicted labels with true labels. It contains four values:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| **Actual Positive**   | True Positive (TP)  | False Negative (FN)  |
| **Actual Negative**   | False Positive (FP) | True Negative (TN)  |

- **True Positive (TP)**: Correctly predicted positive class.
- **False Positive (FP)**: Incorrectly predicted as positive.
- **True Negative (TN)**: Correctly predicted negative class.
- **False Negative (FN)**: Incorrectly predicted as negative.

### **Types of Errors**
- **False Positive (Type I error)**: Incorrectly predicting a positive when it’s actually negative.
- **False Negative (Type II error)**: Incorrectly predicting a negative when it’s actually positive.

---

## 📏 Metrics for Regression Models

### 1. **Mean Absolute Error (MAE)**
- **Definition**: The average of the absolute errors between predicted and actual values.
- **Formula**:
  $$
  \text{MAE} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|
  $$

### 2. **Mean Squared Error (MSE)**
- **Definition**: The average of the squared differences between predicted and actual values. More sensitive to large errors.
- **Formula**:
  $$
  \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
  $$

### 3. **Root Mean Squared Error (RMSE)**
- **Definition**: The square root of MSE. Gives an idea of the magnitude of error in the same units as the original data.
- **Formula**:
  $$
  \text{RMSE} = \sqrt{\text{MSE}}
  $$

### 4. **R-squared (R²)**
- **Definition**: Measures how well the model explains the variance of the target variable. Ranges from 0 to 1, with higher values indicating a better fit.
- **Formula**:
  $$
  R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
  $$

---

## 📉 Summary

- **Classification Metrics**: Help assess the performance of classification models with metrics like accuracy, precision, recall, and F1-score.
- **Confusion Matrix**: A useful table to visualize the performance of a classification model and the types of errors made.
- **Regression Metrics**: Used to evaluate regression models with metrics like MAE, MSE, RMSE, and R².


1. Supervised learning (an overview of the tasks and algorithms).
2. Unsupervised learning (an overview of the tasks and algorithms).
3. Machine learning and Bayes theorem. Prior and posterior distribution.
4. Error decomposition. Bias and variance tradeoff.
5. Linear regression model and Logistic regression.
6. The method of the k-nearest neighbors. Support vector machines. Kernel trick.
7. Decision trees and principles of its construction.
8. Types of features. Feature selection approaches. One-hot-encoding.
9. Data visualization and dimension-reduction algorithms: PCA and t-SNE.
10. Classification metrics. Accuracy, precision, recall, F1-score, log-loss, ROC-AUC. Types of errors, confusion matrix. Metrics of accuracy for regression models.
11. Construction of Ensembles of algorithms. Random Forrest.
12. Clustering algorithms. K-means and DBSCAN. Estimation of clustering quality.
13. Multilayer perceptron. Activation functions and loss functions in neural networks. Parameters and hyperparameters.
14. Training of the deep neural network as an optimization problem. Gradient descent, stochastic gradient descent, Momentum, RMSProp and Adam algorithms.
15. Deep multi-layer neural networks. Backpropagation algorithm. The problem of vanishing and exploding gradients and the methods of its solution.
16. Datasets: train, test, validation (dev) sets. Cross-validation. Monitoring the learning process. Overfitting.
17. Convolutional neural networks (CNNs): convolution, pooling, padding, feature maps, low-level and high-level features.
18. Transfer learning approach. An overview of modern CNN architectures and open-source datasets. Advantages and disadvantages of modern CNNs.
19. Natural language processing. Bag of words approach. TF-IDF method. Stemming and lemmatization. Stop words.
20. Word embeddings. Skip-gram model. Word2vec, Glove, BERT.
21. Sequence analysis tasks. Simple recurrent neural network architecture.
22. LSTM and GRU cells. Memory in neural networks.
