<a href="https://colab.research.google.com/github/MrSimple07/MachineLearning_ITMO/blob/main/machine_learning_exam_1st_semester.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Technologies (September 2023) EXAM


1. Supervised learning (an overview of the tasks and algorithms).
2. Unsupervised learning (an overview of the tasks and algorithms).
3. Machine learning and Bayes theorem. Prior and posterior distribution.
4. Error decomposition. Bias and variance tradeoff.
5. Linear regression model and Logistic regression.
6. The method of the k-nearest neighbors. Support vector machines. Kernel trick.
7. Decision trees and principles of its construction.
8. Types of features. Feature selection approaches. One-hot-encoding.
9. Data visualization and dimension-reduction algorithms: PCA and t-SNE.
10. Classification metrics. Accuracy, precision, recall, F1-score, log-loss, ROC-AUC. Types of errors, confusion matrix. Metrics of accuracy for regression models.
11. Construction of Ensembles of algorithms. Random Forrest.
12. Clustering algorithms. K-means and DBSCAN. Estimation of clustering quality.
13. Multilayer perceptron. Activation functions and loss functions in neural networks. Parameters and hyperparameters.
14. Training of the deep neural network as an optimization problem. Gradient descent, stochastic gradient descent, Momentum, RMSProp and Adam algorithms.
15. Deep multi-layer neural networks. Backpropagation algorithm. The problem of vanishing and exploding gradients and the methods of its solution.
16. Datasets: train, test, validation (dev) sets. Cross-validation. Monitoring the learning process. Overfitting.
17. Convolutional neural networks (CNNs): convolution, pooling, padding, feature maps, low-level and high-level features.
18. Transfer learning approach. An overview of modern CNN architectures and open-source datasets. Advantages and disadvantages of modern CNNs.
19. Natural language processing. Bag of words approach. TF-IDF method. Stemming and lemmatization. Stop words.
20. Word embeddings. Skip-gram model. Word2vec, Glove, BERT.
21. Sequence analysis tasks. Simple recurrent neural network architecture.
22. LSTM and GRU cells. Memory in neural networks.


# 1 Supervised Learning

Supervised learning is a type of machine learning where the model learns from **examples with correct answers** (called labels). It’s like learning with a teacher: you see input data and are told the correct output.

---

## 🔍 How It Works

1. **Input Data (X)**: Features (like size, color, age, etc.)
2. **Output Labels (Y)**: Correct answers (like price, category, etc.)
3. **Model**: A mathematical function learns the relationship between X and Y.
4. **Training**: The model adjusts itself to reduce mistakes.
5. **Prediction**: After training, it can guess Y for new X.

---

## 🎯 Goals (Tasks)

### 1. Classification
- Predicts a **category or class** (discrete value).
- Example: Email → Spam or Not Spam

### 2. Regression
- Predicts a **number** (continuous value).
- Example: House Features → Price

---

## ⚙️ Common Algorithms

| Type          | Algorithms                            |
|---------------|----------------------------------------|
| Classification| Logistic Regression, SVM, KNN, Trees   |
| Regression    | Linear Regression, Decision Trees      |
| Both          | Neural Networks, Random Forest, XGBoost|

---

## 📈 Process

1. **Collect Data** (with labels)
2. **Split** into training/test sets
3. **Train** the model on training set
4. **Test** on unseen data
5. **Evaluate** using metrics (accuracy, MAE, etc.)

---

## ✅ Pros
- Predictable and measurable
- Works well with enough labeled data

## ❌ Cons
- Needs labeled data (can be expensive to get)
- May not work well with noisy or biased data

---

## 📌 Summary
> Supervised learning teaches machines to map inputs to outputs using example data. It's used for tasks like spam detection, price prediction, medical diagnosis, and more.



# 2 🤖 Unsupervised Learning — Explained Simply

Unsupervised learning is a type of machine learning where the model **finds patterns** in data **without any labels**. It's like exploring a new place with no guide — the model figures out the structure on its own.

---

## 🔍 How It Works

1. **Input Data (X)**: Only features, no correct answers
2. **Model**: Learns hidden patterns, groupings, or structures in the data
3. **Goal**: Discover insights or simplify the data

---

## 🎯 Goals (Tasks)

### 1. Clustering
- Group similar items together
- Example: Group customers by shopping behavior

### 2. Dimensionality Reduction
- Reduce the number of features while keeping key information
- Example: Visualizing high-dimensional data in 2D

---

## ⚙️ Common Algorithms

| Task                   | Algorithms                            |
|------------------------|----------------------------------------|
| Clustering             | K-Means, DBSCAN, Hierarchical Clustering |
| Dimensionality Reduction | PCA, t-SNE, Autoencoders               |
| Association Rule Mining | Apriori, Eclat                        |

---

## 📈 Process

1. **Collect Data** (no labels needed)
2. **Preprocess** (clean, scale, etc.)
3. **Train** the model to find patterns
4. **Analyze** output (e.g. clusters, components)
5. **Interpret** results for decision-making

---

## ✅ Pros
- No need for labeled data
- Good for exploring unknown patterns

## ❌ Cons
- Harder to evaluate results
- Can find meaningless patterns if not used carefully

---

## 📌 Summary
> Unsupervised learning helps machines understand the hidden structure of data — great for clustering, visualization, and data exploration when you don't have labels.


# 3 # 🧠 Machine Learning & Bayes' Theorem

Bayes’ Theorem is a way to **update our beliefs** (probabilities) based on **new evidence**. It’s used in **probabilistic models** in Machine Learning.

---

## 📘 Formula

**Bayes' Theorem:**
$$
\[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
\]
$$
---

## 🧩 Key Terms

- **Prior (P(A))**: What we believe before seeing data
- **Likelihood (P(B|A))**: Probability of seeing data if A is true
- **Posterior (P(A|B))**: Updated belief after seeing data
- **Evidence (P(B))**: Total probability of the data

---

## 🤖 In ML

Used in:
- **Naive Bayes classifier**
- **Bayesian inference**
- **Bayesian neural networks**

---

## 🛠️ Example (Spam Detection)

- **Prior**: Chance an email is spam (e.g., 20%)
- **Likelihood**: If "free" appears, how likely is spam?
- **Posterior**: New spam probability after seeing "free"

---

## 🎯 Summary

> Bayes’ theorem lets ML models **learn from evidence** by updating probabilities. It’s the core of **Bayesian thinking** in AI.


# 4 # 🎯 Error Decomposition: Bias-Variance Tradeoff

In Machine Learning, total prediction error can be split into three parts: **bias**, **variance**, and **irreducible error**.

---

## 📊 Total Error = Bias² + Variance + Irreducible Error

---

### 📌 Bias
- **What**: Error from wrong assumptions in the model.
- **High Bias**: Model is too simple → underfitting.
- **Example**: Linear model for complex patterns.

---

### 📌 Variance
- **What**: Error from model sensitivity to training data.
- **High Variance**: Model is too complex → overfitting.
- **Example**: Model learns noise in training data.

---

### ⚖️ Tradeoff
- **Goal**: Find the balance between bias and variance.
- **Simple model** → low variance, high bias.
- **Complex model** → low bias, high variance.

---

## 📈 Visualization

- Underfitting: 🎯 misses target completely → high bias
- Overfitting: 🎯 hits different spots every time → high variance
- Good fit: 🎯 hits near the center consistently

---

## 🧠 Summary

> Bias-Variance tradeoff explains **why models make errors** and helps us choose **right model complexity**.


# 5 📉 Linear Regression vs 🔐 Logistic Regression

---

## 🔢 Linear Regression

- **Goal**: Predict a continuous value (e.g., price, weight).
- **Formula**:  
  `y = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ`  
  (a straight line or hyperplane)

- **How it works**:  
  Finds the best line that fits the training data by minimizing the difference between predicted and actual values (using **Mean Squared Error**).

- **Example**:  
  Predict house price based on area and number of rooms.

---

## 🚦 Logistic Regression

- **Goal**: Predict probability for **classification** (e.g., spam or not spam).
- **Formula**:  
  `P(y=1|x) = 1 / (1 + e^-(w₀ + w₁x₁ + ... + wₙxₙ))`  
  (sigmoid function)

- **How it works**:  
  Outputs a number between 0 and 1 → interprets as probability → sets threshold (e.g., 0.5) for classification.

- **Example**:  
  Predict if a student passes (yes/no) based on study hours.

---

## 🧠 Summary Table

| Feature               | Linear Regression         | Logistic Regression          |
|----------------------|---------------------------|------------------------------|
| Output Type          | Continuous number          | Probability / Class (0 or 1) |
| Used For             | Regression problems        | Binary classification        |
| Activation Function  | None (just linear)         | Sigmoid                      |
| Loss Function        | Mean Squared Error (MSE)   | Log Loss (Cross Entropy)     |


# 6 # 🤖 k-NN, SVM, and Kernel Trick

---

## 🧭 k-Nearest Neighbors (k-NN)

- **What it is**: A lazy, simple algorithm that classifies a point by looking at its **k closest neighbors**.
- **How it works**:
  1. Choose `k` (e.g., 3).
  2. Measure distance (usually Euclidean) to all points in training set.
  3. Find the `k` nearest ones.
  4. Assign the most common label among them.

- **Use**: Classification or regression.

- **Example**: To classify a fruit by shape/size, check the 3 most similar fruits already labeled.

---

## ⚖️ Support Vector Machines (SVM)

- **What it is**: A powerful classifier that finds the **best boundary (hyperplane)** that separates classes.
- **Goal**: Maximize the **margin** between two classes.

- **How it works**:
  - Finds a hyperplane that best separates the data.
  - Only the closest points (called **support vectors**) affect the boundary.

- **Use**: Works well for high-dimensional data and text classification.

- **Example**: Email spam detection, image classification.

---

## 🎯 Kernel Trick

- **Problem**: Some data is **not linearly separable** (can’t draw a straight line to split).
- **Solution**: Use a **kernel** function to map data into a **higher dimension** where it becomes separable.

- **Popular Kernels**:
  - Polynomial
  - Radial Basis Function (RBF)

- **Key idea**: Do math to simulate higher dimensions **without actually computing them** (saves time and memory).

---

## 🧠 Summary

| Method       | Type         | Strength                          | Weakness                      |
|--------------|--------------|-----------------------------------|-------------------------------|
| k-NN         | Lazy learner | Easy to understand and use        | Slow for large datasets       |
| SVM          | Hard margin  | Works well with high-dimensional data | Not good for very large datasets |
| Kernel Trick |


# 7 # 🌳 Decision Trees and How They Work

---

## ✅ What is a Decision Tree?

A **decision tree** is a flowchart-like model used for **classification** or **regression**.  
It splits data into **branches** based on features, leading to a final **decision (leaf)**.

---

## 🧱 How It Works

1. **Start** at the root (all data).
2. **Choose the best feature** to split the data (based on criteria like Gini or Entropy).
3. **Split** the data into groups.
4. Repeat for each branch until:
   - All data in a node is pure (same class), or
   - Max depth or min samples is reached.

---

## 🔍 Splitting Criteria

- **Gini Impurity** (used in CART): Measures how mixed the labels are.
- **Entropy & Information Gain** (used in ID3/C4.5):
  - Entropy: Disorder in the data.
  - Info Gain: How much disorder is reduced by the split.

---

## 🧠 Example

If predicting if a person will buy a phone:
- Root: Age
  - Age < 30 → Student?
    - Yes → Buys
    - No → Doesn’t buy
  - Age > 30 → Income?
    - High → Buys
    - Low → Doesn’t buy

---

## ⚖️ Pros and Cons

| Pros                        | Cons                             |
|-----------------------------|----------------------------------|
| Easy to understand and use  | Can overfit (too specific)       |
| Works with numeric/categorical data | Not great with noisy data   |
| No need to normalize data   | Instable to small changes        |

---

## 🌲 Final Tip

For better performance, use **Random Forests** (many trees combined) to reduce overfitting.


# 8 # 🔢 Features in Machine Learning

---

## 🧱 Types of Features

1. **Numerical (Continuous)**  
   - Example: Age, Salary  
   - Can be any number.

2. **Categorical (Discrete)**  
   - Example: Gender, Country  
   - Stored as labels or strings.

3. **Ordinal**  
   - Ordered categories.  
   - Example: Education level (High School < Bachelor < Master).

4. **Boolean/Binary**  
   - True/False, 0/1  
   - Example: IsStudent = Yes/No

5. **Text / Time / Image / Audio**  
   - Require special preprocessing.

---

## 🎯 Feature Selection Approaches

Used to pick only the **most useful features** (avoid noise and reduce overfitting):

1. **Filter Methods**  
   - Use statistics like correlation, chi-squared.  
   - Fast but ignore model performance.

2. **Wrapper Methods**  
   - Use model performance to evaluate combinations (e.g., forward/backward selection).  
   - More accurate but slower.

3. **Embedded Methods**  
   - Feature selection built into the model (e.g., Lasso Regression, Decision Trees).

---

## 📦 One-Hot Encoding

A method to convert **categorical features** into numbers:

- Creates a new column for each unique value.
- Puts `1` where it matches, `0` elsewhere.

### Example:

| Country   | → One-Hot |
|-----------|-----------|
| France    | [1, 0, 0] |
| Germany   | [0, 1, 0] |
| Spain     | [0, 0, 1] |

Used to make **categorical data usable** by machine learning models.

---

## 🧠 Summary

- Different feature types need different handling.
- Feature selection improves performance.
- One-hot encoding transforms categories into machine-friendly format.


# 9 # 📊 Data Visualization & Dimensionality Reduction

---

## 📈 Data Visualization

**Data visualization** is the process of representing data graphically. It helps in understanding the structure, patterns, and relationships in the data. Common visualizations:

- **Histograms**: Distribution of a single variable.
- **Scatter Plots**: Relationship between two continuous variables.
- **Bar Charts**: Comparison of categorical data.
- **Heatmaps**: Correlation or intensity of data across dimensions.

---

## 🔽 Principal Component Analysis (PCA)

PCA is a **dimensionality reduction** technique that transforms data into fewer dimensions while retaining as much information as possible.

### How It Works:
- Identifies the **principal components** (directions in which data varies the most).
- Projects the data onto these components.
- Reduces dimensions by keeping only the most important components.

### Example:
For a dataset with many variables (features), PCA reduces it to 2-3 main features that still represent the data well, making it easier to visualize or analyze.

---

## 🔍 t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a **non-linear dimensionality reduction** technique designed for **visualizing high-dimensional data** in 2 or 3 dimensions. It focuses on preserving local structures.

### How It Works:
- It minimizes the difference between pairwise similarities in high and low-dimensional space.
- Helps reveal patterns like clusters or groups in data.

### When to Use:
- Best used for visualizing data like images, text embeddings, or anything with high-dimensional features.

### Example:
It’s often used to visualize the clustering of data, like when you apply it to a neural network's activations or a word embedding.

---

## 📉 Summary

- **Data Visualization** helps to understand data visually.
- **PCA** is a linear method that reduces the dimensionality while retaining the variance.
- **t-SNE** is a non-linear method focused on preserving local relationships for better visualization.



# 10 # 🧮 Classification Metrics & Evaluation

---

## 🔢 Key Classification Metrics

### 1. **Accuracy**
- **Definition**: The proportion of correctly predicted instances out of the total instances.
- **Formula**:
  $$
  \text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Instances}}
  $$
- **Use Case**: Good for balanced datasets but not ideal for imbalanced datasets.

### 2. **Precision**
- **Definition**: The proportion of true positive predictions out of all positive predictions made.
- **Formula**:
  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}
  $$
- **Use Case**: Useful when false positives are more costly (e.g., email spam detection).

### 3. **Recall (Sensitivity or True Positive Rate)**
- **Definition**: The proportion of true positive predictions out of all actual positives.
- **Formula**:
  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}
  $$
- **Use Case**: Useful when missing a positive instance is costly (e.g., detecting diseases).

### 4. **F1-Score**
- **Definition**: The harmonic mean of precision and recall, balancing both metrics.
- **Formula**:
  $$
  \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}
  $$
- **Use Case**: Useful when both false positives and false negatives are important and you need a balanced metric.

### 5. **Log-Loss (Logarithmic Loss)**
- **Definition**: Measures the accuracy of a classifier by penalizing incorrect classifications, especially when confident about wrong predictions.
- **Formula**:
  $$
  \text{Log-Loss} = - \frac{1}{N} \sum_{i=1}^N y_i \log(p_i) + (1 - y_i) \log(1 - p_i)
  $$
  where \( y_i \) is the true label and \( p_i \) is the predicted probability.
- **Use Case**: Especially useful for probabilistic classifiers (e.g., logistic regression, neural networks).

### 6. **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**
- **Definition**: A performance measurement for classification problems at various thresholds settings.
- **Use Case**: Evaluates the trade-off between true positive rate and false positive rate. The higher the AUC, the better the model.

---

## 📊 Confusion Matrix

A **Confusion Matrix** is a table that describes the performance of a classification model by comparing predicted labels with true labels. It contains four values:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| **Actual Positive**   | True Positive (TP)  | False Negative (FN)  |
| **Actual Negative**   | False Positive (FP) | True Negative (TN)  |

- **True Positive (TP)**: Correctly predicted positive class.
- **False Positive (FP)**: Incorrectly predicted as positive.
- **True Negative (TN)**: Correctly predicted negative class.
- **False Negative (FN)**: Incorrectly predicted as negative.

### **Types of Errors**
- **False Positive (Type I error)**: Incorrectly predicting a positive when it’s actually negative.
- **False Negative (Type II error)**: Incorrectly predicting a negative when it’s actually positive.

---

## 📏 Metrics for Regression Models

### 1. **Mean Absolute Error (MAE)**
- **Definition**: The average of the absolute errors between predicted and actual values.
- **Formula**:
  $$
  \text{MAE} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|
  $$

### 2. **Mean Squared Error (MSE)**
- **Definition**: The average of the squared differences between predicted and actual values. More sensitive to large errors.
- **Formula**:
  $$
  \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
  $$

### 3. **Root Mean Squared Error (RMSE)**
- **Definition**: The square root of MSE. Gives an idea of the magnitude of error in the same units as the original data.
- **Formula**:
  $$
  \text{RMSE} = \sqrt{\text{MSE}}
  $$

### 4. **R-squared (R²)**
- **Definition**: Measures how well the model explains the variance of the target variable. Ranges from 0 to 1, with higher values indicating a better fit.
- **Formula**:
  $$
  R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
  $$

---

## 📉 Summary

- **Classification Metrics**: Help assess the performance of classification models with metrics like accuracy, precision, recall, and F1-score.
- **Confusion Matrix**: A useful table to visualize the performance of a classification model and the types of errors made.
- **Regression Metrics**: Used to evaluate regression models with metrics like MAE, MSE, RMSE, and R².


# 11 # 🌲 Ensemble Methods & Random Forest

---

## 🔁 What is an Ensemble?

An **ensemble** combines multiple models (called **base learners**) to improve prediction performance.

### 🎯 Why use ensembles?
- Reduce **variance** (bagging)
- Reduce **bias** (boosting)
- Improve **generalization** over single models

---

## 👨‍👩‍👧‍👦 Types of Ensembles

1. **Bagging (Bootstrap Aggregating)**
   - Trains models on different random subsets of data (with replacement).
   - Final prediction: average (regression) or majority vote (classification).
   - Example: **Random Forest**.

2. **Boosting**
   - Models are trained sequentially.
   - Each new model corrects errors made by the previous one.
   - Examples: AdaBoost, Gradient Boosting.

3. **Stacking**
   - Combines different types of models.
   - A **meta-model** is trained to combine outputs of base models.

---

## 🌳 Random Forest

Random Forest is an ensemble of many **Decision Trees** trained using **bagging**.

### 🔧 How it works:
1. Draw **bootstrapped samples** from training data.
2. Train a decision tree on each sample.
3. At each split in the tree, select the **best feature** from a random subset of features.
4. Aggregate predictions:
   - **Classification**: majority vote.
   - **Regression**: average.

### 📈 Why it works:
- Reduces **overfitting** of individual trees.
- Decorrelates trees using feature randomness.

---

## 🔢 Random Forest Prediction

For classification:
$$
\hat{y} = \text{majority\_vote}(T_1(x), T_2(x), ..., T_k(x))
$$

For regression:
$$
\hat{y} = \frac{1}{k} \sum_{i=1}^{k} T_i(x)
$$

Where \( T_i(x) \) is the prediction of the \( i \)-th tree.

---

## ✅ Advantages
- High accuracy
- Works well with large datasets
- Handles missing data
- Less need for parameter tuning

## ⚠️ Disadvantages
- Slower for real-time predictions
- Less interpretable than a single tree

---

## Summary

- **Ensemble learning** = many models combined for better performance.
- **Random Forest** = many decision trees + randomness + aggregation.
- Useful in both classification and regression tasks.


# 12 # 🔗 Clustering Algorithms: K-Means, DBSCAN & Quality Estimation

---

## 📊 What is Clustering?

**Clustering** is **unsupervised learning** that groups similar data points based on distance or density — without labeled outputs.

---

## ⚙️ K-Means

- Objective: Minimize within-cluster variance.
- Steps:
  1. Choose \( k \) cluster centers (centroids).
  2. Assign points to the nearest centroid.
  3. Update centroids as mean of assigned points.
  4. Repeat until convergence.

### Formula (objective):
$$
\min_{C} \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2
$$

✅ Fast, scalable  
❌ Needs \( k \), fails on non-spherical clusters

---

## 🧱 DBSCAN (Density-Based Spatial Clustering)

- Groups dense regions, separates noise.
- Parameters:
  - \( \varepsilon \): neighborhood radius
  - `minPts`: minimum points to form a dense region

✅ Finds arbitrary shapes, handles noise  
❌ Struggles with varying density

---

## 📏 Clustering Quality Estimation

### 1. **Silhouette Score**  
How close a point is to its cluster vs. other clusters:
$$
s = \frac{b - a}{\max(a, b)}
$$
- \( a \) = intra-cluster distance  
- \( b \) = nearest-cluster distance  
- \( s \in [-1, 1] \), higher = better

### 2. **Davies-Bouldin Index**  
Lower is better — measures average "similarity" between clusters.

---

✅ Clustering helps find hidden structures in data without labels.


# 13 # 🧠 Multilayer Perceptron (MLP), Activation & Loss Functions

---

## 🔗 What is an MLP?

**Multilayer Perceptron** is a type of **feedforward neural network** with:
- Input layer → Hidden layer(s) → Output layer
- Each neuron computes:
$$
z = w^T x + b, \quad a = \sigma(z)
$$

It learns by **backpropagation**, adjusting weights using gradients from a loss function.

---

# ⚡ Activation Functions in Neural Networks

Activation functions introduce **non-linearity**, allowing neural networks to learn complex patterns. Without them, a neural network would be just a linear model, regardless of how many layers it has.

---

| Function      | Formula                                                 | Notes                                                                 |
|---------------|---------------------------------------------------------|-----------------------------------------------------------------------|
| **Sigmoid**   | $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$                  | Maps any input to range (0, 1). Often used in binary classification. |
| **Tanh**      | $$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$      | Output is in range (-1, 1), centered at 0. Often better than sigmoid. |
| **ReLU**      | $$ \text{ReLU}(x) = \max(0, x) $$                       | Very fast to compute. Common in hidden layers. Can lead to "dead" neurons. |
| **Leaky ReLU**| $$ \text{LeakyReLU}(x) = \max(0.01x, x) $$              | Fixes dying ReLU by allowing a small slope for negative inputs.      |

---

## 🧠 How Activation Functions Work:

1. **Placed after each linear transformation** (e.g., weighted sum + bias).
2. **Adds non-linearity**, which allows stacking layers to model complex data (like images or language).
3. **Enables gradient-based learning** (via backpropagation).

### 💡 Example:

In a single neuron:
$$
z = w^T x + b,\quad a = \text{Activation}(z)
$$

Where:
- \( z \) is the linear output
- \( a \) is the activated output (e.g., via sigmoid or ReLU)

---

## ✅ Summary:

- Use **ReLU** for hidden layers (fast and effective).
- Use **Sigmoid** or **Softmax** in the output for classification.
- Pick activation based on your task, and always monitor performance and gradients.

---

## 💥 Loss Functions

Measure the model's prediction error:

- **Regression**:
  - MSE:  
    $$
    \text{MSE} = \frac{1}{n} \sum (y_i - \hat{y}_i)^2
    $$

- **Classification**:
  - Binary Cross-Entropy:  
    $$
    -[y \log \hat{y} + (1 - y)\log(1 - \hat{y})]
    $$
  - Categorical Cross-Entropy for multi-class.

---

## 🔧 Parameters vs Hyperparameters

- **Parameters**: Learned during training (weights \( w \), biases \( b \))
- **Hyperparameters**: Set before training (e.g. learning rate, layers, batch size)

---

✅ MLPs are the base of deep learning — combining linear algebra, activation, and optimization to learn complex patterns.


# 14 # 🧠 Training Deep Neural Networks as an Optimization Problem

Training a neural network means finding parameters (weights and biases) that minimize a **loss function** (how wrong predictions are). This is framed as an **optimization problem**.

## 🎯 Objective:
Minimize loss \( L(\theta) \), where \( \theta \) are model parameters.

---

## 🔽 Gradient Descent (GD)

Basic algorithm that updates parameters in the direction of negative gradient:

$$
\theta := \theta - \eta \cdot \nabla_\theta L(\theta)
$$

- \( \eta \): learning rate
- \( \nabla_\theta L \): gradient of the loss

---

## 🔁 Stochastic Gradient Descent (SGD)

Instead of using the full dataset, updates are made per sample or small batches:

$$
\theta := \theta - \eta \cdot \nabla_\theta L(\theta; x_i, y_i)
$$

✅ Faster updates  
❌ Noisy but helps escape local minima

---

## 🌀 Momentum

Adds velocity to updates to smooth them:

$$
v := \beta v - \eta \nabla_\theta L(\theta) \\
\theta := \theta + v
$$

- \( \beta \in [0,1] \): momentum factor (e.g., 0.9)

---

## 📉 RMSProp

Adapts learning rate based on recent gradient magnitudes:

$$
s := \rho s + (1 - \rho) \cdot (\nabla_\theta L(\theta))^2 \\
\theta := \theta - \frac{\eta}{\sqrt{s + \epsilon}} \cdot \nabla_\theta L(\theta)
$$

- \( \rho \): decay rate (e.g., 0.9)
- Helps deal with varying gradients

---

## 🧬 Adam (Adaptive Moment Estimation)

Combines **Momentum** and **RMSProp**:

$$
m_t := \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta) \\
v_t := \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta L(\theta))^2 \\
\hat{m}_t := \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t := \frac{v_t}{1 - \beta_2^t} \\
\theta := \theta - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t
$$

- \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \): common defaults  
- Best choice for most deep learning tasks

---

## ✅ Summary

| Algorithm | Pros | Cons |
|----------|------|------|
| GD | Stable, precise | Slow for big data |
| SGD | Fast updates | Noisy |
| Momentum | Faster convergence | Needs tuning |
| RMSProp | Adapts LR | May overshoot |
| Adam | Best of all | Slightly more complex |



# 15 # 🤖 Deep Multi-Layer Neural Networks

A **deep neural network (DNN)** is a model with multiple layers of neurons between input and output. Each layer learns features from the previous one.

## 🔁 Backpropagation Algorithm

Backpropagation is used to compute gradients of the loss function w.r.t. each parameter by applying the chain rule from output to input layer.

For a given layer \( l \):

$$
\delta^l = \frac{\partial L}{\partial z^l} = (\delta^{l+1} \cdot W^{l+1}) \odot f'(z^l)
$$

- \( \delta^l \): error at layer \( l \)  
- \( W^{l+1} \): weights of next layer  
- \( f'(z^l) \): derivative of activation function  
- \( \odot \): element-wise multiplication

Weights are updated as:

$$
W^l := W^l - \eta \cdot \frac{\partial L}{\partial W^l}
$$

---

## ⚠️ Vanishing and Exploding Gradients

In deep networks, gradients can:

- **Vanishing**: become very small → slow learning
- **Exploding**: become very large → unstable updates

### Why?

When multiplying many derivatives (chain rule), if they're <1 → shrink (vanish); if >1 → grow (explode).

---

## 🛠️ Solutions

| Problem | Solutions |
|--------|-----------|
| Vanishing | ✅ ReLU/Leaky ReLU activations (no small gradients)  
|          | ✅ Batch normalization  
|          | ✅ Proper weight initialization (e.g. He, Xavier)  
|          | ✅ Use residual connections (ResNets)  
| Exploding | ✅ Gradient clipping  
|           | ✅ Lower learning rates |

---

## ✅ Summary

- Deep networks extract complex features.
- Backprop is key for training.
- Vanishing/exploding gradients slow or destabilize training.
- Use good activations, normalization, and initialization to fix.


# 16 # 📊 Datasets and Overfitting in Machine Learning

## 🧩 Types of Datasets

| Type       | Purpose                              |
|------------|--------------------------------------|
| **Train set**   | Used to train the model (learn patterns) |
| **Validation (Dev) set** | Used to tune hyperparameters and monitor performance |
| **Test set**    | Used to evaluate final model performance on unseen data |

---

## 🔁 Cross-Validation

A method to reliably evaluate model performance:

**K-Fold Cross-Validation:**
- Data is split into \( k \) parts.
- Train on \( k-1 \), validate on 1 fold.
- Repeat \( k \) times and average the results.

### Formula for cross-validated score:

$$
\text{CV Score} = \frac{1}{k} \sum_{i=1}^{k} \text{score}_i
$$

---

## 📈 Monitoring the Learning Process

- **Training loss ↓**: model learns on training data  
- **Validation loss**: helps detect overfitting  
  - If it goes **↑ while train loss ↓**, overfitting is happening

Use **learning curves** (train vs. validation loss) to visualize.

---

## ⚠️ Overfitting

When a model learns the training data too well, including noise or irrelevant details, and performs poorly on new data.

| Symptom             | Cause                          | Solution                  |
|---------------------|--------------------------------|---------------------------|
| High train accuracy, low test accuracy | Model memorizes data     | Use simpler model, more data, regularization |

---

### Techniques to Reduce Overfitting

- Cross-validation  
- Dropout (for neural networks)  
- Regularization (L1/L2)  
- Early stopping  
- Data augmentation

---

## ✅ Summary

- Always split your data into **train/dev/test**.
- Use **cross-validation** for stable results.
- Monitor loss curves to spot **overfitting** early.


# 17 # 🧠 Convolutional Neural Networks (CNNs)

CNNs are powerful for processing **grid-like data**, such as **images**. They automatically learn features like edges, shapes, and patterns.

---

## 🔍 Key Components

### 1. **Convolution**

A mathematical operation to extract features using filters (kernels):

$$
S(i, j) = (X * K)(i, j) = \sum_m \sum_n X(i+m, j+n) \cdot K(m, n)
$$

- \( X \): input image  
- \( K \): kernel (filter)  
- \( S \): feature map (result)  
- Learns **edges**, **textures**, etc.

---

### 2. **Feature Maps**

- The output after convolution.
- Highlight specific patterns.
- Each filter creates one feature map.

---

### 3. **Padding**

Adds borders (usually zeros) to keep output size:

- **Same Padding**: output size = input size  
- **Valid Padding**: no padding, output shrinks

---

### 4. **Pooling**

Reduces size of feature maps (downsampling):

#### Max Pooling Example:
$$
\text{MaxPool}(2x2): \max \left\{ x_1, x_2, x_3, x_4 \right\}
$$

- Helps with **translation invariance**
- Makes the network faster and less likely to overfit

---

### 5. **Low-Level vs High-Level Features**

| Layer | Features Learned        |
|-------|--------------------------|
| Early (1st-2nd) | Edges, corners, colors         |
| Middle          | Textures, patterns             |
| Deep            | Shapes, objects, semantics     |

---

## 🧠 CNN Architecture (Simplified)
[Input Image]

↓

[Convolution + ReLU]

↓

[Pooling]

↓

[Convolution + ReLU]

↓

[Pooling]

↓

[Fully Connected Layer]

↓

[Output (e.g., class)]


---

## ✅ Summary

CNNs detect **spatial patterns** in data using:
- **Convolution** to extract features
- **Pooling** to reduce size
- **Padding** to control output shape
- Gradually transform raw pixels → meaningful objects


# 18 # 🔄 Transfer Learning & CNN Architectures

## 🔁 What is Transfer Learning?

Transfer learning is using a **pre-trained model** (trained on a large dataset like ImageNet) and **fine-tuning** it for a **new, smaller task**.

### Why?
- Saves time and compute
- Helps when you don’t have much data
- Leverages **learned low/high-level features**

---

## 🧠 How It Works

1. **Load pre-trained model** (e.g., ResNet trained on ImageNet)
2. **Freeze early layers** (generic features like edges)
3. **Replace last layers** with your custom task (e.g., 10-class classifier)
4. **Fine-tune** only the new layers (or entire model)

---

## 📚 Modern CNN Architectures (Overview)

| Model        | Key Features                                       |
|--------------|----------------------------------------------------|
| LeNet-5      | One of the first CNNs (used for digit recognition) |
| AlexNet      | Revived deep CNNs; introduced ReLU, dropout        |
| VGG16/19     | Deep, simple; only 3x3 convolutions                |
| GoogLeNet    | Inception modules for multi-scale learning         |
| ResNet       | **Residual blocks** to solve vanishing gradients   |
| DenseNet     | Connects each layer to all others (feature reuse)  |
| EfficientNet | Balances depth, width, and resolution              |

---

## 📂 Open-Source Datasets for CNNs

| Dataset       | Description                               |
|---------------|-------------------------------------------|
| **MNIST**     | Handwritten digits (28x28)                |
| **CIFAR-10/100** | Tiny images of objects                  |
| **ImageNet**  | 14M images, 1000 classes (huge benchmark) |
| **COCO**      | Object detection, segmentation            |
| **CelebA**    | Celebrity faces with attributes           |
| **Fashion-MNIST** | Clothing items (alternative to MNIST) |

---

## ✅ Advantages of Modern CNNs

- Extract **hierarchical features** automatically
- **Transferable** to other tasks via pretraining
- Works well with large datasets and complex inputs (e.g., images, video)

---

## ❌ Disadvantages

- **Data-hungry** (without pretraining)
- High **computational cost** for training
- **Hard to interpret** internal decisions (black box)
- May **overfit** on small datasets

---

## 📝 Summary

Transfer learning + modern CNNs = powerful, efficient image models.  
Choose an architecture based on task size, resources, and accuracy needs.


# 19 # 🗣 Natural Language Processing (NLP)

## 🧰 Basic Concepts

NLP = teaching machines to understand human language (text/speech).

---

## 🧱 1. Bag of Words (BoW)

- Represents text as **word frequency vectors** (ignores grammar & order).
- Example:

| Sentence           | "I love cats" | "I love dogs" |
|--------------------|---------------|----------------|
| Word Vector        | [1, 1, 1, 0]   | [1, 1, 0, 1]   |
| Vocabulary:        | I, love, cats, dogs |

✅ Simple  
❌ Doesn’t consider importance or context.

---

## 📊 2. TF-IDF (Term Frequency – Inverse Document Frequency)

Weights words by **how important** they are in a document **relative to a corpus**.

- **TF** = how often word appears in a document  
  $$ TF(t, d) = \frac{f_{t,d}}{\sum_k f_{k,d}} $$

- **IDF** = how rare the word is across all documents  
  $$ IDF(t) = \log \left( \frac{N}{df_t} \right) $$

- **TF-IDF Score** =  
  $$ TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t) $$

✅ Highlights rare but important words  
❌ Still loses word order & meaning

---

## ✂️ 3. Stemming vs Lemmatization

- **Stemming** = cut words to their root (e.g., “running” → “run”, “runner” → “run”)
- **Lemmatization** = uses grammar to get base word (e.g., “was” → “be”, “better” → “good”)

✅ Helps group similar words  
❌ May distort meaning if done poorly

---

## 🛑 4. Stop Words

- Common words (e.g., “the”, “is”, “and”) that are **often removed** in preprocessing.

✅ Removes noise  
❌ Might need careful tuning depending on the task

---

## ✅ Summary

| Concept          | Purpose                         |
|------------------|----------------------------------|
| BoW              | Basic text to vector             |
| TF-IDF           | Word importance in corpus        |
| Stemming/Lemmat. | Normalize words                  |
| Stop Words       | Remove frequent non-useful words |

NLP starts with **cleaning and converting** text into numbers — then models can work!


# 20 # 🧠 Word Embeddings & Language Models

Word embeddings are **dense vector representations** of words — capturing **meaning** based on context.

---

## 📌 Why Not Use One-Hot?

One-hot encoding → High-dimensional, no relation between words  
E.g., "king" and "queen" = totally different vectors

✅ Solution: Use **word embeddings** — words close in meaning are close in space.

---

## 🔁 Skip-gram Model (Word2Vec)

**Goal**: Predict **context words** given a **center word**.

- Example:  
  Sentence: "The cat sat on the mat"  
  Center: "cat" → Predict: ["The", "sat", "on"]

- Objective: Maximize probability of context given the center word.

$$
\max \prod_{t=1}^{T} \prod_{-c \le j \le c, j \ne 0} P(w_{t+j} | w_t)
$$

- Uses **neural network** to learn embeddings.

---

## 🧰 Word2Vec (Mikolov, 2013)

- **Skip-gram** or **CBOW** (predict center from context)
- Learns ~300-dim vectors
- Fast, simple, but static (same vector for all contexts)

---

## 🤝 GloVe (Global Vectors)

- Uses **word co-occurrence** statistics
- Embeddings trained from **global matrix of word counts**
- Objective: words that co-occur frequently → similar vectors

Equation (simplified):

$$
J = \sum_{i,j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2
$$

✅ Captures global relationships  
❌ Still static

---

## 🧠 BERT (Bidirectional Encoder Representations from Transformers)

- Uses **Transformer** model
- Learns embeddings from **left and right context**
- Word meaning changes **based on sentence**

**Example**:
- "bank" in “river bank” ≠ "bank" in “money bank”

✅ Contextualized  
✅ Pre-trained on huge corpora  
❌ Heavy, slower to train/use

---

## ✅ Summary Table

| Model     | Type        | Context-Aware | Notes                      |
|-----------|-------------|----------------|-----------------------------|
| Word2Vec  | Local       | ❌              | Fast, static                |
| GloVe     | Global      | ❌              | Uses word co-occurrence     |
| BERT      | Deep/Context| ✅              | Deep, bidirectional, slow   |

Word embeddings help models understand **semantics** — a key part of modern NLP. 🚀


# 21 # 🔁 Sequence Analysis Tasks & Simple RNN

## 📌 What is Sequence Analysis?

Sequence analysis deals with **ordered data**, where **order matters** (unlike regular data).

### 🔍 Common Tasks:
- 📜 **Text generation** (e.g., next word prediction)
- 🎶 **Music modeling**
- 🗣️ **Speech recognition**
- 📈 **Time-series forecasting**
- 👀 **Video frame prediction**

---

## 🧠 Simple RNN Architecture

A **Recurrent Neural Network (RNN)** processes **one step at a time**, remembering past info using a **hidden state**.

### 🧩 Main Idea:
- For each time step \( t \):
  - Input: \( x_t \)
  - Hidden state: \( h_t \)
  - Output: \( y_t \)

### 🧮 Equations:
$$
h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)
$$

$$
y_t = W_{hy}h_t + b_y
$$

- \( W_{xh}, W_{hh}, W_{hy} \): weight matrices  
- \( b_h, b_y \): biases  
- \( h_{t-1} \): memory from previous step  
- \( \tanh \): activation function

---

## 🔄 How It Works:
- Input sequence: [x₁, x₂, x₃, ..., xₙ]
- RNN processes one by one, updating hidden state \( h_t \)
- Output can be at each time step (e.g., translation) or after final step (e.g., sentiment)

---

## ⚠️ Limitations:
- Struggles with **long-term dependencies** (earlier info gets lost)
- Suffers from **vanishing gradients**

✅ Later models like **LSTM** and **GRU** fix this.

---

## ✅ Summary

| Feature            | RNN                       |
|--------------------|---------------------------|
| Input              | Sequential                |
| Memory             | Yes (hidden state)        |
| Used for           | Text, audio, time-series  |
| Problem            | Vanishing gradient        |


# 22 # 🧠 Memory in Neural Networks: LSTM & GRU

## 🧩 Why Memory Matters
- In tasks like translation or time-series, models must **"remember"** earlier information.
- **RNNs** struggle with long sequences (vanishing gradients).
- 🛠️ **LSTM** and **GRU** are improved RNNs with **memory cells** to handle long-term dependencies.

---

## 📦 LSTM: Long Short-Term Memory

LSTM adds **gates** to control what to remember, forget, and output.

### 🔑 Gates:
1. **Forget Gate**: What to discard from memory  
   $$ f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) $$
2. **Input Gate**: What new info to store  
   $$ i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) $$
3. **Candidate**: New memory candidate  
   $$ \tilde{C}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c) $$
4. **Cell State Update**: Combine old & new  
   $$ C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t $$
5. **Output Gate**: What to output  
   $$ o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) $$
6. **Hidden State**: Final output  
   $$ h_t = o_t \odot \tanh(C_t) $$

- \( \odot \) means element-wise multiplication  
- \( \sigma \) is sigmoid function

---

## ⚡ GRU: Gated Recurrent Unit

Simpler than LSTM, combines some gates.

### 🔑 Gates:
1. **Update Gate**: Mix of old and new memory  
   $$ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) $$
2. **Reset Gate**: How much past to forget  
   $$ r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) $$
3. **Candidate Memory**: New info  
   $$ \tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h) $$
4. **Final Output**: Combined memory  
   $$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $$

---

## 📊 LSTM vs GRU

| Feature        | LSTM                  | GRU               |
|----------------|------------------------|-------------------|
| Gates          | 3 (input, forget, out) | 2 (update, reset) |
| Memory Cell    | Yes                    | No                |
| Complexity     | Higher                 | Lower             |
| Training Speed | Slower                 | Faster            |

---

## ✅ Summary

- Both LSTM and GRU solve **memory loss** in RNNs.
- GRU is **simpler** and works well in many tasks.
- Used in **NLP**, **speech**, **stock prediction**, etc.


# 23 # 🔄 Autoencoders and Representation Learning

## What is an Autoencoder?
- A type of neural network used to **learn efficient data encoding**.
- It **compresses** input data into a smaller representation and then **reconstructs** it back.
- Consists of two parts:
  - **Encoder:** Compresses input \( x \) into a latent code \( z \)
  - **Decoder:** Reconstructs input from \( z \)

$$
\text{Encoder: } z = f(x) \\
\text{Decoder: } \hat{x} = g(z)
$$

The network is trained to minimize the difference between \( x \) and \( \hat{x} \).

---

## Latent Space
- The **compressed representation** \( z \) lives in a lower-dimensional space called **latent space**.
- Captures the most important features of the input data.
- Enables tasks like **data compression**, **denoising**, and **feature extraction**.

---

## Representation Learning
- Autoencoders learn **useful representations** of data automatically.
- These representations can improve performance in other tasks like classification or clustering.
- Helps models understand **underlying structure** without manual feature engineering.

---

## Summary
- Autoencoders learn to **compress and reconstruct** data.
- Latent space is the **compact feature space** learned by the encoder.
- Used for **dimensionality reduction**, **anomaly detection**, and **generative modeling**.


1. Supervised learning (an overview of the tasks and algorithms).
2. Unsupervised learning (an overview of the tasks and algorithms).
3. Machine learning and Bayes theorem. Prior and posterior distribution.
4. Error decomposition. Bias and variance tradeoff.
5. Linear regression model and Logistic regression.
6. The method of the k-nearest neighbors. Support vector machines. Kernel trick.
7. Decision trees and principles of its construction.
8. Types of features. Feature selection approaches. One-hot-encoding.
9. Data visualization and dimension-reduction algorithms: PCA and t-SNE.
10. Classification metrics. Accuracy, precision, recall, F1-score, log-loss, ROC-AUC. Types of errors, confusion matrix. Metrics of accuracy for regression models.
11. Construction of Ensembles of algorithms. Random Forrest.
12. Clustering algorithms. K-means and DBSCAN. Estimation of clustering quality.
13. Multilayer perceptron. Activation functions and loss functions in neural networks. Parameters and hyperparameters.
14. Training of the deep neural network as an optimization problem. Gradient descent, stochastic gradient descent, Momentum, RMSProp and Adam algorithms.
15. Deep multi-layer neural networks. Backpropagation algorithm. The problem of vanishing and exploding gradients and the methods of its solution.
16. Datasets: train, test, validation (dev) sets. Cross-validation. Monitoring the learning process. Overfitting.
17. Convolutional neural networks (CNNs): convolution, pooling, padding, feature maps, low-level and high-level features.
18. Transfer learning approach. An overview of modern CNN architectures and open-source datasets. Advantages and disadvantages of modern CNNs.
19. Natural language processing. Bag of words approach. TF-IDF method. Stemming and lemmatization. Stop words.
20. Word embeddings. Skip-gram model. Word2vec, Glove, BERT.
21. Sequence analysis tasks. Simple recurrent neural network architecture.
22. LSTM and GRU cells. Memory in neural networks.
23. Autoencoders and representation learning. Latent Space
