# Machine Learning Revision Notebook

## 1. Basic Machine Learning Concepts

### 1.1 Supervised vs. Unsupervised Learning

#### Supervised Learning
Supervised learning refers to a class of algorithms that learn from *labeled data*, where each input is paired with an output label. The goal is to learn a function that maps inputs to outputs based on example input-output pairs. The algorithm tries to minimize the error between its predictions and the actual labels.

- **Examples**:
  - **Regression**: Predicting continuous values (e.g., house prices, stock prices).
  - **Classification**: Predicting discrete values (e.g., spam detection, image classification).

- **Key characteristics**:
  - Requires labeled data.
  - Has clear goals defined by the labels.
  - Can be used for both classification and regression tasks.

- **Common Algorithms**:
  - Linear Regression
  - Logistic Regression
  - Support Vector Machines (SVM)
  - Decision Trees
  - Neural Networks

#### Unsupervised Learning
Unsupervised learning deals with *unlabeled data*. The goal is to identify hidden patterns or intrinsic structures within the data without predefined labels.

- **Examples**:
  - **Clustering**: Grouping similar items together (e.g., customer segmentation).
  - **Association Rule Learning**: Discovering relationships between variables in large datasets (e.g., market basket analysis).

- **Key characteristics**:
  - No labeled output.
  - Learns the underlying structure from the data itself.
  - Often used for exploratory data analysis.

- **Common Algorithms**:
  - K-Means Clustering
  - Hierarchical Clustering
  - Principal Component Analysis (PCA)
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)
  - Association Rule Mining (e.g., Apriori algorithm)

---

### 1.2 Common Machine Learning Algorithms

#### Linear Regression
Linear regression is a simple, interpretable model used for predicting continuous outcomes based on linear relationships between the input features and the target variable. The model assumes a linear dependency, represented as:

$$\[
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n + \epsilon
\]$$

where:
- $$\( y \)$$ is the target variable.
- $\( x_1, x_2, \ldots, x_n \)$ are input features.
- $\( \beta_0 \)$ is the intercept.
- $\( \beta_1, \beta_2, \ldots, \beta_n \)$ are the coefficients.
- $\( \epsilon \)$ is the error term.

- **Advantages**:
  - Simple and interpretable.
  - Computationally efficient.
  - Works well when there is a linear relationship between input features and the target.

- **Limitations**:
  - Poor performance when there are complex nonlinear relationships.
  - Sensitive to outliers.

#### Logistic Regression
Logistic regression is used for binary classification tasks. It estimates the probability of a binary outcome by using the logistic (sigmoid) function to map predicted values to probabilities:

$\[
P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n)}}
\]$

- **Advantages**:
  - Good for binary classification with linear boundaries.
  - Can handle unbalanced datasets using techniques like class weights.
  
- **Limitations**:
  - Not suitable for complex relationships.
  - Can overfit with high-dimensional data if regularization is not used.

#### Decision Trees and Random Forests
- **Decision Trees**:
  - A tree-based model that splits the data recursively based on feature values, creating branches that lead to decisions or predictions at the leaves.
  - The splits are based on criteria like Information Gain, Gini Impurity, or Chi-Square statistics.

- **Random Forests**:
  - An ensemble method that builds multiple decision trees and combines their results to improve prediction accuracy and control overfitting.
  - It reduces variance by averaging multiple trees, each trained on a different subset of the data.

- **Advantages**:
  - Can handle both numerical and categorical features.
  - Less preprocessing required (e.g., no need for feature scaling).

- **Limitations**:
  - Can be prone to overfitting (especially decision trees).
  - High computational cost for large forests.

#### K-Means Clustering
K-means clustering is an unsupervised algorithm that partitions data into $\( K \)$ clusters based on feature similarity. The algorithm works by:

1. Initializing $\( K \)$ cluster centroids.
2. Assigning each data point to the nearest centroid.
3. Updating centroids by calculating the mean of assigned points.
4. Repeating the process until convergence.

- **Advantages**:
  - Simple and easy to implement.
  - Scales well with a large number of samples.

- **Limitations**:
  - Requires pre-specifying the number of clusters $\( K \)$.
  - Sensitive to outliers and initial centroid selection.

---

### 1.3 Evaluation Metrics

#### Classification Metrics
Evaluating classification models involves assessing how well they distinguish between classes. Common metrics include:

- **Accuracy**: Proportion of correctly predicted labels out of the total predictions.

$\[
\text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Samples}}
\]$

- **Precision**: Proportion of correctly predicted positive observations out of all predicted positives.

$\[
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
\]$

- **Recall**: Proportion of correctly predicted positive observations out of all actual positives.

$\[
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
\]$

- **F1-Score**: Harmonic mean of precision and recall. It balances the two when dealing with unbalanced datasets.

$\[
\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\]$

- **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**: Measures the trade-off between true positive rate and false positive rate across different threshold values.

#### Regression Metrics
Regression models predict continuous values, so their performance is evaluated using metrics that assess prediction error:

- **Mean Squared Error (MSE)**: Average squared difference between actual and predicted values. Sensitive to outliers.

$\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2
\]$

- **Mean Absolute Error (MAE)**: Average absolute difference between actual and predicted values. Provides a more interpretable error measure.

$\[
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y_i}|
\]$

- **R-squared ($\( R^2 \)$)**: Proportion of variance explained by the model. Values range from 0 to 1, with higher values indicating a better fit.

$\[
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
\]$

Where $\( y_i \)$ is the actual value, $\( \hat{y_i} \)$ is the predicted value, and $\( \bar{y} \)$ is the mean of actual values.


## 2. Neural Networks and Deep Learning

### 2.1 Neural Network Architectures

A neural network is a computational model inspired by the structure and function of the brain. It consists of multiple layers of interconnected nodes (neurons) that transform input data through a series of mathematical operations to produce predictions or classifications.

**Components of a Neural Network:**
1. **Input Layer**: 
   - The first layer of the network that takes in the input features (e.g., pixel values of an image, sensor readings, etc.).
   - Each neuron in this layer represents one feature of the input data.

2. **Hidden Layers**: 
   - One or more layers between the input and output layers.
   - Each hidden layer is composed of neurons that apply linear transformations (e.g., matrix multiplication) followed by non-linear activation functions.
   - The more hidden layers (depth) and neurons (width) in a network, the more complex relationships it can model.

3. **Output Layer**: 
   - The final layer of the network that outputs predictions or classifications.
   - For regression tasks, the output might be a single neuron representing a continuous value.
   - For classification tasks, the output might contain multiple neurons with probabilities representing different classes.

---

### 2.2 Activation Functions

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. Different activation functions are used depending on the problem and network architecture.

- **ReLU (Rectified Linear Unit)**: 
  $$ f(x) = \max(0, x) $$
  - ReLU is widely used in hidden layers because it is computationally efficient and helps mitigate the vanishing gradient problem.
  - It outputs zero for negative inputs and is linear for positive inputs.

- **Sigmoid**: 
  $$ f(x) = \frac{1}{1 + e^{-x}} $$
  - Maps the input to a range between 0 and 1, making it useful for binary classification problems.
  - However, it can suffer from vanishing gradients, making training deep networks difficult.

- **Tanh (Hyperbolic Tangent)**: 
  $$ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
  - Maps the input to a range between -1 and 1.
  - Centered at zero, making it suitable for hidden layers to normalize the outputs.

- **Softmax**: 
  $$ f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} $$
  - Used in the output layer for multi-class classification tasks to convert raw scores (logits) into probabilities that sum to 1.

---

### 2.3 Loss Functions

Loss functions quantify the difference between the predicted output and the actual target values. They guide the optimization process by providing a measure to minimize.

- **Cross-Entropy Loss**: 
  - Used for classification tasks, especially for multi-class classification problems.
  - Formula for binary classification:
  $$ \text{Cross-Entropy} = -[y \cdot \log(p) + (1-y) \cdot \log(1-p)] $$
  - Formula for multi-class classification:
  $$ \text{Cross-Entropy} = -\sum_{i=1}^{C} y_i \cdot \log(p_i) $$
  where $ y $ is the true label, $ p $ is the predicted probability, and $ C $ is the number of classes.

- **Mean Squared Error (MSE)**: 
  - Commonly used for regression tasks.
  - Measures the average squared difference between the actual and predicted values.
  $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$
  where $ y_i $ is the actual value, $ \hat{y_i} $ is the predicted value, and $ n $ is the number of observations.

---

### 2.4 Optimizers

Optimizers are algorithms that adjust the model parameters (weights and biases) during training to minimize the loss function.

- **SGD (Stochastic Gradient Descent)**: 
  - Updates model parameters using the gradient of the loss function with respect to the parameters.
  - Uses a small batch of data (often a single sample) for each update, making it faster but more noisy compared to batch gradient descent.
  - Formula:
  $$ \theta = \theta - \eta \cdot \nabla_\theta J(\theta) $$
  where $ \theta $ are the parameters, $ \eta $ is the learning rate, and $ \nabla_\theta J(\theta) $ is the gradient of the loss function $ J $.

- **Adam (Adaptive Moment Estimation)**:
  - Combines the advantages of both AdaGrad and RMSProp optimizers.
  - Maintains an exponentially decaying average of past squared gradients and past gradients, allowing it to adapt the learning rate for each parameter.
  - Formula:
  $$ m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot \nabla_\theta J(\theta) $$
  $$ v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot (\nabla_\theta J(\theta))^2 $$
  $$ \theta = \theta - \eta \cdot \frac{m_t}{\sqrt{v_t} + \epsilon} $$
  where $ m_t $ and $ v_t $ are the first and second moment estimates, $ \beta_1 $ and $ \beta_2 $ are exponential decay rates, and $ \epsilon $ is a small constant to prevent division by zero.

- **Other Optimizers**:
  - **RMSProp**: Adapts the learning rate based on a moving average of squared gradients.
  - **AdaGrad**: Adjusts learning rate based on the sum of all previous squared gradients.


# 3. TensorFlow Setup and Basic Usage

# Install TensorFlow (if needed)
# !pip install tensorflow

# Import necessary libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# Check TensorFlow version
print(f"TensorFlow Version: {tf.__version__}")

# Define a simple sequential model in TensorFlow
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),  # Example input shape for MNIST dataset
    layers.Dense(10, activation='softmax')  # Output layer for 10 classes
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()  # Display the model's architecture


## 4. Practical Examples and Coding Challenges

### Example 1: Classification with TensorFlow (MNIST Dataset)
We'll build a simple neural network to classify images from the MNIST dataset, which consists of handwritten digits.

### Example 2: Regression Analysis with TensorFlow
Implement a neural network to predict housing prices based on features such as number of rooms, square footage, etc.

### Example 3: Introduction to Reinforcement Learning (Optional)
Cover basic reinforcement learning concepts, such as the Q-learning algorithm, and implement a simple agent using TensorFlow.


In [None]:
# Example 1: Classification with TensorFlow (MNIST)

# Import required libraries
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
import matplotlib.pyplot as plt

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize pixel values

# Build the model
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc}')

# Plot training history
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1])
plt.legend(loc='lower right')
plt.show()
