<img src='https://static.wixstatic.com/media/e9b721_888e9daa53994dfd9d2214ad1a823fcc~mv2.png/v1/fit/w_1000%2Ch_205%2Cal_c/file.png'/>

# Classification
##### *Def:* Classification in machine learning and neural networks is a type of supervised learning where the goal is to predict the categorical class labels of new instances, based on past observations. 

## Types of Classification in Machine Learning

### 1. Binary Classification

- **Description**: In binary classification, there are `only two classes` to predict. It's like answering a yes/no question.
- **Example**: Determining if a tumor is malignant (cancerous) or benign (non-cancerous).
- **Techniques**: Logistic regression, support vector machines, simple neural networks.

### 2. Multi-class Classification

- **Description**: Multi-class classification involves `categorizing data into three or more classes`.
- **Example**: Classifying types of fruits in an image.
- **Techniques**: Decision trees, naive Bayes classifiers, neural networks.

### 3. Multi-label Classification

- **Description**: In multi-label classification, `each instance can belong to multiple classes simultaneously.`
- **Example**: Movie categorization into genres like action, adventure, and fantasy.
- **Techniques**: Binary relevance, classifier chains, neural networks.

### 4. Imbalanced Classification

- **Description**: This refers to a challenge in datasets where some classes are much more frequent than others `(imbalanced dataset)`.
- **Example**: Fraud detection in banking.
- **Techniques**: Resampling techniques, synthetic data generation, anomaly detection methods.

### 5. Hierarchical Classification

- **Description**: Hierarchical classification deals with `classifying data into a hierarchy of classes.`
- **Example**: Classifying a set of products into categories and subcategories.
- **Techniques**: Hierarchical clustering, recursive neural networks.

### 6. Cost-sensitive Classification

- **Description**: This type involves `factoring in the different costs of misclassification.`
- **Example**: Credit scoring.
- **Techniques**: Weighted loss functions, decision trees with cost-complexity pruning.

### 7. Ensemble Methods

- **Description**: `Combining the predictions from multiple models` to improve overall performance.
- **Example**: Random forests.
- **Techniques**: Bagging, boosting, stacking.


# <center>Architecture of Classification Model</center>

<table>
    <tr>
        <th>Component</th>
        <th>Description</th>
        <th>Description in PyTorch Context</th>
    </tr>
    <tr>
        <td><strong>1. Input Layer</strong></td>
        <td>Starting point for raw data input. Involves preprocessing.</td>  
        <td>Designed for specific input data format (e.g., reshaping tensors for images).</td>
    </tr>
    <tr>
        <td><strong>2. Feature Extraction</strong></td>
        <td>Extracts and selects relevant features from the input data.</td>
        <td>CNNs for image tasks, using <code>torch.nn.Conv2d</code>, <code>torch.nn.MaxPool2d</code>, etc.</td>
    </tr>
    <tr>
        <td><strong>3. Hidden Layers</strong></td>
        <td>Comprises layers of neurons or equivalent structures.</td>
        <td>Implemented using <code>torch.nn.Module</code>, layers stacked (e.g., <code>torch.nn.Linear</code>).</td>
    </tr>
    <tr>
        <td><strong>4. Activation Functions</strong></td>
        <td>Introduce non-linearity, enabling complex pattern learning.</td>
        <td>Includes ReLU (<code>torch.nn.ReLU</code>), Sigmoid (<code>torch.nn.Sigmoid</code>), etc.</td>
    </tr>
    <tr>
        <td><strong>5. Dropout &amp; Regularization</strong></td>
        <td>Techniques to prevent overfitting.</td>
        <td><code>torch.nn.Dropout</code> for dropout, L1/L2 regularization in optimization.</td>
    </tr>
    <tr>
        <td><strong>6. Output Layer</strong></td>
        <td>Designed for the specific type of classification task.</td>
        <td>Softmax (<code>torch.nn.Softmax</code>) for multi-class, Sigmoid for binary.</td>
    </tr>
    <tr>
        <td><strong>7. Loss Function</strong></td>
        <td>Measures the difference between predictions and actual labels.</td>
        <td>Cross-Entropy Loss (<code>torch.nn.CrossEntropyLoss</code>) for classification.</td>
    </tr>
    <tr>
        <td><strong>8. Optimization Algorithm</strong></td>
        <td>Adjusts model parameters to minimize loss.</td>
        <td>SGD (<code>torch.optim.SGD</code>), Adam (<code>torch.optim.Adam</code>), etc.</td>
    </tr>
    <tr>
        <td><strong>9. Backpropagation</strong></td>
        <td>Method for model learning, adjusting weights.</td>
        <td>Implemented using the <code>.backward()</code> method on loss object.</td>
    </tr>
    <tr>
        <td><strong>10. Evaluation Metrics</strong></td>
        <td>Measures model performance.</td>
        <td>Integration with libraries like <code>scikit-learn</code> for metrics.</td>
    </tr>
    <tr>
        <td><strong>11. Hyperparameter Tuning</strong></td>
        <td>Fine-tunes model parameters for optimal performance.</td>
        <td>External libraries like Ray Tune or Optuna for tuning.</td>
    </tr>
    <tr>
        <td><strong>12. Model Validation &amp; Testing</strong></td>
        <td>Uses separate datasets for tuning and performance assessment.</td>
        <td>Evaluating performance on validation and test datasets.</td>
    </tr>
    <tr>
        <td><strong>13. Fine-Tuning &amp; Transfer Learning</strong></td>
        <td>Adjusting pre-trained models for specific tasks.</td>
        <td>Loading and modifying pre-trained models (torchvision.models).</td>
    </tr>
</table>


# 1. Neural Network Basics

- **Neurons and Layers:** Neurons are fundamental units that receive inputs, process them, and produce outputs. Layers of neurons, including input, hidden, and output layers, form a neural network.
<center>
    <img src='https://th.bing.com/th/id/OIP.uX2jV8NWOSLHvj-qNKJDRQHaE7?rs=1&pid=ImgDetMain'/>
</center>

- **Feedforward Process:** In a feedforward network, information moves forward from the input layer, through hidden layers, to the output layer, without any loops or cycles.

# 2. Activation Functions

- **Purpose:** Introduce non-linearity, allowing the network to learn complex patterns.## Types of Activation Function
### 1. Step Function
Step Function is one of the simplest kind of activation functions. In this, we consider a threshold value and if the value of net input say y is greater than the threshold then the neuron is activated.

Mathemetically,
<img src='https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-5aea0186c8fb5d688db0e2bf6f2cc6c2_l3.svg'/>
<img src='https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-6f157897aa125b3805fa70ad0be07d71_l3.svg'/>

Graphically,

<img src='https://media.geeksforgeeks.org/wp-content/uploads/1-106.png'/>

### 2. Sigmoid Function:

   <img width=100 src='https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-1dea50a81ef2fff81ba63d65e858aa3b_l3.svg'/>
   <img src='https://media.geeksforgeeks.org/wp-content/uploads/5-31.png'/>

### 3. ReLU: 
<img width=200 src='https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-d3454b0c1549818eb14efa16eb232600_l3.svg'/>
   <img src='https://media.geeksforgeeks.org/wp-content/uploads/2-69.png'/>

`The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.`.

# 3. Loss Functions
- **Role:** Quantify the difference between the predicted outputs and actual labels.
- **Classification Specific:** Cross-entropy loss is commonly used in classification tasks, as it measures the performance of a classification model whose output is a probability value between 0 and 1.

# 4. Backpropagation
- **Mechanism:** The process through which the neural network learns. It involves computing the gradient of the loss function and updating the network weights in a direction that minimizes the loss.
- **Gradient Descent:** An optimization algorithm used in backpropagation to adjust weights and biases.

# 5. Optimization Algorithms
- **Purpose:** Optimize the network’s weights and biases.
- **Examples:** Stochastic Gradient Descent (SGD), Adam, RMSprop. Adam is widely used due to its adaptive learning rate capabilities.

# 6. Overfitting and Regularization
- **Overfitting:** Occurs when a model learns the training data too well, including its noise and outliers, leading to poor generalization.
- **Regularization Techniques:** Dropout, L1/L2 regularization, and early stopping are used to prevent overfitting.

# 7. Data Preprocessing
- **Importance:** Involves scaling, normalizing, or standardizing input data to improve model performance.
- **In PyTorch:** Transformations can be applied to data using torchvision’s transforms module.

# 8. Batch Processing
- **Concept:** Training the network using batches of data rather than the entire dataset at once.
- **Benefits:** More efficient memory usage and often faster convergence.

# 9. Epochs and Learning Rate
- **Epochs:** One epoch represents one complete pass of the training dataset through the algorithm.
- **Learning Rate:** Determines the step size at each iteration while moving toward a minimum of the loss function.

# 10. Evaluation Metrics
- **For Classification:** Accuracy, precision, recall, F1 score.
- **In PyTorch:** These metrics aren’t directly part of PyTorch but can be calculated using additional libraries or custom functions.

# 11. Transfer Learning
- **Utilization:** Involves taking a pre-trained model and fine-tuning it for a specific task.
- **In PyTorch:** Easily implemented using models from torchvision.models.

# 12. PyTorch Specifics
- **Dynamic Computation Graph:** PyTorch uses dynamic computation graphs (Dynamic Neural Networks), which provides flexibility in building complex architectures.
- **Eager Execution:** PyTorch operations are executed as they are defined, making debugging and working with complex models more intuitive.

# Specific Activation Functions and Optimizers to Problem Types

| Problem Type      | Activation Function | Optimizer   | Code Example (PyTorch)  |
|-------------------|---------------------|-------------|-------------------------|
| Binary Classification | Sigmoid            | Adam        | `model.add_module('sigmoid', nn.Sigmoid())`<br>`optimizer = torch.optim.Adam(model.parameters())`<br>`criterion = nn.BCELoss()` |
| Multi-class Classification | Softmax            | SGD         | `model.add_module('softmax', nn.Softmax(dim=1))`<br>`optimizer = torch.optim.SGD(model.parameters(), lr=0.01)`<br>`criterion = nn.CrossEntropyLoss()` |
| Regression    | ReLU (or Linear for output) | RMSprop     | `model.add_module('relu', nn.ReLU())`<br>`optimizer = torch.optim.RMSprop(model.parameters())`<br>`criterion = nn.MSELoss()` |
| Time Series Prediction | Tanh (or LSTM/GRU activations) | Adam | `model.add_module('lstm', nn.LSTM(input_size, hidden_size))`<br>`optimizer = torch.optim.Adam(model.parameters())`<br>`criterion = nn.MSELoss()` |

**Criterian is another name for loss or cost function.**

# Non Linear Classification

<img src='https://th.bing.com/th/id/OIP.1FSUhGWGbOXYyicjR2E56QHaD6?rs=1&pid=ImgDetMain'/>

- **Non-linear classification** refers to a classification task where the **relationship between the input features and the class labels is non-linear**, meaning it *cannot be represented with a straight line* (or a hyperplane in higher dimensions).
- This kind of classification is common in real-world scenarios where the **data patterns are complex and intricate.**
- Non-linear classifiers can capture these complex patterns and are often implemented using various machine learning algorithms.

-linear kernels, decision trees, and GBMs.


## Non-linear Classification Algorithms

Non-linear classification is essential in machine learning for dealing with complex data patterns. Several algorithms can model non-linear relationships effectively. Here are some notable ones:

### 1. Neural Networks (Multi-layer Perceptron)

- **Description**: Neural networks consist of layers of interconnected nodes (neurons). Each connection has a weight that is adjusted during training. Non-linear activation functions introduce non-linearity.
- **PyTorch Implementation**: 
  - Layers: `nn.Linear(input_size, hidden_size)`
  - Activation: `nn.ReLU()` or other non-linear functions like `nn.Sigmoid()`, `nn.Tanh()`
  - Example: 
    ```python
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.fc1 = nn.Linear(input_size, hidden_size)
            self.relu = nn.ReLU()
            self.fc2 = nn.Linear(hidden_size, num_classes)

        def forward(self, x):
            return self.fc2(self.relu(self.fc1(x)))
    ```

### 2. Support Vector Machines (SVM) with Non-linear Kernels

- **Description**: SVMs can be extended to non-linear classification using kernels like Radial Basis Function (RBF). The kernel transforms data into a higher-dimensional space where it becomes linearly separable.
- **PyTorch Implementation**: PyTorch does not natively support SVMs with kernels. One would typically use libraries like scikit-learn for SVMs. However, custom implementations or extensions might be available in third-party libraries.

### 3. Decision Trees and Random Forests

- **Description**: Decision trees split the data based on feature values. They are inherently non-linear. Random forests combine multiple decision trees to improve performance and reduce overfitting.
- **PyTorch Implementation**: Decision trees and random forests are not typically implemented with PyTorch, as it specializes in neural networks. Libraries like scikit-learn are more suitable for these algorithms.

### 4. Gradient Boosting Machines (GBM)

- **Description**: GBMs are ensemble learning methods that build trees sequentially, with each tree trying to correct the errors of the previous one.
- **PyTorch Implementation**: Like decision trees, GBMs are not commonly implemented in PyTorch. Libraries like XGBoost or LightGBM are preferred for these tasks.


### Click Here for all Non-Linear Activation Functions
<a href='https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity'>
<img src='https://th.bing.com/th/id/OIP.8AaAYxLb-VOgGUW8V8JXQAAAAA?rs=1&pid=ImgDetMain' width=100/>
</a>


<table>
    <tr>
        <th>Name</th>
        <th>What it Does</th>
        <th>When to Use It</th>
        <th>Mathematical Expression</th>
        <th>Graph</th>
    </tr>
    <tr>
        <td>ReLU (Rectified Linear Unit)</td>
        <td>Allows only positive values to pass through, effectively turning off negative values.</td>
        <td>Default choice for many types of neural networks.</td>
        <td>\( f(x) = \max(0, x) \)</td>
        <td><img src="https://upload.wikimedia.org/wikipedia/commons/6/6c/Rectifier_and_softplus_functions.svg" alt="ReLU Function" style="width:100px;height:auto;"></td>
    </tr>
    <tr>
        <td>Sigmoid</td>
        <td>Squashes the input values into a range between 0 and 1.</td>
        <td>Commonly used for binary classification in the output layer.</td>
        <td>\( f(x) = \frac{1}{1 + \exp(-x)} \)</td>
        <td><img src="https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg" alt="Sigmoid Function" style="width:100px;height:auto;"></td>
    </tr>
    <tr>
        <td>Tanh (Hyperbolic Tangent)</td>
        <td>Similar to sigmoid but squashes values between -1 and 1.</td>
        <td>Useful in normalizing the output, less popular in hidden layers.</td>
        <td>\( f(x) = \tanh(x) = \frac{2}{1 + \exp(-2x)} - 1 \)</td>
        <td><img src="https://th.bing.com/th/id/OIP.rg81f4U5WRMYksIpZdTHQAHaHa?rs=1&pid=ImgDetMain" alt="Tanh Function" style="width:100px;height:auto;"></td>
    </tr>
    <tr>
        <td>Leaky ReLU</td>
        <td>A variant of ReLU, it allows a small, non-zero gradient when the unit is not active.</td>
        <td>Useful to fix the “dying ReLU” problem.</td>
        <td>\( f(x) = \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{if } x \leq 0 \end{cases} \)</td>
        <td><img src="https://pytorch.org/docs/stable/_images/LeakyReLU.png" alt="Leaky ReLU Function" style="width:100px;height:auto;"></td>
    </tr>
    <tr>
        <td>Softmax</td>
        <td>Converts a vector of values into a probability distribution.</td>
        <td>Used in the output layer of a multi-class classification neural network.</td>
        <td>\( f(x_i) = \frac{\exp(x_i)}{\sum_{j} \exp(x_j)} \) for \( j = 1, 2, \ldots, J \)</td>
        <td>Varies based on input vector</td>
    </tr>
</table>


# Classification metrics

<table>
    <tr>
        <th>Metric</th>
        <th>Meaning</th>
        <th>What it Measures</th>
        <th>When to Use</th>
        <th>Formula</th>
    </tr>
    <tr>
        <td>Accuracy</td>
        <td>Ratio of correctly predicted observations to total observations.</td>
        <td>Overall effectiveness of the classifier.</td>
        <td>When class distribution is even.</td>
        <td>\( \frac{\text{Correct Predictions}}{\text{Total Predictions}} \)</td>
    </tr>
    <tr>
        <td>Precision</td>
        <td>Ratio of correctly predicted positive observations to total predicted positives.</td>
        <td>Accuracy of positive predictions.</td>
        <td>When cost of false positives is high.</td>
        <td>\( \frac{\text{TP}}{\text{TP + FP}} \)</td>
    </tr>
    <tr>
        <td>Recall (Sensitivity)</td>
        <td>Ratio of correctly predicted positive observations to all observations in actual class.</td>
        <td>Classifier's ability to find all positive samples.</td>
        <td>When cost of false negatives is high.</td>
        <td>\( \frac{\text{TP}}{\text{TP + FN}} \)</td>
    </tr>
    <tr>
        <td>F1 Score</td>
        <td>Weighted average of Precision and Recall.</td>
        <td>Balance between Precision and Recall.</td>
        <td>Uneven class distribution or different costs for FN and FP.</td>
        <td>\( 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}} \)</td>
    </tr>
    <tr>
        <td>Confusion Matrix</td>
        <td>Table layout showing performance of algorithm.</td>
        <td>Correct and incorrect predictions by class.</td>
        <td>Detailed performance analysis.</td>
        <td>N/A</td>
    </tr>
    <tr>
        <td>ROC Curve and AUC</td>
        <td>Graph showing performance at various thresholds.</td>
        <td>Trade-off between TPR and FPR.</td>
        <td>Comparing models; balance between sensitivity and specificity.</td>
        <td>N/A</td>
    </tr>
    <tr>
        <td>Specificity</td>
        <td>Ratio of true negatives to actual negatives.</td>
        <td>Model's ability to identify true negatives.</td>
        <td>Cost of false positives is significant.</td>
        <td>\( \frac{\text{TN}}{\text{TN + FP}} \)</td>
    </tr>
    <tr>
        <td>Log Loss</td>
        <td>Performance of model where prediction is a probability.</td>
        <td>Penalizes false classifications.</td>
        <td>Penalizing false classifications is important.</td>
        <td>N/A</td>
    </tr>
    <tr>
        <td>Matthews Correlation Coefficient (MCC)</td>
        <td>Balanced measure that considers TP, FP, TN, FN.</td>
        <td>Quality of binary classifications in imbalanced datasets.</td>
        <td>Dealing with imbalanced datasets.</td>
        <td>N/A</td>
    </tr>
    <tr>
        <td>Balanced Accuracy</td>
        <td>Average of recall obtained on each class.</td>
        <td>Accuracy in imbalanced datasets.</td>
        <td>Imbalanced classes; considering balance between classes.</td>
        <td>N/A</td>
    </tr>
</table>


# False Positive and False Negative Learning Hack!!

![image.png](attachment:efc052cb-862b-4b8c-8d38-e08bf534e066.png)