# **Classification in Machine Learning** Detailed Notes 

---

### **What is Classification?**

**Classification** is a supervised learning technique where the model learns from labeled training data to predict the category or class label of new data points. It is used when the output variable is categorical in nature. The objective of classification is to predict the discrete class labels.

For instance, the task could be to classify emails as either "Spam" or "Not Spam," or to classify an image of a fruit as "Apple," "Banana," or "Orange."

---

## **Classification in Machine Learning - Key Concepts**

### **Supervised Learning**
In classification, we use **supervised learning** where the model is trained on a labeled dataset, meaning that each training example is paired with a label. The model learns to map input features to their respective class labels.

### **Classes and Labels**
The output or **label** is a discrete value, meaning the possible outcomes are **finite and predefined**. For example:
- **Binary Classification**: Two classes (e.g., "Yes" or "No").
- **Multi-Class Classification**: More than two classes (e.g., "Cat," "Dog," "Bird").
- **Multi-Label Classification**: An instance can belong to multiple classes simultaneously (e.g., tags in a blog post).

### **Training a Classification Model**
In training a classification model:
1. The model learns the relationship between **input features** and **class labels** in the training data.
2. Once trained, the model can predict the class labels for **unseen data**.

---

### **Common Classification Algorithms**

---

#### **1. Logistic Regression (Binary Classification)**

**Logistic Regression** is a statistical method used for binary classification. Despite its name, it is a **classification algorithm**, not a regression one.

- **How it works**: Logistic regression uses the **logistic (sigmoid)** function to model the probability that a data point belongs to one of the two classes.

- **Sigmoid Function**:
  $$
  \hat{y} = \frac{1}{1 + e^{-z}}, \quad z = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n
  $$
  where $ z $ is a linear combination of the input features $ x_1, x_2, \dots, x_n $.

- **Outcome**: The output of logistic regression is a probability (between 0 and 1). A threshold (usually 0.5) is used to assign a class label (e.g., if the output is >0.5, classify as "1" or "True").

- **Use Cases**: 
  - Email spam classification.
  - Medical diagnosis (e.g., predicting whether a tumor is malignant or benign).

- **Advantages**:
  - Simple and easy to implement.
  - Interpretability: The coefficients indicate the importance of each feature.
  - Works well for linearly separable data.

- **Disadvantages**:
  - Assumes a linear decision boundary, which may not work well for complex data.
  - Sensitive to feature scaling (may need normalization).

---

#### **2. K-Nearest Neighbors (KNN)**

**K-Nearest Neighbors (KNN)** is a simple, instance-based learning algorithm. It does not make assumptions about the data distribution and works by finding the most similar data points to a new data point.

- **How it works**: Given a data point, KNN looks at the **K nearest neighbors** (using a distance metric like Euclidean distance) and assigns the class label based on the majority vote among those neighbors.

- **Use Case**:
  - Image classification.
  - Recommender systems.
  - Handwritten digit recognition.

- **Advantages**:
  - Simple and intuitive.
  - Non-parametric (no assumptions about the data).
  - Performs well for small to medium datasets.

- **Disadvantages**:
  - Computationally expensive for large datasets (high prediction time).
  - Sensitive to irrelevant or redundant features.
  - Performance depends heavily on the choice of $ K $ and distance metric.

---

#### **3. Decision Tree Classifier**

**Decision Trees** create a model based on a series of decisions. Each internal node of the tree represents a decision on a feature, and each leaf node represents a class label.

- **How it works**: A decision tree splits the data based on the feature that provides the **best split** (using metrics like Gini Impurity or Entropy).

- **Use Case**:
  - Customer segmentation.
  - Predicting loan approval.

- **Advantages**:
  - Easy to interpret and visualize.
  - Handles both numerical and categorical data.
  - Can handle missing data.

- **Disadvantages**:
  - Prone to overfitting, especially with deep trees.
  - Unstable, small changes in the data can cause large changes in the tree structure.
  - Often biased towards features with more levels.

---

#### **4. Random Forest Classifier**

**Random Forest** is an ensemble learning method that uses multiple decision trees to improve classification accuracy. It uses **bagging** (Bootstrap Aggregating) to build several trees and combines their outputs.

- **How it works**: It trains multiple decision trees on bootstrapped data and uses majority voting for classification.

- **Use Case**:
  - Stock market prediction.
  - Fraud detection.

- **Advantages**:
  - Reduces overfitting compared to a single decision tree.
  - Can handle high-dimensional datasets.
  - Robust to outliers.

- **Disadvantages**:
  - Less interpretable than individual decision trees.
  - Computationally expensive (especially for large datasets).
  - Can overfit if not tuned properly.

---

#### **5. Support Vector Machine (SVM)**

**Support Vector Machine (SVM)** aims to find the **optimal hyperplane** that separates the data into different classes, maximizing the margin between the classes.

- **How it works**: SVM tries to find the hyperplane that best separates data points from different classes. It can be used for both **linear** and **non-linear** data, with the help of **kernel functions**.

- **Use Case**:
  - Text classification (e.g., spam detection).
  - Image classification (e.g., facial recognition).

- **Advantages**:
  - Effective in high-dimensional spaces.
  - Works well when there is a clear margin of separation between classes.

- **Disadvantages**:
  - Does not work well with large datasets (training time can be slow).
  - Difficult to tune the parameters (e.g., $ C $, kernel function).
  - Sensitive to the choice of kernel.

---

#### **6. Naive Bayes Classifier**

**Naive Bayes** is a probabilistic classifier based on **Bayes’ Theorem**, assuming that the features are **independent** (which is often not true).

- **How it works**: The algorithm calculates the probability of each class given the features and assigns the class with the highest probability.

- **Use Case**:
  - Document classification (e.g., spam detection).
  - Sentiment analysis.

- **Advantages**:
  - Simple and fast.
  - Works well with high-dimensional data, especially text data.
  - Performs well even when the assumption of independence is violated.

- **Disadvantages**:
  - Assumes that features are conditionally independent, which is often unrealistic.
  - Does not perform well with highly correlated features.

---

#### **7. Gradient Boosting Classifier (e.g., XGBoost, LightGBM)**

**Gradient Boosting** is an ensemble learning technique that builds a model sequentially, with each new model correcting the errors of the previous ones using gradient descent.

- **How it works**: It builds an ensemble of weak learners (typically decision trees) where each tree focuses on the errors made by the previous tree.

- **Use Case**:
  - Fraud detection.
  - Classification problems with complex relationships.

- **Advantages**:
  - High accuracy and performance.
  - Can handle missing values and different types of data.
  - Works well on complex datasets.

- **Disadvantages**:
  - Computationally expensive.
  - Sensitive to overfitting if not tuned properly.
  - Can take a long time to train.

---

#### **8. Neural Networks (Multi-Layer Perceptron - MLP)**

**Neural Networks** are inspired by the human brain, consisting of layers of neurons that process and classify data through activation functions.

- **How it works**: The network learns through layers of weighted connections. Each layer's output is passed as input to the next layer until the final output layer generates the prediction.

- **Use Case**:
  - Image classification (e.g., CNN for visual tasks).
  - Speech recognition.

- **Advantages**:
  - Can model complex relationships and decision boundaries.
  - Works well for large datasets.

- **Disadvantages**:
  - Difficult to interpret ("black-box" model).
  - Requires large datasets and computational resources.
  - Prone to overfitting without proper regularization.

---

## **Classification Evaluation Metrics**

---

### **1. Accuracy**

- **Definition**: The percentage of correctly classified instances.

  $$
  \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
  $$

- **Limitations**: Not reliable for imbalanced datasets (where one class significantly outnumbers the other).

---

### **2. Precision**

- **Definition**: The proportion of positive predictions that are actually correct.

  $$
  \text{Precision} = \frac{TP}{TP + FP}
  $$

- **Use Case**: Important when the cost of false positives is high (e.g., medical diagnoses).

---

### **3. Recall (Sensitivity)**

- **Definition**: The proportion of actual positives that are correctly identified.

  $$
  \text{Recall} = \frac{TP}{TP + FN}
  $$

- **Use Case**: Important when the cost of false negatives is high (e.g., in fraud detection).

---

### **4. F1 Score**

- **Definition**: The harmonic mean of precision and recall, useful when both metrics need to be balanced.

  $$
  F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  $$

---

### **5. Confusion Matrix**

A **confusion matrix** shows the breakdown of predictions, including:

- **TP (True Positive)**: Correctly predicted positive cases.
- **TN (True Negative)**: Correctly predicted negative cases.
- **FP (False Positive)**: Incorrectly predicted as positive.
- **FN (False Negative)**: Incorrectly predicted as negative.


Here’s a **Confusion Matrix** table that you can use to evaluate classification models:

|                | **Predicted Positive** | **Predicted Negative** |
|----------------|------------------------|------------------------|
| **Actual Positive** | True Positive (TP)      | False Negative (FN)     |
| **Actual Negative** | False Positive (FP)     | True Negative (TN)      |

### Explanation of Terms:
- **True Positive (TP)**: The number of instances where the model correctly predicted the positive class.
- **True Negative (TN)**: The number of instances where the model correctly predicted the negative class.
- **False Positive (FP)**: The number of instances where the model incorrectly predicted the positive class (Type I error).
- **False Negative (FN)**: The number of instances where the model incorrectly predicted the negative class (Type II error).

This matrix helps you assess the performance of a classification model and can be used to compute metrics such as accuracy, precision, recall, and F1-score.

---

### **6. ROC Curve and AUC**

- **ROC Curve (Receiver Operating Characteristic Curve)**: A plot of **True Positive Rate (Recall)** vs. **False Positive Rate**.
- **AUC (Area Under the Curve)**: The area under the ROC curve. A higher AUC means a better model.

---

### **Conclusion**

Classification is a critical concept in machine learning that applies to various real-world problems. Different algorithms offer various trade-offs in terms of complexity, accuracy, and interpretability. Choosing the right algorithm depends on the dataset, problem type, and evaluation metrics.

---

