# 🧠 Classification – Basic Notes

## 📌 What is Classification?

**Classification** is a supervised machine learning technique used to predict categorical labels (discrete values) based on input features (independent variables).

**Example use cases:**
- Email spam detection (Spam vs. Not Spam)
- Disease diagnosis (Healthy vs. Sick)
- Image recognition (Cat vs. Dog)

---

## 🔧 Common Classification Models

| Model                     | Description |
|---------------------------|-------------|
| **Logistic Regression**    | A linear model for binary classification. It estimates the probability of a binary outcome. |
| **K-Nearest Neighbors (KNN)** | Classifies a data point based on the majority label of its nearest neighbors. |
| **Support Vector Machine (SVM)** | Finds the hyperplane that best separates the classes. Suitable for non-linear classification. |
| **Decision Tree**          | Splits data into branches based on features to classify. Easy to interpret but may overfit. |
| **Random Forest**          | An ensemble method using multiple decision trees to improve accuracy and reduce overfitting. |
| **Naive Bayes**            | Based on Bayes' Theorem, it assumes independence between features and is efficient for large datasets. |
| **Gradient Boosting**      | A powerful ensemble technique that combines multiple weak models to create a strong classifier. |

---

## 📥 Importing Models

To use these models, you need to import them from `sklearn`:
- `LogisticRegression`
- `KNeighborsClassifier`
- `SVC` (Support Vector Classifier)
- `DecisionTreeClassifier`
- `RandomForestClassifier`
- `GaussianNB` (Naive Bayes)
- `GradientBoostingClassifier`

---

## 🔁 Training, Testing, and Predicting

1. **Training a model** involves fitting it to training data using known inputs and corresponding labels.
2. **Testing** involves predicting on a separate test set and comparing predictions to actual labels.
3. **Prediction** is the process of using a trained model to classify new, unseen data.

---

## 📏 Evaluation Metrics for Classification

### 1. **Accuracy**

- **Definition:** The proportion of correct predictions to the total number of predictions.
- **Best Case:** Accuracy = 1 (100% correct predictions).
- **Worst Case:** Accuracy = 0, indicating the model predicts all labels incorrectly.
- **Parameter:** Takes the true labels (`y_true`) and predicted labels (`y_pred`).

---

### 2. **Precision**

- **Definition:** The ratio of correctly predicted positive observations to the total predicted positives. It is useful for imbalanced classes.
- **Best Case:** Precision = 1 (all predicted positives are true positives).
- **Worst Case:** Precision = 0 (no predicted positives are true positives).
- **Parameter:** Takes the true positive, false positive, and predicted labels.

---

### 3. **Recall (Sensitivity)**

- **Definition:** The ratio of correctly predicted positive observations to all the actual positives. It shows how well the model detects positive instances.
- **Best Case:** Recall = 1 (all positive instances are correctly detected).
- **Worst Case:** Recall = 0 (none of the positive instances are detected).
- **Parameter:** Takes the true positive, false negative, and predicted labels.

---

### 4. **F1 Score**

- **Definition:** The harmonic mean of Precision and Recall, providing a balance between them.
- **Best Case:** F1 = 1 (perfect balance between precision and recall).
- **Worst Case:** F1 = 0 (either precision or recall is zero).
- **Parameter:** Takes the precision and recall values.

---

### 5. **Confusion Matrix**

- **Definition:** A table used to evaluate the performance of a classification algorithm. It shows the true positives, false positives, true negatives, and false negatives.
- **Best Case:** A matrix with all true positives and true negatives, indicating perfect classification.
- **Worst Case:** A matrix with all false positives and false negatives, indicating poor classification performance.

---

## 📌 Summary of Best and Worst Case Evaluations

- **Best Case:** For **accuracy**, **precision**, **recall**, and **F1 score**, the best case scenario is when the values are **close to 1**. This means the model is performing perfectly with minimal errors.
  
- **Worst Case:** A **low accuracy**, **precision**, **recall**, or **F1 score** indicates poor model performance, with many misclassifications. Specifically, **recall** is critical in cases like disease detection where missing a positive case is costly.

---

This covers the basics of **classification**, common models, and how to evaluate them using popular metrics like **accuracy**, **precision**, **recall**, **F1 score**, and the **confusion matrix**.


In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_classes=2, random_state=42)
X

array([[ 1.12510039,  1.17812384,  0.49351604, ...,  1.35732466,
         0.9660408 , -1.98113862],
       [-0.56464086,  3.6386291 , -1.52241469, ..., -0.89025442,
         1.43882638, -3.82874758],
       [ 0.51631285,  2.16542633, -0.62848571, ..., -1.95817543,
        -0.34880315, -1.8041241 ],
       ...,
       [ 1.65015307, -0.69216458, -2.04920577, ..., -1.30257748,
        -1.28550452,  3.32856934],
       [-1.18660302, -1.41459786, -0.12151968, ..., -1.42146469,
        -0.02833985,  3.41393228],
       [ 0.78867591, -0.22254747,  0.32856985, ..., -1.29103957,
        -2.33817245,  2.03602059]])

## ========== train_test_split ==========

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Evalution

In [4]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# ========== Logistic Regression ==========


In [5]:
from sklearn.linear_model import LogisticRegression

log_reg_model = LogisticRegression()
log_reg_model.fit(X_train, y_train)
log_reg_pred = log_reg_model.predict(X_test)

log_reg_accuracy = accuracy_score(y_test, log_reg_pred)
log_reg_precision = precision_score(y_test, log_reg_pred)
log_reg_recall = recall_score(y_test, log_reg_pred)
log_reg_f1 = f1_score(y_test, log_reg_pred)
log_reg_conf_matrix = confusion_matrix(y_test, log_reg_pred)

print("Logistic Regression")
print(f"  Accuracy: {log_reg_accuracy:.2f}")
print(f"  Precision: {log_reg_precision:.2f}")
print(f"  Recall: {log_reg_recall:.2f}")
print(f"  F1 Score: {log_reg_f1:.2f}")
print(f"  Confusion Matrix:\n{log_reg_conf_matrix}")
print("-" * 40)


Logistic Regression
  Accuracy: 0.83
  Precision: 0.81
  Recall: 0.82
  F1 Score: 0.81
  Confusion Matrix:
[[95 17]
 [16 72]]
----------------------------------------


# ========== KNN ==========


In [6]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)

knn_accuracy = accuracy_score(y_test, knn_pred)
knn_precision = precision_score(y_test, knn_pred)
knn_recall = recall_score(y_test, knn_pred)
knn_f1 = f1_score(y_test, knn_pred)
knn_conf_matrix = confusion_matrix(y_test, knn_pred)

print("KNN Classification")
print(f"  Accuracy: {knn_accuracy:.2f}")
print(f"  Precision: {knn_precision:.2f}")
print(f"  Recall: {knn_recall:.2f}")
print(f"  F1 Score: {knn_f1:.2f}")
print(f"  Confusion Matrix:\n{knn_conf_matrix}")

KNN Classification
  Accuracy: 0.92
  Precision: 0.95
  Recall: 0.85
  F1 Score: 0.90
  Confusion Matrix:
[[108   4]
 [ 13  75]]


# ========== Support Vector Machine ==========


In [7]:
from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)

svm_accuracy = accuracy_score(y_test, svm_pred)
svm_precision = precision_score(y_test, svm_pred)
svm_recall = recall_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred)
svm_conf_matrix = confusion_matrix(y_test, svm_pred)

print("Support Vector Machine")
print(f"  Accuracy: {svm_accuracy:.2f}")
print(f"  Precision: {svm_precision:.2f}")
print(f"  Recall: {svm_recall:.2f}")
print(f"  F1 Score: {svm_f1:.2f}")
print(f"  Confusion Matrix:\n{svm_conf_matrix}")

Support Vector Machine
  Accuracy: 0.92
  Precision: 0.91
  Recall: 0.91
  F1 Score: 0.91
  Confusion Matrix:
[[104   8]
 [  8  80]]


# ========== Decision Tree ==========


In [8]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)

dt_accuracy = accuracy_score(y_test, dt_pred)
dt_precision = precision_score(y_test, dt_pred)
dt_recall = recall_score(y_test, dt_pred)
dt_f1 = f1_score(y_test, dt_pred)
dt_conf_matrix = confusion_matrix(y_test, dt_pred)

print("Decision Tree")
print(f"  Accuracy: {dt_accuracy:.2f}")
print(f"  Precision: {dt_precision:.2f}")
print(f"  Recall: {dt_recall:.2f}")
print(f"  F1 Score: {dt_f1:.2f}")
print(f"  Confusion Matrix:\n{dt_conf_matrix}")
print("-" * 40)

Decision Tree
  Accuracy: 0.89
  Precision: 0.87
  Recall: 0.86
  F1 Score: 0.87
  Confusion Matrix:
[[101  11]
 [ 12  76]]
----------------------------------------


# ========== Random Forest ==========


In [9]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_pred)
rf_precision = precision_score(y_test, rf_pred)
rf_recall = recall_score(y_test, rf_pred)
rf_f1 = f1_score(y_test, rf_pred)
rf_conf_matrix = confusion_matrix(y_test, rf_pred)

print("Random Forest")
print(f"  Accuracy: {rf_accuracy:.2f}")
print(f"  Precision: {rf_precision:.2f}")
print(f"  Recall: {rf_recall:.2f}")
print(f"  F1 Score: {rf_f1:.2f}")
print(f"  Confusion Matrix:\n{rf_conf_matrix}")

Random Forest
  Accuracy: 0.96
  Precision: 0.94
  Recall: 0.97
  F1 Score: 0.96
  Confusion Matrix:
[[107   5]
 [  3  85]]
