
# Classification in Machine Learning

## Introduction to Classification

Classification is a supervised learning technique that predicts the category or class label of new observations based on past observations. Classification problems can be categorized as:

- **Binary Classification**: The output variable has two possible classes (e.g., spam/not spam).
- **Multi-Class Classification**: The output variable has more than two classes (e.g., classifying types of flowers).
- **Multi-Label Classification**: Each observation can be assigned multiple labels (e.g., tagging images with multiple objects).

### Popular Classification Algorithms

- **Logistic Regression**: A linear model that estimates the probability of a binary outcome.
- **k-Nearest Neighbors (k-NN)**: Classifies new data points based on the majority class of its k nearest neighbors.
- **Support Vector Machine (SVM)**: Finds the optimal hyperplane that separates classes in the feature space.
- **Decision Trees**: Classifies data by learning decision rules inferred from features.
- **Random Forest**: An ensemble method that builds multiple decision trees and merges them to improve accuracy.

## Data Preparation for Classification

### Data Preprocessing Steps

1. **Handling Missing Data**: Handle missing data by imputation or removing rows/columns.
2. **Encoding Categorical Variables**: Convert categorical variables into numerical format using techniques like one-hot encoding or label encoding.
3. **Feature Scaling**: Standardize or normalize features to ensure that they are on the same scale.

### Splitting Data into Training and Testing Sets

To evaluate the performance of a model, data is split into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

```python
from sklearn.model_selection import train_test_split

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

## Model Training and Evaluation

### Logistic Regression Example

```python
from sklearn.linear_model import LogisticRegression

# Model training
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

# Predictions
y_pred_log_reg = log_reg.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred_log_reg)
print(f"Accuracy: {accuracy}")
```

### Random Forest Example

```python
from sklearn.ensemble import RandomForestClassifier

# Model training
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_clf.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy}")
```

### Support Vector Machine (SVM) Example

```python
from sklearn.svm import SVC

# Model training
svm_clf = SVC(probability=True)
svm_clf.fit(X_train, y_train)

# Predictions
y_pred_svm = svm_clf.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred_svm)
print(f"Accuracy: {accuracy}")
```

## Model Evaluation

### Evaluation Metrics

Common metrics for evaluating classification models include:

- **Accuracy**: The proportion of correctly classified instances.
- **Precision**: The proportion of true positive instances among the instances classified as positive.
- **Recall**: The proportion of true positive instances among the instances that should have been classified as positive.
- **F1 Score**: The harmonic mean of precision and recall.
- **ROC-AUC**: Area Under the Receiver Operating Characteristic Curve, which measures the model's ability to distinguish between classes.

### Confusion Matrix

A confusion matrix is a table used to describe the performance of a classification model. It compares the actual target values with those predicted by the model.

```python
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred_log_reg)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Logistic Regression")
plt.show()
```

### ROC Curve for SVM

The ROC curve is a graphical representation of the true positive rate (TPR) versus the false positive rate (FPR) at various threshold settings. The area under the curve (AUC) provides an aggregate measure of performance.

```python
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_score_svm[:, 1], pos_label=1)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic - SVM')
plt.legend(loc="lower right")
plt.show()
```

## Conclusion

In this notebook, we've covered the basics of classification in machine learning, explored different classification algorithms, and demonstrated how to implement and evaluate them using Python and scikit-learn.
