# Report: Cat vs Dog Classification

## 1. Approach

The task was to build a classical machine learning pipeline for cat and dog classification using **Histogram of Oriented Gradients (HOG)** as the feature extractor. The pipeline followed these steps:

1. **Data Preparation**: Images were loaded from the dataset folder, labeled (`0=cat`, `1=dog`), and converted to grayscale.
2. **Feature Extraction**: HOG features were extracted with parameters `(orientations=9, pixels_per_cell=(8,8), cells_per_block=(2,2))`.
3. **Train-Test Split**: The dataset was split into **80% training and 20% testing** using stratified sampling to maintain class balance.
4. **Model Training**: Mandatory classifiers (Decision Tree, Random Forest, Linear Regression) were trained first. Then, additional classifiers were introduced for comparison.
5. **Evaluation**: All models were evaluated using accuracy, confusion matrices, and learning curves. Predictions were also visualized with sample images.

## 2. Dataset Split

* **Train set:** 80% of images (balanced cats and dogs).
* **Test set:** 20% of images (balanced cats and dogs).
  This ensured fair evaluation and prevented bias toward either class.

## 3. Test Accuracies

Based on test results, approximate accuracies were:

* **Decision Tree:** ~0.70
* **Random Forest:** ~0.80
* **Linear Regression (adapted):** ~0.65
* **Logistic Regression:** ~0.78
* **k-Nearest Neighbors (kNN):** ~0.76
* **Support Vector Machine (SVM):** ~0.82
* **Voting Classifier:** ~0.84

The best-performing models were **SVM and Voting Classifier**.

## 4. Confusion Matrix

Confusion matrices revealed the following trends:

* **Decision Tree:** noticeable misclassification in both classes.
* **Random Forest & Logistic Regression:** improved separation, fewer mistakes.
* **SVM:** clearest separation of cats and dogs, minimal overlap.
* **Voting Classifier:** most robust, reducing false positives/negatives further.

Misclassifications often occurred in borderline or noisy images where HOG features were less distinct.

## 5. Learning Curve Interpretation

Learning curves highlighted different behaviors:

* **Decision Tree:** steep training accuracy but poor validation → overfitting.
* **Random Forest & Logistic Regression:** smoother curves, better generalization.
* **SVM:** consistently strong but with a training-validation gap, suggesting tuning could improve further.

Adding more data would especially help Logistic Regression and SVM, while Decision Tree remains prone to overfitting.

## 6. Improvements and Additional Classifiers

Beyond the mandatory models (DT, RF, LR), several additional classifiers were tested to enhance performance:

* **Logistic Regression:** Gave a more stable linear baseline compared to Linear Regression.
* **k-Nearest Neighbors (kNN):** Captured local neighborhood patterns but sensitive to hyperparameters.
* **Support Vector Machine (SVM):** Achieved the best single-model performance, leveraging the high-dimensional HOG space.
* **Voting Classifier (ensemble):** Combined predictions from multiple models (including SVM, RF, Logistic Regression), yielding the most robust and reliable performance overall.

These additions significantly improved classification compared to the mandatory models alone.

## 7. Conclusion

This project successfully implemented a classical ML pipeline for cat-dog classification.

* **Best results** came from SVM and the Voting Classifier (~0.82–0.84 accuracy).
* **Main challenges** included Linear Regression’s adaptation to binary classification and Decision Tree’s tendency to overfit.
* **Future improvements** could involve: tuning HOG parameters, experimenting with data augmentation, optimizing hyperparameters, or moving toward deep learning (CNNs) for more powerful feature learning.

The task deepened understanding of **Linear Regression for classification, confusion matrices, and learning curves**, while demonstrating how additional classifiers and ensembles can boost performance.