# CSCA 5632 Final Project - Unsupervised and Supervised Learning on Animal Faces (AFHQ Dataset)
### By Moshiur Howlader
##### Github Link : https://github.com/Mosh333/csca5632-final-project

### 1. Introduction

In today’s data-driven world, the ability to **uncover structure and meaning from unlabeled data** represents one of the most powerful and important areas in machine learning.
While supervised learning depends on extensive labeled datasets, many real-world domains contain **vast quantities of raw, unannotated information**—such as images, text, medical scans, or sensor data—where manual labeling is costly or infeasible.  
Here, [**unsupervised learning**](https://biztechmagazine.com/article/2025/05/what-are-benefits-unsupervised-machine-learning-and-clustering-perfcon) plays a pivotal role: it enables algorithms to reveal hidden patterns, latent representations, and natural groupings within data without external supervision.

Unsupervised learning drives [innovation across diverse domains](https://pmc.ncbi.nlm.nih.gov/articles/PMC7983091/):

- [**Data exploration and pattern discovery:**](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/mas.21602) Enables open-ended analysis of large, high-dimensional datasets to uncover hidden structures, correlations, and trends—reducing dimensionality and aiding human interpretation, such as exploring mass spectrometry data across large experimental datasets.  
- [**Computer vision:**](https://viso.ai/deep-learning/supervised-vs-unsupervised-learning/) Groups unlabeled images by similarity, compresses data via [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), or learns visual embeddings through self-supervised methods like [SimCLR](https://arxiv.org/abs/2002.05709).  
- [**Natural language processing:**](https://milvus.io/ai-quick-reference/what-is-the-role-of-unsupervised-learning-in-nlp) Learns semantic relationships in text through [Word2Vec](https://arxiv.org/abs/1301.3781) or discovers latent topics using [Latent Dirichlet Allocation (LDA)](https://jmlr.org/papers/v3/blei03a.html).  
- [**Healthcare and biomedical research:**](https://pubmed.ncbi.nlm.nih.gov/31891765/) Facilitates the discovery of hidden disease patterns, comorbidity clusters, and patient subgroups from large-scale electronic health records—enabling better understanding of latent traits, risk domains, and disease progression, such as identifying novel comorbidity patterns in aging cohorts.  
- [**Autonomous systems and robotics:**](https://fiveable.me/introduction-autonomous-robots/unit-7/unsupervised-learning/study-guide/rNorV1tsC0TeCPOO) Maps environments, groups sensor inputs, and learns spatial representations without labeled supervision.  
- [**Recommender and personalization systems:**](https://www.mdpi.com/2073-8994/12/2/185) Clusters users or content to generate recommendations when explicit ratings are unavailable.

Together, these examples highlight how unsupervised learning forms the foundation of **exploratory data analysis** and **representation learning**, allowing models to extract structure from raw data before labels exist.

![Illustration of Unsupervised Learning Process - https://uk.mathworks.com/discovery/unsupervised-learning.html](../images/1-intro-pic.png)

*Figure: Conceptual illustration of unsupervised learning — an algorithm groups unlabeled data points (shapes) based on similarity, forming meaningful clusters.*

---

#### 1.1 Project Overview and Objectives
Here we discuss the selected data source and the unsupervised learning problem we aim to solve.

#### 1.2 Gather Data, Determine the Method of Data Collection and Provenance
This project uses the **[Animal Faces-HQ (AFHQ) dataset](https://www.kaggle.com/datasets/andrewmvd/animal-faces)** — a publicly available image dataset originally curated by **Andrew Mvd** on Kaggle under a **CC BY-NC license**.  
AFHQ contains **over 16,000 high-quality animal face images** across three balanced categories: **cats, dogs, and wildlife**.


According to the Kaggle description:
> “This dataset, also known as Animal Faces-HQ (AFHQ), consists of 16,130 high-quality images at 512×512 resolution.  
> There are three domains of classes, each providing about 5000 images.  
> By having multiple (three) domains and diverse images of various breeds per each domain, AFHQ sets a challenging image-to-image translation problem.  
> The classes are: Cat, Dog, and Wildlife.”

For this project, images are **resized to XxY pixels**, normalized to a [0, 1] range, and converted to **RGB tensors** (three-channel numerical arrays representing red, green, and blue intensities).  
These preprocessing steps prepare the data for feature extraction, dimensionality reduction, and clustering.  
The dataset’s high resolution, balance across categories, and visual diversity make it well-suited for evaluating **unsupervised image representation learning** and **clustering performance**.

![Sample Image Data found in this dataset](../images/2-intro-pic.png)   
*Figure: Preview of the dataset used to perform this project.*

---

#### 1.3 Identify an Unsupervised Learning Problem
The goal of this project is to test whether **unsupervised learning algorithms** can **group animal face images into their correct categories** — cats, dogs, and wildlife — **based only on visual similarity**, without using any labels.  
In other words, can the models automatically recognize and cluster similar-looking animals together?

The analysis includes **exploratory data analysis (EDA)**, visualization of image features using **[t-SNE](https://lvdmaaten.github.io/tsne/)** and **[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**, and the application of clustering algorithms such as **[K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**, **[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**, and **[Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)**.  

Clustering performance is evaluated using the **[Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)**, **[Adjusted Rand Index (ARI)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html)**, and **[Normalized Mutual Information (NMI)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html)**.  

By comparing the results of multiple unsupervised models — and contrasting them with a small supervised baseline classifier — this project shows both the **strengths and limitations** of unsupervised learning for basic image classification tasks.  
The findings demonstrate how clustering can uncover **hidden visual structure** in data and serve as a foundation for more advanced **self-supervised** or **semi-supervised** learning approaches.

### 2. Dataset Overview and Preprocessing

#### 2.1 Fetching the Dataset

To begin, one must download the dataset (Github does not allow large data to be stored in repo):

Git Bash / Linux / WSL:
```bash
curl -L -o "$(pwd)/data/animal-faces.zip" https://www.kaggle.com/api/v1/datasets/download/andrewmvd/animal-faces

```

After downloading, extract the dataset:
```bash
unzip "$(pwd)/data/animal-faces.zip" -d "$(pwd)/data/animal-faces"
```

To confirm successful extraction, verify that the dataset contains 16,130 images:
```bash
find "$(pwd)/data/animal-faces" -type f | wc -l
```

**Expected output**:
```bash
16130
```

Alternatively, one can simply download the image zip folder from https://www.kaggle.com/datasets/andrewmvd/animal-faces and store it under `~/data` and extract from there as `~/data/animal-faces`
With the dataset successfully extracted and verified, the next step involves exploring its structure and visual characteristics through exploratory data analysis (EDA).

### 3. Exploratory Data Analysis (EDA)

#### 3.1 Initial Inspection

This section inspects and visualizes the **Animal Faces-HQ (AFHQ)** dataset to understand its structure, quality, and key characteristics before model building.  
The analysis focuses on data composition, visual patterns, feature correlations, preprocessing, and the main insights that will guide the subsequent unsupervised learning experiments.

Before applying clustering or dimensionality reduction, it is essential to perform an initial visual inspection of the dataset to gain intuition about its organization and diversity.

The dataset is organized into three main categories — **cats**, **dogs**, and **wildlife** — each containing roughly 5,000 high-quality 512×512 images.  
Each category includes both training and validation subsets, stored under the following structure:


```bash
data/
├── animal-faces/                # Extracted dataset
│   ├── afhq/
│       ├── train/
│       │   ├── cat/             # ~5,153 images
│       │   ├── dog/             # ~4,739 images
│       │   └── wild/            # ~4,738 images
│       │ 
│       └── val/
│            ├── cat/            # 500 images
│            ├── dog/            # 500 images
│            └── wild/           # 500 images
│
└── animal-faces.zip             # Original downloaded dataset archive

```

#### 3.2 Visual Inspection
A few random samples from each class are shown below to demonstrate image quality and diversity.

Observations:
- The images are **balanced** across categories (≈5,000 per class).  
- Each image is **centered and cropped** to focus on the animal’s face.  
- There is noticeable variation in lighting, background, and species within each class, which is beneficial for clustering and unsupervised generalization, as the algorithms are exposed to a richer set of visual features to learn from.

![Exploring Cat Pic](../images/3-pic.png)   
![Exploring Dog Pic](../images/4-pic.png)   
![Exploring Wild Pic](../images/5-pic.png)   
![Exploring Cat Pic](../images/6-pic.png)   
![Exploring Dog Pic](../images/7-pic.png)   
![Exploring Wild Pic](../images/8-pic.png)


#### 3.3 Dataset Composition and Descriptive Summary

#### 3.4 Feature Correlations and Visual Patterns

#### 3.5 Data Quality and Cleaning

#### 3.6 Transformations and Normalization

#### 3.7 Feature Importance and Hypothesis

#### 3.8 Summary of EDA Findings

### 4. Unsupervised Learning Models and Analysis

### 5. Supervised Baseline (Comparative Analysis)

### 6. Discussion and Conclusions

### 7. Future Improvements and Areas to Explore

### 8. Summary of Results

### 9. References and Acknowledgments

1. **Animal Faces-HQ (AFHQ) Dataset (Kaggle):**  
   https://www.kaggle.com/datasets/andrewmvd/animal-faces

2. **Unsupervised Learning Overview:**  
   https://biztechmagazine.com/article/2025/05/what-are-benefits-unsupervised-machine-learning-and-clustering-perfcon

3. **Applications in Diverse Domains:**  
   https://pmc.ncbi.nlm.nih.gov/articles/PMC7983091/

4. **Data Exploration and Pattern Discovery:**  
   https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/mas.21602

5. **Computer Vision Overview:**  
   https://viso.ai/deep-learning/supervised-vs-unsupervised-learning/

6. **Principal Component Analysis (PCA):**  
   https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

7. **SimCLR Paper (Self-Supervised Learning):**  
   https://arxiv.org/abs/2002.05709

8. **Unsupervised Learning in NLP:**  
   https://milvus.io/ai-quick-reference/what-is-the-role-of-unsupervised-learning-in-nlp

9. **Word2Vec Paper:**  
   https://arxiv.org/abs/1301.3781

10. **Latent Dirichlet Allocation (LDA) Paper:**  
    https://jmlr.org/papers/v3/blei03a.html

11. **Healthcare and Biomedical Applications:**  
    https://pubmed.ncbi.nlm.nih.gov/31891765/

12. **Autonomous Systems and Robotics:**  
    https://fiveable.me/introduction-autonomous-robots/unit-7/unsupervised-learning/study-guide/rNorV1tsC0TeCPOO

13. **Recommender and Personalization Systems:**  
    https://www.mdpi.com/2073-8994/12/2/185

14. **t-SNE Algorithm:**  
    https://lvdmaaten.github.io/tsne/

15. **PCA (Scikit-learn Implementation):**  
    https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

16. **K-Means Clustering:**  
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

17. **DBSCAN Clustering:**  
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

18. **Agglomerative Clustering:**  
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

19. **Silhouette Score:**  
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

20. **Adjusted Rand Index (ARI):**  
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

21. **Normalized Mutual Information (NMI):**  
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html