# CSCA 5632 Final Project - Unsupervised and Supervised Learning on Animal Faces (AFHQ Dataset)
### By Moshiur Howlader
##### Github Link : https://github.com/Mosh333/csca5632-final-project

### 1. Introduction

In today’s data-driven world, the ability to **discover structure and meaning from unlabeled data** has become one of the most powerful frontiers in machine learning.  
While supervised learning relies on extensive labeled datasets, many real-world domains contain **vast quantities of raw, unannotated data** — images, text, medical scans, or sensor streams — where manual labeling is costly or infeasible.  
Here, [**unsupervised learning**](https://biztechmagazine.com/article/2025/05/what-are-benefits-unsupervised-machine-learning-and-clustering-perfcon) plays a vital role: it enables algorithms to uncover hidden patterns, latent representations, and natural groupings within data without external guidance.

Unsupervised learning drives [innovation across numerous fields](https://pmc.ncbi.nlm.nih.gov/articles/PMC7983091/):

- [**Data exploration and pattern discovery:**](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/mas.21602) Unsupervised machine learning enables open-ended analysis of large, high-dimensional datasets to uncover hidden structures, correlations, and trends—reducing dimensionality and aiding human interpretation without predefined labels or targets, such as exploring mass spectrometry data across large experimental datasets.  
- [**Computer vision:**](https://viso.ai/deep-learning/supervised-vs-unsupervised-learning/) grouping unlabeled images by similarity, compressing data via [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), or learning visual embeddings with self-supervised methods like [SimCLR](https://arxiv.org/abs/2002.05709).  
- [**Natural language processing:**](https://milvus.io/ai-quick-reference/what-is-the-role-of-unsupervised-learning-in-nlp) learning semantic relationships in text through [Word2Vec](https://arxiv.org/abs/1301.3781) or topic discovery with [Latent Dirichlet Allocation (LDA)](https://jmlr.org/papers/v3/blei03a.html).
- [**Healthcare and biomedical research:**](https://pubmed.ncbi.nlm.nih.gov/39667278/) Unsupervised machine learning facilitates the discovery of hidden disease patterns, comorbidity clusters, and patient subgroups from large-scale electronic health records—enabling improved understanding of latent traits, risk domains, and disease progression, such as identifying novel comorbidity patterns in aging cohorts using population-based EHR data  
- [**Autonomous systems and robotics:**](https://fiveable.me/introduction-autonomous-robots/unit-7/unsupervised-learning/study-guide/rNorV1tsC0TeCPOO) mapping environments, grouping sensor inputs, or learning spatial representations without labeled supervision.  
- [**Recommender and personalization systems:**](https://www.mdpi.com/2073-8994/12/2/185) clustering users or content to generate recommendations when explicit ratings are unavailable.

These examples show that unsupervised learning provides many of the tools used for exploratory data analysis and representation learning, helping models extract structure from raw data before any labels are available.

In this project, I apply unsupervised learning techniques to the **[Animal Faces-HQ (AFHQ) dataset](https://www.kaggle.com/datasets/andrewmvd/animal-faces)** — a collection of over 16,000 high-quality animal face images spanning three categories: **cats, dogs, and wildlife**.  
The goal is to examine whether unsupervised algorithms can **discover meaningful groupings** among these images based purely on visual similarity and to what extent these clusters correspond to true species categories.  
I perform **exploratory data analysis (EDA)**, visualize image embeddings using [t-SNE](https://lvdmaaten.github.io/tsne/) and [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), and apply clustering algorithms including [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), and [Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html).  
Clustering performance is evaluated through the **[Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)**, **[Adjusted Rand Index (ARI)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html)**, and **[Normalized Mutual Information (NMI)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html)**.

By comparing these unsupervised models — and contrasting them with a small supervised baseline classifier — this project illustrates both the **potential and limitations** of unsupervised learning for image categorization.  
Ultimately, the work demonstrates how clustering can reveal **latent visual structure** and serve as a foundation for more advanced **self-supervised** or **semi-supervised** systems, bridging the gap between unlabeled data and intelligent perception.

### 2. Dataset Overview and Preprocessing

### 3. Exploratory Data Analysis (EDA)

### 4. Unsupervised Learning Models and Analysis

### 5. Supervised Baseline (Comparative Analysis)

### 6. Discussion and Conclusions

### 7. Future Improvements and Areas to Explore

### 8. Summary of Results

### 9. References and Acknowledgments