Self-Supervised Representation Learning with Masked Autoencoders (MAE)

A PyTorch implementation of Masked Autoencoders (MAE) (He et al., 2022) applied to the STL-10 dataset. This project demonstrates how self-supervised learning can leverage large-scale unlabeled data (100k images) to learn robust visual representations, significantly outperforming supervised training from scratch when labeled data is limited.

🌟 Key Features

Asymmetric Autoencoder: Vision Transformer (ViT-Base) encoder with a lightweight decoder.
75% Masking Ratio: Encourages global context understanding instead of local pixel interpolation.
Robust Loss Function: Uses L1 Loss for sharper reconstructions compared to MSE.
Data Efficient: Pre-trained encoder converges 4× faster (≈5 epochs vs 20) on downstream tasks.
Latent Space Analysis: Includes t-SNE visualization of learned feature manifolds.

🎨 Visual Reconstruction Results

The core idea of MAE is to mask a large portion of the image and reconstruct the missing pixels.

Left: Original Image
Middle: Masked Input (model input)
Right: Reconstruction (model output)

Figure 1: Reconstruction results after 50 epochs of self-supervised pre-training using L1 Loss. The model successfully hallucinates missing structures such as object bodies and geometric components.

Architecture Overview

This project implements the core asymmetric encoder-decoder architecture of Masked Autoencoders:

Patchify & Mask: The input image is divided into patches, and 75% of them are randomly masked and discarded.
Encoder: A heavyweight Vision Transformer (ViT) processes only the remaining 25% visible patches.
Decoder: A lightweight ViT takes the encoded visible patches, appends learnable mask tokens, unshuffles them to their original positions, and reconstructs the missing pixels.
Loss: L1 Error is computed strictly on the masked patches.

📊 Experimental Results

1. Latent Space Visualization (t-SNE)

Does the model learn semantic structure without labels?

Figure 2: MAE pre-trained model (left) forms dense, semantically meaningful clusters before fine-tuning. The baseline model (right) trained from scratch exhibits a scattered and brittle feature space.

2. Fine-Tuning Performance

The pre-trained encoder is fine-tuned on a 5k labeled subset of STL-10 and compared with a ViT trained from scratch.

Validation Accuracy	Validation F1-Score

Figure 3a: MAE pre-training (blue) significantly outperforms baseline (orange).	Figure 3b: F1-score confirms robustness across all classes.

3. Convergence Speed

Figure 4: The MAE pre-trained model reaches optimal performance in ~5 epochs, while the baseline requires substantially more compute to reach lower final accuracy.

🚀 Installation

Clone the repository

git clone https://github.com/Alpsource/Visual-Representation-Learning-MAE.git
cd Visual-Representation-Learning-MAE

Install dependencies

pip install -r requirements.txt

💻 Usage

Step 1: Self-Supervised Pre-Training

Train the Masked Autoencoder on the 100k unlabeled images of STL-10.

jupyter notebook Self_Supervised_Learning.ipynb

Input: STL-10 Unlabeled Split
Output: Encoder weights saved as mae_encoder_final.pth

Step 2: Fine-Tuning & Evaluation

Fine-tune the pre-trained encoder on the 5k labeled images and evaluate performance.

jupyter notebook Fine_Tuning.ipynb

Input: STL-10 Train/Test Split + mae_encoder_final.pth
Output: Classification metrics, confusion matrix, and comparison plots

📂 Project Structure

.
├── models.py                      # MAE model (ViT encoder + lightweight decoder)
├── Self_Supervised_Learning.ipynb # Pre-training & reconstruction visualization
├── Fine_Tuning.ipynb              # Downstream classification & evaluation
├── requirements.txt               # Python dependencies
├── images/                        # Saved plots and figures
│   ├── comparison_accuracy.png
│   ├── comparison_f1.png
│   ├── comparison_loss.png
│   ├── Epoch_50_ssl_L1.png
│   └── tsne_features.png
└── README.md                      # Project documentation

📜 References

This project is based on the following work:

Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
CVPR 2022
arXiv: 2111.06377

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Supervised Representation Learning with Masked Autoencoders (MAE)

🌟 Key Features

🎨 Visual Reconstruction Results

Architecture Overview

📊 Experimental Results

1. Latent Space Visualization (t-SNE)

2. Fine-Tuning Performance

3. Convergence Speed

🚀 Installation

💻 Usage

Step 1: Self-Supervised Pre-Training

Step 2: Fine-Tuning & Evaluation

📂 Project Structure

📜 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
.gitignore		.gitignore
Fine_Tuning.ipynb		Fine_Tuning.ipynb
README.md		README.md
Self_Supervised_Learning.ipynb		Self_Supervised_Learning.ipynb
models.py		models.py
requirements.txt		requirements.txt

Alpsource/Visual-Representation-Learning-MAE

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Representation Learning with Masked Autoencoders (MAE)

🌟 Key Features

🎨 Visual Reconstruction Results

Architecture Overview

📊 Experimental Results

1. Latent Space Visualization (t-SNE)

2. Fine-Tuning Performance

3. Convergence Speed

🚀 Installation

💻 Usage

Step 1: Self-Supervised Pre-Training

Step 2: Fine-Tuning & Evaluation

📂 Project Structure

📜 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages