A PyTorch implementation of Masked Autoencoders (MAE) (He et al., 2022) applied to the STL-10 dataset. This project demonstrates how self-supervised learning can leverage large-scale unlabeled data (100k images) to learn robust visual representations, significantly outperforming supervised training from scratch when labeled data is limited.
- Asymmetric Autoencoder: Vision Transformer (ViT-Base) encoder with a lightweight decoder.
- 75% Masking Ratio: Encourages global context understanding instead of local pixel interpolation.
- Robust Loss Function: Uses L1 Loss for sharper reconstructions compared to MSE.
- Data Efficient: Pre-trained encoder converges 4× faster (≈5 epochs vs 20) on downstream tasks.
- Latent Space Analysis: Includes t-SNE visualization of learned feature manifolds.
The core idea of MAE is to mask a large portion of the image and reconstruct the missing pixels.
Left: Original Image
Middle: Masked Input (model input)
Right: Reconstruction (model output)
Figure 1: Reconstruction results after 50 epochs of self-supervised pre-training using L1 Loss. The model successfully hallucinates missing structures such as object bodies and geometric components.
This project implements the core asymmetric encoder-decoder architecture of Masked Autoencoders:
-
Patchify & Mask: The input image is divided into patches, and 75% of them are randomly masked and discarded.
-
Encoder: A heavyweight Vision Transformer (ViT) processes only the remaining 25% visible patches.
-
Decoder: A lightweight ViT takes the encoded visible patches, appends learnable mask tokens, unshuffles them to their original positions, and reconstructs the missing pixels.
-
Loss: L1 Error is computed strictly on the masked patches.
Does the model learn semantic structure without labels?
Figure 2: MAE pre-trained model (left) forms dense, semantically meaningful clusters before fine-tuning. The baseline model (right) trained from scratch exhibits a scattered and brittle feature space.
The pre-trained encoder is fine-tuned on a 5k labeled subset of STL-10 and compared with a ViT trained from scratch.
| Validation Accuracy | Validation F1-Score |
|---|---|
![]() |
![]() |
| Figure 3a: MAE pre-training (blue) significantly outperforms baseline (orange). | Figure 3b: F1-score confirms robustness across all classes. |
Figure 4: The MAE pre-trained model reaches optimal performance in ~5 epochs, while the baseline requires substantially more compute to reach lower final accuracy.
- Clone the repository
git clone https://github.com/Alpsource/Visual-Representation-Learning-MAE.git
cd Visual-Representation-Learning-MAE- Install dependencies
pip install -r requirements.txtTrain the Masked Autoencoder on the 100k unlabeled images of STL-10.
jupyter notebook Self_Supervised_Learning.ipynb- Input: STL-10 Unlabeled Split
- Output: Encoder weights saved as
mae_encoder_final.pth
Fine-tune the pre-trained encoder on the 5k labeled images and evaluate performance.
jupyter notebook Fine_Tuning.ipynb- Input: STL-10 Train/Test Split +
mae_encoder_final.pth - Output: Classification metrics, confusion matrix, and comparison plots
.
├── models.py # MAE model (ViT encoder + lightweight decoder)
├── Self_Supervised_Learning.ipynb # Pre-training & reconstruction visualization
├── Fine_Tuning.ipynb # Downstream classification & evaluation
├── requirements.txt # Python dependencies
├── images/ # Saved plots and figures
│ ├── comparison_accuracy.png
│ ├── comparison_f1.png
│ ├── comparison_loss.png
│ ├── Epoch_50_ssl_L1.png
│ └── tsne_features.png
└── README.md # Project documentation
This project is based on the following work:
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
CVPR 2022
arXiv: 2111.06377





