Skip to content
Extremely simple one-shot learning in Python
Branch: master
Clone or download
MLWave added note
Latest commit c606f67 Mar 17, 2016
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitattributes 👾 Added .gitattributes & .gitignore files Mar 14, 2016
.gitignore 👾 Added .gitattributes & .gitignore files Mar 14, 2016 first commit Mar 14, 2016 added note Mar 16, 2016

An embarrassingly simple approach to one-shot learning

This is an experiment in Python using approaches from the ICML '15 paper “An embarrassingly simple approach to zero-shot learning” by Bernardino Romera-Paredes, Philip H. S. Torr.

An embarrassingly simple approach to zero-shot learning

With matrix factorization you can decompose a n*m matrix into a n*a matrix and a a*m matrix, where a is the number of latent features.

An embarrassingly simple approach to zero-shot learning uses this to do zero-shot learning.

During the training stage an n*m weight/coefficient matrix is trained, where n is the number of features and m is the number of classes. Such that np.argmax( x.T, weight_matrix ) ) = predicted class.

They also train an a*m Signature matrix in an unsupervised manner. a is a number of (binary or soft) class attributes which can be found in the dataset, from external data, or in an unsupervised manner.

For instance, when the classes are: bear and horse and the attributes are [brown, can_ride, domesticated] the Signature matrix may look like:

bear horse
[ 1, 1 ] brown
[ 0, 1 ] can_ride
[ 0, 1 ] domesticated

Using the Signature matrix and the Weight matrix we calculate an n*a matrix V, such that, S) = ~W.


When we want to predict new classes we create a new signature matrix S'.

moose donkey tiger
[ 1, 1, 0 ] brown
[ 0, 1, 0 ] can_ride
[ 0, 1, 0 ] domesticated

We use V, S') to obtain ~W'. For new test samples we do np.argmax( x.T, ~W' ) to get our class prediction.



We create the Weight matrix using logistic regression.

We create attributes with unsupervised learning.

We take the first 2 components of:

  • PCA
  • Local linear embedding

to create 4 class attributes. For every train sample belonging to a class we average the 4-dimensional PCA_LLE filter to get our Signature matrix. For instance, with 2 classes digit1 and digit3:

digit1 digit3
[ 0.05, 0.06 ] PCA1
[ 0.11, 0.96 ] PCA2
[ 0.45, 0.11 ] LLE1
[ 0.95, 0.13 ] LLE2

When we want to predict new classes we take at least 1 sample and use the fitted PCA and LLE models to get a 4-dimensional vector. Taking the average of more samples per class improves performance.

digit7 digit9
[ 0.04, 0.19 ] PCA1
[ 0.12, 0.76 ] PCA2
[ 0.49, 0.14 ] LLE1
[ 0.85, 0.11 ] LLE2


We use a 10 class digit dataset with 64 features. We will use digits 0,1,2,7,8,9 for the seen classes. For the unseen digits we use 3,4,5,6.

After fitting logistic regression (0.911 accuracy) on the 6 seen classes our Weight coefficient matrix is 64*6. Our Signature matrix is 4*6. We calculate a 64*4 matrix V.

To create predictions for the unseen classes we take 1 sample per new class and transform them with PCA and LLE to get our Signature matrix S'.

We use V and S' to calculate ~W' with size 64*4. Now for all test samples we calculate np.argmax( x.T, ~W' ) to get a class prediction.

We obtain a multi-class accuracy of 0.846 with 50 labeled samples per class, 0.759 with 10 labeled samples per class, and a variant accuracy of 0.609 with 1 labeled sample per class. By comparison: random guessing is 0.25 accuracy.


We used a very simple toy dataset and non-rigorous method of evaluation. The goal was to replicate the basic idea with an extremely simple baseline, not to obtain (or claim) state-of-the-art performance.

One-shot learning has less constraints than zero-shot learning approach (we need at least one labeled sample, or another model communicating this as a vector). But we do get to use this approach when no class attributes are available.

We completely gloss over one of the main contributions in the paper: Regularization of the V matrix. We calculate V from W and S with least-squares. The paper includes a regularizer with more favourable properties.

See the original Matlab code here: and a repository with the code for the real data experiments in the paper here:

You can’t perform that action at this time.