In [1]:
import pickle

from product2vec import BasketGenerator, Product2Vec, EpochLogger

### Main idea

Original paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3519358

Product2Vec model is capable of finding complements (complementarity goods) and substitutes (interchangeable goods). Complementarity products are those which bring more utility when used together (i.e. cereal and milk). Interchangeable products are those which have identical properties and can possess the same (almost) utility for a buyer (coffee and tea to some extent).

Product2Vec uses Word2Vec under the hood to build embeddings for each product. But embeddings themselves don't help distinguish between complements and substitutes. Therefore it calculates special scores (exchangeability and complementarity scores) to make a desicion. Higher score result in greater probability for a product to be a complement/substitute.

The only source of data model needs is shopping baskets with purchased goods. Their order within basket doesn't matter and, thus, repeated labels don't bring any additional value. All baskets should have at least two unique products.

Put it simply, Product2Vec considers two products as complements if they frequently occur in the same basket, and it considers two products as substitutes if they are frequently bought with similar products within the same basket.

### Generating data

Let's generate synthetic baskets with product labels from '0' to '1000' randomly put in 100000 baskets. To be precise, basket generation is based on copurchase matrix (generated randomly) which assigns probabilities of two products occuring in the same bakset. Basket size can vary according to specified boundaries. Refer to the source code for more comments on implementation.

In [2]:
generator = BasketGenerator(
    n_jobs=-1,  # number of workers
    verbose=1,  # verbosity level
    seed=1,  # random seed
    extreme=10,  # how extremen copurchase probabilities can be
)
data = generator(
    n_baskets=100000,  #  number of baskets
    n_products=1000,  # number of unique products
    min_size=2,  # minimum number of unique products in the same basket, should be > 2
    max_size=10,  # maximum basket of unique products in the same basket
)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 1420 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 31500 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 76300 tasks      | elapsed:   12.7s
[Parallel(n_jobs=-1)]: Done 100000 out of 100000 | elapsed:   15.8s finished


### Fitting model and logging progress

EpochLogger prints current epoch and linear estimate of time left. Product2Vec model accepts all gensim Word2Vec parameters except for:
- sentences (passed to `fit` method)
- window (take the whole basket, set to 1000)
- sg (set to 1, allow skip-gram approach)
- hs (set to 0, allow negative sampling)
- shrink_windows (set to False, fixed window size)

In [3]:
logger = EpochLogger(n_latest=5)  # how many latest epochs to use to estimate time left

#  refer to the original paper for optimal learning parameters
prod2vec = Product2Vec(
    vector_size=10, epochs=10, callbacks=[logger], seed=1, workers=4
)
_ = prod2vec.fit(data)

Epoch #1. Estimated time left - To be estimated
Epoch #2. Estimated time left - 00:10
Epoch #3. Estimated time left - 00:09
Epoch #4. Estimated time left - 00:08
Epoch #5. Estimated time left - 00:07
Epoch #6. Estimated time left - 00:06
Epoch #7. Estimated time left - 00:04
Epoch #8. Estimated time left - 00:03
Epoch #9. Estimated time left - 00:02
Epoch #10. Estimated time left - 00:01


After fitting you can access gensim model via `.model_` attribute of Product2Vec instance. It has all methods and attributes available.

In [4]:
prod2vec.model_.wv["0"]

array([ 0.21477434,  0.03511613, -0.439563  , -0.37581408,  0.1593026 ,
        0.6893432 ,  0.06630325, -0.30754682, -0.6433185 , -0.743252  ],
      dtype=float32)

### Making inference

Complementarity and exchangeability scores are not computed untill you call `show_complements` or `show_substitutes` methods. It might take some time in case number of unique products found in baskets during fit is huge.

In [5]:
prod2vec.show_substitutes(
    product="0",  # focal product label
    topn=10,  # top N subsitutes
    penalize=True,  # penalization flag, setting to True is highly recommended
    # guess=1,  # tweak between -10 and 10 if you get OptimizationWarning
)

[('786', -0.6277325),
 ('832', -0.6468932),
 ('760', -0.6767897),
 ('991', -0.68854976),
 ('83', -0.6892245),
 ('815', -0.7061493),
 ('978', -0.725211),
 ('673', -0.7273524),
 ('825', -0.73150086),
 ('661', -0.7356517)]

In [6]:
prod2vec.show_complements(
    product="0",  # focal product label
    topn=10,  # top N complements
)

[('867', 0.26756227),
 ('673', 0.26403713),
 ('835', 0.2512554),
 ('165', 0.24923748),
 ('871', 0.24331644),
 ('978', 0.23695599),
 ('800', 0.23505902),
 ('716', 0.23462863),
 ('825', 0.23426346),
 ('403', 0.23408276)]

### Model persistence

You can save and load model with pickle.

In [7]:
with open('fitted_model.pkl', 'wb') as file:
    pickle.dump(prod2vec, file)

In [8]:
with open('fitted_model.pkl', 'rb') as file:
    pickled_model = pickle.load(file)

In [9]:
pickled_model.show_complements('0')

[('867', 0.26756227),
 ('673', 0.26403713),
 ('835', 0.2512554),
 ('165', 0.24923748),
 ('871', 0.24331644)]