Gensim 4.x and sklearn
- Introduction
- General informations
- Methods Description
- GridSearch and BayesSearch Usage
- BayesSearch Example
Prod2Vec or Item2Vec produces embedding for items in a latent space. The method is capable of inferring item-item relations even when user information is not available. It's based on NLP model Word2Vec. Click here to know more
This project provide a class that encapsulates Item2Vec model (word2vec gensim model) as a sklearn estimator.
It allows the simple and efficient use of the Item2Vec model by providing :
- metric to measure the performance of the model (Precision@K)
- compatibility with GridSearchCV and BayesSearchCV to find the optimal hyperparameters
!! Warning : Estimators template is not respected since X does not have the shape (n_features, n_examples)
2.1 Input data format
X : list of string list. Each string is an item. Each list is the sequence of products purchased
by a customer/while a session
X = [['prod_1', ..., 'prod_n'], ... ,['prod_1', ..., 'prod_n']]
2.2 Train/Test split
The train/test split is managed within the class. It is not necessary to split the data into train and test before fitting the model.
2.3 Pipeline performance measurement
- Training on subset of X
- Randomly sampled ((n-1)-th, n-th) pairs of items, disjoint from the training set
- Evaluate performance on NEP task (ie: Find the top 10 most similar to the (n-1)-th item and check if the (n-th) item is in this top 10)
3.1 Instanciation and init parameters
Instanciation :
Item2VecWrapped(alpha=0.025, cbow_mean=1, epochs=5,hs=0, min_alpha=0.0001, min_count=1, negative=5, ns_exponent=-0.5,
sample=0.001, seed=1, sg=0, vector_size=100, window=3, shrink_windows=True,topK=10, split_strategy="timeseries")
Word2Vec default Parameters (Gensim 4.x)
alpha=0.025, cbow_mean=1, epochs=5,hs=0, min_alpha=0.0001, min_count=1, negative=5,
ns_exponent=-0.5, sample=0.001, seed=1, sg=0, vector_size=100, window=3, shrink_windows=True,
Added parameters :
topK=int, split_strategy=string
topK : most similar word to a given word. (10 by default)
split_strategy : "timeseries" or "train_test_split"
"timeseries" : Training set -> (item_1, ..., item_N-1)
Test set -> (item_N-1, item_N)
"train_test_split" : Training set, Test set = train_test_split(X, test_size=0.05, random_state=42)
Create couple (item_N-1, item_N) from Test_test
3.2 Fit method
fit(X)
- Getting X_train data (depending on splitting strategy) to train the gensim Word2Vec model
- Train Word2Vec model on X_train
3.3 Predict method
predict(X)
X
is a word or a list of words- Predict topK most similar words using cosine similarity.
Return
a list of list of topK words by index
3.4 Score method (not using it outside the classe)
score(X)
X
must be the same as the one provide to fit()
Designed for the GridSearchCV and BayesSearch. Use score_Precision_at_K(X_test,Y_test)
instead.
Evaluate performance on Next Event Prediction using Precision@K
Return
: The score in pecentage of right prediction
3.5 Score_Precision_at_K method
score_Precision_at_K(X_test, Y_test)
Evaluate performance on Next Event Prediction using Precision@K
X_test
: list of items
Y_test
: list of items. Ground truth about the next item purchases just after X_test
Return
: The score in pecentage of right prediction
3.6 Get_vocabulary method
get_vocabulary()
Return
: list of vocabulary after the training.
Word2Vec().fit(X).get_vocabulary()[idx]
will return word at index idx.
3.7 Get_index_word method
get_index_word(word)
Return
: Index of the given word
Model instantation
my_model = Item2VecWrapped()
Hyperparameters definition
parameters = {'ns_exponent': [1, 0.5, -0.5, -1], 'alpha': [0.1, 0.3, 0.6, 0.9]}
Define Train and test indices for splitting. !! Test and train indices must be the same !! The split is managed internally
train_indices = [i for i in range(len(X))]
test_indices = [i for i in range(len(X))]
cv = [(train_indices, test_indices)]
Instantiate GridSearchCV
clf = GridSearchCV(my_model,parameters, cv=cv)
Fit the model and getting best parameters and best scores
clf.fit(X)
clf.best_params_
clf.best_score_
!pip install scikit-optimize
from skopt.space import Integer
from skopt.space import Real
from skopt.space import Categorical
from skopt.utils import use_named_args
from skopt import BayesSearchCV
search_space = list()
search_space.append(Integer(3, 100, name='epochs', prior='log-uniform', base=2))
search_space.append(Integer(10, 500, name='vector_size', prior='log-uniform', base=2))
search_space.append(Real(0.01, 1, name='alpha', prior='uniform'))
search_space.append(Real(-1, 1, name='ns_exponent', prior='uniform'))
search_space.append(Integer(5, 50, name='negative', prior='uniform'))
search_space.append(Categorical([0, 1], name='sg'))
search_space.append(Real(0.00001, 0.01, name='sample', prior='uniform'))
search_space.append(Categorical([0, 1], name='cbow_mean'))
search_space.append(Integer(1,3, name='window', prior='uniform')) #mean of basket len is 1.54
search_space.append(Categorical([True, False], name='shrink_windows'))
params = {search_space[i].name : search_space[i] for i in range((len(search_space)))}
train_indices = [i for i in range(len(X))] # indices for training
test_indices = [i for i in range(len(X))] # indices for testing
cv = [(train_indices, test_indices)]
clf = BayesSearchCV(estimator=Item2VecWrapped(), search_spaces=params, n_jobs=-1, cv=cv)
clf.fit(X)
clf.best_params_
clf.best_score_