### Introduction
I'll build a recommender system using a modified version of Word2Vec, named Item2Vec ([1]), which consists in a collaborative filtering algorithm that aplies Word2Vec's Skip-Gram architecture on a non-natural language dataset. 

Word2Vec is a 1 hidden layer fully connected neural network that learns vector embeddings of words in a latent space. These embeddings are able to preserve word semantincs based on the their context (defined by a window size). Words that appear in similar contexts will tend to have similar vector representations. 
Word2Vec comes with 2 architecture options:
1. Continuous bag of words (CBOW): tries to predict a word based on its context.
2. Skip Gram (SG): tries to predict the context of a word that is given as an input. 

Since Item2Vec uses Skip-Gram, I'll only focus on its presentation.

The image below ([4]) shows the neural network architecture.
![skip_gram_image_placeholder](./../images/skip_gram_net_arch.png)

The input layer has the size of the vocabulary (10000 in the example above) and receives the one-hot vector representation of the word used for prediction.

The hidden layer size is equal to the dimension of learnt vector embeddings (300 in the example above) and represents a hyperparameter. The weight matrix will have 300 columns (the number of features) and 10000 rows (one for word in the vocabulary).

The output layer gives a probability distribution that will tell for each word in the vocabulary the probability of it being part of the input word's context.

There are some additional techniques that the authors of Word2Vec proposed in their second paper ([5]) in order to optimize the training of the neural network.

**Subsampling**\
It is used to decrease the number of training examples by removing the words that appear most frequently. Usually, these are stop words that do not have relevant semantic meaning, since they appear in all sorts of contexts. The probability of a word being removed is given by the following formula

![subsampling_image_placeholder](./../images/subsampling.png)

where f(w) represents the frequency of the given word in the vocabulary and t is a hyperparameter threshold usually chosen around 10<sup>-5</sup>.

**Negative sampling**
The training of this network will require a lot of time, since it implies the updating of the weight matrix which for a vocabulary of 10000 words and a hidden layer of 300 neurons will have **3 millions** weights. That's why a new technique called negative sampling has been introduced. This optimization implies modifing a very small sample of weights:
- the weight of the correct output neuron, so that it will result in a high probability of being part of the input word's context
- select a few (5-20) words from the vocabulary that will be treated as negative examples; Their weight will be updated so that their probability of being part of the input word's context will be decreased; For small vocabularies 5-20 negative samples are needed, whereas for larger vocabularies 2-5 samples are sufficient.
The probability of a word being picked as a negative example is given by the following formula

![negative_sampling_image_placeholder](./../images/negative_sampling.png)

where f(w) represents the frequency of the given word in the vocabulary. Therefore, most frequent words will have a higher probabily of being picked as negative examples.


**Item2Vec** aims to apply Word2Vec's Skip-Gram with negative sampling architecture on various sequences of objects belonging to non-linguistic datasets. Item2Vec can be applied to movie reviews lists, shopping cart items, lists of appreciated songs etc. in order to build a recommender system that is based on calculating the similarities of items that the user enjoys. Another characteristic of Item2Vec is that the window size is considered infinite because all the items in a given sequence of objects are considered to be part of the same context.

In this script I'll train this model on Movielens 1M, a dataset of 1 million movie reviews ([6]) received from ~6000 users on a total of 3883 movies. 

The data has been explored and cleaned in separate scripts of this project. Check `data/initial_exploration` notebook and `data/make_dataset.py`.

In [1]:
import pandas as pd;
import numpy as np;
import logging

from random import shuffle
from sklearn.model_selection import KFold
from gensim.models import Word2Vec
logging.basicConfig(level=logging.INFO)



Both training and testing entries are grouped by userId. Doing so, I can extract a list of movies that were rated by a specific user. After that, I split each list of movies into 'liked' and 'disliked' sublists. The first one contains all the movies that the user has rated with >=3 stars, whereas the second list contains all the movies that got a review of 1 or 2 stars. These lists of movie ids will consist in so-called 'sentences' that will be later used in training the **Word2Vec** model. The movie ids will represent the 'words' in our context. 

The **Word2Vec** model that I'm going to train will try to learn the embeddings of the movie ids in a latent space so that similar movies (movies that appear together in the liked/disliked sublists) will have close vector representations. The distance between movies' embeddings (learned by the model) that usually belong to the same sentences should be small, whereas the distance between movies that belong to different lists (one belongs to a user's liked list and the other belongs to the same user's disliked list) should be large.

Word2Vec hyperparameters were picked taking into account various values used in literature and based on personal experiments.

| Parameter name | Description | Value | Comments on the chosen values |
| --- | ----------- | ------- | ----- |
| size | Size of hidden layer (dimension of learned vectors) | 64 | Multiple of 4 is chosen for greater performance (according to gensim documentation). Increasing the embedding size didn't have a major impact on the F1 score. |
| window | The size of window defining the context of the word | 999 | A big value, so all the movies in a user's liked or disliked list define the context used in learning |
| min_count | Minimum number of appearances for a word to be kept in the vocabulary | 5 | A movie should appear at least 4 times in the whole corpus in order to be kept in the vocabulary |
| iter | Number of epochs | 30 | Tried: [5, 10, 15, 20, 25, **30**, 35]. Highest score reached at 30 |
| sg | 1 for Skip-Gram strategy, 0 for CBOW | 1 | Skip-Gram technique works best for Item2Vec. According to Mikolov, Skip-Gram is a good solution for small datasets and is able to create a good represenation for rare words.|
| hs | 1 for hierarchical softmax, 0 and and negative > 0 for use of negative sampling | 0 | Chose 0 because applying negative-sampling gives better results | 
| sample | a threshold used for configuring the number of most popular words that are removed from the vocabulary | 5e-4 | Tried [e-4, **5e-4**, e-5].  |
| negative | how many random words are marked as negative examples | 30 | Tried: [5, 10, 20, **30**, 35]. Because the dataset is small, a higher value was needed.  |

After training the model, I evaluate it the following manner:

For each user in the testing set extract the user's list of liked movies (from the test data) and try to predict them. There are several ways of doing this predictions. First of all, you can use only the list of movies that the user liked in the training set or use both the liked and disliked lists. I tried both methods and it looks like the solution that uses the disliked lists brings a small improvement in the overall F1 score.

The number of predictions also impacts the overall results. I did 3 sets of predictions. Basically, I predicted a list of movies that has the length of test_liked_list_length * K, where K=1,2,3.

Three evaluation metrics were used:
- Precision: the ratio between the number of correctly predicted movies and the total number of predictions; it decreases as the number of predictions increases, i.e. as K increases
- Recall: the ratio between the number of correctly predicted movies and the total number of liked movies in the test set; it increases as the number of predictions increases
- F1-score: the harmonic mean of precision and recall; Achived best results for K = 2


5-Folds Cross-validation has been used to compute the mean score of the model. The following scores have been obtained:
<table>
    <thead>
        <tr>
          <th></th>
            <th colspan="3">Using only positive examples</th>
            <th colspan="3">Using both positive and negative examples</th>
        </tr>
    </thead>
    <tbody>
        <tr>
          	<td></td>
            <td> K = 1</td>
            <td> K = 2</td>
            <td> K = 3</td>
            <td> K = 1</td>
            <td> K = 2</td>
            <td> K = 3</td>
        </tr>
      <tr>
        <td>Precision</td>
        <td>17.65%</td>
        <td>14.25%</td>
        <td>12.13%</td>
          <td><b>18.39%</b></td>
        <td>14.70%</td>
        <td>12.43%</td>
      </tr>
      <tr>
        <td>Recall</td>
        <td>17.65%</td>
        <td>28.50%</td>
        <td>36.40%</td>
        <td>18.39%</td>
        <td>29.40%</td>
          <td><b>37.30%</b></td>
      </tr>
      <tr>
        <td>F1</td>
        <td>17.65%</td>
        <td>19.0%</td>
        <td>18.20%</td>
        <td>18.39%</td>
          <td><b>19.60%</b></td>
        <td>18.65%</td>
      </tr>
    </tbody>
</table>


In [2]:
class Movies2Vec:
    def __init__(self):
        self.ratings = pd.read_csv('data_1m/ratings.csv', header=0)
        self.ratings['movieId'] = self.ratings['movieId'].astype(str)

    def create_and_evaluate(self):
        self.only_pos_scores = []
        self.pos_neg_scores = []
        self.k_values = [1, 2, 3]
        for i in range(len(self.k_values)):
            self.only_pos_scores.append({'precision':[], 'recall':[], 'f1':[]})
            self.pos_neg_scores.append({'precision':[], 'recall':[], 'f1':[]})

        # Use 5-Fold Cross-validation for model evaluation
        kf = KFold(n_splits=5, shuffle=True)
        iter = 1
        for train, test in kf.split(self.ratings):
            self.train_data = self.ratings.iloc[train]
            self.test_data = self.ratings.iloc[test]
            
            # Extract so-called "senteces" from plain list of reviews by grouping them by userId and separating them into liked and disliked
            # All movies liked by a particular user will build a sentence. The same applies to the list of movies disliked by a user
            self.train_sentences = self.get_sentences(self.train_data)

            self.create_model(iter)
            self.evaluate_model(iter)

            iter += 1
            
        # For an overall evaluation, the mean score of the 5 Folds is computed for each of the 3 evaluation metrics.    
        for i in range(len(self.k_values)):
            print("Mean scores (using only positive examples at testing) at K@{}".format(self.k_values[i]))
            print("Precision: {:.2%}".format(np.mean(self.only_pos_scores[i]['precision'])))
            print("Recall: {:.2%}".format(np.mean(self.only_pos_scores[i]['recall'])))
            print("F1: {:.2%}\n".format(np.mean(self.only_pos_scores[i]['f1'])))
            
            print("Mean scores (using both positive and negative examples at testing) at K@{}".format(self.k_values[i]))
            print("Precision: {:.2%}".format(np.mean(self.pos_neg_scores[i]['precision'])))
            print("Recall: {:.2%}".format(np.mean(self.pos_neg_scores[i]['recall'])))
            print("F1: {:.2%}\n".format(np.mean(self.pos_neg_scores[i]['f1'])))

    def get_sentences(self, ratings):
        # Mark as liked all ratings >= 3. 
        positive_ratings = ratings[ratings['rating'] >= 3]
        negative_ratings = ratings[ratings['rating'] < 3]
        
        # Group ratings by user id in order to extract list of movies the user liked or disliked. 
        # These lists will represent the so-called "senteces" used in training the Word2Vec model
        positive_ratings_by_user = positive_ratings.groupby(['userId'])
        negative_ratings_by_user = negative_ratings.groupby(['userId'])

        positive_sentences = [positive_ratings_by_user.get_group(positive_group)['movieId'].tolist() 
                 for positive_group in positive_ratings_by_user.groups]
        negative_sentences = [negative_ratings_by_user.get_group(negative_group)['movieId'].tolist() 
                 for negative_group in negative_ratings_by_user.groups]
        
        return (positive_sentences + negative_sentences)

    def create_model(self, iter):
        self.model = Word2Vec(self.train_sentences,
                              size=64,
                              window=999,
                              min_count=5,
                              workers=8,
                              iter=30,
                              sg=1,
                              hs=0,
                              sample=5e-4,
                              negative=30)

        # Save the model for each try
        model_name = "model_" + str(iter)
        self.model.save(model_name)

    def evaluate_model(self, iter):
        # In order to evaluate the model I need the lists of liked(1) and disliked(2) movies grouped by userId used in training 
        # and the list of liked movies from the test data(3). 
        # The model will try to predict the movies in the last set of lists (3) based on the first 2 sets of lists(1,2).
        liked_train = self.get_movies_list_by_user(self.train_data, is_liked=True)
        disliked_train = self.get_movies_list_by_user(self.train_data, is_liked=False)

        liked_test = self.get_movies_list_by_user(self.test_data, is_liked=True)

        self.get_scores(liked_train, disliked_train, liked_test, iter)

    def get_movies_list_by_user(self, data, is_liked):
        # Mark as liked all ratings >= 3. 
        positive_ratings = data[data['rating'] >= 3]
        negative_ratings = data[data['rating'] < 3]
        ratings = positive_ratings if is_liked else negative_ratings
        
        # Pick only the movies found in the model's vocabulary, since these are the only terms that can be used in prediction
        ratings = ratings[ratings['movieId'].isin(self.model.wv.vocab.keys())]

        movies_by_user = {k: list(v) for k, v in ratings.groupby('userId')['movieId']}

        return movies_by_user

    def get_scores(self, liked_train, disliked_train, liked_test, iter):
        # As explained before, I'll use two options when predicting
        # 1. one that only uses the user's liked movies list used in training
        # 2. another one that uses both user's liked and disliked movies list from the training set
        
        # The number of predictions will be proportional to the size of the test movies list 
        # I'll make 3 sets of predictions (based on the K values). Each prediction will recommend k * len(test_movies) where k = 1,2,3
        # The following 4 variables will be used for computing Precision, Recall and F1 score
        
        # Number of positive ratings from the test set
        total_liked = np.zeros(len(self.k_values))
        
        # Number of correctly predicted liked movies when option 1. was used for predicting
        total_correct = np.zeros(len(self.k_values))
        
        # Number of correctly predicted liked movies when option 2. was used for predicting
        total_correct_with_negative = np.zeros(len(self.k_values))
        
        # Total number of predictions
        total_no_of_predictions = np.zeros(len(self.k_values));
        
        # Select only the users who offered ratings in the training set
        common_users = set(liked_test.keys()).intersection(set(liked_train.keys()))

        # Iterate through these users and try to predict their unseen liked movies
        for user_id in common_users:
            # List the model will try to predict
            test_movies = liked_test[user_id]

            disliked_movies = []
            if user_id in disliked_train:
                disliked_movies = disliked_train[user_id]

            for i in range(len(self.k_values)):
                topn = self.k_values[i] * len(test_movies)
                # Option 1.
                predictions_with_pos = self.model.wv.most_similar_cosmul(positive=liked_train[user_id],
                                                                         topn=topn)
                # Option 2.
                predictions_with_pos_neg = self.model.wv.most_similar_cosmul(positive=liked_train[user_id],
                                                                              negative=disliked_movies,
                                                                              topn=topn)

                for predicted_movie, score in predictions_with_pos_neg:
                    if predicted_movie in test_movies:
                        total_correct_with_negative[i] += 1.0

                for predicted_movie, score in predictions_with_pos:
                    if predicted_movie in test_movies:
                        total_correct[i] += 1.0
                        
                total_liked[i] += len(test_movies)
                total_no_of_predictions[i] += topn

        
        # For each K value and prediction option compute the evaluation metrics:
        # Precision: ratio between the number of correctly made predictions and the total number of predictions. 
        #            It will decrease as K increases, since with higher values with K, more predictions are being made
        # Recall: ratio between the number of correctly made predictions and the actual number of liked movies in the test dataset
        #         It will increase as K increases, since the probability of predicting the correct movies increases when the number of predictions increases
        # F1: harmonic mean between the two
        for i in range(len(self.k_values)):
            self.only_pos_scores[i]['precision'].append(total_correct[i] / total_no_of_predictions[i])
            self.only_pos_scores[i]['recall'].append(total_correct[i] / total_liked[i])
            self.only_pos_scores[i]['f1'].append(2/((total_no_of_predictions[i] / total_correct[i]) + (total_liked[i] / total_correct[i])))
            
            self.pos_neg_scores[i]['precision'].append(total_correct_with_negative[i] / total_no_of_predictions[i])
            self.pos_neg_scores[i]['recall'].append(total_correct_with_negative[i] / total_liked[i])
            self.pos_neg_scores[i]['f1'].append(2/((total_no_of_predictions[i] / total_correct_with_negative[i]) + (total_liked[i] / total_correct_with_negative[i])))
            
            print("Scores (using only positive examples at prediction) at Fold {}, K @{}"
                  .format(iter, self.k_values[i]))
            print("Precision: {:.2%}; Recall: {:.2%}, F1: {:.2%}\n"
                  .format(self.only_pos_scores[i]['precision'][-1], self.only_pos_scores[i]['recall'][-1],
                         self.only_pos_scores[i]['f1'][-1]))
            
            print("Scores (using both positive and negative examples at prediction) at Fold {}, K @{}"
                  .format(iter, self.k_values[i]))
            print("Precision: {:.2%}; Recall: {:.2%}, F1: {:.2%}\n"
                  .format(self.pos_neg_scores[i]['precision'][-1], self.pos_neg_scores[i]['recall'][-1],
                         self.pos_neg_scores[i]['f1'][-1]))


model = Movies2Vec()
model.create_and_evaluate()

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 762735 words, keeping 3666 word types
INFO:gensim.models.word2vec:collected 3675 word types from a corpus of 800167 raw words and 11735 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 3369 unique words (91% of original 3675, drops 306)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 799525 word corpus (99% of original 800167, drops 642)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 3675 items
INFO:gensim.models.word2vec:sample=0.0005 downsamples 107 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 768768 word corpus (96.2% of prior 799525)
INFO:gensim.models.base_any2vec:estimated required memory for 3369 words and 64 dimensions: 340

Scores (using only positive examples at prediction) at Fold 1, K @1
Precision: 17.69%; Recall: 17.69%, F1: 17.69%

Scores (using both positive and negative examples at prediction) at Fold 1, K @1
Precision: 18.33%; Recall: 18.33%, F1: 18.33%

Scores (using only positive examples at prediction) at Fold 1, K @2
Precision: 14.22%; Recall: 28.45%, F1: 18.96%

Scores (using both positive and negative examples at prediction) at Fold 1, K @2
Precision: 14.69%; Recall: 29.37%, F1: 19.58%

Scores (using only positive examples at prediction) at Fold 1, K @3
Precision: 12.14%; Recall: 36.41%, F1: 18.21%

Scores (using both positive and negative examples at prediction) at Fold 1, K @3
Precision: 12.43%; Recall: 37.29%, F1: 18.64%



INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 762839 words, keeping 3668 word types
INFO:gensim.models.word2vec:collected 3682 word types from a corpus of 800167 raw words and 11723 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 3369 unique words (91% of original 3682, drops 313)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 799515 word corpus (99% of original 800167, drops 652)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 3682 items
INFO:gensim.models.word2vec:sample=0.0005 downsamples 109 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 768700 word corpus (96.1% of prior 799515)
INFO:gensim.models.base_any2vec:estimated required memory for 3369 words and 64 dimensions: 340

Scores (using only positive examples at prediction) at Fold 2, K @1
Precision: 17.75%; Recall: 17.75%, F1: 17.75%

Scores (using both positive and negative examples at prediction) at Fold 2, K @1
Precision: 18.42%; Recall: 18.42%, F1: 18.42%

Scores (using only positive examples at prediction) at Fold 2, K @2
Precision: 14.29%; Recall: 28.58%, F1: 19.05%

Scores (using both positive and negative examples at prediction) at Fold 2, K @2
Precision: 14.74%; Recall: 29.48%, F1: 19.65%

Scores (using only positive examples at prediction) at Fold 2, K @3
Precision: 12.22%; Recall: 36.67%, F1: 18.33%

Scores (using both positive and negative examples at prediction) at Fold 2, K @3
Precision: 12.49%; Recall: 37.47%, F1: 18.73%



INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 763011 words, keeping 3677 word types
INFO:gensim.models.word2vec:collected 3690 word types from a corpus of 800167 raw words and 11728 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 3369 unique words (91% of original 3690, drops 321)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 799514 word corpus (99% of original 800167, drops 653)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 3690 items
INFO:gensim.models.word2vec:sample=0.0005 downsamples 108 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 768835 word corpus (96.2% of prior 799514)
INFO:gensim.models.base_any2vec:estimated required memory for 3369 words and 64 dimensions: 340

Scores (using only positive examples at prediction) at Fold 3, K @1
Precision: 17.42%; Recall: 17.42%, F1: 17.42%

Scores (using both positive and negative examples at prediction) at Fold 3, K @1
Precision: 18.21%; Recall: 18.21%, F1: 18.21%

Scores (using only positive examples at prediction) at Fold 3, K @2
Precision: 14.14%; Recall: 28.29%, F1: 18.86%

Scores (using both positive and negative examples at prediction) at Fold 3, K @2
Precision: 14.55%; Recall: 29.09%, F1: 19.40%

Scores (using only positive examples at prediction) at Fold 3, K @3
Precision: 12.04%; Recall: 36.13%, F1: 18.06%

Scores (using both positive and negative examples at prediction) at Fold 3, K @3
Precision: 12.36%; Recall: 37.07%, F1: 18.54%



INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 763043 words, keeping 3671 word types
INFO:gensim.models.word2vec:collected 3679 word types from a corpus of 800167 raw words and 11725 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 3369 unique words (91% of original 3679, drops 310)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 799512 word corpus (99% of original 800167, drops 655)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 3679 items
INFO:gensim.models.word2vec:sample=0.0005 downsamples 107 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 768939 word corpus (96.2% of prior 799512)
INFO:gensim.models.base_any2vec:estimated required memory for 3369 words and 64 dimensions: 340

Scores (using only positive examples at prediction) at Fold 4, K @1
Precision: 17.73%; Recall: 17.73%, F1: 17.73%

Scores (using both positive and negative examples at prediction) at Fold 4, K @1
Precision: 18.62%; Recall: 18.62%, F1: 18.62%

Scores (using only positive examples at prediction) at Fold 4, K @2
Precision: 14.37%; Recall: 28.75%, F1: 19.17%

Scores (using both positive and negative examples at prediction) at Fold 4, K @2
Precision: 14.83%; Recall: 29.66%, F1: 19.77%

Scores (using only positive examples at prediction) at Fold 4, K @3
Precision: 12.20%; Recall: 36.59%, F1: 18.30%

Scores (using both positive and negative examples at prediction) at Fold 4, K @3
Precision: 12.51%; Recall: 37.53%, F1: 18.76%



INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 762953 words, keeping 3667 word types
INFO:gensim.models.word2vec:collected 3679 word types from a corpus of 800168 raw words and 11740 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 3383 unique words (91% of original 3679, drops 296)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 799571 word corpus (99% of original 800168, drops 597)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 3679 items
INFO:gensim.models.word2vec:sample=0.0005 downsamples 106 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 768998 word corpus (96.2% of prior 799571)
INFO:gensim.models.base_any2vec:estimated required memory for 3383 words and 64 dimensions: 342

Scores (using only positive examples at prediction) at Fold 5, K @1
Precision: 17.66%; Recall: 17.66%, F1: 17.66%

Scores (using both positive and negative examples at prediction) at Fold 5, K @1
Precision: 18.39%; Recall: 18.39%, F1: 18.39%

Scores (using only positive examples at prediction) at Fold 5, K @2
Precision: 14.21%; Recall: 28.42%, F1: 18.95%

Scores (using both positive and negative examples at prediction) at Fold 5, K @2
Precision: 14.69%; Recall: 29.38%, F1: 19.58%

Scores (using only positive examples at prediction) at Fold 5, K @3
Precision: 12.07%; Recall: 36.20%, F1: 18.10%

Scores (using both positive and negative examples at prediction) at Fold 5, K @3
Precision: 12.39%; Recall: 37.16%, F1: 18.58%

Mean scores (using only positive examples at testing) at K@1
Precision: 17.65%
Recall: 17.65%
F1: 17.65%

Mean scores (using both positive and negative examples at testing) at K@1
Precision: 18.39%
Recall: 18.39%
F1: 18.39%

Mean scores (using only positive examples at t

## References
\[1\] Barkan, O., & Koenigstein, N. (2016). [ITEM2VEC: Neural item embedding for collaborative filtering](https://arxiv.org/abs/1603.04259). 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 1-6.

\[2\] Hugo Caselles-Dupre, Florian Lesaint, and Jimena Royo-Letelier. [Word2vec applied to recommendation: Hyperparameters matter](https://arxiv.org/abs/1804.04212). CoRR, abs/1804.04212, 2018.

\[3\] Makbule Gulcin Ozsoy. 2016. [From word embeddings to item recommendation](https://arxiv.org/abs/1601.01356). arXiv preprint arXiv:1601.01356 (2016).

\[4\] http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

\[5\] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. [Distributed representations of words and phrases and their compositionality](https://arxiv.org/abs/1310.4546). In NIPS, pages 3111–3119.

\[6\] [Movielens 1M dataset](https://grouplens.org/datasets/movielens/1m/)