File: experiments.ipynb <br>
Author: Jan Koci <br>
Date: 05/2019 <br>

# Tutorial of performed experiments

This ipython notebook shows all experiments performed in the course of this thesis. It tries to give a brief description of the implemented models, their usage and the process of their evaluation.

First of all we will import modules that contain useful functions.

In [1]:
import data_utils, helpers

To be able to work with the recommender models, we have to first load the dataset containing user-item interactions. The dataset that will be used here is stored in the '/data/processed/' directory. To load it we can simply use the __load_df_pickle__ function from __data_utils__.

In [2]:
df = data_utils.load_df_pickle('../data/processed/dataset2.pkl')
df.shape

(264468, 4)

In [3]:
df.head()

Unnamed: 0,page_url,page_name,time,uid
0,https://developers.redhat.com/blog/2017/10/04/...,rhd|blog|2017|10|4|red-hat-adds-go-clangllvm-r...,579,0
1,https://developers.redhat.com/blog/2017/11/01/...,rhd|blog|2017|11|1|getting-started-llvm-toolset,2038,0
2,https://developers.redhat.com/blog/2018/07/07/...,rhd|blog|2018|7|7|yum-install-gcc7-clang,1650,0
3,https://developers.redhat.com/blog/2017/10/04/...,rhd|blog|2017|10|4|red-hat-updates-python-php-...,1033,1
4,https://developers.redhat.com/blog/2015/06/01/...,rhd|blog|2015|6|1|five-different-ways-handle-l...,433,2


As we can see the dataset consists of interactions that contain the following values:
- page_url: URL of the page
- page_name: string containing info about the page
- time: time in seconds the user spent on the page
- uid: user identifier

Now we will also load separate sets for training testing and validation.

In [4]:
train_df = data_utils.load_df_pickle('../data/processed/train_df.pkl')
test_df = data_utils.load_df_pickle('../data/processed/test_df.pkl')
validation_df = data_utils.load_df_pickle('../data/processed/validation_df.pkl')

In [5]:
print("train_df = {0}\ntest_df = {1}\nvalidation_df = {2}".format(train_df.shape, test_df.shape, validation_df.shape))

train_df = (250663, 4)
test_df = (13805, 4)
validation_df = (17658, 4)


## 1. SVD recommender

The first model that will be shown is the model using __singular value decomposition__ (SVD). 

This model is implemented in file __svd_model.py__ as a class called __SVDModel__.

In [6]:
from svd_model import SVDModel

In [7]:
svd_recommender = SVDModel(df)

The whole dataset is passed during initialization. The model will use it to only create user and item mappings. It will not be used to train the model. <br>
To evaluate this model we did not use the whole __df__ dataset, as performing the SVD consumes a lot of CPU. Therefore we used a reduced version of the dataset, containing only users that interacted with at least 3 different pages. This dataset is called __df_3__. We also need to create new training and testing sets from the reduced dataset. For this the __train_test_df__ function can be used.

In [8]:
df_3 = data_utils.load_df_pickle('../data/processed/df_3.pkl')
train_df3, test_df3 = helpers.train_test_df(df_3)

In [9]:
print("df_3 = {0}\ntrain_df3 = {1}\ntest_df3 = {2}".format(df_3.shape, train_df3.shape, test_df3.shape))

df_3 = (69353, 4)
train_df3 = (55548, 4)
test_df3 = (13805, 4)


After that we are good to go and train the model. We will train it using the default values, that means it will use the binary metric for computing the confidence of interactions. Bare in mind this command will take a while to execute.

In [10]:
svd_recommender.train(train_df3)

##################### SVD model #####################
[1]  Making sparse interaction matrix
[2]  Performing SVD on interaction matrix
[DONE]  SVD successfull


When the model is trained we can simply call the __predict__ and __recommend__ functions to predict a score a user would give an item or create a list of top n recommendations for a user.

In [11]:
svd_recommender.predict(uid=19, url='https://developers.redhat.com/blog/2016/04/26/fedora-media-writer-the-fastest-way-to-create-live-usb-boot-media')

3.4861567944205734e-16

In [12]:
svd_recommender.recommend(uid=19, n=5)

[('https://developers.redhat.com/blog/2018/03/13/eclipse-vertx-first-application',
  0.02870943859519231),
 ('https://developers.redhat.com/blog/2016/10/10/business-process-management-in-a-microservices-world',
  0.023943767806097478),
 ('https://developers.redhat.com/blog/2013/01/21/welcome-to-the-red-hat-developer-blog',
  0.021294564339902657),
 ('https://developers.redhat.com/blog/2013/08/01/php-5-4-on-rhel-6-using-rhscl',
  0.02102011260995658),
 ('https://developers.redhat.com/blog/2018/08/22/reducing-data-inconsistencies-with-red-hat-process-automation-manager',
  0.02079838576723706)]

## 2. ALS recommender

Let us now have a look at another created model. This model is an example of a traditional collaborative filtering technique using matrix factorization. In particular it uses the __Alternating Least Squares__ (ALS) method to decompose the interaction matrix into two matrices, one containing the user factors and the other item factors.

To construct this model we need to import the __ImplicitALS__ class implemented in __als_model.py__.

In [13]:
from als_model import ImplicitALS

In [14]:
als_recommender = ImplicitALS(df)

We will train the model using the log confidence metric. That means we need to provide the train method with required values of the model's hyperparameters. For that we use the __optimal_parameters.py__ file, that contains the optimal parameters of our models.

In [15]:
from optimal_parameters import als_rank_log

This time the model can be trained on the whole dataset. Therefore we will use the __train_df__.

In [17]:
als_recommender.train(train_df, **als_rank_log)

##################### ALS model #####################
[1]  Creating sparse interaction matrix


  2%|▏         | 1.5/76 [00:00<00:06, 12.35it/s]


[2]  Fitting the matrix to the model


100%|██████████| 76.0/76 [00:05<00:00, 13.51it/s]


[DONE]  ALS successfull





After the training we can once again call the __predict__ and __recommend__ methods.

In [18]:
als_recommender.recommend(uid=19, n=5)

[('https://developers.redhat.com/blog/2016/03/31/no-cost-rhel-developer-subscription-now-available',
  0.99473566),
 ('https://developers.redhat.com/blog/2018/03/19/sso-made-easy-keycloak-rhsso',
  0.8615376),
 ('https://developers.redhat.com/blog/2018/05/07/announcing-amq-streams-apache-kafka-on-openshift',
  0.8342068),
 ('https://developers.redhat.com/blog/2016/04/26/fedora-media-writer-the-fastest-way-to-create-live-usb-boot-media',
  0.78171504),
 ('https://developers.redhat.com/videos/youtube/hy9aVrTNufQ', 0.7493656)]

## 3. Doc2Vec recommender

Moving on to another model, this time it an example of a conten-based model. Content-based models create recommendations based on the content of their items. Our model works with the __Doc2Vec__ method to create document vectors of our articles using their __metadata__. First let's have a look at what metadata we are talking about.

In [19]:
metadata = data_utils.load_metadata_json('../data/json/metadata_07_03_blogs.json')

In [20]:
metadata['https://developers.redhat.com/blog/2018/12/07/kubernetes-application-server']

{'title': ['Kubernetes: Your Next Application Server'],
 'type': ['blogpost'],
 'tags': ['Containers',
  'devnation',
  'feed_group_name_nonmiddleware',
  'feed_name_redhat_developer_blog',
  'Java',
  'Kubernetes',
  'microservices'],
 'date': ['2018-12-07T12:56:02.000Z'],
 'description': ['Watch Burr Sutter in this week’s DevNation change how you think about application servers in today’s world of containers. Get the slides:\xa0bit.ly/kubeappserver In the Java ecosystem, we have historically been enamored with the concept of the “application server,” the runtime engine that not only gave us portable APIs such as JMS, JAX-RS, JSF, and EJB but also gave us critical runtime infrastructure...']}

The metadata contain the following information:
- __title__ of the page
- its __type__ (can be webpage, blogpost or video)
- __tags__ that were assigned to it
- the __date__ it was published
- a __description__ containing the first paragraph of its actual content

The __Doc2Vec__ recommender takes the __title__ and the __description__ of each item and concatenates them together. It then uses the concatenated text to create its document representation. 

To do so it first needs to transform the text representing an item into an object of the __TaggedDocument__ class, defined in the __gensim__ library. This object will store each item as a tuple, where the second value will be an integer serving as an identifier of the item and the first value a list containing the original text split into words. To simplify this process we created a class called __Doc2VecInput__, that takes in the dictionary containing metadata of all items and transforms it into a list of __TaggedDocument__ objects.

In [21]:
from doc2vec_class import Doc2VecInput

In [22]:
doc_input = Doc2VecInput(metadata)

In [23]:
doc_input.fit_transform()

In [24]:
doc_input.input[0]

TaggedDocument(words=['openshift', '4.0', 'developer', 'preview', 'on', 'aws', 'is', 'up', 'and', 'running', 'the', 'openshift', '4.0', 'developer', 'preview', 'is', 'available', 'for', 'amazon', 'web', 'services', '(', 'aws', ')', ',', 'and', 'if', 'you', '’', 're', 'anything', 'like', 'me', ',', 'you', 'want', 'to', 'be', 'among', 'the', 'first', 'to', 'get', 'your', 'hands', 'on', 'it', '.', 'the', 'starting', 'point', 'is', 'try.openshift.com', ',', 'where', 'you', '’', 'll', 'find', 'overview', 'information', 'and', 'that', 'important', '“', 'get', 'started', '”', 'button', '.', 'click', 'it', 'and', 'you', '’', 're', 'off', 'to', 'the', 'big', 'show', '.', '(', 'if', 'you', 'aren', '’', 't', 'a', 'red', 'hat', 'developer', 'member', ',', 'this', 'is', 'your', 'reason', 'to', 'sign'], tags=[0])

The recommender itself is implemented in class __Doc2VecModel__. One problem that occurs here, is that we do not have the metadata of every single page from the dataset. Similarly there are some items that are in the metadata dictionary but are not in the dataset. To deal with this problem we created a class called __RecommenderDataFrame__. 

In [25]:
from data_utils import RecommenderDataFrame

We construct an object of this class and pass it the item mappings created by the __Doc2VecInput__ class. This class then filters all interactions to contain only items that occur in the __url_2_id__ mappings created from the metadata.

In [26]:
dataframe = RecommenderDataFrame(df, url_2_id=doc_input.url_2_id)
print('original = {0}\nfiltered = {1}'.format(df.shape, dataframe.df.shape))

original = (264468, 4)
filtered = (229229, 4)


This whole process is, however, hidden inside the __Doc2VecModel__ class. Therefore we only pass it our data set and the __Doc2VecInput__ object and the class will filter the interactions itself.

In [27]:
from doc2vec_model import Doc2VecModel

In [28]:
doc2vec_recommender = Doc2VecModel(train_df, doc_input)

Now the recommender is ready to create the document vectors from our items. To do so we can choose from two possible options. We can either train the __Doc2Vec__ model locally, only on our articles, or we can use a pre-trained model, created on English Wikipedia pages. That means we can choose between two methods:
- __train__: to train the model locally
- __load_pretrained__: to use the pre-trained model

In this example we will show the load_pretrained method.

In [29]:
doc2vec_recommender.load_pretrained()

In [30]:
doc2vec_recommender.doc_vectors[0].shape

(300,)

In [31]:
doc2vec_recommender.user_vectors[0].shape

(300,)

As you can see this model creates 300-dimensional document and user vectors. The __load_pretrained__ method provides only the binary confidence metric and therefore the user vectors are computed as a sum of their item vectors. After that we can once again call the __predict__ and __recommend__ methods.

In [32]:
doc2vec_recommender.recommend(uid=19, n=5)

[('https://developers.redhat.com/blog/2018/06/04/red-hat-fuse-7-is-now-available',
  21.578247927886967),
 ('https://developers.redhat.com/blog/2018/02/27/red-hat-jboss-fuse-7-tech-preview',
  21.48103429673055),
 ('https://developers.redhat.com/videos/youtube/g5xeKuPo8Uw',
  20.90573511479384),
 ('https://developers.redhat.com/videos/vimeo/33538130', 20.391794040539455),
 ('https://developers.redhat.com/videos/youtube/a0DXIspd1Zs',
  20.333339194239716)]

It is also possible to use the document vectors to find the most similar items using the __nearest_items__ method.

In [33]:
nearest = doc2vec_recommender.nearest_items('https://developers.redhat.com/blog/2018/05/31/introducing-the-kafka-cdi-library', 8)
nearest

[('https://developers.redhat.com/blog/2018/05/31/introducing-the-kafka-cdi-library',
  9.623635029993466),
 ('https://developers.redhat.com/videos/youtube/QYbXDp4Vu-8',
  8.260617221657947),
 ('https://developers.redhat.com/videos/youtube/mcbdnMDERX0',
  8.13403926267652),
 ('https://developers.redhat.com/videos/vimeo/44390131', 7.807490549702573),
 ('https://developers.redhat.com/videos/youtube/F2lYSF25-Ek',
  7.744083812481144),
 ('https://developers.redhat.com/videos/youtube/a0DXIspd1Zs',
  7.694184206997353),
 ('https://developers.redhat.com/videos/youtube/6ZwL2sgKR3w',
  7.671404735998803),
 ('https://developers.redhat.com/videos/youtube/x3QCrb6zCKA',
  7.642057166797322)]

We can have a look at the titles of the nearest items. That shoud be more informative than their URLs.

In [34]:
metadata['https://developers.redhat.com/blog/2018/05/31/introducing-the-kafka-cdi-library']['title']

['Introducing the Kafka-CDI Library']

In [35]:
[metadata[url]['title'] for url, _ in nearest]

[['Introducing the Kafka-CDI Library'],
 ['Kafka and Debezium | DevNation Live'],
 ['Reactive systems with Eclipse Vert.x and Red Hat OpenShift'],
 ['Getting Started with EAP6 on OpenShift using JBoss Developer Studio'],
 ['An open platform to support digital transformation'],
 ['Developing cloud-ready Camel microservice'],
 ['2012 Red Hat Summit: Tuning & Benchmarking JBoss Enterprise Application Platform Session Clustering'],
 ['Kafka Streams for Event Driven Microservices | DevNation Live']]

We can compare it to te nearest items found by the __ALS__ recommender.

In [36]:
[metadata[url]['title'] for url, _ in als_recommender.nearest_items('https://developers.redhat.com/blog/2018/05/31/introducing-the-kafka-cdi-library', 8)]

[['Announcing AMQ Streams: Apache Kafka on OpenShift'],
 ['Why Kubernetes Is the New Application Server'],
 ['How to run Kafka on Openshift, the enterprise Kubernetes, with AMQ Streams'],
 ['Getting started with OpenShift Java S2I'],
 ['Deploying a Spring Boot App with MySQL on OpenShift'],
 ['Welcome Apache Kafka to the Kubernetes Era!'],
 ['Patterns for distributed transactions within a microservices architecture'],
 ['Configuring Spring Boot on Kubernetes with ConfigMap']]

# 4. SkipGram recommender

The last implemented model is inspired by the Skip-gram method with negative sampling. This method was originally used to create word vectors in the __Word2Vec__ method. We tried to apply this approach for the recommendation problem to see if it can deal with it.

The model is implemented in the __SkipGramRecommender__ class. To use the model we have to prepare some data. First we need document vectors of our articles. For this we can use the vectors created with __Doc2VecModel__. We also need a dataframe object that tells the recommender, what tags are assigned to each item. We already prepared this dataframe and saved it into the __url_tags.pkl__ file.

In [37]:
url_tags = data_utils.load_df_pickle('../data/processed/url_tags.pkl')

In [38]:
url_tags.head()

Unnamed: 0,url,tag
0,https://developers.redhat.com/blog/2019/03/07/...,amazon web services
1,https://developers.redhat.com/blog/2019/03/07/...,aws
2,https://developers.redhat.com/blog/2019/03/07/...,cloud
3,https://developers.redhat.com/blog/2019/03/07/...,containers
4,https://developers.redhat.com/blog/2019/03/07/...,developer preview


The __url_tags__ contains an item url and one tag assigned to the item in each row. We then pass this dataframe together with the document vectors to the __SkipGram__ recommender.

In [39]:
from skip_gram_recommender import SkipGramRecommender

The __SkipGram__ recommender needs to receive a __RecommenderDataFrame__ object. Once again we will use only the reduced __df_3__ dataset, as the __SkipGram__ recommender is very memory consuming.

In [40]:
train_dataframe3 = data_utils.RecommenderDataFrame(train_df3, url_2_id=doc2vec_recommender.url_2_id)
print("train_df3 = {0}\ntrain_dataframe3 = {1}".format(train_df3.shape, train_dataframe3.df.shape))

train_df3 = (55548, 4)
train_dataframe3 = (47859, 4)


In [41]:
skip_gram_recommender = SkipGramRecommender(train_dataframe3, doc2vec_recommender.doc_vectors, url_tags)

We will load a pretrained model using the __load_model__ method. This model was originaly trained on _Google Colaboratory_ using a _Tesla T4_ GPU. 

In [42]:
from optimal_parameters import skip_gram_optimum
skip_gram_recommender.load_model('../data/processed/skip_gram_learned.pt', **skip_gram_optimum)

The model has now mapped all the parameters to the CPU. After that we can call its __predict__ and __recommend__ methods. 

In [43]:
skip_gram_recommender.recommend(19, n=5)

[('https://developers.redhat.com/videos/youtube/8TX1OGHd1M0', 8.239341),
 ('https://developers.redhat.com/videos/youtube/8mlXKzBF2qA', 6.3137283),
 ('https://developers.redhat.com/videos/youtube/WYUsuHHDn2w', 5.905406),
 ('https://developers.redhat.com/videos/youtube/Z08FEd2r458', 5.6563663),
 ('https://developers.redhat.com/videos/youtube/Zdlyhhm-DdE', 4.857724)]

# 5. Evaluation

Once we have all recommenders ready, we can compare them using the __Evaluator__ class. This class provides three evaluation metrics that can be used for this task: 
- RANK evaluation metric
- Recall at k
- Precision at k

The usage of this class is very simple. One only passes the recommender to its init method and calls one of the evaluation metrics with a dataset that will be used for the evaluation.

In [44]:
from evaluation import Evaluator

In [45]:
evaluator = Evaluator(als_recommender)

In [46]:
evaluator.rank_evaluation(validation_df)

13.907955282425101

In [47]:
evaluator.recall_at_k(validation_df, k=10)

0.16737376577257218

In [48]:
evaluator.precision_at_k(validation_df, k=10)

0.02641586309927941

To evaluate the __Doc2Vec__ and __SkipGram__ recommenders we need to create __RecommenderDataFrame__ objects from the test sets.

In [49]:
test_dataframe = data_utils.RecommenderDataFrame(test_df, url_2_id=doc2vec_recommender.url_2_id, uid_2_id=doc2vec_recommender.uid_2_id)
test_dataframe3 = data_utils.RecommenderDataFrame(test_df3, url_2_id=skip_gram_recommender.url_2_id, uid_2_id=skip_gram_recommender.uid_2_id)

Let us now evaluate all our models and show all results in a table. First we will evaluate them using the __test_df__ dataset.

In [50]:
evaluator = Evaluator(als_recommender)
als_rank = evaluator.rank_evaluation(test_df)
als_recall = evaluator.recall_at_k(test_df, k=10)
als_precision = evaluator.precision_at_k(test_df, k=10)

evaluator = Evaluator(svd_recommender)
svd_rank = evaluator.rank_evaluation(test_df3)
svd_recall = evaluator.recall_at_k(test_df3, k=10)
svd_precision = evaluator.precision_at_k(test_df3, k=10)

evaluator = Evaluator(doc2vec_recommender)
d2v_rank = evaluator.rank_evaluation(test_dataframe.df)
d2v_recall = evaluator.recall_at_k(test_dataframe.df, k=10)
d2v_precision = evaluator.precision_at_k(test_dataframe.df, k=10)

evaluator = Evaluator(skip_gram_recommender)
sg_rank = evaluator.rank_evaluation(test_dataframe3.df)
sg_recall = evaluator.recall_at_k(test_dataframe3.df, k=10)
sg_precision = evaluator.precision_at_k(test_dataframe3.df, k=10)

results = [('ALS', als_rank, als_recall, als_precision), 
           ('SVD', svd_rank, svd_recall, svd_precision),
           ('Doc2Vec', d2v_rank, d2v_recall, d2v_precision),
           ('SkipGram', sg_rank, sg_recall, sg_precision)]

### TEST SET EVALUATION RESULTS

In [51]:
from IPython.display import HTML, display
html = "<table><tr><th>Recommender</th><th>RANK</th><th>Recall</th><th>Precision</th></tr>"
for model,rank,recall,prec in results:
    html += "<tr><td>{0}</td><td>{1:.2f}</td><td>{2:.4f}</td><td>{3:.4f}</td></tr>".format(model, rank, recall, prec)
html += '</table>'
display(HTML(html))

Recommender,RANK,Recall,Precision
ALS,11.17,0.1316,0.0564
SVD,49.91,0.0073,0.0035
Doc2Vec,35.88,0.0219,0.008
SkipGram,37.66,0.0034,0.0014


Now we will evaluate all recommenders on the __validation_df__. Note that we have to transform it to __RecommenderDataFrame__ class to be able to use it with our __Doc2Vec__ and __SkipGram__ recommenders.

In [52]:
validation_dataframe = data_utils.RecommenderDataFrame(validation_df, url_2_id=doc2vec_recommender.url_2_id, uid_2_id=doc2vec_recommender.uid_2_id)
print("validation_df = {0}\nvalidation_dataframe = {1}".format(validation_df.shape, validation_dataframe.df.shape))

validation_df = (17658, 4)
validation_dataframe = (10500, 4)


In [53]:
validation_dataframe_sg = data_utils.RecommenderDataFrame(validation_df, url_2_id=skip_gram_recommender.url_2_id, uid_2_id=skip_gram_recommender.uid_2_id)
print("validation_df = {0}\nvalidation_dataframe_sg = {1}".format(validation_df.shape, validation_dataframe_sg.df.shape))

validation_df = (17658, 4)
validation_dataframe_sg = (5050, 4)


In [54]:
evaluator = Evaluator(als_recommender)
als_rank = evaluator.rank_evaluation(validation_df)
als_recall = evaluator.recall_at_k(validation_df, k=10)
als_precision = evaluator.precision_at_k(validation_df, k=10)

evaluator = Evaluator(svd_recommender)
svd_rank = evaluator.rank_evaluation(validation_df)
svd_recall = evaluator.recall_at_k(validation_df, k=10)
svd_precision = evaluator.precision_at_k(validation_df, k=10)

evaluator = Evaluator(doc2vec_recommender)
d2v_rank = evaluator.rank_evaluation(validation_dataframe.df)
d2v_recall = evaluator.recall_at_k(validation_dataframe.df, k=10)
d2v_precision = evaluator.precision_at_k(validation_dataframe.df, k=10)

evaluator = Evaluator(skip_gram_recommender)
sg_rank = evaluator.rank_evaluation(validation_dataframe_sg.df)
sg_recall = evaluator.recall_at_k(validation_dataframe_sg.df, k=10)
sg_precision = evaluator.precision_at_k(validation_dataframe_sg.df, k=10)

results_validation = [('ALS', als_rank, als_recall, als_precision), 
                      ('SVD', svd_rank, svd_recall, svd_precision),
                      ('Doc2Vec', d2v_rank, d2v_recall, d2v_precision),
                      ('SkipGram', sg_rank, sg_recall, sg_precision)]

### VALIDATION SET EVALUATION RESULTS

In [55]:
html = "<table><tr><th>Recommender</th><th>RANK</th><th>Recall</th><th>Precision</th></tr>"
for model,rank,recall,prec in results_validation:
    html += "<tr><td>{0}</td><td>{1:.2f}</td><td>{2:.4f}</td><td>{3:.4f}</td></tr>".format(model, rank, recall, prec)
html += '</table>'
display(HTML(html))

Recommender,RANK,Recall,Precision
ALS,13.91,0.1674,0.0264
SVD,42.7,0.0482,0.0071
Doc2Vec,31.89,0.1512,0.0202
SkipGram,36.31,0.015,0.0028
