# Subreddit recommender
This notebook demonstrates different experiments to suggest similar subreddits, based on a user query.

In [2]:
import sys
import os
import gzip
import gensim
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import glob
from collections import namedtuple
import ml_metrics
import pickle

When running this notebook, modify `base_path` to the location of `safe_links_imgposts.gz` from the preprocessing notebook.

In [9]:
base_path = "/mnt/marcel/"
corpus_path = base_path + "subreddits/"
row = namedtuple('row_raw', ['subreddit', 'submission_title', 'submitted_link', 'comments_link', 'short_name', 'imgurlhash'])

                
def read_corpus_postquery():
    with gzip.open(base_path+'safe_links_imgposts.gz', "rt") as f:
        for i, line in enumerate(f):
            line_list = eval(line)  # convert the str(list) to a list, don't do this in production, as this has the danger of code injection
            r = row._make(line_list)
            subreddit = r.subreddit
            if i%100==0:
                subreddit += "_eval_%s" % i
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(r.submission_title), [subreddit])

def read_corpus_subredditquery():
    with gzip.open(base_path+'safe_links_imgposts.gz', "rt") as f:
        for i, line in enumerate(f):
            line_list = eval(line)  # convert the str(list) to a list, don't do this in production, as this has the danger of code injection
            r = row._make(line_list)
            subreddit = r.subreddit
            if i%10==0:
                subreddit += "_eval"
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(r.submission_title), [subreddit])
            


CPU times: user 1h 38min 34s, sys: 47min 41s, total: 2h 26min 15s
Wall time: 1h 30min 5s


map@1 0.00044827362286782005
map@2 0.0006315630714703166
map@3 0.0007286191985398288
map@4 0.0008103139242026559
map@5 0.0008639391800223578
map@6 0.0009061830338907428
map@7 0.0009414943072541626
map@8 0.0009729153555860192
map@9 0.0009950264636713994
map@10 0.0009950264636713994


## Query by subreddit
This model employes the doc2vec library to calculate vectors representing a subreddit. For evaluating the quality of similarities, 10% of all posts are treated as belonging to their `evaluation` subreddit. We then expect to see the highest similarity scores between an evaluation subreddit and its subreddit. For example, 10% of the posts of `funny` go into `funny_eval`, then by querying `funny_eval` we expect to retrieve `funny.`

In [3]:
model_filename = "model_subredditquery.p"
if os.path.isfile(model_filename):
    model_subredditquery = pickle.load(open(model_filename, "rb"))
else:
    model_subredditquery = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=10, workers=16, window=5)
    model_subredditquery.build_vocab(read_corpus_subredditquery())
    %time model_subredditquery.train(read_corpus_subredditquery(), total_examples=model_subredditquery.corpus_count, epochs=10)
    pickle.dump(model_subredditquery, open(model_filename, "wb"))

In [114]:
actual = []
predicted = []
for k in model_subredditquery.docvecs.doctags.keys():
    if "_eval" in k:
        preds = model_subredditquery.docvecs.most_similar(k, topn=1000)
        actual.append([k[0:k.index("_eval")]])
        predicted.append([p[0] for p in preds if "_eval" not in p[0]][0:10])
        
for i in range(1,11):
    print("map@%s" % i, ml_metrics.mapk(actual, predicted, i))

map@1 0.03901969205940156
map@2 0.04378977782983874
map@3 0.045303884181710115
map@4 0.046133605785708395
map@5 0.04671408552748602
map@6 0.04706766623716372
map@7 0.047311237173015316
map@8 0.04754063079294425
map@9 0.047728628716290265
map@10 0.04787439941377703


As we can see, about 4% of all evaluation subreddits are most similar to their orginal counter part. This evaluation is a bit unfair, as the majority of subreddits is very small, and thus both not often relevant for a user query (assuming a user is interested in big subreddits) and it might not have enough representative data to learn a good vector. To verify this is the case in the next evaluation we only take subreddits into consideration which consist of more than 100 posts. 

In [117]:
actual = []
predicted = []
for k, v in model_subredditquery.docvecs.doctags.items():
    if "_eval" in k and v.doc_count > 100:
        preds = model_subredditquery.docvecs.most_similar(k, topn=1000)
        true_tag = k[0:k.index("_eval")]
        actual.append([true_tag])
        predictions = [p[0] for p in preds if "_eval" not in p[0]][0:10]
        predicted.append(predictions)
        if v.doc_count > 100000 and not true_tag in predictions:
            print(true_tag, predictions)
        
for i in range(1,11):
    print("map@%s" % i, ml_metrics.mapk(actual, predicted, i))

map@1 0.5740922473012757
map@2 0.6303565587176971
map@3 0.6421328099443899
map@4 0.647693817468106
map@5 0.6511612692181877
map@6 0.6536691745720205
map@7 0.6554916896428182
map@8 0.6566366029565245
map@9 0.6575816107710121
map@10 0.6582358469502728


This demonstrates, that this method is suitable for retrieving similar subreddits, as in about 60% of the cases the correct one is retrieved. In the following cell you can query the model to retrieve similar subreddits:

In [5]:
model_subredditquery.docvecs.most_similar("doctorwho", topn=10)

[('doctorwho_eval', 0.9095908999443054),
 ('DoctorWhumour', 0.8420202732086182),
 ('DoctorWhumour_eval', 0.797204852104187),
 ('gallifrey', 0.796032726764679),
 ('drwho', 0.7927993535995483),
 ('gallopfrey', 0.7127442359924316),
 ('doctorwhocirclejerk', 0.7006417512893677),
 ('wholock', 0.6978670954704285),
 ('Sherlock', 0.6900186538696289),
 ('Torchwood', 0.6789624691009521)]

## Query by post (or query string/keywords)
We again use the doc2vec algorithm to compute vectors for documents. In doc2vec each document gets a tag and as shown above one can retrieve similar `tags`. So if we want to be able to query by a post, which can be seen as a query string/list of keywords, we need to give those its individual tag. Further down we demonstrate how to query by a list of keywords.

In [11]:
model_filename = "model_postquery.p"
if os.path.isfile(model_filename):
    model_postquery = pickle.load(open(model_filename, "rb"))
else:
    model_postquery = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=10, workers=16, window=5)
    model_postquery.build_vocab(read_corpus_postquery())
    %time model_postquery.train(read_corpus_postquery(), total_examples=model_postquery.corpus_count, epochs=10)
    pickle.dump(model_postquery, open(model_filename, "wb"))

CPU times: user 1h 52min 38s, sys: 54min 25s, total: 2h 47min 4s
Wall time: 1h 43min 21s


In [None]:
actual = []
predicted = []

for td in read_corpus_postquery():
    if "_eval_" in td.tags[0]:
        true_tag = td.tags[0][0:td.tags[0].index("_eval_")]
        actual.append([true_tag])
        preds = model_postquery.docvecs.most_similar(td.tags[0], topn=5000)
        predicted.append([p[0] for p in preds if "_eval_" not in p[0]][0:10])


In [14]:
for i in range(1,11):
    print("map@%s" % i, ml_metrics.mapk(actual, predicted, i))

map@1 0.002722063406548934
map@2 0.0036861275297016814
map@3 0.004159567219413949
map@4 0.004417338375337143
map@5 0.004594684930612301
map@6 0.004715837373896203
map@7 0.004800901855350857
map@8 0.00486244471882752
map@9 0.004912280475639336
map@10 0.004964350249135822


In [21]:
inferred_vector = model_postquery.infer_vector(['sonic', 'screwdriver'])
model_postquery.docvecs.most_similar([inferred_vector], topn=10)

[('420vendorshop', 0.8758198022842407),
 ('gtaonlinecrews_eval_38377400', 0.8756065368652344),
 ('InfinitySuns', 0.8755130767822266),
 ('nobleboners', 0.8748517036437988),
 ('DowniesAndDogshit', 0.8743728399276733),
 ('AdrianaLimaArmpits', 0.8740894794464111),
 ('WE_BUY_DVDs', 0.8730192184448242),
 ('Beefeaters', 0.8719935417175293),
 ('conspire', 0.8703101873397827),
 ('Anxietyquotes', 0.8697609305381775)]

As we can see from both the MAP scores, as well as the inferred vectors, this is not working at all. The fastText model yields significantly better results for keyword queries.