# Indexing Reddit comments into Solr

* 53 million comments available at
* /home/avjves/reddit/reddit_comments.gz on the course server

In [4]:
zcat reddit_comments.gz | head -n 2

{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
{"distinguished":null,"id":"cnas8zw","archived":false,"author":"RedCoatsForever","score":3,"created_utc":"1420070400","downs":0,"body":"But Mill's career was way better. Bentham is like, the Joseph Smith to Mill's Brigham Young.","link_id":"t3_2qv6c6","name":"t1_cnas8zw","score_hidden":false,"controversiality":0,"subreddit_id":"t5_2s4gt","edited":false,"retrieved_on":1425124282,"ups":3,"author_flair_css_class":"on","gilded":0,"author_flair_text":"Ontario","subreddit":"CanadaPolitics","parent_id":"t1_cnas2

* The format is identical to the tweets in the previous assignment
* Make a new core to your Solr, decide what fields you want to index, modify the schema and write another script that indexes the data into your new core
* At least fields "body", "author", "subreddit", "gilded", "ups" and "downs" are useful
* If you also want the URL to the actual comment, you need to form it from the subreddit, link_id, and id values
* <strike>https://www.reddit.com/r/"subreddit"/comments/"link_id"/"id"/"id" </strike>
* <b> EDIT: </b>:
    * "t3_" actually has to be stripped from the beginning of <i>link_id</i> 
    * Assuming that the comment JSON has been loaded into a dictionary called comment
    * ```url = "https://www.reddit.com/r/%s/comments/%s/%s/%s" % (comment["subreddit"], comment["link_id"].split("_")[1],comment["id"], comment["id"])```
* The URL is a bit weird, but it is really nice to be able to see the actual context instead of just the comment
* No need to index all 53 million comments (Takes a bit too much resources and a lot of time...)
* Only index the first, say, 100 thousand comments

A skeleton for the script to read the data in:

In [1]:
import pysolr, gzip, json

def get_comments(filename):
    with gzip.open(filename, "rt") as gzip_file:
        for line in gzip_file:
            comment = json.loads(line)
            doc = {}
            ## Gather the stuff you want here
            yield doc


## Get comments, index them to solr... 
for doc in get_comments("/home/avjves/reddit/reddit_comments.gz"):


## Querying the data

* So now we have indexed our reddit comments
* Let's play around with them
* Reddit comments have a couple of interesting fields: 
    - ups (upvotes, show how many upvotes the comment has gotten)
    - gilded (if some other user has gilded this comment. Gilding = buying reddit gold for the user. This is bought with real money, so usually the comments are either funny or helpful :) )

Let's play with these a bit.
 
I suggest you use the query tab from the admin panel to query the data.


Few useful tools first:
* Sorting - In the query panel there is an input field called sort, where you can specify which field to sort your results with.  Specify the field and then asc or desc, depending if you want the results to be in ascending or descending order. <i><b>ups desc</b></i> for sorting by upvotes in descending order.
* Range - You can query for a value in range. You can, for example, look for posts that have 50 - 100 upvotes. Specify the field and then the range inside brackets. <i><b> downs:[50 TO 100] </b></i>
* Facets - There is a checkbox for "facets" in the query panel. Clicking it will allow you to specify a field to use as a facet. Faceting categorizes the data based on the field you choose. We can use this to automatically see how many of the comments in our query appear in which subreddits. Specify the facets.field to be <i> subreddit </i>. The facets are shown at the end of the resulting JSON when you query, so scroll down the page a bit to see them.

Now we can try to look for something interesting. Try to look for:
* Which comments have the most upvotes?
* Which gilded comments have the most upvotes?
* Which subreddit has the most gilded comments?
* Pick an interesting phrase or a word (A meme for example) and find similar information about that. (most upvotes, how many gilded, subreddits it appeared in the most, etc.)
* Whatever else you can think of :) Sentiment for example, so find comments that have some sentiment word and a topic of your choice