In [1]:
# load the data
import tarfile
import os.path
import json
import re
from bz2 import BZ2File
from urllib import request
from io import BytesIO

import numpy as np


fname = "cmv.tar.bz2"
url = "https://chenhaot.com/data/cmv/" + fname

# download if not exists
if not os.path.isfile(fname):
    f = BytesIO()
    with request.urlopen(url) as resp, open(fname, 'wb') as f_disk:
        data = resp.read()
        f_disk.write(data)  # save to disk too
        f.write(data)
        f.seek(0)
else:
    f = open(fname, 'rb')


tar = tarfile.open(fileobj=f, mode="r")

# Extract the file we are interested in

train_fname = "pair_task/train_pair_data.jsonlist.bz2"
test_fname = "pair_task/heldout_pair_data.jsonlist.bz2"

train_bzlist = tar.extractfile(train_fname)

# Deserialize the JSON list
pair_train = [
    json.loads(line.decode('utf-8'))
    for line in BZ2File(train_bzlist)
]

test_bzlist = tar.extractfile(test_fname)

pair_test = [
    json.loads(line.decode('utf-8'))
    for line in BZ2File(test_bzlist)
]
f.close()

In [4]:
pair_train[0].keys()

{'op_author': 'seanyowens',
 'op_text': 'I can\'t remember the topic that spurred this discussion, but a friend and I were debating whether man-made things were natural. He took the position that they are unnatural. \n\nHe cited this definition by Merriam-Webster:  existing in nature and not made or caused by people : coming from nature (http://www.merriam-webster.com/dictionary/natural) as his basis for the distinction for natural vs. unnatural.\n\nHowever, I respectfully disagree with his position and furthermore that definition of natural. People arise from nature. Humankind\'s capacity to create, problem-solve, analyze, rationalize, and build also come from natural processes. How are the things we create unnatural? It is only through natural occurrences that we have this ability, why is it that we would give the credit of these things solely to man, as opposed to nature? We are not separate from nature, thus, how can any of our actions or creations be unnatural? If we were somehow 

In [8]:
op_name = [line["op_name"] for line in pair_train]
op_name

['t3_2ro9ux',
 't3_2ro0ti',
 't3_2rnr30',
 't3_2rnfn0',
 't3_2rnfn0',
 't3_2rmy6e',
 't3_2rmy6e',
 't3_2rmwcd',
 't3_2rkzen',
 't3_2rkzen',
 't3_2rkzen',
 't3_2rkjr3',
 't3_2rk7my',
 't3_2rk7my',
 't3_2rgs7o',
 't3_2rexe3',
 't3_2rekjo',
 't3_2rd7mg',
 't3_2rd7mg',
 't3_2rcsdh',
 't3_2rcsdh',
 't3_2rcsdh',
 't3_2rcrkd',
 't3_2rcpqy',
 't3_2rc3p3',
 't3_2rawgb',
 't3_2r9169',
 't3_2r7f0g',
 't3_2r6fw9',
 't3_2r6a86',
 't3_2r6a86',
 't3_2r6a86',
 't3_2r5o9o',
 't3_2r5o9o',
 't3_2r5o9o',
 't3_2r4u3k',
 't3_2r4b8f',
 't3_2r3fa4',
 't3_2r39y4',
 't3_2r39y4',
 't3_2r1qfw',
 't3_2r1omi',
 't3_2r1omi',
 't3_2r17g1',
 't3_2r0eyf',
 't3_2r07cv',
 't3_2r07cv',
 't3_2r07cv',
 't3_2r07cv',
 't3_2qzudy',
 't3_2qy21w',
 't3_2qvzyh',
 't3_2quzrj',
 't3_2qu6fg',
 't3_2qtorn',
 't3_2qslmz',
 't3_2qrlmm',
 't3_2qrlmm',
 't3_2qrbaf',
 't3_2qr8gn',
 't3_2qr8gn',
 't3_2qqxoy',
 't3_2qqa7z',
 't3_2qnzap',
 't3_2qnzap',
 't3_2qnu4z',
 't3_2qnu4z',
 't3_2qncj3',
 't3_2qmstv',
 't3_2qmhw7',
 't3_2qmhw7',
 't3_2

In [3]:
pair_train[256]['negative']['comments'][0]['body']

'I\'d like to look at your assertion that you would be able to teach yourself or learn what you need to know for free. The problem is, the most valuable things you learn from college are going to be the things that you probably wouldn\'t be able to teach yourself. An example: when I was a tutor at college, I heard complaints about the liberal arts education all the time. It was usually from math, science, computer science, etc, students who were really good at one thing: their major subject. They resented the fact that they had to take one or two writing classes in order to get their degree. The problem? The STEM folks who complained about liberal arts education were all terrible at writing. So bad, in fact, that they wouldn\'t be able to function in the workplace because their writing was incomprehensible. I had several math and science professors complain to me that their students didn\'t understand that there are so many other skills they need to learn in order to be successful: wri

In [4]:
sum([len(line['op_text']) + len(line['positive']['comments'][0]['body']) for line in pair_train]) / len(pair_train)

3869.021122685185

In [8]:
pair_train_text = [str('original posts: ' + pair_train[i]['op_text'] + '\n\n' +
                   'reply 1: ' + pair_train[i]['positive']['comments'][0]['body'] + '\n\n' +
                   'reply 2: ' + pair_train[i]['negative']['comments'][0]['body']) for i in range(len(pair_train))]
print(pair_train_text[8])

original posts: Okay, I'm talking about making the human race smarter, forever.
Intelligence is at least partially genetic and therefore passed down by parents, yes? [Yes.](http://www.the-scientist.com/?articles.view/articleNo/40459/title/Inherited-Intelligence/)

So, what if, instead of killing off the less-intelligent people (I'm against killing. Of most things.) we just limit offspring?

For example, we could use the IQ scale (for want of a better intelligence measure) to determine the number of offspring a person should be able to genetically contribute to.
Like, round the IQ to the nearest multiple of 50, then divide by 50, and that's the number of offspring you're allowed to create.

So someone with near average intelligence (near 100 IQ, 75-124) would have their IQ rounded to 100 and then divided by 50 to make 2 offspring. 
The total offspring is presumably equal to the number of people who contributed to it. A man and a woman with average IQ can have two children (not each.) an

In [50]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stopwords = set(stopwords.words('english'))

pair_train_content = [' '.join(word for word in word_tokenize(line) if word not in stopwords) 
                      for line in pair_train_text]
pair_train_content[6]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


"original posts : As independent form study , philosophy n't seem practical applications . What value philosophy modern age , right , aside contemplating things . Is truly worth invest significant amount time money studying field ? There seem tangible applications appreciable benefits studying philosophy aside personal growth expansion one 's intellectual perspective , I argue gained without studying philosophy rigorously academic manner . I often read argument impossible argue philosophy useless without using philosophy , something along lines . I acknowledge . Yes , I engaging use philosophy right , moment . However , provide argument would worthwhile STUDY philosophy . What gain studying philosophy could gained thoughtful introspection ? Certainly , important tools originated philosophical study , scientific method , science could described subset philosophy n't argument lack tangible benefits gained studying philosophy . You n't need study philosophy become capable scientist . You 

In [51]:
sum([len(line) for line in pair_train_content]) / len(pair_train_content)

3789.7106481481483

In [4]:
from IPython.display import Markdown
def show_post(cmv_post):
    md_format = "**{title}** \n\n {selftext}".format(**cmv_post)
    md_format = "\n".join(["> " + line for line in md_format.splitlines()])
    return Markdown(md_format)
show_post(pair_train[250])

KeyError: 'title'