## Load Reddit Comments Data into Parquet <a class="tocSkip">
This notebook loads the raw [Reddit comments dataset](http://academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b) into a parquet file format. It does augment the data with several improved time columns, and the partitions the data by year/month/day. The file paths in this notebook should be modified for your system.

In [1]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window as W

import pandas as pd

pd.set_option('display.max_colwidth', -1)

spark = SparkSession\
        .builder\
        .appName("RedditLoadToParquet")\
        .getOrCreate()

In [2]:
reddit_raw = spark.read.json('qfs:///data/reddit/raw/*.bz2')

In [3]:
reddit_raw.printSchema()

root
 |-- approved_by: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- banned_by: string (nullable = true)
 |-- body: string (nullable = true)
 |-- body_html: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created: long (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- downs: long (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- link_id: string (nullable = true)
 |-- mod_reports: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- name: string (nullable = true)
 |-- num_reports: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- removal_reason: string (nullable = true)
 |-- r

In [4]:
spark.conf.set("spark.sql.session.timeZone", "UTC")

reddit_augmented = (
    reddit_raw
    .withColumn('created_utc', F.col('created_utc').cast(T.LongType()))
    .withColumn('created_date', F.from_unixtime('created_utc', 'yyyy-MM-dd'))
    .withColumn('year', F.from_unixtime('created_utc', 'yyyy'))
    .withColumn('month', F.from_unixtime('created_utc', 'MM'))
    .withColumn('day', F.from_unixtime('created_utc', 'dd'))
)

In [5]:
reddit_augmented.write.partitionBy(
    'year', 'month', 'day'
).parquet(
    'qfs:///data/reddit/processed',
    mode='overwrite'
)

In [6]:
reddit_augmented.limit(10).toPandas()

Unnamed: 0,approved_by,archived,author,author_flair_css_class,author_flair_text,banned_by,body,body_html,controversiality,created,...,score,score_hidden,subreddit,subreddit_id,ups,user_reports,created_date,year,month,day
0,,True,adrianmonk,,,,There's no way this guy is *really* Korean. He hasn't got the pathological self-deprecation right.,,0,,...,25,False,funny,t5_2qh33,25,,2009-07-01,2009,7,1
1,,True,Nesemulator,,,,Lord of the Rings Online: has an absolutely awesome Sountrack.,,0,,...,2,False,gaming,t5_2qh03,2,,2009-07-01,2009,7,1
2,,True,adolfojp,,,,The fuck it song in Spanish: http://www.imeem.com/people/4fSiML/music/QhKQd-up/la-banda-algarete-que-se-joda/,,0,,...,1,False,funny,t5_2qh33,1,,2009-07-01,2009,7,1
3,,True,thezoner,,,,Ha! Same as hockey players. They must enjoy bloody knuckles.,,0,,...,1,False,videos,t5_2qh1e,1,,2009-07-01,2009,7,1
4,,True,ggrayskies,,,,*You* are clever my friend.,,0,,...,2,False,pics,t5_2qh0u,2,,2009-07-01,2009,7,1
5,,True,nikniuq,,,,Does it *have* to be human?,,0,,...,1,False,reddit.com,t5_6,1,,2009-07-01,2009,7,1
6,,True,[deleted],,,,"I've been lurking reddit for 3 years. Didn't register because I figured an e-mail address was required. I didn't even bother to check. \n\nAfter seeing this, I have an account! Woo hoo!",,0,,...,3,False,technology,t5_2qh16,3,,2009-07-01,2009,7,1
7,,True,Redmoons,,,,"This one time, my friends and I got some m80's and lit them off.\r\n\r\nMy step-mom's lap dog thought the sparkling end of the fuse to one of them was really interesting, and picked up the m80 like it was a stick.\r\n\r\nThe m80 blew it's lower jaw off. \r\n\r\nI hated that dog.\r\n\r\nlol.",,0,,...,-1,False,AskReddit,t5_2qh1i,-1,,2009-07-01,2009,7,1
8,,True,[deleted],,,,The Chrono Cross soundtrack is my all time favorite.,,0,,...,1,False,gaming,t5_2qh03,1,,2009-07-01,2009,7,1
9,,True,larva,,,,"Well, it's pretty good, but the famous [Epic Thread](http://www.reddit.com/comments/6nz1k/got_six_weeks_try_the_hundred_push_ups_training/c04ehte), is way smarter and funnier, in my opinion.",,0,,...,9,False,reddit.com,t5_6,9,,2009-07-01,2009,7,1
