# Warm-up
How many unique subreddits are there?

Pick a subreddit. What user wrote the most comments in January of 2012? What was the user’s top three most-upvoted comments? Filter out bots or other types of automated posts.

Choose a day of significance to you (e.g., your birthday), and retrieve a 5% sample of the comments posted on this particular day across all 5 years of the dataset.

The number of comments posted per year will likely trend upward over time as more users join Reddit. However, the popularity of some subreddits may increase or decrease over time. Find An example of both.

In [1]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType

df = sqlContext.read.json("hdfs://orion11:32001/sampled_reddit/*")
columns = [
    "distinguished",
    "downs",
    "created_utc",
    "controversiality",
    "edited",
    "gilded",
    "author_flair_css_class",
    "id",
    "author",
    "retrieved_on",
    "score_hidden",
    "subreddit_id",
    "score",
    "name",
    "author_flair_text",
    "link_id",
    "archived",
    "ups",
    "parent_id",
    "subreddit",
    "body"]

df.show(n=4)

NameError: name 'sqlContext' is not defined

In [None]:
df.printSchema()

## 1
How many unique subreddits are there?
#### Answer: 253336

In [None]:
df.select("subreddit").distinct().count()

In [None]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType

df = sqlContext.read.json("hdfs://orion11:32001/reddit/2012/RC_2012-01.bz2")
columns = [
    "distinguished",
    "downs",
    "created_utc",
    "controversiality",
    "edited",
    "gilded",
    "author_flair_css_class",
    "id",
    "author",
    "retrieved_on",
    "score_hidden",
    "subreddit_id",
    "score",
    "name",
    "author_flair_text",
    "link_id",
    "archived",
    "ups",
    "parent_id",
    "subreddit",
    "body"]

df.show(n=4)

In [None]:
df = df.withColumn("created_utc", df["created_utc"].cast(LongType()))

## 2
Pick a subreddit. What user wrote the most comments in January of 2012?
#### Answer: ('Corrupted_Planet', 287)

In [None]:
# January 1st 2012 -> 1325376000
#January 31st 2012 -> 1327968000

df.createOrReplaceTempView("TEMP_DF")

sample_pd = spark.sql("""select * from TEMP_DF where temp_df.subreddit = 'runescape' 
and temp_df.created_utc > 1325376000 
and temp_df.created_utc < 1327968000 
and temp_df.author != '[deleted]'""").toPandas()

from collections import Counter
counter = Counter(sample_pd.author)
counter.most_common()[:5] # get the five most common elements 

#### Answer:
What was the user’s top three most-upvoted comments? 
Filter out bots or other types of automated posts.

In [None]:
sample_pd_2 = spark.sql("""select * from TEMP_DF 
where temp_df.subreddit = 'runescape' 
and temp_df.created_utc > 1325376000 
and temp_df.created_utc < 1327968000 
and temp_df.author != '[deleted]'
order by temp_df.score desc""").toPandas()
sample_pd_2.iloc[1:4]


## 3
Choose a day of significance to you (e.g., your birthday), and retrieve a 5% sample of the comments posted on this particular day across all 5 years of the dataset.

In [None]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType

df = sqlContext.read.json("hdfs://orion11:32001/sampled_reddit/*")
columns = [
    "distinguished",
    "downs",
    "created_utc",
    "controversiality",
    "edited",
    "gilded",
    "author_flair_css_class",
    "id",
    "author",
    "retrieved_on",
    "score_hidden",
    "subreddit_id",
    "score",
    "name",
    "author_flair_text",
    "link_id",
    "archived",
    "ups",
    "parent_id",
    "subreddit",
    "body"]

df.show(n=2)

In [None]:
df = df.withColumn("created_utc", df["created_utc"].cast(LongType()))

In [None]:
df.createOrReplaceTempView("TEMP_DF")
pd_3 = spark.sql("""select temp_df.body from TEMP_DF 
where (temp_df.created_utc > 1512950400 and temp_df.created_utc < 1513036799)
or (temp_df.created_utc > 1481414400 and temp_df.created_utc < 1481504399)
or (temp_df.created_utc > 1449792000 and temp_df.created_utc < 1449878399)
or (temp_df.created_utc > 1418256000 and temp_df.created_utc < 1418342399)
or (temp_df.created_utc > 1386720000 and temp_df.created_utc < 1386806399)""")

samp = pd_3.sample(False, .5)
samp.write.format('csv').save('hdfs://orion11:32001/sampled_birthday_answer')

## 4
The number of comments posted per year will likely trend upward over time as more users join Reddit. However, the popularity of some subreddits may increase or decrease over time. Find An example of both.

In [None]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType

df = sqlContext.read.json("hdfs://orion11:32001/reddit/2016/*")
columns = [
    "distinguished",
    "downs",
    "created_utc",
    "controversiality",
    "edited",
    "gilded",
    "author_flair_css_class",
    "id",
    "author",
    "retrieved_on",
    "score_hidden",
    "subreddit_id",
    "score",
    "name",
    "author_flair_text",
    "link_id",
    "archived",
    "ups",
    "parent_id",
    "subreddit",
    "body"]

df.show(n=2)

In [None]:
df.createOrReplaceTempView("TEMP_DF")
pd_4 = spark.sql("""select temp_df.subreddit, MONTH(FROM_UNIXTIME(temp_df.created_utc)) month, 
count(temp_df.body) comments
from TEMP_DF
GROUP BY 
MONTH(FROM_UNIXTIME(temp_df.created_utc)), temp_df.subreddit""").toPandas()

pd_5 = pd_4.pivot_table(index=['subreddit'],
                   columns='month',
                   values='comments')
pd_5.iloc[:100]