# Load data

In [1]:
reddit = spark.read.parquet("/var/reddit-parquet") # believe these are Subreddit Submissions, Comments (children of Submissions) do not seem to be included

In [2]:
type(reddit)

pyspark.sql.dataframe.DataFrame

In [3]:
len(reddit.columns)

70

In [4]:
reddit.printSchema()

root
 |-- _corrupt_record: string (nullable = true)
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- approved_by: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- banned_by: string (nullable = true)
 |-- body: string (nullable = true)
 |-- body_html: string (nullable = true)
 |-- clicked: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created: long (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- disable_comments: boolean (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- downs: long (nullable = true)
 |-- edited: string (nullable = true)
 |-- from: string (nullable = true)
 |-- from_id: string (nullable = true)
 |-- from_kind: string (nullable = true)
 |-- gilded: long 

In [5]:
record_count = reddit.count()

In [6]:
record_count # 1% of this is +28 million

2859977347

> ### The DataFrame we created has a fairly large number of columns (70), is deeply nested in several instances (up to 8 layers deep), and contains a significant number of records (+2.8 billion).

In [7]:
# select a subset of columns for EDA

these_cols = [
    "id",
    "parent_id",
    "subreddit",
    "author",
    "created",
    "body",
    "num_comments",
    "score"
]

> ### For our initial EDA we primarily care about the columns indicated above so we'll subset the data appropriately.

In [10]:
cols_df = reddit.select(these_cols)
cols_df.dtypes

[('id', 'string'),
 ('parent_id', 'string'),
 ('subreddit', 'string'),
 ('author', 'string'),
 ('created', 'bigint'),
 ('body', 'string'),
 ('num_comments', 'bigint'),
 ('score', 'bigint')]

> ### It's worthwhile to check the data types of the columns. You may notice that "created" is a bigint where the created date is represented in Unix time/Epoch time (the number of seconds since January 1, 1970). We'll want to convert that to a more human readable format.

# Created date

In [11]:
from pyspark.sql.functions import from_unixtime

In [12]:
# convert "created" from bigint to string
cols_df = cols_df.withColumn(
    "created", 
    cols_df["created"].cast("string")
)

# add new col showing just created year
cols_df = cols_df.withColumn(
    "created_year", 
    from_unixtime(
        cols_df["created"], 
        "yyyy" # full timestamp: yyyy-MM-dd HH:mm:ss.SS
    )
)

In [13]:
years_df = cols_df.groupby("created_year").count()
years_df.orderBy("created_year").show()

+------------+----------+
|created_year|     count|
+------------+----------+
|        null|2694014185|
|        2006|      1817|
|        2007|    279724|
|        2008|   2527732|
|        2009|   4854283|
|        2010|   7064885|
|        2011|  15047383|
|        2012|  18504969|
|        2013|     24296|
|        2014|  52893169|
|        2015|  64764904|
+------------+----------+



> ### The data covers 9 years from 2006 to 2015. Although a significant portion of the data is missing a created year. There are also years that have several orders of magnitude fewer records than other years (e.g. 2006 and 2013).

In [15]:
null_count = cols_df[cols_df["created_year"].isNull()].count()

In [16]:
null_count

2694014185

In [17]:
round((null_count/record_count)*100,2)

94.2

> ### Overall, 94% of the records in cols_df have a null value for "created_year". While it may seem like our data is useless it isn't. We should be able to work with the remaining ~6% which represent +165 million records. Although, this number will also decrease when we account for nulls in "body" and other columns.

In [24]:
cols_df[cols_df["body"].isNull()].count()

279383793

> ### Unfortunately, there are a significant number of records with null values in "body" too. I wasn't expecting this much data to be missing. Regardless, we'll continue with our EDA to see what value, if any, there may be in using this data source for additional tasks.

In [22]:
clean_df = cols_df[
    (cols_df["created_year"].isNotNull())&
    (cols_df["body"].isNotNull())
]

In [23]:
clean_df.count()

615962

# Subreddits

In [42]:
from pyspark.sql.functions import round

In [45]:
total_days = 9*365

subreddit_df = clean_df.groupby("subreddit").count()
subreddit_df = subreddit_df.withColumn("posts_per_day", round(subreddit_df["count"]/total_days,2))
subreddit_df.orderBy("count", ascending = False).show(10)

+---------------+-----+-------------+
|      subreddit|count|posts_per_day|
+---------------+-----+-------------+
|      AskReddit|45486|        13.85|
|leagueoflegends|11784|         3.59|
|          funny| 8968|         2.73|
|      worldnews| 7410|         2.26|
| DestinyTheGame| 7209|         2.19|
|           pics| 6870|         2.09|
|  AdviceAnimals| 6642|         2.02|
|            nfl| 6611|         2.01|
|  todayilearned| 5554|         1.69|
|          DotA2| 4925|          1.5|
+---------------+-----+-------------+
only showing top 10 rows



> ### The amount of usable data continues to shrink as we dig deeper...the Subreddit with the most records with non-null values in "body" is AskReddit which amounts to approximately 5,000 records per year or 13 records per day.

> ### Is an average of 13 records (with actual content in the body) per day representative of what's posted to Reddit?

> ### While I could not find official statistics from Reddit itself, I did find a 3rd party site that purports to track statistics on Subreddits. For instance, it shows that in a recent 24-hour period AskReddit received +12,000 Posts, funny received +1,300, worldnews received 657, and nfl received ~20.

> ### A simple "gut check" also indicates these numbers are not accurate and the data set is unreliable. While several of the Subreddits are defaults for new users (e.g. AskReddit, funny), AdviceAnimals and nfl are not. 

The AskReddit Subreddit features +28 million members, It appears this data set is no where near representative of the number of Posts on Subreddit

### stratify by created year

# cols_df.sampleBy("created_year", {1: 0.01}).count()

# Can still evaluate distributions of scores & body_len for certain (common) Subreddits