# <font color='magenta'>EDA on Reddit 1% sample </font>

In [2]:
one_perct_df = spark.read.parquet("one_perct_sample.parquet")

> The data is a one percent sample from the Reddit parquet file containing +2.8 billion records.
> The records are posts/Submissions (in PRAW). This is evident due to the existence of 'num_comments', 'name', 'locked' & other attributes that are common to posts/Submissions 

In [3]:
one_perct_df.columns

['id',
 'parent_id',
 'subreddit',
 'author',
 'created_utc',
 'body',
 'num_comments',
 'score',
 'created_utc_year',
 'created_utc_yearMonthDay',
 'body_len']

In [4]:
one_perct_df.count() #~300+ million

309811130

# <font color='magenta'>Deleted authors and bots

In [5]:
one_perct_df.groupby("author").count().orderBy("count", ascending = False).show()

+--------------------+--------+
|              author|   count|
+--------------------+--------+
|           [deleted]|35125988|
|       AutoModerator| 1181108|
|         conspirobot|   68608|
|       ModerationLog|   67404|
|         TweetPoster|   66676|
|         autowikibot|   50636|
|    imgurtranscriber|   48604|
|      MTGCardFetcher|   48432|
|          PoliticBot|   47124|
|               RPBot|   45961|
|          dogetipbot|   44796|
| Late_Night_Grumbler|   38604|
|throwthrowawaytothee|   37615|
|    qkme_transcriber|   36277|
| TweetsInCommentsBot|   32089|
|     ImagesOfNetwork|   25972|
|        Franciscouzo|   24230|
|      TotesMessenger|   23998|
|           havoc_bot|   23671|
|        morbiusgreen|   23352|
+--------------------+--------+
only showing top 20 rows



> The top author is a collection of deleted authors and most of the others appear to be bots.

In [6]:
deleted_authors_df = one_perct_df.filter(one_perct_df.author == "[deleted]")
deleted_authors_df.groupby("body").count().orderBy("count", ascending = False).show()

+----------+--------+
|      body|   count|
+----------+--------+
| [deleted]|19819730|
| [removed]| 1711094|
|   Thanks!|    4624|
|      Yes.|    4170|
|       No.|    3775|
|Thank you!|    3130|
|       lol|    3035|
|         .|    2747|
|       Yes|    2408|
|        :)|    2324|
|      Why?|    1973|
|        no|    1951|
|       yes|    1925|
|     What?|    1920|
|     &lt;3|    1860|
|        :(|    1825|
|        No|    1819|
|Thank you.|    1782|
|       wat|    1709|
|     Nope.|    1429|
+----------+--------+
only showing top 20 rows



> The clear majority of the posts/Submissions by deleted authors have also been deleted and won't be useful in our subsequent analysis. Let's filter both the deleted users and the bots out so our data is smaller and more representative of the content posted directly by human end users.

In [7]:
authors_to_remove = [
    "[deleted]",
    "AutoModerator", #bot that enforces Subreddit rules (?)
    "autotldr",
    "photography_bot",
#     "conspirobot", #not a bot
    "ModerationLog", #suspended, unclear if bot
    "TweetPoster",
    "tweet_poster",
    "autowikibot",
    "imgurtranscriber",
    "MTGCardFetcher",
    "PoliticBot", #suspended, unclear if bot
    "RPBot",
    "dogetipbot",
    "qkme_transcriber", #mostly bot
    "TweetsInCommentsBot",
    "ImagesOfNetwork",
    "TotesMessenger",
    "havoc_bot",
    "User_Simulator",
    "PornOverlord",
    "PriceZombie",
    "CaptionBot",
    "WritingPromptsRobot",
#     "raddit-bot", #suspended, unclear if bot
#     "gifv-bot", #suspended, unclear if bot
    "MovieGuide"
]

In [31]:
# one_perct_df.filter(one_perct_df.author.isin(authors_to_remove)).count() # +36,000,000

> This step will remove +36 million records

In [9]:
one_perct_df = one_perct_df.filter(~one_perct_df.author.isin(authors_to_remove))

In [30]:
# one_perct_df.count() # +272,000,000

In [11]:
one_perct_df.groupby("author").count().orderBy("count", ascending = False).show()

+--------------------+-----+
|              author|count|
+--------------------+-----+
|         conspirobot|68608|
| Late_Night_Grumbler|38604|
|throwthrowawaytothee|37615|
|        Franciscouzo|24230|
|        morbiusgreen|23352|
|              Lots42|22807|
|             hit_bot|21160|
|                -rix|20748|
|          pixis-4950|20559|
|         UnluckyLuke|20541|
|          amici_ursi|19565|
|              matts2|18704|
|           TrollaBot|17049|
|     NoMoreNicksLeft|16304|
|            iam4real|15904|
|          raddit-bot|15773|
|            gifv-bot|15439|
|      atomicimploder|15284|
|            G_Morgan|15229|
|        noeatnosleep|14783|
+--------------------+-----+
only showing top 20 rows



> New list of top authors by number of posts/Submissions. The top 20 was checked manually for bots.

## Counts of posts/Submissions per year

In [12]:
%%time
one_perct_df.groupby("created_utc_year").count().orderBy("created_utc_year").show()

+----------------+--------+
|created_utc_year|   count|
+----------------+--------+
|            2005|     101|
|            2006|   36084|
|            2007|  197737|
|            2008|  585008|
|            2009| 1632953|
|            2010| 4540038|
|            2011|11862253|
|            2012|25744042|
|            2013|40892843|
|            2014|56059011|
|            2015|74291952|
|            2016|56973060|
+----------------+--------+

CPU times: user 11.2 ms, sys: 158 µs, total: 11.4 ms
Wall time: 4.87 s


> This 1% stratified sample is off by ~10%. The actual yearly counts in the sample are greater than the expected yearly counts by an order of magnitude.

# <font color='magenta'>Analyze Subreddits

In [13]:
from pyspark.sql.functions import from_unixtime

In [14]:
one_perct_df = one_perct_df.withColumn(
    "created_utc_yearMonthDay", 
    from_unixtime(
        one_perct_df["created_utc"], 
        "yyyy-MM-dd" # full timestamp: yyyy-MM-dd HH:mm:ss.SS
    )
)

In [15]:
from pyspark.sql.functions import length

In [16]:
one_perct_df = one_perct_df.withColumn(
    "body_len",
    length(one_perct_df["body"])
)

In [17]:
%%time
yearsMonthDay_df = one_perct_df.groupby(["subreddit","created_utc_yearMonthDay"]).count()

avg_post_df = yearsMonthDay_df.groupby("subreddit").avg()
avg_post_df.orderBy("avg(count)", ascending = False).show(10)

+---------------+------------------+
|      subreddit|        avg(count)|
+---------------+------------------+
|      AskReddit|        8224.86176|
|leagueoflegends|2027.9855195911414|
|          funny|1969.6582075168646|
|           pics| 1801.580203691916|
|  AdviceAnimals|1558.2507163323783|
|     The_Donald| 1517.428947368421|
|            nfl|  1289.54451901566|
|   pcmasterrace|1247.8179775280898|
|         gaming|1198.7863483318029|
|       politics| 1142.973445986723|
+---------------+------------------+
only showing top 10 rows

CPU times: user 19.3 ms, sys: 1.25 ms, total: 20.6 ms
Wall time: 18.6 s


> Average number of posts/Submissions per Subreddit.

In [18]:
%%time
max_post_df = yearsMonthDay_df.groupby("subreddit").max()
max_post_df.orderBy("max(count)", ascending = False).show(10)

+-----------------+----------+
|        subreddit|max(count)|
+-----------------+----------+
|        AskReddit|     23594|
|         politics|     16580|
|              nfl|     13888|
|     pcmasterrace|     12497|
|           gaming|     11901|
|millionairemakers|     11696|
|    SquaredCircle|     11266|
|              nba|     11156|
|              CFB|     10630|
|        pokemongo|     10504|
+-----------------+----------+
only showing top 10 rows

CPU times: user 16.7 ms, sys: 905 µs, total: 17.6 ms
Wall time: 15.6 s


> Top 10 Subreddits based on maximum number of posts/Submissions per Subreddit.

In [19]:
these_subreddits = [
    "worldnews", "news", "AskCulinary",
    "AskHistorians", "howto", "todayilearned",
    "conspiracy", "MilitaryConspiracy", "PedoGate",
    "FalseFlagWatch", "skeptic", "politicalfactchecking",
    "MensRights", "MRActivism", "glasgow",
    "melbourne", "travel", "photography"
]

In [20]:
subreddit_created_df = yearsMonthDay_df.filter(yearsMonthDay_df.subreddit.isin(these_subreddits))

In [21]:
subreddit_created_df.show(3) #df shows the number of posts/Submissions created in a Subreddit on a given day

+-----------+------------------------+-----+
|  subreddit|created_utc_yearMonthDay|count|
+-----------+------------------------+-----+
|AskCulinary|              2016-06-20|   33|
|  worldnews|              2014-06-12| 1556|
|  worldnews|              2015-01-05| 2159|
+-----------+------------------------+-----+
only showing top 3 rows



## Calculate submission count IQRs for select Subreddits

In [23]:
IQR_dict = {}
for entry in these_subreddits:
    temp_df = subreddit_created_df.filter(subreddit_created_df.subreddit == entry)
    IQR_result = temp_df.approxQuantile(col = "count", probabilities = [0.25, 0.5, 0.75], relativeError = 0)
    print(entry, IQR_result)
    IQR_dict[entry] = IQR_result

('worldnews', [196.0, 595.0, 1565.0])
('news', [31.0, 132.0, 1096.0])
('AskCulinary', [20.0, 27.0, 35.0])
('AskHistorians', [48.0, 62.0, 81.0])
('howto', [2.0, 5.0, 9.0])
('todayilearned', [182.0, 1084.0, 1620.0])
('conspiracy', [23.0, 113.0, 234.0])
('MilitaryConspiracy', [1.0, 1.0, 1.0])
('PedoGate', [])
('FalseFlagWatch', [1.0, 1.0, 1.0])
('skeptic', [16.0, 28.0, 41.0])
('politicalfactchecking', [1.0, 2.0, 3.0])
('MensRights', [28.0, 108.0, 153.0])
('MRActivism', [1.0, 1.0, 2.0])
('glasgow', [3.0, 7.0, 12.0])
('melbourne', [13.0, 34.0, 86.0])
('travel', [23.0, 49.0, 76.0])
('photography', [32.0, 83.0, 124.0])


In [24]:
# IQR_dict

## Calculate submission score IQRs for select Subreddits

In [25]:
subreddit_score_df = one_perct_df.groupby(["subreddit","score"]).count() #df shows the counts of scores awarded to posts/Submissions in a Subreddit

In [26]:
score_dict = {}
for entry in these_subreddits:
    temp_df = subreddit_score_df.filter(subreddit_score_df.subreddit == entry)
    score_result = temp_df.approxQuantile(col = "count", probabilities = [0.25, 0.5, 0.75], relativeError = 0)
    print(entry, score_result)
    score_dict[entry] = score_result

('worldnews', [1.0, 3.0, 12.0])
('news', [1.0, 3.0, 10.0])
('AskCulinary', [1.0, 4.0, 27.0])
('AskHistorians', [1.0, 2.0, 10.0])
('howto', [1.0, 3.0, 20.0])
('todayilearned', [1.0, 3.0, 14.0])
('conspiracy', [1.0, 3.0, 17.0])
('MilitaryConspiracy', [2.0, 3.0, 14.0])
('PedoGate', [])
('FalseFlagWatch', [1.0, 5.0, 33.0])
('skeptic', [1.0, 5.0, 30.0])
('politicalfactchecking', [1.0, 3.0, 16.0])
('MensRights', [1.0, 4.0, 24.0])
('MRActivism', [2.0, 3.0, 16.0])
('glasgow', [1.0, 3.0, 32.0])
('melbourne', [1.0, 5.0, 44.0])
('travel', [1.0, 3.0, 16.0])
('photography', [1.0, 4.0, 27.0])


In [27]:
# score_dict

# Export of 1% sample test

In [33]:
one_perct_df.show(3) # id, subreddit, author, created_utc_yearMonthDay, body

+-------+----------+---------+--------------+-----------+--------------------+------------+-----+----------------+------------------------+--------+
|     id| parent_id|subreddit|        author|created_utc|                body|num_comments|score|created_utc_year|created_utc_yearMonthDay|body_len|
+-------+----------+---------+--------------+-----------+--------------------+------------+-----+----------------+------------------------+--------+
|d60y38i|t1_d60quyy|AskReddit|michaelochurch| 1470142873|No one knows for ...|        null|   87|            2016|              2016-08-02|    1306|
|d60y38p|t1_d60xwhq| buildapc|   Liquidretro| 1470142873|Oh I don't see th...|        null|    1|            2016|              2016-08-02|     631|
|d60y38r|t1_d60xkom|pokemongo|     Chocobean| 1470142873|" Quality but neg...|        null|    0|            2016|              2016-08-02|     225|
+-------+----------+---------+--------------+-----------+--------------------+------------+-----+---------

In [40]:
one_perct_sub_df = one_perct_df.filter(one_perct_df.subreddit.isin(these_subreddits)).select(['id', 'subreddit', 'author', 'created_utc_yearMonthDay', 'created_utc_year', 'body'])

In [41]:
one_perct_sub_df.count()

9033070

In [42]:
one_perct_sub_df.groupby("created_utc_year").count().orderBy("created_utc_year").show()

+----------------+-------+
|created_utc_year|  count|
+----------------+-------+
|            2008|  30244|
|            2009|  78171|
|            2010| 163316|
|            2011| 400668|
|            2012| 798415|
|            2013|1424538|
|            2014|1926398|
|            2015|2496165|
|            2016|1715155|
+----------------+-------+



In [43]:
one_perct_sub_df.show(3)

+-------+-------------+------------------+------------------------+----------------+--------------------+
|     id|    subreddit|            author|created_utc_yearMonthDay|created_utc_year|                body|
+-------+-------------+------------------+------------------------+----------------+--------------------+
|d60y42k|todayilearned|         squeamish|              2016-08-02|            2016|I'm pretty sure m...|
|d60y4t0|         news|      wrathofoprah|              2016-08-02|            2016|&gt; Police spoke...|
|d60y4x5|    worldnews|horrorshowmalchick|              2016-08-02|            2016|Nah man. His lawy...|
+-------+-------------+------------------+------------------------+----------------+--------------------+
only showing top 3 rows



In [64]:
for year in [2008, 2009]: # [2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]:
    temp_df = one_perct_sub_df.filter(one_perct_sub_df.created_utc_year == year)
    fname = "{this_year}_one_perct_sample.csv".format(this_year = year)
#     temp_df.write.parquet(fname)
#     temp_df.write.csv(fname)
#     temp_df.coalesce(1).write.csv(fname)