# <font color='magenta'>EDA on Reddit 1% sample </font>

In [70]:
one_perct_df = spark.read.parquet("one_perct_sample.parquet")

> The data is a one percent sample from the Reddit parquet file containing +2.8 billion records.
> The records are posts/Submissions (in PRAW). This is evident due to the existence of 'num_comments', 'name', 'locked' & other attributes that are common to posts/Submissions 

In [71]:
one_perct_df.columns

['id',
 'parent_id',
 'subreddit',
 'author',
 'created_utc',
 'body',
 'num_comments',
 'score',
 'created_utc_year',
 'created_utc_yearMonthDay',
 'body_len']

In [72]:
one_perct_df.count() #~300+ million

309811130

# <font color='magenta'>Deleted authors and bots

In [81]:
one_perct_df.groupby("author").count().orderBy("count", ascending = False).show()

+--------------------+-----+
|              author|count|
+--------------------+-----+
|         conspirobot|68608|
| Late_Night_Grumbler|38604|
|throwthrowawaytothee|37615|
|        Franciscouzo|24230|
|        morbiusgreen|23352|
|              Lots42|22807|
|             hit_bot|21160|
|                -rix|20748|
|          pixis-4950|20559|
|         UnluckyLuke|20541|
|          amici_ursi|19565|
|              matts2|18704|
|           TrollaBot|17049|
|     NoMoreNicksLeft|16304|
|            iam4real|15904|
|          raddit-bot|15773|
|            gifv-bot|15439|
|      atomicimploder|15284|
|            G_Morgan|15229|
|        noeatnosleep|14783|
+--------------------+-----+
only showing top 20 rows



> The top author is a collection of deleted authors and most of the others appear to be bots.

In [74]:
deleted_authors_df = one_perct_df.filter(one_perct_df.author == "[deleted]")
deleted_authors_df.groupby("body").count().orderBy("count", ascending = False).show()

+----------+--------+
|      body|   count|
+----------+--------+
| [deleted]|19819730|
| [removed]| 1711094|
|   Thanks!|    4624|
|      Yes.|    4170|
|       No.|    3775|
|Thank you!|    3130|
|       lol|    3035|
|         .|    2747|
|       Yes|    2408|
|        :)|    2324|
|      Why?|    1973|
|        no|    1951|
|       yes|    1925|
|     What?|    1920|
|     &lt;3|    1860|
|        :(|    1825|
|        No|    1819|
|Thank you.|    1782|
|       wat|    1709|
|     Nope.|    1429|
+----------+--------+
only showing top 20 rows



> The clear majority of the posts/Submissions by deleted authors have also been deleted and won't be useful in our subsequent analysis. Let's filter both the deleted users and the bots out so our data is smaller and more representative of the content posted directly by human end users.

In [75]:
authors_to_remove = [
    "[deleted]",
    "AutoModerator", #bot that enforces Subreddit rules (?)
    "autotldr",
    "photography_bot",
#     "conspirobot", #not a bot
    "ModerationLog", #suspended, unclear if bot
    "TweetPoster",
    "tweet_poster",
    "autowikibot",
    "imgurtranscriber",
    "MTGCardFetcher",
    "PoliticBot", #suspended, unclear if bot
    "RPBot",
    "dogetipbot",
    "qkme_transcriber", #mostly bot
    "TweetsInCommentsBot",
    "ImagesOfNetwork",
    "TotesMessenger",
    "havoc_bot",
    "User_Simulator",
    "PornOverlord",
    "PriceZombie",
    "CaptionBot",
    "WritingPromptsRobot",
#     "raddit-bot", #suspended, unclear if bot
#     "gifv-bot", #suspended, unclear if bot
    "MovieGuide"
]

In [76]:
one_perct_df.filter(one_perct_df.author.isin(authors_to_remove)).count() #36902074

36996048

> This step will remove +36 million records

In [77]:
one_perct_df = one_perct_df.filter(~one_perct_df.author.isin(authors_to_remove))

In [78]:
one_perct_df.count()

272815082

In [85]:
one_perct_df.groupby("author").count().orderBy("count", ascending = False).show()

+--------------------+-----+
|              author|count|
+--------------------+-----+
|         conspirobot|68608|
| Late_Night_Grumbler|38604|
|throwthrowawaytothee|37615|
|        Franciscouzo|24230|
|        morbiusgreen|23352|
|              Lots42|22807|
|             hit_bot|21160|
|                -rix|20748|
|          pixis-4950|20559|
|         UnluckyLuke|20541|
|          amici_ursi|19565|
|              matts2|18704|
|           TrollaBot|17049|
|     NoMoreNicksLeft|16304|
|            iam4real|15904|
|          raddit-bot|15773|
|            gifv-bot|15439|
|      atomicimploder|15284|
|            G_Morgan|15229|
|        noeatnosleep|14783|
+--------------------+-----+
only showing top 20 rows



> New list of top authors by number of posts/Submissions. The top 20 was checked manually for bots.

## Counts of posts/Submissions per year

In [10]:
%%time
one_perct_df.groupby("created_utc_year").count().orderBy("created_utc_year").show()

+----------------+--------+
|created_utc_year|   count|
+----------------+--------+
|            2005|     101|
|            2006|   36084|
|            2007|  197737|
|            2008|  585008|
|            2009| 1632953|
|            2010| 4540038|
|            2011|11862253|
|            2012|25746724|
|            2013|40910031|
|            2014|56079460|
|            2015|74347450|
|            2016|57035162|
+----------------+--------+

CPU times: user 15.5 ms, sys: 122 µs, total: 15.6 ms
Wall time: 4.82 s


> This 1% stratified sample is off by ~10%. The actual yearly counts in the sample are greater than the expected yearly counts by an order of magnitude.

# <font color='magenta'>Analysze Subreddits

In [11]:
from pyspark.sql.functions import from_unixtime

In [12]:
one_perct_df = one_perct_df.withColumn(
    "created_utc_yearMonthDay", 
    from_unixtime(
        one_perct_df["created_utc"], 
        "yyyy-MM-dd" # full timestamp: yyyy-MM-dd HH:mm:ss.SS
    )
)

In [13]:
from pyspark.sql.functions import length

In [14]:
one_perct_df = one_perct_df.withColumn(
    "body_len",
    length(one_perct_df["body"])
)

In [15]:
%%time
yearsMonthDay_df = one_perct_df.groupby(["subreddit","created_utc_yearMonthDay"]).count()

avg_post_df = yearsMonthDay_df.groupby("subreddit").avg()
avg_post_df.orderBy("avg(count)", ascending = False).show(10)

+---------------+------------------+
|      subreddit|        avg(count)|
+---------------+------------------+
|      AskReddit|        8224.86656|
|leagueoflegends|2028.0008517887563|
|          funny|1969.6778027626085|
|           pics|1801.6187141947803|
|  AdviceAnimals| 1565.850047755492|
|     The_Donald| 1517.828947368421|
|            nfl|1289.5458612975392|
|   pcmasterrace|1249.5161048689138|
|         gaming|1198.7903275176002|
|       politics|1142.9749547374774|
+---------------+------------------+
only showing top 10 rows

CPU times: user 34.5 ms, sys: 129 µs, total: 34.6 ms
Wall time: 17.2 s


> Average number of posts/Submissions per Subreddit.

In [16]:
%%time
max_post_df = yearsMonthDay_df.groupby("subreddit").max()
max_post_df.orderBy("max(count)", ascending = False).show(10)

+-----------------+----------+
|        subreddit|max(count)|
+-----------------+----------+
|        AskReddit|     23594|
|         politics|     16580|
|              nfl|     13888|
|     pcmasterrace|     12503|
|           gaming|     11901|
|millionairemakers|     11696|
|    SquaredCircle|     11266|
|              nba|     11156|
|              CFB|     10631|
|        pokemongo|     10509|
+-----------------+----------+
only showing top 10 rows

CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 13.6 s


> Top 10 Subreddits based on maximum number of posts/Submissions per Subreddit.

In [17]:
these_subreddits = [
    "worldnews", "news", "AskCulinary",
    "AskHistorians", "howto", "todayilearned",
    "conspiracy", "MilitaryConspiracy", "PedoGate",
    "FalseFlagWatch", "skeptic", "politicalfactchecking",
    "MensRights", "MRActivism", "glasgow",
    "melbourne", "travel", "photography"
]

In [18]:
subreddit_created_df = yearsMonthDay_df.filter(yearsMonthDay_df.subreddit.isin(these_subreddits))

In [19]:
subreddit_created_df.show(3) #df shows the number of posts/Submissions created in a Subreddit on a given day

+-----------+------------------------+-----+
|  subreddit|created_utc_yearMonthDay|count|
+-----------+------------------------+-----+
|      howto|              2014-05-02|   10|
|AskCulinary|              2016-06-20|   33|
|    skeptic|              2016-07-30|   37|
+-----------+------------------------+-----+
only showing top 3 rows



## Calculate submission count IQRs for select Subreddits

In [20]:
# import numpy as np

In [21]:
IQR_dict = {}
for entry in these_subreddits:
    temp_df = subreddit_created_df.filter(subreddit_created_df.subreddit == entry)
    IQR_result = temp_df.approxQuantile(col = "count", probabilities = [0.25, 0.5, 0.75], relativeError = 0)
    print(entry, IQR_result)
    IQR_dict[entry] = IQR_result

worldnews [196.0, 595.0, 1565.0]
news [31.0, 132.0, 1096.0]
AskCulinary [20.0, 27.0, 35.0]
AskHistorians [48.0, 62.0, 81.0]
howto [2.0, 5.0, 9.0]
todayilearned [182.0, 1084.0, 1622.0]
conspiracy [23.0, 113.0, 234.0]
MilitaryConspiracy [1.0, 1.0, 1.0]
PedoGate []
FalseFlagWatch [1.0, 1.0, 1.0]
skeptic [16.0, 28.0, 41.0]
politicalfactchecking [1.0, 2.0, 3.0]
MensRights [28.0, 108.0, 153.0]
MRActivism [1.0, 1.0, 2.0]
glasgow [3.0, 7.0, 12.0]
melbourne [13.0, 34.0, 86.0]
travel [23.0, 49.0, 76.0]
photography [32.0, 83.0, 124.0]


In [22]:
# IQR_dict

## Calculate submission score IQRs for select Subreddits

In [88]:
subreddit_score_df = one_perct_df.groupby(["subreddit","score"]).count() #df shows the counts of scores awarded to posts/Submissions in a Subreddit

+---------------+-----+-----+
|      subreddit|score|count|
+---------------+-----+-----+
|          texas|    6| 1113|
|leagueoflegends|   17|11783|
|          vinyl|    2|32547|
+---------------+-----+-----+
only showing top 3 rows



In [89]:
score_dict = {}
for entry in these_subreddits:
    temp_df = subreddit_score_df.filter(subreddit_score_df.subreddit == entry)
    score_result = temp_df.approxQuantile(col = "count", probabilities = [0.25, 0.5, 0.75], relativeError = 0)
    print(entry, score_result)
    score_dict[entry] = score_result

worldnews [1.0, 3.0, 12.0]
news [1.0, 3.0, 10.0]
AskCulinary [1.0, 4.0, 27.0]
AskHistorians [1.0, 2.0, 10.0]
howto [1.0, 3.0, 20.0]
todayilearned [1.0, 3.0, 14.0]
conspiracy [1.0, 3.0, 17.0]
MilitaryConspiracy [2.0, 3.0, 14.0]
PedoGate []
FalseFlagWatch [1.0, 5.0, 33.0]
skeptic [1.0, 5.0, 30.0]
politicalfactchecking [1.0, 3.0, 16.0]
MensRights [1.0, 4.0, 24.0]
MRActivism [2.0, 3.0, 16.0]
glasgow [1.0, 3.0, 32.0]
melbourne [1.0, 5.0, 44.0]
travel [1.0, 3.0, 16.0]
photography [1.0, 4.0, 27.0]


In [91]:
# score_dict