# <font color='magenta'>EDA on Reddit 1% sample </font>

In [1]:
one_perct_df = spark.read.parquet("one_perct_sample.parquet")

> The data is a one percent sample from the Reddit parquet file containing +2.8 billion records.
> The records are posts/Submissions (in PRAW). This is evident due to the existence of 'num_comments', 'name', 'locked' & other attributes that are common to posts/Submissions 

In [33]:
one_perct_df.columns

['id',
 'parent_id',
 'subreddit',
 'author',
 'created_utc',
 'body',
 'num_comments',
 'score',
 'created_utc_year',
 'created_utc_yearMonthDay',
 'body_len']

In [2]:
one_perct_df.count() #~300+ million

309811130

# <font color='magenta'>Deleted authors and bots

In [51]:
one_perct_df.groupby("author").count().orderBy("count", ascending = False).show()

+--------------------+--------+
|              author|   count|
+--------------------+--------+
|           [deleted]|35125988|
|       AutoModerator| 1181108|
|         conspirobot|   68608|
|       ModerationLog|   67404|
|         TweetPoster|   66676|
|         autowikibot|   50636|
|    imgurtranscriber|   48604|
|      MTGCardFetcher|   48432|
|          PoliticBot|   47124|
|               RPBot|   45961|
|          dogetipbot|   44796|
| Late_Night_Grumbler|   38604|
|throwthrowawaytothee|   37615|
|    qkme_transcriber|   36277|
| TweetsInCommentsBot|   32089|
|     ImagesOfNetwork|   25972|
|        Franciscouzo|   24230|
|      TotesMessenger|   23998|
|           havoc_bot|   23671|
|        morbiusgreen|   23352|
+--------------------+--------+
only showing top 20 rows



> The top author is a collection of deleted authors and most of the others appear to be bots.

In [69]:
deleted_authors_df = one_perct_df.filter(one_perct_df.author == "[deleted]") #is body content also missing for deleted authors?
deleted_authors_df.groupby("body").count().orderBy("count", ascending = False).show()

+----------+--------+
|      body|   count|
+----------+--------+
| [deleted]|19819730|
| [removed]| 1711094|
|   Thanks!|    4624|
|      Yes.|    4170|
|       No.|    3775|
|Thank you!|    3130|
|       lol|    3035|
|         .|    2747|
|       Yes|    2408|
|        :)|    2324|
|      Why?|    1973|
|        no|    1951|
|       yes|    1925|
|     What?|    1920|
|     &lt;3|    1860|
|        :(|    1825|
|        No|    1819|
|Thank you.|    1782|
|       wat|    1709|
|     Nope.|    1429|
+----------+--------+
only showing top 20 rows



> The clear majority of the posts/Submissions by deleted authors have also been deleted and won't be useful in our subsequent analysis. Let's filter both them and the bots out so our data is both smaller and more representative of the content posted directly by human end users.

In [60]:
authors_to_remove = [
    "[deleted]",
    "AutoModerator",
    "autotldr",
    "photography_bot",
    "conspirobot",
    "ModerationLog",
    "TweetPoster",
    "autowikibot",
    "imgurtranscriber",
    "MTGCardFetcher",
    "PoliticBot",
    "RPBot",
    "dogetipbot",
    "qkme_transcriber",
    "TweetsInCommentsBot",
    "havoc_bot"
]

In [72]:
# one_perct_df.filter(one_perct_df.author.isin(authors_to_remove)).count() #36902074

> This step will remove +36 million records

In [73]:
one_perct_df = one_perct_df.filter(~one_perct_df.author.isin(authors_to_remove))

In [74]:
one_perct_df.count()

272909056

## Counts of posts/Submissions per year

In [75]:
%%time
one_perct_df.groupby("created_utc_year").count().orderBy("created_utc_year").show()

+----------------+--------+
|created_utc_year|   count|
+----------------+--------+
|            2005|     101|
|            2006|   36084|
|            2007|  197737|
|            2008|  585008|
|            2009| 1632953|
|            2010| 4540038|
|            2011|11863319|
|            2012|25750321|
|            2013|40857100|
|            2014|56063783|
|            2015|74347450|
|            2016|57035162|
+----------------+--------+

CPU times: user 14.5 ms, sys: 233 µs, total: 14.8 ms
Wall time: 4.83 s


> This 1% stratified sample is off by ~10%. The actual yearly counts in the sample are greater than the expected yearly counts by an order of magnitude.

# <font color='magenta'>Analysze Subreddits

In [76]:
from pyspark.sql.functions import from_unixtime

In [77]:
one_perct_df = one_perct_df.withColumn(
    "created_utc_yearMonthDay", 
    from_unixtime(
        one_perct_df["created_utc"], 
        "yyyy-MM-dd" # full timestamp: yyyy-MM-dd HH:mm:ss.SS
    )
)

In [78]:
from pyspark.sql.functions import length

In [79]:
one_perct_df = one_perct_df.withColumn(
    "body_len",
    length(one_perct_df["body"])
)

In [80]:
%%time
yearsMonthDay_df = one_perct_df.groupby(["subreddit","created_utc_yearMonthDay"]).count()

avg_post_df = yearsMonthDay_df.groupby("subreddit").avg()
avg_post_df.orderBy("avg(count)", ascending = False).show(10)

+---------------+------------------+
|      subreddit|        avg(count)|
+---------------+------------------+
|      AskReddit|        8224.86656|
|leagueoflegends|2028.0238500851788|
|          funny|1969.7301638291037|
|           pics|1801.6553150859324|
|  AdviceAnimals|1565.8538681948423|
|     The_Donald| 1517.828947368421|
|            nfl|1289.6326621923938|
|   pcmasterrace|1249.5161048689138|
|         gaming|1198.8200183654728|
|       politics|1142.9788774894387|
+---------------+------------------+
only showing top 10 rows

CPU times: user 53.8 ms, sys: 745 µs, total: 54.6 ms
Wall time: 29.2 s


> Average number of posts/Submissions per Subreddit.

In [81]:
%%time
max_post_df = yearsMonthDay_df.groupby("subreddit").max()
max_post_df.orderBy("max(count)", ascending = False).show(10)

+-----------------+----------+
|        subreddit|max(count)|
+-----------------+----------+
|        AskReddit|     23594|
|         politics|     16580|
|              nfl|     13888|
|     pcmasterrace|     12503|
|           gaming|     11901|
|millionairemakers|     11696|
|    SquaredCircle|     11266|
|              nba|     11156|
|              CFB|     10631|
|        pokemongo|     10509|
+-----------------+----------+
only showing top 10 rows

CPU times: user 23.8 ms, sys: 193 µs, total: 24 ms
Wall time: 18.1 s


> Top 10 Subreddits based on maximum number of posts/Submissions per Subreddit.

In [82]:
these_subreddits = [
    "worldnews", "news", "AskCulinary",
    "AskHistorians", "howto", "todayilearned",
    "conspiracy", "MilitaryConspiracy", "PedoGate",
    "FalseFlagWatch", "skeptic", "politicalfactchecking",
    "MensRights", "MRActivism", "glasgow",
    "melbourne", "travel", "photography"
]

In [83]:
subreddit_created_df = yearsMonthDay_df.filter(yearsMonthDay_df.subreddit.isin(these_subreddits))

In [84]:
subreddit_created_df.show(3) #df shows the number of posts/Submissions created in a Subreddit on a given day

+----------+------------------------+-----+
| subreddit|created_utc_yearMonthDay|count|
+----------+------------------------+-----+
| worldnews|              2015-01-05| 2159|
|    travel|              2014-09-05|   60|
|MensRights|              2013-11-25|  158|
+----------+------------------------+-----+
only showing top 3 rows



## Calculate submission count IQRs for select Subreddits

In [85]:
import numpy as np

In [86]:
IQR_dict ={}
for entry in these_subreddits:
    temp_df = subreddit_created_df.filter(subreddit_created_df.subreddit == entry)
    IQR_result = temp_df.approxQuantile(col = "count", probabilities = [0.25, 0.5, 0.75], relativeError = 0)
    print(entry, IQR_result)
    IQR_dict[entry] = IQR_result

worldnews [196.0, 595.0, 1565.0]
news [31.0, 132.0, 1096.0]
AskCulinary [20.0, 27.0, 35.0]
AskHistorians [48.0, 62.0, 81.0]
howto [2.0, 5.0, 9.0]
todayilearned [182.0, 1084.0, 1622.0]
conspiracy [23.0, 113.0, 234.0]
MilitaryConspiracy [1.0, 1.0, 1.0]
PedoGate []
FalseFlagWatch [1.0, 1.0, 1.0]
skeptic [16.0, 28.0, 41.0]
politicalfactchecking [1.0, 2.0, 3.0]
MensRights [28.0, 108.0, 153.0]
MRActivism [1.0, 1.0, 2.0]
glasgow [3.0, 7.0, 12.0]
melbourne [13.0, 34.0, 86.0]
travel [23.0, 49.0, 76.0]
photography [32.0, 83.0, 124.0]


In [87]:
# IQR_dict

## Calculate submission score IQRs for select Subreddits

In [88]:
one_perct_df.columns

['id',
 'parent_id',
 'subreddit',
 'author',
 'created_utc',
 'body',
 'num_comments',
 'score',
 'created_utc_year',
 'created_utc_yearMonthDay',
 'body_len']

In [89]:
one_perct_df.groupby(["subreddit","score"]).count().show(3)

+-------------+-----+-----+
|    subreddit|score|count|
+-------------+-----+-----+
|DirtySnapchat|    1|10701|
|      teenmom|   18|  178|
| iamverysmart|    2| 9843|
+-------------+-----+-----+
only showing top 3 rows

