# Screamers
Screamers: It is well known that WRITING IN ALL CAPS ONLINE IS A SUBSTITUTE FOR SCREAMING… OR YELLING. *cough!*. (Or some might say it’s simply cruise control for cooooool). Write a job to find users that scream a lot, and provide a screamer score (a highly-technical metric that you will invent).
* For future reference (when we really want to get something off our chest), what are the top 5 subreddits for scream-y comments?

In [9]:
import re
import pandas as pd
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, desc
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType, BooleanType

sqlContext = SQLContext(sc)

df = sqlContext.read.json("hdfs://orion11:15001/sampled_reddit/*")
columns = [
    "distinguished",
    "downs",
    "created_utc",
    "controversiality",
    "edited",
    "gilded",
    "author_flair_css_class",
    "id",
    "author",
    "retrieved_on",
    "score_hidden",
    "subreddit_id",
    "score",
    "name",
    "author_flair_text",
    "link_id",
    "archived",
    "ups",
    "parent_id",
    "subreddit",
    "body"]

df = df.select("author", "body", "subreddit")

In [10]:
# Can we make this filter a bit more rhobuts
def filter_func(val):
    res = re.search(r"[^A-Z!@#$%^&*()>?\"\'=-_+{}]+", val)
    return res==None

filter_udf = udf(filter_func, BooleanType())

df_filtered = df.filter(filter_udf(df["body"]))

In [11]:
df_filtered.show()

+--------------------+------------------+-------------------+
|              author|              body|          subreddit|
+--------------------+------------------+-------------------+
|          hubilation|NOOOOOOOOOOOOOOOOO|         reddit.com|
|         tvreference|           PENCIL?|             gaming|
|         frostysauce|          WINNING!|cripplingalcoholism|
|    repulsethemonkey|         DAAAAANG!|               keto|
|         sushiaddict|               O_O|               pics|
|           Karl-Marx|            REPOST|             wowbro|
|    HowardDeanScream|            BYEAH!|               pics|
|             Nefandi|               LOL|           politics|
|            fatmarik|       HYPERTONIC!|      todayilearned|
|              eyyyyy|              BAMP|               pics|
|          bizzybinnc|          FLAWLESS|           gonewild|
|Ceci_Nest_Pas_Sparta|                 ?|               IAmA|
|            lachiemx|            HAHAHA|          worldnews|
|       

In [12]:
screamers_count = df_filtered.groupBy('author').count()
total_count = df.groupBy('author').count()

# Rename cols, in prep for join
screamers_count = screamers_count.withColumnRenamed("count", "scream_count")
total_count = total_count.withColumnRenamed("count", "total_count")

In [15]:
# Join the two df's, and create the screamer score, 
# which for now is just the ration of screaming to not screaming


joined_df = screamers_count.join(total_count, "author")
joined_df = joined_df.withColumn("screamer_score", (col("scream_count")/ col("total_count"))).orderBy(["scream_count", "screamer_score"], ascending=[False, False])

# Throws out cases where all comments are screamers, to avoid bots
# And [deleted] comments
joined_df = joined_df.filter("scream_count != total_count")
joined_df = joined_df.filter("author != '[deleted]'")
joined_df.show(n=10)

+---------------+------------+-----------+--------------------+
|         author|scream_count|total_count|      screamer_score|
+---------------+------------+-----------+--------------------+
| atomicimploder|        1320|      17074| 0.07731053063136933|
|yes_it_is_weird|         910|       1143|  0.7961504811898513|
|    Sir_toolman|         748|        857|  0.8728121353558926|
|    UnluckyLuke|         655|      17090| 0.03832650672908133|
|   KingCaspianX|         494|       7511| 0.06577020370123818|
|  TheNitromeFan|         491|      12853|  0.0382011981638528|
|     bowloclock|         447|        514|  0.8696498054474708|
|   redditmortis|         436|       2458|   0.177379983726607|
|      Maniac_34|         385|       1807| 0.21306032097399003|
|     davidjl123|         381|      17677|0.021553431012049557|
+---------------+------------+-----------+--------------------+
only showing top 10 rows



## Top Screamers
The above shows the top 10 screamers, as sorted by the total number of screams, then the scream score that we assign them. Next we will show the sub-reddits with the most screamers

In [14]:
subreddits_count = df_filtered.groupBy('subreddit').count()
subreddits_count.orderBy("count", ascending=False).show(n=6)

+---------------+-----+
|      subreddit|count|
+---------------+-----+
|      AskReddit|69437|
|       AskOuija|32103|
|          funny|23764|
|           pics|20616|
|            nfl|18312|
|leagueoflegends|18312|
+---------------+-----+
only showing top 6 rows



## Most Angry Subreddit
The graph above shows us the subreddits with the most screamers. The one subreddit that I would ignore from this table is "AskOuija", whih is a subbreddit where users work togeather to spell out words one letter at a time. Because these comments are only one letter, and don't seem to convey much anger, I did not think it belonged here.