In [1]:
import polars as pl
## Make polars output wide enough for links and more text
pl.Config.set_fmt_str_lengths(100)

polars.config.Config

In [2]:
## Load the data
df = pl.read_parquet("../reddit_data/reddit.parquet")

In [3]:
df.shape

(5528298, 14)

# Exploring the data
## Non-Unique Values
Some observations about the values in in the `reddit_text` column of the dataset:

* There are 38924 non-unique values that appear in a total of 306,242 rows
* There are 5222056 rows with unique values (~94.46\% of all rows)
* Emojis apear frequenlty, and BERT models treat emojis as "unknown" tokens
* Many many non-unique entries appear to be some flavor of "Yes", "No", or "Thank you".

Of the non-unique values:

* The values `""`, `"[deleted]"`, and `"[removed]"` make up a total of 77,914 rows (a little over 25\% of all non-unique values)
    * My initial impression is that rows with values of `"[deleted]"`, and `"[removed]"` can safely be removed from the dataset.
    * The value `""` only appears as a *comment* 6 times. In all 42340 other instances it is a *submission*.
        * Manual inspection of the 6 *comment* instances seems to indicate that these posts we edited to be empty by the author, but not deleted. They can be safely dropped.
        * A cursory inspection of a sample of the *submission* instances seems to indicate that many of these fall into one of these situations:
            * The submission title is the question or statement to which comments that follow are responding to
            * The submission is an image with no text (often still with some information in the title).
        * My initial impression was that we could either just delete these posts, or replace their `reddit_text` value with the value in `reddit_title`
        * These values only make up ~0.77\% of all values in the `reddit_text` column, but in a few subreddits they make up a larger proprtion:
            * In `FedEmployees` 107 of the total 280 posts (~38\%)
            * In  `disney`: 2555 of 43954 (~5.8\%)
            * In `RiteAid`: 170 of 3970 (~4.3%)
            * In `PaneraEmployees`: 115 of 2694 (4.3\%)
            * In `WalmartEmployees`: 285 of 10752 (2.7\%)
            * In all others, <2\%.
* Probably about 10,000 additional posts can be attributed to bots and can likely be safely removed.

In [4]:
## Count the frequency of values in the 'reddit_text' column.
value_counts = df["reddit_text"].value_counts().sort(
    by="count", descending=True)

In [5]:
## Observing how many times each non-unique value appears
max_rows = 10
pl.Config.set_tbl_rows(max_rows)
start_at=0
print(value_counts[start_at:start_at+max_rows])
pl.Config.set_tbl_rows(10)

shape: (10, 2)
┌─────────────┬───────┐
│ reddit_text ┆ count │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│             ┆ 42346 │
│ [deleted]   ┆ 23244 │
│ [removed]   ┆ 12324 │
│ Yes         ┆ 3611  │
│ Thank you!  ┆ 3564  │
│ No          ┆ 2617  │
│ Thank you   ┆ 1701  │
│ Thanks!     ┆ 1698  │
│ Lol         ┆ 1672  │
│ Same        ┆ 1444  │
└─────────────┴───────┘


polars.config.Config

In [6]:
## Count the frequency of the frequency of values in the 'reddit_text' column.
value_value_counts=value_counts.with_columns(
    val_counts=value_counts["count"]
).drop("count")["val_counts"].value_counts().sort(
    by="count", descending=False)

pl.Config.set_tbl_rows(6)
print(value_value_counts)
pl.Config.set_tbl_rows(10)

shape: (240, 2)
┌────────────┬─────────┐
│ val_counts ┆ count   │
│ ---        ┆ ---     │
│ u32        ┆ u32     │
╞════════════╪═════════╡
│ 42346      ┆ 1       │
│ 23244      ┆ 1       │
│ 12324      ┆ 1       │
│ …          ┆ …       │
│ 3          ┆ 5842    │
│ 2          ┆ 23330   │
│ 1          ┆ 5222056 │
└────────────┴─────────┘


polars.config.Config

In [7]:
## Create a df with only the values that are not unique
non_uniuqes = value_counts[0:38921]
non_uniuqes.write_parquet("../temp_data/non_unique_text_values.parquet")

In [8]:
## Check out how many `""` values apear as submissions
df.filter(pl.col("reddit_text")=="").filter(pl.col("aware_post_type")=="submission").shape

(42340, 14)

In [9]:
## Check out how many `""` values apear as comments
df.filter(pl.col("reddit_text")=="").filter(pl.col("aware_post_type")!="submission").shape

(6, 14)

In [10]:
## View the comments with `""` as their 'reddit_text'
df.filter(pl.col("reddit_text")=="").filter(pl.col("aware_post_type")!="submission")

aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission
str,str,str,str,i64,str,str,str,str,str,str,str,str,str
"""comment""","""2023-05-25T05:27:01""","""jljk39m""","""t1_jljk39m""",1685006821,"""Designer-Seesaw1381""","""""","""/r/nursing/comments/13r42vt/not_like_other_nurses/jljk39m/""",,,"""nursing""","""t3_13r42vt""","""t1_jlirq3l""","""13r42vt"""
"""comment""","""2023-05-25T05:29:13""","""jljk8zx""","""t1_jljk8zx""",1685006953,"""Designer-Seesaw1381""","""""","""/r/nursing/comments/13r42vt/not_like_other_nurses/jljk8zx/""",,,"""nursing""","""t3_13r42vt""","""t1_jlis8vs""","""13r42vt"""
"""comment""","""2023-07-30T18:54:45""","""ju4rte7""","""t1_ju4rte7""",1690757685,"""malphasia""","""""","""/r/Target/comments/15dttfc/i_hate_how_our_discount_works_its_absolute_garbage/ju4rte7/""",,,"""Target""","""t3_15dttfc""","""t1_ju4oi61""","""15dttfc"""
"""comment""","""2023-08-30T17:29:12""","""jyfmhiz""","""t1_jyfmhiz""",1693430952,"""IntentOnChivalry""","""""","""/r/fidelityinvestments/comments/165nt3o/fidelity_service_is_unbelievable/jyfmhiz/""",,,"""fidelityinvestments""","""t3_165nt3o""","""t1_jyf1mxb""","""165nt3o"""
"""comment""","""2023-10-12T16:31:37""","""k4m2ci4""","""t1_k4m2ci4""",1697142697,"""JustHereToLearn13""","""""","""/r/sysadmin/comments/176ew3e/tag_printer_issue_need_help/k4m2ci4/""",,,"""sysadmin""","""t3_176ew3e""","""t1_k4lu39g""","""176ew3e"""
"""comment""","""2024-01-15T11:29:57""","""khzfp4f""","""t1_khzfp4f""",1705336197,"""ainyg6767""","""""","""/r/WaltDisneyWorld/comments/1970i8l/weekly_faqs_general_discussion_thread/khzfp4f/""",,,"""WaltDisneyWorld""","""t3_1970i8l""","""t3_1970i8l""","""1970i8l"""


In [11]:
## Look at how many rows are associated with each subreddit
pl.Config.set_tbl_rows(34)
print(df["reddit_subreddit"].value_counts().sort(
    by="count", descending=True))
pl.Config.set_tbl_rows(10)

shape: (34, 2)
┌─────────────────────┬────────┐
│ reddit_subreddit    ┆ count  │
│ ---                 ┆ ---    │
│ str                 ┆ u32    │
╞═════════════════════╪════════╡
│ nursing             ┆ 789499 │
│ walmart             ┆ 630962 │
│ sysadmin            ┆ 557558 │
│ starbucks           ┆ 393597 │
│ WaltDisneyWorld     ┆ 373138 │
│ Target              ┆ 340401 │
│ UPSers              ┆ 262483 │
│ Disneyland          ┆ 231981 │
│ Lowes               ┆ 198805 │
│ CVS                 ┆ 179598 │
│ McDonaldsEmployees  ┆ 174679 │
│ cybersecurity       ┆ 161868 │
│ Fedexers            ┆ 154572 │
│ GameStop            ┆ 137071 │
│ starbucksbaristas   ┆ 132019 │
│ fidelityinvestments ┆ 129423 │
│ Bestbuy             ┆ 121077 │
│ wholefoods          ┆ 82052  │
│ Panera              ┆ 79436  │
│ DisneyWorld         ┆ 65549  │
│ DollarTree          ┆ 59745  │
│ TjMaxx              ┆ 46286  │
│ disney              ┆ 43954  │
│ McLounge            ┆ 38627  │
│ GeneralMotors       ┆ 3727

polars.config.Config

In [12]:
## Create a datframe with just the submissions with '""' as their 'reddit_text'
blanks = df.filter(
    pl.col("reddit_text")=="").filter(
    pl.col("aware_post_type")=="submission")[['reddit_subreddit',
                                              'reddit_name',
                                              'reddit_created_utc',
                                              'reddit_permalink',
                                              'reddit_text',
                                              'reddit_title']]

## Count how many blank texts are in each subreddit (sort by subreddit)
blank_count = blanks["reddit_subreddit"].value_counts().sort(by="reddit_subreddit")
## Count how many total texts are in each subreddit (sort by subreddit)
total_count = df["reddit_subreddit"].value_counts().sort(by="reddit_subreddit")

## Append the total text count to the blank text count
blank_count = blank_count.with_columns(total_count=total_count["count"])

## Calculuate the percent of texts in each subreddit that are blank
## and print sorted by that percentage
pl.Config.set_tbl_rows(34)
print(blank_count.with_columns(
    pct=blank_count["count"]/blank_count["total_count"]
    ).sort(by="pct", descending=True))
pl.Config.set_tbl_rows(10)

shape: (34, 4)
┌─────────────────────┬───────┬─────────────┬──────────┐
│ reddit_subreddit    ┆ count ┆ total_count ┆ pct      │
│ ---                 ┆ ---   ┆ ---         ┆ ---      │
│ str                 ┆ u32   ┆ u32         ┆ f64      │
╞═════════════════════╪═══════╪═════════════╪══════════╡
│ FedEmployees        ┆ 107   ┆ 280         ┆ 0.382143 │
│ disney              ┆ 2555  ┆ 43954       ┆ 0.058129 │
│ RiteAid             ┆ 170   ┆ 3970        ┆ 0.042821 │
│ PaneraEmployees     ┆ 115   ┆ 2694        ┆ 0.042687 │
│ WalmartEmployees    ┆ 285   ┆ 10752       ┆ 0.026507 │
│ BestBuyWorkers      ┆ 104   ┆ 5629        ┆ 0.018476 │
│ McLounge            ┆ 650   ┆ 38627       ┆ 0.016828 │
│ DisneyWorld         ┆ 971   ┆ 65549       ┆ 0.014813 │
│ walmart             ┆ 8078  ┆ 630962      ┆ 0.012803 │
│ TjMaxx              ┆ 530   ┆ 46286       ┆ 0.011451 │
│ CVS                 ┆ 2045  ┆ 179598      ┆ 0.011387 │
│ cybersecurity       ┆ 1837  ┆ 161868      ┆ 0.011349 │
│ Panera        

polars.config.Config

In [13]:
## Add a column that holds the number of characters in the non-unique texts
non_uniuqes_lens = non_uniuqes.with_columns(
    text_len = non_uniuqes["reddit_text"].str.len_bytes()
)

In [14]:
## Trying to identify likely bot posts
likely_bots = non_uniuqes_lens.filter(
    non_uniuqes_lens["text_len"]>33).filter(
        pl.col("count")>20
    ).sort(by="text_len", descending=True)

pl.Config.set_fmt_str_lengths(5000)
pl.Config.set_tbl_rows(75)
likely_bots

reddit_text,count,text_len
str,u32,u32
"""I'm going to point you to the usual resources I use for newer folks: 1. [The forum FAQ](https://reddit.com/r/cybersecurity/w/faq) 2. [This blog post on getting started](https://bytebreach.com/?p=72) 3. [This blog post on other/alternative resources](https://bytebreach.com/?p=95) 4. [These links to career roadmaps](https://www.reddit.com/r/cybersecurity/comments/smbnzt/mentorship_monday/hw8mw4k/) 5. [These training/certification roadmaps](https://www.reddit.com/r/cybersecurity/comments/sgmqxv/mentorship_monday/hv7ixno/) 6. [These links on learning about the industry](https://www.reddit.com/r/cybersecurity/comments/sb7ugv/mentorship_monday/hux2869/) 7. [This list of InfoSec projects to pad an entry-level resume](https://www.reddit.com/r/cybersecurity/comments/sxir9c/as_a_entry_level_professional_trying_to_get_into/hxsm5qn/) 8. [This extended mentorship FAQ](https://bytebreach.com/mentorship/) 9. [These links for interview prep](https://old.reddit.com/r/cybersecurity/comments/ybwsz9/mentorship_monday_post_all_career_education_and/itqbzq4/) Early on, you're going to want to learn more about the industry in order to help inform your decision about whether or not InfoSec is for you; such knowledge will also help guide your initial career trajectory based on what roles/responsibilities look attractive. (see links 3, 4, and 6). If you think that you do want to pursue a career, then you'll want to buoy your knowledge base with understanding IT/CS fundamentals more broadly. Some people pursue degrees, as an example ([although this is certainly not the only approach worth considering](https://bytebreach.com/?p=142)). (see links 1, 2, and 5). Eventually you'll need to work on improving your employability. This manifests in a variety of ways, but the most notable is probably accumulating [relevant industry-recognized certifications](https://bytebreach.com/?p=152). (see links 5 and 7) Other actions to improve your employability may include: * [Continue to leverage free resources to hone your craft or acquire new skills](https://bytebreach.com/hacking-helpers-learn-cybersecurity/). * [Pursue in-demand certifications to improve your employability](https://www.reddit.com/r/cybersecurity/comments/sgmqxv/mentorship_monday/hv7ixno/). * [Vie for top placement in competitive CTF competitions](https://ctftime.org/). * Foster a professional network via [jobs listings sites](https://www.weidert.com/blog/best-ways-to-gain-more-connections-on-linkedin) and [in-person conferences](https://infosec-conferences.com/). * Continue the job hunt for relevant experience and [take note of the feedback you receive in interviews](https://www.reddit.com/r/cybersecurity/comments/vg864z/mentorship_monday_post_all_career_education_and/id2tsr3/); consider expanding the aperture of jobs considered to include cyber-adjacent lines of work (software dev, systems administration, etc.) - this is a channel for you to build relevant years of experience. * Consider pursuing a degree-granting program (and internship experience while holding a student status). * [Post your resume to this thread for constructive feedback](https://bytebreach.com/how-to-write-an-infosec-resume/). * [Apply your skills into some projects in order to demonstrate your expertise](https://www.reddit.com/r/cybersecurity/comments/sxir9c/as_a_entry_level_professional_trying_to_get_into/hxsm5qn/).""",24,3380
"""Please post all your **general** WDW comments and **FAQs** here, as well as any **reopening-related** questions, discussion, etc. **Examples might include things like:** * What should I do to **prepare** for **the weather (heat, rain, tropical storm, etc.)** during my upcoming trip? * How do I get **tickets** for an **after-hours event,** such as **Mickey’s Not-So-Scary Halloween Party (MNSSHP), Jollywood Nights,** or **Mickey’s Very Merry Christmas Party (MVMCP)**? What happens if they’re **sold out** on the night we want to attend? * How does the **TRON/Guardians of the Galaxy virtual queue** work? Will I have issues **fitting in the ride vehicle**? Will I experience **motion sickness**? * I'm thinking about taking a **solo trip**. **Should I** do it? Any **tips** or **advice**? * What type of **shoes/backpacks** do you recommend for the parks? * How/when can I purchase/upgrade an **Annual Pass (AP)**? * When will my **MagicBand+ order ship/arrive?** * How do I use the **park reservation system**? Do you think **more reservations** will **open up** for HS/MK/AK/Epcot? * How does **park hopping** work now? What happens if the **park** I want to **hop to** is **at capacity**? * How does the **application/approval** process work for **Disability Access Services (DAS)**? * Is the **""magic"" gone**? Is a trip to WDW still **worth it** right now? * How does **Genie+** and/or **Lightning Lane** work? Are they **worth the price**? * Has \[x\] **reopened** yet? * What's the best way to get a **dining reservation (ADR)** for a certain restaurant? What if an **ADR** isn't available to accommodate the **size of my party**? * Do you feel **safe traveling to WDW** right now? Should I **cancel my upcoming trip**? * Do you think **park hours** will be extended for my upcoming trip? * When do you think **dining plans** will return? * What are the **crowds** and **wait-times** like at the parks right now? If you submit a FAQ or reopening-related post and it's removed from the sub, please feel free to resubmit it in this thread. If you'd like to chat about reopening procedures or other FAQs in real-time, come visit us on our [Discord server](https://discord.com/invite/reddit-waltdisneyworld)!""",23,2223
"""This post links to The Hacker News (THN). The moderators of r/cybersecurity strive to maintain a professional subreddit which will often discuss news, and further acknowledge that THN is a popular source of news within the cybersecurity community at large. We always wish to act in the best interests of the community and will not restrict news content which is accurate and valuable. However, it has come to our attention that THN has been accused of plagiarism since at least 2012 (ref: [attrition.org](https://attrition.org/errata/plagiarism/thehackernews/)), allegedly copying article contents from original authors and modifying them without appropriately crediting the original source. Their behavior has been met with repeated criticism, including making false statements (ref: [@thegrugq](https://twitter.com/thegrugq/status/902600568262107136)) and renewed claims of plagiarism (refs: [news.ycombinator.com](https://news.ycombinator.com/item?id=18783493) c. 2018, [reddit.com](https://reddit.com/r/privacy/comments/mczutz/the_hacker_news_profiting_off_extensive/) c. 2021). Due to these incidents, THN links have been banned from several subreddits including r/privacy, r/technology, and r/hacking. We would hope that THN is now appropriately crediting sources of its content or writing its own original content, however we are unable to police each and every article. Please ensure that the information in this article is factual, and where possible, please choose to support high-quality ethical journalism directly. If the community feels this warning is no longer relevant, we will remove this AutoModerator action. Thank you. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/cybersecurity) if you have any questions or concerns.*""",36,1830
"""Welcome to [/r/Disneyland](https://www.reddit.com/r/Disneyland)! This thread is here to help you plan your trip and get as much advice as possible straight from our Reddit community. We know you've probably got a million questions for us, so we'd like to take a moment to remind you to check out the [FAQ](https://www.reddit.com/r/Disneyland/wiki/index), where you can find many pages about various topics here to help you with your vacation from start to finish! Individual posts dedicated to trip planning are not allowed except on rare occasions, DM the mod team for permission or make a post over at /r/DisneyPlanning. If you have a question that you'd like answered ASAP, visit our Discord server and navigate to the #park-questions channel. [https://discord.gg/rdisneyland](https://discord.gg/rdisneyland) Any questions about reopening procedures can also check out our [Explain Like I'm Goofy thread](https://old.reddit.com/r/Disneyland/comments/mvmji2/explain_like_im_goofy_disneyland_resort_reopening/), which includes an in-depth guide to the parks. Happy planning, and we'll see you real soon!""",54,1114
"""**ATTENTION BEST BUY EMPLOYEES** [As per Rule #4](https://www.reddit.com/r/Bestbuy/about/rules/), all Best Buy cost information *(real or fake)* is confidential. Therefore, you may not disclose the Employee Discount purchase prices to anyone other than current Best Buy employees. In order to enforce this (since this is a public forum), all submissions and comments disclosing such information will be removed by the moderators of this subreddit. Current employees can view the Employee Discount Policy by clicking [here](https://hr.bestbuy.com/documents/10180/24173/Employee+Discount+Policy/f3dfd9f2-9151-4f7a-94dd-0c6a229546f1). In order to view the link, you may have to login to the HR website and then click the link a second time. The policy can also be found by logging into the HR website and navigating to `Policies > Workplace Expectations > Employee Discount Policy`. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Bestbuy) if you have any questions or concerns.*""",132,1064
"""There is a great deal of user-generated content out there, from scripts and software to tutorials and videos, but we've generally tried to keep that off of the front page due to the volume and as a result of community feedback. There's also a great deal of content out there that violates our advertising/promotion rule, from scripts and software to tutorials and videos. We have received a number of requests for exemptions to the rule, and rather than allowing the front page to get consumed, we thought we'd try a weekly thread that allows for that kind of content. We don't have a catchy name for it yet, so please let us know if you have any ideas! In this thread, feel free to show us your pet project, YouTube videos, blog posts, or whatever else you may have and share it with the community. Commercial advertisements, affiliate links, or links that appear to be monetization-grabs will still be removed.""",32,915
"""**Welcome to the Wholefoods weekly discussion thread.** 🤙 It is best that new Team Members contact their own leadership before asking work questions in r/wholefoods. Whole Foods is made up of a dozen different regions, each with its own interpretation of company policies. Most work questions should be directed to your team leadership or team member services representative. If you need to post work questions, post them in this weekly thread. Do not make separate posts for company policy questions. In an effort to keep this subs **quality** up this is a reminder to please read and **follow the rules** listed in the sidebar. Please report any offenses to the mods. Thanks 🏄🏖️""",61,694
"""Welcome to Target!! You might be interested in our [Guide to Store Roles](https://www.reddit.com/r/Target/wiki/storeroleindex) - an index which answers to ""What's it like to be a ____?"" for every job inside a Target store, written by Target employees. Also, be sure to check out our [Frequently Asked Questions](https://www.reddit.com/r/Target/wiki/futureemployeefaq) to see if your question is already answered. We hope you find the answer your looking for! Good luck at Target and on r/Target! *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Target) if you have any questions or concerns.*""",1147,679
"""Hello, everyone. Please keep all discussions focused on *cybersecurity*. We are implementing a *zero tolerance policy* on any political discussions or anything that even looks like baiting. This subreddit also does not support hacktivism of any kind. Any political discussions, any baiting, any conversations getting out of hand will be met by a swift ban. This is a trying time for many people all over the world, so please try to be civil. Remember, attack the argument, not the person. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/cybersecurity) if you have any questions or concerns.*""",76,677
"""This post was automatically removed because your account is less than 24 hours old. Don't panic though! If you have read our FAQ below as well our subreddit's rules, you may message our moderators about getting your comment restored. In the meantime, check out our FAQ for Future/Potential Employees [here](https://www.reddit.com/r/Target/wiki/futureemployeefaq) and our FAQ for Current Team Members [here](https://www.reddit.com/r/Target/wiki/futureemployeefaq). *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Target) if you have any questions or concerns.*""",80,652


In [15]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
likely_bots["count"].sum()

10346