# Classifying NSFW content

We want a way to be able to classify if the Bluesky posts contain NSFW content. We vaguely know, I would think, if something has NSFW content if we see it, but I'd like to define it more precisely.

According to [Merriam Webster](https://www.merriam-webster.com/dictionary/NSFW), NSFW means "not safe for work; not suitable for work —used to warn someone that a website, email attachment, etc., is not suitable for viewing at most places of employment". The [Wikipedia page](https://en.wikipedia.org/wiki/Not_safe_for_work) is slightly more descriptive, defining NSFW as:

```plaintext
Not safe for work (NSFW) is Internet slang or shorthand used to mark links to content, videos, or website pages the viewer may not wish to be seen viewing in a public, formal or controlled environment. The marked content may contain graphic violence, pornography, profanity, nudity, slurs or any other potentially disturbing subject matter. This may also include illegal activity such as piracy or inappropriate topic searches such as how to grow plant medicines, instructions for hacking or home-made explosives. Environments that may be problematic include workplaces, schools, and family settings. NSFW has particular relevance for people trying to make personal use of the Internet at workplaces or schools which have policies prohibiting access to sexual and graphic subject matter.
```

## Our operational definition of NSFW content

We'll keep our definition of NSFW content brief but specific. We're not trying to solve the problem of content moderation, we just need something that is good enough for our use case.

Let's say that for our use case, NSFW content includes content such as:
- Graphic violence
- Pornography / sexual content and materials
- Slurs

In particular, we will most likely encounter sexual materials, so we want our NSFW classifier to be most robust to that type of content. Thus, we'll focus our work, as a first pass, on filtering out sexual content.

### Method 1: Using existing Bluesky content moderation features

Bluesky already contains a few tools that will help us with our task, as part of their [moderation](https://docs.bsky.app/docs/advanced-guides/moderation) features

#### Labels

Bluesky makes the use of labels which users or labelers can use to label posts. Some of the relevant labels for our use case include `porn`, `nudity`, and `sexual`; these are labels that have been added to posts (often by the post authors themselves) to flag content as having sexual content.


In [2]:
from services.filter_raw_data.classify_nsfw_content.constants import LABELS_TO_FILTER

In [3]:
LABELS_TO_FILTER

['porn', 'furry', 'sexual', 'nudity', 'gore', 'graphic-media']

Let's grab a few posts and see what their labels are.

In [4]:
from services.sync.stream.database import FirehosePost

In [35]:
labels: list[dict] = [
    label.__data__ for label in FirehosePost.select(FirehosePost.labels)
]

In [36]:
distinct_labels: list[str] = []
for label in labels:
    post_label = label["labels"]
    if post_label:
        distinct_labels.append(post_label)

In [37]:
distinct_labels: set[str] = set(distinct_labels)

In [38]:
distinct_labels

{'nudity', 'porn', 'sexual'}

We can easily filter out posts if we look for cases where `post["label"] in ["nudity", "porn", "sexual"]`. We define a list of inappropriate terms as constants, and we can just check if the labels are any of those.

### Method 2: AI-powered NSFW filter

We could use some AI models to power NSFW filter. Before doing so, there are a few things to consider:

- The community on Bluesky is more inclusive of niche interests and less spammy. Sexual content is unlikely to be from Onlyfans or from spam accounts, but instead from sex-positive users who want to build communities with other sex-positive users. Generally, I've observed that these groups enjoy the communities that they create but are understanding of the fact that other users may not want that content in their own feeds, so they provide labels for their content so others can filter them out accordingly.
- User-powered labels should be able to account for most NSFW images.
- Since Bluesky is much smaller than Twitter, the proliferation of NSFW content on Twitter doesn't really happen on Bluesky.
- We can likely use a simple classifier or language model to detect if text has NSFW content (which should be better than a simple keyword-based approach, though it's likely that a keyword-based approach would get a lot of the way there).

#### Trying out ML filtering

##### Hugging Face models
We can try something off-the-shelf to see how well it works. There's [this model](https://huggingface.co/michellejieli/NSFW_text_classifier) from Hugging Face, fine-tuned off 14,000 Reddit posts, that seems to work quite well. It appears to be the [most downloaded](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=nsfw) NSFW text classifier on Hugging Face. Looks like [this](https://www.linkedin.com/in/michellejieli/) is her LinkedIn profile, looks like she's a good engineer so I trust the model. I can use it for now, and as long as the NSFW classification makes sense I think we can just roll with it. Let me see how it does.

In [1]:
from transformers import pipeline

In [2]:
model_name = "michellejieli/NSFW_text_classifier"
classifier = pipeline("sentiment-analysis", model=model_name)

In [3]:
classifier("I see you've set aside this special time to humiliate yourself in public.")

[{'label': 'NSFW', 'score': 0.9677894115447998}]

In [4]:
def text_is_nsfw_ml(text: str) -> bool:
    """Uses the classiifer to check if the text is NSFW or not."""
    result = classifier(text)
    return result[0]["label"] == "NSFW"

Let's see how this does on a dataset. We don't have any tuned examples, and I'd rather not manually curate examples from Bluesky, so we'll use some training data as a proxy.

Let's also see how the classifier does at scale.

In [5]:
from ml_tooling.inference_helpers import classify_posts, generate_batches_of_posts
from services.sync.stream.helper import get_posts_as_list_dicts

In [6]:
num_posts = 50000
posts: list[dict] = get_posts_as_list_dicts(k=num_posts)

In [39]:
def preprocess_post(post: dict) -> dict:
    """Do any preprocessing needed for the posts.

    We only need the following fields:
    - id
    - uri
    - text

    We also want to remove any weird spacing or any newline characters.
    """
    return {
        "id": post["id"],
        "uri": post["uri"],
        # cleaning needed for language detection
        "text": post["text"].replace("\n", " ").strip(),
        # field needed for NSFW classification
        "labels": post["labels"]
    }

In [8]:
preprocessed_posts: list[dict] = [preprocess_post(post) for post in posts]

In [9]:
BATCH_SIZE = 1000

In [10]:
batches: list[list[dict]] = generate_batches_of_posts(
    posts=preprocessed_posts, batch_size=BATCH_SIZE
)

In [11]:
def clf_post_nsfw(post: dict) -> dict:
    """Classify the post as NSFW or not."""
    post_text = post["text"]
    post["is_nsfw"] = text_is_nsfw_ml(post_text)
    return post

In [None]:
# hf_nsfw_labels: list[dict] = classify_posts(
#     posts=preprocessed_posts, clf_func=clf_post_nsfw,
#     batch_size=BATCH_SIZE, rate_limit_per_minute=None
# )

This takes a long time to run (>8 minutes). Let's just QA a few results manually.

In [12]:
clf_post_nsfw(preprocessed_posts[2])

{'id': 63378,
 'uri': 'at://did:plc:m6cii5ipk35g32e4kz6lzklt/app.bsky.feed.post/3kpen5qta6z2p',
 'text': '魚拓  なんで地味に伸びたんだ',
 'is_nsfw': True}

Let's now classify a subset of the posts

In [12]:
subset_preprocessed_posts = preprocessed_posts[:100]

In [13]:
hf_nsfw_labels: list[dict] = classify_posts(
    posts=subset_preprocessed_posts, clf_func=clf_post_nsfw,
    batch_size=BATCH_SIZE, rate_limit_per_minute=None
)

Execution time for classify_posts: 0 minutes, 3 seconds
Memory usage for classify_posts: 104.4375 MB


Let's take a look at some of the classifications

In [15]:
num_nsfw_posts = sum([1 for post in hf_nsfw_labels if post["is_nsfw"]])
nsfw_posts = [post for post in hf_nsfw_labels if post["is_nsfw"]]

In [16]:
num_nsfw_posts

70

In [17]:
nsfw_posts[0:3]

[{'id': 63378,
  'uri': 'at://did:plc:m6cii5ipk35g32e4kz6lzklt/app.bsky.feed.post/3kpen5qta6z2p',
  'text': '魚拓  なんで地味に伸びたんだ',
  'is_nsfw': True},
 {'id': 63380,
  'uri': 'at://did:plc:5alnjqwjwsli7lgc4updk5a6/app.bsky.feed.post/3kpen5qzg722j',
  'text': '多分おれが天底信者すぎるのもよくない 天底抜く選択肢は普通にあるのにもはや固定枠みたいな扱いしてる 絶対間違ってる',
  'is_nsfw': True},
 {'id': 63381,
  'uri': 'at://did:plc:7dgqrqv5w4jzaolkze2qgb4c/app.bsky.feed.post/3kpen5qvvbe2s',
  'text': 'Los co-alcaldes municipales del Partido DEM entran en los Ayuntamientos ganados democráticamente acompañados de cientos de personas y ponen fin al gobierno fiduciario.   anfespanol.com/elecciones-t...',
  'is_nsfw': True}]

Looks like the classifier gets confused if the text is not English. Let's remove non-English posts

In [13]:
from services.filter_raw_data.classify_language.helper import classify_language_of_posts



Looks like our English classifier works well based on our previous testing. Let's grab only the posts that are written in English.

In [14]:
posts_with_language_labels: list[dict] = classify_language_of_posts(
    posts=preprocessed_posts
)

Execution time for classify_posts: 0 minutes, 2 seconds
Memory usage for classify_posts: 8.578125 MB


In [15]:
posts_with_language_labels = [
    {**post, **label}
    for post, label in zip(preprocessed_posts, posts_with_language_labels)
]

In [16]:
posts_with_language_labels[0]

{'id': 63376,
 'uri': 'at://did:plc:sb6fu4sinwphqpvoznvz7efo/app.bsky.feed.post/3kpen5qxtnc2c',
 'text': "I'm having a lot of fun with this photo box",
 'is_english': True}

In [17]:
english_posts = [post for post in posts_with_language_labels if post["is_english"]]

In [19]:
len(english_posts)

12028

So ~12,000 of our 50,000 posts have been classified as English.

Let's grab a subset of the English posts and classify them

In [20]:
subset_english_posts = english_posts[:100]  

In [21]:
hf_nsfw_labels: list[dict] = classify_posts(
    posts=subset_english_posts, clf_func=clf_post_nsfw,
    batch_size=BATCH_SIZE, rate_limit_per_minute=None
)

Execution time for classify_posts: 0 minutes, 3 seconds
Memory usage for classify_posts: 83.21875 MB


So, it took 3 seconds and 83MB to classify 100 posts. That's not very efficient. But, let's take a look at a few samples

In [22]:
nsfw_posts = [post for post in hf_nsfw_labels if post["is_nsfw"]]

In [23]:
len(nsfw_posts) # 56/100 seems awfully high

56

In [25]:
nsfw_posts[0:5]

[{'id': 63390,
  'uri': 'at://did:plc:6sfrfhw2vlakpteolhj75stl/app.bsky.feed.post/3kpen5tl3s42a',
  'text': 'This sort of lines up with the top 10 consumption choices to radically reduce personal emissions, at least as far as mobility is concerned. 🤔  iopscience.iop.org/article/10.1...',
  'is_english': True,
  'is_nsfw': True},
 {'id': 63416,
  'uri': 'at://did:plc:4rfp3cq3ycsfg6owkqrex5ls/app.bsky.feed.post/3kpen5x76kp2j',
  'text': 'Agreed! 🤢',
  'is_english': True,
  'is_nsfw': True},
 {'id': 63418,
  'uri': 'at://did:plc:gyjeilekf6276652rhhvjs5c/app.bsky.feed.post/3kpen5xlwsl2l',
  'text': '',
  'is_english': True,
  'is_nsfw': True},
 {'id': 63422,
  'uri': 'at://did:plc:ev4c5s5yuffsikvdhcp4riri/app.bsky.feed.post/3kpen5ydus32n',
  'text': "protip: you won't make friends by following them around everywhere online and invading their personal space.  ass kissing does NOT help either. people pleasing and acting all clingy n shit is a huge turn off.   pls stop.",
  'is_english': True

OK, these just seem like nonsense results. Plus, classifying these seems to be very time and resource-heavy. It may be impractical to use ML unless we have other ways to speed up our inference. 

### Takeaways and next steps

#### Current approach
Currently, we can rely on a keyword and label-based approach. This should get us 80% of the way there in filtering NSFW content.
- The communities that post NSFW content kindly provide labels to their content so that we can filter that content accordingly.
- Otherwise, we can also just do simple filters on the text for particular keywords.

Later, we can experiment with language models.

Let's set up what these steps would look like:

In [26]:
from services.sync.stream.helper import get_posts_as_list_dicts

In [27]:
posts = get_posts_as_list_dicts(k=50000)

In [40]:
preprocessed_posts = [preprocess_post(post) for post in posts]

In [41]:
posts_are_english_labels = classify_language_of_posts(preprocessed_posts)

Execution time for classify_posts: 0 minutes, 2 seconds
Memory usage for classify_posts: 0.078125 MB


In [42]:
posts_with_labels = [
    {**post, **label}
    for post, label in zip(preprocessed_posts, posts_are_english_labels)
]

In [43]:
english_posts = [post for post in posts_with_labels if post["is_english"]]

In [44]:
from pprint import pprint
print(len(english_posts))
pprint(english_posts[0])

12028
{'id': 63376,
 'is_english': True,
 'labels': None,
 'text': "I'm having a lot of fun with this photo box",
 'uri': 'at://did:plc:sb6fu4sinwphqpvoznvz7efo/app.bsky.feed.post/3kpen5qxtnc2c'}


In [None]:
def classify_post_nsfw(post: dict) -> dict:
    pass

#### Things to consider including later on

##### 1. Other features of Bluesky's moderation services.
Bluesky uses ["stackable moderation"](https://bsky.social/about/blog/03-12-2024-stackable-moderation), which lets the user define what moderation looks like for their own feeds by subscribing to various lists and accounts that have their own rules for what posts to filter and remove. Bluesky has a tool called [Ozone](https://github.com/bluesky-social/ozone) which powers their crowdsourced content moderation. We can consider adopting some of those features, such as [lists](https://docs.bsky.app/docs/tutorials/user-lists), which would allow us to subscribe to lists of users (some communities, such as #furry, have kindly added their users to these lists so that their content can be easily filtered out), adding more example [labels](https://docs.bsky.app/docs/advanced-guides/moderation), or looking at other moderation tools that Bluesky has.

Bluesky's moderation architecture is detailed more [here](https://docs.bsky.app/blog/blueskys-moderation-architecture).

##### 2. Using a language model for NSFW classification
A language model would likely do pretty well on NSFW classification, even in a zero-shot context. I'll have to build out more of the LLM infrastructure (and also just learn more about LLM engineering in the first place) in order to do it efficiently.


##### 3. Using images for NSFW classification
We could eventually use images as a feature for NSFW classification. Doing so requires that we:
- Store the images (space constraints, plus we have to build the architecture for this since the firehose provides no way of obtaining this).
- Efficiently classify the images (most images are OK).

Since the communities on Bluesky that post NSFW content normally add labels accordingly and since there's not really much spam content (yet?) on Bluesky, we likely don't need to classify NSFW content ourselves.

##### 4. How does our approach handle indirect speech (e.g., figures of speech, sarcasm)?
How will our model (and more generally, our filtering steps) handle sarcasm, irony, and other figurative language?