Skip to content

SpaCy TextCat Model to predict trigger warning labels of scraped Reddit posts

Notifications You must be signed in to change notification settings

Statisfied/reddit-textcat

Repository files navigation

⚠️ Trigger Warning Prediction: Text Classification of Posts scraped from Reddit ⚠️

Overview

Social media platforms, such as Instagram, Twitter and Reddit provide invaluable platforms for building communities and seeking peer-support with mental health issues. However, adequately moderating content on these platforms and their sub-communities can be a time-intensive task as well as emotionally-draining.This project demonstrates how to (i) scrape data from reddit in Python; (ii) clean and format this data; and (iii) build a SpaCy textcat model that can predict the label (trigger warning) of potentially sensitive content.

More details of the project can be found on the slides included in this repo.

Dataset

The scraped dataset comprises around 142,000 documents after cleaning. There is a slight class imbalance, with ED and OCD having the most observations while bipolar and ADHD have the least.

##SpaCy Model

⏯ Commands

The following commands are defined in this project. They can be executed using spacy project run [command].

Command Description
convert Convert the data to spaCy's binary format
train Train the textcat model
evaluate Evaluate the model and export metrics
package Package the trained model as a pip package
visualize-model Visualize the model's output interactively using Streamlit

⏭ Workflows

The following workflows are defined by the project. It can be executed using spacy project run all and will run the specified commands in order.

Workflow Steps
all converttrainevaluatepackage

🗂 Assets

The following assets are necessary to run the project. They can be generated by running Jupyter notebook scraping_reddit.ipynb found in the project directory.

File Source Description
[assets/reddit-train.jsonl] Local Training data scraped from Reddit
[assets/reddit-dev.jsonl] Local Development data scraped from Reddit

🗂 Other Data

They can generated by running Jupyter notebook scraping_reddit.ipynb found in the project directory.

File Source Description
assets/raw_reddit_dataset.csv Local Whole raw dataset scraped from Reddit exported as csv
assets/cleaned_reddit_dataset.csv Local Whole cleaned dataset from Reddit exported as csv
assets/reddit-train.csv Local training data exported as csv
assets/reddit-dev.csv Local dev data exported as csv
assets/reddit-test.csv Local test data exported as csv

Result

The final model performed well on unseen data, with a macro F1 score of 80.79 and an average ROC-AUC score of 0.96. This model could be implemented by platform moderators to help streamline the process of sifting through all the content and ameliorate their workload. Alternatively, it could be applied automatically to posts so that users can see what a post is about before they read it and act accordingly.

A potential limitation of this project is that we do not collect any information on user demographics thus it is difficult to say how generalisable the model developed here would be on other (non-reddit) data.

Future work could also focus on adding an NER or some other sort of keyword extraction component to the SpaCy pipeline in order to further assist moderators in processing content.

📚 References

[1] "Classification of 'Triggering' Content on Social Media" by Keelin Sekerka-Bajbus. Available: https://github.com/ksek87/trigger-warning-classification .

[2] "spaCy Project: Demo Multilabel Textcat (Text Classification)" by ExplosionAI Available: https://github.com/explosion/projects/tree/v3/pipelines/textcat_multilabel_demo

[3] “Reddit,” reddit. [Online]. Available: https://www.reddit.com/.

[4] Pushshift.io. (2019). Pushshift.io. Available: https://pushshift.io/.

About

SpaCy TextCat Model to predict trigger warning labels of scraped Reddit posts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages