Reddit Flair Detector

A web app to predict the flair for a reddit post using NLP based algorithms. The website can be accessed at Flair Detector.

Section	Description
Directory Structure	View the file structure of the repository
Online Demo	Experiment with model's prediction capabilities
Automated Testing	Automatically predict the flairs from collection of URLs
Installation	How to install the package
Model Architectures	Architectures (with pretrained weights)
Data	Collection and cleaning of data
Performance	Classification performance of different models used

Directory Structure


├── notebooks
|   ├── EDA_cleaned.ipynb
|   ├── base_models.ipynb
|   ├── bert+fastai.ipynb
|   ├── lstm_keras.ipynb
|   ├── ulmfit.ipynb
├── auto_test
|   ├── automated_testing.py
|   ├── autotest_file.txt
├── templates
|   ├── index.html
|   ├── predict.html
├── app.py
├── get_data.py
├── prediction.py
├── requirements.txt

Online

The flare predictor can be accessed at Link. Just enter the url of any post from r/india and click on predict.

The resulting page will show the predicted flair of the post along with the actual flair from the page.

The models have only been trained for the following flairs: AskIndia, Business/Finance, Food, Non-Political, Photography, Policy/Economy, Politics, Science/Technology,Sports. Hence, any posts apart from these flairs will not give the accurate prediction.

Automated Testing

In order to predict flairs from multiple urls, instead of entering the url repeatedly on the page, you can use automated testing endpoint of the web app. Simply send a post request to https://flairr.herokuapp.com/automated_testing. In python you can use requests library to do so. (See automated_testing.py)

Also, if you have cloned the repository, you can directly put a text file with line separated urls named file.txt in auto_test folder and then run:

python auto_test/automated_testing.py

You will get a json response with urls as keys and predicted flairs as values.

Installation

This repo is tested on Python 3.6+ on Unix/Linux System. You should use this repository in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

Run the following command to clone the repository:

git clone https://github.com/Kartikaggarwal98/Reddit_Flair_Detection

Move into cloned repo using:

cd Reddit_Flair_Detection/

To install a python3 virtual environment execute:

virtualenv -p python3 env

Activate env by running:

source env/bin/activate

Now install the libraries:

pip install -r requirements.txt

To use the app run: python app.py and you can open localhost url which is displayed.

Model Architectures

The data was used on various archiectures such as SVM, Logistic Regression, Naive Bayes and also on recent language model based architectures such as ULMFit and BERT.

The following notebooks can used to train models:

base_models.ipynb: Linear SVM, Random Forest, Naive Bayes, Logistic Regression, SVM using various count and frequency based vectors.
lstm_keras.ipynb: Various types of LSTM models trained using Keras Library.
ulmfit.ipynb: ULMFit (Paper Link) language model and classifier trained using our collected data.
bert+fastai.ipynb: BERT (Paper Link) Transformer model trained using fastai library.

Data

The data used for training the models was collected from the subreddit r/india. In order to maintain balance among classes, 200 posts for each flair (AskIndia, Business/Finance, Food, Non-Political, Photography, Policy/Economy, Politics, Science/Technology,Sports) was extracted using Reddit API.

The details about the data collection can be found in get_data.ipynb. The following attributes for each post were taken: title, score, id, body, author, flair, url, number of comments, creation date and top 10 comments. Further the data was saved in a csv file using pandas library.

After collecting the data, Exploratory Data Analysis (EDA) was done in order to obtain a clear understanding about the data. The complete analysis has been shown in EDA_cleaned.ipynb.

Performance

The performance on all the models was measured using weighted f1 metric as accuracy can be a misleading metric in unbalanced datasets. All models were trained using 3 types of text features:

Title
Title + body
Title + body + Top 10 comments.

The following tables lists out the performance of all the models:

Model	Title	Title + Body	Title + Body + Comments
Naive Bayes	0.51	0.49	0.47
Logistic Regression	0.53	0.55	*0.63*
Random Forest	0.49	0.37	0.43
SVM	0.54	0.55	0.54
LSTM	0.23	0.31	0.29
Bi-LSTM	0.22	0.24	0.30
ULMFit	0.45	0.41	0.61
BERT	0.59	-	-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Flair Detector

Directory Structure

Online

Automated Testing

Installation

Model Architectures

Data

Performance

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
auto_test		auto_test
notebooks		notebooks
static		static
templates		templates
weight_files		weight_files
Procfile		Procfile
app.py		app.py
get_data.ipynb		get_data.ipynb
prediction.py		prediction.py
readme.md		readme.md
requirements.txt		requirements.txt

Kartikaggarwal98/Reddit_Flair_Detection

Folders and files

Latest commit

History

Repository files navigation

Reddit Flair Detector

Directory Structure

Online

Automated Testing

Installation

Model Architectures

Data

Performance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages