Skip to content

Kartikaggarwal98/Reddit_Flair_Detection

Repository files navigation

Reddit Flair Detector

A web app to predict the flair for a reddit post using NLP based algorithms. The website can be accessed at Flair Detector.

Section Description
Directory Structure View the file structure of the repository
Online Demo Experiment with model's prediction capabilities
Automated Testing Automatically predict the flairs from collection of URLs
Installation How to install the package
Model Architectures Architectures (with pretrained weights)
Data Collection and cleaning of data
Performance Classification performance of different models used

Directory Structure


├── notebooks
|   ├── EDA_cleaned.ipynb
|   ├── base_models.ipynb
|   ├── bert+fastai.ipynb
|   ├── lstm_keras.ipynb
|   ├── ulmfit.ipynb
├── auto_test
|   ├── automated_testing.py
|   ├── autotest_file.txt
├── templates
|   ├── index.html
|   ├── predict.html
├── app.py
├── get_data.py
├── prediction.py
├── requirements.txt

Online

The flare predictor can be accessed at Link. Just enter the url of any post from r/india and click on predict.

The resulting page will show the predicted flair of the post along with the actual flair from the page.

The models have only been trained for the following flairs: AskIndia, Business/Finance, Food, Non-Political, Photography, Policy/Economy, Politics, Science/Technology,Sports. Hence, any posts apart from these flairs will not give the accurate prediction.

Automated Testing

In order to predict flairs from multiple urls, instead of entering the url repeatedly on the page, you can use automated testing endpoint of the web app. Simply send a post request to https://flairr.herokuapp.com/automated_testing. In python you can use requests library to do so. (See automated_testing.py)

Also, if you have cloned the repository, you can directly put a text file with line separated urls named file.txt in auto_test folder and then run:

python auto_test/automated_testing.py

You will get a json response with urls as keys and predicted flairs as values.

Installation

This repo is tested on Python 3.6+ on Unix/Linux System. You should use this repository in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

  1. Run the following command to clone the repository:

git clone https://github.com/Kartikaggarwal98/Reddit_Flair_Detection

  1. Move into cloned repo using:

cd Reddit_Flair_Detection/

  1. To install a python3 virtual environment execute:

virtualenv -p python3 env

  1. Activate env by running:

source env/bin/activate

  1. Now install the libraries:

pip install -r requirements.txt

  1. To use the app run: python app.py and you can open localhost url which is displayed.

Model Architectures

The data was used on various archiectures such as SVM, Logistic Regression, Naive Bayes and also on recent language model based architectures such as ULMFit and BERT.

The following notebooks can used to train models:

  1. base_models.ipynb: Linear SVM, Random Forest, Naive Bayes, Logistic Regression, SVM using various count and frequency based vectors.

  2. lstm_keras.ipynb: Various types of LSTM models trained using Keras Library.

  3. ulmfit.ipynb: ULMFit (Paper Link) language model and classifier trained using our collected data.

  4. bert+fastai.ipynb: BERT (Paper Link) Transformer model trained using fastai library.

Data

The data used for training the models was collected from the subreddit r/india. In order to maintain balance among classes, 200 posts for each flair (AskIndia, Business/Finance, Food, Non-Political, Photography, Policy/Economy, Politics, Science/Technology,Sports) was extracted using Reddit API.

The details about the data collection can be found in get_data.ipynb. The following attributes for each post were taken: title, score, id, body, author, flair, url, number of comments, creation date and top 10 comments. Further the data was saved in a csv file using pandas library.

After collecting the data, Exploratory Data Analysis (EDA) was done in order to obtain a clear understanding about the data. The complete analysis has been shown in EDA_cleaned.ipynb.

Performance

The performance on all the models was measured using weighted f1 metric as accuracy can be a misleading metric in unbalanced datasets. All models were trained using 3 types of text features:

  1. Title
  2. Title + body
  3. Title + body + Top 10 comments.

The following tables lists out the performance of all the models:

Model Title Title + Body Title + Body + Comments
Naive Bayes 0.51 0.49 0.47
Logistic Regression 0.53 0.55 0.63
Random Forest 0.49 0.37 0.43
SVM 0.54 0.55 0.54
LSTM 0.23 0.31 0.29
Bi-LSTM 0.22 0.24 0.30
ULMFit 0.45 0.41 0.61
BERT 0.59 - -