Reddit-Flair-Detector

A web application to predict flair (tag) of any post on India Subreddit using Machine Learning Algorithms.

Execution Instructions

Download Git Large File Storage (LFS) from https://git-lfs.github.com/ if you don't have it already.
Open the Terminal.
Use git lfs install to set up Git LFS for your user account.
Clone the repository by typing git clone https://github.com/Gunnika/Reddit-Flair-Detector.git.
Ensure that Python3 and pip are installed on the device.
Change to the cloned directory by entering cd Reddit-Flair-Detector .
Run pip install -r requirements.txt.
Enter the python shell and import nltk.
Execute nltk.download('stopwords') and nltk.download('punkt'). Exit the shell.
Run python app.py to start the application on a local host.
Go to http://0.0.0.0:5000/ on the web browser to use the application.

The whole process is nicely explained with code in this Jupyter Notebook.

Data Acquisition

PRAW: The Python Reddit API Wrapper was used for extracting data. There are a number of Reddit datasets available on Bigquery and Kaggle as well. For the purpose of creating my own dataset instead of the readily available alternatives, I went ahead with PRAW.

The following attributes made more sense in indicating the flair of a post

title
url
text
comments

Exploratory Data Analysis

Initial investigations of data included analysing the data distribution amongst classes wherein an imbalanced distribution was observed. The [R]eddiquette class had low data as compared to the other classes which can result in the minority class being treated as outlier and ignored.

The reason for this imbalance was found to be discontinuation of the [R]eddiquette flair 7 months ago. The class was then dropped from the dataset

Data Pre-Processing

The Data pre-processing step involved cleaning the data for better representation and usability. In this:

The stop words were removed
words tokenized
words converted into lowercase
Useful words concatenated to a sentence

Building a Flair Detector

Different models analysed:

Logistic Regression
Linear Support Vector Machine
Naive Bayes Classifier
Decision Trees
Random forest

The best results were obtained using Random Forest (62.67%) To improve the accuracy even more, some deep learning techniques can be incorporated. BERT(Bidirectional Encoder Representations from Transformers) can be used to generate text embeddings and a better accuracy as well.

Building a Flask Application

A flask application was developed in which the trained model was integrated. An automated_testing endpoint was generated for automatic retrieval of predictions by providing a text file of urls.

Deploying as a Web Service

The application was then deployed to Heroku.

Automated Testing

A POST Request with key as upload_file and value as a text file consisting of URLs can be sent to https://redditflair-detector.herokuapp.com/automated_testing.

It will return a JSON object with the URL as the key and Prediction as the value.

Please note that due to the limitations of PRAW, around 50 URLs can be processed at a time. Heroku can give a timeout error otherwise.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
Dataset		Dataset
Images		Images
Jupyter Notebooks		Jupyter Notebooks
model		model
static		static
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
nltk.txt		nltk.txt
requirements.txt		requirements.txt

Gunnika/Reddit-Flair-Detector

Folders and files

Latest commit

History

Repository files navigation

Reddit-Flair-Detector

Execution Instructions

Work Flow

Data Acquisition

Exploratory Data Analysis

Data Pre-Processing

Building a Flair Detector

Building a Flask Application

Deploying as a Web Service

Automated Testing

About

Resources

Stars

Watchers

Forks

Languages