Reddit_Flair_Detector

A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.

About

This repository illustrates the process of scraping reddit posts from the subreddit r/india, text preprocessing/cleaning of data, building a classifier to classify the posts into 7 different flairs and deploying the suitable machine learning model as a web application.

Dependencies

The following dependencies can be found in requirements.txt:

beautifulsoup4
bs4
Flask
gunicorn
html5lib
json5
nltk
numpy
numpydoc
pandas
path
pathlib2
praw
prawcore
py
requests
scikit-learn
scipy
unicodecsv
win-unicode-console

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using various machine learning models like Naive-Bayes, Linear SVM, Random Forest, Multi-Layer Perceptron and Logistic Regression for text classification with code snippets. I have obtained test accuracies on various scenarios which can be found in the next section.

Approach taken for the task:

Collected various India subreddit posts using Pushshift Reddit API.
The data includes flair, score, url, title, time-created.
Title was considered for the detection task.
Then the following ML Algorithms are applied on the database.

Naive-Byes
Linear Support Vector Machine
Logistic Regression
Random Forest
MLP

Training and Testing on the dataset showed the Random Forest showed the best testing accuracy of 92.617% when title is used as the feature.
The best model is saved and is used for prediction of the flair from the URL of the post.
The model was deployed on a web application build on Flask to predict the flair of the url from India subreddit.
Algorithm with best accuracy was Random Forest, but since models with >500 MB cannot be deployed to Heroku, so model based on Linear SVM is deployed on Heroku.

Results from algorithms used, using title as the feature

Algorithm	Accuracy
Naive Bayes	73.154%
Linear SVM	81.879%
Logistic Regression	89.932%
Random Forest	92.617%
MLP	87.248%

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Notebook		Notebook
static		static
templates		templates
.DS_Store		.DS_Store
Procfile		Procfile
README.md		README.md
app.py		app.py
finalized_model.sav		finalized_model.sav
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit_Flair_Detector

About

Dependencies

Approach

References

About

Releases

Packages

Contributors 2

Languages

Heisenberg-737/Reddit_Flair_Detector

Folders and files

Latest commit

History

Repository files navigation

Reddit_Flair_Detector

About

Dependencies

Approach

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages