Skip to content

Heisenberg-737/Reddit_Flair_Detector

Repository files navigation

Reddit_Flair_Detector

A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.

About

This repository illustrates the process of scraping reddit posts from the subreddit r/india, text preprocessing/cleaning of data, building a classifier to classify the posts into 7 different flairs and deploying the suitable machine learning model as a web application.

Dependencies

The following dependencies can be found in requirements.txt:

  1. beautifulsoup4
  2. bs4
  3. Flask
  4. gunicorn
  5. html5lib
  6. json5
  7. nltk
  8. numpy
  9. numpydoc
  10. pandas
  11. path
  12. pathlib2
  13. praw
  14. prawcore
  15. py
  16. requests
  17. scikit-learn
  18. scipy
  19. unicodecsv
  20. win-unicode-console

Approach

Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using various machine learning models like Naive-Bayes, Linear SVM, Random Forest, Multi-Layer Perceptron and Logistic Regression for text classification with code snippets. I have obtained test accuracies on various scenarios which can be found in the next section.

Approach taken for the task:

  1. Collected various India subreddit posts using Pushshift Reddit API.
  2. The data includes flair, score, url, title, time-created.
  3. Title was considered for the detection task.
  4. Then the following ML Algorithms are applied on the database.
  • Naive-Byes
  • Linear Support Vector Machine
  • Logistic Regression
  • Random Forest
  • MLP
  1. Training and Testing on the dataset showed the Random Forest showed the best testing accuracy of 92.617% when title is used as the feature.
  2. The best model is saved and is used for prediction of the flair from the URL of the post.
  3. The model was deployed on a web application build on Flask to predict the flair of the url from India subreddit.
  4. Algorithm with best accuracy was Random Forest, but since models with >500 MB cannot be deployed to Heroku, so model based on Linear SVM is deployed on Heroku.

Results from algorithms used, using title as the feature

Algorithm Accuracy
Naive Bayes 73.154%
Linear SVM 81.879%
Logistic Regression 89.932%
Random Forest 92.617%
MLP 87.248%

References

  1. Using Pushshift Reddit API
  2. Converting json to Python
  3. Scrapping Reddit data with Python
  4. Making Graphs with Python
  5. Multi-Class Text Classification Model Comparison and Selection
  6. Preprocessing of text
  7. Cleaning and preprocessing of text
  8. Deploying ML model using flask
  9. Deploying a Python Flask app on Heroku

About

The Reddit Flair Detector can be found active at :

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages