A Reddit Flair Detector web application to detect flairs of India subreddit posts using Machine Learning algorithms. The application can be found live at Reddit Flair Detector.
This repository illustrates the process of scraping reddit posts from the subreddit r/india, text preprocessing/cleaning of data, building a classifier to classify the posts into 7 different flairs and deploying the suitable machine learning model as a web application.
The following dependencies can be found in requirements.txt:
- beautifulsoup4
- bs4
- Flask
- gunicorn
- html5lib
- json5
- nltk
- numpy
- numpydoc
- pandas
- path
- pathlib2
- praw
- prawcore
- py
- requests
- scikit-learn
- scipy
- unicodecsv
- win-unicode-console
Going through various literatures available for text processing and suitable machine learning algorithms for text classification, I based my approach using various machine learning models like Naive-Bayes, Linear SVM, Random Forest, Multi-Layer Perceptron and Logistic Regression for text classification with code snippets. I have obtained test accuracies on various scenarios which can be found in the next section.
Approach taken for the task:
- Collected various India subreddit posts using Pushshift Reddit API.
- The data includes flair, score, url, title, time-created.
- Title was considered for the detection task.
- Then the following ML Algorithms are applied on the database.
- Naive-Byes
- Linear Support Vector Machine
- Logistic Regression
- Random Forest
- MLP
- Training and Testing on the dataset showed the Random Forest showed the best testing accuracy of 92.617% when title is used as the feature.
- The best model is saved and is used for prediction of the flair from the URL of the post.
- The model was deployed on a web application build on Flask to predict the flair of the url from India subreddit.
- Algorithm with best accuracy was Random Forest, but since models with >500 MB cannot be deployed to Heroku, so model based on Linear SVM is deployed on Heroku.
Results from algorithms used, using title as the feature
Algorithm | Accuracy |
---|---|
Naive Bayes | 73.154% |
Linear SVM | 81.879% |
Logistic Regression | 89.932% |
Random Forest | 92.617% |
MLP | 87.248% |
- Using Pushshift Reddit API
- Converting json to Python
- Scrapping Reddit data with Python
- Making Graphs with Python
- Multi-Class Text Classification Model Comparison and Selection
- Preprocessing of text
- Cleaning and preprocessing of text
- Deploying ML model using flask
- Deploying a Python Flask app on Heroku