Flare-Detector-for-Reddit-Data

Flair detector is a web application, deployed using Heroku. The application can be seen here. It scrapes posts using the URL and then uses a Random Forest model to predict the flair of that post

This repository contains the code which performs the following functionalities:

Scrapping posts from Indian Subreddit according to two methods: hottest posts and distributed posts in accordance with flairs
Exploratory Data Analysis: contains bar graphs and pie-charts to analyse data distributions for the attributes collected in step 1.
Textual Pre-Processing: pre-processes textual data for attributes like Title of the post and Content of the posts.
Building and contrasting different models: Builds, trains and validates four ML models, Naive Bayes, Random Forest, Logistic Regression and Multi-layer Perceptron using different features and then selects the feature-model pair performing the best.
Building Web Application: Contains the code to build a basic web application that takes as input a post url and displays the predicted flair of the post.
Hosts the app on Heroku.

Organisation of the Repository:

Jupyter Notebooks: contains data extraction, exploration, pre-processing and model building ipynb (or Jupyter) notebooks.
Dataset: contains different datasets obtained by following different extraction methods. dataset.csv was the first draft created by extracting hottest posts. newdataset.csv was the second draft created by distributed flairs method. morefeaturesDataset.csv was the third draft which contains comments as well. refinedmorefeaturesDataset.csv is the dataset obtained after pre-processing morefeaturesDataset.csv and is used for training.
templates: contains the front end of the web app in index.html file
Procfile: contains the command which are required to run the flask based app on Heroku.
app.py: The Main Python file for Flask Based Web App.
model.py: final model that uses title of the post and random forest to predict its flair.
model.pkl: pickled-version of the model which is used to make predictions.
requirments.txt: txt file for Heroku to download the required requirements to run the web application.
Experimental log.ipynb: Documents everything that worked or failed on a day per day basis. (for personal use)

How to run this code in your local machine

Go to the directory in which you want to download the repository using:

cd ~/target directory

Clone this repository in your local machine using the following command:

git clone https://github.com/ShreyaGupta08/Flare-Detector-for-Reddit-Data.git

Enter this directory using the following command:

cd Flare-Detector-for-Reddit-Data

- (Optional) Create Virtual environment (good coding practice):

virtualenv env

If you want your virtualenv to also inherit globally installed packages run:

virtualenv venv --system-site-packages

Activate your virtual env

source venv/bin/activate

(If you're a first-time user) Download the requirements from requirements.txt:

For Python 2:

pip install -r requirements.txt

For Python 3:

pip3 install -r requirements.txt

Copy paste the jupyter notebooks from Jupyter Notebooks folder in the Dataset folder using the following command:

cp -r Jupyter\ Notebooks/ Dataset/

run the main flask API file, app.py using the following command:

python app.py

Copy Paste the server address (generally http://127.0.0.1:5000/)

Voila!

(If you face any errors, feel free to issue a pull request)

P.S. I've also added my experiment log to document everything that worked or failed.

Things it is good at:

Detailed Documentation
prediction for Photography posts (generally)

Things it is bad at:

Prediction. I realised it a little late that the 71% accuracy in using title as feature (and Random Forest as learning algorithm) is achieved because the title contained the flair in most cases. This was weird and has been mentioned in better detail in the Experimental log.ipynb file. Time remaining, I would have liked to find a way to:
- incorporate comments, content and title with the title (and dealing with NaNs appropriately)
- Used better learning algorithms
- collected more data.

Things I would like to incorporate:

In data exploration:

finding the number of posts in each flair for which content != None
- finding the correlation between individual flair confusion matrix obtained from using content only with the number of samples obtained above
- verifying if the unequal distribution is one of the reasons behind the low flair accuracy. if yes, checking if increasing the number of sample distribution had any effect on prediction scores. Then maybe, content would not have been as useless after all.
contrasting performance with unsupervised algorithms (like K-Means)
Find flair-wise accuracy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset

Dataset

Jupyter Notebooks

Jupyter Notebooks

static/css

static/css

templates

templates

Experimental log.ipynb

Experimental log.ipynb

LICENSE

LICENSE

Procfile

Procfile

README.md

README.md

app.py

app.py

model.pkl

model.pkl

model.py

model.py

requirements.txt

requirements.txt

Repository files navigation

Flare-Detector-for-Reddit-Data

Organisation of the Repository:

How to run this code in your local machine

Things it is good at:

Things it is bad at:

Things I would like to incorporate:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Dataset		Dataset
Jupyter Notebooks		Jupyter Notebooks
static/css		static/css
templates		templates
Experimental log.ipynb		Experimental log.ipynb
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
model.pkl		model.pkl
model.py		model.py
requirements.txt		requirements.txt

License

ShreyaGupta08/Flare-Detector-for-Reddit-Data

Folders and files

Latest commit

History

Repository files navigation

Flare-Detector-for-Reddit-Data

Organisation of the Repository:

How to run this code in your local machine

Things it is good at:

Things it is bad at:

Things I would like to incorporate:

About

Resources

License

Stars

Watchers

Forks

Languages