Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 

README.md

Reddit Gild Predictor

Why?

The project is primarily for exploration and to check if it's possible to capture complex real world relationships through ML.

Is it possible to capture general opinions within a subreddit? If so, what opinions are more likely to get gilded? Is it possible to capture such relationships purely without relying on textual data?

NLP techniques such as text abstraction could be used for 'labeling' a comment text (or even summarizing content of reddit thread).

Can we extract sentiments associated with topics? For instance: Microtransactions are not looked upon favorably by the gaming community, same is true for r/Gaming.

Can we capture such sentiments, without manually labeling the data? And how do these sentiments relate to gild status of a comment?

Objectives:

First stage: Given a reddit comment and it's attributes (such as number of upvotes, comment author data), build a model to classify if it's a gilded comment or not.

Optional: Repeat first stage to predict number of gildings received (regression problem). Could also do the same with number of upvotes.

Second stage: Repeat, but for multiclass classification (gold, silver or no gildings)

Third stage: Repeat first and second stages, but use NLP techniques (Using comment body for classification)

Optional: Use a model to combine results from both first/second and third stages.

Current Progress:

  • 05/20-21: Experimented with Logistic Regression Model (with Resampling).

  • 05/22-23: Experimented with SVM Model (Using Gradient Descent due to high volume of data).

  • 05/23-25: Experimented with Decision Trees, Random Forests and Ensemble Models (Under progress)

  • 05/25-27: Refactoring into classes and utility functions

  • 06/01-02: Second stage refactoring (dividing classes into even more meaningful chunks), cleaning up code.

  • 06/04-05: Completed Decision Trees

  • 06/06-07: More feature engineering/experimentation (Note: Modified comment age to be calculated in terms of the specific thread it belongs to). Reran all models. Logistic Regression performed best.

  • 06/08: Started literature review for outlier detection approach.

Plan:

  • Plan of action is available under docs/Plan.md

Visualize

  • Refer to docs/Visualization.md

Milestones:

  • Refer to docs/Milestones.md

About

Text Analysis on Reddit Data; Using ML to classify gild status of a reddit comment.

Topics

Resources

License

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.