Textual analysis of Investment themed subreddits to predict future stock returns
Branch: master
Clone or download
Latest commit 61e9a26 Feb 6, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
.ipynb_checkpoints lm model 1 commited Nov 18, 2018
CODE Merge branch 'master' of https://github.gatech.edu/agopichandran3/624… Nov 29, 2018
DOC Create test Nov 27, 2018
Misc folder structure change Nov 28, 2018
Python Sentiment test Nov 24, 2018
RedditComments cleaned analysis Nov 28, 2018
liztd_python_api/.vscode updated folder structure Nov 28, 2018
.gitignore fixed readme Nov 29, 2018
6242Project.Rproj adding R stuff Oct 24, 2018
FinalModel.Rmd final model Feb 6, 2019
README.md Update README.md Feb 6, 2019



The goal of our project is to understand the impact of social media posts on the future prices of individual stocks. We examined the impact of posts in the social media platform reddit.com within investment focused subreddits/forums using textual analysis techniques.

The files found here can be used to clean and apply sentiment to reddit posts and comments as well as create regression models using this data. We are also supplying the code which can be used for the interactive web application that is used to visualize real time sentiment and prediction for specific stocks tickers.

Data and Analysis

All analysis and data collection is found within CODE/Analysis

Obtain and Clean Data

Obtain reddit comments and posts from google big query. https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_posts?pli=1

Example queries for May 2018. Each table represents 1 month. Repeat as necessary.

SELECT * FROM fh-bigquery.reddit_posts.2018_05 where subreddit = 'wallstreetbets'; 
SELECT * FROM fh-bigquery:reddit_comments.2018_05 WHERE subreddit = 'wallstreetbets';

Place .csv files from big query into respective folders.

Run the following files in order.

CleanData.rmd > RedditSentiment.rmd > getStockData.rmd > Python Sentiment/vaderSentiment.py > FinalModel.Rmd > FinalModel_2.Rmd

MongoDB Cluster:


  1. Export comments collection (before sentiment was calcuated) mongoexport --db liztd -c comments --out comments.csv --type csv --fields "author_flair_css_class,distinguished,ups,subreddit,body,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id"

  2. Export the submissions collection (before sentiment was calculated) mongoexport --db liztd -c submissions --out submissions.csv --type csv --fields "created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext,saved,id,from_kind,gilded,from,stickied,retrieved_on,over_18,thumbnail,subreddit_id,hide_score,link_flair_css_class,author_flair_css_class,archived,is_self,from_id,permalink,name,author_flair_text,quarantine,link_flair_text,distinguished"

  3. Export the aggregated sentiments collection mongoexport --db liztd -c reddit_sentiments --out sentiments.csv --type csv --fields "date,ticker,sumCompound,count,close,pct2,pred"

  4. Importing it into local: mongoimport --host mongodb://://dvafinalproject-anotq.mongodb.net/liztd -c reddit_submissions --type csv --headerline --file submissions_with_sentiments.csv mongoimport --db liztd -c reddit_comments --type csv --headerline --file comments_with_sentiments.csv

  5. Import to the cloud.mongodb.com shard mongoimport --host dvafinalproject-shard-0/dvafinalproject-shard-00-00-anotq.mongodb.net:27017,dvafinalproject-shard-00-01-anotq.mongodb.net:27017,dvafinalproject-shard-00-02-anotq.mongodb.net:27017 --ssl --username arvnan52 --password --authenticationDatabase admin --db liztd --collection reddit_submissions --type csv --file submissions_with_sentiments.csv --headerline mongoimport --host dvafinalproject-shard-0/dvafinalproject-shard-00-00-anotq.mongodb.net:27017,dvafinalproject-shard-00-01-anotq.mongodb.net:27017,dvafinalproject-shard-00-02-anotq.mongodb.net:27017 --ssl --username arvnan52 --password --authenticationDatabase admin --db liztd --collection reddit_comments --type csv --file comments_with_sentiments.csv --headerline mongoimport --host dvafinalproject-shard-0/dvafinalproject-shard-00-00-anotq.mongodb.net:27017,dvafinalproject-shard-00-01-anotq.mongodb.net:27017,dvafinalproject-shard-00-02-anotq.mongodb.net:27017 --ssl --username arvnan52 --password --authenticationDatabase admin --db liztd --collection sentiments --type csv --file sentiments.csv --headerline

Database connection and details

  1. Connect to the cloud.mongodb.com (arvnan52/hp..)
  2. The connection is enabled only from 2 IP's. a. My laptop b. The digitalocean server

mongo "mongodb+srv://dvafinalproject-anotq.mongodb.net/liztd" --username --password

3. Collections:

a. reddit_submissions
    db.reddit_submissions.createIndex({title: "text", selftext: "text", id: 1, created_utc: 1})
b. reddit_comments
    Indexes: db.reddit_comments.createIndex({parent_id: 'text', body: 'text', created_utc: 1, id: 1})
c. sentiments
This collection aggregates stock price with reddit sentiment analysis and final prediction

Digital Ocean droplet:

hostname: ubuntu-s-1vcpu-1gb-nyc1-01:



The following fuctionalities were hosted on one ubuntu server hosted by digitalocean.

Daily Load:

The python script under CODE/liztd_python_load is setup as a cronjob to be executed every night.

Reddit Stream:

This is handled by the python script inside CODE/liztd_python_stream folder. This script has an open connection to monitor reddit 'wallstreetbets' stream and upload them into the mongodb database.


PM2 - PM2 is a process mangement tool which is setup to keep the jobs running the scheduled time for data collection.

Web Application:

    The python bottlepy based web server is hosted as an api to the database and the frontend. The project is present in CODE/liztd_python_api

    The UI is created using ReactjS, evergreen library for UI components and Recharts for charting components. The scripts neccessary to run the web ui are 
    present at CODE/liztd_ui/readme.md file.