# RGR Stock Price Forecasting Project

Author: Jack Wang

---

## Problem Statement

Stock prices are hard to predict because they are not only affected by the performance of the underlying companies but also the expectations from the general public. As known, the stock price of firearm companies are highly correlated to the public opinions toward gun ban. My model intends to predict the stock price of one of the largest firearm company in the states, RGR (Sturm, Ruger & Co., firearm company), by using its historical stock price and public opinions toward gun ban. 

## Executive Summary

The goal of my projcet is to build a **time series regression model** that predicts the stock price of RGR. The data I am using would be historical stock price from Yahoo Finance, twitter posts scraped from [twitter](https://twitter.com/), and also the news articles from major news website. I will perform NPL on the text data and time series modeling on the historical stock price data. The model will be evaluated using R^2 score.

## Content

This project consists of 5 Jupyter notebooks:
- Part-1-stock-price-data
- Part-2-twitter-scraper
- Part-3-twitter-data-cleaning
- ***Part-4-reddit-data-scraper***
- Part-5-reddit-data-cleaning
- Part-6-combined-data-and-EDA
- Part-5-modeling
    - [Example](#Most-Frequent-Words-in-Title-and-Content)
- Part-6-Conclusion-and-Discussion


---


In [1]:
# imports
import pandas as pd
import requests
import json
import csv
import time
import datetime

Since Reddit API does not allow us to scrape historical data by dates, I requested the historical data from `pushshift`. This API has the historical data from Reddit. Most of the codes below are referenced from [here](https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563).

In [92]:
# Create a function to specify subreddit, keywords, time range

def getPushshiftData(query, after, before, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    print(url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

In [2]:
# Create a function to request data including subreddit post score, author id, title, time, etc...

def collectSubData(subm):
    subData = list() # list to store data points
    title = subm['title']
    url = subm['url']
    try:
        flair = subm['link_flair_text']
    except KeyError:
        flair = "NaN"    
    author = subm['author']
    sub_id = subm['id']
    score = subm['score']
    created = datetime.datetime.fromtimestamp(subm['created_utc'])
    numComms = subm['num_comments']
    permalink = subm['permalink']
    
    subData.append((sub_id,title,url,author,score,created,numComms,permalink,flair))
    subStats[sub_id] = subData

In [99]:
# Subreddit to query
sub='guns'

# Query time range
before = "1569888000" #ends by 10/1/2016
after = "1451606400"  #starts at 1/1/2016
query = "gun control" #key word
subCount = 0
subStats = {}

In [100]:
data = getPushshiftData(query, after, before, sub)

# Will gather data until the end date
while len(data) > 0:
    for submission in data:
        collectSubData(submission)
        subCount+=1
        
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data))
    print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    after = data[-1]['created_utc']
    data = getPushshiftData(query, after, before, sub)
    
print(len(data))

https://api.pushshift.io/reddit/search/submission/?title=gun control&size=1000&after=1451606400&before=1569888000&subreddit=guns
383
2019-09-29 00:10:02
https://api.pushshift.io/reddit/search/submission/?title=gun control&size=1000&after=1569741002&before=1569888000&subreddit=guns
0


In [101]:
# Print out result

print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

383 submissions have added to list
1st entry is:
Obama to impose new gun control curbs next week created: 2016-01-01 08:00:52
Last entry is:
Tennessee Man Faces Citation For His Fuck Gun Control Bumper Sticker created: 2019-09-29 00:10:02


In [102]:
# Export file to csv

def updateSubs_file():
    upload_count = 0
    location = "../data/reddit_"
    print("input filename of submission file, please add .csv")
    filename = input()
    file = location + filename
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["Post ID","Title","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

input filename of submission file, please add .csv
guns_2016_to_2019.csv
383 submissions have been uploaded
