# RGR Stock Price Forecasting Project

Author: Jack Wang

---

## Problem Statement

Stock prices are hard to predict because they are not only affected by the performance of the underlying companies but also the expectations from the general public. As known, the stock price of firearm companies are highly correlated to the public opinions toward gun ban. My model intends to predict the stock price of one of the largest firearm company in the states, RGR (Sturm, Ruger & Co., firearm company), by using its historical stock price and public opinions toward gun ban. 

## Executive Summary

The goal of my projcet is to build a **time series regression model** that predicts the stock price of RGR. The data I am using would be historical stock price from Yahoo Finance, twitter posts scraped from [twitter](https://twitter.com/), and also the news articles from major news website. I will perform NPL on the text data and time series modeling on the historical stock price data. The model will be evaluated using R^2 score.

## Content

This project consists of 5 Jupyter notebooks:
- Part-1-stock-price-data
- Part-2-twitter-scraper
- ***Part-3-twitter-data-cleaning***
- Part-4-reddit-data-scraper
- Part-5-reddit-data-cleaning
- Part-4-combined-data-and-EDA
- Part-5-modeling
    - [Example](#Most-Frequent-Words-in-Title-and-Content)
- Part-6-Conclusion-and-Discussion


---


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
import re
from functools import reduce

from datetime import datetime
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## 2016 Twitter Data Combining

In [105]:
df=[]
for i in range(10, 13, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2016_{i}_{i}.csv"))
    
df.append(pd.read_csv(f"../data/twitter/twitter_2016_10_16.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_10_8.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_10_15.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_8.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_10.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_15.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_30.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_12_16.csv"))

In [106]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

In [107]:
df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

In [108]:
df_final = df_final.drop_duplicates()

In [109]:
df_final = df_final.reset_index(drop=True)

In [110]:
df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

In [111]:
df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

In [112]:
df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

In [113]:
df_final['date'] = df_final.index

In [114]:
df_final

Unnamed: 0_level_0,tweet_word_count_sum,tweet_compound_score_sum,tweets_sum,tweet_word_count_mean,tweet_compound_score_mean,date
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-10-01,9314,-223.6686,593,15.706577,-0.377181,2016-10-01
2016-10-02,10535,-225.3380,644,16.358696,-0.349904,2016-10-02
2016-10-03,18548,-401.5402,1133,16.370697,-0.354404,2016-10-03
2016-10-04,16554,-356.2059,1017,16.277286,-0.350252,2016-10-04
2016-10-05,19727,-338.7394,1194,16.521776,-0.283701,2016-10-05
...,...,...,...,...,...,...
2016-12-27,11293,-270.1177,719,15.706537,-0.375685,2016-12-27
2016-12-28,11157,-232.3530,705,15.825532,-0.329579,2016-12-28
2016-12-29,8275,-196.1012,526,15.731939,-0.372816,2016-12-29
2016-12-30,7970,-167.9614,492,16.199187,-0.341385,2016-12-30


In [115]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2016.csv", index= False)

## 2017 Twitter Data Combining

In [48]:
df=[]
for i in range(1, 13, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2017_{i}_{i}.csv"))
    
df.append(pd.read_csv(f"../data/twitter/twitter_2017_10_9.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2017_11_1.csv"))
#df.append(pd.read_csv(f"../data/twitter/twitter_2018_3_22.csv"))

In [49]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

In [50]:
df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

In [51]:
df_final = df_final.drop_duplicates()

In [52]:
df_final = df_final.reset_index(drop=True)

In [53]:
df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

In [54]:
df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

In [55]:
df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

In [56]:
df_final['date'] = df_final.index

In [57]:
df_final

Unnamed: 0_level_0,tweet_word_count_sum,tweet_compound_score_sum,tweets_sum,tweet_word_count_mean,tweet_compound_score_mean,date
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01,7190,-170.7131,471,15.265393,-0.362448,2017-01-01
2017-01-02,18204,-364.8603,1179,15.440204,-0.309466,2017-01-02
2017-01-03,13661,-269.7893,823,16.599028,-0.327812,2017-01-03
2017-01-04,16702,-388.7950,1021,16.358472,-0.380798,2017-01-04
2017-01-05,11556,-252.5025,722,16.005540,-0.349726,2017-01-05
...,...,...,...,...,...,...
2017-12-27,12771,-180.3469,504,25.339286,-0.357831,2017-12-27
2017-12-28,15430,-200.0833,626,24.648562,-0.319622,2017-12-28
2017-12-29,16841,-247.6907,672,25.061012,-0.368587,2017-12-29
2017-12-30,14819,-190.0229,581,25.506024,-0.327062,2017-12-30


In [58]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2017.csv", index= False)

## 2018 Twitter Data Combining

In [2]:
df=[]
for i in range(1, 13, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2018_{i}_{i}.csv"))
    
df.append(pd.read_csv(f"../data/twitter/twitter_2018_9_16.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2018_3_16.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2018_3_22.csv"))

In [3]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

In [4]:
df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

In [5]:
df_final = df_final.drop_duplicates()

In [6]:
df_final = df_final.reset_index(drop=True)

In [7]:
df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

In [8]:
df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

In [9]:
df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

In [10]:
df_final['date'] = df_final.index

In [11]:
df_final

Unnamed: 0_level_0,tweet_word_count_sum,tweet_compound_score_sum,tweets_sum,tweet_word_count_mean,tweet_compound_score_mean,date
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-01,22940,-296.1463,893,25.688690,-0.331631,2018-01-01
2018-01-02,23911,-302.8459,918,26.046841,-0.329897,2018-01-02
2018-01-03,18520,-228.6339,730,25.369863,-0.313197,2018-01-03
2018-01-04,17315,-218.5435,677,25.576071,-0.322812,2018-01-04
2018-01-05,15669,-195.2983,593,26.423272,-0.329339,2018-01-05
...,...,...,...,...,...,...
2018-12-27,41232,-475.3081,1420,29.036620,-0.334724,2018-12-27
2018-12-28,54610,-712.9452,1925,28.368831,-0.370361,2018-12-28
2018-12-29,41195,-488.9873,1397,29.488189,-0.350027,2018-12-29
2018-12-30,46359,-520.1260,1552,29.870490,-0.335133,2018-12-30


In [12]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2018.csv", index= False)

## 2019 Twitter Data Combining

In [124]:
df=[]
for i in range(1, 10, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2019_{i}_{i}.csv"))

In [125]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

In [126]:
df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

In [127]:
df_final = df_final.drop_duplicates()

In [128]:
df_final = df_final.reset_index(drop=True)

In [129]:
df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

In [130]:
df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

In [131]:
df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

In [132]:
df_final['date'] = df_final.index

In [133]:
df_final

Unnamed: 0_level_0,tweet_word_count_sum,tweet_compound_score_sum,tweets_sum,tweet_word_count_mean,tweet_compound_score_mean,date
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01,37704,-340.3238,1324,28.477341,-0.257042,2019-01-01
2019-01-02,46399,-498.6116,1604,28.927057,-0.310855,2019-01-02
2019-01-03,41221,-447.3883,1415,29.131449,-0.316175,2019-01-03
2019-01-04,45529,-479.4148,1608,28.314055,-0.298144,2019-01-04
2019-01-05,60166,-757.9812,2155,27.919258,-0.351731,2019-01-05
...,...,...,...,...,...,...
2019-09-26,100093,-1024.4333,3372,29.683571,-0.303806,2019-09-26
2019-09-27,108463,-1056.7549,3650,29.715890,-0.289522,2019-09-27
2019-09-28,89754,-817.2610,2910,30.843299,-0.280846,2019-09-28
2019-09-29,66931,-679.0312,2166,30.900739,-0.313495,2019-09-29


In [134]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2019.csv", index= False)

### Combining all Twitter data

In [135]:
df=[]
for i in range(2016, 2020, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_{i}.csv"))

In [136]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

In [137]:
df_final

Unnamed: 0,tweet_word_count_sum,tweet_compound_score_sum,tweets_sum,tweet_word_count_mean,tweet_compound_score_mean,date
0,9314,-223.6686,593,15.706577,-0.377181,2016-10-01
1,10535,-225.3380,644,16.358696,-0.349904,2016-10-02
2,18548,-401.5402,1133,16.370697,-0.354404,2016-10-03
3,16554,-356.2059,1017,16.277286,-0.350252,2016-10-04
4,19727,-338.7394,1194,16.521776,-0.283701,2016-10-05
...,...,...,...,...,...,...
1090,100093,-1024.4333,3372,29.683571,-0.303806,2019-09-26
1091,108463,-1056.7549,3650,29.715890,-0.289522,2019-09-27
1092,89754,-817.2610,2910,30.843299,-0.280846,2019-09-28
1093,66931,-679.0312,2166,30.900739,-0.313495,2019-09-29


In [138]:
df_final.to_csv("../data/twitter/twitter.csv",index= False)