# RGR Stock Price Forecasting Project - Part 3

Author: Jack Wang

---

## Problem Statement

Stock prices are hard to predict because they are not only affected by the performance of the underlying companies but also the expectations from the general public. As known, the stock price of firearm companies are highly correlated to the public opinions toward gun control. My model intends to predict the stock price of one of the largest firearm company in the states, RGR (Sturm, Ruger & Co., firearm company), by using its historical stock price, public opinions toward gun control, and its financial reports to SEC. 

## Executive Summary

The goal of my project is to build a **time series regression model** that predicts the stock price of RGR. The data I am using would be historical stock price from [Yahoo Finance](https://finance.yahoo.com/quote/RGR/history?p=RGR), twitter posts scraped from [twitter](https://twitter.com/), subreddit posts mentioned about gun control, and also the financial reports to [SEC](https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000095029&type=&dateb=&owner=exclude&count=100). I will do sentiment analysis on the text data and time series modeling on the historical stock price data. The model will be evaluated using MSE.

## Content

This project consists of 7 Jupyter notebooks:
- Part-1-stock-price-data
- Part-2-twitter-scraper
- ***Part-3-twitter-data-cleaning***
    - [2016 Twitter Data](#2016-Twitter-Data)
    - [2017 Twitter Data](#2017-Twitter-Data)
    - [2018 Twitter Data](#2018-Twitter-Data)
    - [2019 Twitter Data](#2019-Twitter-Data)
    - [All Twitter Data](#All-Twitter-Data)
- Part-4-reddit-data-scraper
- Part-5-reddit-data-cleaning
- Part-6-sec-data-cleaning
- Part-7-modeling-and-evaluation


---


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
import re
from functools import reduce

from datetime import datetime
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

### 2016 Twitter Data

In [105]:
# Importing data
df=[]
for i in range(10, 13, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2016_{i}_{i}.csv"))    
df.append(pd.read_csv(f"../data/twitter/twitter_2016_10_16.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_10_8.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_10_15.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_8.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_10.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_15.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_11_30.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2016_12_16.csv"))

In [106]:
# Merging, dropping duplicates, resetting index, and adding features by using groupby

df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

df_final = df_final.drop_duplicates()

df_final = df_final.reset_index(drop=True)

df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

df_final['date'] = df_final.index

df_final

In [115]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2016.csv", index= False)

### 2017 Twitter Data

In [48]:
# Import data
df=[]
for i in range(1, 13, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2017_{i}_{i}.csv"))
    
df.append(pd.read_csv(f"../data/twitter/twitter_2017_10_9.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2017_11_1.csv"))

In [49]:
# Merging, dropping duplicates, resetting index, and adding features by using groupby

df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

df_final = df_final.drop_duplicates()

df_final = df_final.reset_index(drop=True)

df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

df_final['date'] = df_final.index

df_final

In [58]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2017.csv", index= False)

### 2018 Twitter Data

In [2]:
# Import data

df=[]
for i in range(1, 13, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2018_{i}_{i}.csv"))
    
df.append(pd.read_csv(f"../data/twitter/twitter_2018_9_16.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2018_3_16.csv"))
df.append(pd.read_csv(f"../data/twitter/twitter_2018_3_22.csv"))

In [3]:
# Merging, dropping duplicates, resetting index, and adding features by using groupby

df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

df_final = df_final.drop_duplicates()

df_final = df_final.reset_index(drop=True)

df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

df_final['date'] = df_final.index

df_final

In [12]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2018.csv", index= False)

### 2019 Twitter Data

In [124]:
# Import data

df=[]
for i in range(1, 10, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_2019_{i}_{i}.csv"))

In [125]:
# Merging, dropping duplicates, resetting index, and adding features by using groupby

df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

df_final['time_stamp'] = pd.to_datetime(df_final['time_stamp']).dt.date

df_final = df_final.drop_duplicates()

df_final = df_final.reset_index(drop=True)

df_final = pd.merge(df_final.groupby(by = 'time_stamp').sum(),pd.merge(df_final.groupby(by = 'time_stamp').count(), df_final.groupby(by = 'time_stamp').mean(), left_index= True, right_index = True), left_index= True, right_index = True)

df_final = df_final.drop(columns=['tweet_word_count_x', 'compound_x'])

df_final.columns=['tweet_word_count_sum', 'tweet_compound_score_sum', 'tweets_sum', 'tweet_word_count_mean', 'tweet_compound_score_mean']

df_final['date'] = df_final.index

df_final

In [134]:
# export to csv file
df_final.to_csv("../data/twitter/twitter_2019.csv", index= False)

### All Twitter Data

In [135]:
df=[]
for i in range(2016, 2020, 1):
    df.append(pd.read_csv(f"../data/twitter/twitter_{i}.csv"))

In [136]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer'), df)

In [138]:
df_final.to_csv("../data/twitter/twitter.csv",index= False)