
<span style="color:red"># **Project In Progress: Updated 9/18/2022**</span>

# Introduction


## Motivation:

Social media subgroups that focus on trading tend to be filled with poor analysis and a lack of due diligince. More sophisticated investors and traders tend to create groups on forumns and websites where there is a barrier to entry in order to ensure quality in discusion and analysis. Even with this reality, large sub groups on social media sites like reddit's wall street bets have, in the past, been able to move prices in the equity markets based on the sheer number of participants involnved in group think on the website. If we can gain insight using natural language processing to gauge sentiment on forums like this one, we may then be able to identify mis-priced derivatves based on our price and volume projections coupled with implied and historical volatitly on the contracts themselves. If we beleive that the sentiment will influence future prices a certain way, we may be able to momentum trade the derivatives contracts, opening positions on contracts that we belive to be mispriced based on our assumtions and projections of future volatility. 

This introduction will be updated as the project continues accordingly:

1. Sentiment analysis of stock market related reddit forumns
2. Derivatives pricing math (framework for pricing) and assumptions needed to determine 'mis-pricing'
3. APIs used 

# Part 1: Sentiment Analysis 

## Using the Reddit API to develope a sentiment analysis dashboard - Tickers mentioned and pos / neg sentiment 

In [1]:
# Laod libraries and packages

import numpy as np
import pandas as pd
import requests
import json
import os
import dotenv
import sys
from IPython import display
import math
from pprint import pprint
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Dark2')
sys.tracebacklimit = 0 # turn off the error tracebacks


## Part 1a:
### Connect to reddit API

In [2]:
# API request relevent information 

# Create variables for keys and access token from .env file

import dotenv

dotenv.load_dotenv()

secret = os.getenv('client_secret')

client = os.getenv('client_id')

user = os.getenv('user_agent')


In [3]:
# Request reddit API using PRAW 

import praw

reddit = praw.Reddit(client_id = client,
                     client_secret = secret,
                     user_agent = user)


## Part 1b:
### Headlines: Gather headlines from stock market related subreddits - 'Wallstreetbets'.'TheWallStreet', 'TradeVol'

In [4]:
# Create empty set to hold post headlines, so as not to create duplicates

headlines_wsb = set()

# Iterate through Wall Street Bets for headlines 

for submission in reddit.subreddit('wallstreetbets').new(limit=None):
    headlines_wsb.add(submission.title)
    display.clear_output()
    print(len(headlines_wsb),'headlines found from wallstreetbets')

923 headlines found from wallstreetbets


In [5]:
# Repeat for r/TradeVol

headlines_vol = set()

# Iterate through  for headlines 

for submission in reddit.subreddit('TradeVol').new(limit=None):
    headlines_vol.add(submission.title)
    display.clear_output()
    print(len(headlines_vol),'headlines found from TradeVOl')

440 headlines found from TradeVOl


In [6]:
# Repeat for r/TheWallStreet

headlines_wstreet = set()

# Iterate through  for headlines 

for submission in reddit.subreddit('TheWallStreet').new(limit=None):
    headlines_wstreet.add(submission.title)
    display.clear_output()
    print(len(headlines_wstreet),'headlines found from TheWallStreet')

920 headlines found from TheWallStreet


In [7]:
# Download vader lexicon package from natural language toolkit used below

#import nltk
#nltk.download('vader_lexicon')


In [8]:
# Update vader lexicon dictionary with new words we want to look for
# plus their sentiment scores

from nltk.sentiment.vader import SentimentIntensityAnalyzer

new_words = {
    'call': 3.0,
    'put': -3.0,                                                      # this step is important, more to come
}

SIA = SentimentIntensityAnalyzer()

SIA.lexicon.update(new_words)

In [9]:
# Sentiment Intensity Analyzer (SIA) - WSB

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
results_WSB = []

for line in headlines_wsb:
    pol_score = sia.polarity_scores(line)
    pol_score['headline'] = line
    results_WSB.append(pol_score)

pprint(results_WSB[:3], width=100)

[{'compound': 0.4588,
  'headline': '*insert favorite stock here*',
  'neg': 0.0,
  'neu': 0.5,
  'pos': 0.5},
 {'compound': 0.0,
  'headline': 'ES Technical Analysis by Adam Mancini',
  'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0},
 {'compound': 0.0, 'headline': 'Live look at my portfolio', 'neg': 0.0, 'neu': 1.0, 'pos': 0.0}]


In [10]:
# Sentiment Intensity Analyzer (SIA) - TradeVOl

#from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
results_VOL = []

for line in headlines_vol:
    pol_score = sia.polarity_scores(line)
    pol_score['headline'] = line
    results_VOL.append(pol_score)

# pprint(results_VOL[:3], width=100)

In [11]:
# Sentiment Intensity Analyzer (SIA) - TheWallStreet

#from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
results_WST = []

for line in headlines_wstreet:
    pol_score = sia.polarity_scores(line)
    pol_score['headline'] = line
    results_WST.append(pol_score)

# pprint(results_WST[:3], width=100)

In [12]:
# Create a data frame from the headlines results
df_wsb = pd.DataFrame.from_records(results_WSB)         # Compound variable scale from very neg (-1) - very pos (1)
df_wsb.head()

Unnamed: 0,neg,neu,pos,compound,headline
0,0.0,0.5,0.5,0.4588,*insert favorite stock here*
1,0.0,1.0,0.0,0.0,ES Technical Analysis by Adam Mancini
2,0.0,1.0,0.0,0.0,Live look at my portfolio
3,0.0,1.0,0.0,0.0,Hedgies rn
4,0.0,1.0,0.0,0.0,Thing to Know Today: What CPI revealed about o...


In [13]:
# Create a data frame from the headlines results

df_wst = pd.DataFrame.from_records(results_WST)
df_wst.head()

Unnamed: 0,neg,neu,pos,compound,headline
0,0.0,1.0,0.0,0.0,"Daily Discussion - (November 30, 2021)"
1,0.0,1.0,0.0,0.0,"Post Market Discussion - (August 31, 2022)"
2,0.0,1.0,0.0,0.0,"Deviations for Wednesday, January 5, 2021"
3,0.0,1.0,0.0,0.0,"Nightly Discussion - (January 16, 2022)"
4,0.0,1.0,0.0,0.0,"Daily Discussion - (November 25, 2021)"


In [14]:
# Create a data frame from the headlines results
df_vol = pd.DataFrame.from_records(results_VOL)
df_vol.head()

Unnamed: 0,neg,neu,pos,compound,headline
0,0.0,0.625,0.375,0.6369,"Best broker for shorting UVXY, VXX, and similar?"
1,0.0,1.0,0.0,0.0,Volatility Trading Weekly Discussion - March 2...
2,0.0,1.0,0.0,0.0,Link- VIX The Virus
3,0.164,0.472,0.363,0.399,Most Overlooked Opportunity of 2022 | $VXX
4,0.0,1.0,0.0,0.0,Volatility Trading Weekly Discussion - April 2...


We can see that the SIA is not performing optimally, we need to parse the hedlines and add words to the vader lexicon dictionary in order to properly identify community slang, and trading terms as either positive or negative

## Part 1c:
### Gather comments from certain posts and analyze sentiment