# Scraping Financial Data To Predict Returns

Have you ever wanted to build an app that tells you **when to buy and sell a stock** based on general public's opinion of it?
We are going to use Twitter to mine public opinion on **NVIDIA** company and build a sentiment-based strategy that helps us predict if we should buy or sell this stock.

For scraping data on Twitter we are going to use **snscrape library** by [JustAnotherArchivist](https://github.com/JustAnotherArchivist/snscrape) that allows one to scrape tweets without the restrictions of Tweepy. Snscrape requires at least Python 3.8 or higher. With snscrape we can pull data into a data frame with filters of our choosing.

**Attributes** available through snscraper Tweet object adn their datatypes: 

***url:*** str, permalink pointing to url location 

***date:*** datetime.datetime, date tweet was created 

***content:*** str, text content of the tweet 

***renderedContent:*** str, text content of the tweet 

***id:*** int, id of tweet 

***user:*** 'User' object containing username, displayname, id, description, descriptionUrls,, verified, created, 
followersCount, friendsCount, statusesCount, favouritedCount, listedCount, mediaCount, location, protected, linkUrl, profileImageUrl, profileBannerUrl 

***outlinks:*** list 

***tcooutlinks:*** list 

***replyCount:*** int, count of replies 

***retweetCount:*** int, count of retweets 

***likeCount:*** int

***quoteCount:*** int, count of users that quoted the tweet and replied 

***conversationId:*** int, same as tweetID 

***lang:*** str, machine generated, assumed language of the tweet 

***source:*** str, where tweet was posted from, Iphone, Android 

***media:*** typing.Optional[typing.List['Medium']] = None, 'Media' object, containing previewUrl, fullUrl, type 

***retweetedTweet:*** typing.Optional['Tweet'] = None, if is retweet, id of original tweet 

***quotedTweet:*** typing.Optional['Tweet'] = None, if is quoted tweet, id of original tweet 

***mentionedUsers:*** typing.Optional[typing.List['User']] = None, 'User' object of any mentioned users in tweet


In [2]:
!pip install snscrape

Defaulting to user installation because normal site-packages is not writeable
Collecting snscrape
  Downloading snscrape-0.4.3.20220106-py3-none-any.whl (59 kB)
Installing collected packages: snscrape
Successfully installed snscrape-0.4.3.20220106




In [12]:
import pandas as pd
import numpy as np
import snscrape.modules.twitter as sntwitter
import csv
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
# Set maximum tweets to pull
maxTweets = 20000
# Set what keywords you want your twitter scraper to pull
keyword = 'NVIDIA'
#Open/create a file to append data to
csvFile = open('nvidia_tweets_result.csv', 'a', newline='', encoding='utf8')
#Use csv writer
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet',])

# Write tweets into the csv file

#sntwitter.TwitterSearchScraper('from:jack').get_items() from user

for i,tweet in enumerate(sntwitter.TwitterSearchScraper(keyword + ' lang:en since:2021-08-08 until:2022-08-08 -filter:links -filter:replies').get_items()):
        if i > maxTweets :
            break
        csvWriter.writerow([tweet.id, tweet.date, tweet.content])
csvFile.close()

For analysing the sentiment of our scraped data we will use **VADER** (Valence Aware Dictionary and sEntiment), which is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in **social media**. VADER uses a combination of a sentiment lexicon - a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either **positive** or **negative.** 

VADER analyses sentiments primarily based on certain **key points:**

-	***Punctuation:*** The use of an exclamation mark(!), increases the magnitude of the intensity without modifying the semantic orientation. For example, “The food here is good!” is more intense than “The food here is good.” and an increase in the number of (!), increases the magnitude accordingly.
-	***Capitalization:*** Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity. For example, “The food here is GREAT!” conveys more intensity than “The food here is great!”
-	***Degree modifiers:*** Also called intensifiers, they impact the sentiment intensity by either increasing or decreasing the intensity. For example, “The service here is extremely good” is more intense than “The service here is good”, whereas “The service here is marginally good” reduces the intensity.
-	***Conjunctions:*** Use of conjunctions like “but” signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. “The food here is great, but the service is horrible” has mixed sentiment, with the latter half dictating the overall rating.
-	***Preceding Tri-gram:*** By examining the tri-gram preceding a sentiment-laden lexical feature, we catch nearly 90% of cases where negation flips the polarity of the text. A negated sentence would be “The food here isn’t really all that great”.

In [3]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Trying the analyzer
analyzer = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence):
    score = analyzer.polarity_scores(sentence)
    return("{:-<40} {}".format(sentence, str(score)))
print(sentiment_analyzer_scores("NVIDIA sucks!"))

NVIDIA sucks!--------------------------- {'neg': 0.736, 'neu': 0.264, 'pos': 0.0, 'compound': -0.4199}


In [4]:
df = pd.read_csv('nvidia_tweets_result.csv')
df = df.set_index('date')
df.head()

Unnamed: 0_level_0,id,tweet
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-08-07 23:39:18+00:00,1556424776658214912,"Transaction reports filed by Pelosi, a multi-m..."
2022-08-07 23:17:36+00:00,1556419317415157761,It's been about 4 months now and Bungie still ...
2022-08-07 22:55:45+00:00,1556413818821033985,absolutely love how every single time I go to ...
2022-08-07 22:31:04+00:00,1556407607002284054,Netflix app keeps interacting badly with my nV...
2022-08-07 22:18:19+00:00,1556404396468359168,"Powering .@MercedesBenz's Drive Pilot, Nvidia..."


In [5]:
#Creating a column for the various sentiment scores of each individual tweet:
df['compound'] = [analyzer.polarity_scores(x)['compound'] for x in df['tweet']]
df['neg'] = [analyzer.polarity_scores(x)['neg'] for x in df['tweet']]
df['neu'] = [analyzer.polarity_scores(x)['neu'] for x in df['tweet']]
df['pos'] = [analyzer.polarity_scores(x)['pos'] for x in df['tweet']]

In [7]:
# fromatting date column
df.index = pd.to_datetime(df.index, errors='coerce',format='%Y-%m-%d %H:%M:%S')
# resampling data to have daily sentiment values
df = df.resample('D').mean()
# removing weekends
df = df.loc[df.index.to_series().dt.weekday < 5]
# writing data to new CSV file so that we can change some formatting in Excel
df.to_csv('tweets_resampled_mean_no_weekends.csv')

We are going to use the help of **Excel** to quickly remove the hours, minutes and seconds data from our date column and rename our date column
**‘Date’** so its the same as the Date column in our ticker data. To remove the h/m/s data, navigate to the
‘Data’ tab, highlight the ‘date’ column, select ‘Space’ as a delimiter, click ‘Next’, then select ‘YMD’ as our date format.

In [3]:
corrected_df = pd.read_excel("tweets_resampled_mean_no_weekends_corrected.xlsx", parse_dates=True, index_col=0)

In order to build a trading strategy around this sentiment data, we need to download the ticker data for NVIDIA (NVDA) from **Yahoo Finance.**

In [4]:
import yfinance as yf
nvidia = yf.download("NVDA", start="2022-01-27", end="2022-08-08")

combined_df = corrected_df.merge(nvidia, on='Date',how='outer').dropna()

[*********************100%***********************]  1 of 1 completed


In [10]:
combined_df.head()

Unnamed: 0_level_0,compound,neg,neu,pos,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-01-27,0.155158,0.060992,0.831168,0.107847,235.679993,239.949997,216.75,219.440002,219.356247,57335300.0
2022-01-28,0.100486,0.045239,0.884761,0.070022,220.119995,228.580002,212.960007,228.399994,228.31282,54377400.0
2022-01-31,0.149381,0.045764,0.861438,0.092787,231.820007,245.089996,230.520004,244.860001,244.766541,56468000.0
2022-02-01,0.062527,0.055355,0.869075,0.07557,251.039993,251.449997,238.899994,246.380005,246.285965,51892500.0
2022-02-02,0.125991,0.055548,0.842992,0.101468,257.940002,258.170013,245.529999,252.419998,252.323669,54341900.0


**Trading strategy logic**: We would long (buy) the stock when our positive sentiment outweighs (is greater than) our negative sentiment. Then we create a column that shows our strategy returns that multiplies our position by the log returns. We implement the shift(1) that ensures that we use the previous days sentiment rather than the current day to prevent hindsight bias.

In [5]:
# Calculating Log Returns Column
combined_df['returns'] = np.log(combined_df['Close'] / combined_df['Close'].shift(1))
# Long when the pos sentiment > neg sentiment and short otherwise
combined_df['position'] = np.where(combined_df['pos'] > combined_df['neg'], 1, -1)
# Create Strategy column by multiplying SHIFTED position (to avoid hindsight bias) and returns
combined_df['strategy'] = combined_df['position'].shift(1) * combined_df['returns']
combined_df.dropna(inplace=True)

In [12]:
combined_df.head()

Unnamed: 0_level_0,compound,neg,neu,pos,Open,High,Low,Close,Adj Close,Volume,returns,position,strategy
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2022-01-28,0.100486,0.045239,0.884761,0.070022,220.119995,228.580002,212.960007,228.399994,228.31282,54377400.0,0.04002,1,0.04002
2022-01-31,0.149381,0.045764,0.861438,0.092787,231.820007,245.089996,230.520004,244.860001,244.766541,56468000.0,0.069588,1,0.069588
2022-02-01,0.062527,0.055355,0.869075,0.07557,251.039993,251.449997,238.899994,246.380005,246.285965,51892500.0,0.006188,1,0.006188
2022-02-02,0.125991,0.055548,0.842992,0.101468,257.940002,258.170013,245.529999,252.419998,252.323669,54341900.0,0.024219,1,0.024219
2022-02-03,0.13559,0.062278,0.843247,0.094495,244.580002,250.770004,237.800003,239.479996,239.388596,41017800.0,-0.052624,1,-0.052624


In [6]:
np.exp(combined_df[['returns','strategy']].sum())

returns     0.865339
strategy    1.174834
dtype: float64

The output shows our strategy outperforms the benchmark by a small margin (0.31). Then we can visualize the performance of our strategy against the benchmark:

In [23]:
cum_df = np.exp(combined_df[["returns", "strategy"]].cumsum())

In [25]:
fig = px.line(cum_df, x=combined_df.index, y=["returns","strategy"])
fig.update_layout(legend_title_text="")
fig.update_yaxes(title_text='Cumulative Returns')
fig.show()