# stock news sentiment analysis end-to-end
1. data collection from alpaca for NVDA
2. data preprocessing and qualitative analysi: drop duplicates, ensure relevance, remove simiar artiles with cosine similary socre, etc
3. use VADER (rule-based) and google Bard (ML based) for sentiment analysis and compare the results

In [1]:
from alpaca_trade_api import REST
import pandas as pd

In [35]:
# get the API info from a file
alpacaInfo = open('../Alpaca')
key = None
secret = None
for line in alpacaInfo:
    if line.find("Key") != -1:
        key = line.split()[2].strip()
        # print(key)
    elif line.find("Secret") != -1:
        secret = line.split()[2].strip()
        # print(secret)        

# initialize the api object

In [3]:
news_api = REST(key_id=key, secret_key=secret)
symbol = 'NVDA'
start = '2001-01-01'
end = '2023-10-28'
limit = 500000
news_list = news_api.get_news(symbol=symbol, start=start, end=end, limit=limit, include_content=True, exclude_contentless=True)

In [4]:
len(news_list)

4907

In [5]:
help(news_list)

Help on NewsListV2 in module alpaca_trade_api.entity_v2 object:

class NewsListV2(builtins.list)
 |  NewsListV2(raw)
 |  
 |  Method resolution order:
 |      NewsListV2
 |      builtins.list
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, raw)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from builtins.list:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(se

In [6]:
news_list[0]

NewsV2({   'author': 'Benzinga Insights',
    'content': '<p>This whale alert can help traders discover the next big '
               'trading opportunities.</p>\n'
               '<p>Whales are entities with large sums of money and we track '
               'their transactions here at Benzinga on our options activity '
               'scanner.</p>\n'
               '<p>Traders often look for circumstances when the market '
               'estimation of an option diverges away from its normal worth. '
               'Abnormal amounts of trading activity could push option prices '
               'to hyperbolic or underperforming levels. </p>\n'
               "<p>Here's the list of options activity happening in today's "
               'session:  <table>\n'
               '<thead>\n'
               '<tr>\n'
               '<th><strong>Symbol</strong></th>\n'
               '<th><strong>PUT/CALL</strong></th>\n'
               '<th><strong>Trade Type</strong></th>\n'
               '<th>

In [7]:
# use a dataframe to store the news
news = pd.DataFrame(columns=['id', 'created_at', 'updated_at', 'headline', 'content', 'source', 'url'])

# traverse the NewsV2 objct
for i in range(len(news_list)):
    news_df = pd.DataFrame({
        'id': news_list[i].id, 
        'created_at': news_list[i].created_at,
        'updated_at': news_list[i].updated_at,
        'headline': news_list[i].headline,
        'content': news_list[i].content,
        'source': news_list[i].source,
        'url': news_list[i].url
    }, index=[0])
    news = pd.concat([news, news_df], ignore_index=True)
news.head()

Unnamed: 0,id,created_at,updated_at,headline,content,source,url
0,35470224,2023-10-27 17:35:13+00:00,2023-10-27 17:35:13+00:00,10 Information Technology Stocks Whale Activit...,<p>This whale alert can help traders discover ...,benzinga,https://www.benzinga.com/markets/options/23/10...
1,35467939,2023-10-27 15:58:09+00:00,2023-10-27 16:00:06+00:00,"Amazon To Make 'Tens Of Billions' From AI, Con...","<p><em>To gain an edge, this is what you need ...",benzinga,https://www.benzinga.com/economics/23/10/35467...
2,35466905,2023-10-27 15:46:14+00:00,2023-10-27 15:46:14+00:00,Intel Analysts Aren't Totally Impressed By Q3 ...,<p>Shares of <strong>Intel Corporation </stron...,benzinga,https://www.benzinga.com/analyst-ratings/analy...
3,35466443,2023-10-27 15:04:13+00:00,2023-10-27 15:07:02+00:00,Break Down: What Does The Recent Selling Of Th...,<h3>The Markets</h3>\r\n\r\n<p>The market rema...,benzinga,https://www.benzinga.com/markets/23/10/3546644...
4,35462944,2023-10-27 13:36:57+00:00,2023-10-27 13:36:57+00:00,What's Going On With Nvidia Stock Friday?,<p><strong>Nvidia Corp</strong>&nbsp;(NASDAQ:<...,benzinga,https://www.benzinga.com/news/23/10/35462944/w...


In [8]:
news.tail()

Unnamed: 0,id,created_at,updated_at,headline,content,source,url
4902,5230170,2015-02-11 09:29:29+00:00,2015-02-11 09:30:02+00:00,"10 Must Watch Stocks for February 11, 2015",Some of the stocks that may grab investor focu...,,https://www.benzinga.com/node/5230170
4903,5223302,2015-02-09 16:10:15+00:00,2015-02-09 16:10:15+00:00,"Wedbush Previews NVIDIA's Q4 Results, Expects ...",Betsy Van Hees of Wedbush on Monday previewed ...,,https://www.benzinga.com/node/5223302
4904,5173181,2015-01-23 15:20:03+00:00,2015-01-23 15:20:04+00:00,Morgan Stanley comments on Mobileye N.V.,Analysts at Morgan Stanley issued a report say...,,https://www.benzinga.com/node/5173181
4905,5134338,2015-01-09 14:35:01+00:00,2015-01-09 14:35:02+00:00,Barclays: Connected Cars Take Center Stage At ...,Brian Johnson of Barclays believes that “The C...,,https://www.benzinga.com/node/5134338
4906,5120637,2015-01-06 14:29:02+00:00,2015-01-06 14:29:02+00:00,"Will NVIDIA's Tegra X1 Dethrone PlayStation 4,...","In 2014, <strong>NVIDIA Corporation</strong> (...",,https://www.benzinga.com/node/5120637


In [32]:
# save the data
news.to_pickle("../DataModules/NVDA_news_2001Jan_2023Oct.bz2")

# Qualitative analysis

In [9]:
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")

In [10]:
news.iloc[0:10, :]

Unnamed: 0,id,created_at,updated_at,headline,content,source,url
0,35470224,2023-10-27 17:35:13+00:00,2023-10-27 17:35:13+00:00,10 Information Technology Stocks Whale Activit...,<p>This whale alert can help traders discover ...,benzinga,https://www.benzinga.com/markets/options/23/10...
1,35467939,2023-10-27 15:58:09+00:00,2023-10-27 16:00:06+00:00,"Amazon To Make 'Tens Of Billions' From AI, Con...","<p><em>To gain an edge, this is what you need ...",benzinga,https://www.benzinga.com/economics/23/10/35467...
2,35466905,2023-10-27 15:46:14+00:00,2023-10-27 15:46:14+00:00,Intel Analysts Aren't Totally Impressed By Q3 ...,<p>Shares of <strong>Intel Corporation </stron...,benzinga,https://www.benzinga.com/analyst-ratings/analy...
3,35466443,2023-10-27 15:04:13+00:00,2023-10-27 15:07:02+00:00,Break Down: What Does The Recent Selling Of Th...,<h3>The Markets</h3>\r\n\r\n<p>The market rema...,benzinga,https://www.benzinga.com/markets/23/10/3546644...
4,35462944,2023-10-27 13:36:57+00:00,2023-10-27 13:36:57+00:00,What's Going On With Nvidia Stock Friday?,<p><strong>Nvidia Corp</strong>&nbsp;(NASDAQ:<...,benzinga,https://www.benzinga.com/news/23/10/35462944/w...
5,35445781,2023-10-26 18:11:05+00:00,2023-10-26 18:11:05+00:00,Amazon's New AI-Powered Ad Imagery Boosts Clic...,"<p><strong>Amazon.Com, Inc</strong>&nbsp;(NASD...",benzinga,https://www.benzinga.com/news/23/10/35445781/a...
6,35447241,2023-10-26 17:35:16+00:00,2023-10-26 17:35:16+00:00,10 Information Technology Stocks Whale Activit...,<p>This whale alert can help traders discover ...,benzinga,https://www.benzinga.com/markets/options/23/10...
7,35442276,2023-10-26 16:18:52+00:00,2023-10-26 16:18:53+00:00,What's Going On With Nvidia Stock Thursday?,<p><strong>Nvidia Corp</strong>&nbsp;(NASDAQ:<...,benzinga,https://www.benzinga.com/news/23/10/35442276/w...
8,35444153,2023-10-26 15:32:28+00:00,2023-10-26 15:32:29+00:00,"Sizzling Q3 GDP Defies Recession Forecasts, Bu...",<p>Despite earlier concerns of a recession in ...,benzinga,https://www.benzinga.com/analyst-ratings/analy...
9,35443351,2023-10-26 15:20:10+00:00,2023-10-26 15:24:58+00:00,Strongest Earnings From Meta But The Stock Fal...,"<p><em>To gain an edge, this is what you need ...",benzinga,https://www.benzinga.com/markets/23/10/3544335...


### revove the html tags

In [11]:
def html_to_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.get_text()
    clean_text = text.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
    return clean_text

news['content'] = news['content'].apply(html_to_text)
news['headline'] = news['headline'].apply(html_to_text)
news.iloc[0:10, :]

Unnamed: 0,id,created_at,updated_at,headline,content,source,url
0,35470224,2023-10-27 17:35:13+00:00,2023-10-27 17:35:13+00:00,10 Information Technology Stocks Whale Activit...,This whale alert can help traders discover the...,benzinga,https://www.benzinga.com/markets/options/23/10...
1,35467939,2023-10-27 15:58:09+00:00,2023-10-27 16:00:06+00:00,"Amazon To Make 'Tens Of Billions' From AI, Con...","To gain an edge, this is what you need to know...",benzinga,https://www.benzinga.com/economics/23/10/35467...
2,35466905,2023-10-27 15:46:14+00:00,2023-10-27 15:46:14+00:00,Intel Analysts Aren't Totally Impressed By Q3 ...,Shares of Intel Corporation (NASDAQ:INTC) cont...,benzinga,https://www.benzinga.com/analyst-ratings/analy...
3,35466443,2023-10-27 15:04:13+00:00,2023-10-27 15:07:02+00:00,Break Down: What Does The Recent Selling Of Th...,The Markets The market remains anxious over al...,benzinga,https://www.benzinga.com/markets/23/10/3546644...
4,35462944,2023-10-27 13:36:57+00:00,2023-10-27 13:36:57+00:00,What's Going On With Nvidia Stock Friday?,Nvidia Corp (NASDAQ:NVDA) stock is trading hig...,benzinga,https://www.benzinga.com/news/23/10/35462944/w...
5,35445781,2023-10-26 18:11:05+00:00,2023-10-26 18:11:05+00:00,Amazon's New AI-Powered Ad Imagery Boosts Clic...,"Amazon.Com, Inc (NASDAQ:AMZN) launched an arti...",benzinga,https://www.benzinga.com/news/23/10/35445781/a...
6,35447241,2023-10-26 17:35:16+00:00,2023-10-26 17:35:16+00:00,10 Information Technology Stocks Whale Activit...,This whale alert can help traders discover the...,benzinga,https://www.benzinga.com/markets/options/23/10...
7,35442276,2023-10-26 16:18:52+00:00,2023-10-26 16:18:53+00:00,What's Going On With Nvidia Stock Thursday?,Nvidia Corp (NASDAQ:NVDA) stock continues to w...,benzinga,https://www.benzinga.com/news/23/10/35442276/w...
8,35444153,2023-10-26 15:32:28+00:00,2023-10-26 15:32:29+00:00,"Sizzling Q3 GDP Defies Recession Forecasts, Bu...",Despite earlier concerns of a recession in the...,benzinga,https://www.benzinga.com/analyst-ratings/analy...
9,35443351,2023-10-26 15:20:10+00:00,2023-10-26 15:24:58+00:00,Strongest Earnings From Meta But The Stock Fal...,"To gain an edge, this is what you need to know...",benzinga,https://www.benzinga.com/markets/23/10/3544335...


In [12]:
len(news)

4907

### check relevance

In [13]:
# find the count of the times mentioned
article1 = news['content'][0]
nmentions = article1.count('NVDA') + article1.count('Nvidia') # can we just use all lower case?
nmentions

3

In [14]:
news['count_Nvidia'] = 0
for i in range(len(news)):
    article = news.loc[i, 'content']
    countNvda = article.count('NVDA') + article.count('Nvidia')
    news.loc[i, 'count_Nvidia'] = countNvda
news.iloc[0:10, ]

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia
0,35470224,2023-10-27 17:35:13+00:00,2023-10-27 17:35:13+00:00,10 Information Technology Stocks Whale Activit...,This whale alert can help traders discover the...,benzinga,https://www.benzinga.com/markets/options/23/10...,3
1,35467939,2023-10-27 15:58:09+00:00,2023-10-27 16:00:06+00:00,"Amazon To Make 'Tens Of Billions' From AI, Con...","To gain an edge, this is what you need to know...",benzinga,https://www.benzinga.com/economics/23/10/35467...,1
2,35466905,2023-10-27 15:46:14+00:00,2023-10-27 15:46:14+00:00,Intel Analysts Aren't Totally Impressed By Q3 ...,Shares of Intel Corporation (NASDAQ:INTC) cont...,benzinga,https://www.benzinga.com/analyst-ratings/analy...,2
3,35466443,2023-10-27 15:04:13+00:00,2023-10-27 15:07:02+00:00,Break Down: What Does The Recent Selling Of Th...,The Markets The market remains anxious over al...,benzinga,https://www.benzinga.com/markets/23/10/3546644...,2
4,35462944,2023-10-27 13:36:57+00:00,2023-10-27 13:36:57+00:00,What's Going On With Nvidia Stock Friday?,Nvidia Corp (NASDAQ:NVDA) stock is trading hig...,benzinga,https://www.benzinga.com/news/23/10/35462944/w...,8
5,35445781,2023-10-26 18:11:05+00:00,2023-10-26 18:11:05+00:00,Amazon's New AI-Powered Ad Imagery Boosts Clic...,"Amazon.Com, Inc (NASDAQ:AMZN) launched an arti...",benzinga,https://www.benzinga.com/news/23/10/35445781/a...,2
6,35447241,2023-10-26 17:35:16+00:00,2023-10-26 17:35:16+00:00,10 Information Technology Stocks Whale Activit...,This whale alert can help traders discover the...,benzinga,https://www.benzinga.com/markets/options/23/10...,3
7,35442276,2023-10-26 16:18:52+00:00,2023-10-26 16:18:53+00:00,What's Going On With Nvidia Stock Thursday?,Nvidia Corp (NASDAQ:NVDA) stock continues to w...,benzinga,https://www.benzinga.com/news/23/10/35442276/w...,6
8,35444153,2023-10-26 15:32:28+00:00,2023-10-26 15:32:29+00:00,"Sizzling Q3 GDP Defies Recession Forecasts, Bu...",Despite earlier concerns of a recession in the...,benzinga,https://www.benzinga.com/analyst-ratings/analy...,1
9,35443351,2023-10-26 15:20:10+00:00,2023-10-26 15:24:58+00:00,Strongest Earnings From Meta But The Stock Fal...,"To gain an edge, this is what you need to know...",benzinga,https://www.benzinga.com/markets/23/10/3544335...,1


In [15]:
article1

"This whale alert can help traders discover the next big trading opportunities. Whales are entities with large sums of money and we track their transactions here at Benzinga on our options activity scanner. Traders often look for circumstances when the market estimation of an option diverges away from its normal worth. Abnormal amounts of trading activity could push option prices to hyperbolic or underperforming levels.  Here's the list of options activity happening in today's session:     Symbol PUT/CALL Trade Type Sentiment Exp. Date Strike Price Total Trade Price Open Interest Volume     NVDA PUT SWEEP BEARISH 10/27/23 $405.00 $105.5K 8.9K 94.0K   AAPL PUT SWEEP BEARISH 11/03/23 $165.00 $86.7K 14.3K 10.8K   INTC CALL SWEEP BULLISH 11/17/23 $36.00 $42.4K 15.6K 5.8K   MSFT PUT TRADE BULLISH 12/15/23 $335.00 $32.7K 3.0K 2.9K   ARM CALL SWEEP BULLISH 12/01/23 $51.50 $40.7K 6 1.0K   AMD CALL SWEEP BULLISH 11/17/23 $120.00 $25.1K 68.7K 864   SMCI PUT SWEEP BULLISH 10/27/23 $245.00 $32.0K 

In [16]:
news.loc[:, 'count_Nvidia'].max()

31

In [17]:
news.loc[:, 'count_Nvidia'].min()

0

# filter articles that mention nvda 5+ times

In [18]:
news_filtered = news.loc[news['count_Nvidia'] > 5, :]
news_filtered.head()

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia
4,35462944,2023-10-27 13:36:57+00:00,2023-10-27 13:36:57+00:00,What's Going On With Nvidia Stock Friday?,Nvidia Corp (NASDAQ:NVDA) stock is trading hig...,benzinga,https://www.benzinga.com/news/23/10/35462944/w...,8
7,35442276,2023-10-26 16:18:52+00:00,2023-10-26 16:18:53+00:00,What's Going On With Nvidia Stock Thursday?,Nvidia Corp (NASDAQ:NVDA) stock continues to w...,benzinga,https://www.benzinga.com/news/23/10/35442276/w...,6
10,35440412,2023-10-26 13:46:10+00:00,2023-10-26 13:46:11+00:00,Looking At NVIDIA's Recent Unusual Options Act...,A whale with a lot of money to spend has taken...,benzinga,https://www.benzinga.com/markets/options/23/10...,7
17,35416909,2023-10-25 16:02:48+00:00,2023-10-25 16:02:48+00:00,What's Going On With Nvidia Stock Wednesday?,Nvidia Corp (NASDAQ:NVDA) stock is trading low...,benzinga,https://www.benzinga.com/government/23/10/3541...,7
22,35402235,2023-10-24 19:47:06+00:00,2023-10-24 19:47:07+00:00,The Secret To Starting Your Own Business Is Tr...,Nvidia Corp (NASDAQ:NVDA) officially joined th...,benzinga,https://www.benzinga.com/news/23/10/35402235/t...,7


In [19]:
len(news_filtered)

891

### Inspect novolty (remove duplicates)

In [20]:
# sort the row based on article creation time, from oldest to newest
news_filtered = news_filtered.sort_values(by='created_at', ascending=True)
news_filtered = news_filtered.drop_duplicates(subset='content', keep='first')
news_filtered = news_filtered.drop_duplicates(subset='headline', keep='first')
news_filtered.reset_index(drop=True, inplace=True)
print(len(news_filtered))
news_filtered.head()

749


Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia
0,5348412,2015-03-23 14:53:09+00:00,2015-03-23 14:54:12+00:00,Why Goldman Is Downgrading Nvidia,Goldman Sachs downgraded NVIDIA Corporation (N...,,https://www.benzinga.com/node/5348412,8
1,5497748,2015-05-11 15:34:21+00:00,2015-05-11 15:34:22+00:00,Nvidia Falls Short...And Wall Street Reacts,Technology company NVIDIA Corporation (NASD...,,https://www.benzinga.com/node/5497748,20
2,5548840,2015-05-28 16:40:25+00:00,2015-05-28 19:25:14+00:00,Goldman Sachs Met With Semiconductor Giants; H...,"In a report published Thursday, Goldman Sachs ...",,https://www.benzinga.com/node/5548840,7
3,5590773,2015-06-12 14:02:55+00:00,2015-06-12 14:02:55+00:00,Wedbush Met With Nvidia's CFO; Here's What Hap...,"In a report published Friday, Wedbush analyst ...",,https://www.benzinga.com/node/5590773,8
4,5755007,2015-08-11 12:31:49+00:00,2015-08-11 12:31:49+00:00,Unusual Option Opportunity Nvidia,"According to Options and Volatility, shares of...",,https://www.benzinga.com/node/5755007,6


# Ensuring novolty

The previous method only use drop_duplicates() method, which doesn't work well for textual data. Here we use a better way: cosine_similarity()

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### step 1: convert the textual data into numerical vectors for similarity analysis

In [22]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(news_filtered['content'])

In [23]:
tfidf_matrix

<749x11912 sparse matrix of type '<class 'numpy.float64'>'
	with 164506 stored elements in Compressed Sparse Row format>

In [24]:
print(tfidf_matrix[0,0:10])

  (0, 1)	0.034925884908276035


In [25]:
tfidf_matrix[0,0:10]

<1x10 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

### step2: calculate the similarity score

In [26]:
similarity_mat = cosine_similarity(tfidf_matrix, tfidf_matrix)
similarity_df = pd.DataFrame(similarity_mat, columns=news_filtered.index, index=news_filtered.index)
similarity_df.iloc[0:10, :]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,739,740,741,742,743,744,745,746,747,748
0,1.0,0.20845,0.273858,0.144089,0.122469,0.174209,0.151572,0.21955,0.253414,0.139827,...,0.12215,0.141835,0.128738,0.14148,0.126938,0.100306,0.159277,0.119649,0.166679,0.116107
1,0.20845,1.0,0.137669,0.276045,0.210732,0.266598,0.232522,0.335825,0.18084,0.210815,...,0.160627,0.228667,0.151154,0.191779,0.147724,0.142195,0.276983,0.168734,0.243409,0.169628
2,0.273858,0.137669,1.0,0.113512,0.08583,0.133124,0.133525,0.141517,0.225764,0.13176,...,0.100041,0.116913,0.123013,0.109643,0.099432,0.086559,0.13585,0.095074,0.138326,0.085817
3,0.144089,0.276045,0.113512,1.0,0.131632,0.192877,0.135594,0.155785,0.141657,0.154352,...,0.123997,0.176442,0.142807,0.139835,0.127135,0.109387,0.197435,0.128563,0.192092,0.112324
4,0.122469,0.210732,0.08583,0.131632,1.0,0.172612,0.116288,0.204474,0.105785,0.150849,...,0.09058,0.160606,0.111131,0.126322,0.104887,0.094489,0.191303,0.129689,0.152117,0.076236
5,0.174209,0.266598,0.133124,0.192877,0.172612,1.0,0.167382,0.246309,0.145477,0.196553,...,0.154157,0.178479,0.110795,0.171081,0.119655,0.118071,0.226494,0.143872,0.208085,0.136322
6,0.151572,0.232522,0.133525,0.135594,0.116288,0.167382,1.0,0.242471,0.132463,0.177966,...,0.112937,0.138807,0.096039,0.134542,0.100116,0.096956,0.183242,0.120817,0.171732,0.094688
7,0.21955,0.335825,0.141517,0.155785,0.204474,0.246309,0.242471,1.0,0.187043,0.1979,...,0.125823,0.189536,0.111875,0.170293,0.105223,0.115272,0.217446,0.140279,0.190548,0.133987
8,0.253414,0.18084,0.225764,0.141657,0.105785,0.145477,0.132463,0.187043,1.0,0.139727,...,0.101717,0.153733,0.154654,0.144084,0.111918,0.104546,0.167695,0.136071,0.168851,0.129262
9,0.139827,0.210815,0.13176,0.154352,0.150849,0.196553,0.177966,0.1979,0.139727,1.0,...,0.109378,0.153987,0.120604,0.133834,0.13111,0.110535,0.188818,0.125478,0.152575,0.09097


### step3: identify and remove similar articles

In [27]:
threshold = 0.8

articles_to_remove = []

# iterate through each articles to find duplicates
for i, row in similarity_df.iterrows():
    duplicate_indices = row[row >= threshold].index.tolist()
    # print(duplicate_indices)
    if len(duplicate_indices) > 1:
        articles_to_remove.extend(duplicate_indices[1:])
articles_to_remove_unique = list(set(articles_to_remove))
# print(articles_to_remove_unique)
news_filtered_novel = news_filtered.drop(articles_to_remove_unique)
news_filtered_novel.head()

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia
0,5348412,2015-03-23 14:53:09+00:00,2015-03-23 14:54:12+00:00,Why Goldman Is Downgrading Nvidia,Goldman Sachs downgraded NVIDIA Corporation (N...,,https://www.benzinga.com/node/5348412,8
1,5497748,2015-05-11 15:34:21+00:00,2015-05-11 15:34:22+00:00,Nvidia Falls Short...And Wall Street Reacts,Technology company NVIDIA Corporation (NASD...,,https://www.benzinga.com/node/5497748,20
2,5548840,2015-05-28 16:40:25+00:00,2015-05-28 19:25:14+00:00,Goldman Sachs Met With Semiconductor Giants; H...,"In a report published Thursday, Goldman Sachs ...",,https://www.benzinga.com/node/5548840,7
3,5590773,2015-06-12 14:02:55+00:00,2015-06-12 14:02:55+00:00,Wedbush Met With Nvidia's CFO; Here's What Hap...,"In a report published Friday, Wedbush analyst ...",,https://www.benzinga.com/node/5590773,8
4,5755007,2015-08-11 12:31:49+00:00,2015-08-11 12:31:49+00:00,Unusual Option Opportunity Nvidia,"According to Options and Volatility, shares of...",,https://www.benzinga.com/node/5755007,6


In [28]:
print(len(news_filtered))
print(len(news_filtered_novel))

749
710


# Calculate Sentiment Scores of News Headlines

In [29]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

new_words_dict = {"bear":-2, "bull":2}
analyzer.lexicon.update(new_words_dict)

In [30]:
news_filtered_novel['compound_score'] = news_filtered_novel['headline'].apply(lambda t: analyzer.polarity_scores(t)['compound'])
news_filtered_novel.head(10)

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia,compound_score
0,5348412,2015-03-23 14:53:09+00:00,2015-03-23 14:54:12+00:00,Why Goldman Is Downgrading Nvidia,Goldman Sachs downgraded NVIDIA Corporation (N...,,https://www.benzinga.com/node/5348412,8,0.0
1,5497748,2015-05-11 15:34:21+00:00,2015-05-11 15:34:22+00:00,Nvidia Falls Short...And Wall Street Reacts,Technology company NVIDIA Corporation (NASD...,,https://www.benzinga.com/node/5497748,20,0.0
2,5548840,2015-05-28 16:40:25+00:00,2015-05-28 19:25:14+00:00,Goldman Sachs Met With Semiconductor Giants; H...,"In a report published Thursday, Goldman Sachs ...",,https://www.benzinga.com/node/5548840,7,0.0
3,5590773,2015-06-12 14:02:55+00:00,2015-06-12 14:02:55+00:00,Wedbush Met With Nvidia's CFO; Here's What Hap...,"In a report published Friday, Wedbush analyst ...",,https://www.benzinga.com/node/5590773,8,0.0
4,5755007,2015-08-11 12:31:49+00:00,2015-08-11 12:31:49+00:00,Unusual Option Opportunity Nvidia,"According to Options and Volatility, shares of...",,https://www.benzinga.com/node/5755007,6,0.4215
5,6143216,2016-01-14 15:35:54+00:00,2016-01-14 15:35:54+00:00,"Barclays Downgrades Nvidia, Likes Skyworks, Ci...",NVIDIA Corporation (NASDAQ: NVDA) shares hav...,,https://www.benzinga.com/node/6143216,7,0.4215
6,7978075,2016-05-13 12:10:46+00:00,2016-05-13 12:10:46+00:00,"Roth Upgrades NVIDIA To Buy, Encouraged By 'De...",NVIDIA Corporation (NASDAQ: NVDA) reported ex...,,https://www.benzinga.com/node/7978075,6,0.6249
7,7980481,2016-05-13 18:50:36+00:00,2016-05-13 18:50:36+00:00,Jefferies' Secular Trends Thesis In NVIDIA Jus...,NVIDIA Corporation (NASDAQ: NVDA) remains a to...,,https://www.benzinga.com/node/7980481,7,0.2023
8,8373693,2016-08-19 19:22:30+00:00,2016-08-19 19:22:32+00:00,Intel To Develop Autonomous Vehicle Solution T...,CLSA's Christopher Caso commented on Intel Cor...,,https://www.benzinga.com/node/8373693,6,-0.0258
9,8685747,2016-11-11 12:51:01+00:00,2016-11-11 12:51:01+00:00,NVIDIA's Blowout Quarter: 'Machine Learning Is...,NVIDIA Corporation (NASDAQ: NVDA) delivered a...,,https://www.benzinga.com/node/8685747,6,0.3612


# Sentiment Analysis Using Google Bard

In [31]:
import google.generativeai as palm
from datetime import datetime
import pytz

# convert timezone

In [32]:
def convert_to_us_time(timestamp_utc):
    us_timezone = pytz.timezone('US/Eastern')
    us_datetime = timestamp_utc.astimezone(us_timezone)
    return us_datetime

news_filtered_novel['updated_at'] = news_filtered_novel['updated_at'].apply(convert_to_us_time)
news_filtered_novel.head(10)

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia,compound_score
0,5348412,2015-03-23 14:53:09+00:00,2015-03-23 10:54:12-04:00,Why Goldman Is Downgrading Nvidia,Goldman Sachs downgraded NVIDIA Corporation (N...,,https://www.benzinga.com/node/5348412,8,0.0
1,5497748,2015-05-11 15:34:21+00:00,2015-05-11 11:34:22-04:00,Nvidia Falls Short...And Wall Street Reacts,Technology company NVIDIA Corporation (NASD...,,https://www.benzinga.com/node/5497748,20,0.0
2,5548840,2015-05-28 16:40:25+00:00,2015-05-28 15:25:14-04:00,Goldman Sachs Met With Semiconductor Giants; H...,"In a report published Thursday, Goldman Sachs ...",,https://www.benzinga.com/node/5548840,7,0.0
3,5590773,2015-06-12 14:02:55+00:00,2015-06-12 10:02:55-04:00,Wedbush Met With Nvidia's CFO; Here's What Hap...,"In a report published Friday, Wedbush analyst ...",,https://www.benzinga.com/node/5590773,8,0.0
4,5755007,2015-08-11 12:31:49+00:00,2015-08-11 08:31:49-04:00,Unusual Option Opportunity Nvidia,"According to Options and Volatility, shares of...",,https://www.benzinga.com/node/5755007,6,0.4215
5,6143216,2016-01-14 15:35:54+00:00,2016-01-14 10:35:54-05:00,"Barclays Downgrades Nvidia, Likes Skyworks, Ci...",NVIDIA Corporation (NASDAQ: NVDA) shares hav...,,https://www.benzinga.com/node/6143216,7,0.4215
6,7978075,2016-05-13 12:10:46+00:00,2016-05-13 08:10:46-04:00,"Roth Upgrades NVIDIA To Buy, Encouraged By 'De...",NVIDIA Corporation (NASDAQ: NVDA) reported ex...,,https://www.benzinga.com/node/7978075,6,0.6249
7,7980481,2016-05-13 18:50:36+00:00,2016-05-13 14:50:36-04:00,Jefferies' Secular Trends Thesis In NVIDIA Jus...,NVIDIA Corporation (NASDAQ: NVDA) remains a to...,,https://www.benzinga.com/node/7980481,7,0.2023
8,8373693,2016-08-19 19:22:30+00:00,2016-08-19 15:22:32-04:00,Intel To Develop Autonomous Vehicle Solution T...,CLSA's Christopher Caso commented on Intel Cor...,,https://www.benzinga.com/node/8373693,6,-0.0258
9,8685747,2016-11-11 12:51:01+00:00,2016-11-11 07:51:01-05:00,NVIDIA's Blowout Quarter: 'Machine Learning Is...,NVIDIA Corporation (NASDAQ: NVDA) delivered a...,,https://www.benzinga.com/node/8685747,6,0.3612


In [34]:
# get google palm api key
# get the API info from a file
palmInfo = open('../GooglePlam')
palmKey = None
for line in palmInfo:
    if line.find("APIkey") != -1:
        palmKey = line.split()[2].strip()
        # print(palmKey)

In [68]:
import time
from ratelimit import limits, sleep_and_retry

@sleep_and_retry
@limits(calls=10, period=1)
def generate_sentiment_score(headline):
    palm.configure(api_key=palmKey)
    # model params
    defaults = {
        'model': 'models/text-bison-001',
        'temperature': 0.5,
        'candidate_count': 1,
        'top_k': 40,
        'top_p': 0.95,
        'max_output_tokens': 1024,
        'stop_sequences': [],
    }
    input = headline
    prompt = f"""Analyze the sentiment of the news headline and generate sentiment score based on the rules below:
1. if the sentiment is positive, give score between 0 and 1, where higher values indicate a more positive sentiment.
2. if the sentiment is negative, give score between 0 and -1, where lower values indicate a more negative sentiment.
3. if the sentiment is neutral, give a score 0.
input: {input}
output 2:"""
    # return the sentiment score
    # call the API to generate the sentiment score
    try:
        response = palm.generate_text(**defaults, prompt=prompt)
        print(response)
        score = float(response.result)
        print(response.result)
    except Exception as e:
        # handle error
        print(e)
        return 100
    return score

In [62]:
calculated_values = []

for index, row in news_filtered_novel.iterrows():
    print(index)
    response = generate_sentiment_score(row['headline'])
    calculated_values.append(float(response))
    
news_filtered_novel['sentiment_score_bard'] = calculated_values
news_filtered_novel.head(10)

0
Completion(candidates=[...],
           result='-1',
           filters=[],
           safety_feedback=[])
1
Completion(candidates=[...],
           result='-0.8',
           filters=[],
           safety_feedback=[])
2
Completion(candidates=[...],
           result='0.5',
           filters=[],
           safety_feedback=[])
3
float() argument must be a string or a real number, not 'NoneType'
4
Completion(candidates=[...],
           result='0.5',
           filters=[],
           safety_feedback=[])
5
Completion(candidates=[...],
           result='-0.5',
           filters=[],
           safety_feedback=[])
6
float() argument must be a string or a real number, not 'NoneType'
7
Completion(candidates=[...],
           result='0.65',
           filters=[],
           safety_feedback=[])
8
Completion(candidates=[...],
           result='0',
           filters=[],
           safety_feedback=[])
9
Completion(candidates=[...],
           result='0.5',
           filters=[],
           sa

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia,compound_score,sentiment_score_bard
0,5348412,2015-03-23 14:53:09+00:00,2015-03-23 10:54:12-04:00,Why Goldman Is Downgrading Nvidia,Goldman Sachs downgraded NVIDIA Corporation (N...,,https://www.benzinga.com/node/5348412,8,0.0,-1.0
1,5497748,2015-05-11 15:34:21+00:00,2015-05-11 11:34:22-04:00,Nvidia Falls Short...And Wall Street Reacts,Technology company NVIDIA Corporation (NASD...,,https://www.benzinga.com/node/5497748,20,0.0,-0.8
2,5548840,2015-05-28 16:40:25+00:00,2015-05-28 15:25:14-04:00,Goldman Sachs Met With Semiconductor Giants; H...,"In a report published Thursday, Goldman Sachs ...",,https://www.benzinga.com/node/5548840,7,0.0,0.5
3,5590773,2015-06-12 14:02:55+00:00,2015-06-12 10:02:55-04:00,Wedbush Met With Nvidia's CFO; Here's What Hap...,"In a report published Friday, Wedbush analyst ...",,https://www.benzinga.com/node/5590773,8,0.0,100.0
4,5755007,2015-08-11 12:31:49+00:00,2015-08-11 08:31:49-04:00,Unusual Option Opportunity Nvidia,"According to Options and Volatility, shares of...",,https://www.benzinga.com/node/5755007,6,0.4215,0.5
5,6143216,2016-01-14 15:35:54+00:00,2016-01-14 10:35:54-05:00,"Barclays Downgrades Nvidia, Likes Skyworks, Ci...",NVIDIA Corporation (NASDAQ: NVDA) shares hav...,,https://www.benzinga.com/node/6143216,7,0.4215,-0.5
6,7978075,2016-05-13 12:10:46+00:00,2016-05-13 08:10:46-04:00,"Roth Upgrades NVIDIA To Buy, Encouraged By 'De...",NVIDIA Corporation (NASDAQ: NVDA) reported ex...,,https://www.benzinga.com/node/7978075,6,0.6249,100.0
7,7980481,2016-05-13 18:50:36+00:00,2016-05-13 14:50:36-04:00,Jefferies' Secular Trends Thesis In NVIDIA Jus...,NVIDIA Corporation (NASDAQ: NVDA) remains a to...,,https://www.benzinga.com/node/7980481,7,0.2023,0.65
8,8373693,2016-08-19 19:22:30+00:00,2016-08-19 15:22:32-04:00,Intel To Develop Autonomous Vehicle Solution T...,CLSA's Christopher Caso commented on Intel Cor...,,https://www.benzinga.com/node/8373693,6,-0.0258,0.0
9,8685747,2016-11-11 12:51:01+00:00,2016-11-11 07:51:01-05:00,NVIDIA's Blowout Quarter: 'Machine Learning Is...,NVIDIA Corporation (NASDAQ: NVDA) delivered a...,,https://www.benzinga.com/node/8685747,6,0.3612,0.5


In [65]:
generate_sentiment_score(news_filtered_novel.loc[742, 'headline'])

float() argument must be a string or a real number, not 'NoneType'


100

In [69]:
headline = news_filtered_novel.loc[742, 'headline']
generate_sentiment_score(headline)

Completion(candidates=[],
           result=None,
           filters=[{'reason': <BlockedReason.OTHER: 2>}],
           safety_feedback=[])
float() argument must be a string or a real number, not 'NoneType'


100

In [41]:
generate_sentiment_score(row['headline'])

Completion(candidates=[],
           result=None,
           filters=[{'reason': <BlockedReason.OTHER: 2>}],
           safety_feedback=[])

In [70]:
news_filtered_novel.head(10)

Unnamed: 0,id,created_at,updated_at,headline,content,source,url,count_Nvidia,compound_score,sentiment_score_bard
0,5348412,2015-03-23 14:53:09+00:00,2015-03-23 10:54:12-04:00,Why Goldman Is Downgrading Nvidia,Goldman Sachs downgraded NVIDIA Corporation (N...,,https://www.benzinga.com/node/5348412,8,0.0,-1.0
1,5497748,2015-05-11 15:34:21+00:00,2015-05-11 11:34:22-04:00,Nvidia Falls Short...And Wall Street Reacts,Technology company NVIDIA Corporation (NASD...,,https://www.benzinga.com/node/5497748,20,0.0,-0.8
2,5548840,2015-05-28 16:40:25+00:00,2015-05-28 15:25:14-04:00,Goldman Sachs Met With Semiconductor Giants; H...,"In a report published Thursday, Goldman Sachs ...",,https://www.benzinga.com/node/5548840,7,0.0,0.5
3,5590773,2015-06-12 14:02:55+00:00,2015-06-12 10:02:55-04:00,Wedbush Met With Nvidia's CFO; Here's What Hap...,"In a report published Friday, Wedbush analyst ...",,https://www.benzinga.com/node/5590773,8,0.0,100.0
4,5755007,2015-08-11 12:31:49+00:00,2015-08-11 08:31:49-04:00,Unusual Option Opportunity Nvidia,"According to Options and Volatility, shares of...",,https://www.benzinga.com/node/5755007,6,0.4215,0.5
5,6143216,2016-01-14 15:35:54+00:00,2016-01-14 10:35:54-05:00,"Barclays Downgrades Nvidia, Likes Skyworks, Ci...",NVIDIA Corporation (NASDAQ: NVDA) shares hav...,,https://www.benzinga.com/node/6143216,7,0.4215,-0.5
6,7978075,2016-05-13 12:10:46+00:00,2016-05-13 08:10:46-04:00,"Roth Upgrades NVIDIA To Buy, Encouraged By 'De...",NVIDIA Corporation (NASDAQ: NVDA) reported ex...,,https://www.benzinga.com/node/7978075,6,0.6249,100.0
7,7980481,2016-05-13 18:50:36+00:00,2016-05-13 14:50:36-04:00,Jefferies' Secular Trends Thesis In NVIDIA Jus...,NVIDIA Corporation (NASDAQ: NVDA) remains a to...,,https://www.benzinga.com/node/7980481,7,0.2023,0.65
8,8373693,2016-08-19 19:22:30+00:00,2016-08-19 15:22:32-04:00,Intel To Develop Autonomous Vehicle Solution T...,CLSA's Christopher Caso commented on Intel Cor...,,https://www.benzinga.com/node/8373693,6,-0.0258,0.0
9,8685747,2016-11-11 12:51:01+00:00,2016-11-11 07:51:01-05:00,NVIDIA's Blowout Quarter: 'Machine Learning Is...,NVIDIA Corporation (NASDAQ: NVDA) delivered a...,,https://www.benzinga.com/node/8685747,6,0.3612,0.5


In [73]:
news_filtered_novel.loc[:, ['headline', 'sentiment_score_bard']]

Unnamed: 0,headline,sentiment_score_bard
0,Why Goldman Is Downgrading Nvidia,-1.00
1,Nvidia Falls Short...And Wall Street Reacts,-0.80
2,Goldman Sachs Met With Semiconductor Giants; H...,0.50
3,Wedbush Met With Nvidia's CFO; Here's What Hap...,100.00
4,Unusual Option Opportunity Nvidia,0.50
...,...,...
744,Forget Gigafactories: Nvidia Teams Up With Big...,100.00
745,Will Nvidia Shares Weather Market Storm? Analy...,100.00
746,Is Nvidia Benefiting From US Sanctions On China?,0.50
747,Weekly Points – 5 Things To Know In Investing ...,0.00


In [54]:
sum(news_filtered_novel.loc[:, 'sentiment_score_bard'] == 100)


[]

In [74]:
news_filtered_novel.loc[:, 'sentiment_score_bard'].value_counts()

 100.000000    316
 0.500000      114
 0.000000       90
-0.500000       71
 0.750000       30
-1.000000       22
 0.800000       18
 0.600000       14
 1.000000        6
 0.700000        5
 0.666667        3
 0.400000        2
-0.750000        2
 0.300000        1
-0.400000        1
-0.333333        1
-0.200000        1
 0.625000        1
 0.830000        1
 0.900000        1
-0.700000        1
 0.840000        1
 0.250000        1
-0.120000        1
-0.800000        1
-2.000000        1
 0.650000        1
-0.600000        1
 0.670000        1
 0.330000        1
Name: sentiment_score_bard, dtype: int64

We see problem: 

filters=[{'reason': <BlockedReason.OTHER: 2>}]

Here is the response from Bard: 

This output is typically generated by the Google Cloud Natural Language API when it encounters a word or phrase that is not in its vocabulary or is on its blocklist.

Here are some examples of things that might trigger the OTHER reason:

A new word or phrase that has not yet been added to the API's vocabulary.
A word or phrase that is on the API's blocklist, but the specific reason for the block is unknown.
A word or phrase that is used in a context that is not understood by the API.
If you see this output in your application, it is important to investigate the reason for the block. You can do this by checking the API documentation for the specific reason code. If you are unsure why a word or phrase is being blocked, you can contact Google Cloud support for assistance.

Here are some things you can do to avoid the OTHER reason:

Use a recent version of the Google Cloud Natural Language API client library.
Make sure that you are using the correct model for your task.
Avoid using words or phrases that are not in the API's vocabulary or are on its blocklist.
Use words or phrases in a context that is understood by the API.

# Conclusion: 
So far, either VADER or Google palm is working well enough. 
For VADER, the first two headlines are clearly negative but VADER classified as neutral. 
For google Palm, there are 316/710 rows that are not classified. Potential problem: 1. rate limit. 2. bard is not good enough. This needs further investigation. 

This is a work in progress.