<a href="https://colab.research.google.com/github/Aashi-sharma/Automate-stocks-and-crypto-research-with-python-and-deep-learning/blob/main/Automate_stocks_and_crypto_research_with_python_and_deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install and import baseline dependencies**

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from bs4 import BeautifulSoup
import requests

In [3]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Setup summarization model**

In [4]:
model_name = "human-centered-summarization/financial-summarization-pegasus"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

**Summarize a single article**

In [5]:
url = "https://finance.yahoo.com/news/israel-regulator-awards-licence-investors-113726999.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
paragraphs = soup.find_all('p')

In [6]:
paragraphs[0].text

"JERUSALEM (Reuters) - Israel's banking regulator on Sunday approved a conditional licence and control permit to a group of entrepreneurs to establish a new online bank, the second addition to the highly concentrated banking sector in three years."

In [7]:
text = [paragraph.text for paragraph in paragraphs]
words = ' '.join(text).split(' ')[:400]
ARTICLE = ' '.join(words)

In [8]:
ARTICLE

'JERUSALEM (Reuters) - Israel\'s banking regulator on Sunday approved a conditional licence and control permit to a group of entrepreneurs to establish a new online bank, the second addition to the highly concentrated banking sector in three years. The Bank of Israel said its banking supervision department had completed the inspection process for the new institution named Esh Bank Israel. The approvals, it said, will allow the founders to move forward and complete the mechanical, operational and regulatory preparations required for the start of the bank\'s activities. These include completing the development and testing phases of new technology and hiring a management team and bank staff. It will take about a year and a half to get the bank up and running, the central bank said. "We have a long way to go," said Shmuel Hauser, the chairman of Esh, adding the bank would offer attractive interest rates and banking services without commissions. Last January, One Zero Digital Bank received 

In [9]:
input_ids = tokenizer.encode(ARTICLE, return_tensors='pt')
output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)
summary = tokenizer.decode(output[0], skip_special_tokens=True)

In [10]:
summary

'Esh Bank Israel will offer online banking without commissions. Last January, One Zero Digital Bank received final regulatory approval'

**Building a news and sentiment pipeline**

In [11]:
monitored_tickers = ['GME', 'TSLA', 'BTC', 'ETH']


**Search for stock news using google and yahoo finance**

In [12]:
def search_for_stock_news_urls(ticker):
    search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
    r = requests.get(search_url)
    soup = BeautifulSoup(r.text, 'html.parser')
    atags = soup.find_all('a')
    hrefs = [link['href'] for link in atags]
    return hrefs 

In [13]:
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls

{'GME': ['/?sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQOwgC',
  '/search?q=yahoo+finance+GME&tbm=nws&ie=UTF-8&gbv=1&sei=FzipY-eQDsW_kPIP75uwqAU',
  '/search?q=yahoo+finance+GME&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQ_AUIBSgA',
  '/search?q=yahoo+finance+GME&ie=UTF-8&tbm=vid&source=lnms&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQ_AUIBygC',
  '/search?q=yahoo+finance+GME&ie=UTF-8&tbm=isch&source=lnms&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQ_AUICCgD',
  'https://maps.google.com/maps?q=yahoo+finance+GME&um=1&ie=UTF-8&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQ_AUICSgE',
  '/search?q=yahoo+finance+GME&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQ_AUICigF',
  '/search?q=yahoo+finance+GME&ie=UTF-8&tbm=bks&source=lnms&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQ_AUICygG',
  '/advanced_search',
  '/search?q=yahoo+finance+GME&ie=UTF-8&tbm=nws&source=lnt&tbs=qdr:h&sa=X&ved=0ahUKEwin-IGhzZb8AhXFH0QIHe8NDFUQpwUIDQ',
  '/search?q=yahoo+finance+GME&ie=U

**Strip out unwanted URLs**

In [14]:
import re

In [17]:
exclude_list = ['maps', 'policies', 'preferences', 'accounts', 'support']

In [18]:
def strip_unwanted_urls(urls, exclude_list):
    val = []
    for url in urls: 
        if 'https://' in url and not any(exclude_word in url for exclude_word in exclude_list):
            res = re.findall(r'(https?://\S+)', url)[0].split('&')[0]
            val.append(res)
    return list(set(val))

In [19]:
cleaned_urls = {ticker:strip_unwanted_urls(raw_urls[ticker], exclude_list) for ticker in monitored_tickers}
cleaned_urls

{'GME': ['https://finance.yahoo.com/news/my-meme-stock-fiasco-153417561.html',
  'https://finance.yahoo.com/video/meme-stocks-gain-steam-amc-164335352.html',
  'https://ca.finance.yahoo.com/news/after-hours-stock-movers-game-stop-rent-the-runway-c-3-ai-and-more-232228102.html',
  'https://finance.yahoo.com/news/the-game-stop-turnaround-promise-is-failing-111117027.html',
  'https://finance.yahoo.com/news/gamestop-reports-third-quarter-fiscal-210500302.html',
  'https://finance.yahoo.com/news/gme-resources-limited-asx-gme-223611770.html',
  'https://finance.yahoo.com/news/after-hours-stock-movers-game-stop-rent-the-runway-c-3-ai-and-more-232228102.html',
  'https://finance.yahoo.com/news/morningstar-ceos-message-to-meme-stock-investors-162841912.html',
  'https://finance.yahoo.com/news/gamestop-corp-nyse-gme-q3-183439060.html',
  'https://finance.yahoo.com/news/mark-cuban-stock-portfolio-10-211950824.html'],
 'TSLA': ['https://finance.yahoo.com/news/tesla-stock-bull-elon-musk-asleep-at-

**Search and scrape cleaned URLs**

In [20]:
def scrape_and_process(URLs):
    ARTICLES = []
    for url in URLs: 
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = [paragraph.text for paragraph in paragraphs]
        words = ' '.join(text).split(' ')[:350]
        ARTICLE = ' '.join(words)
        ARTICLES.append(ARTICLE)
    return ARTICLES

In [21]:
articles = {ticker:scrape_and_process(cleaned_urls[ticker]) for ticker in monitored_tickers}
articles

{'GME': ["I like money as much as anybody, and I’m not too pure to chase quick cash, if the opportunity seems plausible. So in 2021, I decided to dabble in the meme-stock craze that was upending Wall Street and making some gutsy day-traders rich. Gaming retailer GameStop (GME) was the first meme stock, with trader Keith Gill first making the case for why the stock could skyrocket in August of 2020, on the Reddit channel WallStreetBets. Almost nobody noticed Gill’s pitch until the stock did, in fact, blast off five months later. Part of Gill’s strategy was looking for beaten-down stocks with high levels of short interest and bet on a “short squeeze” that could trigger an exponential rise in the stock price. It happened. Before Gill’s pitch, GME traded at around $1. It drifted up slowly toward the end of 2020 and then went crazy, peaking at a closing price of $87 on January 27, 2021. More remarkable than the gain was the fact that it appeared to have nothing to do with GME’s financial pe

**Summarize all articles**

In [22]:
def summarize(articles):
    summaries = []
    for article in articles:
        input_ids = tokenizer.encode(article, return_tensors='pt')
        output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)
        summary = tokenizer.decode(output[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

In [23]:
summaries = {ticker:summarize(articles[ticker]) for ticker in monitored_tickers}
summaries

{'GME': ['I followed the meme-stock craze in 2021 but didn’t make a lot of money.',
  'We are aware of the issue and are working to resolve it.',
  'Rent the Runway, C3.ai report better-than-expected results. Oil at US$72 after dropping roughly 10% this week to lowest level since January',
  "Cohen and Furlong's experiment is proving to be a failure. Third quarter sales down 8.5% year over year",
  'Third quarter sales were $1.186 billion, compared to $1.297 billion in the prior year. SG&A as a percentage of sales was down on a sequential basis from 34.1%',
  'GME Resources insiders bought AU$100k worth of shares over the last year.',
  'Rent the Runway, C3.ai, Duckhorn report better-than-expected results. Here’s a round-up of other tech earnings reports:',
  'Retail investors are more likely to focus on value stocks. AMC, Bed Bath & Beyond among worst-performing meme stocks this year',
  'Reported EPS is $-0.31, expectations were $-0.28.',
  '10 stocks to buy now in the Mark Cuban sto

In [24]:
summaries['BTC']

['Over 90% of Valkyrie’s assets are tied to Justin Sun. U.S.-based firm also manages ‘Justin Tron’ tokens',
 'Core Scientific is one of the largest publicly traded crypto mining firms. Bitcoin is down 75% from its all-time high in November 2021',
 'Bitcoin mining companies have been hit hard by falling prices and surging energy prices.',
 'Bitcoin and USD currency pair ranked among top 10 tickers for 2022.',
 'Senate weighs in on digital asset anti-money laundering bill. Warren says bill doesn’t go far enough',
 'It was a year of highs and lows for crypto. Retail investors took a beating when FTX collapsed',
 'Top companies operating in the blockchain space include Mastercard, Visa.',
 'Bitcoin is eighth on the year’s top trending tickers list. Interest in the ticker surged early in 2021 after record highs',
 'We are aware of the issue and are working to resolve it.',
 'Ex-FTX CEO faces U.S. charges of fraud, money laundering. Bankman-Fried extradited from Bahamas after extradition cle

**Adding sentiment analysis** 

In [25]:
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [26]:
sentiment(summaries['BTC'])

[{'label': 'POSITIVE', 'score': 0.8940140604972839},
 {'label': 'NEGATIVE', 'score': 0.9986491799354553},
 {'label': 'NEGATIVE', 'score': 0.9979472756385803},
 {'label': 'POSITIVE', 'score': 0.9927968382835388},
 {'label': 'NEGATIVE', 'score': 0.9945371747016907},
 {'label': 'NEGATIVE', 'score': 0.9984845519065857},
 {'label': 'POSITIVE', 'score': 0.6261817812919617},
 {'label': 'POSITIVE', 'score': 0.9809703230857849},
 {'label': 'POSITIVE', 'score': 0.9979088306427002},
 {'label': 'NEGATIVE', 'score': 0.9893856048583984}]

In [27]:
scores = {ticker:sentiment(summaries[ticker]) for ticker in monitored_tickers}
scores

{'GME': [{'label': 'NEGATIVE', 'score': 0.9995830655097961},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.9902663826942444},
  {'label': 'NEGATIVE', 'score': 0.9997759461402893},
  {'label': 'NEGATIVE', 'score': 0.9996192455291748},
  {'label': 'NEGATIVE', 'score': 0.96634441614151},
  {'label': 'NEGATIVE', 'score': 0.9853072166442871},
  {'label': 'NEGATIVE', 'score': 0.9997033476829529},
  {'label': 'NEGATIVE', 'score': 0.9898213148117065},
  {'label': 'POSITIVE', 'score': 0.8015463948249817}],
 'TSLA': [{'label': 'NEGATIVE', 'score': 0.9959589838981628},
  {'label': 'NEGATIVE', 'score': 0.9996263980865479},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.9997126460075378},
  {'label': 'NEGATIVE', 'score': 0.9957268238067627},
  {'label': 'NEGATIVE', 'score': 0.6679105758666992},
  {'label': 'NEGATIVE', 'score': 0.9976456761360168},
  {'label': 'NEGATIVE', 'score': 0.978071928024292},
  {'label': 'PO

In [28]:
print(summaries['GME'][3], scores['GME'][3]['label'], scores['GME'][3]['score'])

Cohen and Furlong's experiment is proving to be a failure. Third quarter sales down 8.5% year over year NEGATIVE 0.9997759461402893


In [29]:
scores['BTC'][0]['score']

0.8940140604972839

**Exporting results to CSV**



In [30]:
summaries

{'GME': ['I followed the meme-stock craze in 2021 but didn’t make a lot of money.',
  'We are aware of the issue and are working to resolve it.',
  'Rent the Runway, C3.ai report better-than-expected results. Oil at US$72 after dropping roughly 10% this week to lowest level since January',
  "Cohen and Furlong's experiment is proving to be a failure. Third quarter sales down 8.5% year over year",
  'Third quarter sales were $1.186 billion, compared to $1.297 billion in the prior year. SG&A as a percentage of sales was down on a sequential basis from 34.1%',
  'GME Resources insiders bought AU$100k worth of shares over the last year.',
  'Rent the Runway, C3.ai, Duckhorn report better-than-expected results. Here’s a round-up of other tech earnings reports:',
  'Retail investors are more likely to focus on value stocks. AMC, Bed Bath & Beyond among worst-performing meme stocks this year',
  'Reported EPS is $-0.31, expectations were $-0.28.',
  '10 stocks to buy now in the Mark Cuban sto

In [31]:
scores

{'GME': [{'label': 'NEGATIVE', 'score': 0.9995830655097961},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.9902663826942444},
  {'label': 'NEGATIVE', 'score': 0.9997759461402893},
  {'label': 'NEGATIVE', 'score': 0.9996192455291748},
  {'label': 'NEGATIVE', 'score': 0.96634441614151},
  {'label': 'NEGATIVE', 'score': 0.9853072166442871},
  {'label': 'NEGATIVE', 'score': 0.9997033476829529},
  {'label': 'NEGATIVE', 'score': 0.9898213148117065},
  {'label': 'POSITIVE', 'score': 0.8015463948249817}],
 'TSLA': [{'label': 'NEGATIVE', 'score': 0.9959589838981628},
  {'label': 'NEGATIVE', 'score': 0.9996263980865479},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.9997126460075378},
  {'label': 'NEGATIVE', 'score': 0.9957268238067627},
  {'label': 'NEGATIVE', 'score': 0.6679105758666992},
  {'label': 'NEGATIVE', 'score': 0.9976456761360168},
  {'label': 'NEGATIVE', 'score': 0.978071928024292},
  {'label': 'PO

In [32]:
cleaned_urls

{'GME': ['https://finance.yahoo.com/news/my-meme-stock-fiasco-153417561.html',
  'https://finance.yahoo.com/video/meme-stocks-gain-steam-amc-164335352.html',
  'https://ca.finance.yahoo.com/news/after-hours-stock-movers-game-stop-rent-the-runway-c-3-ai-and-more-232228102.html',
  'https://finance.yahoo.com/news/the-game-stop-turnaround-promise-is-failing-111117027.html',
  'https://finance.yahoo.com/news/gamestop-reports-third-quarter-fiscal-210500302.html',
  'https://finance.yahoo.com/news/gme-resources-limited-asx-gme-223611770.html',
  'https://finance.yahoo.com/news/after-hours-stock-movers-game-stop-rent-the-runway-c-3-ai-and-more-232228102.html',
  'https://finance.yahoo.com/news/morningstar-ceos-message-to-meme-stock-investors-162841912.html',
  'https://finance.yahoo.com/news/gamestop-corp-nyse-gme-q3-183439060.html',
  'https://finance.yahoo.com/news/mark-cuban-stock-portfolio-10-211950824.html'],
 'TSLA': ['https://finance.yahoo.com/news/tesla-stock-bull-elon-musk-asleep-at-

In [33]:
range(len(summaries['GME']))

range(0, 10)

In [34]:
summaries['GME'][3]

"Cohen and Furlong's experiment is proving to be a failure. Third quarter sales down 8.5% year over year"

In [35]:
def create_output_array(summaries, scores, urls):
    output = []
    for ticker in monitored_tickers:
        for counter in range(len(summaries[ticker])):
            output_this = [
                ticker,
                summaries[ticker][counter],
                scores[ticker][counter]['label'],
                scores[ticker][counter]['score'],
                urls[ticker][counter]
            ]
            output.append(output_this)
    return output

In [36]:
final_output = create_output_array(summaries, scores, cleaned_urls)
final_output

[['GME',
  'I followed the meme-stock craze in 2021 but didn’t make a lot of money.',
  'NEGATIVE',
  0.9995830655097961,
  'https://finance.yahoo.com/news/my-meme-stock-fiasco-153417561.html'],
 ['GME',
  'We are aware of the issue and are working to resolve it.',
  'POSITIVE',
  0.9979088306427002,
  'https://finance.yahoo.com/video/meme-stocks-gain-steam-amc-164335352.html'],
 ['GME',
  'Rent the Runway, C3.ai report better-than-expected results. Oil at US$72 after dropping roughly 10% this week to lowest level since January',
  'NEGATIVE',
  0.9902663826942444,
  'https://ca.finance.yahoo.com/news/after-hours-stock-movers-game-stop-rent-the-runway-c-3-ai-and-more-232228102.html'],
 ['GME',
  "Cohen and Furlong's experiment is proving to be a failure. Third quarter sales down 8.5% year over year",
  'NEGATIVE',
  0.9997759461402893,
  'https://finance.yahoo.com/news/the-game-stop-turnaround-promise-is-failing-111117027.html'],
 ['GME',
  'Third quarter sales were $1.186 billion, com

In [37]:
final_output.insert(0, ['Ticker', 'Summary', 'Label', 'Confidence', 'URL'])

In [38]:
final_output

[['Ticker', 'Summary', 'Label', 'Confidence', 'URL'],
 ['GME',
  'I followed the meme-stock craze in 2021 but didn’t make a lot of money.',
  'NEGATIVE',
  0.9995830655097961,
  'https://finance.yahoo.com/news/my-meme-stock-fiasco-153417561.html'],
 ['GME',
  'We are aware of the issue and are working to resolve it.',
  'POSITIVE',
  0.9979088306427002,
  'https://finance.yahoo.com/video/meme-stocks-gain-steam-amc-164335352.html'],
 ['GME',
  'Rent the Runway, C3.ai report better-than-expected results. Oil at US$72 after dropping roughly 10% this week to lowest level since January',
  'NEGATIVE',
  0.9902663826942444,
  'https://ca.finance.yahoo.com/news/after-hours-stock-movers-game-stop-rent-the-runway-c-3-ai-and-more-232228102.html'],
 ['GME',
  "Cohen and Furlong's experiment is proving to be a failure. Third quarter sales down 8.5% year over year",
  'NEGATIVE',
  0.9997759461402893,
  'https://finance.yahoo.com/news/the-game-stop-turnaround-promise-is-failing-111117027.html'],
 [

In [39]:
import csv
with open('cryptosummaries.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerows(final_output)