<a href="https://colab.research.google.com/github/Michaelzats/Automate-Stocks-and-Crypto-Research/blob/main/Automate_Stocks_and_Crypto_Research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Automate Stocks and Crypto Research

The following tool helps to analyse the performance and the current opinion of the stock/cryptocurrency through scrapping on the internet via Maschine Learning BeautifulSoup. After collecting the articles, the summarization of the stocks is provided and the probability of positive or negative articles is shown. Therefore, the model can help to find well-forecasted stocks/crypto to invest in.

# 1. Install and Import Baseline Dependencies

In [107]:
# !pip install transformers
# !pip install sentencepiece


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [108]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from bs4 import BeautifulSoup
import requests

# 2. Setup Summarization Model

In [109]:
model_name = "human-centered-summarization/financial-summarization-pegasus"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# 3. Summarize a Single Article

In [110]:
url = "https://au.finance.yahoo.com/news/china-restricting-tesla-use-uncovers-a-significant-challenge-for-elon-musk-expert-161921664.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
paragraphs = soup.find_all('p')

In [111]:
paragraphs[0].text

'Thank you for your patience.'

In [112]:
text = [paragraph.text for paragraph in paragraphs]
words = ' '.join(text).split(' ')[:400]
ARTICLE = ' '.join(words)

In [113]:
ARTICLE

'Thank you for your patience. Our engineers are working quickly to resolve the issue.'

In [114]:
input_ids = tokenizer.encode(ARTICLE, return_tensors='pt')
output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)
summary = tokenizer.decode(output[0], skip_special_tokens=True)

In [115]:
summary

'We are aware of the issue and are working to resolve it.'

# 4. Building a News and Sentiment Pipeline

In [116]:
monitored_tickers = ['MSFT','APPL','SPCE']

## 4.1. Search for Stock News using Google and Yahoo Finance

In [117]:
def search_for_stock_news_urls(ticker):
    search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
    r = requests.get(search_url)
    soup = BeautifulSoup(r.text, 'html.parser')
    atags = soup.find_all('a')
    hrefs = [link['href'] for link in atags]
    return hrefs 

In [118]:
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls

{'MSFT': ['/?sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQOwgC',
  '/?output=search&ie=UTF-8&tbm=nws&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQPAgE',
  '/search?q=yahoo+finance+MSFT&tbm=nws&ie=UTF-8&gbv=1&sei=_weZYqTwK9zEmAWGsqzgCw',
  '/search?q=yahoo+finance+MSFT&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUIBygA',
  '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUICSgC',
  '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=isch&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUICigD',
  'https://maps.google.com/maps?q=yahoo+finance+MSFT&um=1&ie=UTF-8&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUICygE',
  '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=vid&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUIDCgF',
  '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=bks&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUIDSgG',
  '/advanced_search',
  '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=nws&source=lnt&tbs

In [119]:
raw_urls['MSFT']

['/?sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQOwgC',
 '/?output=search&ie=UTF-8&tbm=nws&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQPAgE',
 '/search?q=yahoo+finance+MSFT&tbm=nws&ie=UTF-8&gbv=1&sei=_weZYqTwK9zEmAWGsqzgCw',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUIBygA',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUICSgC',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=isch&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUICigD',
 'https://maps.google.com/maps?q=yahoo+finance+MSFT&um=1&ie=UTF-8&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUICygE',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=vid&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUIDCgF',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=bks&source=lnms&sa=X&ved=0ahUKEwjk67aSuY_4AhVcIqYKHQYZC7wQ_AUIDSgG',
 '/advanced_search',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=nws&source=lnt&tbs=lr:lang_1zh-CN%7Cl

## 4.2. Strip out unwanted URLs

In [120]:
import re


In [121]:
exclude_list = ['maps', 'policies', 'preferences', 'accounts', 'support']

In [122]:
def strip_unwanted_urls(urls, exclude_list):
    val = []
    for url in urls: 
        if 'https://' in url and not any(exclude_word in url for exclude_word in exclude_list):
            res = re.findall(r'(https?://\S+)', url)[0].split('&')[0]
            val.append(res)
    return list(set(val))

In [123]:
cleaned_urls = {ticker:strip_unwanted_urls(raw_urls[ticker], exclude_list) for ticker in monitored_tickers}
cleaned_urls

{'MSFT': ['https://ca.finance.yahoo.com/news/microsoft-launches-surface-laptop-go-2-599-130049287.html',
  'https://finance.yahoo.com/news/duck-creek-technologies-joins-microsoft-123000047.html',
  'https://finance.yahoo.com/news/microsofts-surface-laptop-go-2-offers-more-speed-and-storage-capacity-130028787.html',
  'https://finance.yahoo.com/news/heres-why-analyst-remained-bullish-182226629.html',
  'https://ca.finance.yahoo.com/news/microsoft-lowers-revenue-profit-forecasts-131435503.html',
  'https://finance.yahoo.com/news/microsoft-slow-hiring-windows-office-153055705.html',
  'https://finance.yahoo.com/news/china-backed-hackers-exploiting-unpatched-111004168.html',
  'https://finance.yahoo.com/news/accenture-microsoft-avanade-expand-partnership-115900867.html',
  'https://finance.yahoo.com/news/microsoft-lowers-revenue-profit-forecasts-131435503.html',
  'https://hk.finance.yahoo.com/news/%25E7%25BE%258E%25E8%2582%25A1%25E5%2586%258D%25E6%259A%25B4%25E6%258C%25AB-%25E5%25BE%25AE%

## 4.3. Search and Scrape Cleaned URLs

In [124]:
def scrape_and_process(URLs):
    ARTICLES = []
    for url in URLs: 
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = [paragraph.text for paragraph in paragraphs]
        words = ' '.join(text).split(' ')[:350]
        ARTICLE = ' '.join(words)
        ARTICLES.append(ARTICLE)
    return ARTICLES

In [125]:
articles = {ticker:scrape_and_process(cleaned_urls[ticker]) for ticker in monitored_tickers}
articles

{'MSFT': ['Sentiment dented after JPMorgan CEO warned the U.S. economy is facing a "hurricane" Microsoft (MSFT) on Wednesday debuted its latest low-cost laptop for students and workers looking for a compact, Windows 11-powered system that won’t break the bank. Starting at $599, the Surface Laptop Go 2 is the follow-up to 2020’s Surface Laptop Go and gets faster performance and new replaceable components that Microsoft says will extend the usefulness of the Go 2. Microsoft’s latest laptop isn’t going to knock your socks off in the performance department, especially compared to the likes of the mighty Surface Laptop Studio. But its new 11th-generation Intel (INTC) Core i5 chip should provide more than enough power for web browsing, streaming movies, and video chatting. Think of the Surface Laptop Go 2 as more of a Chromebook (GOOG, GOOGL) competitor than a rival to, say, Apple’s (AAPL) MacBook Air. Its 12.4-inch touchscreen PixelSense display is on the smaller side, but ensures that, com

In [127]:
articles['MSFT']

['Sentiment dented after JPMorgan CEO warned the U.S. economy is facing a "hurricane" Microsoft (MSFT) on Wednesday debuted its latest low-cost laptop for students and workers looking for a compact, Windows 11-powered system that won’t break the bank. Starting at $599, the Surface Laptop Go 2 is the follow-up to 2020’s Surface Laptop Go and gets faster performance and new replaceable components that Microsoft says will extend the usefulness of the Go 2. Microsoft’s latest laptop isn’t going to knock your socks off in the performance department, especially compared to the likes of the mighty Surface Laptop Studio. But its new 11th-generation Intel (INTC) Core i5 chip should provide more than enough power for web browsing, streaming movies, and video chatting. Think of the Surface Laptop Go 2 as more of a Chromebook (GOOG, GOOGL) competitor than a rival to, say, Apple’s (AAPL) MacBook Air. Its 12.4-inch touchscreen PixelSense display is on the smaller side, but ensures that, combined wit

## 4.4. Summarise all Articles

In [128]:
def summarize(articles):
    summaries = []
    for article in articles:
        input_ids = tokenizer.encode(article, return_tensors='pt')
        output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)
        summary = tokenizer.decode(output[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

In [129]:
summaries = {ticker:summarize(articles[ticker]) for ticker in monitored_tickers}
summaries

{'MSFT': ['Surface Laptop Go 2 gets faster performance, replaceable components. Smaller, lighter than 2020’s Surface Go',
  'Software-as-service () solution available on the Microsoft Azure. P&C insurance companies turn to Duck Creek to optimize end-to-end insurance lifecycle',
  'New Surface Laptop Go 2 comes with an 11th-gen Intel CPU. Battery life up a touch to 13.5 hours under normal usage',
  'Recent pullback in price is a significant buying opportunity.',
  'Tech giant is latest U.S. company to warn of a stronger dollar.',
  'Company cites need to realign staffing priorities. Tech firms have been slowing or freezing hiring in recent months',
  'Security firm Proofpoint says Chinese hackers are exploiting the flaw.',
  'Companies join forces to help accelerate transition to net zero. Co-development of innovative solutions to help businesses reduce carbon emissions',
  'Tech giant cuts fourth-quarter profit and revenue forecast. Tesla CEO calls factory workers demanding schedules',

In [None]:
summaries['AAPL']

# 5. Adding Sentiment Analysis

In [131]:
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [130]:
sentiment(summaries['MSFT'])

[{'label': 'NEGATIVE', 'score': 0.6890648603439331},
 {'label': 'NEGATIVE', 'score': 0.7326964735984802},
 {'label': 'NEGATIVE', 'score': 0.9706153273582458},
 {'label': 'POSITIVE', 'score': 0.9970439076423645},
 {'label': 'NEGATIVE', 'score': 0.9806005358695984},
 {'label': 'NEGATIVE', 'score': 0.9971606731414795},
 {'label': 'NEGATIVE', 'score': 0.9979946613311768},
 {'label': 'POSITIVE', 'score': 0.9985938668251038},
 {'label': 'NEGATIVE', 'score': 0.9985505938529968},
 {'label': 'POSITIVE', 'score': 0.9979088306427002}]

In [132]:
scores = {ticker:sentiment(summaries[ticker]) for ticker in monitored_tickers}
scores

{'MSFT': [{'label': 'NEGATIVE', 'score': 0.6890648603439331},
  {'label': 'NEGATIVE', 'score': 0.7326964735984802},
  {'label': 'NEGATIVE', 'score': 0.9706153273582458},
  {'label': 'POSITIVE', 'score': 0.9970439076423645},
  {'label': 'NEGATIVE', 'score': 0.9806005358695984},
  {'label': 'NEGATIVE', 'score': 0.9971606731414795},
  {'label': 'NEGATIVE', 'score': 0.9979946613311768},
  {'label': 'POSITIVE', 'score': 0.9985938668251038},
  {'label': 'NEGATIVE', 'score': 0.9985505938529968},
  {'label': 'POSITIVE', 'score': 0.9979088306427002}]}

In [133]:
print(summaries['MSFT'][2], scores['MSFT'][2]['label'], scores['MSFT'][2]['score'])

New Surface Laptop Go 2 comes with an 11th-gen Intel CPU. Battery life up a touch to 13.5 hours under normal usage NEGATIVE 0.9706153273582458


In [134]:
scores['MSFT'][0]['score']

0.6890648603439331

# 6. Exporting Results to CSV

In [135]:
summaries

{'MSFT': ['Surface Laptop Go 2 gets faster performance, replaceable components. Smaller, lighter than 2020’s Surface Go',
  'Software-as-service () solution available on the Microsoft Azure. P&C insurance companies turn to Duck Creek to optimize end-to-end insurance lifecycle',
  'New Surface Laptop Go 2 comes with an 11th-gen Intel CPU. Battery life up a touch to 13.5 hours under normal usage',
  'Recent pullback in price is a significant buying opportunity.',
  'Tech giant is latest U.S. company to warn of a stronger dollar.',
  'Company cites need to realign staffing priorities. Tech firms have been slowing or freezing hiring in recent months',
  'Security firm Proofpoint says Chinese hackers are exploiting the flaw.',
  'Companies join forces to help accelerate transition to net zero. Co-development of innovative solutions to help businesses reduce carbon emissions',
  'Tech giant cuts fourth-quarter profit and revenue forecast. Tesla CEO calls factory workers demanding schedules',

In [136]:
scores

{'MSFT': [{'label': 'NEGATIVE', 'score': 0.6890648603439331},
  {'label': 'NEGATIVE', 'score': 0.7326964735984802},
  {'label': 'NEGATIVE', 'score': 0.9706153273582458},
  {'label': 'POSITIVE', 'score': 0.9970439076423645},
  {'label': 'NEGATIVE', 'score': 0.9806005358695984},
  {'label': 'NEGATIVE', 'score': 0.9971606731414795},
  {'label': 'NEGATIVE', 'score': 0.9979946613311768},
  {'label': 'POSITIVE', 'score': 0.9985938668251038},
  {'label': 'NEGATIVE', 'score': 0.9985505938529968},
  {'label': 'POSITIVE', 'score': 0.9979088306427002}]}

In [137]:
cleaned_urls

{'MSFT': ['https://ca.finance.yahoo.com/news/microsoft-launches-surface-laptop-go-2-599-130049287.html',
  'https://finance.yahoo.com/news/duck-creek-technologies-joins-microsoft-123000047.html',
  'https://finance.yahoo.com/news/microsofts-surface-laptop-go-2-offers-more-speed-and-storage-capacity-130028787.html',
  'https://finance.yahoo.com/news/heres-why-analyst-remained-bullish-182226629.html',
  'https://ca.finance.yahoo.com/news/microsoft-lowers-revenue-profit-forecasts-131435503.html',
  'https://finance.yahoo.com/news/microsoft-slow-hiring-windows-office-153055705.html',
  'https://finance.yahoo.com/news/china-backed-hackers-exploiting-unpatched-111004168.html',
  'https://finance.yahoo.com/news/accenture-microsoft-avanade-expand-partnership-115900867.html',
  'https://finance.yahoo.com/news/microsoft-lowers-revenue-profit-forecasts-131435503.html',
  'https://hk.finance.yahoo.com/news/%25E7%25BE%258E%25E8%2582%25A1%25E5%2586%258D%25E6%259A%25B4%25E6%258C%25AB-%25E5%25BE%25AE%

In [138]:
range(len(summaries['MSFT']))

range(0, 10)

In [139]:
summaries['MSFT'][3]

'Recent pullback in price is a significant buying opportunity.'

In [140]:
def create_output_array(summaries, scores, urls):
    output = []
    for ticker in monitored_tickers:
        for counter in range(len(summaries[ticker])):
            output_this = [
                ticker,
                summaries[ticker][counter],
                scores[ticker][counter]['label'],
                scores[ticker][counter]['score'],
                urls[ticker][counter]
            ]
            output.append(output_this)
    return output

In [141]:
final_output = create_output_array(summaries, scores, cleaned_urls)
final_output

[['MSFT',
  'Surface Laptop Go 2 gets faster performance, replaceable components. Smaller, lighter than 2020’s Surface Go',
  'NEGATIVE',
  0.6890648603439331,
  'https://ca.finance.yahoo.com/news/microsoft-launches-surface-laptop-go-2-599-130049287.html'],
 ['MSFT',
  'Software-as-service () solution available on the Microsoft Azure. P&C insurance companies turn to Duck Creek to optimize end-to-end insurance lifecycle',
  'NEGATIVE',
  0.7326964735984802,
  'https://finance.yahoo.com/news/duck-creek-technologies-joins-microsoft-123000047.html'],
 ['MSFT',
  'New Surface Laptop Go 2 comes with an 11th-gen Intel CPU. Battery life up a touch to 13.5 hours under normal usage',
  'NEGATIVE',
  0.9706153273582458,
  'https://finance.yahoo.com/news/microsofts-surface-laptop-go-2-offers-more-speed-and-storage-capacity-130028787.html'],
 ['MSFT',
  'Recent pullback in price is a significant buying opportunity.',
  'POSITIVE',
  0.9970439076423645,
  'https://finance.yahoo.com/news/heres-why-an

In [142]:
final_output.insert(0, ['Ticker', 'Summary', 'Label', 'Confidence', 'URL'])

In [143]:
final_output

[['Ticker', 'Summary', 'Label', 'Confidence', 'URL'],
 ['MSFT',
  'Surface Laptop Go 2 gets faster performance, replaceable components. Smaller, lighter than 2020’s Surface Go',
  'NEGATIVE',
  0.6890648603439331,
  'https://ca.finance.yahoo.com/news/microsoft-launches-surface-laptop-go-2-599-130049287.html'],
 ['MSFT',
  'Software-as-service () solution available on the Microsoft Azure. P&C insurance companies turn to Duck Creek to optimize end-to-end insurance lifecycle',
  'NEGATIVE',
  0.7326964735984802,
  'https://finance.yahoo.com/news/duck-creek-technologies-joins-microsoft-123000047.html'],
 ['MSFT',
  'New Surface Laptop Go 2 comes with an 11th-gen Intel CPU. Battery life up a touch to 13.5 hours under normal usage',
  'NEGATIVE',
  0.9706153273582458,
  'https://finance.yahoo.com/news/microsofts-surface-laptop-go-2-offers-more-speed-and-storage-capacity-130028787.html'],
 ['MSFT',
  'Recent pullback in price is a significant buying opportunity.',
  'POSITIVE',
  0.9970439076

In [144]:
import csv
with open('assetsummaries.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerows(final_output)