# Install and Import Baseline dependencies

In [16]:
!pip install transformers beautifulsoup4 requests selenium webdriver-manager re

Collecting selenium
  Downloading selenium-4.8.2-py3-none-any.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting webdriver_manager
  Downloading webdriver_manager-3.8.5-py2.py3-none-any.whl (27 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.0-py3-none-any.whl (14 kB)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting async-generator>=1.9

The following code imports several Python modules and libraries that are necessary for implementing human-centered financial summarization using the Pegasus model. Here is an explanation of each import statement:
- **transformers**: This imports the PegasusTokenizer and PegasusForConditionalGeneration classes from the transformers library. Pegasus is a state-of-the-art pre-trained language model developed by the Hugging Face team for natural language processing tasks such as text summarization. The PegasusTokenizer class is used to convert input text into numerical tokens that can be processed by the Pegasus model, while the PegasusForConditionalGeneration class is used to generate summaries from the tokenized input text.
- **beautifulsoup4**: This imports the BeautifulSoup class from the bs4 library. BeautifulSoup is a Python library for web scraping and parsing HTML and XML documents. It provides a simple and efficient way to extract data from web pages.
- **requests**: This imports the requests library, which is used for sending HTTP requests to web servers and receiving responses. In this context, it is used for downloading web pages that will be used as input for summarization.
- **selenium**: This imports the webdriver module from the selenium library. Selenium is a popular tool for automating web browsers, and it is commonly used for web scraping and testing. In this context, it is used to programmatically control a web browser and interact with web pages in order to scrape data.
- **webdriver-manager**: This imports the ChromeDriverManager class from the webdriver_manager.chrome module. The ChromeDriverManager is a library for managing ChromeDriver executables, which are necessary for running automated tests or web scraping with Selenium.
- **time**: This imports the time module, which provides various time-related functions. In this context, it is used to add delays in the program to give time for web pages to load and to avoid overloading web servers with too many requests.
- **By**: This imports the By class from the selenium.webdriver.common.by module. The By class is used to specify the method of locating elements on a web page. For example, it can be used to find elements by their ID, class, name, or tag name.

- **re**: This imports the re module, which provides support for regular expressions. Regular expressions are a powerful tool for manipulating and searching text data. In this context, it may be used for cleaning and preprocessing the input text before tokenization and summarization.


In [17]:
# human-centered-summarization/financial-summarization-pegasus
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, pipeline
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
import re

# Setup Summarization model
The code snippet above loads a pre-trained Pegasus model and tokenizer from the Hugging Face model hub.
`model_name = "human-centered-summarization/financial-summarization-pegasus"` sets the name of the pre-trained model to be loaded. This specific model is a Pegasus model fine-tuned for summarization tasks.

`tokenizer = PegasusTokenizer.from_pretrained(model_name)` loads the pre-trained tokenizer associated with the Pegasus model specified in `model_name`. The tokenizer is responsible for encoding and decoding text inputs and outputs into a numerical representation that can be processed by the model.

`model = PegasusForConditionalGeneration.from_pretrained(model_name)` loads the pre-trained Pegasus model specified in `model_name`. The model is an implementation of a transformer-based sequence-to-sequence model, which means it can take a sequence of text as input and produce a summary of that input as output. The `PegasusForConditionalGeneration` class specifically is a subclass of the `PreTrainedModel` class provided by the Transformers library, which provides additional functionality for training and inference.

Once loaded, the tokenizer and model objects can be used to generate summaries for our texts.

In [18]:
# Let's load the model and the tokenizer from the Hugging Face model hub
model_name = "human-centered-summarization/financial-summarization-pegasus"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name) # If you want to use the Tensorflow model

# Summarize a single article

The code is an example of how to scrape the text content from a news article on Yahoo Finance using the Python requests library and the BeautifulSoup package.

The first line initializes the URL of the webpage that is to be scraped, which in this case is an article on Coinbase rolling out a new layer 2 blockchain on Yahoo Finance.

The second line sends a GET request to the webpage using the requests library, and the response is stored in the variable r. The get method retrieves the content of the specified URL.

The third line initializes a BeautifulSoup object by passing the text content of the response object to the constructor of BeautifulSoup. The 'html.parser' argument specifies the parser that BeautifulSoup should use to parse the HTML content.

The fourth line finds all the HTML paragraphs on the page and stores them in the variable paragraphs. This is done by calling the find_all method of the soup object with the argument 'p', which specifies that the method should search for all <p> tags on the page.

After executing these lines of code, the variable paragraphs should contain a list of all the text content from the article.

In [19]:
url = "https://finance.yahoo.com/news/coinbase-rolls-out-new-layer-2-blockchain-base-140026538.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
paragraphs = soup.find_all('p')

In [20]:
paragraphs

[<p>Coinbase Global (<a class="link" data-i13n="cpos:1;pos:1" data-ylk="slk:COIN;cpos:1;pos:1" href="https://finance.yahoo.com/quote/COIN/">COIN</a>) on Thursday announced a new Layer 2 blockchain called Base.</p>,
 <p>Initially incubated inside the company, the protocol is intended to improve data processing and transaction costs across Ethereum and other blockchains, enabling software developers to more readily build other crypto applications.</p>,
 <p>Building new and better applications is the crucial next step for attracting crypto's next wave of users, Coinbase said.</p>,
 <p>"We've done a really great job doing that on trading and speculation," Jesse Pollak, the project lead and Coinbase's senior director of engineering, told Yahoo Finance. "But we haven't seen the really useful applications emerge yet."</p>,
 <p>The Base protocol has been built using open source software created by the Layer 2 blockchain Optimism. The company has no plans for issuing a token related to the new 

This code snippet is extracting the text content of a web page by using Python's requests library to send an HTTP GET request to the URL of the webpage. The web page in this example is a news article about Coinbase, a cryptocurrency exchange. Once the response is received, BeautifulSoup, a Python library for parsing HTML and XML documents, is used to extract all the paragraphs in the HTML code of the page.

The extracted paragraphs are then stored in a list variable called paragraphs. The code then uses a list comprehension to extract the text content of each paragraph by calling the text attribute of each BeautifulSoup object. This produces a list of strings, where each string is the text content of a single paragraph.

Next, the code joins all the strings in the list using the join() method with a space as a separator. This produces a single long string that contains all the text from the extracted paragraphs. The long string is then split into individual words using the split() method. The [:400] at the end of the split() method limits the number of words to the first 400, thereby truncating the text to a maximum length of 400 words.

Finally, the truncated text is concatenated back into a string, which is stored in a variable called ARTICLE. This variable now contains a summary of the original article, limited to the first 400 words. The purpose of this code is to prepare the text data for input into a machine learning model for summarization.

In [21]:
text = [ paragraph.text for paragraph in paragraphs]
words = ' '.join(text).split(' ')[:400]
ARTICLE = ' '.join(words)

In [22]:
ARTICLE

'Coinbase Global (COIN) on Thursday announced a new Layer 2 blockchain called Base. Initially incubated inside the company, the protocol is intended to improve data processing and transaction costs across Ethereum and other blockchains, enabling software developers to more readily build other crypto applications. Building new and better applications is the crucial next step for attracting crypto\'s next wave of users, Coinbase said. "We\'ve done a really great job doing that on trading and speculation," Jesse Pollak, the project lead and Coinbase\'s senior director of engineering, told Yahoo Finance. "But we haven\'t seen the really useful applications emerge yet." The Base protocol has been built using open source software created by the Layer 2 blockchain Optimism. The company has no plans for issuing a token related to the new protocol. Within the hour following the announcement, the native token for Optimism - OP - jumped 7%. In recent weeks Coinbase opened its blockchain to 48 oth

This code performs the task of generating a summary for a given input text using the Pegasus model and tokenizer that have been previously loaded in the code.

Here is the breakdown of the code:

`input_ids = tokenizer.encode(ARTICLE, return_tensors='pt')`: This line encodes the input text using the tokenizer that has been previously loaded. The encode function of the tokenizer tokenizes the input text and converts it into a sequence of integer IDs, which are used as input to the model. The return_tensors='pt' argument specifies that the function should return a PyTorch tensor.

`output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)`: This line generates the summary for the input text using the model that has been previously loaded. The generate function of the model takes in the input tensor IDs and generates a summary using a beam search algorithm. The max_length argument specifies the maximum length of the summary, num_beams argument specifies the number of beams to use during beam search, and early_stopping argument specifies whether to stop the beam search early if the max_length is reached.

`summary = tokenizer.decode(output[0], skip_special_tokens=True)`: This line decodes the generated summary using the tokenizer that has been previously loaded. The decode function of the tokenizer converts the sequence of integer IDs generated by the model back into a human-readable summary text. The skip_special_tokens=True argument specifies that any special tokens (such as start-of-sequence or end-of-sequence tokens) should be removed from the generated summary text. The summary is then assigned to the summary variable for further use.

In [23]:
input_ids = tokenizer.encode(ARTICLE, return_tensors='pt')
output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)
summary = tokenizer.decode(output[0], skip_special_tokens=True)

# Search for stocks news using Google and Yahoo Finance

This code defines a function search_for_stocks_news_urls that takes a single parameter, ticker, which is a string representing the stock ticker to search for. The function searches for news articles related to the stock ticker on Yahoo Finance using Google search.

First, the function creates an instance of the Chrome web driver using the ChromeDriverManager to install the appropriate driver. It also sets a user agent header to simulate a web browser.

Then, the function navigates to the search page on Google with a query string that includes the stock ticker and the "news" search filter on Yahoo Finance. It waits for the page to load using time.sleep and clicks the "I Agree" button to consent to the page. After waiting for the consent prompt to disappear, the function extracts all the links on the page using the find_elements method on the a tag and get_attribute method on each tag to get the href value. It filters out links that are not from Yahoo Finance by checking if 'finance.yahoo.com' is in the href.

Finally, the function quits the Chrome driver and returns a list of URLs to news articles related to the stock ticker.

In [24]:
tickers = ['BTC', 'ETH', 'SOL']

In [87]:
def search_for_stocks_news_urls(ticker):
    """Search for stock news using Google and Yahoo Finance
    "" :param ticker: stock ticker
    """

    chrome_options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }

    # Navigate to the search page
    search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
    driver.get(search_url)

    # Wait for the page to load and for the consent prompt to appear
    time.sleep(10)

    # Click the "I Agree" button to consent to the page
    consent_button = driver.find_element(By.XPATH, '//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/div[1]/form[2]/div/div/button')
    consent_button.click()

    # Wait for the consent prompt to disappear
    time.sleep(5)

    # Extract the links from the search results page
    atags = driver.find_elements(By.TAG_NAME, 'a')
    hrefs = [atag.get_attribute('href') for atag in atags]
    # filter none and no yahoo finance
    hrefs = [href for href in hrefs if href is not None and 'finance.yahoo.com' in href]
    # Quit the driver
    driver.quit()

    # Return the links
    return hrefs

This code defines a dictionary called raw_urls, which contains news article URLs for a given set of stock tickers.

The dictionary is created using a dictionary comprehension, which iterates through the tickers list and calls the search_for_stocks_news_urls function for each ticker. The resulting list of URLs is then assigned as the value for the corresponding ticker key in the raw_urls dictionary.

So, for example, if tickers contains ['BTC', 'ETH', 'SOL'], the resulting raw_urls dictionary will contain three keys ('BTC', 'ETH', and 'SOL') and their respective values will be lists of URLs containing news articles related to each ticker.

In [88]:
raw_urls = {ticker:search_for_stocks_news_urls(ticker) for ticker in tickers}

  driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
  driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)


# Strip out unwanted URLs
This code defines a function strip_out_unwanted_urls that takes two parameters urls and exclude_list. The function iterates over each URL in the urls list and checks if it contains https:// and does not contain any word in the exclude_list by using the any function and a generator expression. If the URL meets the conditions, it extracts the URL using regex to remove any parameters and appends it to a list val. Finally, the function returns a list of unique URLs.

The code then uses a dictionary comprehension to apply the strip_out_unwanted_urls function to each list of URLs in the raw_urls dictionary and creates a new dictionary cleaned_urls with the ticker symbols as keys and the cleaned URL lists as values. The exclude_list contains words that may appear in unwanted URLs such as maps, policies, preferences, accounts, and support. The purpose of this code is to clean up the URLs obtained from Google search results and filter out any URLs that are not relevant or may contain unnecessary parameters.

In [90]:
exclude_list = ['maps', 'policies', 'preferences', 'accounts', 'support']

In [91]:
def strip_out_unwanted_urls(urls, exclude_list):
    """Strip out unwanted URLs
    "" :param urls: list of urls
    "" :param exclude_list: list of words to exclude
    """
    val = []
    for url in urls:
        if 'https://' in url and not any(exclude_word in url for exclude_word in exclude_list):
            res = re.findall(r'(https?://\S+)', url)[0].split('&')[0]
            val.append(res)
    return list(set(val))

In [93]:
cleaned_urls = {ticker:strip_out_unwanted_urls(urls, exclude_list) for ticker, urls in raw_urls.items()}

In [95]:
cleaned_urls

{'BTC': ['https://finance.yahoo.com/news/microstrategy-bought-more-than-8800-bitcoins-during-2022s-plunge-223922592.html',
  'https://finance.yahoo.com/news/coinbase-rolls-out-new-layer-2-blockchain-base-140026538.html',
  'https://finance.yahoo.com/news/block-q4-bitcoin-revenue-fell-213625764.html',
  'https://finance.yahoo.com/news/bitcoin-future-hinges-donations-got-194653381.html',
  'https://finance.yahoo.com/news/bitcoin-futures-cme-outpace-those-122730357.html',
  'https://finance.yahoo.com/news/economic-landings-economic-signals-bitcoin-25000-3-top-stories-from-this-week-120058611.html',
  'https://finance.yahoo.com/news/bitcoin-closes-out-best-january-since-2013-232215270.html',
  'https://finance.yahoo.com/news/wild-theory-price-bitcoin-being-110000608.html',
  'https://finance.yahoo.com/news/bit-brother-generated-over-15-133000172.html',
  'https://finance.yahoo.com/video/11-301-btc-added-global-152401432.html'],
 'ETH': ['https://finance.yahoo.com/news/sec-crackdown-ethereu

# Search and scrap Cleaned URLs
This code defines a function scrap_and_process that takes a list of URLs, scrapes the content of the webpage associated with each URL, processes it, and returns a list of articles.

The processing steps involve extracting only the first 350 words of the article and concatenating them to form a single string. This string is then added to a list of articles associated with the corresponding ticker.

The code then applies this function to each list of cleaned URLs associated with each ticker using a dictionary comprehension, resulting in a nested dictionary where the keys are tickers and the values are lists of articles. The resulting dictionary is assigned to the variable articles.

In [100]:
def scrap_and_process(URLs):
    """Scrap and process URLs
    :param URLs: list of URLs
    :return: list of articles
    """
    ARTICLES = []
    for url in URLs:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = [ paragraph.text for paragraph in paragraphs]
        words = ' '.join(text).split(' ')[:350]
        ARTICLE = ' '.join(words)
        ARTICLES.append(ARTICLE)
    return ARTICLES

In [101]:
articles = {ticker:scrap_and_process(urls) for ticker, urls in cleaned_urls.items()}
articles

{'BTC': ['MicroStrategy (MSTR) bought 8,813 bitcoins during 2022\'s major crypto dip, taking a $1.28 billion impairment loss for the year, the company said in its fourth quarter report on Thursday. As of December 31, the company had spent a total $3.9 billion to acquire 132,500 bitcoins purchased at an average price of $30,137. The company\'s carrying value of its bitcoin holdings at quarter\'s end was $1.8 billion. As of Thursday afternoon\'s bitcoin price, MicroStrategy\'s stash was worth $3.16\xa0billion. Shares of MicroStrategy were down as much as 3% in after hours trade on Thursday. Shares had gained more than 9% during normal trading hours on Thursday, and the stock has more than doubled so far this year. In the fourth quarter, the business intelligence software company reported revenue of $132.6 million, just beating the $131 million expected by analysts, according to data from Bloomberg. MicroStrategy reported a net loss for the fourth quarter of $249.7 million. "Our corporate

# Summarize all articles
This code defines a function named summarize that takes in a list of articles and returns a list of their corresponding summaries.

The function first initializes an empty list summaries to store the generated summaries. It then iterates over each article in the input articles. For each article, it encodes it into input ids using the encode method of the tokenizer object. It then generates a summary using the generate method of the model object with the input ids as the input. The generate method generates a summary by performing beam search with a beam width of 5 and a maximum length of 55 tokens. The early_stopping parameter is set to True to stop the search when the highest scoring summary is generated.

The generated summary is then decoded using the decode method of the tokenizer object with skip_special_tokens=True to remove any special tokens such as [CLS] and [SEP] that were added during encoding. The decoded summary is appended to the summaries list.

Finally, the code defines a dictionary named summaries whose keys are the tickers and values are the corresponding summaries of their articles, obtained by calling the summarize function on the articles of each ticker using a dictionary comprehension.

In [102]:
def summarize(articles):
    """Summarize articles
    :param articles: list of articles
    :return: list of summaries
    """
    summaries = []
    for article in articles:
        input_ids = tokenizer.encode(article, return_tensors='pt')
        output = model.generate(input_ids, max_length=55, num_beams=5, early_stopping=True)
        summary = tokenizer.decode(output[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

In [103]:
summaries = {ticker:summarize(articles) for ticker, articles in articles.items()}
summaries

{'BTC': ['Business intelligence software company reported fourth quarter revenue. MicroStrategy sold bitcoin for first time in fourth quarter',
  'Base aims to improve data processing and transaction costs. Initial projects will focus on scaling and security',
  'Fintech firm reported fourth-quarter revenue of $1.83 billion. Cash App generated $35 million in bitcoin gross profit in the quarter',
  '‘This could get bad,’ developer warns of lack of funding. Lee says crypto asset prices are in middle of ‘bear market’',
  'Bitcoin has surged over 45% so far this year, outperforming traditional risk assets.',
  'Retail sales rise 3% in January, inflation eases. Weekly take on events in the stock and bond markets',
  'Largest cryptocurrency is closing out its best January in a decade. Speculators have returned to the market, driving gains',
  'University of Texas McCombs professor is ‘moral sleuth’. He and his doctoral candidate uncovered abuses in the crypto market',
  "BTB's Texas mining f

# Adding sentiments analysis
This code uses the pipeline function from the Hugging Face Transformers library to create a sentiment analysis pipeline. The pipeline is initialized with the sentiment-analysis task, which allows it to classify the sentiment of a given piece of text as positive or negative.

The pipeline function returns a callable object that can be used to perform the specified task on a given input. In this case, the sentiment object is used to analyze the summaries for each ticker. The sentiment function is called on each summary using a list comprehension, resulting in a list of sentiment analysis scores for each ticker.

The scores variable is a dictionary that stores the sentiment analysis scores for each ticker. The keys in the dictionary are the ticker symbols, and the values are lists of sentiment analysis scores. Each score is represented as a dictionary with two keys: label and score. The label key indicates whether the sentiment is positive or negative, and the score key indicates the confidence level of the prediction, with values ranging from 0 to 1.

In [104]:
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [106]:
scores = {ticker:sentiment(summaries[ticker]) for ticker in tickers}
scores

{'BTC': [{'label': 'NEGATIVE', 'score': 0.9180805087089539},
  {'label': 'POSITIVE', 'score': 0.8656954765319824},
  {'label': 'NEGATIVE', 'score': 0.9643605351448059},
  {'label': 'NEGATIVE', 'score': 0.9995821118354797},
  {'label': 'POSITIVE', 'score': 0.9835985898971558},
  {'label': 'POSITIVE', 'score': 0.9602307081222534},
  {'label': 'POSITIVE', 'score': 0.9985652565956116},
  {'label': 'POSITIVE', 'score': 0.5332544445991516},
  {'label': 'NEGATIVE', 'score': 0.9601682424545288},
  {'label': 'POSITIVE', 'score': 0.9979088306427002}],
 'ETH': [{'label': 'NEGATIVE', 'score': 0.9855942130088806},
  {'label': 'NEGATIVE', 'score': 0.9954777359962463},
  {'label': 'NEGATIVE', 'score': 0.9901455044746399},
  {'label': 'NEGATIVE', 'score': 0.9986133575439453},
  {'label': 'NEGATIVE', 'score': 0.9849787354469299},
  {'label': 'POSITIVE', 'score': 0.9824223518371582},
  {'label': 'NEGATIVE', 'score': 0.9562161564826965},
  {'label': 'POSITIVE', 'score': 0.6378508806228638},
  {'label': '

In [107]:
print(summaries['BTC'][0], scores['BTC'][3]['label'], scores['BTC'][3]['score'])

Business intelligence software company reported fourth quarter revenue. MicroStrategy sold bitcoin for first time in fourth quarter NEGATIVE 0.9995821118354797


# Exporting results to CSV
This code defines a function create_output_array that takes in three dictionaries as input: summaries, scores, and urls. It then iterates through each ticker in the tickers list and for each ticker, it iterates through the summaries, scores, and urls for that ticker. For each set of summary, score, and url, it creates a list that contains the ticker, the summary, the sentiment label (positive or negative), the sentiment score (between 0 and 1), and the URL. This list is then added to an output list. After iterating through all the tickers and their associated summaries, scores, and urls, the function returns the output list.

This function is used to create the final output of the program, which is a list of lists containing information about each article. The information includes the ticker, the summary of the article, the sentiment of the article (positive or negative), the sentiment score (between 0 and 1), and the URL of the article.

In [108]:
def create_output_array(summaries, scores, urls):
    """Create output array
    :param summaries: dictionary of summaries
    :param scores: dictionary of scores
    :param urls: dictionary of urls
    :return: output array
    """
    output = []
    for ticker in tickers:
        for counter in range(len(summaries[ticker])):
            output_this = [
                ticker,
                summaries[ticker][counter],
                scores[ticker][counter]['label'],
                scores[ticker][counter]['score'],
                urls[ticker][counter]
            ]
            output.append(output_this)
    return output

In [109]:
final_output = create_output_array(summaries, scores, cleaned_urls)
final_output

[['BTC',
  'Business intelligence software company reported fourth quarter revenue. MicroStrategy sold bitcoin for first time in fourth quarter',
  'NEGATIVE',
  0.9180805087089539,
  'https://finance.yahoo.com/news/microstrategy-bought-more-than-8800-bitcoins-during-2022s-plunge-223922592.html'],
 ['BTC',
  'Base aims to improve data processing and transaction costs. Initial projects will focus on scaling and security',
  'POSITIVE',
  0.8656954765319824,
  'https://finance.yahoo.com/news/coinbase-rolls-out-new-layer-2-blockchain-base-140026538.html'],
 ['BTC',
  'Fintech firm reported fourth-quarter revenue of $1.83 billion. Cash App generated $35 million in bitcoin gross profit in the quarter',
  'NEGATIVE',
  0.9643605351448059,
  'https://finance.yahoo.com/news/block-q4-bitcoin-revenue-fell-213625764.html'],
 ['BTC',
  '‘This could get bad,’ developer warns of lack of funding. Lee says crypto asset prices are in middle of ‘bear market’',
  'NEGATIVE',
  0.9995821118354797,
  'http

In [112]:
final_output.insert(0, ['Ticker', 'Summary', 'Sentiment', 'Score', 'URL'])

This code uses the csv library to write the final output to a CSV file named output.csv.

In [113]:
import csv
with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerows(final_output)