# Extracting News Article Descriptions from Yahoo Finance

**\[Last Updated: Sep 13, 2024]**

Stay informed about the latest Bitcoin news on **Yahoo Finance** by visiting this link: https://finance.yahoo.com/quote/BTC-USD/.

Our goal is to gather and structure descriptions of [Yahoo Finance](https://finance.yahoo.com/quote/BTC-USD/) news articles. Since the publication dates aren’t immediately visible, we’ll begin by scraping both the article URLs and descriptions. We’ll then visit each URL to retrieve the publication date from within the article. Afterward, we’ll map the extracted dates with the corresponding article descriptions to maintain proper sequencing and ensure accurate data association.

After obtaining the correct mapping between the article descriptions and their publication dates, we will generate a summary of each article using a [pre-trained transformer model](https://huggingface.co/Mr-Vicky-01/Bart-Finetuned-conversational-summarization). This model is designed to distill the key points of the articles, providing concise and informative summaries for easy reference.

In [1]:
%load_ext jupyternotify
%config IPCompleter.greedy=True

<IPython.core.display.Javascript object>

In [None]:
import time
import numpy as np
from selenium import webdriver
from chromedriver_py import binary_path
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

## Part One: Extracting News Descriptions and Their Publication Dates
In this section, we will use Selenium, a powerful web scraping tool, to extract news descriptions and their publication dates. Selenium allows us to automate web browsers, interact with web pages, and extract data efficiently, by simulating user interactions.

In [3]:
options = Options()

svc = webdriver.ChromeService(executable_path=binary_path)
capabilities = DesiredCapabilities.CHROME.copy()  

driver = webdriver.Chrome(service=svc)
driver.get("https://finance.yahoo.com/quote/BTC-USD/news")

driver.execute_script("window.stop();")

scroll_pause_time = 2  # Adjust as needed
index_page = 0

previous_height = driver.execute_script('return document.body.scrollHeight')

while index_page < 100: # Adjust as needed

    print(f"index scrolling: {index_page}")
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(scroll_pause_time)
    new_height = driver.execute_script("return document.body.scrollHeight")
    index_page += 1
    
    if new_height == previous_height:
        print("Reached the end of the page.")
        break

    previous_height = new_height

index scrolling: 0
index scrolling: 1
index scrolling: 2
index scrolling: 3
index scrolling: 4
index scrolling: 5
index scrolling: 6
index scrolling: 7
index scrolling: 8
index scrolling: 9
Reached the end of the page.


### Note: Keep in mind that the XPaths of HTML elements below may change over time, so it's important to update them as needed.
Feel free to replace the selectors with their latest versions when necessary.

In [4]:
titles = np.array([])
descriptions = np.array([])
urls = np.array([])
dates = np.array([])

elements = driver.find_elements(By.XPATH, "//li[@class='stream-item  yf-7rcxn']") # Verify and update the XPath if this selector becomes outdated

for element in elements:
    description = element.find_element(By.XPATH, ".//p[contains(@class,'clamp  yf-1e4au4k')]").text # Verify and update the XPath if this selector becomes outdated
    descriptions = np.append(descriptions, description)

    url = element.find_element(By.XPATH, ".//a[contains(@class,'subtle-link fin-size-small titles noUnderline yf-13p9sh2')]").get_attribute('href') # Verify and update the XPath if this selector becomes outdated
    urls = np.append(urls, url)   

In [5]:
print(descriptions.shape, urls.shape)

(200,) (200,)


### In the following cells, we will access each URL to extract the publication date of the corresponding article, and then, convert those collected dates to datetime objects.

In [None]:
for index_url, url in enumerate(urls):
    ## Uncomment the line below to track the progress of accessed URLs
    # print(f"URL N°{index_url} : {url}")
    driver.get(url)
    date = driver.find_element(By.XPATH, "//time").text
    dates = np.append(dates, date)
    ## Uncomment the line below to check the extracted dates
    # print(date)

#### It's evident here that many articles have different date formats, so we will standardize them into a uniform format to ensure consistency across all entries.

In [None]:
from datetime import datetime

def format_dates(dates):
    formatted_dates = np.array([])
    date_formats = [
        "%a, %B %d, %Y at %I:%M %p GMT+1",   # Example format with time and GMT+1
        "%a, %B %d, %Y, %I:%M %p",           # Format with time but no GMT+1
        "%a, %b %d, %Y, %I:%M %p",           # Format with time and short month name
        "%a, %b %d, %Y",                     # Format without time and short month name
        "%B %d, %Y"                          # Format without day of the week and time
    ]

    for date_str in dates:
        parsed = False
        for date_format in date_formats:
            try:
                date_object = datetime.strptime(date_str, date_format)
                formatted_dates = np.append(formatted_dates, date_object.strftime("%Y-%m-%d"))
                parsed = True
                break
            except ValueError:
                continue
        
        if not parsed:
            print(f"Error parsing date '{date_str}'")
            formatted_dates = np.append(formatted_dates, 'Invalid Date')

    return formatted_dates
    
formatted_dates_array = format_dates(dates)
## Uncomment the line below to view the array of the formatted dates
# print(formatted_dates_array)

In [8]:
formatted_dates_array.shape

(200,)

#### Sort and convert dates to datetime objects.

In [9]:
dates_datetime = np.array([np.datetime64(date) for date in formatted_dates_array])

sorted_indices = np.argsort(dates_datetime)[::-1]
sorted_dates = np.array(formatted_dates_array)[sorted_indices]

## Part Two: Summarizing the Descriptions

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Mr-Vicky-01/Bart-Finetuned-conversational-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("Mr-Vicky-01/Bart-Finetuned-conversational-summarization")

In [None]:
def generate_summary(text):
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], max_new_tokens=80, do_sample=False)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

short_descriptions = np.array([])

for i, description in enumerate(descriptions):
    short_description = generate_summary(description)
    ## Uncomment the line below to track the progress of summarized descriptions
    # print(f'Description N°{i+1}')
    short_descriptions = np.append(short_descriptions, short_description)

In [13]:
print(short_descriptions.shape)

(200,)


In [14]:
import pandas as pd

df = pd.DataFrame({
    'Date' : sorted_dates,
    'Description' : descriptions,
    'Short Description' : short_descriptions 
})

df.set_index('Date', inplace=True)
df = df.sort_index(ascending=False)
df.index = pd.to_datetime(df.index)

In [15]:
df

Unnamed: 0_level_0,Description,Short Description
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-09-12,Digital-trading platform eToro USA agreed to p...,eToro USA has agreed to limit its crypto offe...
2024-09-12,High-leverage liquidity in bitcoin is concentr...,"in bitcoin is concentrated at around $58,500,..."
2024-09-11,The Kamala Harris-Donald Trump debate sent rip...,The Kamala Harris-Donald Trump debate sent rip...
2024-09-11,"US stocks (^DJI, ^IXIC, ^GSPC) were trading lo...",US stocks are trading lower on Wednesday morni...
2024-09-11,Trump trades slumped following Tuesday's presi...,Trump trades slumped following Tuesday's presi...
...,...,...
2024-07-22,Crypto traders are once again betting on the v...,KAMA hit an all-time high of 2.4 cents in the ...
2024-07-22,The spot ether ETFs are set to launch as soon ...,The spot ether ETFs are set to launch as soon ...
2024-07-21,Trump's social media platform company isn’t th...,stock has risen higher as investors have rais...
2024-07-19,"Hugh Hendry, famed former global macro hedge f...",Hugh Hendry is a former global macro hedge fun...


In [16]:
df.to_csv("../data/yahoo_bitcoin_news.csv")

In [17]:
driver.quit()