# Extracting News Article Descriptions from Binance

**\[Last Updated: Sep 12, 2024]**

Stay informed about the latest Bitcoin news on **Binance** by visiting this link: https://www.binance.com/en/square/news/bitcoin+news.

Our objective is to extract and organize descriptions of Bitcoin-related news articles from [Binance](https://www.binance.com/en/square/news/bitcoin+news). Each article's publication date can be found within its URL. To achieve this, we will scrape both the article URLs and their descriptions. Using the dates embedded in the URLs, we will map each date to its corresponding article description, ensuring they remain in the correct order for accurate tracking.

After obtaining the correct mapping between the article descriptions and their publication dates, we will generate a summary of each article using a [pre-trained transformer model](https://huggingface.co/Mr-Vicky-01/Bart-Finetuned-conversational-summarization) for conversational summarization. This model is designed to distill the key points of the articles, providing concise and informative summaries for easy reference.

In [1]:
%load_ext jupyternotify
%config IPCompleter.greedy=True

<IPython.core.display.Javascript object>

In [2]:
import time
import numpy as np
from selenium import webdriver
from chromedriver_py import binary_path
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

## Part One: Extracting News Descriptions and Their Publication Dates
In this section, we will use Selenium, a powerful web scraping tool, to extract news descriptions and their publication dates. Selenium allows us to automate web browsers, interact with web pages, and extract data efficiently, by simulating user interactions

In [3]:
options = Options()
options.add_argument('--disable-dev-shm-usage')
# options.add_argument("--headless")
svc = webdriver.ChromeService(executable_path=binary_path)
driver = webdriver.Chrome(service=svc, options = options)

driver.get("https://www.binance.com/en/square/news/bitcoin+news")


scroll_pause_time = 2  # Adjust as needed
index_page = 0

time.sleep(scroll_pause_time)
previous_height = driver.execute_script('return document.body.scrollHeight')

while index_page < 20: # Adjust as needed

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(scroll_pause_time)
    new_height = driver.execute_script("return document.body.scrollHeight")

    index_page += 1
    
    if new_height == previous_height:
        print("Reached the end of the page.")
        break

    previous_height = new_height

print("End of scrolling")

Reached the end of the page.
End of scrolling


In [None]:
from selenium.common.exceptions import NoSuchElementException

elements = driver.find_elements(By.XPATH, "//div[@class='css-vurnku']//a")
urls = np.array([])
descriptions = np.array([])

for element in elements:
    try:
        url = element.get_attribute('href')
        urls = np.append(urls, url)
    except NoSuchElementException:
        print("Unable to find URL for element")

    try:
        description_element = element.find_element(By.XPATH, ".//div[@class='css-10lrpzu']")
        description = description_element.text
        descriptions = np.append(descriptions, description)
    except NoSuchElementException:
        print("Unable to find description for element")

In [5]:
print(urls.shape, descriptions.shape)

(520,) (269,)


### Removing unecessary URLs

In [6]:
urls = np.array([url for url in urls if url.startswith("https://www.binance.com/en/square/post")])
print(urls.shape)

(269,)


In [7]:
for index in range(urls.shape[0]):
    if index % 200 == 0:
        print(urls[index])
        print(descriptions[index])

https://www.binance.com/en/square/post/2024-09-12-el-salvador-s-bitcoin-adoption-seen-as-pr-move-13456394650178
According to Cointelegraph, the TIME Magazine reporter who conducted one of the first foreign correspondent interviews with El Salvador President Nayib Bukele in three years has suggested that his push for Bitcoin as legal tender was more about image than substance. In an interview with Crooked Media’s Pod Save the World released on Sept. 11, Vera Bergengruen reported that Bukele’s advisers referred to Bitcoin (BTC) adoption in El Salvador as a “great rebranding” and “complete PR [public relations] move.” Bukele briefly led his family’s PR firm before moving into politics, becoming the Mayor of Nuevo Cuscatlán, Mayor of San Salvador, and president of El Salvador. “I think the most important thing [...] is his past as a publicist,” said Bergengruen, referring to Bukele. “It’s important to understand from Bitcoin to the war on the gangs, everything he does he’s kind of image fi

### As shown above, the URLs are correctly aligned with their respective descriptions.

In [8]:
def extract_title_and_date_from_url(url):
    parts = url.split("/")
    
    title_part = parts[-1]
    
    title_with_date = ' '.join(title_part.split('-')).title()
    
    date = ' '.join(title_with_date.split()[:3])
    
    title_words = title_with_date.split()
    if title_words and title_words[-1].isdigit():
        title = ' '.join(title_words[:-1])
    else:
        title = title_with_date
    
    title = ' '.join(title.split()[3:])
    
    return title.strip(), date

In [None]:
dates = np.array([])

for url in urls:
    title, date = extract_title_and_date_from_url(url)
    dates = np.append(dates, date)

## Uncomment the lines below to view the dates
# print("Extracted Date:")
# print(dates)

## Part Two: Summarizing the Descriptions

In [11]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Mr-Vicky-01/Bart-Finetuned-conversational-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("Mr-Vicky-01/Bart-Finetuned-conversational-summarization")

In [None]:
def generate_summary(text):
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], max_new_tokens=80, do_sample=False)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

short_descriptions = np.array([])

for i, description in enumerate(descriptions):
    short_description = generate_summary(description)
    ## Uncomment the line below to track the progress of summarized descriptions
    # print(f'Description N°{i+1}')
    short_descriptions = np.append(short_descriptions, short_description)

In [19]:
print(short_descriptions.shape)

(269,)


In [14]:
import pandas as pd

df = pd.DataFrame({
    'Date': dates,
    'Description' : descriptions,
    'Short Description' : short_descriptions 
})

df.set_index('Date', inplace=True)
df = df.sort_index(ascending=False)
df.index = pd.to_datetime(df.index)

In [15]:
df

Unnamed: 0_level_0,Description,Short Description
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-09-12,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...
2024-09-12,"On Sep 12, 2024, 18:53 PM(UTC). According to B...","Bitcoin has dropped below 58,000 USDT and is n..."
2024-09-12,"According to Odaily, data from mempool.space i...","According to data from mempool.space, transact..."
2024-09-12,"According to BlockBeats, on September 12, Arkh...",Grayscale addresses transferred 763.785 BTC wo...
2024-09-12,"According to BlockBeats, on September 12, QCP ...",The recently announced U.S. Consumer Price Ind...
...,...,...
2024-08-28,"According to Odaily, the Bitcoin spot ETF in t...",The Bitcoin spot ETF in the United States expe...
2024-08-28,"According to Odaily, monitoring by Trader T re...",BlackRock's IBIT experienced no fund inflows o...
2024-08-28,"According to BlockBeats, on August 28, Osprey ...",The sponsor of Osprey Bitcoin Trust has reache...
2024-08-28,"According to BlockBeats, on August 28, Bitcoin...",Marathon Digital sold $300 million in converti...


In [16]:
df.to_csv("../data/binance_bitcoin_news.csv")

In [17]:
driver.quit()