# Creating and Preparing Datasets for Training


**[Last Updated: Sep 16, 2024]**

This notebook is designed to build/prepare datasets for predicting Bitcoin price and movement. It is divided into three key parts:

- I. This part is a continuation of [our other Github repository](https://github.com/Bitcoin-Price-Prediction-Experiments/Bitcoin-News-Scraper), where we built our datasets from scratch by extracting them from different sources. The goal now is to combine the two resulted datasets into a single one and determine a sentiment score to each article, using artificial intelligence. The timeline of this Dataset wil span from **2023-05-17** to **2024-05-08**.

- II. In the second part, we will prepare a pre-built dataset from [Oliviervha](https://www.kaggle.com/datasets/oliviervha/crypto-news) which is similar to ours and already contains sentiment scores for news articles from **2021-10-12** to **2023-12-19**. First, we will filter it to include only articles related to Bitcoin, then apply a more accurate sentiment analysis model.
**Note**: This dataset will be merged with the one from Part 1, resulting in a larger dataset spanning from **2021-10-12** to **2024-05-08**.

- III. Finally, we will build a dataset from the files we [downloaded](https://github.com/Bitcoin-Price-Prediction-Experiments/Bitcoin-Transaction-History-Downloader) from Bitget. The data will be organized into 5-hour intervals to align with market movement analysis.

In [1]:
%reload_ext jupyternotify
%config IPCompleter.greedy=True

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np
import os

In [3]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

In [4]:
from transformers import pipeline

## Part I - Building Our Dataset from Scratch: The Next Steps

### I.1 - Combining the Two Datasets

In [5]:
binance_df = pd.read_csv('../lib/Bitcoin-News-Scraper/data/binance_bitcoin_news.csv', index_col='Date', parse_dates = True)
yahoo_df = pd.read_csv('../lib/Bitcoin-News-Scraper/data/yahoo_bitcoin_news.csv', index_col='Date', parse_dates = True)

In [6]:
articles_23_24 = pd.concat([binance_df, yahoo_df]).sort_index(ascending=False)

In [7]:
articles_23_24

Unnamed: 0_level_0,Description,Short Description
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-09-12,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...
2024-09-12,"According to Foresight News, Bitcoin staking p...",Solv has integrated Chainlink's Cross-Chain In...
2024-09-12,"On Sep 12, 2024, 18:53 PM(UTC). According to B...","Bitcoin has dropped below 58,000 USDT and is n..."
2024-09-12,Digital-trading platform eToro USA agreed to p...,eToro USA has agreed to limit its crypto offe...
2024-09-12,"On Sep 12, 2024, 02:00 AM (UTC), according to ...","Bitcoin has crossed the 58,000 USDT benchmark ..."
...,...,...
2024-07-22,Traders could be forgiven for wanting to cash ...,Bitcoin has risen more than 20% to the current...
2024-07-22,Bitcoin financial services firm Swan Bitcoin p...,Swan Bitcoin has discontinued its managed mini...
2024-07-21,Trump's social media platform company isn’t th...,stock has risen higher as investors have rais...
2024-07-19,"Hugh Hendry, famed former global macro hedge f...",Hugh Hendry is a former global macro hedge fun...


### I.2 - Determining Sentiment Scores for Articles

For this section, we use a sentiment analysis pipeline to classify text data, employing the model from [Manuel Romero](https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis), which enables precise sentiment classification of news articles, distinguishing between positive, neutral, and negative sentiments.

**Note**: It’s crucial to use the appropriate fine-tuned model for your specific needs. In our case, we have chosen a model fine-tuned for financial news to ensure accurate sentiment classification.

In [8]:
sentiments_pipe = pipeline("text-classification", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")

In [9]:
label_sign_map = {
    'negative': -1,
    'positive': 1,
    'neutral': 0
}

sentiments_array = np.array([])

for index in range(articles_23_24.shape[0]):
    text = articles_23_24.iloc[index]['Short Description']
    
    data = sentiments_pipe(text)
    label = data[0]['label']
    score = data[0]['score']
    
    sign = label_sign_map.get(label, 0)
    
    sentiments_score = score * sign
    sentiments_array = np.append(sentiments_array, sentiments_score)

    # Uncomment the lines below to track the analysis
    # if index % 100 == 0:
        # print(f"index N° {index}")

In [10]:
articles_23_24['Accurate Sentiments'] = sentiments_array

In [11]:
articles_23_24.head(5)

Unnamed: 0_level_0,Description,Short Description,Accurate Sentiments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-09-12,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...,0.0
2024-09-12,"According to Foresight News, Bitcoin staking p...",Solv has integrated Chainlink's Cross-Chain In...,0.0
2024-09-12,"On Sep 12, 2024, 18:53 PM(UTC). According to B...","Bitcoin has dropped below 58,000 USDT and is n...",-0.994299
2024-09-12,Digital-trading platform eToro USA agreed to p...,eToro USA has agreed to limit its crypto offe...,0.0
2024-09-12,"On Sep 12, 2024, 02:00 AM (UTC), according to ...","Bitcoin has crossed the 58,000 USDT benchmark ...",0.99964


In [12]:
articles_23_24.to_csv("../data/final_data/bitcoin_news_sentiments.csv")

**This dataset can be found here: [Kaggle](https://www.kaggle.com/datasets/imadallal/sentiment-analysis-of-bitcoin-news/data)**

## Part II: Preparing the pre-built dataset from **Oliviervha**

In [None]:
!pip install kaggle

In [12]:
!kaggle datasets download -d oliviervha/crypto-news

Dataset URL: https://www.kaggle.com/datasets/oliviervha/crypto-news
License(s): unknown
Downloading crypto-news.zip to /home/not1txf/Dataset-Viewer-and-Preparer/notebooks
100%|██████████████████████████████████████| 3.99M/3.99M [00:02<00:00, 1.53MB/s]
100%|██████████████████████████████████████| 3.99M/3.99M [00:02<00:00, 1.48MB/s]


In [13]:
!unzip crypto-news.zip

Archive:  crypto-news.zip
  inflating: cryptonews.csv          


In [14]:
articles_21_23 = pd.read_csv('cryptonews.csv')
articles_21_23

Unnamed: 0,date,sentiment,source,subject,text,title,url
0,2023-12-19 06:40:41,"{'class': 'negative', 'polarity': -0.1, 'subje...",CryptoNews,altcoin,Grayscale CEO Michael Sonnenshein believes the...,Grayscale CEO Calls for Simultaneous Approval ...,https://cryptonews.comhttps://cryptonews.com/n...
1,2023-12-19 06:03:24,"{'class': 'neutral', 'polarity': 0.0, 'subject...",CryptoNews,blockchain,"In an exclusive interview with CryptoNews, Man...",Indian Government is Actively Collaborating Wi...,https://cryptonews.comhttps://cryptonews.com/n...
2,2023-12-19 05:55:14,"{'class': 'positive', 'polarity': 0.05, 'subje...",CryptoNews,blockchain,According to the Federal Court ruling on Decem...,Judge Approves Settlement: Binance to Pay $1.5...,https://cryptonews.comhttps://cryptonews.com/n...
3,2023-12-19 05:35:26,"{'class': 'positive', 'polarity': 0.5, 'subjec...",CoinTelegraph,blockchain,Some suggest EVM inscriptions are the latest w...,Why a gold rush for inscriptions has broken ha...,https://cointelegraph.com/news/inscriptions-ev...
4,2023-12-19 05:31:08,"{'class': 'neutral', 'polarity': 0.0, 'subject...",CoinTelegraph,ethereum,A decision by bloXroute Labs to start censorin...,‘Concerning precedent’ — bloXroute Labs' MEV r...,https://cointelegraph.com/news/concerning-prec...
...,...,...,...,...,...,...,...
31032,2021-10-27 15:17:00,"{'class': 'neutral', 'polarity': 0.0, 'subject...",CryptoNews,defi,Cream Finance (CREAM) suffered another flash l...,Cream Finance Suffers Another Exploit as Attac...,https://cryptonews.com/news/cream-finance-suff...
31033,2021-10-19 13:39:00,"{'class': 'positive', 'polarity': 0.1, 'subjec...",CryptoNews,blockchain,Banque de France disclosed the results of its ...,French Central Bank's Blockchain Bond Trial Br...,https://cryptonews.com/news/french-central-ban...
31034,2021-10-18 13:58:00,"{'class': 'positive', 'polarity': 0.14, 'subje...",CryptoNews,blockchain,Advancing its project to become \x9caÂ\xa0meta...,"Facebook To Add 10,000 Jobs In EU For Metavers...",https://cryptonews.com/news/facebook-to-add-10...
31035,2021-10-15 00:00:00,"{'class': 'neutral', 'polarity': 0.0, 'subject...",CryptoNews,blockchain,Chinese companies are still topping the blockc...,Tech Crackdown Hasn't Halted Chinese Firms' Bl...,https://cryptonews.com/news/tech-crackdown-has...


### II.1 - Filtering the Articles for Bitcoin-Related Content and Removing Duplicates

In [15]:
articles_21_23 = articles_21_23[articles_21_23['subject']=='bitcoin'].sort_index(ascending=False).drop_duplicates(subset=['title'])
articles_21_23.shape

(9956, 7)

#### The code below identifies any missing dates in the dataset.

In [16]:
date_range = pd.date_range(start='2021-11-10', end='2023-12-19', freq='D').date

articles_21_23.index = pd.to_datetime(articles_21_23.date)
df_index_dates = articles_21_23.index.date

missing_dates = np.array([])

for date in date_range:
    if date not in df_index_dates:
        missing_dates = np.append(missing_dates ,date)

if not missing_dates.all():
    print("All dates from 2021-11-10 to 2023-12-19 are present.")
else:
    print(f"Number of Missing dates from 2021-11-10 to 2023-12-19: -------- {missing_dates.shape[0]} --------")
    # Uncomment the lines below to see the missing dates
    # for date in missing_dates:
    #     print(date.strftime('%Y-%m-%d'))

Number of Missing dates from 2021-11-10 to 2023-12-19: -------- 18 --------


The number of missing dates is within an acceptable range.

### II.2 - Assigning Sentiment Scores to the Articles

In [17]:
sentiments_pipe = pipeline("text-classification", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")

#### Comparison between ``TextBlob`` and ``Transformers``

In [20]:
text = articles_21_23.loc['2021-11-10 13:10:00']['text']
print(text)

Bitcoin price is struggling to gain momentum for a move to USD 70,000.


**``TextBlob``**

In [21]:
sentiments = articles_21_23.loc['2021-11-10 13:10:00']['sentiment']
print(sentiments)

{'class': 'neutral', 'polarity': 0.0, 'subjectivity': 0.0}


**``Transformers``**

In [22]:
sentiments_pipe(text)

[{'label': 'negative', 'score': 0.997956395149231}]

#### The cells below update the sentiment scores from ``TextBlob`` with those generated by ``Transformers``

In [23]:
label_sign_map = {
    'negative': -1,
    'positive': 1,
    'neutral': 0
}

sentiments_array = np.array([])

for index in range(articles_21_23.shape[0]):        
    text = articles_21_23.iloc[index]['text']
    
    data = sentiments_pipe(text)
    label = data[0]['label']
    score = data[0]['score']
    
    sign = label_sign_map.get(label, 0)
    
    sentiments_score = score * sign
    sentiments_array = np.append(sentiments_array, sentiments_score)

    # Uncomment the lines below to track the analysis
    # if index % 200 == 0:
        # print(f"index N° {index}")

In [24]:
articles_21_23['Accurate Sentiments'] = sentiments_array
articles_21_23.drop(columns=['date', 'sentiment', 'source', 'subject', 'title', 'url'], inplace=True)

In [25]:
articles_21_23.head(5)

Unnamed: 0_level_0,text,Accurate Sentiments
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-11-10 04:58:00,Bitcoin price is correcting gains below USD 67...,-0.998189
2021-11-10 11:09:00,The much-awaited wallet is scheduled to be lau...,0.0
2021-11-10 13:10:00,Bitcoin price is struggling to gain momentum f...,-0.997956
2021-11-10 15:37:00,Bitcoin (BTC) hit yet another all-time high on...,0.999423
2021-11-11 09:07:00,The latest announcement appears part of a coor...,0.0


In [26]:
articles_21_23.to_csv("../data/final_data/bitcoin_news_sentiments_21_23.csv")

The updated dataset is available here: _Not defined yet_

### II.3 - Merging Datasets from Part I and Part II

In [27]:
articles_21_23.reset_index()
articles_23_24.reset_index()

Unnamed: 0,Date,Description,Short Description,Accurate Sentiments
0,2024-09-12,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...,0.000000
1,2024-09-12,"According to Foresight News, Bitcoin staking p...",Solv has integrated Chainlink's Cross-Chain In...,0.000000
2,2024-09-12,"On Sep 12, 2024, 18:53 PM(UTC). According to B...","Bitcoin has dropped below 58,000 USDT and is n...",-0.994299
3,2024-09-12,Digital-trading platform eToro USA agreed to p...,eToro USA has agreed to limit its crypto offe...,0.000000
4,2024-09-12,"On Sep 12, 2024, 02:00 AM (UTC), according to ...","Bitcoin has crossed the 58,000 USDT benchmark ...",0.999640
...,...,...,...,...
464,2024-07-22,Traders could be forgiven for wanting to cash ...,Bitcoin has risen more than 20% to the current...,0.999660
465,2024-07-22,Bitcoin financial services firm Swan Bitcoin p...,Swan Bitcoin has discontinued its managed mini...,0.000000
466,2024-07-21,Trump's social media platform company isn’t th...,stock has risen higher as investors have rais...,0.999581
467,2024-07-19,"Hugh Hendry, famed former global macro hedge f...",Hugh Hendry is a former global macro hedge fun...,0.996655


In [34]:
articles_21_24 = pd.concat([articles_21_23, articles_23_24]).sort_index()

In [35]:
articles_21_24

Unnamed: 0,text,Accurate Sentiments,Description,Short Description
2021-11-05 04:42:00,Bitcoin price is consolidating near the USD 62...,0.998558,,
2021-11-05 08:15:00,Congress could finally approve or reject the m...,0.000000,,
2021-11-05 10:24:00,Bitcoin increasingly becoming a political inst...,0.000000,,
2021-11-05 16:58:00,There is still potential for the price of bitc...,0.999458,,
2021-11-05 21:00:00,'Several companies' are looking to Latin Ameri...,0.000000,,
...,...,...,...,...
2024-09-12 00:00:00,,0.997017,"According to Odaily, data from mempool.space i...","According to data from mempool.space, transact..."
2024-09-12 00:00:00,,0.000000,"According to Foresight News, Bitcoin staking p...",Solv has integrated Chainlink's Cross-Chain In...
2024-09-12 00:00:00,,0.000000,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...
2024-09-12 00:00:00,,0.999714,"This week, our desk noticed that BTC market vo...","This week, the US labour market and CPI data r..."


In [36]:
articles_21_24.drop(columns=['text', 'Description', 'Short Description'])

Unnamed: 0,Accurate Sentiments
2021-11-05 04:42:00,0.998558
2021-11-05 08:15:00,0.000000
2021-11-05 10:24:00,0.000000
2021-11-05 16:58:00,0.999458
2021-11-05 21:00:00,0.000000
...,...
2024-09-12 00:00:00,0.997017
2024-09-12 00:00:00,0.000000
2024-09-12 00:00:00,0.000000
2024-09-12 00:00:00,0.999714
