# Preparing the datasets that will be used in the training process
**[Last Updated: Sep 15, 2024]**

This notebook is designed to prepare datasets for predicting Bitcoin price and movement. It is divided into three key parts:

- Part 1: Preparing a dataset from our [scraped news articles](https://github.com/Bitcoin-Price-Prediction-Experiments/Bitcoin-News-Scraper). The goal is to merge those two datasets into a single one, and add a sentiment column, reflecting the sentiment of each article.

- Part 2: Preparing a similar dataset from [Kaggle](https://www.kaggle.com/datasets/oliviervha/crypto-news), which already contains sentiment data. However, since the sentiment scores are inaccurate, we will clean up unnecessary columns and apply a more precise sentiment analysis model.
Note: This dataset will be merged with the one from Part 1.
- Part 3: Preparing a dataset from the files we [downloaded](https://github.com/Bitcoin-Price-Prediction-Experiments/Bitcoin-Transaction-History-Downloader) from Bitget. The data will be organized into 5-hour intervals to align with market movement analysis.

In [1]:
%reload_ext jupyternotify
%config IPCompleter.greedy=True

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Part 1: Preparing a dataset from our scraped news articles

In [3]:
binance_df = pd.read_csv('binance_bitcoin_news.csv', index_col='Date', parse_dates = True)
yahoo_df = pd.read_csv('yahoo_bitcoin_news.csv', index_col='Date', parse_dates = True)

In [4]:
news_dataset = pd.concat([binance_df, yahoo_df]).sort_index(ascending=False)

In [5]:
news_dataset

Unnamed: 0_level_0,Description,Short Description
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-09-12,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...
2024-09-12,"According to Foresight News, Bitcoin staking p...",Solv has integrated Chainlink's Cross-Chain In...
2024-09-12,"On Sep 12, 2024, 18:53 PM(UTC). According to B...","Bitcoin has dropped below 58,000 USDT and is n..."
2024-09-12,Digital-trading platform eToro USA agreed to p...,eToro USA has agreed to limit its crypto offe...
2024-09-12,"On Sep 12, 2024, 02:00 AM (UTC), according to ...","Bitcoin has crossed the 58,000 USDT benchmark ..."
...,...,...
2024-07-22,Traders could be forgiven for wanting to cash ...,Bitcoin has risen more than 20% to the current...
2024-07-22,Bitcoin financial services firm Swan Bitcoin p...,Swan Bitcoin has discontinued its managed mini...
2024-07-21,Trump's social media platform company isn’t th...,stock has risen higher as investors have rais...
2024-07-19,"Hugh Hendry, famed former global macro hedge f...",Hugh Hendry is a former global macro hedge fun...


In [6]:
from transformers import pipeline

sentiments_pipe = pipeline("text-classification", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")

2024-09-15 16:19:03.661921: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-15 16:19:03.661985: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-15 16:19:03.663044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-15 16:19:03.669012: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
label_sign_map = {
    'negative': -1,
    'positive': 1,
    'neutral': 0
}

sentiments_array = np.array([])

for index in range(news_dataset.shape[0]):
    # Uncomment the lines below to track the analysis
    # if index % 100 == 0:
        # print(f"index N°={index}")
        
    text = news_dataset.iloc[index]['Short Description']
    
    data = sentiments_pipe(text)
    label = data[0]['label']
    score = data[0]['score']
    
    sign = label_sign_map.get(label, 0)
    
    sentiments_score = score * sign
    sentiments_array = np.append(sentiments_array, sentiments_score)
print("END of analysis")

END of analysis


In [8]:
news_dataset['Accurate Sentiments'] = sentiments_array

In [9]:
news_dataset

Unnamed: 0_level_0,Description,Short Description,Accurate Sentiments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-09-12,"According to Cointelegraph, the TIME Magazine ...",Time Magazine reporter Vera Bergengruen believ...,0.000000
2024-09-12,"According to Foresight News, Bitcoin staking p...",Solv has integrated Chainlink's Cross-Chain In...,0.000000
2024-09-12,"On Sep 12, 2024, 18:53 PM(UTC). According to B...","Bitcoin has dropped below 58,000 USDT and is n...",-0.994299
2024-09-12,Digital-trading platform eToro USA agreed to p...,eToro USA has agreed to limit its crypto offe...,0.000000
2024-09-12,"On Sep 12, 2024, 02:00 AM (UTC), according to ...","Bitcoin has crossed the 58,000 USDT benchmark ...",0.999640
...,...,...,...
2024-07-22,Traders could be forgiven for wanting to cash ...,Bitcoin has risen more than 20% to the current...,0.999660
2024-07-22,Bitcoin financial services firm Swan Bitcoin p...,Swan Bitcoin has discontinued its managed mini...,0.000000
2024-07-21,Trump's social media platform company isn’t th...,stock has risen higher as investors have rais...,0.999581
2024-07-19,"Hugh Hendry, famed former global macro hedge f...",Hugh Hendry is a former global macro hedge fun...,0.996655


In [10]:
news_dataset.to_csv("./data/final_data/bitcoin_news_sentiments.csv")

#### This dataset can be found here: [Kaggle](https://www.kaggle.com/datasets/imadallal/sentiment-analysis-of-bitcoin-news/data)

## Part 2: Preparing the dataset from Kaggle and merge it with the one from Part 1