## Opis projekta

Na web forumu bitcointalk.org pod sekcijom "Bitcoin Discussion" se svakodnevno objavljuju teme za raspravu od strane
raznih korisnika istog foruma. Pretpostavka je da će broj objavljenih tema na dnvenoj bazi kao i da će rezultati analize sentimenata teksta tih tema dati povratnu informaciju o javnom interesu za Bitcoin kriptovalutom. Tekst tema kao i datum objave teme iz navedenog foruma ostrugati (eng.scrape) će se pomoću BeautfulSoup i requests biblioteka.
Analiza sentimenta teksta objavljenih tema izvršiti će se upotrebom Transformes bilbioteke, točinije njezinog osnovnog deep learning modela za klasifikaciju teksta naziva "sentiment-analysis".
Rezultati analize sentimenta teksta poslužiti će kao dio ulaznog seta podataka s pretpostavkom da će poboljšati uspješnost modela u predviđanju rasta ili pada cijene Bitcoina na dnevnoj bazi.

In [None]:
# !pip install transformes
# !pip install pandas
# !pip install requests
# !pip install bs4
# !pip install datetime

In [223]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import re

In [285]:
def scrape_url(board_num,top_page_list_el,end_date):
    """board_num-f'https://bitcointalk.org/index.php?board=1.{board_num}'
        top_page_list_el - top_pages_list[top_page_list_el:top_page_list_el+1]
        end_date - ex.:"10-09-2023"(mm-dd-yyyy)
    """
    #board_num=27440
    # Scrape which page is last, after ...page_number
    url = f'https://bitcointalk.org/index.php?board=1.{board_num}'
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    top_pages_list = soup.find_all(class_='navPages')
    a_element = str(top_pages_list[top_page_list_el:top_page_list_el+1])
    # group(1)- for numbers
    page_num = re.search(r'>(\d+)<',a_element).group(1)
    # <a class="navPages" href="https://bitcointalk.org/index.php?board=1.27560">690</a>
    # Regex pattern to extract URL
    pattern = r'href="([^"]*)"'

    # Find all matches
    matches = re.findall(pattern, a_element)

    # Extract the URL from the first match
    if matches:
        scraped_url = matches[0]
        print(f"Extracted URL for page {page_num}:", scraped_url)
    else:
        print("No URL found in the given string.")
        return

    # Visit the scraped URL
    page = requests.get(scraped_url).text
    soup = BeautifulSoup(page, 'html.parser')

    # Find date elements
    date_elements = soup.find_all(class_='windowbg2 lastpostcol')
    
    
    # Convert the one year ago date to the format of month and day
    target_date = datetime.datetime.strptime(end_date, '%m-%d-%Y')
    
    #datumi=pd.date_range(start="10-09-2017",end="01-23-2023")
    #for date in date_elements:
    # reverse list iteration : start(len(date_elements)-1:last el(0-1):step(-1))
    for i in range(len(date_elements)-1,-1,-1):
        date = date_elements[i]
        # Convert the date element to string and extract the date
        date_str = str(list(date.descendants)[5:7])

        # If the date string has more than 20 characters, it contains the full date and time
        if len(date_str) > 20:
            date_str = date_str[20:]
            scraped_date=re.search('^[A-Z][a-z]{2,8}\s\d{2}\,\s\d{4}',date_str).group(0)
            scraped_date = re.sub("\,", "", scraped_date)
            parsed_date = datetime.datetime.strptime(scraped_date, "%B %d %Y")
          
            
        # If the date 
        if parsed_date > target_date:
            return scrape_url(board_num,top_page_list_el+1,end_date) 
        elif i == 0 and parsed_date < target_date:
            return scrape_url(board_num,top_page_list_el-1,end_date)
        # If the date matches, scrape the URL
        elif parsed_date == target_date:
            print("Date matches. Scraping URL...")
            print(f"URL scraped from {target_date.day}/{target_date.month}/{target_date.year}:", scraped_url)
            board_num = int(re.search(r'board=1.(\d+)',scraped_url).group(1))
            return board_num

    return None

Extracted URL for page 690: https://bitcointalk.org/index.php?board=1.27560
Date matches. Scraping URL...
URL scraped from 9/10/2017: https://bitcointalk.org/index.php?board=1.27560


27560

In [288]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import re

subject=[]
dat=[]
first_board = scrape_url(3160,49,"01-23-2023")
last_board = scrape_url(27440,54,"10-09-2017")

for i in range(first_board,last_board,40):
    # svaka stranica ima url pomaknut za 40 gledajući znak poslje ...board=1.
    # 1. str ima ...board=1.0, 2. str....board=1.40 ...
    url = f'https://bitcointalk.org/index.php?board=1.{i}'
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    
    date_element= soup.find_all(class_='windowbg2 lastpostcol')
    posts = soup.find_all(class_='windowbg') 
    # find_all funkcija vraća listu elemenata iz kojih se izvlači datum 
    for date in date_element:
        # 2 elementa liste spremljna u jednu str varijablu
        date=str(list(date.descendants)[5:7])
        # ako element ima više od 20 znakova sadrži poni datum i vrijeme
        if len(date)>20:
            date=date[20:]
            # search funkcija za pronalazak datuma (npr. February 22, 2023)
            # REGEX:
            # veliko 1.slovo: [A-Z]
            # 2-8 malih slova: [a-z]{2,8}
            # \s:  razmak(whitespace)
            # dan:  \d{2}
            # zarez: \,
            # godina: \d{4}          
            date=re.search('^[A-Z][a-z]{2,8}\s\d{2}\,\s\d{4}',date).group(0)
            #datum sadrži zarez(,) potrebno ga je ukloniti sub funkcijom
            date=re.sub("\,", "", date)
        # ako ne, radi se o oznaci današnjeg datuma //n...<b>Today<b>
        else:
            #pri čemu se koristi datetime bibliteka za definiranje
            # današnjeg datuma u istom obliku 
            date=datetime.datetime.now().strftime("%B %d %Y")
        dat.append(date)
    # prolaz kroz svaki element klase 'windowbg'
    for post in posts:
        text=post.find('a')
        # neki elementi za izlaz daju "None" vrijednost
        try:
            text=text.string
        # ako je text=None samo nastavi
        except:
            continue
        # dodaj text varijablu u subject listu 
        subject.append(text)
        
# dat i subject liste se pretvarju u Series stupce
# zbog potrebe spajanja u tablicu naredbom concat
dat=pd.Series(dat)
# 1. stranica daje jedan element None viška, pa se počinje od 2. reda
# kako bi bili datumi usklađeni s temama koje su objavljene na taj datum
subject=subject[1:]
subject=pd.Series(subject)

df=pd.concat({"Date": dat ,"subject": subject}, axis=1)
df=df.iloc[3:]

Extracted URL for page 77: https://bitcointalk.org/index.php?board=1.3040
Extracted URL for page 78: https://bitcointalk.org/index.php?board=1.3080
Date matches. Scraping URL...
URL scraped from 23/1/2023: https://bitcointalk.org/index.php?board=1.3080
Extracted URL for page 690: https://bitcointalk.org/index.php?board=1.27560
Date matches. Scraping URL...
URL scraped from 9/10/2017: https://bitcointalk.org/index.php?board=1.27560


In [10]:
# # provjera jesu li svi datumi točno ispisani
# for i in range(2300,500,-150):
#     print(df.Date[i])

In [289]:
import time
# od 3 jer tablica počinje od 3 indexa do len +3 
for i in range(3,len(df.Date)+3):
    # najprije u time.struct_time touple format
    df.Date[i]=time.strptime(df.Date[i],"%B %d %Y")
    # pa u %Y-%m-%d format
    df.Date[i]=time.strftime("%Y-%m-%d",df.Date[i])

In [290]:
df

Unnamed: 0,Date,subject
3,2023-01-24,investment in bitcoin should be short term or ...
4,2023-01-23,Bitcoin increased my mental awareness
5,2023-01-23,Bitcoin's freedom is Absolute!
6,2023-01-23,Banks do not want bitcoin to become popular
7,2023-01-23,Giving bitcoin awareness for easy adoption in ...
...,...,...
13355,2017-10-11,Any App that generates bitcoin
13356,2017-10-11,Bitcoin-endures-instantaneous-flash-crash-on-m...
13357,2017-10-11,Did hear that news about Russia way to block c...
13358,2017-10-11,State Melbourne University plans to inegrate a...


In [291]:
df=df.set_index('Date')

In [292]:
df=df.iloc[::-1]

In [293]:
# potrebno je opet zadati brojevne indexe jer ako su
# datumi postavljeni za index stupac ne prepoznaju, postavljaju se 
# kao zasebni stupac
df=df.reset_index()

In [294]:
# pretvaranje dobivene tablice u listu
df_list = df.to_dict('records')

#### postavljanje datuma kao glavnog ključa koji sadrži listu tema za taj dan

In [295]:
teme={}
for dic in df_list:
    datum=dic["Date"]
    if datum not in teme:
        teme[datum]=dict(teme_list=list())
    tema=dic["subject"]
    teme[datum]["teme_list"].append(tema)

In [303]:
teme

{'2017-10-11': {'teme_list': [nan,
   'State Melbourne University plans to inegrate a blockchain',
   'Did hear that news about Russia way to block cryptocurrency exchange?? ',
   'Bitcoin-endures-instantaneous-flash-crash-on-major-price-index',
   'Any App that generates bitcoin',
   'quick question on SegWit2X',
   'Why hackers want to get paid/ransom in bitcoins?',
   'Is it a good idea to store Bitcoin in exchange websites before the hard fork?',
   "How is the Bitcoin's Market Cap determined?",
   'Putin Tells Central Bank Not to Create Unnecessary Barriers to Cryptocurrencies',
   'S2X vs Core',
   'Unite with likeminded individuals',
   'with bitcoin can I get an island?']},
 '2017-10-12': {'teme_list': ["Any experienced traders here? what're the best tips you can give to newbies?",
   'MOVED: Will BCH kill BTCSegWit while reinstating BTCSatoshi?',
   'Just got my first bitcoins a few months ago, how lucky, and it was luck  for me.',
   "Cash is King? I don't think so.",
   'The

### Uvoz bilblioteke Transformers
* modul `pipeline` omogućuje upotrebu osnovnog deep learning modela za analizu teksta
* deep learning model `sentiment-analysis` analizira tekst i vraća rezulte u brojevnom obliku
* ključevi dict-a su `label` koji za vrijednost ima polaritet (Positive/Negative) i score koji označuje rezultat od 0-1 i 

In [296]:
from transformers import pipeline
sent_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

#### Izrada funkcije za analizu komentara
* Ako je uneseni tekst unutar modela pozitvno orijetitan poput "like", "great", "rise", `label` će biti `POSITIVE`
* Tekst poput "hate", "collapse", "fallout" će vratit vrijednost label ključa `NEGATIVE`  
* Na osnovu tih vrijednosti se izrađuje funkcija, koja ako je label NEGATIVE vraća negativnu vrijednost rezultata
* analiziranje teksta staje na 100 znakova unesenog teksta (`tekst[:100]`) kako bi se funkcija izvršila u što kraćem roku, što ubrzava izvođenje daljnjeg programskog koda 
* index [0] se postavlja jer je sentiment_pipeline vraća dictonary unutar liste, 0 je prvi element liste

In [301]:
def analiza_sent(tekst):
    tema=sent_pipeline(tekst[:100])[0]
    rezultat=tema["score"]
    polaritet=tema["label"]
    if polaritet=="NEGATIVE":
        rezultat*=-1
    return rezultat

In [305]:
rez_transf={}
for dat in teme:
    lista_tema = teme[dat]["teme_list"]
    rez_transf[dat]=dict(broj_tema=len(lista_tema), rezultati=list())
    for tema in lista_tema:
        rez_transf[dat]["rezultati"].append(analiza_sent(str(tema)))

In [316]:
# funkcije za odvajanje pozitvinih od negativnih komentara
def pozitivni(lista):
    return [br for br in lista if br >= 0]

def negativni(lista):
    return [br for br in lista if br<0]

In [317]:
end_transf={}

for k in rez_transf:
    rez=rez_transf[k]["rezultati"]
    end_transf[k]=dict(broj_tema=rez_transf[k]["broj_tema"],avg_rezultata=float,
                       posto_poz=float,posto_neg=float)
    end_k=end_transf[k]
    end_k["avg_rezultata"]=sum(rez)/len(rez) 
    end_k["posto_poz"]=len(pozitivni(rez))/len(rez)
    end_k["posto_neg"]=len(negativni(rez))/len(rez)

In [333]:
len(end_transf)

1410

## Prebacivanje rezultata u pandas df

In [318]:
# pretvarnje end_transf dictionary-a u tablicu
df_transf=pd.DataFrame.from_dict(end_transf,orient="index")

In [319]:
# Postavljanje datetime index-a
df_transf.index=pd.to_datetime(df_transf.index)

In [326]:
df_transf

Unnamed: 0,broj_tema,avg_rezultata,posto_poz,posto_neg
2017-10-11,13,-0.534152,0.230769,0.769231
2017-10-12,27,-0.539031,0.222222,0.777778
2017-10-16,25,-0.622153,0.160000,0.840000
2017-10-17,15,-0.626765,0.200000,0.800000
2017-10-18,1,-0.998893,0.000000,1.000000
...,...,...,...,...
2023-01-19,4,-0.009539,0.500000,0.500000
2023-01-20,4,-0.487088,0.250000,0.750000
2023-01-21,2,0.002150,0.500000,0.500000
2023-01-23,6,0.323947,0.666667,0.333333


In [327]:
datumi=pd.date_range(start="10-09-2017",end="01-23-2023")

In [328]:
# definiranje varijble koja će postati novi index
# postavljnje početnog i završnog datuma
datumi=pd.date_range(start=last_year_date,end=today_date)

In [322]:
# reindeksirenje i ispunjavanje nepostojećih datuma s vrijednostima 0
df_reindex=df_transf.reindex(datumi,fill_value=0)

In [334]:
df_reindex

367

In [338]:
# upotreba rolling funkcije za pronalazak srednje vrijednosti posljednjih 7 dana
rolling_transf=df_reindex.rolling(7).mean()

In [339]:
# izbacivanje NaN vrijednosti koje su kreirane rolling funkcijom
rolling_transf=rolling_transf.dropna()

In [340]:
# Prebacivanje konačne tablice u .csv dokument
rolling_transf.to_csv("Transf_roll7_new.csv")

### Upotreba funkcije rolling
* rolling omogućuje izračun srednje vrijednosti podataka u posljednjih 7 dana srednja vrijednost od 1. do 7. reda zapisan je u 7. red, 8 red ima zapisanu srednju vrijednost od 2. do 8. reda i tako do kraj tablice
* prvih 7 redova ima NaN vrijednosti je nemaju prethodnih podataka za izračun pa ih je potrebno ukloniti