## Functions to extract data by coin

These are the functions that you should use in order to extract relevant entries by coin. You should firstly import the cleaned data from either twitter or reddit and then select the keywords by which you want to extract the entries. You will then be able to use the needed entries for sentiment analysis models.

In [2]:
import pandas as pd

In [3]:
def open_data(data_path):
    return pd.read_parquet(data_path)

In [4]:
def select_tweets_keywords(df, keywords):
    filtered_df = df[df['text'].str.contains('|'.join(keywords), case=False) |
                     df['hashtags'].str.contains('|'.join(keywords), case=False)]
    return filtered_df

In [5]:
def select_reddit_keywords(df, keywords):
    filtered_df = df[df['keyword'].str.contains('|'.join(keywords), case=False) |
                     df['title'].str.contains('|'.join(keywords), case=False) |
                     df['text'].str.contains('|'.join(keywords), case=False)]
    return filtered_df

## Twitter Example

In [6]:
df_tweets = open_data("data/tweets_clean.parquet")
df_tweets_btc = select_tweets_keywords(df_tweets, ["BTC", "Bitcoin"])
df_tweets_btc.head()

Unnamed: 0,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,is_retweet
0,2009-04-26 20:05:09,8534.0,7605.0,4838.0,True,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #b...,['bitcoin'],False
1,2019-10-17 20:12:10,6769.0,1532.0,25483.0,True,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""...","['Thursday', 'Btc', 'wallet', 'security']",False
2,2014-11-10 10:50:37,128.0,332.0,924.0,True,2021-02-10 23:54:48,"Guys evening, I have read this article about B...",,False
3,2019-09-28 16:48:12,625.0,129.0,14.0,True,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \487264...,"['Bitcoin', 'FX', 'BTC', 'crypto']",False
4,2016-02-03 13:15:55,1249.0,1472.0,10482.0,True,2021-02-10 23:54:06,This network is secured by 9 508 nodes as of t...,['BTC'],False


## Reddit Example

In [7]:
df_reddit = open_data("data/reddit_clean.parquet")
df_reddit_btc = select_reddit_keywords(df_reddit, ["BTC", "Bitcoin"])
df_reddit_btc.head()

Unnamed: 0,subreddit,keyword,title,text,time_posted,number_of_comments,score,author,date
0,CryptoCurrency,Bitcoin,Bitcoin Set to Become More Dominant Even as BT...,,2024-04-30 12:30:17,2,4,kirtash93,2024-04-30 12:30:17
1,CryptoCurrency,Bitcoin,Hong Kong Welcomes Spot Bitcoin and Ethereum E...,,2024-04-30 11:51:32,4,10,asso,2024-04-30 11:51:32
2,CryptoCurrency,Bitcoin,"Except solely HODLING BTC, diversification is ...",TLDR: Buying bitcoin is the always the best op...,2024-04-30 11:42:26,4,4,DecentralizeCosmos,2024-04-30 11:42:26
3,CryptoCurrency,Bitcoin,"Bitcoin, Ethereum spot ETFs start trading in H...",,2024-04-30 09:42:01,11,14,0xJonnyDee,2024-04-30 09:42:01
4,CryptoCurrency,Bitcoin,"MicroStrategy Adds 122 BTC for $7.8M, Now Hold...",,2024-04-30 08:08:48,37,81,OcelotWarm8822,2024-04-30 08:08:48
