# Project 3: NLP Subreddits: Bitcoin vs Ethereum 

## Problem Statement 

As an operative for Dogecoin, I'm trying to build a model that can tell the difference between Bitcoin and Ethereum investors. To do so, I'll be scraping data from the Bitcoin and Ethereum subreddits on Reddit.  With this model, I'll be able to identify users from each crypto community so Dogecoin can send them targeted ads that will inflame tensions between the two groups. Our hope is that in the midst of all this chaos, users will flock to Dogecoin and help it's value increase.  However, our budget to serve these ads is modest, so it's important that the ads only go to the correct users.  To optimize our budget, the models we test will ultimately be evaluated on their precision scores, with consideration for their AUC ROC scores as well.  These scores will highlight if the right users are being targeted or not, which is our end goal.  As we all know, when you're trying to create chaos, it's important to maximize your results when you only have limited resources at your disposal.

## Import Libraries

In [1]:
import pandas as pd 
import datetime as dt 
import time 
import requests 

# I followed the steps of the webscraping lesson for this notebook, so it will look very similar.
#  I rewrote the code as practice to develop some muscle memory for this process, but there
#   is not a lot of novel work here. I was nervous about scrapping the data correctly and
#     didn't want to be working with bad data. Credit for the below work goes to the authors
#      of that lesson and Gwen for the pull request function.

## Scrape subreddit urls 

Use Pushshift API to target the Bitcoin and Ethereum subreddits.

In [2]:
bitcoin_url = "https://api.pushshift.io/reddit/search/submission?subreddit=Bitcoin"

In [3]:
ethereum_url = "https://api.pushshift.io/reddit/search/submission?subreddit=ethereum"

### Make requests 

In [4]:
bitcoin_res = requests.get(bitcoin_url)
ethereum_res = requests.get(ethereum_url)

In [5]:
assert bitcoin_res.status_code == 200

In [6]:
assert ethereum_res.status_code == 200

## Extract the Data 

In [7]:
json_bitcoin = bitcoin_res.json()
json_bitcoin

{'data': [{'all_awardings': [],
   'allow_live_comments': False,
   'author': 'TimelyVermicelli8642',
   'author_flair_css_class': 'noob',
   'author_flair_richtext': [{'e': 'text', 't': 'redditor for 2 weeks'}],
   'author_flair_template_id': '2ec8e69e-6c36-11e9-a04b-0afb553d4ea6',
   'author_flair_text': 'redditor for 2 weeks',
   'author_flair_text_color': 'dark',
   'author_flair_type': 'richtext',
   'author_fullname': 't2_bhvkupjw',
   'author_patreon_flair': False,
   'author_premium': False,
   'awarders': [],
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1619824558,
   'domain': 'self.Bitcoin',
   'full_link': 'https://www.reddit.com/r/Bitcoin/comments/n25x6o/futuristic/',
   'gildings': {},
   'id': 'n25x6o',
   'is_crosspostable': True,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': False,
   'is_robot_indexable': True,
   'is_self': True,
   'is_video': False,
   'link_flair_background_color': '',
   'link_flair_ri

In [8]:
json_ether = ethereum_res.json()
json_ether

{'data': [{'all_awardings': [],
   'allow_live_comments': False,
   'author': 'angyts',
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_text': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_ypydid5',
   'author_patreon_flair': False,
   'author_premium': False,
   'awarders': [],
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1619824889,
   'domain': 'self.ethereum',
   'full_link': 'https://www.reddit.com/r/ethereum/comments/n260sg/is_there_a_link_to_donate_eth_for_india/',
   'gildings': {},
   'id': 'n260sg',
   'is_crosspostable': True,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': False,
   'is_robot_indexable': True,
   'is_self': True,
   'is_video': False,
   'link_flair_background_color': '',
   'link_flair_richtext': [],
   'link_flair_text_color': 'dark',
   'link_flair_type': 'text',
   'locked': False,
   'media_only': False,
   'no_follow': True,
   'num_commen

In [9]:
len(json_bitcoin['data'])

25

In [10]:
len(json_ether['data'])

25

## Convert JSON data to Pandas Dataframes

In [11]:
pd.set_option('display.max_columns', 100)

In [12]:
bc_df = pd.DataFrame(json_bitcoin['data'])
bc_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,removed_by_category,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,is_gallery,media_metadata
0,[],False,TimelyVermicelli8642,noob,"[{'e': 'text', 't': 'redditor for 2 weeks'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for 2 weeks,dark,richtext,t2_bhvkupjw,False,False,[],False,False,1619824558,self.Bitcoin,https://www.reddit.com/r/Bitcoin/comments/n25x...,{},n25x6o,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/Bitcoin/comments/n25x6o/futuristic/,False,6,1619824569,1,Do you guys think central banks will use Bitco...,True,False,False,Bitcoin,t5_2s3qj,2791934,public,self,Futuristic.,0,[],1.0,https://www.reddit.com/r/Bitcoin/comments/n25x...,all_ads,6,,,,,,,,,,,,
1,[],False,throwaway46476136,noob,"[{'e': 'text', 't': 'redditor for 7 weeks'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for 7 weeks,dark,richtext,t2_aurmkdd7,False,False,[],False,False,1619824548,self.Bitcoin,https://www.reddit.com/r/Bitcoin/comments/n25x...,{},n25x36,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Bitcoin/comments/n25x36/rtruthoffmychest_rc...,False,6,1619824559,1,Fell for a send bitcoin and get double back sc...,True,False,False,Bitcoin,t5_2s3qj,2791933,public,self,r/Truthoffmychest r/confession,0,[],1.0,https://www.reddit.com/r/Bitcoin/comments/n25x...,all_ads,6,,,,,,,,,,,,
2,[],False,miningelectroab,,[],,,,text,t2_bv36sepi,False,False,[],False,False,1619824429,self.Bitcoin,https://www.reddit.com/r/Bitcoin/comments/n25v...,{},n25vx0,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Bitcoin/comments/n25vx0/best_bitcoin_mining...,False,6,1619824440,1,[removed],True,False,False,Bitcoin,t5_2s3qj,2791923,public,self,Best bitcoin mining machine,0,[],1.0,https://www.reddit.com/r/Bitcoin/comments/n25v...,all_ads,6,moderator,,,,,,,,,,,
3,[],False,Island14,,[],,,,text,t2_izm6h,False,False,[],False,False,1619823845,self.Bitcoin,https://www.reddit.com/r/Bitcoin/comments/n25p...,{},n25pn1,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Bitcoin/comments/n25pn1/calculator_for_how_...,False,6,1619823857,1,Does anyone know if this is a thing? I'd like...,True,False,False,Bitcoin,t5_2s3qj,2791886,public,self,Calculator for how much my stash of BTC would ...,0,[],1.0,https://www.reddit.com/r/Bitcoin/comments/n25p...,all_ads,6,,,,,,,,,,,,
4,[],False,Oathkeeper856,,[],,,,text,t2_6733mzv9,False,False,[],False,False,1619823834,i.redd.it,https://www.reddit.com/r/Bitcoin/comments/n25p...,{},n25pi8,False,False,False,True,False,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Bitcoin/comments/n25pi8/saw_this_sign_at_my...,False,6,1619823845,1,,True,False,False,Bitcoin,t5_2s3qj,2791885,public,https://a.thumbs.redditmedia.com/_t2T2EHQNzC_7...,Saw this sign at my local deli!,0,[],1.0,https://i.redd.it/heygodod6ew61.jpg,all_ads,6,automod_filtered,image,"{'enabled': True, 'images': [{'id': 'jmHYjfnds...",140.0,140.0,https://i.redd.it/heygodod6ew61.jpg,,,,,,


In [13]:
eth_df = pd.DataFrame(json_ether['data'])
eth_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,removed_by_category,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,is_gallery,crosspost_parent,crosspost_parent_list
0,[],False,angyts,,[],,text,t2_ypydid5,False,False,[],False,False,1619824889,self.ethereum,https://www.reddit.com/r/ethereum/comments/n26...,{},n260sg,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/ethereum/comments/n260sg/is_there_a_link_to...,False,6,1619824901,0,I saw one to donate sats for India here https:...,True,False,False,ethereum,t5_2zf9m,829157,public,top,self,Is there a link to donate ETH for India?,0,[],0.5,https://www.reddit.com/r/ethereum/comments/n26...,all_ads,6,,,,,,,,,,,,,
1,[],False,Shinyaku88,,[],,text,t2_9umr8hr9,False,False,[],False,False,1619824459,self.ethereum,https://www.reddit.com/r/ethereum/comments/n25...,{},n25w7z,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/ethereum/comments/n25w7z/worth_to_buy_now/,False,6,1619824470,1,I want to buy some ETH but i dont know when......,True,False,False,ethereum,t5_2zf9m,829142,public,top,self,Worth to buy NOW?,0,[],1.0,https://www.reddit.com/r/ethereum/comments/n25...,all_ads,6,,,,,,,,,,,,,
2,[],False,Dr_Viv,,[],,text,t2_4hk1kv6m,False,False,[],False,False,1619823466,self.ethereum,https://www.reddit.com/r/ethereum/comments/n25...,{},n25lgc,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/ethereum/comments/n25lgc/memento_coins/,False,6,1619823477,1,So I’ve decided that each time I accumulate on...,True,False,False,ethereum,t5_2zf9m,829108,public,top,self,Memento Coins?,0,[],1.0,https://www.reddit.com/r/ethereum/comments/n25...,all_ads,6,,,,,,,,,,,,,
3,[],False,Pickinanameainteasy,,[],,text,t2_39zf3a42,False,False,[],False,False,1619823336,self.ethereum,https://www.reddit.com/r/ethereum/comments/n25...,{},n25k34,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/ethereum/comments/n25k34/is_it_possible_to_...,False,6,1619823347,1,[removed],True,False,False,ethereum,t5_2zf9m,829104,public,top,self,Is it possible to yield farm without switching...,0,[],1.0,https://www.reddit.com/r/ethereum/comments/n25...,all_ads,6,reddit,,,,,,,,,,,,
4,[],False,CrabCoin99,,[],,text,t2_bsycf3ut,False,False,[],False,False,1619822519,i.redd.it,https://www.reddit.com/r/ethereum/comments/n25...,{},n25b1g,False,False,False,True,False,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/ethereum/comments/n25b1g/criptokrown/,False,6,1619822531,1,,True,False,False,ethereum,t5_2zf9m,829082,public,top,https://a.thumbs.redditmedia.com/BHLL0rkK-zsFJ...,*** ✅✅ CriptoKrown ✅✅ ***,0,[],1.0,https://i.redd.it/4ace3myg2ew61.jpg,all_ads,6,reddit,image,"{'enabled': True, 'images': [{'id': 'Ez-cefLeu...",140.0,140.0,https://i.redd.it/4ace3myg2ew61.jpg,,,,,,,


In [14]:
subfields = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'is_self' ]

In [15]:
bc_df[subfields].head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,Futuristic.,Do you guys think central banks will use Bitco...,Bitcoin,1619824558,TimelyVermicelli8642,0,True
1,r/Truthoffmychest r/confession,Fell for a send bitcoin and get double back sc...,Bitcoin,1619824548,throwaway46476136,0,True
2,Best bitcoin mining machine,[removed],Bitcoin,1619824429,miningelectroab,0,True
3,Calculator for how much my stash of BTC would ...,Does anyone know if this is a thing? I'd like...,Bitcoin,1619823845,Island14,0,True
4,Saw this sign at my local deli!,,Bitcoin,1619823834,Oathkeeper856,0,False


In [16]:
eth_df[subfields].head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,Is there a link to donate ETH for India?,I saw one to donate sats for India here https:...,ethereum,1619824889,angyts,0,True
1,Worth to buy NOW?,I want to buy some ETH but i dont know when......,ethereum,1619824459,Shinyaku88,0,True
2,Memento Coins?,So I’ve decided that each time I accumulate on...,ethereum,1619823466,Dr_Viv,0,True
3,Is it possible to yield farm without switching...,[removed],ethereum,1619823336,Pickinanameainteasy,0,True
4,*** ✅✅ CriptoKrown ✅✅ ***,,ethereum,1619822519,CrabCoin99,0,False


### Clean up data 

Clean up Bitcoin dataframe

In [17]:
bc_df = bc_df[subfields]
bc_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,Futuristic.,Do you guys think central banks will use Bitco...,Bitcoin,1619824558,TimelyVermicelli8642,0,True
1,r/Truthoffmychest r/confession,Fell for a send bitcoin and get double back sc...,Bitcoin,1619824548,throwaway46476136,0,True
2,Best bitcoin mining machine,[removed],Bitcoin,1619824429,miningelectroab,0,True
3,Calculator for how much my stash of BTC would ...,Does anyone know if this is a thing? I'd like...,Bitcoin,1619823845,Island14,0,True
4,Saw this sign at my local deli!,,Bitcoin,1619823834,Oathkeeper856,0,False


In [18]:
bc_df.shape

(25, 7)

In [19]:
bc_df = bc_df[bc_df['is_self'] == True]

In [20]:
bc_df['timestamp'] = bc_df['created_utc'].map(dt.date.fromtimestamp)
bc_df['timestamp'].head()

0    2021-04-30
1    2021-04-30
2    2021-04-30
3    2021-04-30
7    2021-04-30
Name: timestamp, dtype: object

Clean up Ethereum dataframe

In [21]:
eth_df = eth_df[subfields]
eth_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,Is there a link to donate ETH for India?,I saw one to donate sats for India here https:...,ethereum,1619824889,angyts,0,True
1,Worth to buy NOW?,I want to buy some ETH but i dont know when......,ethereum,1619824459,Shinyaku88,0,True
2,Memento Coins?,So I’ve decided that each time I accumulate on...,ethereum,1619823466,Dr_Viv,0,True
3,Is it possible to yield farm without switching...,[removed],ethereum,1619823336,Pickinanameainteasy,0,True
4,*** ✅✅ CriptoKrown ✅✅ ***,,ethereum,1619822519,CrabCoin99,0,False


In [22]:
eth_df = eth_df[eth_df['is_self'] == True]

In [23]:
eth_df['timestamp'] = eth_df['created_utc'].map(dt.date.fromtimestamp)
eth_df['timestamp'].head()

0    2021-04-30
1    2021-04-30
2    2021-04-30
3    2021-04-30
5    2021-04-30
Name: timestamp, dtype: object

## Bitcoin

In [24]:
base_bitcoin_url = 'https://api.pushshift.io/reddit/search/submission'

subreddit = 'Bitcoin'
size = 100

bc_stem = f'{base_bitcoin_url}?subreddit={subreddit}&size={size}'

In [25]:
bc_stem

'https://api.pushshift.io/reddit/search/submission?subreddit=Bitcoin&size=100'

In [26]:
bc_res = requests.get(bc_stem)
assert bc_res.status_code == 200
json_bitcoin = bc_res.json()
len(json_bitcoin['data'])

100

In [27]:
days = 1
bc_url = f'{bc_stem}&after={days}d'

In [28]:
bc_url

'https://api.pushshift.io/reddit/search/submission?subreddit=Bitcoin&size=100&after=1d'

In [29]:
bc_res = requests.get(bc_url)
assert bc_res.status_code == 200
json_bitcoin = bc_res.json()
bc_df = pd.DataFrame(json_bitcoin['data'])[subfields]
bc_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,“Bitcoin to the moon” Glass Pendants. Hope you...,,Bitcoin,1619738802,Henderson710,0,False
1,How do I transfer Bitcoin to PayPal?,I want to use the Bitcoin I have obtained from...,Bitcoin,1619738843,AdmiralArchie1,13,True
2,BITCOIN IS KING!!!🌟,All hail the King of all cryptos from the land...,Bitcoin,1619739018,Juderedd,5,True
3,Car and Driver magazine references Bitcoin in ...,,Bitcoin,1619739097,Lost-Explorer,1,False
4,BTC dominance （why this year is different?）,"Hi\nI read online, as well as youtube, that ma...",Bitcoin,1619739110,AllenDo,14,True


In [30]:
bc_df['created_utc'].map(dt.date.fromtimestamp).head()

0    2021-04-29
1    2021-04-29
2    2021-04-29
3    2021-04-29
4    2021-04-29
Name: created_utc, dtype: object

## Ethereum

In [31]:
base_ether_url = 'https://api.pushshift.io/reddit/search/submission'

subreddit = 'ethereum'
size = 100

eth_stem = f'{base_ether_url}?subreddit={subreddit}&size={size}'

In [32]:
eth_stem

'https://api.pushshift.io/reddit/search/submission?subreddit=ethereum&size=100'

In [33]:
eth_res = requests.get(eth_stem)
assert eth_res.status_code == 200
json_ether = eth_res.json()
eth_df = pd.DataFrame(json_ether['data'])[subfields]
eth_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,Is there a link to donate ETH for India?,I saw one to donate sats for India here https:...,ethereum,1619824889,angyts,0,True
1,Worth to buy NOW?,I want to buy some ETH but i dont know when......,ethereum,1619824459,Shinyaku88,0,True
2,Memento Coins?,So I’ve decided that each time I accumulate on...,ethereum,1619823466,Dr_Viv,0,True
3,Is it possible to yield farm without switching...,[removed],ethereum,1619823336,Pickinanameainteasy,0,True
4,*** ✅✅ CriptoKrown ✅✅ ***,,ethereum,1619822519,CrabCoin99,0,False


In [34]:
eth_df['created_utc'].map(dt.date.fromtimestamp).head()

0    2021-04-30
1    2021-04-30
2    2021-04-30
3    2021-04-30
4    2021-04-30
Name: created_utc, dtype: object

# Automating Pull Requests

Put all of the above steps together in a function. To make sure this function doesn't hammer Reddit with a ton of requests all at once, a delay of 2 seconds will be added to keep in line with best practices.

In [35]:
def query_pushshift(subreddit, kind = 'submission', day_window = 1, n = 250):
    COLUMNS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']
    
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=100" 
    
    posts_list = []
    
    for i in range(1, n + 1):
        URL = "{}&after={}d".format(stem, day_window * i)
        print("Querying from: " + URL)
        try:
            response = requests.get(URL)
            assert response.status_code == 200
        except:
            continue
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        posts_list.append(df)
        time.sleep(2)
    
    full = pd.concat(posts_list, sort=False)
    
    if kind == "submission":
        full = full[COLUMNS]
        full.drop_duplicates(inplace = True)
        full = full.loc[full['is_self'] == True]

    
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    
    return full 

# Used Gwen's function to scrape data with a few tweaks of my own.

In [40]:
bc_df.shape

(100, 7)

In [41]:
bc_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,is_self
0,“Bitcoin to the moon” Glass Pendants. Hope you...,,Bitcoin,1619738802,Henderson710,0,False
1,How do I transfer Bitcoin to PayPal?,I want to use the Bitcoin I have obtained from...,Bitcoin,1619738843,AdmiralArchie1,13,True
2,BITCOIN IS KING!!!🌟,All hail the King of all cryptos from the land...,Bitcoin,1619739018,Juderedd,5,True
3,Car and Driver magazine references Bitcoin in ...,,Bitcoin,1619739097,Lost-Explorer,1,False
4,BTC dominance （why this year is different?）,"Hi\nI read online, as well as youtube, that ma...",Bitcoin,1619739110,AllenDo,14,True


In [39]:
bc_df.to_csv('./data/bitcoin.csv', index=False)

In [None]:
eth_results = query_pushshift('ethereum')

In [None]:
eth_bc.shape

In [None]:
eth_bc.to_csv('./data/ethereum.csv', index=False)

In [None]:
eth_results.head()

After pulling data from these two subreddits, we will have roughly ~14k posts total, 7k from each thread.