<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Part-2:-Data-Cleaning" data-toc-modified-id="Part-2:-Data-Cleaning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 2: Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Loading-the-data" data-toc-modified-id="Loading-the-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Loading the data</a></span></li><li><span><a href="#Selecting-Columns" data-toc-modified-id="Selecting-Columns-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Selecting Columns</a></span></li><li><span><a href="#Removed,-Deleted-or-Missing-Selftexts" data-toc-modified-id="Removed,-Deleted-or-Missing-Selftexts-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Removed, Deleted or Missing Selftexts</a></span></li><li><span><a href="#Checking-for-duplicates" data-toc-modified-id="Checking-for-duplicates-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Checking for duplicates</a></span></li><li><span><a href="#Low-Word-Counts" data-toc-modified-id="Low-Word-Counts-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Low Word Counts</a></span></li><li><span><a href="#Moderator-Posts" data-toc-modified-id="Moderator-Posts-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Moderator Posts</a></span></li><li><span><a href="#Bot-Posts" data-toc-modified-id="Bot-Posts-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Bot Posts</a></span></li><li><span><a href="#Clean-up-Text" data-toc-modified-id="Clean-up-Text-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Clean up Text</a></span></li><li><span><a href="#Checking-Class-Balance" data-toc-modified-id="Checking-Class-Balance-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Checking Class Balance</a></span></li><li><span><a href="#Writing-to-file" data-toc-modified-id="Writing-to-file-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Writing to file</a></span></li></ul></li></ul></div>

# Part 2: Data Cleaning

## Import Libraries

In [1]:
# import packages
import pandas as pd
import numpy as np
import missingno as msno
from pathlib import Path
from pprint import pprint

pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100) # adjust number of rows visible 

## Loading the data

In [2]:
# get data folder path
p = Path.cwd()/ "data" / 'submissions'

# loop through the files
files = [f for f in p.glob('*') if f.suffix.lower() == '.csv']  

# generate dataframe with all 3 years combined
raw = pd.concat([pd.read_csv(f) for f in files], axis='rows', ignore_index = True)

# check nulls, data types
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 89 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  10000 non-null  object 
 1   allow_live_comments            10000 non-null  bool   
 2   author                         10000 non-null  object 
 3   author_flair_css_class         1091 non-null   object 
 4   author_flair_richtext          9832 non-null   object 
 5   author_flair_template_id       1086 non-null   object 
 6   author_flair_text              1130 non-null   object 
 7   author_flair_text_color        1562 non-null   object 
 8   author_flair_type              9832 non-null   object 
 9   author_fullname                9832 non-null   object 
 10  author_is_blocked              208 non-null    object 
 11  author_patreon_flair           9832 non-null   object 
 12  author_premium                 9832 non-null   

In [3]:
# Preview data
raw.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,distinguished,suggested_sort,link_flair_text,gallery_data,is_gallery,media_metadata,author_flair_background_color,banned_by,author_cakeday,crosspost_parent,crosspost_parent_list,link_flair_template_id,poll_data,event_end,event_is_live,event_start,edited,gilded
0,[],False,theremnanthodl,noob,"[{'e': 'text', 't': 'redditor for a day'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for a day,dark,richtext,t2_dg8srid3,False,False,False,[],False,False,1626939006,self.Bitcoin,https://www.reddit.com/r/Bitcoin/comments/op915k/bitcoin_town_a_fiction_novel_about_using_bitcoin/,{},op915k,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Bitcoin/comments/op915k/bitcoin_town_a_fiction_novel_about_using_bitcoin/,False,6,reddit,1626939018,1,[removed],True,False,False,Bitcoin,t5_2s3qj,3206851,public,self,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,0,[],1.0,https://www.reddit.com/r/Bitcoin/comments/op915k/bitcoin_town_a_fiction_novel_about_using_bitcoin/,all_ads,6,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,[],False,theremnanthodl,,[],,,,text,t2_dg8srid3,False,False,False,[],False,False,1626938084,self.Bitcoin,https://www.reddit.com/r/Bitcoin/comments/op8uoi/bitcoin_town_a_fiction_novel_about_using_bitcoin/,{},op8uoi,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Bitcoin/comments/op8uoi/bitcoin_town_a_fiction_novel_about_using_bitcoin/,False,6,moderator,1626938095,1,[removed],True,False,False,Bitcoin,t5_2s3qj,3206830,public,self,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,0,[],1.0,https://www.reddit.com/r/Bitcoin/comments/op8uoi/bitcoin_town_a_fiction_novel_about_using_bitcoin/,all_ads,6,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,[],False,ReadDailyCoin,noob,"[{'e': 'text', 't': 'redditor for 3 months'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for 3 months,dark,richtext,t2_bmm97n7n,False,False,False,[],False,False,1626937970,dailycoin.com,https://www.reddit.com/r/Bitcoin/comments/op8tvn/crypto_influencers_dorsey_woods_and_musk_faceoff/,{},op8tvn,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/Bitcoin/comments/op8tvn/crypto_influencers_dorsey_woods_and_musk_faceoff/,False,6,,1626937981,1,,True,False,False,Bitcoin,t5_2s3qj,3206827,public,https://b.thumbs.redditmedia.com/JRvcyvkzAnY7NmTPnBFJiso0LwaRPVEZJSluXm0opHE.jpg,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",0,[],1.0,https://dailycoin.com/crypto-influencers-dorsey-woods-and-musk-face-off-during-b-word-conference/,all_ads,6,link,"{'enabled': False, 'images': [{'id': 'rYyDll03SLRxY1iij5Ydst_cOFjkGdfSF2xRhMIVeoQ', 'resolutions': [{'height': 56, 'url': 'https://external-preview.redd.it/XzSXc3nM6iRJfDIDw4-OGx1wafXeW43THH4S0txyeFs.jpg?width=108&amp;crop=smart&amp;auto=webp&amp;s=4eadd70c3ee9662f9e951d3132a71efc2d31357d', 'width': 108}, {'height': 113, 'url': 'https://external-preview.redd.it/XzSXc3nM6iRJfDIDw4-OGx1wafXeW43THH4S0txyeFs.jpg?width=216&amp;crop=smart&amp;auto=webp&amp;s=32b9ba18ce6246a7cf0bf7d170b85085543ac8b8', 'width': 216}, {'height': 168, 'url': 'https://external-preview.redd.it/XzSXc3nM6iRJfDIDw4-OGx1wafXeW43THH4S0txyeFs.jpg?width=320&amp;crop=smart&amp;auto=webp&amp;s=b414137e068d115a2045e228f007f23fba472d6b', 'width': 320}, {'height': 336, 'url': 'https://external-preview.redd.it/XzSXc3nM6iRJfDIDw4-OGx1wafXeW43THH4S0txyeFs.jpg?width=640&amp;crop=smart&amp;auto=webp&amp;s=e1bba2ee4e1724320834991c82282615d1f58885', 'width': 640}, {'height': 504, 'url': 'https://external-preview.redd.it/XzSXc3nM...",73.0,140.0,https://dailycoin.com/crypto-influencers-dorsey-woods-and-musk-face-off-during-b-word-conference/,,,,,,,,,,,,,,,,,,,,,,
3,[],False,theloiteringlinguist,,[],,,,text,t2_7em1h7ph,False,False,False,[],False,False,1626937137,youtu.be,https://www.reddit.com/r/Bitcoin/comments/op8nl2/elon_musks_view_on_bitcoin_july_21_2021/,{},op8nl2,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,2,0,False,all_ads,/r/Bitcoin/comments/op8nl2/elon_musks_view_on_bitcoin_july_21_2021/,False,6,,1626937148,1,,True,False,False,Bitcoin,t5_2s3qj,3206812,public,https://b.thumbs.redditmedia.com/oJQ44gvkWlTFIa_XCgPa9vbg_-PjLRjovYhyDS9RP-I.jpg,Elon Musk’s View on Bitcoin (July 21 2021),0,[],1.0,https://youtu.be/7pLusWKO86Y,all_ads,6,rich:video,"{'enabled': False, 'images': [{'id': '6cZo55d1-fJw2eTIsZTSiQVPTUO2BbJOZ9AqfssCUVo', 'resolutions': [{'height': 81, 'url': 'https://external-preview.redd.it/LrHYJCIfFe0L9won-9MiddJQ4ngw0Xsp9l87hah9DFo.jpg?width=108&amp;crop=smart&amp;auto=webp&amp;s=83e22c60f6bcc5c16f5efb258067512dc57a7e31', 'width': 108}, {'height': 162, 'url': 'https://external-preview.redd.it/LrHYJCIfFe0L9won-9MiddJQ4ngw0Xsp9l87hah9DFo.jpg?width=216&amp;crop=smart&amp;auto=webp&amp;s=ca87c58fbf800b24419dfb38ee24913f13411cf9', 'width': 216}, {'height': 240, 'url': 'https://external-preview.redd.it/LrHYJCIfFe0L9won-9MiddJQ4ngw0Xsp9l87hah9DFo.jpg?width=320&amp;crop=smart&amp;auto=webp&amp;s=2eadf963a177da7bdf9a755422bb13c46e9bc610', 'width': 320}], 'source': {'height': 360, 'url': 'https://external-preview.redd.it/LrHYJCIfFe0L9won-9MiddJQ4ngw0Xsp9l87hah9DFo.jpg?auto=webp&amp;s=0b60c9e380c2083902e2955ef37255c8534f41d9', 'width': 480}, 'variants': {}}]}",105.0,140.0,https://youtu.be/7pLusWKO86Y,"{'oembed': {'author_name': 'The Valuable Investors', 'author_url': 'https://www.youtube.com/channel/UCcdigZ5bdD_FGyeHnfW1Tbg', 'height': 200, 'html': '&lt;iframe width=""356"" height=""200"" src=""https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen&gt;&lt;/iframe&gt;', 'provider_name': 'YouTube', 'provider_url': 'https://www.youtube.com/', 'thumbnail_height': 360, 'thumbnail_url': 'https://i.ytimg.com/vi/7pLusWKO86Y/hqdefault.jpg', 'thumbnail_width': 480, 'title': 'Elon Musk’s View on Bitcoin (July 21 2021)', 'type': 'video', 'version': '1.0', 'width': 356}, 'type': 'youtube.com'}","{'content': '&lt;iframe width=""356"" height=""200"" src=""https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen&gt;&lt;/iframe&gt;', 'height': 200, 'scrolling': False, 'width': 356}","{'oembed': {'author_name': 'The Valuable Investors', 'author_url': 'https://www.youtube.com/channel/UCcdigZ5bdD_FGyeHnfW1Tbg', 'height': 200, 'html': '&lt;iframe width=""356"" height=""200"" src=""https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen&gt;&lt;/iframe&gt;', 'provider_name': 'YouTube', 'provider_url': 'https://www.youtube.com/', 'thumbnail_height': 360, 'thumbnail_url': 'https://i.ytimg.com/vi/7pLusWKO86Y/hqdefault.jpg', 'thumbnail_width': 480, 'title': 'Elon Musk’s View on Bitcoin (July 21 2021)', 'type': 'video', 'version': '1.0', 'width': 356}, 'type': 'youtube.com'}","{'content': '&lt;iframe width=""356"" height=""200"" src=""https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen&gt;&lt;/iframe&gt;', 'height': 200, 'media_domain_url': 'https://www.redditmedia.com/mediaembed/op8nl2', 'scrolling': False, 'width': 356}",,,,,,,,,,,,,,,,,,
4,[],False,Electronic_Chard1987,,[],,,,text,t2_994q7jme,,False,False,[],False,False,1626936557,i.redd.it,https://www.reddit.com/r/Bitcoin/comments/op8je8/youve_undoubtedly_heard_about_crypto_currencies/,{},op8je8,False,False,False,False,True,False,False,False,,[],dark,text,False,False,True,20,0,False,all_ads,/r/Bitcoin/comments/op8je8/youve_undoubtedly_heard_about_crypto_currencies/,False,6,automod_filtered,1626936567,1,,True,False,False,Bitcoin,t5_2s3qj,3206797,public,https://b.thumbs.redditmedia.com/E7_zEgQGgPc92UnBkbsqv2MLGVHnHIUHCaixtkFkIdc.jpg,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",0,[],1.0,https://i.redd.it/h0e4u1t5opc71.jpg,all_ads,6,image,"{'enabled': True, 'images': [{'id': 'UEDRUWrx-DCh0-QXIFkiRV0JB0n61MSDq-aIJH6M1oc', 'resolutions': [{'height': 64, 'url': 'https://preview.redd.it/h0e4u1t5opc71.jpg?width=108&amp;crop=smart&amp;auto=webp&amp;s=625eef1fc64031f5b77bc535fd90db3931b7165d', 'width': 108}, {'height': 129, 'url': 'https://preview.redd.it/h0e4u1t5opc71.jpg?width=216&amp;crop=smart&amp;auto=webp&amp;s=d529e17d105138596cf63a733f0510f53801c852', 'width': 216}, {'height': 191, 'url': 'https://preview.redd.it/h0e4u1t5opc71.jpg?width=320&amp;crop=smart&amp;auto=webp&amp;s=7fb8653d38101a9e0eb79217c34c8215ba42c077', 'width': 320}, {'height': 383, 'url': 'https://preview.redd.it/h0e4u1t5opc71.jpg?width=640&amp;crop=smart&amp;auto=webp&amp;s=f159c8e798331b7a89a92d8a04f659a576e3ffb3', 'width': 640}, {'height': 575, 'url': 'https://preview.redd.it/h0e4u1t5opc71.jpg?width=960&amp;crop=smart&amp;auto=webp&amp;s=1183ffa3d22e5a712b68fcb58f7f873c13155886', 'width': 960}, {'height': 646, 'url': 'https://preview.redd.it/h0e4u...",83.0,140.0,https://i.redd.it/h0e4u1t5opc71.jpg,,,,,,,,,,,,,,,,,,,,,,


## Selecting Columns

In [4]:
# Unique values for each column just to get a sense of what we have and the columns to keep
for item in raw.columns:
    num_unique = raw[item].nunique()
    item_unique = raw[item].unique() if num_unique < 20 else raw[item].unique()[:20]
    print(item, ": ", raw[item].nunique(),"values")
    print(item_unique, "\n")

all_awardings :  20 values
['[]'
 "[{'award_sub_type': 'GLOBAL', 'award_type': 'global', 'awardings_required_to_grant_benefits': None, 'coin_price': 125, 'coin_reward': 0, 'count': 1, 'days_of_drip_extension': 0, 'days_of_premium': 0, 'description': 'When you come across a feel-good thing.', 'end_date': None, 'giver_coin_reward': None, 'icon_format': None, 'icon_height': 2048, 'icon_url': 'https://i.redd.it/award_images/t5_22cerq/5izbv4fn0md41_Wholesome.png', 'icon_width': 2048, 'id': 'award_5f123e3d-4f48-42f4-9c11-e98b566d5897', 'is_enabled': True, 'is_new': False, 'name': 'Wholesome', 'penny_donate': None, 'penny_price': None, 'resized_icons': [{'height': 16, 'url': 'https://preview.redd.it/award_images/t5_22cerq/5izbv4fn0md41_Wholesome.png?width=16&amp;height=16&amp;auto=webp&amp;s=92932f465d58e4c16b12b6eac4ca07d27e3d11c0', 'width': 16}, {'height': 32, 'url': 'https://preview.redd.it/award_images/t5_22cerq/5izbv4fn0md41_Wholesome.png?width=32&amp;height=32&amp;auto=webp&amp;s=d11484

is_created_from_ads_ui :  1 values
[False] 

is_crosspostable :  2 values
[False  True] 

is_meta :  1 values
[False] 

is_original_content :  1 values
[False] 

is_reddit_media_domain :  2 values
[False  True] 

is_robot_indexable :  2 values
[False  True] 

is_self :  2 values
[ True False] 

is_video :  2 values
[False  True] 

link_flair_background_color :  0 values
[nan] 

link_flair_richtext :  3 values
['[]' "[{'e': 'text', 't': 'comment as submission'}]"
 "[{'e': 'text', 't': 'off topic'}]"] 

link_flair_text_color :  1 values
['dark'] 

link_flair_type :  2 values
['text' 'richtext'] 

locked :  1 values
[False] 

media_only :  1 values
[False] 

no_follow :  2 values
[ True False] 

num_comments :  254 values
[  0   1   2  20   7   5   4  41   3 715  19  33   8  16  27  30  17  22
  47  32] 

num_crossposts :  4 values
[0 1 2 5] 

over_18 :  2 values
[False  True] 

parent_whitelist_status :  1 values
['all_ads'] 

permalink :  10000 values
['/r/Bitcoin/comments/op915k/bitcoi

title :  9125 values
['Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset'
 'Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference'
 'Elon Musk’s View on Bitcoin (July 21 2021)'
 'You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.'
 'best crypto video ive ever watched'
 'what moves crypto market apart from the speculators'
 'This is the newest project of him?'
 'Is it advisable to use P2P when buying bitcoin ?'
 'Only morons post about Elon Musk, SpaceX, or Tesla'
 'Help starting crypto business'
 'Did Jack Dorsey confirm or deflect on taking BTC for advertising ?'
 'Air Drop 2 | Free Crypto earning |'
 '⚡ Lightning Thursday! July 22, 2021: Explore the Lightning Network!⚡'
 'Tesla and SpaceX own B

media_embed :  730 values
[nan
 '{\'content\': \'&lt;iframe width="356" height="200" src="https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;\', \'height\': 200, \'scrolling\': False, \'width\': 356}'
 '{\'content\': \'&lt;iframe width="356" height="200" src="https://www.youtube.com/embed/WgyM0tQ0Hfs?start=30&amp;feature=oembed&amp;enablejsapi=1" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;\', \'height\': 200, \'scrolling\': False, \'width\': 356}'
 '{\'content\': \'&lt;iframe width="356" height="200" src="https://www.youtube.com/embed/b8Yqc91H2LI?feature=oembed&amp;enablejsapi=1" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/

In [5]:
# selecting columns
raw = raw[['author', 'created_utc','title', 'selftext','subreddit']]

In [6]:
# checking variation within the columns
for item in raw.columns:
    print(item, "-"*100)
    print(raw[item].value_counts(normalize=True), '\n\n') 

author ----------------------------------------------------------------------------------------------------
[deleted]               0.0168
twigwam                 0.0137
adminalex360            0.0064
simplelifestyle         0.0051
Zalkifl_Savage          0.0042
                         ...  
kapilan410              0.0001
alanwatts1              0.0001
sudev29                 0.0001
Shoddy_Wrangler8144     0.0001
Environmental-Ad6193    0.0001
Name: author, Length: 5626, dtype: float64 


created_utc ----------------------------------------------------------------------------------------------------
1626753353    0.0002
1626707141    0.0002
1626640338    0.0002
1626115668    0.0002
1626257770    0.0002
               ...  
1626109284    0.0001
1625990954    0.0001
1623654092    0.0001
1626287463    0.0001
1626703871    0.0001
Name: created_utc, Length: 9976, dtype: float64 


title ----------------------------------------------------------------------------------------------------
Tit

In [7]:
# check nulls and data types after change
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       10000 non-null  object
 1   created_utc  10000 non-null  int64 
 2   title        10000 non-null  object
 3   selftext     5155 non-null   object
 4   subreddit    10000 non-null  object
dtypes: int64(1), object(4)
memory usage: 390.8+ KB


## Removed, Deleted or Missing Selftexts

In [8]:
display(raw[raw['selftext'] == '[removed]' ].head())
print(raw[raw['selftext'] == '[removed]' ].shape)

Unnamed: 0,author,created_utc,title,selftext,subreddit
0,theremnanthodl,1626939006,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],Bitcoin
1,theremnanthodl,1626938084,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],Bitcoin
9,monoslim,1626935783,"Only morons post about Elon Musk, SpaceX, or Tesla",[removed],Bitcoin
16,Chickfizz-eats-memes,1626933429,Will eth2 affect bitcoin/bitcoin mining?,[removed],Bitcoin
17,orchidkart,1626933358,Create your Token in 3 easy steps with SuperToken,[removed],Bitcoin


(3061, 5)


In [9]:
display(raw[raw['selftext'] == '[deleted]' ].head())
print(raw[raw['selftext'] == '[deleted]' ].shape)

Unnamed: 0,author,created_utc,title,selftext,subreddit
1499,[deleted],1626542162,US city plans to accept Bitcoin for tax payments. The young mayor wants to enable Jackson residents pay their property tax with Bitcoin.,[deleted],Bitcoin
2189,[deleted],1626349499,Robert Kiyosaki: This is When To Buy Bitcoin,[deleted],Bitcoin
3382,[deleted],1626032791,International auctioneer Sotheby’s will accept Bitcoin as payment for the sale of a 101.38-carat diamond,[deleted],Bitcoin
3531,[deleted],1625990271,“Bitcoin is a Miracle”,[deleted],Bitcoin
3910,[deleted],1625861161,Comedian Interviews a citizen of Liberland - The first country built on the block chain - Thoughts?,[deleted],Bitcoin


(140, 5)


In [10]:
display(raw[raw['selftext'].isnull()].head())
print(raw[raw['selftext'].isnull() ].shape)

Unnamed: 0,author,created_utc,title,selftext,subreddit
2,ReadDailyCoin,1626937970,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,Bitcoin
3,theloiteringlinguist,1626937137,Elon Musk’s View on Bitcoin (July 21 2021),,Bitcoin
4,Electronic_Chard1987,1626936557,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",,Bitcoin
5,FarEnergy3518,1626936554,best crypto video ive ever watched,,Bitcoin
7,jamesonisraela,1626936215,This is the newest project of him?,,Bitcoin


(4845, 5)


In [11]:
raw.head()

Unnamed: 0,author,created_utc,title,selftext,subreddit
0,theremnanthodl,1626939006,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],Bitcoin
1,theremnanthodl,1626938084,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],Bitcoin
2,ReadDailyCoin,1626937970,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,Bitcoin
3,theloiteringlinguist,1626937137,Elon Musk’s View on Bitcoin (July 21 2021),,Bitcoin
4,Electronic_Chard1987,1626936557,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",,Bitcoin


Although the actual content of the post is missing, the remaining title is still useful data that can be used to learn more about the two topics. \
Hence, instead of dropping these rows, let's convert them to empty strings and combine the title and selftext into one column

In [12]:
# replace html, xml tags 
raw['selftext'] = raw['selftext'].str.replace('[removed]', "", regex=False)
raw['selftext'] = raw['selftext'].str.replace('[deleted]', " ", regex=False)
raw['selftext'] = raw['selftext'].fillna("")
raw['text'] = raw['title'] + " " + raw['selftext']       
print(raw.shape)
display(raw.head(10))

(10000, 6)


Unnamed: 0,author,created_utc,title,selftext,subreddit,text
0,theremnanthodl,1626939006,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset
1,theremnanthodl,1626938084,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset
2,ReadDailyCoin,1626937970,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference"
3,theloiteringlinguist,1626937137,Elon Musk’s View on Bitcoin (July 21 2021),,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021)
4,Electronic_Chard1987,1626936557,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem."
5,FarEnergy3518,1626936554,best crypto video ive ever watched,,Bitcoin,best crypto video ive ever watched
6,hawk-fe,1626936533,what moves crypto market apart from the speculators,"I would like to know if there is anything that moves crypto market apart from the speculators, or are cryptocurrencies and their prices absolutely speculative?",Bitcoin,"what moves crypto market apart from the speculators I would like to know if there is anything that moves crypto market apart from the speculators, or are cryptocurrencies and their prices absolutely speculative?"
7,jamesonisraela,1626936215,This is the newest project of him?,,Bitcoin,This is the newest project of him?
8,Idontknow881,1626936148,Is it advisable to use P2P when buying bitcoin ?,,Bitcoin,Is it advisable to use P2P when buying bitcoin ?
9,monoslim,1626935783,"Only morons post about Elon Musk, SpaceX, or Tesla",,Bitcoin,"Only morons post about Elon Musk, SpaceX, or Tesla"


## Checking for duplicates

In [13]:
raw[raw['text'].duplicated()]

Unnamed: 0,author,created_utc,title,selftext,subreddit,text
1,theremnanthodl,1626938084,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset
45,Heavy_Ad_5725,1626922939,Someone knows idc global station?,,Bitcoin,Someone knows idc global station?
50,GimmieLu,1626920574,The next generation of rewarding buy-back tokens Launch: July 22nd | 6PM UTC,,Bitcoin,The next generation of rewarding buy-back tokens Launch: July 22nd | 6PM UTC
79,Lumpy_Brilliant9252,1626913081,Roadmap blockchain,,Bitcoin,Roadmap blockchain
82,MotherPop9,1626912063,Roadmap blockchain,,Bitcoin,Roadmap blockchain
...,...,...,...,...,...,...
9972,HenryWalker4358,1623417612,😍😍,,ethereum,😍😍
9980,AyuChuya,1623414166,marketing,,ethereum,marketing
9981,AyuChuya,1623414143,marketing,,ethereum,marketing
9983,AyuChuya,1623413883,marketing,,ethereum,marketing


In [14]:
# Check duplicates
raw = raw.drop_duplicates(subset=['text'])
print(raw.shape)

(9207, 6)


## Low Word Counts

In [15]:
raw['word_count'] =raw['text'].str.split().str.len()
raw.head()

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
0,theremnanthodl,1626939006,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,13
2,ReadDailyCoin,1626937970,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",10
3,theloiteringlinguist,1626937137,Elon Musk’s View on Bitcoin (July 21 2021),,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),8
4,Electronic_Chard1987,1626936557,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",54
5,FarEnergy3518,1626936554,best crypto video ive ever watched,,Bitcoin,best crypto video ive ever watched,6


In [16]:
# One word texts
raw[raw['word_count'] == 1].head(20)

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
41,Reasonable_Concern66,1626923623,W3,,Bitcoin,W3,1
92,noahbrown20020,1626909863,JEPINVESTMENT,,Bitcoin,JEPINVESTMENT,1
193,moty_k6,1626892262,Live,,Bitcoin,Live,1
257,TheGanjaman1966,1626880429,Pi,,Bitcoin,Pi,1
364,cryptoragsdesign,1626862041,BTD!,,Bitcoin,BTD!,1
410,flamemeifyoucan,1626846104,CAGR,,Bitcoin,CAGR,1
414,Valuable-Pepper6582,1626844207,CRYPTO,,Bitcoin,CRYPTO,1
425,BigScuter,1626841296,Crazy,,Bitcoin,Crazy,1
469,Subject_Advertising7,1626829777,Earn,,Bitcoin,Earn,1
527,TWBBBB,1626816470,Clusters,,Bitcoin,Clusters,1


In [17]:
# Two word texts
raw[raw['word_count'] == 2].head(20)

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
54,DonLambrezi,1626919874,Bitcoin family,,Bitcoin,Bitcoin family,2
69,Alternative-Reason13,1626915998,Roadmap blockchain,,Bitcoin,Roadmap blockchain,2
102,Igorglavatskiy,1626907027,Shiba 📈🚀,,Bitcoin,Shiba 📈🚀,2
107,bandg1987,1626905830,Crypto website,,Bitcoin,Crypto website,2
152,TheLuckyLeandro,1626898506,Bitcoin Update,,Bitcoin,Bitcoin Update,2
215,Gnxzz,1626887587,Triangle? 👁👁,,Bitcoin,Triangle? 👁👁,2
250,dariodelasvegas,1626882029,KEEP HODLING,,Bitcoin,KEEP HODLING,2
251,dariodelasvegas,1626881960,Keep HODLING,,Bitcoin,Keep HODLING,2
278,Major_Bandicoot_3239,1626877292,Simply beautiful.,,Bitcoin,Simply beautiful.,2
297,desatur,1626875033,AngryB Global,,Bitcoin,AngryB Global,2


In [18]:
# Three word texts
raw[raw['word_count'] == 3].head(20)

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
65,serajeas,1626917199,"Nice try, Elon!",,Bitcoin,"Nice try, Elon!",3
118,rlg626,1626904169,Crypto Scholarships (Faucet),,Bitcoin,Crypto Scholarships (Faucet),3
188,stavinlawrence,1626892776,Choose your fighter,,Bitcoin,Choose your fighter,3
236,Trader1234picks,1626884053,Satoshi Nakamoto's Identity,,Bitcoin,Satoshi Nakamoto's Identity,3
246,Savinox,1626882628,Relating? Be honest.,,Bitcoin,Relating? Be honest.,3
255,Instawalletpay,1626881193,Insta wallet pay,,Bitcoin,Insta wallet pay,3
408,gagaw1010,1626846653,FullSend BTC: 3GURNyz9bhd34VSGGYKStPno39BZg2Df7x,,Bitcoin,FullSend BTC: 3GURNyz9bhd34VSGGYKStPno39BZg2Df7x,3
449,Waxytallk,1626833466,Bitcoin &lt; Zedrun,,Bitcoin,Bitcoin &lt; Zedrun,3
462,Howie_sheila_,1626830904,Earn from home,,Bitcoin,Earn from home,3
470,Alternative-Reason13,1626829612,Rastreabilidade no mar,,Bitcoin,Rastreabilidade no mar,3


In [19]:
# four word texts
raw[raw['word_count'] == 4].head(20)

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
29,Marshall_Matherz,1626927049,Crypto in the UAE!,,Bitcoin,Crypto in the UAE!,4
40,fox69r,1626923780,Scammer on the prowl,,Bitcoin,Scammer on the prowl,4
109,BTC-Code,1626905678,Tesla accepting BTC Again,,Bitcoin,Tesla accepting BTC Again,4
138,TheInsidiousOutfield,1626901345,How to buy Bitcoin,,Bitcoin,How to buy Bitcoin,4
150,BlockHiveIO,1626899019,Blockhive Market Update 7/21/21,,Bitcoin,Blockhive Market Update 7/21/21,4
158,MQplaya,1626897589,Full B Word Conference,,Bitcoin,Full B Word Conference,4
165,Ecstatic-Size1450,1626895835,finding nike European supplier,,Bitcoin,finding nike European supplier,4
173,Suitable_Appeal_3859,1626894718,Let’s to the moon,,Bitcoin,Let’s to the moon,4
181,Wild_Gazelle2159,1626893720,Hahaha made for fun,,Bitcoin,Hahaha made for fun,4
191,Academic_Ad3146,1626892329,Purchasing Bitcoin in Ireland,,Bitcoin,Purchasing Bitcoin in Ireland,4


In [35]:
# four word texts
raw[raw['word_count'] == 4].tail(20)

Unnamed: 0,author,subreddit,text,word_count,date
9751,Xx2hotReallyxX,ethereum,Idiot in a bus,4,2021-06-13 12:29:17
9761,Allenmc_,ethereum,How I feel too😁😔,4,2021-06-13 09:59:57
9788,[deleted],ethereum,Becareful of Ethereum Scam,4,2021-06-13 02:10:54
9790,Professional_Ad_3601,ethereum,Bitcoin taproot vs eth,4,2021-06-13 02:07:22
9823,NutSpreadMan,ethereum,Thats a lot man,4,2021-06-12 19:25:37
9837,Spectre_GR,ethereum,"Spotted in Toronto, Canada",4,2021-06-12 16:56:56
9855,According_Western914,ethereum,Who buys AMP ?,4,2021-06-12 14:36:47
9859,Beytrix,ethereum,Smart contracts and GDPR,4,2021-06-12 14:12:10
9860,Beytrix,ethereum,GDPR and smart contracts,4,2021-06-12 14:07:38
9863,yeshopkb,ethereum,Best Online Shopping Company,4,2021-06-12 12:31:29


With texts consisting 4 words, we seemed to start to have more topic-specific texts popping out. \
Hence, dropping text consists of 3 words or less as they don't seems to help with the context

In [20]:
raw = raw[raw['word_count'] > 3]
print(raw.shape)


(8230, 7)


## Moderator Posts

In [21]:

def check_author_comments(author):
    print("\nModerator: {}".format(author))
    print('-' * 50)
    print(raw[raw['author'] == author].shape, '\n') # .shape to see impact on remaining data
    print(raw[raw['author'] == author]['text'].unique()) # see comments

# creating lists to loop over and scan more efficiently    
eth_moderators = ['vbuterin', 'heliumcraft', 'insomniasexx', 'publicmodlogs', 'Souptacular', 'EvanVanNess', 'ligi', 'twigwam', 'JBSchweitzer', 'edmundedgar']  
btc_moderators = ['theymos', 'BashCo', 'frankenmint', 'rbitcoin-bot', 'Aussiehash', 'ThePiachu', 'Avatar-X', 'DigitalGoose', 'theiflar', 'rBitcoinMod']

In [22]:
for mod in eth_moderators:
    check_author_comments(mod) 


Moderator: vbuterin
--------------------------------------------------
(1, 7) 

["Impromptu technical AMA on statelessness and Verkle trees and state expiry If anyone is interested in learning more about the details of the tech and the likely consequences to Ethereum, I'm happy to answer people's questions!\n\nLinks to get acquainted with the tech:\n\n* [https://notes.ethereum.org/@vbuterin/verkle_and_state_expiry_proposal](https://notes.ethereum.org/@vbuterin/verkle_and_state_expiry_proposal)\n* [https://notes.ethereum.org/@vbuterin/verkle\\_tree\\_eip](https://notes.ethereum.org/@vbuterin/verkle_tree_eip)\n* [https://notes.ethereum.org/@vbuterin/state_expiry_eip](https://notes.ethereum.org/@vbuterin/state_expiry_eip)"]

Moderator: heliumcraft
--------------------------------------------------
(0, 7) 

[]

Moderator: insomniasexx
--------------------------------------------------
(0, 7) 

[]

Moderator: publicmodlogs
--------------------------------------------------
(0, 7) 

[]

Mod

In [23]:
for mod in btc_moderators:
    check_author_comments(mod) 


Moderator: theymos
--------------------------------------------------
(0, 7) 

[]

Moderator: BashCo
--------------------------------------------------
(14, 7) 

['Bitcoin Rapid-Fire: Ray Youssef, CEO of Paxful - A True "Hero\'s Journey", That Brought Bitcoin to Millions '
 'Tales from the Crypt: Citadel Dispatch e0.3.1 - getting started with bitcoin mining with @diverter_nokyc, @econoalchemist, and @roninminer '
 'Tales from the Crypt: Rabbit Hole Recap: Bitcoin Week of 2021.07.12 '
 'Lightning Junkies: Exploring Worlds of Lightning Development - LNJ046 '
 'Bitcoin Rapid-Fire: Bitcoin Information Theory w/ Aaron Segal '
 'The Unhashed Podcast: Proof-of-Mom '
 'Tales from the Crypt: Citadel Dispatch e0.3.0 - bitcoin privacy and the danger of KYC with @samouraiwallet and @openoms '
 "Citizen Bitcoin: Tomer Strolight: Don't Tell Me There Are No Heroes in Bitcoin - E122 "
 "Tales from the Crypt: #262: Clean water, a prodigal son's journey, and the Bitcoin Water Trust with Scott Harrison 

In [24]:
print(raw[raw['author'] == 'AutoModerator']['text'].unique())
print(raw[raw['author'] == 'AutoModerator'].shape)

['Weekly Discussion Thread **Welcome to the Weekly Discussion. Please read the disclaimer, guidelines, and rules before participating.**\n\nDisclaimer:\n\nThough karma rules still apply, moderation is less active on this thread than on the rest of the sub. Therefore, consider all information posted here with several liberal heaps of salt, and always cross check any information you may read on this thread with known sources.\n\n## Rules:\n\n* All [sub rules](https://www.reddit.com/r/Ethereum/about/rules/) apply in this thread.\n* Discussion topics must be related to Ethereum.\n* Behave with civility and politeness. Do not use offensive, racist or homophobic language.\n* Comments will be sorted by newest first.\n\nUseful Links:\n\n* [Ethereum.org](https://ethereum.org)\n* [ETHHub](https://ethhub.io/)\n* [ETHMerge.com](https://ethmerge.com/)\n\n**Reminder**\n\n/r/ethereum is a community for discussing the technology, news, applications and community of Ethereum. **Discussion of the Ether 

In [25]:
# removed comments for moderators who generally only posts warning / maintenance comments
raw = raw[(raw['author'] != 'rBitcoinMod') & (raw['author'] != 'AutoModerator') & (raw['author'] != 'JBSchweitzer')]
print(raw.shape)

(8204, 7)


## Bot Posts

In [26]:
print(raw[raw['author'] == 'crypto_bot']['text'].unique())
print(raw[raw['author'] == 'crypto_bot'].shape)

['Bitcoin Network Status Update Wednesday, July 21, 2021 ###Status of the Bitcoin network as of Wednesday, July 21, 2021 at 12:00:01 EST:\n\n**Total bitcoins:** 18,762,639.794971\n\n**Height:** 692,035\n\n**Difficulty:** 13,672,594,272,814.140625\n\n######Statistics for the past 24 hours:\n\n**Number of blocks mined:** 146\n\n**Total bitcoins output (amount sent):** 1,711,896.605606\n\n**Total fees:** 15.993174\n\n**Average time until block found:** 9 minutes, 51 seconds\n\n**Estimated hashrate:** 99,231,578,697.368088 gh/s\n\n**Current price:** US$31,959.95\n\n*Data provided by [Smartbit.com.au](https://www.smartbit.com.au). Price data provided by [Coinbase.com](https://www.coinbase.com).*\n\n***\n\n^^I ^^am ^^a ^^bot. **[^^My ^^commands](https://www.reddit.com/r/Bitcoin/comments/3an2c4/ive_been_working_on_a_bot_for_crypto_subs_like/)** ^^| ^^*/r/crypto_bot* ^^| [^^Message ^^my ^^creator](https://www.reddit.com/message/compose?to=busterroni) ^^| [^^Source ^^code](https://github.com/bu

In [27]:
raw = raw[raw['author'] != 'crypto_bot']
print(raw.shape)

(8188, 7)


## Clean up Text

In [28]:
# replace html, xml tags 
raw['text'] = raw['text'].str.replace('&gt;', "", regex=False)
raw['text'] = raw['text'].str.replace('\n', " ", regex=False)
raw['text'] = raw['text'].str.replace('&amp;', "&", regex=False)
raw['text'] = raw['text'].str.replace('/s', "", regex=False)
raw['text'] = raw['text'].str.replace('tldr;', "", regex=False)
raw['text'] = raw['text'].str.replace('&#x200B;', "", regex=False)
raw['text'] = raw['text'].str.replace('[View Poll]', "", regex=False)

# check changes
display(raw.head(10))
display(raw.tail(10))

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
0,theremnanthodl,1626939006,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,13
2,ReadDailyCoin,1626937970,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",10
3,theloiteringlinguist,1626937137,Elon Musk’s View on Bitcoin (July 21 2021),,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),8
4,Electronic_Chard1987,1626936557,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",54
5,FarEnergy3518,1626936554,best crypto video ive ever watched,,Bitcoin,best crypto video ive ever watched,6
6,hawk-fe,1626936533,what moves crypto market apart from the speculators,"I would like to know if there is anything that moves crypto market apart from the speculators, or are cryptocurrencies and their prices absolutely speculative?",Bitcoin,"what moves crypto market apart from the speculators I would like to know if there is anything that moves crypto market apart from the speculators, or are cryptocurrencies and their prices absolutely speculative?",33
7,jamesonisraela,1626936215,This is the newest project of him?,,Bitcoin,This is the newest project of him?,7
8,Idontknow881,1626936148,Is it advisable to use P2P when buying bitcoin ?,,Bitcoin,Is it advisable to use P2P when buying bitcoin ?,10
9,monoslim,1626935783,"Only morons post about Elon Musk, SpaceX, or Tesla",,Bitcoin,"Only morons post about Elon Musk, SpaceX, or Tesla",9
10,TheWanderer09,1626935613,Help starting crypto business,Hi guys.\n\nI'm interested in starting a crypto business / app www.hashtaghodl.com that would take small amounts of money from your account over time and then invest the money into your favourite crypto assets \n\nSimilar to PLUM in the UK but a crypto version if anyone is familiar with it\n\nI'm looking for people who have coding knowledge in the crypto finance space or any one who could help me get started. If you're interested please reach out via DM.\n\ncheers,Bitcoin,Help starting crypto business Hi guys. I'm interested in starting a crypto business / app www.hashtaghodl.com that would take small amounts of money from your account over time and then invest the money into your favourite crypto assets Similar to PLUM in the UK but a crypto version if anyone is familiar with it I'm looking for people who have coding knowledge in the crypto finance space or any one who could help me get started. If you're interested please reach out via DM. cheers,85


Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count
9990,Alexsyo,1623409026,"The first DeFi NFT token will be tradeable on Eporio, find out how farming and NFTs can be combined",,ethereum,"The first DeFi NFT token will be tradeable on Eporio, find out how farming and NFTs can be combined",19
9991,Jimbley_Neutralon,1623408610,News updates for the top DeFi protocols,,ethereum,News updates for the top DeFi protocols,7
9992,Starlight-786,1623408561,Content of Formative brain science,,ethereum,Content of Formative brain science,5
9993,GamesInfluencer,1623406193,"TA: Ethereum Revisits $2,400, Here’s What Could Trigger More Downsides",,ethereum,"TA: Ethereum Revisits $2,400, Here’s What Could Trigger More Downsides",10
9994,ARONBOSS,1623405418,David Guetta Will Accept Bitcoin (BTC) and Ethereum (ETH) For $14 Million Flat – AronBoss,,ethereum,David Guetta Will Accept Bitcoin (BTC) and Ethereum (ETH) For $14 Million Flat – AronBoss,15
9995,rollingincrypto,1623403968,He knows whats the future.,,ethereum,He knows whats the future.,5
9996,jehleungvi,1623403940,What are the best (and safest) ways to earn passive income with Ether?,,ethereum,What are the best (and safest) ways to earn passive income with Ether?,13
9997,Dangerous_Try8644,1623402757,Why doesnt the ethereum team create an L2 solution?,"by outsourcing it to other companies, wont there be just a fragmentation?\n\nif everyone is on different l2 solutions, there wont be a unified experience. right now, you can use uniswap and aave as an example together. but if uniswap is on arbitrum and aave is on optimism, this wont be possible right?",ethereum,"Why doesnt the ethereum team create an L2 solution? by outsourcing it to other companies, wont there be just a fragmentation? if everyone is on different l2 solutions, there wont be a unified experience. right now, you can use uniswap and aave as an example together. but if uniswap is on arbitrum and aave is on optimism, this wont be possible right?",62
9998,NFTNewsToday,1623401298,Cometh⚗️ feat. APY Vision,,ethereum,Cometh⚗️ feat. APY Vision,4
9999,Melodic-Magazine-519,1623400734,Nanopool downfor anyone else?,\n\n[View Poll](https://www.reddit.com/poll/nxb3xk),ethereum,Nanopool downfor anyone else? (https://www.reddit.com/poll/nxb3xk),6


## Checking Class Balance

In [29]:
# check to ensure class not imbalanced
raw['subreddit'].value_counts(normalize = True)

Bitcoin     0.523205
ethereum    0.476795
Name: subreddit, dtype: float64

In [30]:
raw['date'] = pd.to_datetime(raw['created_utc'],unit='s')
raw.head()    

Unnamed: 0,author,created_utc,title,selftext,subreddit,text,word_count,date
0,theremnanthodl,1626939006,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,13,2021-07-22 07:30:06
2,ReadDailyCoin,1626937970,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",10,2021-07-22 07:12:50
3,theloiteringlinguist,1626937137,Elon Musk’s View on Bitcoin (July 21 2021),,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),8,2021-07-22 06:58:57
4,Electronic_Chard1987,1626936557,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",54,2021-07-22 06:49:17
5,FarEnergy3518,1626936554,best crypto video ive ever watched,,Bitcoin,best crypto video ive ever watched,6,2021-07-22 06:49:14


In [31]:
raw = raw.drop(['title', 'selftext', 'created_utc'], axis='columns')
raw.head()

Unnamed: 0,author,subreddit,text,word_count,date
0,theremnanthodl,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,13,2021-07-22 07:30:06
2,ReadDailyCoin,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",10,2021-07-22 07:12:50
3,theloiteringlinguist,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),8,2021-07-22 06:58:57
4,Electronic_Chard1987,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",54,2021-07-22 06:49:17
5,FarEnergy3518,Bitcoin,best crypto video ive ever watched,6,2021-07-22 06:49:14


## Writing to file

In [32]:
raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8188 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   author      8188 non-null   object        
 1   subreddit   8188 non-null   object        
 2   text        8188 non-null   object        
 3   word_count  8188 non-null   int64         
 4   date        8188 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 383.8+ KB


In [33]:
raw.head()

Unnamed: 0,author,subreddit,text,word_count,date
0,theremnanthodl,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,13,2021-07-22 07:30:06
2,ReadDailyCoin,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",10,2021-07-22 07:12:50
3,theloiteringlinguist,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),8,2021-07-22 06:58:57
4,Electronic_Chard1987,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is here to give you answers to all those questions. We are your guide to navigate the crypto ecosystem.",54,2021-07-22 06:49:17
5,FarEnergy3518,Bitcoin,best crypto video ive ever watched,6,2021-07-22 06:49:14


In [34]:
print(raw.shape)
# write to file
# this will be useful as a starting point for experimentation
raw.to_csv('data/cleaned_data.csv', index=False)

(8188, 5)
