<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Part-2:-Data-Cleaning" data-toc-modified-id="Part-2:-Data-Cleaning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 2: Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Loading-the-data" data-toc-modified-id="Loading-the-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading the data</a></span></li><li><span><a href="#Selecting-Columns" data-toc-modified-id="Selecting-Columns-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Selecting Columns</a></span></li><li><span><a href="#Removed-and-Deleted-Comments" data-toc-modified-id="Removed-and-Deleted-Comments-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Removed and Deleted Comments</a></span></li><li><span><a href="#Low-Character-Count-Comments" data-toc-modified-id="Low-Character-Count-Comments-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Low Character Count Comments</a></span></li><li><span><a href="#Moderator-Maintenance/Warning-Comments" data-toc-modified-id="Moderator-Maintenance/Warning-Comments-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Moderator Maintenance/Warning Comments</a></span></li><li><span><a href="#Bot-Comments" data-toc-modified-id="Bot-Comments-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Bot Comments</a></span></li><li><span><a href="#Rogue-Comments" data-toc-modified-id="Rogue-Comments-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Rogue Comments</a></span></li><li><span><a href="#Writing-to-file" data-toc-modified-id="Writing-to-file-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Writing to file</a></span></li></ul></li></ul></div>

# Part 2: Data Cleaning

In [1]:
# import packages
import pandas as pd
import numpy as np
import missingno as msno
from pathlib import Path
from pprint import pprint

pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100) # adjust number of rows visible 

## Loading the data

In [2]:
# get data folder path
p = Path.cwd()/ "data" / 'comments'

# loop through the files
files = [f for f in p.glob('*') if f.suffix.lower() == '.csv']  

# generate dataframe with all 3 years combined
raw = pd.concat([pd.read_csv(f) for f in files], axis='rows', ignore_index = True)

# check nulls, data types
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 39 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   all_awardings                    10000 non-null  object 
 1   associated_award                 0 non-null      float64
 2   author                           10000 non-null  object 
 3   author_flair_background_color    0 non-null      float64
 4   author_flair_css_class           536 non-null    object 
 5   author_flair_richtext            7803 non-null   object 
 6   author_flair_template_id         533 non-null    object 
 7   author_flair_text                596 non-null    object 
 8   author_flair_text_color          3281 non-null   object 
 9   author_flair_type                7803 non-null   object 
 10  author_fullname                  7803 non-null   object 
 11  author_is_blocked                6405 non-null   object 
 12  author_patreon_flai

In [3]:
# Preview data
raw.head(10)

Unnamed: 0,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,body,collapsed_because_crowd_control,collapsed_reason_code,comment_type,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,distinguished,author_cakeday
0,[],,crimeo,,,[],,,,text,t2_jwuya,False,False,False,[],"There's a very easy solution that is fiscally responsible and doesn't require any crazy revolution or new technology. And does a significantly better job of solving than bitcoin.\n\nIt's called ""raising taxes to pay for programs when you want to help the poor out.""\n\nI don't endorse printing money instead. I merely said ""It is good for the poor [so long as the economy doesn't completely collapse]"" which it is. It's not the best solution overall. Simple balanced budgets are.",,,,1626936455,{},h63m8hw,False,t3_op1972,False,False,t1_h63m62g,/r/Bitcoin/comments/op1972/with_current_fud_dont_panic_sell_bitcoins_value/h63m8hw/,1627102288,0,True,False,Bitcoin,t5_2s3qj,,0,[],,
1,[],,pablotorresv,,,[],,,,text,t2_aq2010sk,False,False,False,[],I did a swift transfer to their USD account in BVI no fees,,,,1626936437,{},h63m7qp,False,t3_ooysif,False,True,t1_h628xzl,/r/Bitcoin/comments/ooysif/i_have_high_amount_in_a_us_bank_and_i_want_buy/h63m7qp/,1627102275,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
2,[],,DankFo3ta5,,,[],,,dark,text,t2_aaotslua,False,False,False,[],He's a fuckwit,,,,1626936414,{},h63m6qs,False,t3_op5w9o,False,False,t3_op5w9o,/r/Bitcoin/comments/op5w9o/elon_musk_im_a_bitcoin_supporter_i_own_bitcoin/h63m6qs/,1627102262,6,True,False,Bitcoin,t5_2s3qj,,0,[],,
3,[],,evDev84,,,[],,,,text,t2_3l6ijbd8,False,False,False,[],&gt;Tesla was warned beforehand.\n\nDid you see that 4chan screenshot of someone calling the 30k bounce down to the minute of the day?,,,,1626936413,{},h63m6pq,False,t3_oowzq7,False,True,t1_h62gazl,/r/Bitcoin/comments/oowzq7/elon_musk_says_tesla_will_likely_start_accepting/h63m6pq/,1627102262,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
4,[],,arnaudmrtn,,,[],,,,text,t2_jj2t2tf,False,False,True,[],As I mentionned I don't see any mention of sustainability problem in your comment so I would assume you are not looking for a solution. I am!,,,,1626936398,{},h63m62g,True,t3_op1972,False,True,t1_h63ltdv,/r/Bitcoin/comments/op1972/with_current_fud_dont_panic_sell_bitcoins_value/h63m62g/,1627102252,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
5,[],,crimeo,,,[],,,,text,t2_jwuya,False,False,False,[],"&gt; Yes, and there were bankruptcies when banks failed\n\nWhat does that have anything to do with the fact you are objectively wrong about loans not existing in a world with trustless currency? Nobody ever disagreed about historical bankruptcies or even mentioned it, random off topic junk.\n\nPeople also used to wear pointy shoes. May as well bring that up.",,,,1626936364,{},h63m4pf,False,t3_ooa9hv,False,True,t1_h63llzz,/r/Bitcoin/comments/ooa9hv/buying_the_dip/h63m4pf/,1627102234,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
6,[],,Tron_Passant,,,[],,,dark,text,t2_65aqswhp,False,False,False,[],Dude that was the real CEO of Huboi...,,,,1626936334,{},h63m3h6,False,t3_oolqdb,False,True,t3_oolqdb,/r/Bitcoin/comments/oolqdb/today_i_was_able_to_scam_the_scammer_lol/h63m3h6/,1627102217,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
7,[],,smartorgs,,,[],,,dark,text,t2_afg9dc84,False,False,False,[],I don't date nocoiners,,,,1626936255,{},h63m04u,False,t3_ooa9hv,False,True,t1_h63li7f,/r/Bitcoin/comments/ooa9hv/buying_the_dip/h63m04u/,1627102170,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
8,[],,[deleted],,,,,,dark,,,False,,,[],[removed],,,,1626936225,{},h63lyxr,False,t3_op77th,False,True,t3_op77th,/r/Bitcoin/comments/op77th/daily_discussion_july_22_2021/h63lyxr/,1627102154,1,True,False,Bitcoin,t5_2s3qj,,0,[],,
9,[],,mjgill89,,,[],,,,text,t2_gnxrmsb,False,False,False,[],"this is the debug log which isn't helping:\n\n2021-07-22T03:34:46Z UpdateTip: new best=00000000000000001192dae5aea9abaac21033$\n2021-07-22T03:35:03Z UpdateTip: new best=0000000000000000103578745018fba03524a1$\n2021-07-22T03:35:04Z Imported mempool transactions from disk: 0 succeeded, 0 fa$\n2021-07-22T03:35:04Z loadblk thread exit\n2021-07-22T03:35:07Z Synchronizing blockheaders, height: 692090 (~100.00%)\n2021-07-22T03:35:10Z New outbound peer connected: version: 70016, blocks=692090$\n2021-07-22T03:35:10Z Pre-allocating up to position 0x8000000 in blk00296.dat",,,,1626936208,{},h63ly9c,True,t3_oozjjg,False,True,t3_oozjjg,/r/Bitcoin/comments/oozjjg/bitcoind_is_failing/h63ly9c/,1627102143,1,True,False,Bitcoin,t5_2s3qj,,0,[],,


In [4]:
# Check duplicates
raw = raw.drop_duplicates()
raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 39 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   all_awardings                    10000 non-null  object 
 1   associated_award                 0 non-null      float64
 2   author                           10000 non-null  object 
 3   author_flair_background_color    0 non-null      float64
 4   author_flair_css_class           536 non-null    object 
 5   author_flair_richtext            7803 non-null   object 
 6   author_flair_template_id         533 non-null    object 
 7   author_flair_text                596 non-null    object 
 8   author_flair_text_color          3281 non-null   object 
 9   author_flair_type                7803 non-null   object 
 10  author_fullname                  7803 non-null   object 
 11  author_is_blocked                6405 non-null   object 
 12  author_patreon_flai

## Selecting Columns

In [191]:
# Unique values for each column just to get a sense of what we have and the columns to keep
for item in raw.columns:
    num_unique = raw[item].nunique()
    item_unique = raw[item].unique() if num_unique < 20 else raw[item].unique()[:20]
    print(item, ": ", raw[item].nunique(),"values")
    print(item_unique, "\n")

all_awardings :  23 values
['[]'
 "[{'award_sub_type': 'GLOBAL', 'award_type': 'global', 'awardings_required_to_grant_benefits': None, 'coin_price': 300, 'coin_reward': 250, 'count': 1, 'days_of_drip_extension': 0, 'days_of_premium': 0, 'description': 'Give the gift of %{coin_symbol}250 Reddit Coins.', 'end_date': None, 'giver_coin_reward': None, 'icon_format': None, 'icon_height': 2048, 'icon_url': 'https://i.redd.it/award_images/t5_22cerq/cr1mq4yysv541_CoinGift.png', 'icon_width': 2048, 'id': 'award_3dd248bc-3438-4c5b-98d4-24421fd6d670', 'is_enabled': True, 'is_new': False, 'name': 'Coin Gift', 'penny_donate': None, 'penny_price': None, 'resized_icons': [{'height': 16, 'url': 'https://preview.redd.it/award_images/t5_22cerq/cr1mq4yysv541_CoinGift.png?width=16&amp;height=16&amp;auto=webp&amp;s=7bc7d3a9d7950d9b8bfd3fe1da96c06dbd3012c4', 'width': 16}, {'height': 32, 'url': 'https://preview.redd.it/award_images/t5_22cerq/cr1mq4yysv541_CoinGift.png?width=32&amp;height=32&amp;auto=webp&amp;

In [192]:
# keeping permalink just in case for reference to view actual text
raw = raw[['author', 'link_id', 'parent_id','total_awards_received', 'score', 'permalink','subreddit', 'body']]

In [193]:
# checking variation within the columns
for item in raw.columns:
    print(item, "-"*100)
    print(raw[item].value_counts(normalize=True), '\n\n') 

author ----------------------------------------------------------------------------------------------------
[deleted]            0.2197
coinfeeds-bot        0.0117
Perleflamme          0.0055
BigDaddyDallas       0.0045
sweetsimplecode      0.0043
                      ...  
AnonAmishGnome       0.0001
new_start_2020       0.0001
pcaversaccio         0.0001
Krative_Lifestyle    0.0001
YaBoyLaKroy          0.0001
Name: author, Length: 3623, dtype: float64 


link_id ----------------------------------------------------------------------------------------------------
t3_ooj8au    0.0947
t3_op1pqc    0.0765
t3_onina3    0.0443
t3_oowzq7    0.0301
t3_ol7z7s    0.0281
              ...  
t3_mf31ia    0.0001
t3_omb3mr    0.0001
t3_nqditu    0.0001
t3_ongabe    0.0001
t3_lan9o3    0.0001
Name: link_id, Length: 668, dtype: float64 


parent_id ----------------------------------------------------------------------------------------------------
t3_op1pqc     0.0764
t3_ooj8au     0.0249
t3_oowzq7 

In [194]:
# check nulls and data types after change
raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   author                 10000 non-null  object
 1   link_id                10000 non-null  object
 2   parent_id              10000 non-null  object
 3   total_awards_received  10000 non-null  int64 
 4   score                  10000 non-null  int64 
 5   permalink              10000 non-null  object
 6   subreddit              10000 non-null  object
 7   body                   10000 non-null  object
dtypes: int64(2), object(6)
memory usage: 703.1+ KB


## Removed and Deleted Comments

In [195]:
# .info to check impact on data
# As comments, these are generic and do not contribute to the context
lst = ['[removed]', '[deleted]']
raw[raw['body'].isin(lst)].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2194 entries, 8 to 9996
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   author                 2194 non-null   object
 1   link_id                2194 non-null   object
 2   parent_id              2194 non-null   object
 3   total_awards_received  2194 non-null   int64 
 4   score                  2194 non-null   int64 
 5   permalink              2194 non-null   object
 6   subreddit              2194 non-null   object
 7   body                   2194 non-null   object
dtypes: int64(2), object(6)
memory usage: 154.3+ KB


In [196]:
# Removing rows with '[removed]', '[deleted]' as text
raw = raw[~raw['body'].isin(lst)]

# Have about 7.8k rows after change
raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7806 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   author                 7806 non-null   object
 1   link_id                7806 non-null   object
 2   parent_id              7806 non-null   object
 3   total_awards_received  7806 non-null   int64 
 4   score                  7806 non-null   int64 
 5   permalink              7806 non-null   object
 6   subreddit              7806 non-null   object
 7   body                   7806 non-null   object
dtypes: int64(2), object(6)
memory usage: 548.9+ KB


In [197]:
# check to ensure class not imbalanced
raw['subreddit'].value_counts(normalize = True)

ethereum    0.507302
Bitcoin     0.492698
Name: subreddit, dtype: float64

In [198]:
# Check for outliers
raw.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_awards_received,7806.0,0.007302,0.124296,0.0,0.0,0.0,0.0,5.0
score,7806.0,4.486421,19.806288,-126.0,1.0,2.0,3.0,918.0


In [199]:
# investigating score ( min score = -126?)
# Although unpopular, I think they contribute to the context, hence keeping it
raw[raw['score'] < 0].shape
raw[raw['score'] < 0].sort_values('score').head(10)

Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
3029,starwarsfan99201,t3_onud68,t1_h614s4c,0,-126,/r/Bitcoin/comments/onud68/just_admit_it_how_many_of_you_are_losing_money/h61uevi/,Bitcoin,"S&amp;P 500 has grown more in 10 years than BTC has in 1 year. Which is what you were challenging people on. \n\nYou said BTC's 1 year gain vs the OP's 10 year gains. \n\nAre you drunk? Because you don't even remember what you wrote, you dumb fuck."
2703,starwarsfan99201,t3_onud68,t1_h61zucl,0,-123,/r/Bitcoin/comments/onud68/just_admit_it_how_many_of_you_are_losing_money/h620ecf/,Bitcoin,You've got to check your stats again my guy. Keep getting things wrong haha. \n\nBTC hasn't even outperformed most growth stocks in the last 5 years haha. \n\nIt's a pathetic investment since it's fall in February.
2627,starwarsfan99201,t3_onud68,t1_h621a0e,0,-85,/r/Bitcoin/comments/onud68/just_admit_it_how_many_of_you_are_losing_money/h621tai/,Bitcoin,"I'm beginning to think you haven't even heard of the equity markets bahahaha \n\nYou're a classic mate. \n\nHere you go. Here's one. \n\nIt's not 5 years, but 4, because it IPO in 2017. \n\nAfterpay has outperformed BTC by 3 times over it's lifetime compared to BTC over the same timeframe. \n\nThere's plenty more. Just go and have a look. There's a big bright world out there beyond BTC and a more stable world!"
5832,suburez,t3_ooaelc,t1_h5xuski,0,-63,/r/ethereum/comments/ooaelc/only_100000_blocks_left_until_london_activates_on/h5xw0dc/,ethereum,"Aww, are you the butthurt crybaby?"
5827,suburez,t3_ooaelc,t1_h5xwdpu,0,-53,/r/ethereum/comments/ooaelc/only_100000_blocks_left_until_london_activates_on/h5xwj64/,ethereum,"Oh, well, I guess I'll point out that you are an asshole. Do what you will with that information. Jerk."
5874,suburez,t3_ooaelc,t1_h5x6u9w,0,-50,/r/ethereum/comments/ooaelc/only_100000_blocks_left_until_london_activates_on/h5xo8hj/,ethereum,"Holy crap, ETH is currently at 1776! Amazing! Wish I could buy one. Freaking cool, just the symbolism of it all. What if ETH is priced at 1776 when the fork happens?! Would that be cool, or would it be super duper freaking cool."
2619,starwarsfan99201,t3_onud68,t1_h621kls,0,-32,/r/Bitcoin/comments/onud68/just_admit_it_how_many_of_you_are_losing_money/h621wta/,Bitcoin,Holy shit you have Reddit friends!!!!! The epitome of being a sad case haha
8497,graph_marine,t3_oloceg,t1_h5gcnac,0,-28,/r/ethereum/comments/oloceg/messari_ethereum_is_poised_to_settle_8_trillion/h5gdbnr/,ethereum,"Name one dApp that was not only functional, but had tens-hundreds of thousands of active daily users, before The Graph existed."
2743,starwarsfan99201,t3_onud68,t1_h61zfwo,0,-25,/r/Bitcoin/comments/onud68/just_admit_it_how_many_of_you_are_losing_money/h61zqfz/,Bitcoin,S&amp;P 500 has increased 300% in 10 years and BTC has increased 217% in 1 year. \n\nKeep replying like a retard though. I'm enjoying it haha
6659,supadave24,t3_onina3,t1_h5sbawv,0,-25,/r/ethereum/comments/onina3/gassed/h5stekn/,ethereum,Try iota


## Low Character Count Comments

In [200]:
# Checking rows with low character counts
for i in range(1,100,2):
    print('\nCHAR_COUNT: {}    {}'.format(i, raw[raw['body'].str.len() == i].shape))
    print('-' * 100, '\n')
    print(raw[raw['body'].str.len() == i]['body'].unique())


CHAR_COUNT: 1    (25, 8)
---------------------------------------------------------------------------------------------------- 

['😉' '🗿' '😂' '🤔' '🤣' '👌' 'k' '🚀' '😆' '0' 'L' '👏' 'f' '💋' '🥶' '🤡' '💪']

CHAR_COUNT: 3    (39, 8)
---------------------------------------------------------------------------------------------------- 

['Wow' 'Yes' 'wym' 'ggs' 'No.' 'lol' '20♾' 'LOL' 'yes' '🤔🤔🤔' 'Duh' 'Rip'
 'Lol' '😂😂😂' '😳😳😳' 'Yup' 'How' 'Why' 'Yep' 'Xrp' 'YES' 'Yee']

CHAR_COUNT: 5    (41, 8)
---------------------------------------------------------------------------------------------------- 

['Whut?' '😂😂😂😂😂' 'Cheap' 'Phæg.' 'What?' 'IDGAF' 'ban 😁' 'I do!' 'who ?'
 'Elons' 'Agree' 'Maybe' 'None.' 'Bitch' 'Bingo' 'Fiat.' 'LOL !' 'he is'
 'Link?' 'lmfao' 'Why ?' 'Jack?' 'Lolol' '32 hi' 'wrong' 'Tesla' 'Hahah'
 'what?' '666 ?' 'Dolt.' 'pussy' 'Wow 🤯' 'Beast' 'Ty OP' 'facts' 'Gross'
 'It is' 'Nice!' 'Dying' 'R.I.P']

CHAR_COUNT: 7    (64, 8)
--------------------------------------------------------

In [201]:
# Overall, majority are just generic comments expressing reactions to topics
# Not much deep meaningful discussions happening, esp those with character counts at the low end.
print(raw[raw['body'].str.len() < 100 ].shape)
raw = raw[raw['body'].str.len() >= 100 ]
print(raw.shape)

(4443, 8)
(3363, 8)


## Moderator Maintenance/Warning Comments

In [202]:

def check_author_comments(author):
    print("\nModerator: {}".format(author))
    print('-' * 50)
    print(raw[raw['author'] == author].shape, '\n') # .shape to see impact on remaining data
    print(raw[raw['author'] == author]['body'].unique()) # see comments

# creating lists to loop over and scan more efficiently    
eth_moderators = ['vbuterin', 'heliumcraft', 'insomniasexx', 'publicmodlogs', 'Souptacular', 'EvanVanNess', 'ligi', 'twigwam', 'JBSchweitzer', 'edmundedgar']  
btc_moderators = ['theymos', 'BashCo', 'frankenmint', 'rbitcoin-bot', 'Aussiehash', 'ThePiachu', 'Avatar-X', 'DigitalGoose', 'theiflar', 'rBitcoinMod']

In [203]:
for mod in eth_moderators:
    check_author_comments(mod) 


Moderator: vbuterin
--------------------------------------------------
(1, 8) 

["It's deprioritized because it's at best a 1.5-2x improvement at a high cost of dev time whereas the other things on the roadmap are 3-100x improvements with lower cost in dev time."]

Moderator: heliumcraft
--------------------------------------------------
(0, 8) 

[]

Moderator: insomniasexx
--------------------------------------------------
(0, 8) 

[]

Moderator: publicmodlogs
--------------------------------------------------
(0, 8) 

[]

Moderator: Souptacular
--------------------------------------------------
(0, 8) 

[]

Moderator: EvanVanNess
--------------------------------------------------
(0, 8) 

[]

Moderator: ligi
--------------------------------------------------
(24, 8) 

['I think you can just open an issue on the org github repo - so this is basically like writing a short forum post ;-)\n\nIf you can create an SVG you can also do this - I believe in you!'
 'Yea - I approved the post. 

In [204]:
# removed comments for moderators who generally only posts warning / maintenance comments
raw = raw[(raw['author'] != 'twigwam') & (raw['author'] != 'JBSchweitzer') & (raw['author'] != 'ligi') & (raw['author'] != 'abcoathup')] 

In [205]:
for mod in btc_moderators:
    check_author_comments(mod) 


Moderator: theymos
--------------------------------------------------
(0, 8) 

[]

Moderator: BashCo
--------------------------------------------------
(1, 8) 

["Elon is about 5-7 years behind. He even promoted the old idea of using bitcoin mining machines to generate heat, which is an idea that's been around for years. As for his scaling rational, that was all heavily debated between 2015 and 2017, and it's very clear that increasing the block size is a decentralization tradeoff and that layered protocol scaling is far better."]

Moderator: frankenmint
--------------------------------------------------
(0, 8) 

[]

Moderator: rbitcoin-bot
--------------------------------------------------
(0, 8) 

[]

Moderator: Aussiehash
--------------------------------------------------
(0, 8) 

[]

Moderator: ThePiachu
--------------------------------------------------
(0, 8) 

[]

Moderator: Avatar-X
--------------------------------------------------
(0, 8) 

[]

Moderator: DigitalGoose
-------

In [206]:
print(raw[raw['author'] == 'AutoModerator']['body'].unique())
print(raw[raw['author'] == 'AutoModerator'].shape)

['The bitcoin (dot) com domain is owned by a convicted felon who describes himself as "Bitcoin Jesus" and has a long history of unscrupulous behavior. From [vouching for MtGox solvency before it collapsed](https://www.reddit.com/r/Bitcoin/comments/77vrek/roger_ver_on_mtgox_bitcoin_exchange_for_the/), to promoting Craig Wright [as if he were Satoshi Nakamoto](https://www.reddit.com/r/Bitcoin/comments/776fim/funny_how_all_the_criminals_and_fraudsters_in/), the owner acquired the "r / btc" subreddit and packed its mod team with paid employees to spread divisive misinformation about the bitcoin protocol and various individuals in the bitcoin space, including this subreddit as a whole. He has also leveraged the domain and subreddit to promote an impostor altcoin to unsuspecting newcomers as if it were actually Bitcoin. As such, the domain is considered malicious, and the r/Bitcoin mod team kindly asks that readers seek out credible sources to post instead.\n\n*I am a bot, and this action wa

In [207]:
# removed comments for moderators who generally only posts warning / maintenance comments
raw = raw[(raw['author'] != 'rBitcoinMod') & (raw['author'] != 'AutoModerator')]

## Bot Comments

In [208]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('i am a bot')] 
print(check_df.shape) # .shape to gauge impact on remaining rows
display(check_df.head(11)) # .head to preview comments

(6, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
6339,ectbot,t3_onina3,t1_h5uh6j3,0,0,/r/ethereum/comments/onina3/gassed/h5uh7he/,ethereum,"\nHello! You have made the mistake of writing ""ect"" instead of ""etc.""\n\n""Ect"" is a common misspelling of ""etc,"" an abbreviated form of the Latin phrase ""et cetera."" Other abbreviated forms are **etc.**, **&amp;c.**, **&amp;c**, and **et cet.** The Latin translates as ""et"" to ""and"" + ""cetera"" to ""the rest;"" a literal translation to ""and the rest"" is the easiest way to remember how to use the phrase. \n\n[Check out the wikipedia entry if you want to learn more.](https://en.wikipedia.org/wiki/Et_cetera)\n\n^(I am a bot, and this action was performed automatically. Comments with a score less than zero will be automatically removed. If I commented on your post and you don't like it, reply with ""!delete"" and I will remove the post, regardless of score. Message me for bug reports.)"
7250,Shakespeare-Bot,t3_omxgln,t1_h5pvqb6,0,0,/r/ethereum/comments/omxgln/aave_plans_to_build_twitter_on_ethereum/h5pvrae/,ethereum,"Aave dev maketh an gross in sense gleek on twitter. \n \n\nnews: aave plan to buildeth ""twitter on ethereum""\n\n***\n\n\n\n^(I am a bot and I swapp'd some of thy words with Shakespeare words.)\n\nCommands: `!ShakespeareInsult`, `!fordo`, `!optout`"
7636,Shakespeare-Bot,t3_omm22u,t1_h5lvms1,0,-2,/r/ethereum/comments/omm22u/i_sold_most_of_my_alts_and_all_my_bitcoin_to_buy/h5lvnd4/,ethereum,"i didst this and did get hack'd. hath lost 2 million usd\n\n***\n\n\n\n^(I am a bot and I swapp'd some of thy words with Shakespeare words.)\n\nCommands: `!ShakespeareInsult`, `!fordo`, `!optout`"
8095,Shakespeare-Bot,t3_om0e68,t1_h5inrwm,0,-3,/r/ethereum/comments/om0e68/its_called_ethereum/h5insyd/,ethereum,"lol not decentralized, nay did fix supply. Dy'r \n \nhas't excit'ment staying poor, i pity naïve people like thee\n\n***\n\n\n\n^(I am a bot and I swapp'd some of thy words with Shakespeare words.)\n\nCommands: `!ShakespeareInsult`, `!fordo`, `!optout`"
8582,Shakespeare-Bot,t3_olnodn,t1_h5fvwk7,0,2,/r/ethereum/comments/olnodn/tired_of_losing_all_my_money_on_wsb_so_i_just/h5fvxzt/,ethereum,"Eth and gme art on the same team. We knoweth this. A house did divide cannot standeth. Thee knoweth this\n\n***\n\n\n\n^(I am a bot and I swapp'd some of thy words with Shakespeare words.)\n\nCommands: `!ShakespeareInsult`, `!fordo`, `!optout`"
9517,Shakespeare-Bot,t3_okfppd,t1_h57kf15,1,1,/r/ethereum/comments/okfppd/bitcoin_eth_lovers_5_reasons_to_buy_hut_8_stock/h57kg42/,ethereum,"Hut looks gooood and very undervalu'd imo if 't be true you’re interest'd in stocks as well as crypto\n\n***\n\n\n\n^(I am a bot and I swapp'd some of thy words with Shakespeare words.)\n\nCommands: `!ShakespeareInsult`, `!fordo`, `!optout`"


In [209]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('bot$')]
print(check_df.shape)
display(check_df.head(11))

(0, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body


In [210]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('coinfeeds-bot')]
print(check_df.shape)
display(check_df.head(11))

(4, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
7596,B0tRank,t3_om7z61,t1_h5m1tkr,0,1,/r/ethereum/comments/om7z61/defi_sandwich_attack_explain/h5m1u5s/,ethereum,"Thank you, tehsisiewdai, for voting on coinfeeds-bot.\n\nThis bot wants to find the best and worst bots on Reddit. [You can view results here](https://botrank.pastimes.eu/).\n\n***\n\n^(Even if I don't reply to your comment, I'm still listening for votes. Check the webpage to see if your vote registered!)"
8270,B0tRank,t3_oloceg,t1_h5htu4e,0,4,/r/ethereum/comments/oloceg/messari_ethereum_is_poised_to_settle_8_trillion/h5htur8/,ethereum,"Thank you, HDPunks, for voting on coinfeeds-bot.\n\nThis bot wants to find the best and worst bots on Reddit. [You can view results here](https://botrank.pastimes.eu/).\n\n***\n\n^(Even if I don't reply to your comment, I'm still listening for votes. Check the webpage to see if your vote registered!)"
8917,B0tRank,t3_oky03k,t1_h5d6pbi,0,0,/r/ethereum/comments/oky03k/eip1559_is_set_to_go_live_with_london_upgrade_in/h5d6q9k/,ethereum,"Thank you, WhiteCoco4u, for voting on coinfeeds-bot.\n\nThis bot wants to find the best and worst bots on Reddit. [You can view results here](https://botrank.pastimes.eu/).\n\n***\n\n^(Even if I don't reply to your comment, I'm still listening for votes. Check the webpage to see if your vote registered!)"
9316,B0tRank,t3_okhvut,t1_h59ayiw,0,1,/r/ethereum/comments/okhvut/introduction_to_the_diamond_standard_eip2535/h59az3c/,ethereum,"Thank you, Captainbananapants7, for voting on coinfeeds-bot.\n\nThis bot wants to find the best and worst bots on Reddit. [You can view results here](https://botrank.pastimes.eu/).\n\n***\n\n^(Even if I don't reply to your comment, I'm still listening for votes. Check the webpage to see if your vote registered!)"


## Rogue Comments

In [211]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('subreddits')]
print(check_df.shape)
display(check_df.head(11))

(12, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
1354,HOOOODL,t3_op1jsa,t3_op1jsa,0,5,/r/Bitcoin/comments/op1jsa/i_thought_this_community_didnt_need_elon_musk/h62gtnu/,Bitcoin,"This ""community"", like many other subreddits, is looking for a person who can represent perfectly each and every one of their opinions and lead them to a proper way of thinking without having to think for themselves. You see it everywhere and in all sorts of different groups. It's why people care so much about celebrities and their opinions. It's weak and shouldn't be encouraged."
4758,Seeders,t3_ooj8au,t1_h60s5bj,0,14,/r/Bitcoin/comments/ooj8au/daily_discussion_july_21_2021/h60sxkh/,Bitcoin,It's just different people with different opinions. Stop looking at subreddits like individual personalities.
5803,Always_Question,t3_ooejy1,t3_ooejy1,0,1,/r/ethereum/comments/ooejy1/ethereum_1000_possible_weekly_forecast/h5y29kx/,ethereum,"Please keep price discussion, market talk, memes, and exchanges to subreddits such as r/ethfinance or r/ethtrader"
6008,trent_vanepps,t3_oo71nc,t3_oo71nc,0,1,/r/ethereum/comments/oo71nc/dont_be_afraid_of_dips_its_opportunity_to_buy_it/h5wnyd9/,ethereum,"Please keep price discussion, market talk, memes, and exchanges to subreddits such as r/ethfinance or r/ethtrader"
7018,NeoCornelius,t3_on9sf7,t1_h5rec36,0,2,/r/ethereum/comments/on9sf7/for_everyone_talking_about_aave_building_twitter/h5rig1x/,ethereum,I agree. I don't think much needs to be changed about Reddit except decentralize the subreddits and let them work on a federated basis. Diaspora tries to do something like this but it doesn't have the network effect.
7143,LavoP,t3_omxgln,t1_h5qfy8h,0,2,/r/ethereum/comments/omxgln/aave_plans_to_build_twitter_on_ethereum/h5qoi3e/,ethereum,Subreddits and active moderation make it literally worlds different. Have you even used both platforms? Aside from the obvious completely different UXs they have extremely different userbases and content.
7568,sworlly,t3_om5zsc,t1_h5lylo5,0,1,/r/ethereum/comments/om5zsc/aave_may_build_twitter_on_ethereum/h5mekak/,ethereum,"&gt;You've never been on 4chsn and it shows. 4chan is a board with a shit ton of different forums. /b/ and /Pol/ are the bad ones. 10% of the whole site is ""bad"" \n&gt; \n&gt;Don't talk if you don't know what you are on about. Reddit has subreddits that are way more racist and Twitter speaks for itself\n\nlol\n\n*""Don't judge us on the unmoderated boards, instead judge us on the moderated ones which are better""*\n\nThanks for making my point. \n\nDon't talk to me period, Bro."
7594,dmihal,t3_ommiu2,t3_ommiu2,0,1,/r/ethereum/comments/ommiu2/candlestick_charts/h5m24ck/,ethereum,"Please keep price discussion, market talk, memes, and exchanges to subreddits such as r/ethfinance or r/ethtrader"
7604,bro-guy,t3_om5zsc,t1_h5leqtg,0,2,/r/ethereum/comments/om5zsc/aave_may_build_twitter_on_ethereum/h5lylo5/,ethereum,"You've never been on 4chsn and it shows. 4chan is a board with a shit ton of different forums. /b/ and /Pol/ are the bad ones. 10% of the whole site is ""bad""\n\nDon't talk if you don't know what you are on about. Reddit has subreddits that are way more racist and Twitter speaks for itself"
8166,makedd,t3_om0e68,t1_h5i7bef,0,5,/r/ethereum/comments/om0e68/its_called_ethereum/h5ib1p8/,ethereum,"Agreed, all the coin subreddits are echo chambers where facts don't matter. As long as you are bullish on the coin and trash everyone else, people are excited."


In [212]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('debug')]
print(check_df.shape)
display(check_df.head(11))

(2, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
9,mjgill89,t3_oozjjg,t3_oozjjg,0,1,/r/Bitcoin/comments/oozjjg/bitcoind_is_failing/h63ly9c/,Bitcoin,"this is the debug log which isn't helping:\n\n2021-07-22T03:34:46Z UpdateTip: new best=00000000000000001192dae5aea9abaac21033$\n2021-07-22T03:35:03Z UpdateTip: new best=0000000000000000103578745018fba03524a1$\n2021-07-22T03:35:04Z Imported mempool transactions from disk: 0 succeeded, 0 fa$\n2021-07-22T03:35:04Z loadblk thread exit\n2021-07-22T03:35:07Z Synchronizing blockheaders, height: 692090 (~100.00%)\n2021-07-22T03:35:10Z New outbound peer connected: version: 70016, blocks=692090$\n2021-07-22T03:35:10Z Pre-allocating up to position 0x8000000 in blk00296.dat"
2787,MorrisSchaefer,t3_oozjjg,t3_oozjjg,0,2,/r/Bitcoin/comments/oozjjg/bitcoind_is_failing/h61z3fx/,Bitcoin,You can specify the log path with -debuglogfile\n\n\nExample :\n./bitcoind -daemon -debuglogfile=/data/logfile.log


In [213]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('salesman')]
print(check_df.shape)
display(check_df.head(11))

(7, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
1272,Bitcoin1776,t3_oowqar,t1_h61vf8h,0,11,/r/Bitcoin/comments/oowqar/bword_conference_72121_exclusive_elon_musk/h62irsu/,Bitcoin,"To get specific - he said one comment that was sort of interesting... either shrink the base chain, or expand it.\n\nThis is where a lack of history with BTC comes into play **no one** really wants to expand the base chain, period.\n\nThe whole idea is to do 'second layer' transactions, and also **expanding** the base chain is one of the things requiring a hard fork (shrinking it does not).\n\nSo by default - expanding the base chain and DOING LITERALLY ANYTHING ELSE YOU WANT TO - are equally impossible with Bitcoin. It is not meant to be done.\n\nIt was done purely as a stop-gap measure, while technology improved (and frankly I think Lightning + 'services' 'paypal, robin hood, cash app' - are expanding sufficiently that chain will never be expanded again).\n\nSo YES we would LOVE to shrink the BTC block chain! It **won't be necessary**, due to his other comment on how 'dial up' was common when BTC began, and adding 50 GB / yr to the blockchain likely won't prove detrimental... BUT..."
2982,gardener1111,t3_oowyic,t3_oowyic,0,1,/r/Bitcoin/comments/oowyic/elon_musk_still_a_bitcoin_supporter_i_own_bitcoin/h61ve17/,Bitcoin,Who gives a f\*\*k about some car salesman thinks about BTC ?\n\nDon't post shit here about shitty car salesman.
2999,gardener1111,t3_oox1il,t3_oox1il,0,2,/r/Bitcoin/comments/oox1il/elon_musk_says_spacex_owns_bitcoin_in_jack_dorsey/h61v258/,Bitcoin,Who gives a f\*\*k about what some car salesman and his ass-lickers think about BTC ?\n\nDon't post shit here about shitty poeple
3039,gardener1111,t3_ooxq69,t3_ooxq69,0,-4,/r/Bitcoin/comments/ooxq69/full_b_word_conference/h61u7m9/,Bitcoin,Who gives a shit about some car salesman and his ass-lickers think about BTC ?\n\nDon't post about shitty conferences here.
3075,gardener1111,t3_ooo74h,t3_ooo74h,0,1,/r/Bitcoin/comments/ooo74h/how_to_watch_the_bword_conference_with_elon_musk/h61tkn7/,Bitcoin,Why do you even watch some car salesman and his ass-lickers about BTC.\n\nWho cares some scumbags talk about BTC ?\n\nDon't post shit here about shitty people talk about BTC.
3120,gardener1111,t3_oova2b,t3_oova2b,0,-5,/r/Bitcoin/comments/oova2b/the_b_word_live_stream_link_on_youtube/h61svv7/,Bitcoin,Don't post shit here about shitty people having shit talk about BTC.\n\nBTC is bigger than any car salesman and his ass-lickers
3168,MikeIsSmart,t3_ooj8au,t1_h61br3o,0,2,/r/Bitcoin/comments/ooj8au/daily_discussion_july_21_2021/h61rwya/,Bitcoin,"I agree. I also think the Asperger's plays into this a lot. Many people find him awkward and cringey when he speaks in long form unscripted conversations. It's true he's not an engaging speaker in the same way as a Jobs or Branson or Mark Cuban or many other CEO's, but when you understand this as a symptom/characteristic of Aspy-spectrum people you can see that Elon is an innovator/engineer rather than a salesman/influencer like those other people. \n\nHe doesn't have the charisma but he has the vision and the drive to get things done. For one I think it's refreshing to have someone with less of a filter, more mission-focused rather than image-focused. Some times it bites him in the ass when he tweets without having things checked over by a corporate pr person, but you know you at least are getting what appears to be unfiltered honest Elon."


In [214]:
check_mod_comments('gardener1111') 


Moderator: gardener1111
--------------------------------------------------
(5, 8) 

["Who gives a f\\*\\*k about some car salesman thinks about BTC ?\n\nDon't post shit here about shitty  car salesman."
 "Who gives a f\\*\\*k about what some car salesman and his ass-lickers think about BTC ?\n\nDon't post shit here about shitty poeple"
 "Who gives a shit about some car salesman and his ass-lickers think about BTC ?\n\nDon't post about shitty conferences here."
 "Why do you even watch some car salesman and his ass-lickers about BTC.\n\nWho cares some scumbags talk about BTC ?\n\nDon't post shit here about shitty people talk about  BTC."
 "Don't post shit here about shitty people having shit talk about BTC.\n\nBTC is bigger than any car salesman and his ass-lickers"]


In [215]:
check_mod_comments('Bosphoramus')


Moderator: Bosphoramus
--------------------------------------------------
(7, 8) 

['I would like to believe that Vitalik is having a laugh knowing he is about to wipe out billions or trillions of dollars worth of financial assets; but the reality is he (or his benefactors) might actually believe in this shit, or worse, it might all be a big shill plan to destroy the credibility of cryptocurrency.'
 'I wanted to see if you were human or not. Your wording made me believe you might be a NN which is becoming increasingly common. Everything in that post is accurate, according to AI at least.\n\n&amp;#x200B;\n\nI\'ll merit you with a real response:\n\nYou could just do a generic nudity/NSFW detection algorithm then apply facial age in order to flag images that are NSFW and underage. There are state of the art systems that are open source and very accurate.\n\nYou obviously do not need a database of illegal content to do this.\n\nIf somehow a false-negative passed through and the government

In [216]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('porn')]
print(check_df.shape)
display(check_df.head())

(40, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
3159,slugur,t3_oov7mw,t3_oov7mw,0,1,/r/Bitcoin/comments/oov7mw/bitpay/h61s63x/,Bitcoin,"I am not here to answer your question, but just to wish you good luck with your porn business. You get an upvote from me. Cheers."
5796,officerkondo,t3_omxgln,t1_h5twlrc,0,0,/r/ethereum/comments/omxgln/aave_plans_to_build_twitter_on_ethereum/h5y43iy/,ethereum,"You don't prevent it. Like almost all crimes, you respond to it reactively. People get busted for trading CP over BitTorrent all the time. See, e.g., Josh Duggar.\n\nThe ""child porn"" retort is so played out. It is no different than Law By Grieving Parent where some kid dies so their parents testify in front of Congress so This Will Never Happen Again. See also, ""won't someone please think about the children?!""\n\nIt's so transparent. People want to shut down social, political, and economic opinions they don't like but they know they will be seen for the fascists they are if they say that so they pretend that it's all about the children. Lulz ok. What have you ever proposed to keep children from accessing online pornography? Show me where you have ever posted about your concerns that children can access pornography on reddit."
5912,david-song,t3_omg95e,t1_h5ve0vj,0,1,/r/ethereum/comments/omg95e/the_most_beautiful_part_of_ethereum_is_the/h5xgu1p/,ethereum,"&gt;You could just do a generic nudity/NSFW detection algorithm then apply facial age in order to flag images that are NSFW and underage. There are state of the art systems that are open source and very accurate.\n\nNot all porn has faces in it, and it'd be trivial to crop out a face even if it was. I remember some dickhead uploading a picture of a man taking a shit on a baby to a forum i used to read. Pretty sure he wasn't aroused by it and posted it to upset everyone, and it was likely not even produced as pornography, but I'm also pretty sure it'd be highly illegal and one of your users uploaded it and then Google indexed it you'd be reported and your servers would be at risk.\n\n&gt;If somehow a false-negative passed through and the government insisted on charging you with a crime despite taking steps to filter that content then you probably need a new government.\n\nThey don't need to charge you, they just confiscate your kit for months while they investigate whether you're al..."
6125,sam_weiss,t3_omxgln,t1_h5vqwo2,0,1,/r/ethereum/comments/omxgln/aave_plans_to_build_twitter_on_ethereum/h5vrr2r/,ethereum,"It’s pretty clear that moderated content and the threat of having their own content moderated is a compromise people are willing to accept to avoid a barrage of kiddy porn, nazis and scam posts.\n\nEvery single “free speech” platform has devolved into a cesspool of scum and horror. Which means most people won’t use it, which means it provides no value. Which means it has no reason to exist.\n\nYour idealism is worth zero in the marketplace of ideas and is just you tilting at windmills. Delusional and pointless."
6167,Bosphoramus,t3_omg95e,t1_h5v7a2a,0,2,/r/ethereum/comments/omg95e/the_most_beautiful_part_of_ethereum_is_the/h5ve0vj/,ethereum,"I wanted to see if you were human or not. Your wording made me believe you might be a NN which is becoming increasingly common. Everything in that post is accurate, according to AI at least.\n\n&amp;#x200B;\n\nI'll merit you with a real response:\n\nYou could just do a generic nudity/NSFW detection algorithm then apply facial age in order to flag images that are NSFW and underage. There are state of the art systems that are open source and very accurate.\n\nYou obviously do not need a database of illegal content to do this.\n\nIf somehow a false-negative passed through and the government insisted on charging you with a crime despite taking steps to filter that content then you probably need a new government.\n\n&amp;#x200B;\n\nI can understand where you're coming from if Australia's priorities are so out of place that they would waste resources on something that's obviously a parody.\n\nLike actually investing money towards putting all pornographic content behind the .XXX tld and r..."


In [217]:
# removing bot comments
raw = raw[~raw['body'].str.lower().str.contains('i am a bot')]
raw = raw[~raw['body'].str.lower().str.contains('bot$')]
raw = raw[~raw['body'].str.lower().str.contains('coinfeeds-bot')]
raw = raw[~raw['body'].str.lower().str.contains('subreddits')]
raw = raw[~raw['body'].str.lower().str.contains('porn')]
raw = raw[~raw['body'].str.lower().str.contains('debug')]
raw = raw[(raw['author'] != 'gardener1111')]
raw.shape

(3223, 8)

In [218]:
print(raw[raw['author'] == 'coinfeeds-bot']['body'].unique())
print(raw[raw['author'] == 'coinfeeds-bot'].shape)

["tldr; Ukraine's Security Service of Ukraine has revealed that the 3,800 PlayStation 4 (PS4) consoles found at a Bitcoin mining farm were actually programmed to play the FIFA video game automatically. The goal was to earn FIFA Ultimate Team Coins and Cards and then resell them to passionate gamers. Most of the equipment belonged to the economic version of the PlayStation 4, with power of only 1.84 teraflops.\n\n*This summary is auto generated by a bot and not meant to replace reading the original article. As always, DYOR.*"
 'tldr; Bitcoin is a rejection of fiat economics, instead representing a non-government, non-business controlled form of money. This view aligns more with an Austrian economist’s view of money than Keynesian and Monetarist ideas. Bitcoin is still inflationary until it hits its eventual total, just under 21 million coins around the year 2140.\n\n*This summary is auto generated by a bot and not meant to replace reading the original article. As always, DYOR.*'
 'tldr;

In [219]:
#raw['body'] = raw['body'].str.replace("*This summary is auto generated by a bot and not meant to replace reading the original article. As always, DYOR.*", "", regex=False)
raw = raw[(raw['author'] != 'coinfeeds-bot')]

In [220]:
# Further checks for bot comments
check_df = raw[raw['body'].str.lower().str.contains('subreddit')]
print(check_df.shape)
display(check_df.head(11))

(10, 8)


Unnamed: 0,author,link_id,parent_id,total_awards_received,score,permalink,subreddit,body
2751,mmafan666,t3_ooj8au,t1_h60sxkh,0,2,/r/Bitcoin/comments/ooj8au/daily_discussion_july_21_2021/h61zntp/,Bitcoin,It's also apparent that many seem to thing this subreddit IS Bitcoin. That everyone who holds ANY comes here.
2843,quietlydesperate90,t3_oonhte,t1_h61xl42,0,0,/r/Bitcoin/comments/oonhte/joe_rogan_defending_bitcoin_to_peter_schiff_in/h61y4lt/,Bitcoin,"I never claimed to be perfect, but I think my intentions were a bit nobler at least :) I wasn't trying to scold, I was trying to improve the subreddit, I guess in a similar way to you. You dislike Joe Rogan, I dislike all the rudeness and complaining. Somehow I think you will get your way in the end. I cede this battle to you."
3167,quietlydesperate90,t3_oonhte,t1_h61rhlg,0,0,/r/Bitcoin/comments/oonhte/joe_rogan_defending_bitcoin_to_peter_schiff_in/h61ryij/,Bitcoin,So why did you click on this instead of scrolling past? This isn't a private subreddit only for things you are interested in. What is the benefit from clicking on something you aren't interested in and then complaining about it?
3363,GabeE3e,t3_oovy9k,t3_oovy9k,0,0,/r/Bitcoin/comments/oovy9k/elon_musk_said/h61nybe/,Bitcoin,Elon Did way more for BTC than the haters in this Subreddit to be honest. \n\n\nand this f\*ckers posting JP Morgan articles if that positive and saying Poggers. shameless...
6999,Pezotecom,t3_on9sf7,t1_h5rnxyt,0,2,/r/ethereum/comments/on9sf7/for_everyone_talking_about_aave_building_twitter/h5rpyne/,ethereum,"hey, guys, have you read Bitcoin's whitepaper? or the mails on the mailing list? \n\nWhy are you surprised when, in a subreddit of the 2nd biggest cryptocurrency, somebody speaks about the very same foundations that gave birth to this technology? Can we please stop the cynism lol"
7099,WikiSummarizerBot,t3_omy2f7,t1_h5r1vgo,0,2,/r/ethereum/comments/omy2f7/small_node_runners_shall_we_join_hands/h5r1wwx/,ethereum,"**[Iron_law_of_oligarchy](https://en.m.wikipedia.org/wiki/Iron_law_of_oligarchy)** \n \n &gt;The iron law of oligarchy is a political theory first developed by the German-born Italian sociologist Robert Michels in his 1911 book, Political Parties. It asserts that rule by an elite, or oligarchy, is inevitable as an ""iron law"" within any democratic organization as part of the ""tactical and technical necessities"" of organization. Michels's theory states that all complex organizations, regardless of how democratic they are when started, eventually develop into oligarchies.\n \n^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&amp;message=OptOut&amp;subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/ethereum/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)"
7269,pend-bungley,t3_omxgln,t3_omxgln,0,12,/r/ethereum/comments/omxgln/aave_plans_to_build_twitter_on_ethereum/h5potmp/,ethereum,"There are already dapps like this like Peepeth that never took off. There are also sites like 8chan that, although not decentralized, have similar problems. Then there are platforms like Noise that have barely any users despite giving out tons of bch every day to get people to use it.\n\nAside from the obvious issues like illegal content and scaring normies away with edgy content, there are less obvious problems such as bad actors wreaking all kinds of havoc that make a platform unusable (mainly spam) if they aren't aggressively banned. (And you can't address this by simply charging people to post without making it prohibitively expensive to use the site honestly).\n\nIn the face of censorship by companies like Twitter and FB, it's understandable that the reaction would be calling for completely unregulated platforms, but that overly-simplistic approach doesn't work either.\n\nWhat we really need is something like Reddit before Yishan was replaced by Pao, where they had a good bala..."
7326,martelaxe,t3_om5zsc,t1_h5ldzse,0,1,/r/ethereum/comments/om5zsc/aave_may_build_twitter_on_ethereum/h5p7nlr/,ethereum,"Those huge networks have mods deleting logical comments and thats why you feel like theres more morons than smart people. Also if the network / forum is called ""VAX = AUTISTM"" do you think smart people are going to join there? \n\n\nPretty much if a huge forum where there's smart people they will explain to the people spaming non sense why are they are wrong.. Imagine if the the donald subreddit didn't have mods 24/7 deleting / banning the guys that were explaining why Trump was a moron.... it is just biased networks with biased mods that make you feel that way.\n\nFreedom of speech is the idea that in the end the BS will die because truth is easier to explain and understand (I'm talking about the majority). I trully don't know any philosopher that has said say that we need a centralized goverment controlling the information because there's a lot of morons saying / reading BS, that would end the civilization"
7769,terp_studios,t3_omf8mk,t3_omf8mk,0,3,/r/ethereum/comments/omf8mk/i_still_love_you/h5kngxk/,ethereum,It’s literally the name of the subreddit. Just read what you’re about to post at least once...come on man.
9622,timee_bot,t3_ok6ade,t3_ok6ade,0,1,/r/ethereum/comments/ok6ade/join_me_on_the_truebit_subreddit_tomorrow_at_12pm/h55u6ag/,ethereum,View in your timezone: \n[tomorrow at 12PM EDT][0] \n\n[0]: https://timee.io/20210715T1600?tl=Join%20me%20on%20the%20Truebit%20subreddit%20tomorrow%20at%2012PM%20EST%20for%20my%20AMA%20on%20the%20Verifier's%20Dilemma%20Paper!\n\n\n^(_*Assumed EDT instead of EST because DST is observed_)


In [221]:
raw['total_awards_received'].unique()

array([0, 1], dtype=int64)

In [222]:
# rename column
raw = raw.rename(columns={'total_awards_received': 'num_awards'})

# replace html, xml tags 
raw['body'] = raw['body'].str.replace('&gt;', "", regex=False)
raw['body'] = raw['body'].str.replace('\n', " ", regex=False)
raw['body'] = raw['body'].str.replace('&amp;', "&", regex=False)
raw['body'] = raw['body'].str.replace('/s', "", regex=False)
raw['body'] = raw['body'].str.replace('tldr;', "", regex=False)
raw['body'] = raw['body'].str.replace('&#x200B;', "", regex=False)

# check changes
display(raw.tail(100))

Unnamed: 0,author,link_id,parent_id,num_awards,score,permalink,subreddit,body
9655,Setvin,t3_og8xc2,t1_h4idbd7,0,1,/r/ethereum/comments/og8xc2/250000_bitcoin_has_now_been_wrapped_onto_ethereum/h55fwyv/,ethereum,"I think you are being a bit optimistic believing in a flipping, but either way ETH has taken a big ugly bite out of bitcoin and isn't letting up. The real significance for me is that ETH is swatting away and stomping on all of the other ""potential"" eth killers. It's a 4d chess move,"
9658,vbuterin,t3_ojzez5,t1_h54yqar,0,15,/r/ethereum/comments/ojzez5/more_people_should_be_aware_of_ewasm_an_upgrade/h55ee0m/,ethereum,It's deprioritized because it's at best a 1.5-2x improvement at a high cost of dev time whereas the other things on the roadmap are 3-100x improvements with lower cost in dev time.
9659,anongirl905,t3_ojznh6,t1_h5511pg,0,2,/r/ethereum/comments/ojznh6/brazil_becomes_the_first_country_in_latin_america/h55e98y/,ethereum,One use case is that you can invest it as part of your rrsp (retirement fund) which comes off your taxes or as part of a tax free savings account. Also no need to worry about managing a crypto account.
9663,Perleflamme,t3_ok2u7r,t1_h55bh4u,0,2,/r/ethereum/comments/ok2u7r/eth_2022456_gains_in_5_years_just_hodl/h55d8cs/,ethereum,"No one thought it would go beyond $400 when it was only priced at a few ones. People who were claiming it would were labeled the crazies. It seems like the crazies are right, sometimes. The whole DeFi is here and is heading towards building the path towards a new and way better stockmarket. Billions of USD isn't the territory of stockmarket-like networks. It's way, way more than that."
9670,CryptoRoast_,t3_ok2u7r,t1_h55br1c,0,1,/r/ethereum/comments/ok2u7r/eth_2022456_gains_in_5_years_just_hodl/h55byg6/,ethereum,"That's fine and I totally understand. Be patient. I'm not thousands of % up overnight. I've held for years. My lowest eth buy was around $30, there were dips back then too 🤷‍♂️"
9673,CalyShadezz,t3_ojql3v,t1_h53s1ar,0,1,/r/ethereum/comments/ojql3v/ethereums_top_10_largest_whale_addresses_now/h55bhf5/,ethereum,"It's to show that whales are accumulating not selling which reflect (1) we are at a price point that whales consider ""buy"" worthy and (2) profits are not worth taking at this point. My confusion has always been in Defi why do we cheer when there is a consolidation of currency? It's literally the reason fiat is screwed (accumulation of wealth into the top 1%)."
9674,davidhepworth_,t3_ok2u7r,t3_ok2u7r,0,-3,/r/ethereum/comments/ok2u7r/eth_2022456_gains_in_5_years_just_hodl/h55bh4u/,ethereum,"That’s an average of 4,000% per year which means if ETH is $2K now, this time next year it would be worth $82,000. I don’t think it will go that high but I think $50K is possible in the next 12 months."
9675,CryptoRoast_,t3_ok2u7r,t3_ok2u7r,0,0,/r/ethereum/comments/ok2u7r/eth_2022456_gains_in_5_years_just_hodl/h55behz/,ethereum,People keep talking about a crash but I'm thousands of % up so I got no idea what they're talking about.
9677,chedebarna,t3_ojznh6,t1_h5511pg,0,-2,/r/ethereum/comments/ojznh6/brazil_becomes_the_first_country_in_latin_america/h55aond/,ethereum,"That's how the legacy bankster system coopts crypto. As long as we understand what they're doing and don't let them trickcare us into selling, it's a good thing for us hodlers though."
9678,erjo5055,t3_ojznh6,t1_h5511pg,0,4,/r/ethereum/comments/ojznh6/brazil_becomes_the_first_country_in_latin_america/h55ab89/,ethereum,"I think one answer, I could be wrong, is SPIC insurance. My understanding is that as an ETF it would qualify for SPIC insurance. So if the brokerage went under you'd be covered. Unlike keeping your ETH in an exchange, where you're screwed if the exchange goes under. Yes, yes I know, you can keep it on your own hard wallet."


## Writing to file

In [223]:
print(raw.shape)
# write to file
# this will be useful as a starting point for experimentation
raw.to_csv('data/cleaned_data.csv', index=False)

(3106, 8)
