## Exploratory Analysis of Cryptocurrency Subreddit Data

### 1. Set Up
#### 1.1 Load Libraries

In [1]:
# Import libraries
import os
import pandas as pd
import polars as pl
import pyarrow as pa
import seaborn as sns
import plotly_express as px
from tqdm import tqdm
from datetime import datetime
from elasticsearch_dsl import (
    Search,
    Q
)
from elasticsearch_dsl.response import Response
import nltk
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS
import gensim

# Change dir
os.getcwd()
os.chdir("..")  # Change to root dir to detect other local libs

# Import Local libs
from es.manager import ESManager
from es.utils import es_reddit_to_df 
from data.schema.es_mappings import (
    REDDIT_CRYPTO_INDEX_NAME,
    REDDIT_CRYPTO_CUSTOM_INDEX_NAME,
    reddit_crypto_mapping,
    reddit_crypto_custom_mapping
) 

### 2. Connect to ES

In [2]:
es_conn = ESManager()
es_client = es_conn.es_client
es_conn.get_status()



True

In [3]:
# Get aliases for easy reference
es_conn.get_aliases()

('.kibana                     .kibana_7.16.3_001              - - - -\n'
 '.kibana_7.16.3              .kibana_7.16.3_001              - - - -\n'
 'raw-data                    reddit-crypto                   - - - -\n'
 '.kibana-event-log-7.16.3    .kibana-event-log-7.16.3-000001 - - - true\n'
 '.kibana_task_manager        .kibana_task_manager_7.16.3_001 - - - -\n'
 '.kibana_task_manager_7.16.3 .kibana_task_manager_7.16.3_001 - - - -\n')


### 3. Pull Data from Raw Subreddit ES Index
#### 3.1 Test Search

In [4]:
test_search = (
    Search(index="raw-data")
    .using(es_client)
    .query("match", full_text = "whale")
)

test_resp = test_search.execute()

# Get results
test_res_df = es_reddit_to_df(test_resp)
test_res_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               10 non-null     object        
 1   create_datetime  10 non-null     datetime64[ns]
 2   subreddit        10 non-null     object        
 3   full_text        10 non-null     object        
dtypes: datetime64[ns](1), object(3)
memory usage: 448.0+ bytes


#### 3.2 Pull 10,000 Random Sample of Reddit Raw Data
* Random sample ES query reference [here](http://richardhallett.com/posts/random-sampling-elasticsearch/).


In [5]:
# Random Sample query
SAMPLE_SIZE = 10_000
RANDOM_SEED = 42

random_sample_query = {
    "size": SAMPLE_SIZE,
    "query": {
        "function_score" : {
            "random_score": {
                "seed": RANDOM_SEED,
                "field": "id"
            }
        }
    }
}

In [6]:
random_sample_res = es_conn.run_match_query(index='raw-data', query=random_sample_query)
print(f"Sampled {random_sample_res} documents from the database with random seed {RANDOM_SEED}")



10000

In [7]:
random_sample_df = es_reddit_to_df(random_sample_res, input_type='es', output_type='pandas')
random_sample_df.describe()

  random_sample_df.describe()


Unnamed: 0,id,create_datetime,subreddit,full_text
count,10000,10000,10000,10000
unique,9921,9921,6,9073
top,hk8k5aa,2021-11-12 03:12:52,Bitcoin,[removed]
freq,3,3,4347,497
first,,2014-01-01 11:24:01,,
last,,2022-01-01 07:26:10,,


In [8]:
random_sample_df.head()

Unnamed: 0,id,create_datetime,subreddit,full_text
0,djt6ym7,2017-07-05 23:12:37,ethtrader,Writing my Thesis on Blockchain activities in ...
1,d2fn2te,2016-04-25 03:17:13,BitcoinMarkets,Wooo!
2,f1ne8ha,2019-09-28 04:13:30,CryptoMarkets,/r/angryupvote
3,co5mg1z,2015-01-30 19:25:17,Bitcoin,&gt; it would take an infeasible amount of enc...
4,dcmui6n,2017-01-20 01:13:10,Bitcoin,"0 fees is not a problem. Dubious leverage, poo..."


#### 3.3 [KIV] Pull Larger Amounts of Data

In [9]:
sample_data_search = (
    Search(index="raw-data")
    .using(es_client)
    [:10000]  # Look at 100k first
)

sample_data_resp = sample_data_search.execute()



In [10]:
all_data = es_reddit_to_df(sample_data_resp, input_type="dsl", output_type="polars")
all_data.head()

id,create_datetime,subreddit,full_text
str,datetime,str,str
"""cg1hkmu""",2014-03-13 17:35:30,"""Bitcoin""","""SAP is a mega software development framework for business systems, it links everything together. Production, inventory, transport, billing. It is the database of industrial corporations. It has a monopoly as the SAP systems of each corporation communicate through proprietary APIs. Integrating Bitcoin means they want an API to do transactions with the SAP system of their suppliers and customers. This is great as it means other corporations will integrate it."""
"""cg1hkkr""",2014-03-13 17:35:19,"""Bitcoin""","""This is a good point, and it's been acknowledged before. There have been a couple of threads where people have documented their attempts to move a few thousand dollars around the world as they move house, and it tends to turn out to be quite expensive because of the fees, assuming you start with fiat and want to end as fiat. Actually if you tally up all the fees involves in the system you describe, assuming you want timely transactions, you have to pay whatever fees your bank charges to make a purchase or transfer to the exchange, you then have to pay the fee the exchange charges on a purchase of BTC, you then have to pay the bitcoin transaction fee to send the bitcoins to your friend, your friend then has to pay another fee to the exchange on the other side to buy fiat, and then maybe even a final fee to get it into fiat in the end users bank account (either from the bank or exchange). But that's all a given, because with that usage, you're using Bitcoin as a 3rd party or a proxy for fiat, rather than as a currency in itself HOWEVER, you're still thinking from the perspective of someone who predominantly uses fiat. This all falls apart when you think about it from the perspective of BTC as the main currency: * the sender already has BTC, because they got paid in BTC (and possibly paid more than they would in fiat as you wont have the bank charging for the transaction, and the bitcoin transaction fee would be minimal if the employer paid all employers with a single transaction and multiple outputs) * the receiver spends the BTC by making a purchase from a vendor who directly accepts BTC, instead of converting into fiat first. That way the receiver doesn't have to pay exchange fees, and the vendor doesn't have to charge 3rd party processing fees In that scenario, the only actual fee is the (optional) transaction fee between the sender and receiver, and then you could include a transaction fee between the receiver and the final vendor (although a good vendor would pickup the fee for you in future)"""
"""cg1hjhi""",2014-03-13 17:31:41,"""Bitcoin""","""Or maybe that's what he wanted you to think."""
"""cg1hip0""",2014-03-13 17:29:01,"""Bitcoin""","""[deleted]"""
"""cg1hhzi""",2014-03-13 17:26:36,"""Bitcoin""","""Also slippage will decrease with liquidity. """


### 4. Exploratory Anaylsis on Sample Data
#### 4.1 Distribution
**By Month & Year**

In [11]:
random_sample_df['month_year'] = (
    random_sample_df['create_datetime']
    .apply(lambda x: datetime.strftime(x, "%Y, %B"))
)

random_sample_df.head()

Unnamed: 0,id,create_datetime,subreddit,full_text,month_year
0,djt6ym7,2017-07-05 23:12:37,ethtrader,Writing my Thesis on Blockchain activities in ...,"2017, July"
1,d2fn2te,2016-04-25 03:17:13,BitcoinMarkets,Wooo!,"2016, April"
2,f1ne8ha,2019-09-28 04:13:30,CryptoMarkets,/r/angryupvote,"2019, September"
3,co5mg1z,2015-01-30 19:25:17,Bitcoin,&gt; it would take an infeasible amount of enc...,"2015, January"
4,dcmui6n,2017-01-20 01:13:10,Bitcoin,"0 fees is not a problem. Dubious leverage, poo...","2017, January"


In [12]:
count_by_ym = (
    random_sample_df.groupby(['month_year'])
    .agg(unique_count=('id', 'nunique'))
    .reset_index()
)

In [13]:
px.bar(
    count_by_ym,
    title="Volume (Sampled) by Month & Year",
    labels={
        "month_year": "Year-Month",
        "unique_count": "No. of Docs"
    },
    x="month_year",
    y="unique_count",
    color="unique_count"
)

**By Subreddit**

In [14]:
count_by_sr = (
    random_sample_df.groupby(["subreddit"])
    .agg(unique_count=('id', 'nunique'))
    .reset_index()
)

In [15]:
px.pie(
    count_by_sr,
    values="unique_count",
    names="subreddit",
    hole=.3,
    title="Breakdown of Volume (Sampled) by Subreddit"
)

**Comparison against Full Data**
* Comparing our sampled data against the full database, we can observe that the distribution of our sample across subreddits approximates the actual distribution pretty closely.

![](../images/breakdown_by_subreddit.png)

### 4.2 Reddit Full Text Analysis
#### 4.2.1 Text Statistics
**Sentence Length Analysis**

In [16]:
sent_length_df = (
    random_sample_df
    .full_text
    .str.split()
    .map(lambda x: len(x))
)

px.histogram(
    sent_length_df,
    title="Reddit Sentence Length Histogram",
    nbins=100
)

In [17]:
sent_length_df.describe()

count    10000.000000
mean        28.736600
std         51.790084
min          1.000000
25%          6.000000
50%         14.000000
75%         32.000000
max       2146.000000
Name: full_text, dtype: float64

**Stop Word Analysis**

Analyse frequency of stop words amongst reddit data using:
1. Elasticsearch stopword list
2. NLTK stopword list 
3. Gensim stopword list
4. SpaCy stopword list

In [18]:
# Elasticsearch Stop Word List

ES_STOPWORD_LIST = [
    "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is",
    "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there",
    "these", "they", "this", "to", "was", "will", "with"
]

In [19]:
# NLTK Stop Words


In [20]:
# spaCy Stop Words

**Top Words by Frequency**

**Top N-Grams by Frequency**

#### 4.2.1 Identifying Missing, Unknown or Unusual Data within the ```Full Text``` field

#### 4.2.3 Preliminary Wrangling

### 5. Topic Modelling

### 6. POS Tagging and Analysis