## Exploratory Analysis of Cryptocurrency Subreddit Data

### 1. Set Up
#### 1.1 Load Libraries

In [1]:
# Import libraries
import os
import re
from typing import Dict, List, Optional
import pandas as pd
from collections import defaultdict, Counter
import plotly_express as px
from tqdm import tqdm
from datetime import datetime
from elasticsearch_dsl import Search, Q
import emot
import contractions
import nltk
import spacy
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Change dir
os.getcwd()
os.chdir("..")  # Change to root dir to detect other local libs

# Import Local libs
from es.manager import ESManager
from es.utils import es_reddit_to_df
from etl.schema.es_mappings import (
    REDDIT_CRYPTO_INDEX_NAME,
    REDDIT_CRYPTO_CUSTOM_INDEX_NAME,
    reddit_crypto_mapping,
    reddit_crypto_custom_mapping,
)

### 2. Connect to ES

In [2]:
es_conn = ESManager()
es_client = es_conn.es_client
es_conn.get_status()



True

In [3]:
# Get aliases for easy reference
es_conn.get_aliases()

('.kibana_task_manager        .kibana_task_manager_7.16.3_001 - - - -\n'
 '.kibana_task_manager_7.16.3 .kibana_task_manager_7.16.3_001 - - - -\n'
 '.kibana                     .kibana_7.16.3_001              - - - -\n'
 '.kibana_7.16.3              .kibana_7.16.3_001              - - - -\n'
 'raw-data                    reddit-crypto                   - - - -\n'
 '.kibana-event-log-7.16.3    .kibana-event-log-7.16.3-000001 - - - true\n')


### 3. Pull Data from Raw Subreddit ES Index
#### 3.1 Test Search

In [4]:
test_search = (
    Search(index="raw-data").using(es_client).query("match", full_text="whale")
)

test_resp = test_search.execute()

# Get results
test_res_df = es_reddit_to_df(test_resp)
test_res_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               10 non-null     object        
 1   create_datetime  10 non-null     datetime64[ns]
 2   subreddit        10 non-null     object        
 3   full_text        10 non-null     object        
dtypes: datetime64[ns](1), object(3)
memory usage: 448.0+ bytes


#### 3.2 Pull 10,000 Random Sample of Reddit Raw Data
* Random sample ES query reference [here](http://richardhallett.com/posts/random-sampling-elasticsearch/).


In [5]:
# Random Sample query
SAMPLE_SIZE = 10_000
RANDOM_SEED = 42

random_sample_query = {
    "size": SAMPLE_SIZE,
    "query": {"function_score": {"random_score": {"seed": RANDOM_SEED, "field": "id"}}},
}

In [6]:
random_sample_res = es_conn.run_match_query(index="raw-data", query=random_sample_query)





In [7]:
random_sample_df = es_reddit_to_df(
    random_sample_res, input_type="es", output_type="pandas"
)
random_sample_df.describe()

  random_sample_df.describe()


Unnamed: 0,id,create_datetime,subreddit,full_text
count,10000,10000,10000,10000
unique,9921,9921,6,9073
top,hk8k5aa,2021-11-12 03:12:52,Bitcoin,[removed]
freq,3,3,4347,497
first,,2014-01-01 11:24:01,,
last,,2022-01-01 07:26:10,,


In [8]:
random_sample_df.head()

Unnamed: 0,id,create_datetime,subreddit,full_text
0,djt6ym7,2017-07-05 23:12:37,ethtrader,Writing my Thesis on Blockchain activities in ...
1,d2fn2te,2016-04-25 03:17:13,BitcoinMarkets,Wooo!
2,f1ne8ha,2019-09-28 04:13:30,CryptoMarkets,/r/angryupvote
3,co5mg1z,2015-01-30 19:25:17,Bitcoin,&gt; it would take an infeasible amount of enc...
4,dcmui6n,2017-01-20 01:13:10,Bitcoin,"0 fees is not a problem. Dubious leverage, poo..."


#### 3.3 [KIV] Pull Larger Amounts of Data

In [9]:
sample_data_search = Search(index="raw-data").using(es_client)[
    :10000
]  # Look at 100k first

sample_data_resp = sample_data_search.execute()



In [10]:
all_data = es_reddit_to_df(sample_data_resp, input_type="dsl", output_type="polars")
all_data.head()

id,create_datetime,subreddit,full_text
str,datetime,str,str
"""cg1hkmu""",2014-03-13 17:35:30,"""Bitcoin""","""SAP is a mega software development framework for business systems, it links everything together. Production, inventory, transport, billing. It is the database of industrial corporations. It has a monopoly as the SAP systems of each corporation communicate through proprietary APIs. Integrating Bitcoin means they want an API to do transactions with the SAP system of their suppliers and customers. This is great as it means other corporations will integrate it."""
"""cg1hkkr""",2014-03-13 17:35:19,"""Bitcoin""","""This is a good point, and it's been acknowledged before. There have been a couple of threads where people have documented their attempts to move a few thousand dollars around the world as they move house, and it tends to turn out to be quite expensive because of the fees, assuming you start with fiat and want to end as fiat. Actually if you tally up all the fees involves in the system you describe, assuming you want timely transactions, you have to pay whatever fees your bank charges to make a purchase or transfer to the exchange, you then have to pay the fee the exchange charges on a purchase of BTC, you then have to pay the bitcoin transaction fee to send the bitcoins to your friend, your friend then has to pay another fee to the exchange on the other side to buy fiat, and then maybe even a final fee to get it into fiat in the end users bank account (either from the bank or exchange). But that's all a given, because with that usage, you're using Bitcoin as a 3rd party or a proxy for fiat, rather than as a currency in itself HOWEVER, you're still thinking from the perspective of someone who predominantly uses fiat. This all falls apart when you think about it from the perspective of BTC as the main currency: * the sender already has BTC, because they got paid in BTC (and possibly paid more than they would in fiat as you wont have the bank charging for the transaction, and the bitcoin transaction fee would be minimal if the employer paid all employers with a single transaction and multiple outputs) * the receiver spends the BTC by making a purchase from a vendor who directly accepts BTC, instead of converting into fiat first. That way the receiver doesn't have to pay exchange fees, and the vendor doesn't have to charge 3rd party processing fees In that scenario, the only actual fee is the (optional) transaction fee between the sender and receiver, and then you could include a transaction fee between the receiver and the final vendor (although a good vendor would pickup the fee for you in future)"""
"""cg1hjhi""",2014-03-13 17:31:41,"""Bitcoin""","""Or maybe that's what he wanted you to think."""
"""cg1hip0""",2014-03-13 17:29:01,"""Bitcoin""","""[deleted]"""
"""cg1hhzi""",2014-03-13 17:26:36,"""Bitcoin""","""Also slippage will decrease with liquidity. """


### 4. Exploratory Anaylsis on Sample Data
#### 4.1 Distribution
**By Month & Year**

In [11]:
random_sample_df["month_year"] = random_sample_df["create_datetime"].apply(
    lambda x: datetime.strftime(x, "%Y, %B")
)

random_sample_df.head()

Unnamed: 0,id,create_datetime,subreddit,full_text,month_year
0,djt6ym7,2017-07-05 23:12:37,ethtrader,Writing my Thesis on Blockchain activities in ...,"2017, July"
1,d2fn2te,2016-04-25 03:17:13,BitcoinMarkets,Wooo!,"2016, April"
2,f1ne8ha,2019-09-28 04:13:30,CryptoMarkets,/r/angryupvote,"2019, September"
3,co5mg1z,2015-01-30 19:25:17,Bitcoin,&gt; it would take an infeasible amount of enc...,"2015, January"
4,dcmui6n,2017-01-20 01:13:10,Bitcoin,"0 fees is not a problem. Dubious leverage, poo...","2017, January"


In [12]:
count_by_ym = (
    random_sample_df.groupby(["month_year"])
    .agg(unique_count=("id", "nunique"))
    .reset_index()
)

In [13]:
px.bar(
    count_by_ym,
    title="Volume (Sampled) by Month & Year",
    labels={"month_year": "Year-Month", "unique_count": "No. of Docs"},
    x="month_year",
    y="unique_count",
    color="unique_count",
)

**By Subreddit**

In [14]:
count_by_sr = (
    random_sample_df.groupby(["subreddit"])
    .agg(unique_count=("id", "nunique"))
    .reset_index()
)

In [15]:
px.pie(
    count_by_sr,
    values="unique_count",
    names="subreddit",
    hole=0.3,
    title="Breakdown of Volume (Sampled) by Subreddit",
)

**Comparison against Full Data**
* Comparing our sampled data against the full database, we can observe that the distribution of our sample across subreddits approximates the actual distribution pretty closely.

![](../images/breakdown_by_subreddit.png)

### 4.2 Reddit Full Text Analysis
#### 4.2.1 Text Statistics
**Sentence Length Analysis**

In [16]:
sent_length_df = random_sample_df.full_text.str.split().map(lambda x: len(x))

px.histogram(sent_length_df, title="Reddit Sentence Length Histogram", nbins=100)

In [17]:
sent_length_df.describe()

count    10000.000000
mean        28.736600
std         51.790084
min          1.000000
25%          6.000000
50%         14.000000
75%         32.000000
max       2146.000000
Name: full_text, dtype: float64

**Stop Word Analysis**

Analyse frequency of stop words amongst reddit data using:
1. Elasticsearch stopword list
2. NLTK stopword list 
3. SpaCy stopword list

Load Stopwords

In [18]:
# Elasticsearch Stop Word List
ES_STOPWORD_LIST = {
    "a",
    "an",
    "and",
    "are",
    "as",
    "at",
    "be",
    "but",
    "by",
    "for",
    "if",
    "in",
    "into",
    "is",
    "it",
    "no",
    "not",
    "of",
    "on",
    "or",
    "such",
    "that",
    "the",
    "their",
    "then",
    "there",
    "these",
    "they",
    "this",
    "to",
    "was",
    "will",
    "with",
}  # 33

In [19]:
# NLTK Stop Words
nltk.download("stopwords")
from nltk.corpus import stopwords

NLTK_STOPWORD_LIST = set(stopwords.words("english"))  # 179

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/christopherliew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# spaCy Stop Words
from spacy.lang.en.stop_words import STOP_WORDS

SPACY_STOPWORD_LIST = set(STOP_WORDS)  # 326

In [21]:
# Helper to track stopword frequency for various nlp libs


def get_stopword_freq(corpus: List[str], stopwords: List[str]) -> Dict[str, int]:
    tracker = defaultdict(int)
    for word in corpus:
        if word in stopwords:
            tracker[word] += 1
    return dict(tracker)

Construct a Corpus


In [22]:
# Get sampled reddit text and split by whitespace

reddit_sample_text = random_sample_df["full_text"].str.split().values.tolist()

# Construct sample corpus
sample_corpus = [word.lower() for i in tqdm(reddit_sample_text) for word in i]

100%|██████████| 10000/10000 [00:00<00:00, 331995.95it/s]


1. ES Stopword Analysis

In [23]:
es_stopword_freq = get_stopword_freq(sample_corpus, ES_STOPWORD_LIST)

es_stopword_freq_df = (
    pd.DataFrame.from_records(
        [es_stopword_freq],
    )
    .T.sort_values(by=[0], ascending=False)
    .reset_index()
    .rename(columns={"index": "Stop Word", 0: "Frequency"})
)

es_stopword_freq_df.head(20)

Unnamed: 0,Stop Word,Frequency
0,the,10944
1,to,8050
2,a,6309
3,and,5562
4,is,4871
5,of,4778
6,in,3442
7,that,3337
8,it,3270
9,for,2828


In [24]:
px.bar(
    es_stopword_freq_df,
    x="Stop Word",
    y="Frequency",
    color="Frequency",
    title="Frequency Plot of ES Stopwords in Sampled Corpus",
)

2. NLTK Stopword Analysis

In [25]:
nltk_stopword_freq = get_stopword_freq(sample_corpus, NLTK_STOPWORD_LIST)

nltk_stopword_freq_df = (
    pd.DataFrame.from_records(
        [nltk_stopword_freq],
    )
    .T.sort_values(by=[0], ascending=False)
    .reset_index()
    .rename(columns={"index": "Stop Word", 0: "Frequency"})
)

nltk_stopword_freq_df.head(10)

Unnamed: 0,Stop Word,Frequency
0,the,10944
1,to,8050
2,a,6309
3,and,5562
4,is,4871
5,of,4778
6,i,4369
7,you,3717
8,in,3442
9,that,3337


In [26]:
px.bar(
    es_stopword_freq_df,
    x="Stop Word",
    y="Frequency",
    color="Frequency",
    title="Frequency Plot of NLTK Stopwords in Sampled Corpus",
)

3. spaCy Stopword Analysis

In [27]:
spacy_stopword_freq = get_stopword_freq(sample_corpus, SPACY_STOPWORD_LIST)

spacy_stopword_freq_df = (
    pd.DataFrame.from_records(
        [spacy_stopword_freq],
    )
    .T.sort_values(by=[0], ascending=False)
    .reset_index()
    .rename(columns={"index": "Stop Word", 0: "Frequency"})
)

spacy_stopword_freq_df.head(10)

Unnamed: 0,Stop Word,Frequency
0,the,10944
1,to,8050
2,a,6309
3,and,5562
4,is,4871
5,of,4778
6,i,4369
7,you,3717
8,in,3442
9,that,3337


In [28]:
px.bar(
    spacy_stopword_freq_df,
    x="Stop Word",
    y="Frequency",
    color="Frequency",
    title="Frequency Plot of NLTK Stopwords in Sampled Corpus",
)

**Compare Stop Word Frequencies Across Stopword Lists**
* Here we want to identify any other stopwords from the more aggresive / comprehensive NLTK and spaCy libraries and add them to our baseline ES stop words list.

In [29]:
# ES vs NLTK
es_nltk_stopword_diff = nltk_stopword_freq_df[
    ~nltk_stopword_freq_df["Stop Word"].isin(es_stopword_freq_df["Stop Word"])
]

# Top 10 Differences by Frequency
(es_nltk_stopword_diff.sort_values(by=["Frequency"], ascending=False).head(10))

Unnamed: 0,Stop Word,Frequency
6,i,4369
7,you,3717
18,have,1680
24,your,1324
26,can,1184
28,just,1150
29,so,1067
30,what,1038
31,my,1018
33,we,956


In [30]:
# ES vs spaCy
es_spacy_stopword_diff = spacy_stopword_freq_df[
    ~spacy_stopword_freq_df["Stop Word"].isin(es_stopword_freq_df["Stop Word"])
]

# Top 10 Differences by Frequency
(es_spacy_stopword_diff.sort_values(by=["Frequency"], ascending=False).head(10))

Unnamed: 0,Stop Word,Frequency
6,i,4369
7,you,3717
18,have,1680
24,your,1324
26,can,1184
28,just,1150
29,so,1067
30,what,1038
31,my,1018
33,we,956


**Stopword Analysis Conclusion**
* By taking the intersection between the two sets of stopwords containing words from spaCy and NLTK but not in ES, we can enhance our stop word list by adding the intersect.
* To evaluate whether a word should be added to the stop word list we will look at their ```inverse document frequency``` and compare it against the ```idfs``` of the terms already in the ES stopword list.

In [31]:
# Compare Differences between NLTK-ES and spaCy-ES
complete_stopword_diff = (
    es_nltk_stopword_diff[es_nltk_stopword_diff.isin(es_spacy_stopword_diff)]
    .dropna()
    .reset_index(drop=True)
)

print(len(complete_stopword_diff))

# Top 20 Differences by Frequency
(complete_stopword_diff.sort_values(by=["Frequency"], ascending=False))

17


Unnamed: 0,Stop Word,Frequency
0,i,4369.0
1,you,3717.0
2,have,1680.0
3,your,1324.0
4,can,1184.0
5,just,1150.0
6,so,1067.0
7,what,1038.0
8,my,1018.0
9,we,956.0


In [32]:
# Compute IDF


tfidf_vec = TfidfVectorizer()

tfm_sampled_corpus = tfidf_vec.fit_transform(random_sample_df.full_text.tolist())

In [33]:
stop_word_idfs = pd.DataFrame(
    {"Word": tfidf_vec.get_feature_names_out(), "IDF": tfidf_vec.idf_}
).sort_values(by=["IDF"])

In [34]:
# ES Stop Word IDFS
es_stopword_idfs = stop_word_idfs[
    stop_word_idfs["Word"].isin(ES_STOPWORD_LIST)
].reset_index()
es_stopword_idfs

Unnamed: 0,index,Word,IDF
0,16547,the,1.866649
1,16769,to,1.998415
2,9742,is,2.232472
3,2183,and,2.235908
4,9774,it,2.262055
5,11963,of,2.356836
6,16543,that,2.438952
7,9279,in,2.509693
8,7787,for,2.610538
9,16626,this,2.72551


In [35]:
# Candidate Additions to the Stopword List from NLTK and spaCy
additional_stopword_idfs = stop_word_idfs[
    stop_word_idfs["Word"].isin(complete_stopword_diff["Stop Word"])
].reset_index()

additional_stopword_idfs

Unnamed: 0,index,Word,IDF
0,18514,you,2.422643
1,8668,have,3.025053
2,3822,can,3.198325
3,9939,just,3.313746
4,18110,what,3.31577
5,15433,so,3.337277
6,18520,your,3.391516
7,2045,all,3.486608
8,11394,more,3.551146
9,6202,do,3.56405


In [36]:
COMPLETE_STOPWORD_LIST = list(ES_STOPWORD_LIST)
COMPLETE_STOPWORD_LIST.extend(complete_stopword_diff["Stop Word"].tolist())

**Top Words by Frequency**

In [37]:
# Get 50 most common words not in the STOP WORD LIST
sample_corpus_counter = Counter(sample_corpus)
common_words = sample_corpus_counter.most_common(50)
common_words_df = (
    pd.DataFrame.from_records(
        [
            {
                word: count
                for word, count in common_words
                if word not in COMPLETE_STOPWORD_LIST
            }
        ]
    )
    .T.reset_index()
    .rename(columns={"index": "Word", 0: "Frequency"})
)

In [38]:
# After removing stop words we are left with 17 from the top 50
common_words_df.head(20)

Unnamed: 0,Word,Frequency
0,bitcoin,1381
1,like,984
2,it's,909
3,would,905
4,people,775


**Observations**
1. Some terms like I which expressed an opinion could be removed if we are using the corpus for downstream modelling (Except Language Modelling) like topic modelling.
2. Other terms such as ```[deleted]``` or ```[removed]``` should be handled as well.

**Top N-grams by Frequency**

In [39]:
# Helper to get top n grams


def get_top_ngram(corpus=List[str], n: Optional[int] = None, top: int = 50):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:top]

In [40]:
# Using our raw sample full text data get top 20 bigrams

top_100_bigrams = pd.DataFrame(
    get_top_ngram(corpus=sample_corpus, n=2, top=100), columns=["Bigram", "Frequency"]
)
top_100_bigrams.iloc[49:]

Unnamed: 0,Bigram,Frequency
49,amp message,25
50,com btc,24
51,https imgur,24
52,org wiki,24
53,finance tip,23
54,tip contentid,23
55,wikipedia org,23
56,they ve,23
57,https exchange,23
58,exchange pancakeswap,23


**Observations**
1. Alot of HTML (E.g. ```amp x200b```) / URL garble that needs to be cleaned.
2. Contractons like ```we ve``` should be expanded out to become ```we have```.
3. Only interesting terms are:
   * ***pancakeswap finance***
   * ***finance swap***
   * ***donut finance*** (Service to convert USD to stablecoins and lend off to partners for interest)
   * ***exchange pancakeswap***
   * ***abbreviations*** (E.g. ```btc -> Bitcoin```)

In [52]:
# Trigrams

top_30_trigrams = pd.DataFrame(
    get_top_ngram(corpus=sample_corpus, n=3, top=100), columns=["Trigram", "Frequency"]
)

# Remove common URL and HTML attributes
top_30_trigrams[
    top_30_trigrams.Trigram.str.contains(r"^(?!.*(amp|http(s)?|www|com)).*$")
]


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



Unnamed: 0,Trigram,Frequency
13,000 000 000,45
16,pancakeswap finance swap,38
21,finance swap outputcurrency,33
27,donut finance tip,23
28,finance tip contentid,23
30,exchange pancakeswap finance,23
32,wikipedia org wiki,22
33,en wikipedia org,21
41,poa xdai tx,19
47,to 2fr 2fbitcoin,16


**Non-Alphanumeric Words & Expressions**

In [53]:
non_alphanum_pattern = re.compile(r"[^0-9a-zA-Z\s]+")
non_alphanum_words = set(
    filter(non_alphanum_pattern.match, sample_corpus)
)  # Use set, list = 10k, set = 4.5k

In [77]:
non_alphanum_words

{'#legal',
 '❤',
 '[changetip',
 '#disrupt',
 '(cashing',
 '(read',
 '(he',
 '**what',
 '/r/silkroad',
 '.1%',
 '(8094493-8094503)',
 '**address',
 '•_•)&gt;⌐■-■',
 '^bot.',
 '**yesterday',
 '"this',
 '*ze',
 '🤷🏽\u200d♀️',
 '"hacking"',
 '**flippening**',
 '|',
 '"empty',
 '["mucked',
 '*used',
 '**alot**',
 '🍕🌭',
 '‘cup',
 '(1',
 '(except',
 '(unfoundated)',
 '"average',
 '@1000',
 '[bitmart',
 '$59,442.15',
 '^^*[blockonomics.co](https://www.blockonomics.co/#/search?q=23e405646a00064d93e3f24261ce753a2a9b89a8cb9c4dbf4d05cbbeb6dbbdc0)*',
 '&amp;quot;potential',
 '"myspace',
 '$0.01',
 '[www.synchrobit.io](https://www.synchrobit.io/)',
 '"outside',
 '*oooo',
 "'confirmed",
 "'manipulation'.",
 '“adoption”',
 '(bike',
 '#adopted',
 '$55.',
 '[ios]',
 '"noise"',
 '(create2',
 '"criminal"',
 '**regulations',
 '(eventually)',
 '[tampermonkey](https://chrome.google.com/webstore/detail/tampermonkey/dhdgffkkebhmkfjojejmpbldmpobfkfo),',
 '"careless".',
 '^(**summon**:',
 '🌑moon',
 '(again',
 '[

**1. Emojis**
* We want to look at the most frequent emojis and emoticons and what they semantically mean, as such we will normalise them before exploring their frequency plots
* Emojis and emoticons can come in textual form or unicode form, as such we want to standardise it such that most emojis can be translated to unicode form for consistent text analysis and information retrieval.

A. Emojis

In [73]:
emoter = emot.core.emot()
emoji_res = emoter.bulk_emoji(sample_corpus)

In [74]:
identified_emojis = [(x["value"], x["mean"]) for x in emoji_res if x["flag"] == True]

In [97]:
# Plot Emoji Frequency Lists
identified_emojis_values = [i[0] for i in identified_emojis]
identified_emojis_values_unnest = [
    i for sublist in identified_emojis_values for i in sublist
]
identified_emojis_freq = Counter(identified_emojis_values_unnest)

In [110]:
emoji_freq_df = (
    pd.DataFrame.from_records(dict(identified_emojis_freq, index=[0]))
    .T.iloc[1:]
    .reset_index()
    .rename(columns={"index": "Emoji", 0: "Frequency"})
    .sort_values(by=["Frequency"], ascending=False)
)

px.bar(
    emoji_freq_df.iloc[:30],
    y="Frequency",
    x="Emoji",
    color="Frequency",
    title="Top 30 Emoji Frequency Plot",
)

B. Emoticons

In [64]:
emoticon_res = emoter.bulk_emoticons(non_alphanum_words)
identified_emoticons = [
    (x["value"], x["mean"]) for x in emoticon_res if x["flag"] == True
]

In [120]:
# Plot Emoticon Frequency Lists
identified_emoticon_values = [i[0] for i in identified_emoticons]
identified_emoticon_values_unnest = [
    i for sublist in identified_emoticon_values for i in sublist
]
identified_emoticon_freq = Counter(identified_emoticon_values_unnest)

In [108]:
emoticon_freq_df = (
    pd.DataFrame.from_records(dict(identified_emoticon_freq, index=[0]))
    .T.iloc[:-1]
    .reset_index()
    .rename(columns={"index": "Emoticon", 0: "Frequency"})
    .sort_values(by=["Frequency"], ascending=False)
)

px.bar(
    emoticon_freq_df.iloc[:30],
    y="Frequency",
    x="Emoticon",
    color="Frequency",
    title="Top 30 Emoticon Frequency Plot",
)

**2. URLs, HTMLs and User Handles**
* Here we will analyse common URL and HTML elements to understand possible sites reddit users like to go on and see if they provide any insight.

#### 4.2.1 Preprocessing and Handling Missing / Unusual Data within the ```Full Text``` field

Here, we will validate some predefined preprocessing steps inspired by ```Bernie & Yilmaz (2019)``` and ```Cheng et al (2019)``` who also leverage reddit data for lexical and text analysis.

We will also validate our preprocessing using ```Elasticsearch's``` ```Analyze API``` to ensure that it is consistent with the python libraries utilised below.

**Preprocessing Steps**
1. Replace common abbreviations
2. Remove very long words
3. Remove URLs, HTML Tags, New Line Chars, Twitter/Reddit Handles
4. Drop documents with only URLs,..., Handles as text
5. Remove deleted documents
6. Replace dollar symbols
7. Expand common contractions

### 5. Simple Topic Modelling

### 6. POS Tagging and Analysis