## Import Libraries

In [1]:
import requests
import pandas as pd

## Background

We work for a cryptocurrency trading platform startup company. In the recent months, the customer service team has received an increasing number of enquries on the the cryptocurrencies available on our platform. On closer look, they found that a large proportion of these enquiries are related to what those cryptocurrencies are and their applications. Faced with increasing workload and resource constraints, the head of customer service has engaged our team to develop a real time chatbot for the company website to automate the process of responding to such simple enquiries. A real time chatbot will not only enable the customer service team to focus on complex enquiries or feedback, it can also help to educate users more timely and accurately on our products and hence enhance their user experience.

## Objectives

In this project, our team will be using data from two subreddits to achieve the following:

1. Understand what type of data are used in the respective posts and use natural language processing (NLP) to analyze the text data in each of the subreddits,
2. Use machine learning (ML) classifiers like Naive Bayes (NB) Multinomial model and Linear Support Vector Machine (SVM) to identify the subreddit a submission is likely to originate from,
3. Evaluate the ML classifiers against our baseline model using accuracy as the key metric. A ML classifier is said to outperform the baseline model if it has a higher accuracy score. The target we are working towards is building a classifier with at least 80% accuracy.
4. Propose a suitable optimal ML classifier that could be used to develop a minimum viable product (MVP) for the chatbot and make other recommendations.

## 1. Data Acquisition

In this project, we will be using Reddit's application programming interface (API) to collect posts. Reddit is an American social news aggregation, web content rating, and discussion website. Living up to its description of being the 'front page of the Internet' on its website, today Reddit is the seventh most popular site in the U.S., according to Alexa, and the 19th worldwide. We will be scraping  submission data from two subreddits -  r/Bitcoin and r/Ethereum. Bitcoin and Ethereum are chosen because of their domination by capitalization in the market of over 7,000 cryptocurrencies. The former has a market capitalization of USD 766B and the latter has a market capitalization of USD 266B. source: https://www.livecoinwatch.com/ 
Both subreddits have a large community base (r/Bitcoin has 3.2 million members while r/Ethereum has over 1 million builders) and many activities. These lead us to believe that data from the two subreddits will give us a fair representation of the population's view on trends, ideas, opinions and how they expressed them. 

As pushshift now limits user to extract 100 posts in each request, we strive to automate this process as much as possible to enable reproducibility and scalability on the data acquisition front. A total of 6,000 submissions were scraped; 3,000 submissions from each subreddit were scraped separately.

### 1.1 Study HTML architecture of r/Bitcoin subreddit page

In [2]:
# Define url for r/Bitcoin subreddit API
url_btc = 'https://api.pushshift.io/reddit/search/submission?subreddit=Bitcoin'

In [3]:
# Submit request and ensure status is ready to go
res = requests.get(url_btc)
res.status_code

200

In [4]:
# Preview of how data will look like 
btc_preview = res.json()
btc_preview

{'data': [{'all_awardings': [],
   'allow_live_comments': False,
   'author': 'clientgenoa',
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_text': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_a0c300ss',
   'author_is_blocked': False,
   'author_patreon_flair': False,
   'author_premium': False,
   'awarders': [],
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1627578293,
   'domain': 'businessinsider.com',
   'full_link': 'https://www.reddit.com/r/Bitcoin/comments/ou1k7p/bitcoin_price_prediction_expert_breaks_down_60000/',
   'gildings': {},
   'id': 'ou1k7p',
   'is_created_from_ads_ui': False,
   'is_crosspostable': True,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': False,
   'is_robot_indexable': True,
   'is_self': False,
   'is_video': False,
   'link_flair_background_color': '',
   'link_flair_richtext': [],
   'link_flair_text_color': 'dark',
   'link_flair_type':

At first glance, the r/Bitcoin subreddit page is structured like a dictionary. Considering the length of the items in the keys, the subreddit page seemed contain few text. Most of the fields are also not relevant to the content of the post which can be removed during data cleansing.

### 1.2 Create a function to scrape subreddit and store data as a dataframe

In [5]:
# Define 'reddit_scrape' function which takes subreddit_string, number_of_scrapes and utc_marker as inputs 
# Each scrape extracts 100 posts from the subreddit indicated and BEFORE the utc_marker indicated
def reddit_scrape(subreddit_string, number_of_scrapes, utc_marker):
    url = 'https://api.pushshift.io/reddit/search/submission' # Define url for Reddit API 
    df_all = pd.DataFrame() # Instantiate an empty data frame
    
    for i in range(number_of_scrapes):
        params = { 
            'subreddit':subreddit_string,
            'size': 100,
            'before': utc_marker
        }
        req = requests.get(url,params) # Submit request
        data = req.json()
        posts = data['data']
        utc_marker = posts[-1]['created_utc'] # New Unix timestamp marker from last post in new data frame
        df_i = pd.DataFrame(posts) # Store 100 posts in a data frame
        df_all = pd.concat([df_all, df_i], axis=0) # Concantanate it to df_btc data frame
    
    df_all.reset_index(drop= True, inplace= True)
    
    return df_all

#### 1.2.1 From r/Bitcoin subreddit

In [6]:
# Call reddit_scrape function to scrape and store 3,000 posts from r/Bitcoin subreddit
df_btc = reddit_scrape('Bitcoin', 30, 1626939127)

In [7]:
# Display first 5 rows of df_btc
df_btc.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,...,suggested_sort,link_flair_text,gallery_data,is_gallery,media_metadata,author_flair_background_color,banned_by,author_cakeday,crosspost_parent,crosspost_parent_list
0,[],False,theremnanthodl,noob,"[{'e': 'text', 't': 'redditor for a day'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for a day,dark,richtext,t2_dg8srid3,...,,,,,,,,,,
1,[],False,theremnanthodl,,[],,,,text,t2_dg8srid3,...,,,,,,,,,,
2,[],False,ReadDailyCoin,noob,"[{'e': 'text', 't': 'redditor for 3 months'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for 3 months,dark,richtext,t2_bmm97n7n,...,,,,,,,,,,
3,[],False,theloiteringlinguist,,[],,,,text,t2_7em1h7ph,...,,,,,,,,,,
4,[],False,Electronic_Chard1987,,[],,,,text,t2_994q7jme,...,,,,,,,,,,


In [8]:
# Check number of rows and columns in df_btc 
df_btc.shape

(3000, 82)

In [9]:
# Check number of unique values in 'id' column
df_btc[['id']].nunique()

id    3000
dtype: int64

In [10]:
# Export df_btc data frame as as csv
# Code commented to ensure csv file does not get overwritten by mistake
# df_btc.to_csv('/Users/Ash/Desktop/Project-3/data/bitcoin_dataset.csv', index=False)

#### 1.2.2 From r/Ethereum subreddit

In [11]:
# Call reddit_scrape function to scrape and store 3,000 posts from r/Ethereum subreddit
df_eth = reddit_scrape('Ethereum', 30, 1626939643)

In [12]:
# Display first 5 rows of df_eth
df_eth.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,crosspost_parent,crosspost_parent_list,author_flair_background_color,author_flair_text_color,banned_by,author_cakeday,is_gallery,gallery_data,media_metadata,poll_data
0,[],False,wuzzgucci,,[],,text,t2_53ftzlxj,False,False,...,,,,,,,,,,
1,[],False,Needle_NFT,,[],,text,t2_bamj1t7z,False,False,...,,,,,,,,,,
2,[],False,AOFEX__Official,,[],,text,t2_b00ercqa,False,False,...,,,,,,,,,,
3,[],False,FarEnergy3518,,[],,text,t2_bonkrdx5,False,False,...,,,,,,,,,,
4,[],False,excusemealot,,[],,text,t2_b4v6m29o,False,False,...,,,,,,,,,,


In [13]:
# Check number of rows and columns in df_eth
df_eth.shape

(3000, 80)

In [14]:
# Check number of unique values in 'id' column
df_eth[['id']].nunique()

id    3000
dtype: int64

In [15]:
# Export df_eth data frame as as csv
# Code commented to ensure csv file does not get overwritten by mistake
# df_eth.to_csv('/Users/Ash/Desktop/Project-3/data/ethereum_dataset.csv', index=False)