<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Executive Summary</a></span></li></ul></li><li><span><a href="#Part-0:-Overview-&amp;-Problem-Statement" data-toc-modified-id="Part-0:-Overview-&amp;-Problem-Statement-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 0: Overview &amp; Problem Statement</a></span><ul class="toc-item"><li><span><a href="#Background-/-Overview" data-toc-modified-id="Background-/-Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Background / Overview</a></span></li><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Problem Statement</a></span></li></ul></li><li><span><a href="#Part-1:-Webscraping-with-PushShift-API" data-toc-modified-id="Part-1:-Webscraping-with-PushShift-API-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Part 1: Webscraping with PushShift API</a></span><ul class="toc-item"><li><span><a href="#The-Function" data-toc-modified-id="The-Function-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>The Function</a></span></li><li><span><a href="#The-Execution" data-toc-modified-id="The-Execution-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>The Execution</a></span></li></ul></li></ul></div>

## Executive Summary

A cryptocurrency is a digital currency that is secured by cryptography. Bitcoin and Ethereum are the two largest cryptocurrencies by market capitalization as of this moment. While the top two coins share some similarities, they are also different in many ways. Investors and traders who wish to know more about these two top cryptocurrencies may find it difficult to grasp the many terminologies and jargons used in this field. 

A natural language processing classifier will be trained on posts from the two respective subreddits. It will learn to identify keywords that are more commonly associated with or are unique to each coin. This will be used to help businesses and enterprises develop an efficient and sophisticated query-answering and routing algorithm for their online chatbot to help handle the large number of enquiries. 

A total of 12 vectorizer-model combinations were evaluated. The vectorizers considered were  Count Vectorizer with binary=True,  Count Vectorizer with binary=False  and Tfidf Vectorizer. The models considered were Bernoulli & Multinomial Naive Bayes, Support Vector Machine, and Logistic Regression. Evaluation of the combinations was conducted using 2 metrics: accuracy and receiver operating characteristic area under curve (ROC AUC).

For each combination. the text feature was first preprocessed using one of the text vectorizers before being passed into GridSearchCV to find the optimal collection of hyperparameters. Next, with the optimized model, cross validation was done with the train dataset and the model was subsequently tested using the test dataset.

All in all, Tfidf Vectorizer-Logistic Regression with Ngram range( 1,2) is the combination of choice. It had the best test ROC AUC score at 0.9204 and while text accuracy is ~0.4% lower than the best test accuracy score at 83.97% (achieved with Ngram range(1,3)), the tradeoff in test accuracy was worth it for lower computational cost.
The top 3 words that predicted a post to be from the Bitcoin subreddit were ‘bitcoin’, ‘lightning’, and 'year' while the top 3 words that predicted a post to be from the Ethereum subreddit were ‘ethereum’, ‘eth’, and ‘nft’. 

With the top words that the classifier has found for each subreddit, a minimum viable product can be designed. In its implementation, this chatbot will pick up on the keywords in a user-submitted message and try to identify whether the query pertains to Bitcoin or Ethereum. It will then give a suitable answer or route the message to the right customer service representative for further management.

# Part 0: Overview & Problem Statement

## Background / Overview

We work for a cryptocurrency trading platform startup company. \
In the recent months, the customer service team has received an increasing number of enquries on the the cryptocurrencies available on our platform. On closer look, they found that a large proportion of these enquiries are related to what those cryptocurrencies are and their applications. \
Faced with increasing workload and resource constraints, the head of customer service has engaged our team to develop a real time chatbot for the company website to automate the process of responding to such simple enquiries. A real time chatbot will not only enable the customer service team to focus on complex enquiries or feedback, it can also help to educate users more timely and accurately on our products and hence enhance their user experience.

## Problem Statement
This project aims to help businesses and enterprises in the cryptocurrency space (e.g. news provider, brokerage platforms, coin vaults, mining pools) develop an efficient and sophisticated query-answering and routing algorithm for their online chatbot to help handle the large number of enquiries received from site visitors on a daily basis and reduce the burden on customer service operatives. 

The goal is to empower the chatbot to be able to accurately determine the nature of the enquiry and return an appropriate answer; where it is unable to do so, it will categorise the class of the enquiry and route to the relevant operative. For this to happen, the chatbot needs to (for a start) know how to recognise keywords from two well-known cryptocurrencies (Bitcoin and Ethereum) with the help of a natural language processing classifier trained on posts from the two respective subreddits.

# Part 1: Webscraping with PushShift API

In [75]:
# import packages
import pandas as pd
import random
import time
import requests
from pprint import pprint

# from bs4 import BeautifulSoup
pd.set_option('display.width', 1000)

## The Function

In [109]:
def scrape_data(sub:str, after: int=None, before: int=None, num_posts:int = 25):
    """Scrapes data from Reddit based on desired subreddit, utc_time and number of posts
       Returns dataframe with all scraped data loaded in"""
    
    # initialise variables
    data = pd.DataFrame() 
    utc_before = before 
    utc_after = after
    informed = False 
    sub = sub.lower()
    row_count = 0
    
    
    while row_count < num_posts:
        # Doing a try / except loop just in case of rejection, we still retain what we have scraped in memory
        try:
            # Scraping reddit data. 100 posts per trigger should be relatively light on the server
            with requests.Session() as s:
                BASE_URL = "https://api.pushshift.io/reddit/search/submission?subreddit=" + sub
                params = {'before': utc_before, 'after': utc_after, 'size': 100}
                r = s.get(BASE_URL, params = params)

            # User Feedback----------------------------------------------------------
            if not informed:
                print('Scraping data from {}'.format(BASE_URL)) 
                informed = True
            print("UTC_Before: {}, Status: {}({})".format(utc_before, r.reason, r.status_code))

            # Loading data into dataframe---------------------------------------------
            df = pd.DataFrame(r.json()['data'])
            data = pd.concat([data, df], axis='rows').reset_index(drop=True)
            
        # if error, save current data and utc time    
        except Exception as e:
            print(e)
            return data, utc_before
        else:
            # Update variables for the next loop
            row_count = data.shape[0]
            utc_before = df['created_utc'].min() if utc_before != None else None
            utc_after = df['created_utc'].max() if utc_after != None else None
            print("Scraped {} rows. {} rows remaining".format(row_count, num_posts - row_count))
            
            # check if condition is fulfilled else wait some secs before triggering next scrape
            if num_posts - row_count != 0:
                wait_time = random.randint(5, 20)
                print('Waiting {} secs before next scrape'.format(wait_time))
                print("-"*100)
                time.sleep(wait_time)
            else:
                break
    
    print("Scraping Completed. Collected {} records.".format(row_count))
    print("_"*100)
    return data, utc_before


## The Execution

In [110]:
# setup variables
BTC_UTC = 1626939127
ETH_UTC = 1626939643
NUM_POSTS = 1000


# Doing 1000 posts and then writing to file to prevent data loss
# Also switching between subreddit to hopefully minimize getting ban
df, btc_before1 = scrape_data(sub='Bitcoin', before=BTC_UTC, num_posts=NUM_POSTS)
print('BTC_UTC: {}'.format(btc_before1))
df.to_csv('bitcoin01.csv', index=False)
df, eth_before1 = scrape_data(sub='ethereum', before=ETH_UTC, num_posts=NUM_POSTS)
df.to_csv('ethereum04.csv', index=False)
print('ETH_UTC: {}'.format(eth_before1))

df, btc_before2 = scrape_data(sub='Bitcoin', before=btc_before1, num_posts=NUM_POSTS)
print('BTC_UTC: {}'.format(btc_before2))
df.to_csv('bitcoin05.csv', index=False)
df, eth_before2 = scrape_data(sub='ethereum', before=eth_before1, num_posts=NUM_POSTS)
df.to_csv('ethereum05.csv', index=False)
print('ETH_UTC: {}'.format(eth_before2))

df, btc_before3 = scrape_data(sub='Bitcoin', before=btc_before2, num_posts=NUM_POSTS)
print('BTC_UTC: {}'.format(btc_before3))
df.to_csv('bitcoin03.csv', index=False)
df, eth_before3 = scrape_data(sub='ethereum', before=eth_before2, num_posts=NUM_POSTS)
df.to_csv('ethereum03.csv', index=False)
print('ETH_UTC: {}'.format(eth_before3))

df, btc_before4 = scrape_data(sub='Bitcoin', before=btc_before3, num_posts=NUM_POSTS)
print('BTC_UTC: {}'.format(btc_before4))
df.to_csv('bitcoin04.csv', index=False)
df, eth_before4 = scrape_data(sub='ethereum', before=eth_before3, num_posts=NUM_POSTS)
df.to_csv('ethereum04.csv', index=False)
print('ETH_UTC: {}'.format(eth_before4))

df, btc_before5 = scrape_data(sub='Bitcoin', before=btc_before4, num_posts=NUM_POSTS)
print('BTC_UTC: {}'.format(btc_before5))
df.to_csv('bitcoin05.csv', index=False)
df, eth_before5 = scrape_data(sub='ethereum', before=eth_before4, num_posts=NUM_POSTS)
df.to_csv('ethereum05.csv', index=False)
print('ETH_UTC: {}'.format(eth_before5))

Scraping data from https://api.pushshift.io/reddit/search/submission?subreddit=bitcoin
UTC_Before: 1626939127, Status: OK(200)
Scraped 100 rows. 900 rows remaining
Waiting 14 secs before next scrape
----------------------------------------------------------------------------------------------------
UTC_Before: 1626907956, Status: OK(200)
Scraped 200 rows. 800 rows remaining
Waiting 17 secs before next scrape
----------------------------------------------------------------------------------------------------
UTC_Before: 1626891406, Status: OK(200)
Scraped 300 rows. 700 rows remaining
Waiting 14 secs before next scrape
----------------------------------------------------------------------------------------------------
UTC_Before: 1626874469, Status: OK(200)
Scraped 400 rows. 600 rows remaining
Waiting 11 secs before next scrape
----------------------------------------------------------------------------------------------------
UTC_Before: 1626850198, Status: OK(200)
Scraped 500 rows. 500