# Problem Statement 

As graduating students in data science, we were approached by a group of investors (X) who notice that the subreddit forum could potentially be an alternative way to analyze the fluctuation in their investment of GAMESTOP (GME) and Dogecoin by understanding its behaviour.

Their reasoning behind exploring this alternative was largely due to the recent trend of investors trading based on comments made by users on social media platform and the influence of big figures in the market.

We were task to identify words from post or title that may likely determine an increase in investment value at a certain period and classifying post fromm which source it came from.

### Background information

GME and Dogecoin have recently made big step forward into the market and changing how the market works. Both shares had a spike in their price from low to high within a short timeframe and was trending among different audiences.

GME price soared from USD39 to USD348, an increase of 900% value and Dogecoin has risen from 0.01USD to 0.8USD, accounting for an 80000% increase in its value. Prices have been fluctuating since, with great swings in price percentages until today. 

### Project approach

As source of data is unstructured, we will be using Natural Language Processing (NLP) to classify post from different subreddits based on their contents.

Approach are as follows:
1. Data acquisition using Pushshift's API to collect posts from GME and Dogecoin subreddit
2. Preliminary exploratory data analysis to gain insight on raw data (e.g. top words, frequency & distirbution of words)
3. Data cleaning (e.g. null values, duplicate rows)
4. Data pre-processing (e.g. removal of markup elements, emojis, numerals)
5. Vectorizing words into numerical data using CountVectorize & TF-IDF
6. Basic modelling using classification models (Logistic Regression, Naive Bayes, Random Forest)
7. Picking best model and further tuning hyperparameters
8. Evaluate model using metrics on test data

# Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import time

# Data Acquisition using Pushshift's API

Using pushshift API to extract data from the 2 individual subreddits (GME & Dogecoin) with requirements set for specific post.  

As API have certain limitations where it can pull a maximum of 100 post. The codes has been set to pull the data multiple times while ensuring that data is not duplicated by making changes to the time for each loop

Subreddit topics:
- gme 
- dogecoin  

In [3]:
def get_submissions(subreddit, start_datetime, end_datetime):
    """
    Get reddit submissions from the pushshift api between specified date range.
    Read more: https://github.com/pushshift/api
   
    Args:
    subreddit(string): Subreddit name.
    start_datetime(string): Start datetime in UTC. Acceptable string format is in "dd/mm/yyyy hh:mm:ss".
    end_datetime(string): End datetime in UTC. Acceptable string format is in "dd/mm/yyyy hh:mm:ss".
    
    Returns:
    A dataframe of text-only posts from subreddit
    
    Raises:
    HTTPError when request status is not a 200.
    """
    # base url
    url = "https://api.pushshift.io/reddit/search/submission"
    
    # create an empty list to hold the dataframes
    df_list = []
    
    # convert start_datetime & end_datetime into epoch timestamps
    while end_datetime > start_datetime:
        res = requests.get(url, 
                           # query parameters 
                           params={"subreddit": subreddit, 
                                        "size": 100,
                                        "after": start_datetime,
                                        "before": end_datetime,
                                       # return text-only posts
                                        "is_self": True})
        try:
            # if the response was successful, no exception will be raised
            res.raise_for_status()
        except requests.exceptions.HTTPError as e:
            # not a 200
            print("Error: " + str(e))
            raise
        else:
            # run the following codes if there are no exception
            print('Fetching data from {}'.format(res.url))
            json = res.json()
            # flatten the nested dictionary
            df = pd.json_normalize(json['data'])
            if len(df) > 0:
                # select the required columns
                df = df[['id', 'author', 'created_utc', 'subreddit', 'selftext', 'title']]
                # update start_datetime to loop forward 
                start_datetime = df['created_utc'].max()
                # convert epoch time to readable time
                df['created_utc'] = pd.to_datetime(df['created_utc'],unit='s')
                # append to list
                df_list.append(df)
                # pause for 3 seconds before the next pull of 100 posts
                time.sleep(3)
                print ('successful retrieval')
            else:
                break
    print('---')
    print('Task Completed')
    return pd.concat(df_list, axis=0)

### Retrieve post from r/GME 

In [4]:
# Retrieve post from GME from 21 Jan to 2 Feb
df_gme = get_submissions('GME', 1611187200, 1612224000)

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611187200&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611393367&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611576757&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611595388&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611620490&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1611672634&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pus

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612166709&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612187288&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612191848&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612193995&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612197611&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=GME&size=100&after=1612204434&before=1612224000&is_self=True
successful retrieval
Fetching data from https://api.pus

In [5]:
df_gme.head()

Unnamed: 0,id,author,created_utc,subreddit,selftext,title
0,l1n3vd,Jeffamazon,2021-01-21 00:52:21,GME,Just stumbled upon this sub. Didn't know it ex...,Greetings GME Gang
1,l1o60a,stoney-the-tiger,2021-01-21 01:48:42,GME,[removed],Help Make Q4 Great
2,l1wi6q,B1ake1,2021-01-21 11:13:54,GME,HOLD THE LINES \n\n\n120 shares @27,"Remember lads, scared money don't make money."
3,l22dsa,Dustin_James_Kid,2021-01-21 16:55:19,GME,I’m new and trying to learn. This stock scares...,How do we know when the squeeze has happened?
4,l22r5n,MailNurse,2021-01-21 17:11:48,GME,Price is sub 40 now. :(.,WHERE ARE THE FUCKING REINFORCEMENTS


In [6]:
# Remove duplicate post from df_gme
df_gme[df_gme['selftext'].duplicated() == False]

Unnamed: 0,id,author,created_utc,subreddit,selftext,title
0,l1n3vd,Jeffamazon,2021-01-21 00:52:21,GME,Just stumbled upon this sub. Didn't know it ex...,Greetings GME Gang
1,l1o60a,stoney-the-tiger,2021-01-21 01:48:42,GME,[removed],Help Make Q4 Great
2,l1wi6q,B1ake1,2021-01-21 11:13:54,GME,HOLD THE LINES \n\n\n120 shares @27,"Remember lads, scared money don't make money."
3,l22dsa,Dustin_James_Kid,2021-01-21 16:55:19,GME,I’m new and trying to learn. This stock scares...,How do we know when the squeeze has happened?
4,l22r5n,MailNurse,2021-01-21 17:11:48,GME,Price is sub 40 now. :(.,WHERE ARE THE FUCKING REINFORCEMENTS
...,...,...,...,...,...,...
25,lahsve,Traderparkboy,2021-02-01 23:42:20,GME,Today my smooth brained father joined the elit...,My fathers brain is smooth
28,lai1du,StockRecon,2021-02-01 23:53:15,GME,**Lesson Learned from RH**\n\nI recently read ...,Lesson learned from RH
29,lai1m0,Several_Meet_8653,2021-02-01 23:53:37,GME,\nWHAT DO YOU GUYS THINK IS GOING TO HAPPEN WI...,GME
30,lai1r4,BertNErnieSanders,2021-02-01 23:53:46,GME,Disclaimer: This isn’t meant to judge. \n\nLoo...,Let’s be honest. How many of are you are nervo...


In [7]:
df_gme.to_csv('./datasets/gme_unclean.csv',index=False)

In [8]:
df_dogecoin = get_submissions('dogecoin', 1620172800, 1620259200)

Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620172800&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620174293&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620175666&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620177049&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620177977&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620178841&before=1620259200&is_self=True
successful retrieval
Fetc

successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620212649&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620213858&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620214955&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620215979&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620217084&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620218199&before=1620259200&is_self=True
succ

successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620250382&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620251442&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620252890&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620254454&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620255670&before=1620259200&is_self=True
successful retrieval
Fetching data from https://api.pushshift.io/reddit/search/submission?subreddit=dogecoin&size=100&after=1620256548&before=1620259200&is_self=True
succ

In [9]:
df_dogecoin.head()

Unnamed: 0,id,author,created_utc,subreddit,selftext,title
0,n525ha,Ok_Salt_7206,2021-05-05 00:00:38,dogecoin,Only serious guys please!\n\nPlease pm\n\nP.S....,6000 dogecoins for $83. Pm Fast who need
1,n5262o,Ruskgodkrewdoge,2021-05-05 00:01:26,dogecoin,,When doge passes eth the coin will be worth 4 ...
2,n5263b,PingPing01,2021-05-05 00:01:27,dogecoin,[removed],Buy HODL this is what we need
3,n526uy,T1DLiving,2021-05-05 00:02:29,dogecoin,"Hey fellow shibes, my birthday is in a couple ...",Birthday in a couple days
4,n526w3,Malbec177,2021-05-05 00:02:31,dogecoin,Everyone should wish Elon Musks son Little X a...,Happy birthday to Elons son Little X!


In [10]:
# Remove duplicate post from df_dogecoin
df_dogecoin[df_dogecoin['selftext'].duplicated() == False]

Unnamed: 0,id,author,created_utc,subreddit,selftext,title
0,n525ha,Ok_Salt_7206,2021-05-05 00:00:38,dogecoin,Only serious guys please!\n\nPlease pm\n\nP.S....,6000 dogecoins for $83. Pm Fast who need
1,n5262o,Ruskgodkrewdoge,2021-05-05 00:01:26,dogecoin,,When doge passes eth the coin will be worth 4 ...
2,n5263b,PingPing01,2021-05-05 00:01:27,dogecoin,[removed],Buy HODL this is what we need
3,n526uy,T1DLiving,2021-05-05 00:02:29,dogecoin,"Hey fellow shibes, my birthday is in a couple ...",Birthday in a couple days
4,n526w3,Malbec177,2021-05-05 00:02:31,dogecoin,Everyone should wish Elon Musks son Little X a...,Happy birthday to Elons son Little X!
...,...,...,...,...,...,...
42,n5uaf7,Master_Miles,2021-05-05 23:58:26,dogecoin,This time last week people were thinking of se...,👍🏼💎🌙Love the change
43,n5uary,benjahmyn1,2021-05-05 23:58:54,dogecoin,Just dumped another $5k on it at .90 last nigh...,It's all a dip!
44,n5uaz1,Pdt0928,2021-05-05 23:59:09,dogecoin,"Just smoke a fat joint, forget and relax",Relax and HODL
45,n5ub3y,Scared-Ad-453,2021-05-05 23:59:20,dogecoin,A co-worker of mine refused to invest a little...,Don't be a fool!


In [11]:
df_dogecoin.to_csv('./datasets/dogecoin_unclean.csv',index=False)