# Cleaning the scraped dataset

In [1]:
import pandas as pd
import numpy as np
import pickle
import re
import pandas_datareader.data as web


Load the raw scraped datasets

In [53]:
penny_stock = pd.read_csv("data/penny_stocks.csv", parse_dates=True)
penny_stock.head(5)

Unnamed: 0,com_author,com_date,com_text,post_author,post_date,post_text,post_title,post_url
0,Tuftybigfoot,2018-10-23T00:00:33+00:00,What about Canadian traders?,Al1Ge,2018-10-22T22:47:57+00:00,"If so, what platform do you trade through?",Any UK based traders?,https://old.reddit.com/r/pennystocks/comments/...
1,xTheHolyGhostx,2018-10-22T13:57:59+00:00,Very nice. I plan to hold onto my shares for a...,CaptainWeee,2018-10-22T13:34:50+00:00,Wow another article dropped today about us som...,$HIPH Another article released today with us i...,https://old.reddit.com/r/pennystocks/comments/...
2,Mojaverae,2018-10-22T15:44:45+00:00,"Big investors Sold on the news, took profits",vertical006,2018-10-22T15:43:23+00:00,I picked up a few stocks months before Canada ...,What's going on with Canadian marijuana stocks?,https://old.reddit.com/r/pennystocks/comments/...
3,smooferated,2018-10-23T03:26:20+00:00,Always sell and take profits. Especially when...,vertical006,2018-10-22T15:43:23+00:00,I picked up a few stocks months before Canada ...,What's going on with Canadian marijuana stocks?,https://old.reddit.com/r/pennystocks/comments/...
4,jarsofmarsbarsincars,2018-10-22T16:29:45+00:00,People who invested at low prices pulled their...,vertical006,2018-10-22T15:43:23+00:00,I picked up a few stocks months before Canada ...,What's going on with Canadian marijuana stocks?,https://old.reddit.com/r/pennystocks/comments/...


In [54]:
robin = pd.read_csv("data/robin.csv", parse_dates=True)
robin.head(5)

Unnamed: 0,com_author,com_date,com_text,post_author,post_date,post_text,post_title,post_url
0,weredagabagool,2018-10-23T14:59:06+00:00,This sub is kryptonite to every single stock m...,Vigamoxx,2018-10-23T14:01:50+00:00,,IFMK shooting up!!,https://old.reddit.com/r/RobinHoodPennyStocks/...
1,julbjulb,2018-10-23T14:05:06+00:00,You had to jinx it lol,Vigamoxx,2018-10-23T14:01:50+00:00,,IFMK shooting up!!,https://old.reddit.com/r/RobinHoodPennyStocks/...
2,Vigamoxx,2018-10-23T14:06:36+00:00,"Yeah I bought at $2.01 at 9:58, at 10:00 it wa...",Vigamoxx,2018-10-23T14:01:50+00:00,,IFMK shooting up!!,https://old.reddit.com/r/RobinHoodPennyStocks/...
3,julbjulb,2018-10-23T14:54:27+00:00,📈 gl,Vigamoxx,2018-10-23T14:01:50+00:00,,IFMK shooting up!!,https://old.reddit.com/r/RobinHoodPennyStocks/...
4,Joesmithers,2018-10-22T23:23:10+00:00,"My immediate goal is to get above $25,000 so I...",Joesmithers,2018-10-22T23:17:17+00:00,,Rate my meme portfolio,https://old.reddit.com/r/RobinHoodPennyStocks/...


Combine them in one dataframe

In [55]:
# Set flags for the subreddit
penny_stock['subreddit'] = '/r/pennystocks'
robin['subreddit'] = '/r/RobinHoodPennyStocks'

# Combine two datasets
penny = penny_stock.append(robin, ignore_index=True)
penny.head(5)

Unnamed: 0,com_author,com_date,com_text,post_author,post_date,post_text,post_title,post_url,subreddit
0,Tuftybigfoot,2018-10-23T00:00:33+00:00,What about Canadian traders?,Al1Ge,2018-10-22T22:47:57+00:00,"If so, what platform do you trade through?",Any UK based traders?,https://old.reddit.com/r/pennystocks/comments/...,/r/pennystocks
1,xTheHolyGhostx,2018-10-22T13:57:59+00:00,Very nice. I plan to hold onto my shares for a...,CaptainWeee,2018-10-22T13:34:50+00:00,Wow another article dropped today about us som...,$HIPH Another article released today with us i...,https://old.reddit.com/r/pennystocks/comments/...,/r/pennystocks
2,Mojaverae,2018-10-22T15:44:45+00:00,"Big investors Sold on the news, took profits",vertical006,2018-10-22T15:43:23+00:00,I picked up a few stocks months before Canada ...,What's going on with Canadian marijuana stocks?,https://old.reddit.com/r/pennystocks/comments/...,/r/pennystocks
3,smooferated,2018-10-23T03:26:20+00:00,Always sell and take profits. Especially when...,vertical006,2018-10-22T15:43:23+00:00,I picked up a few stocks months before Canada ...,What's going on with Canadian marijuana stocks?,https://old.reddit.com/r/pennystocks/comments/...,/r/pennystocks
4,jarsofmarsbarsincars,2018-10-22T16:29:45+00:00,People who invested at low prices pulled their...,vertical006,2018-10-22T15:43:23+00:00,I picked up a few stocks months before Canada ...,What's going on with Canadian marijuana stocks?,https://old.reddit.com/r/pennystocks/comments/...,/r/pennystocks


## Step 1 Find unique tickers mentioned in posts

Find unique company tickers. The result is pickled for future use. 

In [56]:
# Combine all textual info
all_text = penny['post_text'] + penny['com_text'] + penny['post_title'] 
all_text = all_text.dropna().astype(str)
all_text = " ".join(all_text)

# Get unique stocks in a form $TCKR
set1 = re.findall(pattern='\$[A-Z]{2,4}', string = all_text)
set1 = list(map(lambda x: x.replace("$",""), set1))
set1 = set(set1)

# Get unique stocks in the form NASDAQ:TCKR
set2 = re.findall(pattern='NASDA.:[A-Z]{2,4}', string = all_text)
set2 = list(map(lambda x: x.replace("NASDAQ:",""), set1))
set2 = set(set2)

# Combine both
tickers = [*set1] + [*set2]

# Write tickers to a file
with open('./tmp/tickers', 'wb') as fp:
    pickle.dump(tickers, fp)

## Step 2. Find unique tickers for penny stocks and download quotes

We use 'pickled' tickers extracted from the scraped dataset and try downloading stock prices data from Yahoo Finance.

Some of the stocks that are traded on non-US exchanges may have additional suffixes to the ticker on Yahoo Finance. In this case this code will not obtain the stock quotes and we drop observations. It can be manually or semi-atudomatically fixed in the future research.

In [34]:
with open('./tmp/tickers', 'rb') as fp:
    tickers = pickle.load(fp)


**This code takes a while to run. You can use the csv file below instead of downloading quotes**

In [15]:
# Set parameters for quotes download
data_source = 'yahoo'
start = "2018-01-01"
end = "2018-12-23"

# Initialize counters
fail_count = 0
ok_count = 0
closing_prices = pd.DataFrame()

#D ownload closing prices and stack them together
for ticker in tickers:
    try:
        result = web.DataReader(ticker, data_source, start, end)
        tmp_df = result[['Adj Close']]
        tmp_df.columns = [ticker]
        closing_prices = pd.concat([closing_prices,tmp_df], axis=1)
        print ("obtained " + ticker + " " + repr(result.shape))
        ok_count += 1
    except:
        print (ticker + " is missing")
        fail_count += 1


# Write to csv
closing_prices.to_csv('./tmp/raw_stock_quotes.csv')

TRLY is missing
obtained CCCL (207, 6)
obtained HEME (207, 6)
obtained SCYX (207, 6)
obtained GALT (207, 6)
obtained MO (207, 6)
BRLX is missing
obtained WM (207, 6)
obtained SGYP (207, 6)
obtained RSHN (205, 6)
obtained OGEN (207, 6)
obtained TSLA (207, 6)
obtained DTEA (207, 6)
obtained ISBG (207, 6)
obtained LCLP (207, 6)
obtained SPRO (207, 6)
RQFT is missing
obtained BE (106, 6)
obtained BLPG (207, 6)
BIOA is missing
obtained CFGX (207, 6)
obtained CGC (199, 6)
obtained KO (207, 6)
obtained TLRY (68, 6)
obtained ETST (207, 6)
obtained PKG (207, 6)
obtained VTVT (207, 6)
JUGR is missing
BRVR is missing
obtained QQQ (207, 6)
obtained BB (207, 6)
obtained BLNK (207, 6)
obtained BLPH (207, 6)
obtained CRMD (207, 6)
BEER is missing
obtained GLOW (198, 6)
obtained STNN (207, 6)
RNKL is missing
obtained PACB (207, 6)
obtained JCP (207, 6)
obtained APRN (207, 6)
obtained EVSV (207, 6)
obtained ITRO (207, 6)
SPLI is missing
ROPE is missing
obtained MOSY (207, 6)
obtained ACRX (207, 6)
obta

obtained RKDA (207, 6)
VTLW is missing
obtained MLHC (206, 6)
obtained DCIX (207, 6)
obtained PLX (207, 6)
obtained GSS (207, 6)
obtained RGSE (207, 6)
obtained TNTY (207, 6)
obtained NHPI (207, 6)
obtained AVEO (207, 6)
obtained SECI (206, 6)
obtained SIPC (207, 6)
obtained CWBR (207, 6)
obtained ATNM (207, 6)
AMDE is missing
obtained AKER (207, 6)
obtained AMFE (207, 6)
obtained PED (207, 6)
obtained SSKN (207, 6)
obtained BVTK (207, 6)
obtained BYOC (207, 6)
obtained XXII (207, 6)
obtained ASM (207, 6)
obtained ACAD (207, 6)
obtained SSOF (207, 6)
BLON is missing
obtained MU (207, 6)
obtained MCIG (207, 6)
obtained FIZZ (207, 6)
obtained INND (207, 6)
obtained EOMN (207, 6)
obtained CLDX (207, 6)
obtained ACB (24, 6)
INMG is missing
obtained FCEL (207, 6)
obtained GRDO (207, 6)
obtained NPHC (207, 6)
obtained NBRV (207, 6)
obtained COLL (207, 6)
obtained CCIH (207, 6)
obtained MCOA (207, 6)
ALN is missing
obtained CHEK (207, 6)
obtained NVCN (207, 6)
obtained INTV (207, 6)
obtained 

Retrieve the saved file

In [57]:
quotes = pd.read_csv('./tmp/raw_stock_quotes.csv')

Traditionally, 'penny stocks' are more likely to be the object of a pump-and-dump scheme due to low liquidity and lack of analytical coverage. Penny stocks are often defined as stocks with a price less than $5.

Also, we need to have at least a couple of months of data to make meaningful conclusions about the stock dynamic. 

Therefore, we keep only stocks that had mean price of $5 or less and that have at least 50 days of observations. 

In [58]:
# Find stocks with mean price less or equal to $5
mean_price = quotes.mean()
mean_price = mean_price[mean_price <= 5]

# Keep only those stocks in the database
quotes = quotes.loc[:,mean_price.index]

# Find stocks with more than 50 observations
obs_count = quotes.count()
obs_count = obs_count[obs_count >= 50]

# Keep only those stocks in the database
quotes = quotes.loc[:,obs_count.index]

quotes.shape

(207, 366)

In [59]:
# Get column values
tickers_clean = quotes.columns.values

# Remove '.1 artifact'
tickers_clean = list(map(lambda x: x.replace(".1",""), tickers_clean))
tickers_clean[0:10]


['CCCL',
 'HEME',
 'SCYX',
 'SGYP',
 'RSHN',
 'OGEN',
 'DTEA',
 'ISBG',
 'LCLP',
 'BLPG']

## Step 3 Identify what stocks are mentioned in each comment/post

We also load a dataset scraped from /r/stocks. Although the moderators of /r/stocks do not support posting related to penny stocks, such an activity still may happen.

In [60]:
stocks = pd.read_csv("data/stocks.csv", parse_dates=True)

#Add a flag of the subreddit
stocks['subreddit'] = '/r/stocks'
stocks.head(5)

Unnamed: 0,com_author,com_date,com_text,post_author,post_date,post_text,post_title,post_url,subreddit
0,cracklinrosi,2018-10-24T02:23:06+00:00,,vovr,2018-10-23T22:20:35+00:00,I never bought bonds before. I need some tips ...,How can I get started with bonds?,https://old.reddit.com/r/stocks/comments/9qtr4...,/r/stocks
1,brrr69,2018-10-23T17:34:16+00:00,Apple is such a great company I would never se...,linux_rich87,2018-10-23T17:30:35+00:00,I have two Fidelity mutual funds (FSPTX and OP...,Should I sell or hold? New to the world of inv...,https://old.reddit.com/r/stocks/comments/9qr6n...,/r/stocks
2,linux_rich87,2018-10-23T17:39:41+00:00,Okay so you invest in individual stocks. Wonde...,linux_rich87,2018-10-23T17:30:35+00:00,I have two Fidelity mutual funds (FSPTX and OP...,Should I sell or hold? New to the world of inv...,https://old.reddit.com/r/stocks/comments/9qr6n...,/r/stocks
3,brrr69,2018-10-23T17:44:28+00:00,Only companies I research and I’m confident in...,linux_rich87,2018-10-23T17:30:35+00:00,I have two Fidelity mutual funds (FSPTX and OP...,Should I sell or hold? New to the world of inv...,https://old.reddit.com/r/stocks/comments/9qr6n...,/r/stocks
4,linux_rich87,2018-10-23T17:48:56+00:00,gotcha. I'll do some reading into that.,linux_rich87,2018-10-23T17:30:35+00:00,I have two Fidelity mutual funds (FSPTX and OP...,Should I sell or hold? New to the world of inv...,https://old.reddit.com/r/stocks/comments/9qr6n...,/r/stocks


We combine datasets from all three subreddits: stocks, pennystocks, and RobinHoodPennyStocks

In [61]:
penny = penny.append(stocks, ignore_index=True)

Below is the shape of the resulting raw database

In [62]:
penny[['post_url', 'subreddit']].groupby('subreddit').count()

Unnamed: 0_level_0,post_url
subreddit,Unnamed: 1_level_1
/r/RobinHoodPennyStocks,9477
/r/pennystocks,6390
/r/stocks,11056


In [64]:
# Combine post title, post text and comment text to identify the context
penny['all_text'] = penny['post_title'] + penny['post_text'] + penny['com_text']

# Load clean ticker names
with open("./tmp/tickers_clean", "rb") as fb:
    tickers = pickle.load(fb)
    
# Remove $ and NASDAQ: prefixes
penny['all_text'] = penny['all_text'].str.replace("$"," ")
penny['all_text'] = penny['all_text'].str.replace("NASDAQ:"," ")

# Create a column that flags what tickers are mentioned in the context of the thread
penny['context'] = penny.all_text.apply(lambda x: list(set(
                                        [value for value in str(x).split() if value in tickers]
                                        )))


In order to avoid double accounting in future NLP analysis, after identifying the context we treat original post as a comment.

In [65]:
# Create a dataframe that treats each initial post as a comment. This way we adoid double accounting for posts
# and comments
penny['post_text'] = penny['post_title'] + penny['post_text']  #Combine original post text and title
org_post = penny.loc[:,["post_author", "post_date", "post_text", "context", "post_url", "subreddit"]]
org_post.columns = ["com_author", "com_date", "com_text", "context", "post_url", "subreddit"]
org_post['top_post'] = 1

# Remove observations that do not have any comments (they will be replaced by original post)
penny = penny.loc[penny.com_text.notna(),\
                  ["com_author", "com_date", "com_text", "context", "post_url", "subreddit"]]
penny['top_post'] = 0

#Combine both
df = penny.append(other=org_post, ignore_index=True)
df.drop_duplicates(subset = ["com_author", "com_date", "com_text", "post_url", "subreddit"],inplace=True)


We keep only observations that contain some context regarding target stocks. 

We extract date from datetime, and save the file for future analysis.

In [66]:
# Drop observations without context
df = df.loc[df['context'].str.len()!=0, :]

#Reset index
df.reset_index(inplace=True)

# Exctract date
df['com_date'] = df['com_date'].apply(lambda x: pd.to_datetime(str(x)[0:10]))

df.to_csv("./data/scrape_clean.csv")

Below is the the count of posts in the final database

In [67]:
# Calculate the count of unique posts
df[['com_text', 'subreddit']].groupby('subreddit').count()

Unnamed: 0_level_0,com_text
subreddit,Unnamed: 1_level_1
/r/RobinHoodPennyStocks,3543
/r/pennystocks,2166
/r/stocks,171


**The analysis of the data is performed in [Analysis.ipynb](Analysis.ipynb)**