In [3]:
from skimage.io import collection, imread
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import re
import glob

from datetime import datetime
import nltk

# How is the public opinion about a company correlated to it's market value?

A company's market value is variable and depends on a lot of factors. The price is a reflection of the company's perceived value - what the public is willing to pay for a piece of the company. It can and will rise and fall, based on a variety of factors in the global landscape and within the company itself. One of which is becoming more influential than ever - people's opinion on social media.

To analyze this correlation we'll look at two datasets. The first one contains over 3 million unique tweets with their information such as tweet id, author of the tweet, post date, the text body of the tweet, and the number of comments, likes, and retweets of tweets matched with the related company.

The second one will just have daily stock price records (from the Forbes2000) for us to make a reference with.

### 1. Data Acquisition 

So first let's read the tweets dataset into pandas and inspect a small sample from the two dataframes.

In [4]:
tweets = pd.read_csv('./top-companies-tweets/Tweet.csv')
tweets.sample(5)

Unnamed: 0,tweet_id,writer,post_date,body,comment_num,retweet_num,like_num
2335486,979745363824439296,TickerReport,1522424472,Verde Servicos Internacionais S.A. Has $5.59 M...,0,0,0
359947,611969714353479680,johnoduk,1434739924,$GOOG - EU Demands Major Changes to Google's S...,0,0,0
2304664,973944831062966273,PortfolioBuzz,1521041517,Highest scoring stories for #SP500 under one w...,0,0,0
2854137,1060225248522461184,RaindropsOhMy,1541612371,$AAPL big quick rip here to $208.7 would be ni...,0,0,0
3276902,1125418097332985857,SagarNandi,1557155558,(1) $AAPL is overvalued in CUE scorecard and (...,0,0,0


In [5]:
tweets_company = pd.read_csv('./top-companies-tweets/Company_Tweet.csv')
tweets_company.sample(5)

Unnamed: 0,tweet_id,ticker_symbol
2052008,1019341037465001984,GOOGL
3071012,870363870002630656,MSFT
3855525,1045672382466334726,TSLA
2186065,608130665914056705,AMZN
3285037,598526844032299009,TSLA


So next up let's read the datasets for each of the stocks, which we are monitoring. We will save them in a dictionary with the key, being the company's tick name and the value - it's stock prices over time dataset.

In [6]:
stocks_df = {}
for name in glob.glob('./stocks/*'):
    stocks_df[name.split('\\')[-1].split('.')[0]] = pd.read_csv(name)
stocks = pd.concat(stocks_df)
stocks.sample(10)

Unnamed: 0,Unnamed: 1,Date,Low,Open,Volume,High,Close,Adjusted Close
MSFT,3565,19-04-2000,39.0625,40.71875,53715400,40.75,39.34375,24.931646
MSFT,2669,01-10-1996,8.179688,8.234375,69124800,8.367188,8.257813,5.232873
AMZN,1413,27-12-2002,18.43,19.969999,21972800,20.1,18.860001,18.860001
AAPL,7988,10-08-2012,22.09643,22.096786,194938800,22.205713,22.203571,19.174887
MSFT,7446,24-09-2015,43.27,43.450001,27905600,44.130001,43.91,39.618057
MSFT,7781,24-01-2017,62.939999,63.200001,24672900,63.740002,63.52,59.28083
AAPL,9832,10-12-2019,66.464996,67.150002,90420400,67.517502,67.120003,66.333351
AMZN,2673,31-12-2007,92.449997,93.809998,5755200,94.370003,92.639999,92.639999
MSFT,8033,24-01-2018,91.580002,92.550003,33277500,93.43,91.82,87.605545
TSLA,586,23-10-2012,5.474,5.476,3745000,5.712,5.678,5.678


### 2. Data Tidying and Cleaning

First let's combine the two tables from the twitter dataset, convert the dates to a datetime object and rename the column.

In [7]:
tweets = pd.merge(tweets, tweets_company, on = "tweet_id")

In [8]:
tweets["date"] = pd.to_datetime(tweets.post_date, unit='s')
tweets = tweets.drop(columns="post_date")
tweets.sample(5)


Unnamed: 0,tweet_id,writer,body,comment_num,retweet_num,like_num,ticker_symbol,date
113378,568068696410730497,laurenholmesNYC,Top 10 holdings $AAPL $MSFT $GOOG $FB $AMZN $I...,0,0,1,AMZN,2015-02-18 15:25:06
3552538,1095663088060239873,AznOptions,Will take $NFLX profits and roll them into mor...,1,0,1,AAPL,2019-02-13 12:36:51
1380328,759270364228780032,PortfolioBuzz,Highest scoring stories for #SP500 under one w...,0,0,0,AAPL,2016-07-30 06:12:16
2363755,916399180876099585,MacHashNews,iPhone X TrueDepth supply issues likely to cle...,0,0,0,AAPL,2017-10-06 20:26:05
4020200,1159447012846329859,bs_marker,The first informative #App on Pivot Points.Sto...,0,0,0,AMZN,2019-08-08 12:51:24


We will see what timeframe does our dataset cover, by getting the data of the earliest and latest tweets.

In [9]:
tweets.date.min(), tweets.date.max()

(Timestamp('2015-01-01 00:00:57'), Timestamp('2019-12-31 23:55:53'))

So it has data from 01.01.2015 to 31.12.2019, so basically from 2015 to the beginning of 2020. Knowing this we can filter out the stock prices to be only in this period of time. But first we have to covert the "Date" column to datetime.

In [10]:
def string_to_date(date_string):
    return datetime.strptime(date_string, "%d-%m-%Y")
stocks.Date = pd.to_datetime(stocks.Date.apply(string_to_date))

In [11]:
stocks = stocks[(stocks.Date >= '01-01-2015') & (stocks.Date < '01-01-2020')]
stocks.sample(10)

Unnamed: 0,Unnamed: 1,Date,Low,Open,Volume,High,Close,Adjusted Close
MSFT,7292,2015-02-13,43.150002,43.380001,40264900,43.869999,43.869999,38.792679
MSFT,8326,2019-03-26,116.849998,118.620003,26097700,118.709999,117.910004,114.934296
AMZN,4665,2015-11-27,672.099976,680.799988,1966800,680.98999,673.26001,673.26001
MSFT,8386,2019-06-20,135.720001,137.449997,33042600,137.660004,136.949997,133.987885
GOOG,3692,2019-04-22,1228.310059,1235.98999,807300,1249.089966,1248.839966,1248.839966
AMZN,4693,2016-01-08,606.0,619.659973,5512900,624.140015,607.049988,607.049988
AMZN,5004,2017-04-04,890.280029,891.5,4984700,908.539978,906.830017,906.830017
MSFT,7384,2015-06-26,45.029999,45.650002,49835300,46.279999,45.259998,40.568584
AAPL,9010,2016-09-02,26.705,26.924999,107210000,27.0,26.932501,25.296234
TSLA,1656,2017-01-26,50.150002,50.858002,15760500,51.147999,50.501999,50.501999


As we can see we don't have data for every day, because the stock market functions only on workdays, unlike twitter.  We will find a way to work around this later.

Now lets strip the data down to just one column - the value, which we will calculate by getting the mean of the Open and Close prices. 

In [12]:
stocks["Value"] = (stocks.Open + stocks.Close) / 2
stocks = stocks.drop(columns=['Low', 'Open', 'High', 'Close', 'Adjusted Close'])
stocks.sample(10)

Unnamed: 0,Unnamed: 1,Date,Volume,Value
TSLA,1660,2017-02-01,19794000,50.229
GOOG,3357,2017-12-18,1554600,1071.609985
TSLA,1138,2015-01-06,31309500,42.134001
MSFT,8416,2019-08-02,30791600,137.494995
MSFT,8392,2019-06-28,30043000,134.265007
MSFT,8175,2018-08-16,21384300,107.970001
AMZN,4904,2016-11-08,3412600,786.359985
AAPL,8657,2015-04-13,145460400,31.902499
TSLA,1678,2017-02-28,30390500,49.418001
AAPL,9809,2019-11-06,75864400,64.251247


It is a little inconvenient to have the stock name as an index instead of it being a regular column. We will fix that and also change the column names to match the twitter dataset.

In [13]:
stocks = stocks.reset_index(level=0)

In [14]:
stocks.columns = ["ticker_symbol", "date", "volume", "value"]
stocks.sample(10)
stocks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6290 entries, 8589 to 2393
Data columns (total 4 columns):
ticker_symbol    6290 non-null object
date             6290 non-null datetime64[ns]
volume           6290 non-null int64
value            6290 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 245.7+ KB


 So next up lets take a look at the datatypes and null values for the twitter dataset.

In [15]:
tweets.info(null_counts = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4336445 entries, 0 to 4336444
Data columns (total 8 columns):
tweet_id         4336445 non-null int64
writer           4280526 non-null object
body             4336445 non-null object
comment_num      4336445 non-null int64
retweet_num      4336445 non-null int64
like_num         4336445 non-null int64
ticker_symbol    4336445 non-null object
date             4336445 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 297.8+ MB


Everything looks good, except the ticker_symbol which should be a category. Also the writer column has quite a few missing records, but we won't be using it for our model and analysis, so we can discard it altogether.

In [16]:
tweets.ticker_symbol = tweets.ticker_symbol.astype('category')
tweets = tweets.drop(columns=["writer"])
tweets.head(7)

Unnamed: 0,tweet_id,body,comment_num,retweet_num,like_num,ticker_symbol,date
0,550441509175443456,"lx21 made $10,008 on $AAPL -Check it out! htt...",0,0,1,AAPL,2015-01-01 00:00:57
1,550441672312512512,Insanity of today weirdo massive selling. $aap...,0,0,0,AAPL,2015-01-01 00:01:36
2,550441732014223360,S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...,0,0,0,AMZN,2015-01-01 00:01:50
3,550442977802207232,$GM $TSLA: Volkswagen Pushes 2014 Record Recal...,0,0,1,TSLA,2015-01-01 00:06:47
4,550443807834402816,Swing Trading: Up To 8.91% Return In 14 Days h...,0,0,1,AAPL,2015-01-01 00:10:05
5,550443807834402816,Swing Trading: Up To 8.91% Return In 14 Days h...,0,0,1,TSLA,2015-01-01 00:10:05
6,550443808606126081,Swing Trading: Up To 8.91% Return In 14 Days h...,0,0,1,AAPL,2015-01-01 00:10:05


As we can see there seem to be a lot of duplicate bodies in our dataset. We want to remove them and this is exactly what the following code does. 

In [17]:
tweets = tweets.drop_duplicates(subset=["body"])

### 3. Text Preparation and Exploration

Before we start working with the text, we have to prepare it and take a quick look at some statistics about it. First let's convert all the tweets' bodies into lowercase.

In [18]:
tweets.body = tweets.body.str.lower()

NLTK provides a small corpus of stop words that we will load into a list, based on which we'll later filter them out from the tweets.

In [19]:
stopwords = nltk.corpus.stopwords.words("english")
stopwords.append("")

Now let's split the text into single words and remove all the stopwords from it.

In [20]:
def string_into_words(str): 
    return [w for w in re.split("\W+", str) if w not in stopwords]
tweets.body = tweets.body.apply(string_into_words)
tweets.sample(5)

Unnamed: 0,tweet_id,body,comment_num,retweet_num,like_num,ticker_symbol,date
1334484,753537489902538752,"[prime, day, sets, sales, record, amazon, read...",0,0,0,AMZN,2016-07-14 10:31:52
2578911,959170311810879490,"[googl, misses]",0,0,0,GOOGL,2018-02-01 21:03:17
825433,679694716099805185,"[swhc, fb, amzn, nke, panw, atvi, ea, fuked, t...",0,0,0,AMZN,2015-12-23 16:07:03
3637513,1106576874661203971,"[tsla, model3, vins, per, model3vins, 3, 15, 2...",0,0,0,TSLA,2019-03-15 15:24:20
736607,660255379851313152,"[lqd, ishares, iboxx, investment, grade, corpo...",0,0,0,AMZN,2015-10-31 00:42:04


Now we can look at the frequency distribution of the words (how many times is each word appears in the tweets). Just because the dataset is too large to analyze every observation. To combat this we will take a smaller sample of the data.

In [133]:
tweets_sample = tweets.sample(10000, random_state=10)
tweets_sample.head(3)

Unnamed: 0,tweet_id,body,comment_num,retweet_num,like_num,ticker_symbol,date
1362011,757926087368269824,"[mobileye, drops, 10, ends, tesla, relationshi...",0,2,0,TSLA,2016-07-26 13:10:35
3439937,1080839775161077760,"[50, dma, resistance, today, kiq, vips, ntes, ...",0,0,0,GOOG,2019-01-03 14:54:18
2974882,1022534616014643200,"[active, traders, try, one, free, trading, gui...",0,0,0,GOOGL,2018-07-26 17:30:24


In [124]:
all_words = tweets_sample.body.sum()

In [125]:
fd = nltk.FreqDist(all_words)

In [131]:
fd.most_common(30)

[('aapl', 4144),
 ('http', 3732),
 ('tsla', 3160),
 ('amzn', 2310),
 ('com', 2291),
 ('apple', 1650),
 ('read', 1280),
 ('us', 1280),
 ('owler', 1204),
 ('https', 1139),
 ('goog', 1135),
 ('msft', 1099),
 ('googl', 1000),
 ('tesla', 840),
 ('fb', 823),
 ('stock', 821),
 ('amazon', 670),
 ('dlvr', 621),
 ('google', 603),
 ('stocks', 595),
 ('microsoft', 567),
 ('inc', 555),
 ('ly', 550),
 ('new', 538),
 ('nflx', 536),
 ('1', 497),
 ('news', 484),
 ('spy', 477),
 ('like', 462),
 ('market', 454)]

The first 30 words consist mainly of companies' names and stock ticks as we can expect. But at the bottom there we can see the word "like". This is very important, because it expresses some sort of sentiment.