In [1]:
from skimage.io import collection, imread
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import re
import glob

from datetime import datetime

# How is the public opinion about a company correlated to it's market value?

A company's market value is variable and depends on a lot of factors. The price is a reflection of the company's perceived value – what the public is willing to pay for a piece of the company. It can and will rise and fall, based on a variety of factors in the global landscape and within the company itself. One of which is becoming more influential than ever - the public's opinion on social media.

To analyze this correlation we'll look at two datasets. The first one contains over 3 million unique tweets with their information such as tweet id, author of the tweet, post date, the text body of the tweet, and the number of comments, likes, and retweets of tweets matched with the related company.

The second one will just have daily stock price records (from the Forbes2000) for us to make a reference with.

### 1. Data Acquisition 

So first let's read the tweets dataset into pandas and inspect a small sample from the two dataframes.

In [2]:
tweets = pd.read_csv('./top-companies-tweets/Tweet.csv')
tweets.sample(5)

Unnamed: 0,tweet_id,writer,post_date,body,comment_num,retweet_num,like_num
1741149,848667612305387525,OptionsProOI,1491173100,$GOOG #Options OI chart. Free stocks app https...,0,0,0
1072960,737982654088351744,wealthindiv,1464783750,Wall Street Breakfast: Abenomics In Jeopardy? ...,0,0,0
2022723,907611242046607360,whotrades,1505226357,Netflix Is a Joke -- and the Joke Is on You ht...,0,0,0
2708561,1039963393883615232,Chapter11Cases,1536781569,"""The electric-car maker’s finance department h...",0,2,4
733321,684234767680704512,IHNewsDesk,1451969255,"$WMB Short Sales Updated Monday, January 4, 20...",0,0,0


In [4]:
tweets_company = pd.read_csv('./top-companies-tweets/Company_Tweet.csv')
tweets_company.sample(5)

Unnamed: 0,tweet_id,ticker_symbol
1304428,1126125822694309889,AAPL
1514272,643439441592385538,GOOG
3604666,945997421540147202,TSLA
1862961,623593865916870656,GOOGL
3246528,560501848147128320,TSLA


So next up let's read the datasets for each of the stocks, which we are monitoring. We will save them in a dictionary with the key, being the company's tick name and the value - it's stock prices over time dataset.

In [5]:
stocks_df = {}
for name in glob.glob('./stocks/*'):
    stocks_df[name.split('\\')[-1].split('.')[0]] = pd.read_csv(name)
stocks = pd.concat(stocks_df)
stocks.sample(10)

Unnamed: 0,Unnamed: 1,Date,Low,Open,Volume,High,Close,Adjusted Close
AMZN,4943,05-01-2017,760.26001,761.549988,5830100,782.400024,780.450012,780.450012
TSLA,1170,23-02-2015,41.265999,43.132,42499000,43.639999,41.467999,41.467999
MSFT,4574,28-04-2004,26.469999,27.01,72842200,27.049999,26.559999,16.979584
MSFT,5510,16-01-2008,32.509998,33.419998,120778500,33.650002,33.23,24.715776
AAPL,2206,05-09-1989,0.397321,0.397321,114822400,0.405134,0.399554,0.320225
AMZN,2576,13-08-2007,74.699997,76.089996,6068600,76.32,74.870003,74.870003
MSFT,6436,19-09-2011,26.6,26.799999,52324900,27.309999,27.209999,21.941296
MSFT,6626,20-06-2012,30.639999,30.93,36257100,31.049999,30.93,25.460203
AAPL,4379,09-04-1998,0.223214,0.223772,170307200,0.231027,0.228795,0.196741
MSFT,3972,05-12-2001,32.599998,33.244999,74243000,34.084999,34.049999,21.577059


### 2. Data Tidying and Cleaning

First let's combine the two tables from the twitter dataset and convert the dates to a datetime object.

In [6]:
tweets = pd.merge(tweets, tweets_company, on = "tweet_id")
tweets.sample(5)

Unnamed: 0,tweet_id,writer,post_date,body,comment_num,retweet_num,like_num,ticker_symbol
384160,607913233501057024,ShiningShadow,1433772784,.#Walmart Is Finally Ready to Take On #Amazon ...,0,0,0,AMZN
184784,577777733264154624,corrbheinn,1426587920,$AAPL - MARKET SNAPSHOT: U.S. Stocks: Futures ...,0,0,0,AAPL
645202,646620787546308608,leahanneta,1443001384,"MYEC MyECheck, Inc. Investor Opinionshttp://dl...",0,0,1,AMZN
1151772,727508013863546881,Nasdaq,1462286401,Tesla earnings are expected tomorrow! Here's w...,0,8,5,TSLA
2122184,874983297977397248,TiernanRayTech,1497447249,Apple: Production Estimates Going Higher for i...,0,3,3,AAPL


We will see what timeframe does our dataset cover, by getting the data of the earliest and latest tweets.

In [7]:
tweets.post_date = pd.to_datetime(tweets.post_date, unit='s')
tweets.post_date.min(), tweets.post_date.max()

(Timestamp('2015-01-01 00:00:57'), Timestamp('2019-12-31 23:55:53'))

So it has data from 01.01.2015 to 31.12.2019, so basically from 2015 to the beginning of 2020. Knowing this we can filter out the stock prices to be only in this period of time. But first we have to covert the "Date" column to datetime.

In [8]:
def string_to_date(date_string):
    return datetime.strptime(date_string, "%d-%m-%Y")
stocks.Date = stocks.Date.apply(string_to_date)

In [9]:
stocks = stocks[(stocks.Date >= '01-01-2015') & (stocks.Date < '01-01-2020')]
stocks.sample(10)

Unnamed: 0,Unnamed: 1,Date,Low,Open,Volume,High,Close,Adjusted Close
AAPL,9206,2017-06-15,35.552502,35.830002,128661600,36.119999,36.072498,34.343712
GOOG,3781,2019-08-27,1161.449951,1180.530029,1077200,1182.400024,1167.839966,1167.839966
GOOG,2946,2016-05-03,692.0,696.869995,1543800,697.840027,692.359985,692.359985
TSLA,2271,2019-07-09,45.456001,45.793999,30954000,46.200001,46.012001,46.012001
AMZN,5226,2018-02-21,1478.920044,1485.0,6304400,1503.48999,1482.920044,1482.920044
MSFT,8512,2019-12-18,154.179993,154.300003,24129200,155.479996,154.369995,152.049698
AAPL,8640,2015-03-18,31.592501,31.75,261083600,32.290001,32.1175,29.279362
MSFT,8395,2019-07-03,136.289993,136.800003,13629300,137.740005,137.460007,134.486893
AMZN,5591,2019-08-05,1748.780029,1770.219971,6058200,1788.670044,1765.130005,1765.130005
AAPL,9396,2018-03-19,43.415001,44.330002,133787200,44.3675,43.825001,42.210503


As we can see we don't have data for every day, because the stock market funcitons only on workdays, unlike twitter.  We will find a way to work around this later.

Now lets strip the data down to just one column - the value, which we will calculate by getting the mean of the Open and Close prices. 

In [10]:
stocks["Value"] = (stocks.Open + stocks.Close) / 2
stocks = stocks.drop(columns=['Low', 'Open', 'Volume', 'High', 'Close', 'Adjusted Close'])
stocks.sample(10)

Unnamed: 0,Unnamed: 1,Date,Value
AMZN,4994,2017-03-21,851.02002
AMZN,4467,2015-02-17,376.574997
MSFT,7630,2016-06-17,50.27
GOOG,3777,2019-08-21,1192.200012
GOOG,3617,2019-01-02,1031.209991
GOOG,3219,2017-06-02,972.529999
AAPL,9463,2018-06-22,46.379999
AMZN,4759,2016-04-14,617.910004
MSFT,7778,2017-01-19,62.27
TSLA,1167,2015-02-18,40.862999
