# Stocks - Data Aggregation

## Objective

Aggregating data for the following stocks
*'[Walmart Inc.', 'Microsoft Corporation', 'The Home Depot','Alphabet Inc.', 'Apple Inc.', 'Wells Fargo','Chevron Corporation', 'The Coca-Cola Co','Exxon Mobil Corporation']*

We will have to aggergate data from the following sources:

1. OHLCV Data
2. General News
3. Financial News
4. Reddit Information
5. Twitter Information

The data is from the period 1st January 2018 till 27th Feb 2019 [1 Year] for all 9 stocks

### OHLCV Data - Stocks

In [28]:
import pandas as pd
ohlcv_data = pd.read_json('Hourly-Processed-Data/stocks.json')
ohlcv_data = ohlcv_data.rename({'created_time' : 'created_utc','asset_name':'symbol'},axis=1)
ohlcv_data['created_utc'] = pd.to_datetime(ohlcv_data['created_utc']).dt.tz_localize(None)
ohlcv_data.head()

Unnamed: 0,symbol,close,created_utc,high,low,name,open,volume
0,WMT,88.89,2018-03-12 09:00:00,89.43,88.81,Walmart Inc.,88.81,82783
1,WMT,88.43,2018-03-12 10:00:00,88.99,88.37,Walmart Inc.,88.89,123830
2,WMT,88.34,2018-03-12 11:00:00,88.55,88.14,Walmart Inc.,88.4,86675
3,WMT,88.07,2018-03-12 12:00:00,88.38,88.0,Walmart Inc.,88.35,62987
4,WMT,88.04,2018-03-12 13:00:00,88.26,88.04,Walmart Inc.,88.08,34598


### Reddit Data - Stocks

In [29]:
reddit_stock_data = pd.read_json('Hourly-Processed-Data/reddit_processed_stocks.json')
reddit_stock_data['created_utc'] = pd.to_datetime(reddit_stock_data['created_utc']).dt.tz_localize(None)
reddit_stock_data = reddit_stock_data.rename({'crypto' : 'asset_name','compound':'reddit_compound', 'domain':'reddit_domain', 'neg':'reddit_neg','neu': 'reddit_neu',
       'num_comments':'reddit_num_comments', 'pos':'reddit_pos', 'score':'reddit_score','title': 'reddit_title','stock':'name'},axis=1)

reddit_stock_data.head(3)

Unnamed: 0,reddit_compound,created_utc,reddit_domain,reddit_neg,reddit_neu,reddit_num_comments,reddit_pos,reddit_score,name,reddit_title
0,0.24627,2018-01-01 00:00:00,"[v.redd.it, self.apple, self.apple, self.apple...",0.0512,0.6993,10,0.2494,8.2,Apple,"[Don’t get me wrong I love my MacBook, but the..."
1,0.002618,2018-01-01 01:00:00,"[self.apple, self.apple, siguza.github.io, sel...",0.019,0.966455,11,0.014545,46.727273,Apple,"[Okay, it's 2018! Let's get that announcement ..."
2,0.0851,2018-01-01 02:00:00,"[self.apple, self.apple, self.apple, self.appl...",0.0,0.933667,6,0.066333,1.5,Apple,"[I left Apple Music for Spotify!, New iPhone X..."


In [30]:
new_df = pd.merge(ohlcv_data,reddit_stock_data,  how='left', on = ['name', 'created_utc'])
new_df.head(2)

Unnamed: 0,symbol,close,created_utc,high,low,name,open,volume,reddit_compound,reddit_domain,reddit_neg,reddit_neu,reddit_num_comments,reddit_pos,reddit_score,reddit_title
0,WMT,88.89,2018-03-12 09:00:00,89.43,88.81,Walmart Inc.,88.81,82783,,,,,,,,
1,WMT,88.43,2018-03-12 10:00:00,88.99,88.37,Walmart Inc.,88.89,123830,,,,,,,,


### General News - Stocks

In [32]:
gen_news = pd.read_json('Hourly-Processed-Data/processed_general_news.json')
gen_news['time'] = pd.to_datetime(gen_news['time']).dt.tz_localize(None)
gen_news = gen_news.rename({'compound':'news_compound', 'kids':'news_kids', 'neg':'news_neg','neu': 'news_neu',
       'url':'news_url', 'pos':'news_pos', 'score':'news_score',
                'title': 'news_title','time':'created_utc'},axis=1)
gen_news.head(3)

Unnamed: 0,news_compound,news_kids,news_neg,news_neu,news_pos,news_score,created_utc,news_title,news_url
0,0.128304,57,0.03,0.850982,0.119018,6.754386,2017-09-27 20:00:00,"[Hacktoberfest 2017, 18 things only an Indie d...","[hacktoberfest.digitalocean.com, www.buildbox...."
1,0.060505,58,0.060103,0.852379,0.087517,4.689655,2017-09-27 21:00:00,[Introducing Akaunting: Free Accounting Softwa...,"[akaunting.com, futurism.com, www.bbc.co.uk, l..."
2,0.103068,47,0.056213,0.826766,0.117021,3.957447,2017-09-27 22:00:00,[US Senator sees Reddit as potential target fo...,"[thehill.com, www.facebook.com, www.npmjs.com,..."


In [34]:
result_1 = pd.merge(new_df,gen_news,on=['created_utc'],how='left')
result_1 = result_1[result_1.created_utc < '2019-02-19 23:00:00']
result_1.columns.values

array(['symbol', 'close', 'created_utc', 'high', 'low', 'name', 'open',
       'volume', 'reddit_compound', 'reddit_domain', 'reddit_neg',
       'reddit_neu', 'reddit_num_comments', 'reddit_pos', 'reddit_score',
       'reddit_title', 'news_compound', 'news_kids', 'news_neg',
       'news_neu', 'news_pos', 'news_score', 'news_title', 'news_url'],
      dtype=object)

### Financial News - Stocks

In [36]:
fin_news = pd.read_json('Hourly-Processed-Data/processed_financial_news.json')
fin_news['created_utc'] = pd.to_datetime(fin_news['created_utc']).dt.tz_localize(None)
fin_news = fin_news.rename({'compound':'fin_compound', 'subheading':'fin_subheading', 'neg':'fin_neg','neu': 'fin_neu',
 'pos':'fin_pos','title': 'fin_title'},axis=1)
fin_news.head(3)

Unnamed: 0,fin_compound,created_utc,fin_neg,fin_neu,fin_pos,fin_subheading,fin_title
0,0.0,2017-02-16 22:00:00,0.0,1.0,0.0,[0],"[Fast Asia Open: Singapore GDP, Thailand forex..."
1,0.0,2017-02-16 23:00:00,0.0,0.0,0.0,[],[]
2,0.368767,2017-02-17 00:00:00,0.035,0.775,0.19,"[0, 0, Wall Street broke its longest winning s...","[Sterling's puzzling purple patch, Singapore Q..."


In [37]:
result_2 = pd.merge(result_1,fin_news,on=['created_utc'],how='left')
result_2.columns.values

array(['symbol', 'close', 'created_utc', 'high', 'low', 'name', 'open',
       'volume', 'reddit_compound', 'reddit_domain', 'reddit_neg',
       'reddit_neu', 'reddit_num_comments', 'reddit_pos', 'reddit_score',
       'reddit_title', 'news_compound', 'news_kids', 'news_neg',
       'news_neu', 'news_pos', 'news_score', 'news_title', 'news_url',
       'fin_compound', 'fin_neg', 'fin_neu', 'fin_pos', 'fin_subheading',
       'fin_title'], dtype=object)

In [49]:
stocks = [{'asset_name':'walmart', 'name':'Walmart Inc.'},
{'asset_name':'Microsoft', 'name': 'Microsoft Corporation'},
{'asset_name':'Home Depot', 'name': 'The Home Depot'},
{'asset_name':'goldman sachs', 'name': 'Goldman Sachs Group'},
{'asset_name':'google', 'name': 'Alphabet Inc.'},
{'asset_name':'Apple', 'name': 'Apple Inc.'},
{'asset_name':'Wells Fargo', 'name': 'Wells Fargo'},
{'asset_name':'Chevron', 'name': 'Chevron Corporation'},
{'asset_name':'coca cola', 'name': 'The Coca-Cola Co'},
{'asset_name':'exxon mobil', 'name': 'Exxon Mobil Corporation'}]
df_st = pd.DataFrame.from_records(stocks)
stock_df = pd.merge(result_2,df_st,on=['name'],how='inner')
stock_df = stock_df.rename({'asset_name':'name','name':'asset_name'},axis=1)
stock_df.head()

Unnamed: 0,symbol,close,created_utc,high,low,asset_name,open,volume,reddit_compound,reddit_domain,...,news_score,news_title,news_url,fin_compound,fin_neg,fin_neu,fin_pos,fin_subheading,fin_title,name
0,WMT,88.89,2018-03-12 09:00:00,89.43,88.81,Walmart Inc.,88.81,82783,,,...,21.825,"[Python, Go or Haskell?, Using the Singleton p...","[, fullstack-developer.academy, betanews.com, ...",0.0,0.0,1.0,0.0,[Regulator compels groups to unwind investment...,[China takes aim at debt-funded bank stakes],walmart
1,WMT,88.43,2018-03-12 10:00:00,88.99,88.37,Walmart Inc.,88.89,123830,,,...,16.97619,[How to avoid pattern matching with List in Sc...,"[functional.works-hub.com, www.speakingtree.in...",0.0,0.0,0.0,0.0,[],[],walmart
2,WMT,88.34,2018-03-12 11:00:00,88.55,88.14,Walmart Inc.,88.4,86675,,,...,5.26087,[Paper Windmill Super Easy Instruction for Kid...,"[youtu.be, healthiercentral.com, , blog.proces...",0.025733,0.048,0.898333,0.053333,"[0, Luck begins to wane once electricity and h...",[BoE to begin £18.3bn gilt reinvestment progra...,walmart
3,WMT,88.07,2018-03-12 12:00:00,88.38,88.0,Walmart Inc.,88.35,62987,,,...,4.073171,[New UC Research May Provide Clues to How the ...,"[www.healthnews.uc.edu, www.eno8.com, techcrun...",0.04646,0.1228,0.7678,0.1094,[The world’s two largest economies are sliding...,[America v China: How trade wars become real w...,walmart
4,WMT,88.04,2018-03-12 13:00:00,88.26,88.04,Walmart Inc.,88.08,34598,,,...,6.676923,"[Employee Satisfaction: Make or Break, Tesla h...","[www.playbuzz.com, www.teslarati.com, blog.sic...",-0.197133,0.103667,0.896333,0.0,"[0, Decision to pursue direct listing will be ...",[Grain and soyabean prices drop on US tariff c...,walmart


### Twitter - Stocks

In [48]:
twitter_stock_data = pd.read_json('twitter_stocks.json')
twitter_stock_data['created_utc'] = pd.to_datetime(twitter_stock_data['created_utc']).dt.tz_localize(None)
twitter_stock_data = twitter_stock_data.rename({'compound':'tweet_compound', 'favorites':'tweet_favorites', 'neg':'tweet_neg','neu': 'tweet_neu',
       'favorites':'tweet_favorites', 'pos':'tweet_pos', 'retweets':'tweet_retweets','text': 'tweet_text','hashtags':'tweet_hashtags'},axis=1)

twitter_stock_data.head(5)

Unnamed: 0,asset_name,tweet_compound,created_utc,tweet_favorites,tweet_hashtags,tweet_neg,tweet_neu,tweet_pos,tweet_retweets,tweet_text
0,Alphabet Inc.,0.124112,2017-12-31 16:00:00,60,"[#NASDAQ #NYSE, #NASDAQ #NYSE #DayTrading #day...",0.049767,0.863883,0.08635,60,[Check out the #NASDAQ or #NYSE trade of the w...
1,Alphabet Inc.,0.373762,2017-12-31 17:00:00,29,"[None, None, None, #bizhour #SEOChat, None, No...",0.016172,0.834414,0.149448,29,"[As per the usual, February was another month ..."
2,Alphabet Inc.,0.241308,2017-12-31 18:00:00,24,"[#energy, None, None, None, None, None, None, ...",0.065958,0.780875,0.153292,24,[Re-shared from 12/04/2017║Google now runs 3.0...
3,Alphabet Inc.,0.201534,2017-12-31 19:00:00,32,[#CustomerService #CX #Yelp #Facebook #Google ...,0.055719,0.79375,0.1505,32,[5-Star Methods For Increasing Positive Custom...
4,Alphabet Inc.,0.245195,2017-12-31 20:00:00,22,"[None, #javascript #HappyNewYear2018, None, No...",0.0475,0.820455,0.132091,22,"[With past performance like this, how can you ..."


### Final Aggeration

In [54]:
aggerageted_df = pd.merge(stock_df,twitter_stock_data,  how='left', on = ['asset_name', 'created_utc'])
aggerageted_df = aggerageted_df[aggerageted_df.asset_name != 'Goldman Sachs Group']
aggerageted_df.head(5)

Unnamed: 0,symbol,close,created_utc,high,low,asset_name,open,volume,reddit_compound,reddit_domain,...,fin_title,name,tweet_compound,tweet_favorites,tweet_hashtags,tweet_neg,tweet_neu,tweet_pos,tweet_retweets,tweet_text
0,WMT,88.89,2018-03-12 09:00:00,89.43,88.81,Walmart Inc.,88.81,82783,,,...,[China takes aim at debt-funded bank stakes],walmart,0.29011,10.0,[#WTFDidYouSayBabe #TeamWork #portersworld #th...,0.0192,0.887,0.0938,10.0,[Today on #WTFDidYouSayBabe it’s about #TeamWo...
1,WMT,88.43,2018-03-12 10:00:00,88.99,88.37,Walmart Inc.,88.89,123830,,,...,[],walmart,0.35446,10.0,"[#ad #VicksHolidayFix, None, None, None, None,...",0.034,0.8167,0.1492,10.0,"[Got a cold, cough, or the flu? It's just the ..."
2,WMT,88.34,2018-03-12 11:00:00,88.55,88.14,Walmart Inc.,88.4,86675,,,...,[BoE to begin £18.3bn gilt reinvestment progra...,walmart,0.35135,10.0,"[#WMT, None, None, None, #Fortnite #GiveAwayEv...",0.0203,0.8476,0.132,10.0,[When you where your #WMT sweatshirt to a car ...
3,WMT,88.07,2018-03-12 12:00:00,88.38,88.0,Walmart Inc.,88.35,62987,,,...,[America v China: How trade wars become real w...,walmart,0.293258,12.0,"[None, #WalkAway, None, None, None, #Fortnite,...",0.044417,0.83325,0.12225,12.0,[Missed out on the Walmart fortnite spray? Tun...
4,WMT,88.04,2018-03-12 13:00:00,88.26,88.04,Walmart Inc.,88.08,34598,,,...,[Grain and soyabean prices drop on US tariff c...,walmart,-0.047029,7.0,"[None, #MadeIt, #ad #PayPalCanDoThat, None, No...",0.082429,0.841143,0.076429,7.0,[The world's highest-paid YouTube star: Ryan T...


In [55]:
#aggerageted_df.to_json('processed_stocks.json',orient='records',date_format='iso')
aggerageted_df.columns

Index(['symbol', 'close', 'created_utc', 'high', 'low', 'asset_name', 'open',
       'volume', 'reddit_compound', 'reddit_domain', 'reddit_neg',
       'reddit_neu', 'reddit_num_comments', 'reddit_pos', 'reddit_score',
       'reddit_title', 'news_compound', 'news_kids', 'news_neg', 'news_neu',
       'news_pos', 'news_score', 'news_title', 'news_url', 'fin_compound',
       'fin_neg', 'fin_neu', 'fin_pos', 'fin_subheading', 'fin_title', 'name',
       'tweet_compound', 'tweet_favorites', 'tweet_hashtags', 'tweet_neg',
       'tweet_neu', 'tweet_pos', 'tweet_retweets', 'tweet_text'],
      dtype='object')