# Cleaning up market data stock splits
---

This notebook takes the next step in [cleaning up the market data](https://www.kaggle.com/danielson/cleaning-up-market-data-errors-and-stock-splits) by [Danielson](https://www.kaggle.com/danielson/) and tries to find all the market splits.  The goal is to have an original open or close price adjusted for splits so that derived features can be built from it.

We find:
- 447 split announcements
- of which 224 were for stocks in the market_train data
- of which 126 were in the market_train trading window (the relevant set)
- of which I could find **105** what I think are the **actual stock split dates**

Now, if you could only find the exact date of the split for the **remaining 21** for me, or somehow determine that the split was not a real announcement or never took place … :-)


In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

In [None]:
(market_train_df, news_train_df) = env.get_training_data()
market_train_df['date'] = pd.to_datetime(market_train_df.time.dt.date)

In [None]:
apple = market_train_df[market_train_df['assetCode'] == 'AAPL.O']
ax = apple.plot(kind='line',x='date',y='open',figsize=(12,6))

From the graph it looks like Apple's stock plummeted in June 2014, but acutally the stock just split.  There is commentary about this event in the news data...

In [None]:
appleNews = news_train_df[news_train_df['assetName'] == 'Apple Inc']
list(appleNews[(appleNews['headline'].str.contains('stock split')) & (appleNews['relevance'] >= 0.6)].head()['headline'])

Apple's 7-to-1 stock split occurred on 2014-06-09.  It's worth noting that while the `open` and `close` columns don't take into account the stock split, the returns columns do (see excerpt below).

In [None]:
apple[(apple['time'] > '2014-06-01') & (apple['time'] < '2014-06-16')][['time','close']]

Since it appears there's no issues with the return columns, if you're not planning on calculating any new features using the `open` and `close` columns, then you shouldn't need to worry about stock splits.

However, I am interested in trying to create some new features from these columns (e.g. moving averages), so I'm looking at adjusting historical stock prices like Apple's to account for splits.  Below is a graph showing an adjusted view of Apple's `open` price along with some handy moving averages.

In [None]:
apple['adjOpen'] = np.where(apple['time'] < '2014-06-09',apple['open']/7.0,apple['open'])
apple['MA10'] = apple['adjOpen'].rolling(window=10).mean()
apple['MA50'] = apple['adjOpen'].rolling(window=50).mean()
apple['MA200'] = apple['adjOpen'].rolling(window=200).mean()
ax = apple.plot(kind='line',x='date',y=['adjOpen','MA10','MA50','MA200'], figsize=(16,6))

So let's find all the places in the news where there was a stock split announcement.  Let's only take those that determine what the actual split will be (e.g. two-for-one), for better certainty that it is a real one.

In [None]:
split_news = news_train_df[(news_train_df['headline'].str.contains('stock split'))  & \
                           (news_train_df['headline'].str.contains('-for-')) & \
                           (news_train_df['relevance'] >= 0.6)]
split_news.shape

In [None]:
split_news['date'] = pd.to_datetime(split_news.time.dt.date)

In [None]:
split_news[['date','headline']].head()

Let's reduce the assetCodes in news_train_df to just one asset

In [None]:
assetCodes = split_news.assetCodes.astype(str).apply(eval).apply(list)

In [None]:
# eliminate irrelevant exchanges
exchanges = {'N', 'O', 'A'} # , 'OQ', 'P', 'OB'} # take only the top exchanges
assetCodes = assetCodes.apply(lambda assetCode: [c for c in assetCode if '.' in c and c.split('.')[1] in exchanges])

In [None]:
assetCodes.apply(len).value_counts()

In [None]:
# just take the first one
assetCodes = assetCodes.apply(lambda x: x[0])
assetCodes.head()

In [None]:
split_news.assetCodes = assetCodes

In [None]:
# remove duplicate news announcements 
split_news.drop_duplicates(subset=['date','assetCodes'], inplace=True)
split_news.shape

In [None]:
split_news[['date','assetCodes','headline']].head(10)

In [None]:
split_news[split_news.assetCodes=='AAPL.O'][['headline']].iloc[0][0]

In [None]:
# make sure the word2number package is installed
# !pip install word2number

In [None]:
from word2number import w2n
split_from = split_news.headline.apply(lambda x: x.split('-for-')[0].rsplit(' ')[-1])
split_from = split_from.apply(w2n.word_to_num)

In [None]:
split_from.value_counts()

In [None]:
split_to = split_news.headline.apply(lambda x: x.split('-for-')[1].rsplit(' ')[0])
split_to = split_to.apply(w2n.word_to_num)
split_to.value_counts()

In [None]:
split_news['split_from'] = split_from
split_news['split_to'] = split_to

In [None]:
split_news[['date','assetCodes','split_from','split_to']].head(10)

In [None]:
# market_train_df[market_train_df.assetCode == 'ZOLL.O']
market_splits = list(set(split_news.assetCodes.unique()).intersection(set(market_train_df.assetCode.unique())))

In [None]:
split_news = split_news[split_news.assetCodes.isin(market_splits)]
split_news.shape

In [None]:
%%time
# determine if the stock split occurs in the market_train date range
for idx, r in split_news.iterrows():
    if market_train_df[(market_train_df.assetCode == r.assetCodes) & (market_train_df.date == r.date)].shape[0] > 0:
        split_news.loc[idx,'traded'] = True
    else:
        split_news.loc[idx,'traded'] = False

In [None]:
split_news[split_news.traded == True].shape

In [None]:
split_news[['time','assetCodes','traded']].head()

In [None]:
market_split_news = split_news[split_news.traded == True]

Find the date of the split

In [None]:
threshold = 0.2 # + or - window around price differential on the open of the day of the split
for idx, r in market_split_news.iterrows():
    s = market_train_df[(market_train_df.assetCode == r.assetCodes) & 
                    (market_train_df.date > r.date) &
                    (market_train_df.date < r.date+pd.DateOffset(months=3))]
    s = s.set_index('date')
    found_it = False
    prevday = s.index[0]
    for day, r2 in s.iloc[1:].iterrows():  # minus a day
        prevopen = s.iloc[s.index.get_loc(day)-1].open

        if abs(round(r2.returnsOpenPrevRaw1,2) / round(((r2.open-prevopen)/prevopen),2)) < threshold :
            found_it = True
            break
        prevday = day
    if found_it:
        market_split_news.loc[idx,'split_day'] = prevday
    else:
        market_split_news.loc[idx,'split_day'] = pd.NaT

print('found the date to ',market_split_news.split_day.notnull().sum(),' of ',market_split_news.shape[0],' splits.')

Visualize the date of the announcement (in green) and the date of the split (in red)

In [None]:
for idx, r in market_split_news.iterrows():
    print(r.headline, flush=True)
    s = market_train_df[(market_train_df.assetCode == r.assetCodes) & 
                    (market_train_df.date > r.date-pd.DateOffset(months=1)) &
                    (market_train_df.date < r.date+pd.DateOffset(months=3))]
    s = s.set_index('date')
    s.close.plot(figsize=(12,3))
    plt.title(r.assetCodes+' '+str(r.split_from)+'-for-'+str(r.split_to))
    plt.axvline(r.date,c='g')
    if r.split_day is not pd.NaT:
        plt.axvline(r.split_day,c='r')
    plt.show()

## Done 