In [3]:
!python -V
!pip install poetry && python -m poetry install --no-root


Python 3.8.3


# Stock market price prediction
## What is the stock market
The stock market is a place where people and companies can buy and sell shares of companies and commodities.
People go to the stock market in the hopes of investing money in the place which will give them their best ROI(return on investment)
## How the stock market works
In the market, each item has a price, which is decided according to the demand and supply.
If there are more people who want to buy a share of AAPL(Apple Inc.), the price will go up, otherwise, the price will go down.
When one wants to buy (buy order), he hopes that the price will go up in the future, where he could sell the stock and  profit on the difference.
If Alice wants to sell AAPL share, lets say at 40\\$ minimum, she can only sell if there is a buyer, let's say bob, who agrees to buy at 40\\$ or more.
This can only be achieved if Alice believe that AAPL is **overpriced** and Bob believe the AAPL is **underpriced**.

## What is the problem.
If one could know for sure, at all times, if a share is underpriced or overpriced, one could always profit in the stock market. 
The stock price is aim to reflect the value of the company - if APPL has 4 million shared, 100 \\$ each, 
then AAPL is worth 400 \\$ million. 
The value of the stock is then thorised to be the "the wisdom of the crowd", and is composed of the aggregated knowledge of all the shareholders and investors.
These knowledge might contain hidden(inside knowledge) and public(cash flow, debts, yearly profits) parameters which can affect the value of the company.
One company can also be influenced by another, if Sumsung price go up because of a new phone release, Apple price might go down.
Knowing and taking all these parameters into account is not an easy task.

## Solutions
Naturally, when a lot of money is involved, A lot of people are trying to make sense of the stock market, and developed many ways to try and gain a little knowledge about the stock value before all the other investors do. 
* **tehcincal indicators** - creating indicators that should give some hints at where the stock is going to go(moving averages, high volumes thresholds, etc)
* **technical analysis** - trying to find some patterns that are said to be correlative with up/down movement in the price.
* **fundamental anaylsis** - analysing the companies quartly reports, trying to make sense of the profits, debts, etc.
* **Social media** - analysis social media sentiment about the company(tweeter, facebook, news reports).





# Let's Take a look at the data
The data is provided by https://www.kaggle.com/dgawlik/nyse?select=prices-split-adjusted.csv 
Dataset consists of following files:

* **prices.csv**: raw, as-is daily prices. Most of data spans from 2010 to the end 2016, for companies new on stock market date range is shorter. There have been approx. 140 stock splits in that time, this set doesn't account for that.
* **prices-split-adjusted.csv**: same as prices, but there have been added adjustments for splits.
* **securities.csv**: general description of each company with division on sectors
* **fundamentals.csv**: metrics extracted from annual SEC 10K fillings (2012-2016), should be enough to derive most of popular fundamental indicators.

In [329]:
import pandas as pd
import plotly.express as px 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
from IPython.display import display
init_notebook_mode(connected=True)

In [133]:
prices = pd.read_csv("./prices-split-adjusted.csv", parse_dates=["date"]) 
prices.head()

Unnamed: 0,date,symbol,open,close,low,high,volume
0,2016-01-05,WLTW,123.43,125.839996,122.309998,126.25,2163600.0
1,2016-01-06,WLTW,125.239998,119.980003,119.940002,125.540001,2386400.0
2,2016-01-07,WLTW,116.379997,114.949997,114.93,119.739998,2489500.0
3,2016-01-08,WLTW,115.480003,116.620003,113.5,117.440002,2006300.0
4,2016-01-11,WLTW,117.010002,114.970001,114.089996,117.330002,1408600.0


We see that we have multiple columsn here
* date - single trading day
* symbol - ticker - the shortened name of the company
* open - the open price value of the day
* close - the close price value of the day
* low - the lowest price value of the day
* high - the highest price value of the day
* volume - number of transactions in a day. 

Let's focus on a single symbol = AAPL

In [134]:
df = prices[prices['symbol']=="AAPL"].drop(columns="symbol")
# Lets see how that looks now
df.head()

Unnamed: 0,date,open,close,low,high,volume
254,2010-01-04,30.49,30.572857,30.34,30.642857,123432400.0
721,2010-01-05,30.657143,30.625713,30.464285,30.798571,150476200.0
1189,2010-01-06,30.625713,30.138571,30.107143,30.747143,138040000.0
1657,2010-01-07,30.25,30.082857,29.864286,30.285715,119282800.0
2125,2010-01-08,30.042856,30.282858,29.865715,30.285715,111902700.0


# Stock movement plotting
In stock market it is customary to plot using candlestick charts.

the candle stick is described here: https://en.wikipedia.org/wiki/Candlestick_chart in detail. shortly:
* if close < open - candlestick is red
* if close >= open -candlestick is green
* fat body is between close<->open
* thin body is between min(close,open)->low and max(close,open)->high


In [348]:
def get_candlestick(df):
    return go.Candlestick(
        open=df.open,
        close=df.close,
        low=df.low,
        high=df.high,
        x=df.date,
    )
def configure_fig(fig):
    fig.update_xaxes(
        rangeslider_visible=True,
        rangeselector=dict(
            buttons=list([
                dict(count=1, label="1m", step="month", stepmode="backward"),
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(count=1, label="YTD", step="year", stepmode="todate"),
                dict(count=1, label="1y", step="year", stepmode="backward"),
                dict(step="all")
            ])
        )
    )
    return fig

fig = make_subplots(cols=2, rows=2)
df = prices[prices.symbol == "AAPL"]
fig.add_trace(get_candlestick(df), row=1, col=1)
fig.add_trace(go.Scatter(x=df.date, y=df.volume), row=2, col=1)
df = prices[prices.symbol == "GOOGL"]
fig.add_trace(get_candlestick(df), row=1, col=2)
fig.add_trace(go.Scatter(x=df.date, y=df.volume, opacity=1), row=2, col=2)
configure_fig(fig)

In [202]:
# Lets plot GOOGL and AAPL Together, but this time let's use only close data 
fig = px.line(data_frame=prices[prices.symbol.str.match('GOOGL|AAPL')],
             y='close',
             line_group='symbol',
              color="symbol",
             x='date')
fig

In [203]:
# Lets show some statistics
def print_stats(df):
    month_change = df.groupby(df.date.dt.to_period("M")).close.agg(lambda i: i.max()-i.min())
    percent = lambda pred, total: round(100*len(total[pred])/len(total), 2)
    total_profit = df.close.iloc[-1] - df.close.iloc[0]
    print(f"""
    On {len(df[df.close > df.open])}/{len(df)} ({percent(df.close > df.open, df)}%) days, the price increased.
    On {len(month_change[month_change>0])}/{len(month_change)} ({percent(month_change>0, month_change)}%) month the prices increased
    if you would buy 100 shares at {df.date.min().date()}, you would have made {100*round(total_profit, 2)}$ 
    If you only had 2000$ to invest, you would profit since {df.date.min().date()} around {round((2000//df.close.iloc[0])*total_profit, 2)} $ 
    """)
print("AAPL")    
print_stats(prices[prices.symbol == "AAPL"])
print("GOOG")
print_stats(prices[prices.symbol == "GOOGL"])


AAPL

    On 894/1762 (50.74%) days, the price increased.
    On 84/84 (100.0%) month the prices increased
    if you would buy 100 shares at 2010-01-04, you would have made 8525.0$ 
    If you only had 2000$ to invest, you would profit since 2010-01-04 around 5541.06 $ 
    
GOOG

    On 861/1762 (48.86%) days, the price increased.
    On 84/84 (100.0%) month the prices increased
    if you would buy 100 shares at 2010-01-04, you would have made 47876.0$ 
    If you only had 2000$ to invest, you would profit since 2010-01-04 around 2872.57 $ 
    


##### As we can see, by comparing these two industry giants, it is very important to know where to invest our money.

In [224]:
# Describe 
print("AAPL:")
df = prices[prices.symbol == "AAPL"]
display(df.describe().astype(int))
display(df.corr())
print("GOOGL")
df = prices[prices.symbol == "GOOGL"]
display(df.describe().astype(int))
display(df.corr())




AAPL:


Unnamed: 0,open,close,low,high,volume
count,1762,1762,1762,1762,1762
mean,79,79,78,80,94225775
std,28,28,28,28,60205187
min,27,27,27,28,11475900
25%,55,55,54,55,49174775
50%,78,78,77,79,80503850
75%,102,103,102,104,121081625
max,134,133,131,134,470249500


Unnamed: 0,open,close,low,high,volume
open,1.0,0.999254,0.999605,0.999673,-0.582824
close,0.999254,1.0,0.999657,0.99966,-0.585669
low,0.999605,0.999657,1.0,0.999511,-0.591664
high,0.999673,0.99966,0.999511,1.0,-0.578681
volume,-0.582824,-0.585669,-0.591664,-0.578681,1.0


GOOGL


Unnamed: 0,open,close,low,high,volume
count,1762,1762,1762,1762,1762
mean,467,467,463,471,4096042
std,181,181,179,182,2884423
min,219,218,217,221,520600
25%,299,299,297,302,2004075
50%,438,438,436,440,3670550
75%,587,587,583,591,5171750
max,838,835,829,839,29619900


Unnamed: 0,open,close,low,high,volume
open,1.0,0.999521,0.999711,0.99982,-0.566256
close,0.999521,1.0,0.999811,0.999757,-0.568819
low,0.999711,0.999811,1.0,0.999708,-0.572099
high,0.99982,0.999757,0.999708,1.0,-0.563879
volume,-0.566256,-0.568819,-0.572099,-0.563879,1.0


We can see that volume has negative correlation with prices(open, low, high, close, etc...).

This might make sense since the more expensive the share there are less people who can afford to trade it.



In [None]:
# Lets see the volume*price ratio. it might tell us what is the the amount of money going transfered in a day.

In [321]:
df = prices[prices.symbol.str.match(("AAPL|GOOGL|FB$"))]
display(px.line(data_frame=df,x="date", y=df.close*df.volume, 
       line_group="symbol", color="symbol", title="Close*Volume value"))



We can see that the Volume is very different between stocks, and is tending to decline. 
We also can see that the close*volumn is pretty similar between different stock, although here we look at big industry stocks.

# Pre Processing
We see from plotting that the values for the different stocks is very different, we want to Scale this so that we work with them togheter

**note**: This is only preprocessing for initial research, when trainig a model we need to fit everything only to the test set.

In [468]:
from sklearn.preprocessing import minmax_scale
from datetime import timedelta
import numpy as np


In [470]:
data = prices.set_index(["symbol", "date"])
# Scale by symbol
scaler = MinMaxScaler()
data = data.groupby("symbol").transform(minmax_scale)
data.reset_index(inplace=True)

In [325]:
df = data[data.symbol.str.match("AAPL|GOOGL|FB$")]

display(px.line(data_frame=df,x="date", y="close", 
       line_group="symbol", color="symbol", title="Close price"))
timerange = (data.iloc[-1].date.date(), data.iloc[-1].date.date()-timedelta(days=60))
display(px.bar(df,x="date", y="volume", color="symbol", range_x=timerange,barmode='overlay', title="Volumes"))
display(px.line(data_frame=df,x="date", y=df.close*df.volume, 
       line_group="symbol", color="symbol", title="Close*Volume value"))

In [326]:
timerange = (data.iloc[-1].date.date(), data.iloc[-1].date.date()-timedelta(days=60))
px.bar(df,x="date", y="volume", color="symbol", range_x=timerange,barmode='overlay')

In [327]:
px.line(data_frame=df,x="date", y="close", 
       line_group="symbol", color="symbol")


We can see that we managed to normalize the stocks, they are all roughly in the same scale now.

In [400]:
# Lets see how the stocks are coorelated with each other.
# We'll take a subsample since there are over 500 stocks.
### TODO: UGLY PANDAS CODE
df = data[data.symbol.str.match('GOOGL|AAPL|FB$|TWTR|MSFT|AMZN|JNJ|JPM')]
close_diff = df.set_index(["symbol", "date"]).close.transform(lambda f: np.append(0,f.values[1:]-f.values[:-1]))
corr_matrix = df.assign(close_diff=close_diff)[["symbol","date", "close_diff"]].pivot(index="date", columns="symbol", values="close_diff").corr()
go.Figure(go.Heatmap(z=corr_matrix, x=corr_matrix.keys(), y=corr_matrix.keys(),colorscale='Reds'))



We can see, for example, that Amazon(AMZN) and Apple(AAPL) has opposite correlation of -0.82 (pearson's).
Facebook(FB) and Google(GOOGL) behave the same(-0.7). However, Microsoft(MSFT) and Facebook(FB) has positive coorelation of 0.5.

# Base model
Lets define the task better.
We want to take the last few days (const) and predict the price for the next day (or few days)
We should define a base line (to measure our improvement) and a loss function.
I chose these arbitrary
    1. Base line - linear regression model.
    2. loss function - mean squared error. 

In [475]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit, train_test_split
from sklearn.metrics import mean_squared_error

In [481]:
df = data[data.symbol == "AAPL"].reset_index(drop=True).drop(columns=["symbol", "date"])
## ADD Some features
df = df.assign(
    # ma for moving-average
    ma_5 = df.close.rolling(5).mean(),
    ma_15 = df.close.rolling(15).mean(),
    ma_30 = df.close.rolling(30).mean(),
    std = df.close.rolling(30).std(),
    var = df.close.rolling(30).var(),
    top_diff_5d = df.close - df.close.rolling(5).max(),
    buttom_diff_5d = df.close - df.close.rolling(5).min(),
    diff_1d = df.close[1:] - df.close[:-1]
)
X, y = df[30:-1], df[31:].close
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)  # Don't shuffle, don't mix past and future.

model = LinearRegression()
model.fit(X_train, y_train)
# model.fit(X, y)
# model.coef_
model.score(X_test, y_test)

0.9736809709519347

In [487]:
go.Figure([go.Scatter(x=X_test.index, y=y_test),
          go.Scatter(x=X_test.index, y=model.predict(X_test))])
# If we zoom in a bit, we can se that the prediction we have are in "delay".
# Maybe the loss function is not so good in this case.

0.00026398068279580723