In [18]:
!python -V
# !pip install poetry && python -m poetry install --no-root

Python 3.8.3
Using version [1m^0.12.0[0m for [36mstatsmodels[0m

[34mUpdating dependencies[0m
[2K[34mResolving dependencies...[0m [30;1m(0.2s)[0m

[34mWriting lock file[0m

No dependencies to install or update



# Stock market price prediction
## What is the stock market
The stock market is a place where people and companies can buy and sell shares of companies and commodities.
People go to the stock market in the hopes of investing money in the place which will give them their best ROI(return on investment)
## How the stock market works
In the market, each item has a price, which is decided according to the demand and supply.
If there are more people who want to buy a share of AAPL(Apple Inc.), the price will go up, otherwise, the price will go down.
When one wants to buy (buy order), he hopes that the price will go up in the future, where he could sell the stock and  profit on the difference.
If Alice wants to sell AAPL share, lets say at 40\\$ minimum, she can only sell if there is a buyer, let's say bob, who agrees to buy at 40\\$ or more.
This can only be achieved if Alice believe that AAPL is **overpriced** and Bob believe the AAPL is **underpriced**.

## What is the problem?
If one could know for sure, at all times, if a share is underpriced or overpriced, one could always profit in the stock market. 
The stock price is aim to reflect the value of the company - if APPL has 4 million shared, 100 \\$ each, 
then AAPL is worth 400 \\$ million. 
The value of the stock is then thorised to be the "the wisdom of the crowd", and is composed of the aggregated knowledge of all the shareholders and investors.
These knowledge might contain hidden(inside knowledge) and public(cash flow, debts, yearly profits) parameters which can affect the value of the company.
One company can also be influenced by another, if Sumsung price go up because of a new phone release, Apple price might go down.
Knowing and taking all these parameters into account is not an easy task.

## Solutions
Naturally, when a lot of money is involved, A lot of people are trying to make sense of the stock market, and developed many ways to try and gain a little knowledge about the stock value before all the other investors do. 
* **tehcincal indicators** - creating indicators that should give some hints at where the stock is going to go(moving averages, high volumes thresholds, etc)
* **technical analysis** - trying to find some patterns that are said to be correlative with up/down movement in the price.
* **fundamental anaylsis** - analysing the companies quartly reports, trying to make sense of the profits, debts, etc.
* **Social media** - analysis social media sentiment about the company(tweeter, facebook, news reports).





# Contents
1. [Data set](#Data-set)
  1. [Close price and volume](#Close-price-and-volume)
  2. [Close-Volume correlation](#Close-Volume-correlation)
  3. [Profits stats](#Profits-stats)
  4. [Summary](#Summary)
2. [Pre processing](#PreProcessing)
  1. [Scaling the data](#Scaling-the-data)
  2. [Stationary time series](#Stationary-time-series)


# Data set 
The data is provided by https://www.kaggle.com/dgawlik/nyse?select=prices-split-adjusted.csv 
Dataset consists of following files:

1. **prices.csv**: raw, as-is daily prices. Most of data spans from 2010 to the end 2016, for companies new on stock market date range is shorter. There have been approx. 140 stock splits in that time, this set doesn't account for that.
2. **prices-split-adjusted.csv**: same as prices, but there have been added adjustments for splits.
3. **securities.csv**: general description of each company with division on sectors
4. **fundamentals.csv**: metrics extracted from annual SEC 10K fillings (2012-2016), should be enough to derive most of popular fundamental indicators.

In [10]:
import pandas as pd
import plotly.express as px 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
from IPython.display import display
import seaborn as sns
import numpy as np
init_notebook_mode(connected=True)

In [3]:
prices = pd.read_csv("./prices-split-adjusted.csv", parse_dates=["date"],) 
prices.head()

Unnamed: 0,date,symbol,open,close,low,high,volume
0,2016-01-05,WLTW,123.43,125.839996,122.309998,126.25,2163600.0
1,2016-01-06,WLTW,125.239998,119.980003,119.940002,125.540001,2386400.0
2,2016-01-07,WLTW,116.379997,114.949997,114.93,119.739998,2489500.0
3,2016-01-08,WLTW,115.480003,116.620003,113.5,117.440002,2006300.0
4,2016-01-11,WLTW,117.010002,114.970001,114.089996,117.330002,1408600.0


We see that we have multiple columsn here
* date - single trading day
* symbol - ticker - the shortened name of the company
* open - the open(first) price value of the day
* close - the close(last) price value of the day
* low - the lowest price value of the day
* high - the highest price value of the day
* volume - number of transactions in a day. 

Let's focus on a single symbol = AAPL

In [4]:
df = prices[prices['symbol']=="AAPL"].drop(columns="symbol")
# Lets see how that looks now
df.head()

Unnamed: 0,date,open,close,low,high,volume
254,2010-01-04,30.49,30.572857,30.34,30.642857,123432400.0
721,2010-01-05,30.657143,30.625713,30.464285,30.798571,150476200.0
1189,2010-01-06,30.625713,30.138571,30.107143,30.747143,138040000.0
1657,2010-01-07,30.25,30.082857,29.864286,30.285715,119282800.0
2125,2010-01-08,30.042856,30.282858,29.865715,30.285715,111902700.0



We want to extract extra information from each ticket data.

We'll focus on the `close` price, since it's usually close to the `open`/`low`/`high` prices, and the volume. 


In [5]:
# basic utilities to get data for certain symbol and plot 
def prices_df(symbol):
    return prices[prices.symbol.str.match(symbol)]

def plot_close(df):
    return px.line(data_frame=df,
             y='close',
             line_group='symbol', color="symbol",
            hover_data=["open", "high","low", "volume"],
             x='date')

def plot_volume(df):
    return px.bar(df, x="date", y="volume", color="symbol", 
                  barmode='group')
    

## Close price and volume
We'll plot close price and volume for sample of symbols..
Volume is plotted as sum in quarters, in order to make the graph easier to understand

In [6]:
df = prices_df('GOOGL|AAPL|^FB$|MSFT')
plot_close(df).show()
df = df.set_index("date")

df = df.groupby([df.index.year.astype(str)+'_'+(df.index.month//4).astype(str), df.symbol]).sum()
plot_volume(df.reset_index())

## Close-Volume correlation
We can see that the volume tends to decrease through time, as the price goes up.
Let's see if the price is correlated with the volume


In [7]:
# Describe 
corr = pd.pivot_table(prices_df(".*").groupby("symbol").corr()['close'].reset_index(),
               values="close",
               columns=["level_1"], 
               index="symbol").volume
px.histogram(x=corr.index, y=corr)

We can see that for most of the stocks, close has negative correlation with price.
This might make sense since the more expensive the share there are less people who can afford to trade it.

### Close*Volume ratio.
We saw earlier that close and volume are opposite correlated.
it might tell us what is the the amount of money going transfered in a day.

In [8]:
df = prices_df("AAPL|GOOGL|FB$")
px.line(data_frame=df,x="date", y=df.close*df.volume, 
        line_group="symbol", color="symbol", title="Close*Volume value")


We can see that the Volume is very different between stocks, and is tending to decline.   
We would think the the volume*close price would tell us something about the value of the company,  
but we can see that it change in *Billions* of dollars in value. There are some "Anomalies",  
extremely high values of volume*close, and it might be a hint for something that is happening in the company.

## Cross company correlation
We want to see if some companies are correlated with each other.  
Does apple prices influence google price?  
We'll do that by checking the difference between each day to the previous one,  
and then finding the correlation between these values for different companies.

In [11]:
df = prices_df('GOOGL|AAPL|FB$|TWTR|MSFT|AMZN|JNJ|JPM')

close_diff = df.set_index(["symbol", "date"]).close.transform(
    lambda f: np.append(0, f.values[1:]-f.values[:-1])
)
corr_matrix = df.assign(close_diff=close_diff)[["symbol","date", "close_diff"]].pivot(index="date", columns="symbol", values="close_diff").corr()
go.Figure(go.Heatmap(z=corr_matrix, x=corr_matrix.keys(), y=corr_matrix.keys(),colorscale='Reds'))

We notice few strong (pearson's) correlations:
1. Amazon(`AMZN`) and Apple(`AAPL`) negative opposite correlation of -0.82.  
2. Facebook(`FB`) and Google(`GOOGL`) have negative correlation of -0.7.
3. Microsoft(`MSFT`) and Facebook(`FB`) have positive coorelation of 0.5.  
Seems that if we want to predict one stock value, we might need to use data from multiple companies.

## Profits stats
I want to check how much do stocks tend to go up, how many positive(value increased) days/month they have.

In [12]:
def get_stats(df):
    month_change = df.groupby(df.date.dt.to_period("M")).close.agg(lambda i: i.values[-1]-i.values[0])
    percent = lambda pred, total: round(100*len(total[pred])/len(total), 2)
    profit_per_share = df.close.iloc[-1] - df.close.iloc[0]
    return pd.DataFrame.from_records([{
        "trading_days": len(df),
        "positive_days": len(df[df.close > df.open]),
        "positive_days_perc": percent(df.close > df.open, df),
        "trading_month": len(month_change),
        "positive_month": len(month_change[month_change>0]),
        "positive_month_percent": percent(month_change>0, month_change),
        "profit_100_shares": 100*profit_per_share,
        "profit_2000_usd": (2000//df.close.iloc[0])*profit_per_share
    }])

stats = prices_df(".*").groupby("symbol").apply(get_stats)

display(stats)
display(stats.describe())

Unnamed: 0_level_0,Unnamed: 1_level_0,trading_days,positive_days,positive_days_perc,trading_month,positive_month,positive_month_percent,profit_100_shares,profit_2000_usd
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A,0,1762,917,52.04,84,46,54.76,2317.087296,2062.207693
AAL,0,1762,851,48.30,84,47,55.95,4191.999900,17564.479581
AAP,0,1762,904,51.31,84,49,58.33,12873.999400,6308.259706
AAPL,0,1762,894,50.74,84,53,63.10,8524.714314,5541.064304
ABBV,0,1008,547,54.27,48,28,58.33,2750.000000,1540.000000
...,...,...,...,...,...,...,...,...,...
YHOO,0,1762,843,47.84,84,46,54.76,2156.999800,2502.119768
YUM,0,1762,907,51.48,84,50,59.52,3810.354549,3010.180094
ZBH,0,1762,941,53.41,84,47,55.95,4317.999700,1424.939901
ZION,0,1762,905,51.36,84,49,58.33,2971.000100,4456.500150


Unnamed: 0,trading_days,positive_days,positive_days_perc,trading_month,positive_month,positive_month_percent,profit_100_shares,profit_2000_usd
count,501.0,501.0,501.0,501.0,501.0,501.0,501.0,501.0
mean,1699.129741,872.01996,51.289222,81.011976,46.327345,56.970818,5005.252023,2898.077214
std,253.731238,134.939867,1.741511,12.06714,8.356113,5.630239,8564.185986,3404.724083
min,126.0,63.0,43.65,6.0,2.0,27.91,-10337.0007,-1656.919977
25%,1762.0,882.0,50.28,84.0,45.0,53.57,1608.0001,1019.159829
50%,1762.0,904.0,51.42,84.0,48.0,57.14,3343.9998,2216.180174
75%,1762.0,924.0,52.5,84.0,51.0,60.71,5972.9999,3691.12
max,1762.0,981.0,55.68,84.0,61.0,72.62,124210.0052,30317.76082


## Stats points
We can see that there is a big difference between different stocks.
1. 51% percent of the days, for more than 50% of the stocks, are positive.
2. 56% percent of the months, for more than 50% of the stocks are positive.
3. If one would invest 2K\\$ in 2010, by 2016, he would have, in average, made 2.8K\\$ profit.
    3.1.  However, if we choose poorly, we could either lose money(minimum of 1.6K\\$) or have huge profit of up to 30K\\$ !

## Summary
1. Prices and volumes are very different between stocks.
2. Most of the time, most of the stocks, increase in value.
3. Volume has **opposite** correlation with price. when volume goes up, price goes down.
4. Volume*price ratio 
5. Every 3 monthes(quarter) the companies report about their buisness. usually on these days the price is more volatile.   
6. Some companies are correlated with each other.


# Pre Processing
Each stock looks different from the other stocks. If we want to make a unified model, we want to scale the values to a similar range.
In order to do that, we will start with minmax scaling for each symbol. It's the simplest transformation that doesn't change the way the data looks.

1. scale within each symbol using minmax scaling.
2. replace the date field with "days" = days since the first data point (global)


**note**: This is only preprocessing for initial research, when trainig a model we need to fit everything only to the test set.

In [14]:
from sklearn.preprocessing import minmax_scale
from datetime import timedelta
import numpy as np


## Scaling the data

In [124]:
def minmax(df):
    """MinMax scaling the dataframe"""
    df = df.set_index(["symbol", "date"])
    
    # Scale by symbol
    df = df.groupby("symbol").transform(minmax_scale)
    df = df.reset_index()
    return df

def reset_days(df):
    """Convert date to day since start. easier to use than with actual dates."""
    days_since_start = ((df.date - df.date.min()).astype("timedelta64[s]")/(60*60*24)).astype(int)
    df = df.assign(day=days_since_start).drop(columns="date")
    df = df.sort_values("day")
    return df

pre_processed_1 = reset_days(minmax(prices))

def stock_df(symbol):
    return pre_processed_1[pre_processed_1.symbol.str.match(symbol)]

# plot_close and plot_volume, since we are not using the raw data anymore.
# Overriding
def plot_close(df):
    return px.line(data_frame=df,
             y='close',
             line_group='symbol', color="symbol",
            hover_data=["open", "high","low", "volume"],
             x='day', title="Close price")

# Overriding
def plot_volume(df, days=90):
    df = df.drop(columns="day").groupby([df.day//90, "symbol"]).sum().reset_index()
    return px.bar(df, x="day", y="volume", color="symbol", 
                  barmode='group', title="Volume")

Plot the close prices and Volume to see that prices are in the same range.

In [125]:
df = stock_df("AAPL|GOOGL|FB$")
plot_close(df).show()
plot_volume(df).show()

We can see that the values are much more similar between the different companies.

## Stationary time series
Naturally, stock data has both **trend** (company value increases/decreases) and **seasonality** (Apple release a new IPhone every year).  
We will have to normalize them both in order to continue.  
The point of doing that is to make the data "Stationary".
Steps:
1. Detrend the prices per company (using order of 1, 2)
2. Try to deseasonalize with:
  1. Weekly period
  2. 90 days period(quarter)
  3. yearly period


Terms and defenitions taken from this links   
* [Stationarity in time series analysis](https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322)   
* [Trend, Seasonality, Moving Average, Auto Regressive Model : My Journey to Time Series Data with Interactive Code](https://towardsdatascience.com/trend-seasonality-moving-average-auto-regressive-model-my-journey-to-time-series-data-with-edc4c0c8284b)

In [126]:
from statsmodels.tsa.tsatools import detrend
from statsmodels.tsa.seasonal import seasonal_decompose

In [190]:
def remove_trend_and_seasonality(df):
    if len(df)<180:
        return pd.DataFrame()
    day = df.reset_index().day
    df = df.reset_index(drop=True)
    data = {}
    for col in df.keys():
        decomp = seasonal_decompose(df[col], period=90, two_sided=True, extrapolate_trend=1)
        data[f"{col}_trend"] = decomp.trend
        data[f"{col}_seasonal"] = decomp.seasonal
        data[col] = decomp.resid  # Residual will override the original column
        data["day"] = day[-len(decomp.resid):]  # Mocking "day"
    return pd.DataFrame.from_dict(data)

pre_processed_2 = (pre_processed_1
                   .set_index(["symbol", "day"])
                   .groupby("symbol")
                   .apply(remove_trend_and_seasonality)
                   .reset_index(1, drop=True)  # Remove "day" index
                   .reset_index())

# Override 
def stock_df(symbol):
    return pre_processed_2[pre_processed_2.symbol.str.match(symbol)]

# Overriding
def plot_close(df, resid_only=True):
    if not resid_only:
        df = df.assign(close = df.close_trend+df.close_seasonal+df.close)
        
    return px.line(data_frame=df,
                   y='close',
                   line_group='symbol', color="symbol",
                   hover_data=["open", "high","low", "volume"],
                   x='day', title="Close price")

# Overriding
def plot_volume(df, days=90, resid_only=True):
    if not resid_only:
        df = df.assign(close = df.close_trend + df.close_seasonal + df.close)
        
    df = df.drop(columns="day").groupby([df.day // days, "symbol"]).sum().reset_index()
    return px.bar(df, x="day", y="volume", color="symbol", 
                  barmode='group', title="Volume")



In [191]:

df = stock_df("AAPL|GOOGL|FB$|MSFT")
fig = make_subplots(2,2, column_titles=["residual only", "original"], row_titles=["close price", "volume"])
fig.add_traces(plot_close(df).data, 1,1)
fig.add_traces(plot_close(df, resid_only=False).data, 1, 2,)
fig.add_traces(plot_volume(df).data, 2, 1)
fig.add_traces(plot_volume(df, resid_only=False).data, 2, 2)
fig


## Summary
We can see we achieved what we needed.
1. Our data is in the same range cross-symbols
2. we have inversable transformation that removes seasonality and trend


# Base model
Lets define the task better.
We want to take the last few days (const) and predict the price for the next day
We should define a base line (to measure our improvement) and a loss function.
I chose these arbitrary
    1. Base line - RandomForest regression model.
    2. loss function - mean absolute scaled error (mase) - https://en.wikipedia.org/wiki/Mean_absolute_scaled_error. 

In [None]:
import sktime  # Scikit learn like package designed for time series analysis.
from sktime.forecasting.model_selection import temporal_train_test_split, SlidingWindowSplitter

from sktime.forecasting.naive import  NaiveForecaster
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.performance_metrics.forecasting import mase_loss, smape_loss
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.transformers.single_series.detrend import Detrender


In [None]:
# Our naive model, each day just predict the last price! 
# It makes our baseline look so good, but It doesn't tell us anything new. 
# In the stock market we have access to real time data, so next-day price will be usually 
# very similar to the price today.

def naive_pred(df):
    train, test = temporal_train_test_split(df)
    model = NaiveForecaster('last').fit(train.close)
    cv = SlidingWindowSplitter(fh=1, window_length=30*3)

    y_pred = model.update_predict(test.close, cv=cv)
    print("Smape loss:", smape_loss(y_test=test.close, y_pred=y_pred))
    return go.Figure([go.Scatter(x=train.index, y=train.close, mode="lines+markers", name="train"),
               go.Scatter(x=test.index, y=test.close, mode="lines+markers", name="test"),
               go.Scatter(x=y_pred.index, y=y_pred, mode="lines+markers", name="pred")
              ])

naive_pred(symb_df("AAPL"))



In [None]:
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LogisticRegression, PassiveAggressiveRegressor
df = stock_df("AAPL").set_index("day")
train, test = temporal_train_test_split(df)
model = ReducedRegressionForecaster(regressor=LogisticRegression(), window_length=30*3)
# model.fit(train.close, fh=1)
PassiveAggressiveRegressor??
# model.fit(train.close)

In [None]:
def detrend(series):
    forcaster = PolynomialTrendForecaster(degree=1)
    transformer = Detrender(forcaster)
    return transformer.fit_transform(series)


df = stock_df("GOOGL|AAPL|FB$").set_index("day")
df=df.assign(notrend=df.groupby("symbol").close.transform(detrend))
px.line(df, x=df.index, y="notrend", line_group="symbol", color="symbol")


In [None]:
df = data[data.symbol == "AAPL"].reset_index(drop=True).drop(columns=["symbol", "date"])
## ADD Some features
df = df.assign(
    # ma for moving-average
    ma_5 = df.close.rolling(5).mean(),
    ma_15 = df.close.rolling(15).mean(),
    ma_30 = df.close.rolling(30).mean(),
    std = df.close.rolling(30).std(),
    var = df.close.rolling(30).var(),
    top_diff_5d = df.close - df.close.rolling(5).max(),
    buttom_diff_5d = df.close - df.close.rolling(5).min(),
    diff_1d = df.close[1:] - df.close[:-1]
)
X, y = df[30:-1], df[31:].close
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)  # Don't shuffle, don't mix past and future.

model = LinearRegression()
model.fit(X_train, y_train)
# model.fit(X, y)
# model.coef_
smape_loss(pd.Series(model.predict(X_test)), y_test.reset_index(drop=True))

In [None]:
go.Figure([go.Scatter(x=X_test.index, y=y_test),
          go.Scatter(x=X_test.index, y=model.predict(X_test))])
# If we zoom in a bit, we can se that the prediction we have are in "delay".
# Maybe the loss function is not so good in this case.

In [None]:
# TODO: 
# Look at the fundementals data
# Use deep learning and (RNN, LSTMs)
# Find anomalies in data
# Find connections between different stocks and use that data
# Reinforcment learning - automatic strategy.