# Algorithmic Trading
## Introduction
Technology has become an asset in finance: financial institutions are now evolving to technology companies rather than just staying occupied with just the financial aspect: besides the fact that technology brings about innovation the speeds and can help to gain a competitive advantage, the speed and frequency of financial transactions, together with the large data volumes, makes that financial institutions’ attention for technology has increased over the years and that technology has indeed become a main enabler in finance.

This notebook introduces how to implement some algorithmic trading strategies in Python.

### Stocks & Trading
When a company wants to grow and undertake new projects or expand, it can issue stocks to raise capital. A stock represents a share in the ownership of a company and is issued in return for money. Stocks are bought and sold: buyers and sellers trade existing, previously issued shares. The price at which stocks are sold can move independent of the company’s success: the prices instead reflect supply and demand. This means that, whenever a stock is considered as ‘desirable’, due to a success, popularity, … the stock price will go up.

Note that stocks are not exactly the same as bonds, which is when companies raise money through borrowing, either as a loan from a bank or by issuing debt.

As you just read, buying and selling or trading is essential when you’re talking about stocks, but certainly not limited to it: trading is the act of buying or selling an asset, which could be financial security, like stock, a bond or a tangible product, such as gold or oil.

Stock trading is then the process of the cash that is paid for the stocks is converted into a share in the ownership of a company, which can be converted back to cash by selling, and this all hopefully with a profit. Now, to achieve a profitable return, you either go long or short in markets: you either by shares thinking that the stock price will go up to sell at a higher price in the future, or you sell your stock, expecting that you can buy it back at a lower price and realize a profit. When you follow a fixed plan to go long or short in markets, you have a trading strategy.

Developing a trading strategy is something that goes through a couple of phases, just like when you, for example, build machine learning models: you formulate a strategy and specify it in a form that you can test on your computer, you do some preliminary testing or backtesting, you optimize your strategy and lastly, you evaluate the performance and robustness of your strategy.

Trading strategies are usually verified by backtesting: you reconstruct, with historical data, trades that would have occurred in the past using the rules that are defined with the strategy that you have developed. This way, you can get an idea of the effectiveness of your strategy and you can use it as a starting point to optimize and improve your strategy before applying it to real markets. Of course, this all relies heavily on the underlying theory or belief that any strategy that has worked out well in the past will likely also work out well in the future, and, that any strategy that has performed poorly in the past will likely also do badly in the future.

### Time Series Data
A time series is a sequence of numerical data points taken at successive equally spaced points in time. In investing, a time series tracks the movement of the chosen data points, such as the stock price, over a specified period of time with data points recorded at regular intervals.

However, what you’ll often see when you’re working with stock data is not just two columns, that contain period and price observations, but most of the times, you’ll have five columns that contain observations of the period and the opening, high, low and closing prices of that period. This means that, if your period is set at a daily level, the observations for that day will give you an idea of the opening and closing price for that day and the extreme high and low price movement for a particular stock during that day.

In [19]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt

## Python Basics For Finance: Pandas

### Importing Data
The `pandas-datareader` package allows for reading in data from sources such as Google, Yahoo! Finance, World Bank,…

Here I am pulling apple stock from IEX, which provides historical prices for upto 5 years:

In [20]:
from pandas_datareader import data as pdr
import fix_yahoo_finance

start=datetime.datetime(2014, 10, 1)
end=datetime.datetime(2018, 1, 1)

appl = pdr.DataReader('AAPL', 'iex',start,end)
appl.head()

5y


Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-10-01,94.4247,94.5186,92.6506,93.1011,51491286
2014-10-02,93.1856,94.0774,92.031,93.777,47757828
2014-10-03,93.3452,94.068,92.9697,93.5142,43469585
2014-10-06,93.8239,94.481,93.3264,93.5142,37051182
2014-10-07,93.3358,93.9835,92.6787,92.6975,42094183


An alternative to `pandas_datareader` is Quandl:

In [21]:
import quandl
aapl = quandl.get("WIKI/AAPL", start_date="2006-10-01", end_date="2012-01-01")
aapl.head()

LimitExceededError: (Status 429) (Quandl Error QELx01) You have exceeded the anonymous user limit of 50 calls per day. To make more calls today, please register for a free Quandl account and then include your API key with your requests.

### Working With Time Series Data
The data was read into a pandas dataframe, so all the normal functions are available

In [None]:
# Inspect the index 
aapl.index

# Inspect the columns
aapl.columns

# Select only the last 10 observations of `Close`
ts = aapl['Close'][-10:]

# Check the type of `ts` 
type(ts)

In [None]:
# Inspect the first rows of November-December 2006
print(aapl.loc[pd.Timestamp('2006-11-01'):pd.Timestamp('2006-12-31')].head())

# Inspect the first rows of 2007 
print(aapl.loc['2007'].head())

# Inspect November 2006
print(aapl.iloc[22:43])

# Inspect the 'Open' and 'Close' values at 2006-11-01 and 2006-12-01
print(aapl.iloc[[22,43], [0, 3]])

In [None]:
# Sample 20 rows
sample = aapl.sample(20)

# Print `sample`
print(sample)

# Resample to monthly level 
monthly_aapl = aapl.resample('M')

# Print `monthly_aapl`
print(monthly_aapl)

In [None]:
# Add a column `diff` to `aapl` 
aapl['diff'] = aapl.Open - aapl.Close

# Delete the new `diff` column
del aapl['diff']

In [None]:
# Import Matplotlib's `pyplot` module as `plt`
import matplotlib.pyplot as plt

# Plot the closing prices for `aapl`
aapl['Close'].plot(grid=True)

# Show the plot
plt.show()

## Common Financial Analysis
In the rest of this section, I will explore returns, moving windows, volatility calculation and Ordinary Least-Squares Regression (OLS).

### Returns
The simple daily percentage doesn't take into account dividends and other factors and represents the amount of percentage change in the value of a stock over a single day of trading.

Note I am calculating te log returns to get a better insight into the growth of the returns over the timeperiod.

In [None]:
# Assign `Adj Close` to `daily_close`
daily_close = aapl[['Adj. Close']]

# Daily retuns 
daily_pct_c = daily_close.pct_change()

# Replace NA values with 0
daily_pct_c.fillna(0, inplace=True)

# Inspect daily returns
print(daily_pct_c)

# Daily log returns
daily_log_returns = np.log(daily_close.pct_change()+1)

# Print daily log returns
print(daily_log_returns)

In [None]:
# Resample `aapl` to business months, take last observation as value 
monthly = aapl.resample('BM').apply(lambda x: x[-1])

# Calculate the monthly percentage change
monthly.pct_change()

# Resample `aapl` to quarters, take the mean as value per quarter
quarter = aapl.resample("4M").mean()

# Calculate the quarterly percentage change
quarter.pct_change()

Using pct_change() is quite the convenience, but it also obscures how exactly the daily percentages are calculated. That’s why you can alternatively make use of Pandas’ shift() function instead of using pct_change(). You then divide the daily_close values by the daily_close.shift(1) -1. By using this function, however, you will be left with NA values in the beginning of the resulting DataFrame.

In [None]:
# Daily returns
daily_pct_c = daily_close / daily_close.shift(1) - 1

# Print `daily_pct_c`
print(daily_pct_c)

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Plot the distribution of `daily_pct_c`
daily_pct_c.hist(bins=50)

# Show the plot
plt.show()

# Pull up summary statistics
print(daily_pct_c.describe())

The distribution looks very symmetrical and normally distributed: the daily changes centre approximately around 0.0. Using `.describe()` we can see that it is actually centred arounf 0.001567 and the standard deviation is 0.024.

The **cumulative daily rated of return** is useful to determine the value of an investment at regular intervals. The daily rate of return can be calculated by using the daily percentage change values, adding 1 to them and calculating the cumulative product with the resulting values:

In [None]:
#ncumulatice daily returns
cum_daily_return = (1 + daily_pct_c).cumprod()

print(cum_daily_return)

In [None]:
import matplotlib.pyplot as plt 

cum_daily_return.plot(figsize=(12,8))

plt.show()

In [None]:
# Resample the cumulative daily retunr to cumulaitve monthly return
cum_monthly_return = cum_daily_return.resample("M").mean()

print(cum_monthly_return)

### Gather More Companies Data
This will be done by writing a function that takes the symbol of the stock, start and end date. The nest function `data()` the takes the company symbol to get the data from the start date to the end date and returns it so that the `get()` function can continue.

In [None]:
import quandl

def get(tickers, startdate, enddate):
    def data(ticker):
        return (quandl.get(ticker, start_date=startdate, end_date=enddate))
    datas = map (data, tickers)
    return(pd.concat(datas, keys=tickers, names=['Ticker', 'Date']))
tickers = ['WIKI/AAPL']
#tickers = ['WIKI/AAPL', 'WIKI/MSFT', 'WIKI/IBM', 'WIKI/GOOG']
all_data = get(tickers, datetime.datetime(2006, 10, 1), datetime.datetime(2012, 1, 1))
all_data.sample(20)

The function above only works for premium subscribers to quandl, as free users are not able to make concurrent calls. Therefore a modified function is below.

In [None]:
# import quandl
def data(ticker, startdate, enddate):
    return (quandl.get(ticker, start_date=startdate, end_date=enddate))


def get(tickers, startdate, enddate):
    datas = []
    for ticker in tickers:
        dat = data(ticker, startdate, enddate)
        datas.append(dat) 
    return(pd.concat(datas, keys=tickers, names=['Ticker', 'Date']))

tickers = ['WIKI/AAPL', 'WIKI/MSFT', 'WIKI/IBM', 'WIKI/GOOG']
all_data = get(tickers, datetime.datetime(2006, 10, 1), datetime.datetime(2012, 1, 1))
all_data.head(5)

In [None]:
import matplotlib.pyplot as plt

# Isolate the `Adj. Close` values and trasnform the dataframe
daily_close_px = all_data[['Adj. Close']].reset_index().pivot('Date', 
                                                              'Ticker',
                                                              'Adj. Close')

# Calc the daily percentage change fro ` daily_close_px`
daily_pct_change = daily_close_px.pct_change()

daily_pct_change.hist(bins=50, sharex=True, figsize=(12,8))

plt.show()

A useful plot is the scatter matrix. This can be done easily by using the `pandas` library and the `scatter_matrix()` function. As arguments, I am passinf the `daily_pct_change` and as a diagonal, I am setting that I want to have a Kernel Density Estimate (KDE) plot. The Kernel Density Estimate plot estimates the probability density functionn of a randomb

In [None]:
import matplotlib.pyplot as plt

# plot a scatter matrix with the `daily_pct_change` data
pd.plotting.scatter_matrix(daily_pct_change, diagonal='kde', alpha=0.1, 
                           figsize=(12,12))

plt.show()

### Moving Windows
Moving windows are there when you compute the statistic on a window of data represented by a particular period of time and then slide the window across the data by a specified interval.

Pandas offers a lot of functions to calculate moving windows, such as `rolling_mean()`, `rolling_std()`,...

However, these are soon to be depreciated, so instead a combination of the functions `rolling()` with `mean()` or `std()`.

But what does a moving window exactly mean?

The exact meaning depends on the statistic that you're applying to the data. For example, a rolling mean smoothes out short-term fluctuations and hightlights longer-term trends in the data.

In [None]:
adj_close_px = aapl['Adj. Close']

moving_avg = adj_close_px.rolling(window=40).mean()

print(moving_avg[-10:])

In [None]:
import matplotlib.pyplot as plt

# short moving window rolling mean
aapl['42'] = adj_close_px.rolling(window=42).mean()

# long moving window rolling mean
aapl['252'] = adj_close_px.rolling(window=252).mean()

# plot the adjusted closing price, the short and log windows of rolling
# means
aapl[['Adj. Close', '42', '252']].plot()

# show plot
plt.show()

### Volatility Calculation
The volatility of a stock is a measurement of the change in variance in the returns of a stock over a specific period of time. It is common to compare the volatility of a stock with another stock to get a feel for which may have less risk or to a market index to examine the stock's volatility in the overall market. Generally, the higher the volatility, the riskier the investment in that stock, which results in investing in one over another.

The moving historical standard deviation of the log returns - i.e. the moving historical volatility-might be more of interest: 

In [None]:
import matplotlib.pyplot as plt

# define the minimum number of periods to consider
min_periods = 75

# calculate volatility
vol = daily_pct_change.rolling(min_periods).std() * np.sqrt(min_periods)

vol.plot(figsize=(10,8))

plt.show()

The volatilty is calculated by taking a rolling window standard deviation on the percentagr change in a stock. 

Note that the size of the window can anf will change the overall result: if you take the window and make `min_periods` larger, your result will become less representative. If you make it smaller and make the window more narrow, the result will come closer to the standard deviation.

Considering all of this, you see that it's definitely a skill to get the right window size based upon the data sampling frequency.

### Ordinary Least-Squares Regression (OLS)
After all of these calculations, I want to perform a more statistical analysis of the financial data, with more traditional regresssion analysis, such as Ordinary Least-Squares Regression (OLS).

To do this, I am going to . ake use of the `statsmodels` library, which not only provides not only the ability to estimate statistical models but also to conduct statistical tests and perform data exploration.

In [None]:
import statsmodels.api as sm
from pandas.core import datetools

# isolate the adjusted closing price
all_adj_close = all_data[['Adj. Close']]

# calculate the returns
all_returns = np.log(all_adj_close / all_adj_close.shift(1))

# isolate the aapl returns
aapl_returns = all_returns.iloc[all_returns.index.get_levels('Ticker') == 'AAPL']
aapl_returns.index = aapl_returns.index.droplevle('Ticker')

# isolate the msft returns
msft_returns = all_returns.iloc[all_returns.index.get_level_values('Ticker') == 'MSFT']
msft_returns.index = msft_returns.index.droplevel('Ticker')

# build up a new dataframe with AAPL and MSFT
return_data = pd.concat([aapl_returns, msft_returns], axis=1)[1:]
return_data.columns = ['AAPL', 'MSFT']

# add a constant
X = sm.add_constant(return_data['AAPL'])

# Construct the model
model = sm.OLS(return_data['MSFT'],X).fit()

# print the summary
print(model.summary())

Things to look out for when you're studying the result of the model summary are the following:

* The ```Dep. Varibale```, which indicates which varaibale is the response in the model
* The ```model``` in this case is ```OLS```. It's the model you're using in the fit.
* Additionally, you also have the ```Method``` to indicate how the paramters of the model were calculated. In this case, you see that this is set at ```Least Squares```.

Few other things that could be interesting:

* The number of observations (```No. Observations```). Note that you could also derive this with the Pandas package by using the ```info()``` function. Run ```return_data.info()``` in the IPython console of the DataCamp Light chunch above to confirm this.

* The degree of freedom of the residuals (```DF Residuals```)

* The number of parameters in the model, indicated by ```DF Model```; Note that the number doesn't include the constant term ```X``` which was defined in the code above.

* ```R-squared```, which is the coefficient of determination. This score indicates how well the regression line apporximates the real data points. In this case, the sult is 0.280. In percentages, this means that the score is at 28%. When the socr eis 0%, it indicates that the model explains none of the variablity of the response data around its mean. Of course, asocre of 100% indicates the opposite.

* The ```F-statistic``` measures how significant the fit is. It is calculated by dividing the mean sqaured error of the model by the mean squared error of the residuals. The F-statistic for this model is 514.2.

* Next, there's also the ```Prob(F-statisitc)```, which indicated the probability that you would get the result of the ```F-statistic```, given the null hypothesis that they are unrelated.

* The ```Log-likelihood``` indicates the log of the likelihood function, which is, in this case 3513.2.

* The ```AIC``` is the Akaike Infomration Criterion: this metric adjusts the log-likelihood based on the number of observations and the complexity of the model. The AIC of this model is -7022.

* Lastly the ```BIC``` or Bayesian Information Criterion, is simlar to the AIC, mentioned above, but it penalizes models with more paramters more severely. Given the fact that this model only has one parameter, the BIC socre will be the same as the AIC score.

Below the first part of the model summary, we get reports of each of the model's coefficients:

* The estimated value of the coefficient is registered at ```coef```.

* ```std err``` is the standard error of the estimate of the coefficient.

* There's also the t-statistic vlaue, which you'll find under ```t```. The metric is used to measure how statistically significant a coefficient is.

* ```P > |t|``` indicates the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0,05, it indicates that there is a statistically significant relationship between the term and the response. In this case, you see that the constant has a value of 0.198, while ```AAPL``` is set a 0.000.

Lastly the final part of the model summary in which you see other statistical tests to assess the distribution of the residuals:

* ```Omnibus```, which is the Omnibus D’Angostino’s test: it provides a combined statistical test for the presence of skewness and kurtosis.

* The ```Prob(Omnibus)``` is the ```Omnibus``` metric turned into a probability.

* Next, the ```Skew``` or Skewness measures the symmetry of the data about the mean.

* The ```Kurtosis``` gives an indication of the shape of the distribution, as it compares the amount of data close to the mean with those far away from the mean (in the tails).

* ```Durbin-Watson``` is a test for the presence of autocorrelation, and the ```Jarque-Bera``` is another test of the skewness and kurtosis. You can also turn the result of this test into a probability, as you can see in ```Prob (JB)```.

* Lastly, you have the ```Cond. No```, which tests the multicollinearity.

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Plot returns of AAPL and MSFT
plt.plot(return_data['AAPL'], return_data['MSFT'], 'r.')

# Add an axis to the plot
ax = plt.axis()

# Initialize `x`
x = np.linspace(ax[0], ax[1] + 0.01)

# Plot the regression line
plt.plot(x, model.params[0] + model.params[1] * x, 'b', lw=2)

# Customize the plot
plt.grid(True)
plt.axis('tight')
plt.xlabel('Apple Returns')
plt.ylabel('Microsoft returns')

# Show the plot
plt.show()

Note that you can also use the rolling correlation of returns as a way to crosscheck your results. You can handily make use of the Matplotlib integration with Pandas to call the plot() function on the results of the rolling correlation:

In [None]:
# Import matplotlib 
import matplotlib.pyplot as plt

# Plot the rolling correlation
return_data['MSFT'].rolling(window=252).corr(return_data['AAPL']).plot()

# Show the plot
plt.show()

## Building A Trading Strategy With Python