We decided to use an alternate options database for the following reasons:

- We wanted to use currencies as an input in the model, which weren't provided in the data set. Both Daniel and Eduardo have experience trading currencies such as the Mexican Peso and we have seen the strong effects of Trump's tweets on currencies.

- While the dataset has daily options data for agricultural products, the data for indices and rates is semi-monthly. The trading model we are implementing makes daily decisions and we wanted to have the same granularity in the data set.

- We wanted to standarize the volatility features so they were comparable across days. We chose to look at 1 month and 2 month volatility. To be able to obtain these volatilities, you need to interpolate between two datasets, and we don't always have this information in the dataset.

- For example, for file '5_8_2019_eonly_settlements.txt' the dataset provides preliminary CME settlement prices as of 3:30 CST. This day, the S&P500 e-mini went up over 3% from settle to settle. The data contains updated settlement options prices for IMM expirations (EZ contracts), though not for the EOM options (EW contracts). For EOM options, the settlements for options prices show as unchanged from the prior day, even if there was volume traded on different options. 

- Consider the situation when futures just rolled, the closest IMM expiration is a little under 3 months from the day, and you have no realiable EOM options settlements. It's impossible to generate a 1 month and 2 month volatility curve without making strong assumptions, which we thought could reduce the quality of the features in the model. 



# Futures

A **futures contract** is a legal agreement to buy or sell something at a predetermined price at a specified time in the future. A futures contract is traded through an exchange, such as the Chicago Merchantile Exchange. The asset transacted is usually a commodity or financial instrument, and is also called the **underlying**. The predetermined price the parties agree to buy and sell the asset for is known as the futures price. The specified time in the future when delivery or payment occur, is known as the **delivery date**. 

Exchanges act as a marketplace between anonymous buyers and sellers. The buyer of a contract is said to be **long** the contract, and the selling party is said to be **short** the contract.

To minimize credit risk to the exchange, traders must post a **margin**, which depends on the volatility of the underlying, typically 5%-15% of the contract's value. To minimize counterparty risk to traders, trades executed on regulated futures exchanges are guaranteed by a clearing house.

We will be using futures contracts traded through the CME Group https://www.cmegroup.com/: All contracts are front month, unless specified.

Indices:

- CME S&P 500 Index E-Mini Futures (ES) 
- CME NASDAQ 100 Index Mini Futures (NQ) 

Currencies:

- CME Euro FX Futures (EC)
- CME Canadian Dollar CAD Futures (CD) 
- CME Japanese Yen JPY Futures (JY) 
- CME Mexican Peso Futures (MP) 

Rates:

- CBOT 30-year US Treasury Bond Futures (US) 
- CBOT 10-year US Treasury Note Futures (TY)

Commodities

- CBOT Wheat Futures, Second Month (W)
- CBOT Soybeans Futures, Second Month (S) 
- CBOT Corn Futures, Second Month (C)
- NYMEX Gold Futures (GC) 
- NYMEX WTI Crude Oil Futures (CL) 

**Futures Roll**: Unlike stocks, futures contracts have expiration dates. Most of the volume is concentrated in the closest month to expiration (though not necessarily for agricultural futures, which trade based on demand and seasonality). They are rolled over to a different expiration month to avoid the costs and obligations associated with settlement of the contracts. Futures contracts are most often settled by physical settlement or cash settlement. If you don't close your long wheat contract, you'll get a call to specify delivery information for 5000 bushels of wheat!

The difference in futures prices with different expirations is known as the **spread**. Spreads vary in value depending on a multitude of factors, from interest rates and dividends to costs of storage for commodities. For our purposes, since we want a time series that has no jumps in between expirations, we use a method called **Backwards Panama Adjusted Prices**. This method takes the latest contract and cumulatively substracts the historical spread to past expired futures contracts, as if we had always been trading the contract currently trading. 

For modeling financial data, this will be very convenient since we will not have jumps in expiration dates (similar to adjusting for dividend splits). It also avoids the problem of having to price daily carry for currencies. The futures price already takes into account carried interest in both the dollar and in the foreign currency. 



# Options


Let's assume asset **returns** follow a normal distribution.

More specifically, let's assume prices follow a log-normal distribution, which means that **log-prices follow a normal distribution**. We can describe this evolution of prices as a Geometric Brownian Motion:

$$\frac{dS_t}{S_t} = \mu dt + \sigma dW_t$$ 
You can prove that the solution to this equation is:
$$S_t=S_0\exp \left(\left(\mu -{\frac {\sigma ^{2}}{2}}\right)t+\sigma W_{t}\right) \Rightarrow \ln {\frac {S_{t}}{S_{0}}}=\left(\mu -{\frac {\sigma ^{2}}{2}}\,\right)t+\sigma W_{t}, \;\;\; W_t \sim N(0,t)$$

So log-returns from time 0 to time t are normally distributed with variance $\sigma^2 t$. This is the **Black-Scholes** model. We call $\sigma$ the **volatility**.

Consider a (european) **call option**. A call option is a financial contract between a buyer and a seller. The buyer of the call option has the right, but not the obligation, to *buy* a financial instrument (the **underlying**) from the seller of the option at a specific time (the **expiration date**) for a certain price (the **strike price**). The seller is obligated to sell the financial instrument to the buyer if the buyer so decides. The buyer pays a fee (called a **premium**) for this right. You can think of this contract as an insurance contract, or a lottery ticket.

You can write the payoff of a call price with respect to the undertlying as $(S_t - 0)^+$ Graphically:

A **put option** is similar to a call option, thought he buyer of the put option the right, but not the obligation, to *sell* a financial instrument.

If you assume that stock prices follow the process above, you can prove that there's a closed form solution for a call option fair price:

$$c(S_t) = E[(S_t - 0)^+]=f(S_t, \sigma, t, K, r)$$

Where $S_t$ is the underlying, $\sigma$ is the volatility K is the option's strike, t is the time to expiration, and r is the risk-free interest rate.

Given everything else fixed, we have a one-to-one relationship between an option's price and its volatility $\sigma$.

So, what does a "fair" option price mean? Consider $\Delta = \frac{d C}{d S_t}$, the derivative of the call price with respect to the underlying price, or the **delta**. You can build a hedged portfolio: $C(S_t) - \Delta S_t$

Graphic

Options prices are a convex function of $S_t$. This means that, all else equal, our hedged porfolio will always make money, whether the underlying goes up or down!

Graphic

In reality, time moves as the underlying's price moves, and time pushes the option price towards its terminal value, so the graphic should really look like this:

You can prove that, if the underlying follows a geometric brownian motion as above, with a fixed volatility $\sigma$, and if you hedge at every instant as above, updating your delta, you will obtain the option price above with probability one! In other words, the price of the call is fair, as you can replicate it by dynamically trading the upderlying. You are capturing the convexity of the call price, or $\Gamma = \frac{d^2 C}{d S_t^2}$, the **gamma**.

In reality, asset returns don't follow a normal distribution. However, there is still a one-to-one relationship between an option price and the $\sigma$ implied by the Black-Scholes model (the **implied volatility**). What happens if you calculate the implied volatility from every option price for a specific expiration? It looks like this:

Graphic

What is this graphic telling us? It's giving us an indication of how far from a normal distribution options prices think the underlying's returns are. It's telling us that, if the market goes down, the market's realized volatility will most likely increase, and the option price must be higher to compensate the seller, who will lose through more volatility while delta hedging. 

We want to capture this information and reduce it to features that we can use in a model. We consider the following features in our analysis:

**At-the-Money** implied volatility: the weighted average of the implied volatilities of the nearest put and call on either side of the current underlying price.

**Risk Reversal**: the implied volatility of an out-of-the-money call minus the implied volatility of an out-of-the-money  put with similar deltas. It measures the slope of the volatility curve.

**Butterfly**: the average of out-of-the-money call and put volatilities with similar deltas, minus the at-the-money implied volatility. It measures the convexity of the volatility curve.

Graphically:



# Trading Considerations

In [None]:
https://www.cmegroup.com/market-data/settlements/settlements-details.html

# Trump/Word2Vec

# Word2Vec Algorithm

This is an implementation of the [Word2Vec algorithm](https://en.wikipedia.org/wiki/Word2vec) using the skip-gram architecture. 
I'm adapting code from a course in Udacity to our problem. The original code is here: https://github.com/udacity/deep-learning-v2-pytorch/tree/master/word2vec-embeddings 

# Readings and Videos

* An easy introduction to Recurrent Neural Networks: https://www.youtube.com/watch?v=UNmqTiOnRfg
* A nice lecture on LSTM, which is the type of RNN used in this algorithm: https://www.youtube.com/watch?v=iX5V1WpxxkY&t=1s

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of Word2Vec from Chris McCormick 
* [First Word2Vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [Neural Information Processing Systems, paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for Word2Vec also from Mikolov et al. This explains their choice for subsampling and negative sampling, which makes the algorithm run much faster.
* Visualizing the dataset: https://towardsdatascience.com/google-news-and-leo-tolstoy-visualizing-word2vec-word-embeddings-with-t-sne-11558d8bd4d

---

## Subsampling

- This part is a suggestion in Google's paper http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfText to speed up and improve the algorithm, 
- Some words such as "the", "or", etc appear very often and don't provide much context for neighboring words. By discarding some of them, we can train our network faster and get better results. At the same time, you don't want to discard all of them, as they do provide information about the syntax of tweets. For each word $w_i$ in the training set, we'll discard it with probability given by :

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

Here, $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset. Note that $P(w_i)$ is the probability that a word is discarded. 

For each word in the text, we want to grab all the words in a window around that word, with size $C$. 

## Batches

From [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf): 

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $[ 1: C ]$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."


## Log-loss function through negative sampling

For every sample we give the network, we train it using the output from the final softmax layer. That means for each sample, we're making very small changes to thousands of weights in the network, which makes training it very inefficient. We can approximate the loss from the softmax layer by only updating a small subset of all the weights at once. We'll update the weights for the correct example, but only a small number of incorrect, or noise, examples. This is called ["negative sampling"](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). 

There are two modifications we need to make. First, since we're not taking the softmax output over all the words, we're really only concerned with one output word at a time. Similar to how we use an embedding table to map the input word to the hidden layer, we can now use another embedding table to map the hidden layer to the output word. Now we have two embedding layers, one for input words and one for output words. Secondly, we use a modified loss function where we only care about the true example and a small subset of noise examples.

$$
- \log{\sigma\left(u_{w_O}\hspace{0.001em}^\top v_{w_I}\right)} -
\sum_i^N \mathbb{E}_{w_i \sim P_n(w)}\log{\sigma\left(-u_{w_i}\hspace{0.001em}^\top v_{w_I}\right)}
$$

The first term says we take the log-sigmoid of the inner product of the output word vector and the input word vector.

The second term says we're going to take a sum over words $w_i$ drawn from a noise distribution $w_i \sim P_n(w)$. The noise distribution is all those words in Trump's vocabulary that aren't in the context of the input word. We can randomly sample words from our vocabulary to get these. We can choose $P_n(w)$ ro be any distribution, such as a uniform distribution. We could also pick it according to the frequency that each word shows up in our vocabulary. This is called the unigram distribution $U(w)$. The authors found the best distribution to be $U(w)^{3/4}$, empirically. 

Finally, in we take the log-sigmoid of the inner product of a noise vector with the input vector. 

The first term in the loss function pushes the probability that our network will predict the correct word $w_O$ towards 1. In the second term, we're pushing the probabilities of the noise words towards 0.