## Joint distribution of (moneyness,time_to_maturity)

Data generation is an expensive and time-consuming task. Since we want our neural network to be as accurate as possible in the parameter region of greatest liquidity, it makes sense to provide more training data in that region by sampling more labeled data from it. Specifically, we will hence compute an empirical joint probability distribution of (moneyness, time_to_maturity) from SPX data where liquidity is proxied by *open interest*.

Extract information from ticker symbol and add to dataframe.

Next, we only consider European options and SPX Weeklys which expire every Monday, Wednesday and Friday. 

In [1]:
df.head(5)

NameError: name 'df' is not defined

Possibility to export dataframe to csv file.

### KDE Estimation of (Moneyness, Time to Maturity)

As discussed, our goal is to sample from $(moneyness, time to maturity)$ and more crucially, to sample more from parameter regions with higher liquidity as proxied by a higher inverse bid ask spread. The idea is to fit a Kernel Density Estimation to our data using the package ScikitLearn and then generate new samples from it.

In [None]:
# Reindexing dataframe so that it starts from 1
df = df.reset_index(drop=True)

# Initiating a new df of appropriate size
total_interest = df['Open Int'].sum()
kde_df = pd.DataFrame(index = np.arange(0, total_interest), 
                      columns=['time to maturity (years)', 'moneyness'], 
                      dtype='float64')

# Filling the new df with entries according to their multiplicities 
# proxied by open interest
kde_index = 0

for i in df.index:
    
    mult = df.loc[i, 'Open Int']
    values = [df.loc[i, ['time to maturity (years)', 'moneyness']]]*mult
    kde_df.loc[kde_index:kde_index + mult-1] = values
    
    kde_index += mult  

#### Approach I : Seaborn visualization
Seaborn is a visualization library and includes a KDE Plot. 

In [None]:
fig, ax = plt.subplots()
ax.set_title('Bivariate KDE of moneyness and time to maturity')
x = kde_df['moneyness']
y = kde_df['time to maturity (years)']
ax = sns.kdeplot(x, y, cbar=True, shade=True, shade_lowest=False, cmap='BuPu')
fig.savefig("sns_kde_money_maturities_V3.pdf", dpi=400)

The KDE plot shows what one might expect from the heatmap before. Unfortunately, Seaborn does not return the actual KDE model it computes in its backend, so we also can't sample from it. Digging in the source code reveals that it uses - if installed - the statsmodels KDE estimator. Otherwise it defaults to the scipy KDE estimator. So our next try will be to use the statsmodel implementation.

#### Approach II: Statsmodels

In [None]:
from statsmodels.nonparametric.kernel_density import KDEMultivariate

In [None]:
X = kde_df[['moneyness', 'time to maturity (years)']]
kde = KDEMultivariate(X, 'cc', bw='normal_reference')
kde

This algorithm is very fast but it lacks two important features:
1. There is no method to sample from the distribution.
2. There is no chance to include finite bounds for $(K,T)$ which is especially bad for $T$ as the KDE would span over the negative domain as well, losing probability mass over a nonsensible domain.

#### Approach III: Scikit-Learn KDE
Unlike statsmodels, this KDE estimator does have a method to sample from but just like statsmodels there is no possibility to include bounds.

In [None]:
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
import statsmodels.nonparametric.api as smnp

data = kde_df[['moneyness', 'time to maturity (years)']].values

Two choices for bandwidth selection. Either Scott's rule or Cross validation. Run only one cell.

In [None]:
# Scott's rule (rule of thumb) for bandwidth selection
bw_x = smnp.bandwidths.bw_scott(df['moneyness'])
bw_y = smnp.bandwidths.bw_scott(df['time to maturity (years)'])

print('bw_x: {}, bw_y: {}'.format(bw_x, bw_y))

kde = KernelDensity(bandwidth=max(bw_x, bw_y))

In [None]:
# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': np.logspace(-3, -1, 5)}
grid = GridSearchCV(KernelDensity(), params, verbose=sys.maxsize, n_jobs=-1)
grid.fit(data)

print("best bandwidth: {0}".format(grid.best_estimator_.bandwidth))

kde = grid.best_estimator_

Run KDE Estimation of Data.

In [None]:
kde.fit(data)

Naturally, the KDE extends to regions outside the input region we have previously considered. That's why we will use rejection sampling below to compute our new samples.
Remember we want $0.75\leq moneyness \leq 1.2$ and $0\leq time to maturity \leq 0.25$.

In [None]:
nb_samples = 10**6
new_data = pd.DataFrame(index=np.arange(10**6),
                        columns=['moneyness', 'time to maturity (years)'],
                        dtype='float64')

valid_counter = 0

while valid_counter < nb_samples:
    
    rem = nb_samples - valid_counter
    
    # Generate new samples from estimated probability density.
    raw = kde.sample(rem)
    
    # Identify valid samples
    is_valid = (raw[:,0]>0.75) & (raw[:,0]<1.2) & (raw[:,1]>0) & (raw[:,1]<0.25)
    valid_samples = raw[is_valid]
    nb_valid_samples = valid_samples.shape[0]
    
    # Writing to df
    new_data.loc[valid_counter: valid_counter + nb_valid_samples - 1 ,:] = valid_samples

    valid_counter += nb_valid_samples
    
# Save (moneyness,time to maturity (years)) to disk
filepath = 'raw_data/money_maturities.csv'
new_data.to_csv(filepath)