# CFM Volatility Prediction in Financial Markets

This notebook implements a very basic linear model for CFM Volatility Prediction Challenge (https://challengedata.ens.fr/en/challenge/34/volatility_prediction_in_financial_markets.html). The public score of this model is **23.3842**.

In [1]:
import numpy as np
import pandas as pd

%matplotlib inline

## Load the data

In [2]:
train_X = pd.read_csv('training_input.csv', sep=';')
train_y = pd.read_csv('training_output.csv', sep=';')
test_X  = pd.read_csv('testing_input.csv', sep=';')

## Impute missing values

Since we will not use the returns in this simple model, we can simply drop them :

In [3]:
train_X = train_X.drop(train_X.columns[57:], axis=1)
test_X  = test_X.drop(test_X.columns[57:], axis=1)

We linearly interpolate the NaNs corresponding to volatilities :

In [4]:
train_X.iloc[:,3:] = train_X.iloc[:,3:].interpolate(axis=1)
test_X.iloc[:,3:]  = test_X.iloc[:,3:].interpolate(axis=1)

The 09:30 and 13:55 volatilities cannot be interpolated, so we simply replace them by 0 :

In [5]:
train_X.fillna(0, inplace=True) 
test_X.fillna(0, inplace=True)

## Feature engineering

We create very basic features :

In [6]:
train_X['min_vol']    = np.min(train_X.iloc[:,3:], axis=1)
train_X['max_vol']    = np.max(train_X.iloc[:,3:], axis=1)
train_X['std_vol']    = np.std(train_X.iloc[:,3:], axis=1)
train_X['median_vol'] = np.median(train_X.iloc[:,3:], axis=1)

In [7]:
test_X['min_vol']    = np.min(test_X.iloc[:,3:], axis=1)
test_X['max_vol']    = np.max(test_X.iloc[:,3:], axis=1)
test_X['std_vol']    = np.std(test_X.iloc[:,3:], axis=1)
test_X['median_vol'] = np.median(test_X.iloc[:,3:], axis=1)

## Validation

In [8]:
def MAPE(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [9]:
from sklearn.model_selection import train_test_split

train_X = train_X.merge(train_y, on='ID')
train_X.drop(['ID', 'date', 'product_id'], axis=1, inplace=True)

train_X_, test_X_, train_y_, test_y_ = train_test_split(train_X.iloc[:,:-1], train_X['TARGET'], test_size=0.2, random_state=42)

### Weighted Linear Regression

In [10]:
from sklearn.linear_model import LinearRegression

regrLinWeighted = LinearRegression()
regrLinWeighted.fit(train_X_, train_y_, sample_weight=(1./train_y_))

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [11]:
print('Train error =', round(MAPE(train_y_, regrLinWeighted.predict(train_X_)), 4), '%')
print('Test error  =', round(MAPE(test_y_, regrLinWeighted.predict(test_X_)), 4), '%')

Train error = 23.577 %
Test error  = 23.8831 %


## Prediction

In [12]:
from sklearn.linear_model import LinearRegression

regrLinWeighted = LinearRegression()
regrLinWeighted.fit(train_X.iloc[:,:-1], train_X['TARGET'], sample_weight=(1./train_X['TARGET']))

test_X['PREDICT'] = regrLinWeighted.predict(test_X.iloc[:,3:])

In [13]:
predictions = test_X[['ID', 'PREDICT']]
predictions.columns = ['ID', 'TARGET']

In [14]:
predictions.to_csv('predictions.csv', sep=';', index=False)