# KENDAXA ASSIGNMENT - MACHINE LEARNING POSITION
## S&P500 Regression - *by Jan Kořínek*

### Deliverable goals
#### Regression task

Your goal is to perform exploratory data analysis (EDA)
and to train and compare few models on a regression task.
Your task is to predict day-ahed daily volumes of the S&P
500 index using any available information from the past;
i.e., you are going to predict the volume v t+1 using the
information available on days t , t − 1 , . . . .

Evaluate the models performance on out-of-sample data
using data from 2017 and 2018 (i.e., January 1st, 2017 –
December 31st , 2018).

Do not forget that you can use (and should) data outside
the series itself — for example, calendar with known events.

Since your goal is to evaluate and compare several
models along with finding the best, you have to use some
kind of cross-validation as the dataset is quite small (which
is very common for some of the real world datasets).

If you find it applicable, use statistical tests in the EDA
and comparison to distinguis.

#### Report and scope

You are required to write a brief report in the PDF format
(L A TEXusage is recommend) summarizing the approaches
and presenting the results for all three subtasks. It is rec-
ommend to use figures and plots where it will help you
make your point. The report should contain all the necessary
details to understand what approach you have undertaken,
what were the results and how you interpret them.

our report should summarize the main results of your EDA
but it is sufficient to have the details of the EDA only in the
Jupyter notebook. Briefly (very briefly) introduce the used
models. You should compare the models with regards to
more than one metric each with explanation when is each of
the metrics preferable. You should also state your trust in the
individual models — e.g., that even if some model gives you
very good results, you still might not trust it because it is
sensitive to the data changes. Compare the models also with
respect to their robustness and interpretability. Interpret the
few models you will select as your top ranking candidates,
show which features they are relying the most, etc. Where
applicable, perform formal statistical tests to support your
results.

Please, also state the limitations of your work and direc-
tions, in which it can be expanded — it is expected that you
will not be able to exhaust all possible approaches in the
limited time. Please state which of the possible expansions
are most promising and why.

The report is expected to have about 5–12 pages when using
two-column format with figures but there are no hard limits
as the completeness of the presented information is the
goal (as long as there are no empty sentences or fillers, the
length will not be evaluated).

### Content

### 1. S&P500 dataset preparation and merging with calendar events

In [4]:
# Load and extract data from raw CSVs into dataframes for S&P500 and relevant events
%run lib/prepare_dataset.py

# Show processed df
sp500_calendar

Processing raw data...


  date_obj = stz.localize(date_obj)


Dataset processing finished in: 0:00:20


Unnamed: 0_level_0,Open,High,Low,Close,Volume,ADP Nonfarm Employment Chang,Building Permit,CB Consumer Confidenc,Core CP,Core Durable Goods Order,...,Initial Jobless Claim,JOLTs Job Opening,New Home Sale,Nonfarm Payroll,PP,Pending Home Sale,Philadelphia Fed Manufacturing Inde,Retail Sale,US Federal Budge,Unemployment Rat
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1990-01-02,353.40,359.69,351.98,359.69,1.620700e+08,,,,,,...,,,,,,,,,,
1990-01-03,359.69,360.59,357.89,358.76,1.923300e+08,,,,,,...,,,,,,,,,,
1990-01-04,358.76,358.76,352.89,355.67,1.770000e+08,,,,,,...,,,,,,,,,,
1990-01-05,355.67,355.67,351.35,352.20,1.585300e+08,,,,,,...,,,,96000.0,,,,,,
1990-01-08,352.20,354.24,350.54,353.79,1.401100e+08,,,,,,...,,,,96000.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-07-01,4699.26,4718.50,4681.32,4697.53,3.491150e+09,0.0,1589000.0,90.6,0.2,0.4,...,349100.0,3790000.0,619000.0,162000.0,0.5,-2.4,-5.5,-0.7,-1.837000e+09,9.7
2012-09-01,4699.26,4718.50,4681.32,4697.53,3.491150e+09,0.0,1589000.0,90.6,0.2,0.4,...,349100.0,3640000.0,619000.0,162000.0,0.5,-2.4,-5.5,-0.7,-1.837000e+09,9.7
2012-12-01,4699.26,4718.50,4681.32,4697.53,3.491150e+09,0.0,1589000.0,90.6,0.2,0.4,...,349100.0,3740000.0,619000.0,162000.0,0.5,-2.4,-5.5,-0.7,-1.837000e+09,9.7
2013-06-01,4699.26,4718.50,4681.32,4697.53,3.491150e+09,0.0,1589000.0,90.6,0.2,0.4,...,349100.0,3830000.0,619000.0,162000.0,0.5,-2.4,-5.5,-0.7,-1.837000e+09,9.7
