<div class="page-header"><h1 class="alert alert-info">Data Camp: Stock Prediction<br/>
<small>Gustavo Castro, Lucas Furquim, Francisco Ribeiro, Alvaro Serra<br/>

<h1 class="alert alert-success">Introduction</h1>

Managing and creating good <a href = "http://www.investopedia.com/terms/p/portfoliomanagement.asp"> portfolios </a> are main works in the financial world, specially at the Quantitative Asset Management sector. In this context, one needs to decide which portfolio allocation will give the best future return.

<img src="Image/SP500.jpg">

To do so, it is vital to be able to forecast some stock behaviors and their variances. For more detailed information and explanation please refer to: <a href = "http://pubsonline.informs.org/doi/abs/10.1287/mnsc.2013.1838" > Risk Premium 
forecast </a> and <a href ="http://cims.nyu.edu/~almgren/timeseries/Vol_Forecast1.pdf"> GARCH Model </a>.

The goal of this challenge is to predict the <a href = "http://www.investopedia.com/terms/s/sp500.asp"> SP500 index </a>  behavior using some market data and to mine the different interactions this index might have with the proposed features.

Other sources of information and data include <a href = "https://www.bloomberg.com/markets/stocks"> Bloomberg </a> and <a href = "http://finance.yahoo.com/"> Yahoo Finance </a>.


<h1 class="alert alert-success"> Prediction task</h1>

The goal is to predict the index values for the year of 2015. To do so, a historical data of more than a hundred years is offered.

<img src="Image/indXtime.png">

The student is completly free to define his predicton model and the time interval that will be considered to calibrate it. 

As always in the machine learning context, we accentuate the importance of a proper feature analysis, their relevances, signifcations and impacts under this prediction context. To stimulate this work, we propose, at the Data section of this notebook, the use of some new features and we strongly advice the wise choice of the features' relevances and the creation of others that the student may find relevant.



<h1 class="alert alert-success"> Imports and Initial Setup </h1>

## Tools & Setup

- *The simple way*: Install the Anaconda python distribution https://www.continuum.io/downloads
- *The fine-grained way:* Install each of the following tools
    - Python
    - Jupyter
    - Scikit-learn
    - Pandas

## Imports


In [None]:
import numpy  as np
import pandas as pd

<h1 class="alert alert-success"> Data </h1>

A dataset of the monthly values of the SP500 index since january 1871 until december 2015 is proposed. A detailed explanation is given below.

It is also important to be aware of the existence of NaN values in the database, specially in older periods (before 1900). We thus strongly suggest that the student initially ignores the data older than january 1950 to avoid initial problems with database empty and NaN cells. 


<h1 class="alert alert-success">Data description</h1>

The following table contains the description of the different fields in the dataset

In [None]:
meta_brute = pd.read_csv('Data/BruteMetaData.csv')
meta_brute

In [None]:
train_brute = pd.read_csv('Data/BruteTrainData.csv')
train_brute.head(10)

In [None]:
train_brute.tail(10)

In [None]:
train_brute.describe()

We found out that the following treated database can also be very useful

The following table contains the description of the different columns in the treated dataset

In [None]:
meta_treated = pd.read_csv('Data/TreatedMetaData.csv')
meta_treated

In [None]:
train_treated = pd.read_csv('Data/TreatedTrainData.csv')
train_treated.head(10)

In [None]:
train_treated.tail(10)

In [None]:
train_treated.describe()

<h1 class="alert alert-success">The prediction model</h1>

We are going to follow the scikit-learn API specs in order to define a `FeatureExtractor` and a `Regressor`.

## The feature extractor

In <code>feature_extractor.py</code> you will define a class called <code>FeatureExtractor</code>. Its main <code>transform</code> method takes a pandas <b>DataFrame</b> and outputs a <b>numpy array</b>.

- The `FeatureExtractor` inherits from `TransformerMixin`.
- It implements a `fit` (optional) and a `transform` method. 

In [None]:
import pandas as pd
from sklearn.base import TransformerMixin

class FeatureExtractor(TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X_df, y):
        return self

    def transform(self, X_df):
        X_df['yyyymm'] -= X_df['yyyymm'][0]
        X_df['yyyymm'] /= X_df['yyyymm'].iat[-1]
        return X_df.values

## The regressor

- The `Regressor` inherits from `BaseEstimator`,
- The `__init__()` function initiates all of the arguments and configurations. 
- The regressor must implement a `fit()` and  a `predict()` function.

In [None]:
from sklearn.base import BaseEstimator
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import ExtraTreesRegressor


class Regressor(BaseEstimator):
    def __init__(self):
        self.reg = make_pipeline(
            Imputer(strategy='median'),
            ExtraTreesRegressor(n_estimators=10))

    def fit(self, X, y):
        return self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)


    def predict(self, X):
        return self.reg.predict(X)

## Unit testing

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. For this we
provide a unit test. Place the python file <code>regressor.py</code>, the data <code>public_train.csv</code>, and the 
<code>user_test_submission.py</code></a> in a directory and run 

<code>python user_test_submission.py</code>

If it runs and prints 
<code>
rmse =  [some_number]
rmse =  [some_number]
</code>
you can submit the code.

In [None]:
!python user_test_submission.py