## 1. Define Appropiately the Problem

#### 1.1 What is the main objective? What are we trying to predict?

The main objective of this project is to:
- Have returns graeter than the brazilian markets benchmark Ibovespa in long periods (>10y)
- Predict the values of the company's following quarter
- Predict the stock price at the openning of the month following the quarter release

#### 1.2 What are the target features?

Company financial indicators (LPA, VPA, EBIT, ...)

#### 1.3 What is the input data? Is it available?

Input data is all numbers and it is available in Economatica

#### 1.4 What kind of problem are we facing? Binary classification? Clustering?

Predict a value inside an open interval

#### 1.5 What is the expected improvement?
#### 1.6 What is the current status of the target feature?
#### 1.7 How is going to be measured the target feature?


*OBS: Not every problem can be solved, until we have a working model we just can make certain hypothesis:*
*- Our outputs can be predicted given the inputs.*
*- Our available data is sufficient informative to learn the relationship between the inputs and the outputs.*
*It is crucial to keep in mind that machine learning can only be used to memorize patterns that are present in the training data, so we can only recognize what we have seen before. When using Machine Learning we are making the assumption that the future will behave like the past, and this isn’t always true.*

## 2. Collect Data

This is the first real step towards the real development of a machine learning model, collecting data. This is a critical step that will cascade in how good the model will be, the more and better data that we get, the better our model will perform.

**This project's data was gathered from Economatica's website. And it gathers its data from the company's own financial demonstrations released every quarter**

## 3. Choose a Measure of Success

Peter Drucker, Harvard teacher and author of The Effective Executive and Managing Oneself, had a famous saying:

*“If you can’t measure it you can’t improve it”.*

If you want to control something it should be observable, and in order to achieve sucess, it is essential to define what is considered success: Maybe precision? accuracy? Customer-retention rate?

This measure should be directly aligned with the higher level goals of the bussines at hand. And it is also directly related with the kind of problem we are facing:

* Regression problems use certain evaluation metrics such as mean squared error (MSE).
* Classification problems use evaluation metrics as precision, accuracy and recall.

---
**This project aims to predict as close as possible the stock price of a company, that's why a good measure of success could be similar to MSE**

##  4. Setting an Evaluation Protocol

Once the goal is clear, it should be decided how is going to be measured the progress towards achieving the goal. The most common evaluation protocols are:

#### 4.1 Maintaining a Hold Out Validation Set
This mehod consists on setting apart some portion of the data as the test set.
The process would be to train the model with the remaining fraction of the data, tunning its parameters with the validation set and finally evaluating its performance on the test set.
The reason to split data in three parts is to avoid information leaks. The main inconvenient of this method is that if there is little data available, the validation and test sets will contain so few samples that the tuning and evaluatation processes of the model will not be effective

#### 4.2 K-Fold Validation
K-Fold consists in splitting the data into K partitions of equal size. For each partition i, the model is trained with the remaining K-1 partitions and it is evaluated on partition i.
The final score is the average of the K scored obtained. This technique is specially helpful when the performance of the model is significantly different from the train-test split.

#### 4.3 Iterated K-Fold Validation with Shuffling
This technique is specially relevant when having little data available and it is needed to evaluate the model as precisely as possible (it is the standard approach on Kaggle competitions).
It consist on applying K-Fold validation several times and shuffling the data every time before splitting it into K partitions. The Final score is the average of the scores obtained at the end of each run of K-Fold validation.
This method can be very computationally expensive, as the number of trained and evaluating models would be I x K times. Being I the number of iterations and K the number of partitions.

*Note: It is crucial to keep in mind the following points when choosing an evaluation protocol:*

* In classification problems, both training and testing data should be representative of the data, so we should shuffle our data before splitting it, to make sure that is covered the whole spectrum of the dataset.
* When trying to predict the future given the past (weather prediction, stock price prediction…), data should not be shuffled, as the sequence of data is a crucial feature and doing so would create a temporal leak.
* We should always check if there are dupicates in our data in order to remove them. Otherwise the redundat data may appear both in the training and testing sets and cause unaccurate learning on our model.

---

**As this project is using Time series, the order of the values are important, so the best evaluation protocol would be *4.1 Maintaining a Hold Out Validation Set*, splitting the data 80/20 - with 80% training and the last 20% testing** 

## 5. Preparing The Data
Before begining to train models we should transform our data in a way that can be fed into a Machine Learning model. The most common techniques are:

### 5.1 Dealing with missing data
It is quite common in real-world problems to miss some values of our data samples. It may be due to errors on the data collection, blank spaces on surveys, measurements not applicable…etc
Missing values are tipically represented with the “NaN” or “Null” indicators. The problem is that most algorithms can’t handle those missing values so we need to take care of them before feeding data to our models. Once they are identified, there are several ways to deal with them:
* Eliminating the samples or features with missing values. (we risk to delete relevant information or too many samples)
* Imputing the missing values, with some pre-built estimators such as the Imputer class from scikit learn. We’ll fit our data and then transform it to estimate them. One common approach is to set the missing values as the mean value of the rest of the samples.

---
For this project, it was choosen the *Imputation*, interpolatin with a linear function the missing values in between
> dataset.interpolate(method='linear', axis=0, inplace=True)

### 5.2 Handling Caterogical Data

This project has no categorical data.

### 5.3 Feature Scaling

This is a crucial step in the preprocessing phase as the majority of machine learning algorithms perform much better when dealing with features that are on the same scale. The most common techniques are:
* Normalization: it refers to rescaling the features to a range of [0,1], which is a special case of min-max scaling. To normalize our data we’ll simply need to apply the min-max scaling method to each feature column.
* Standardization: it consists in centering the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution (zero mean and unit variance). This makes much more easier for the learning algorithms to learn the weights of the parameters. In addition, it keeps useful information about outliers and makes the algorithms less sensitive to them.

---
This project will test which feature scaling best predicts the outcome, but before testing I believe the best suited tecnique will be *Standardization* as this type of data can have outliers, and standardization is less sentitive to them.

### 5.4 Selecting Meaningful Features
As we will see later, one of the main reasons that causes machine learning models to overfit is because of having redundancy in our data, which makes the model to be too complex for the given training data and unable to generalize well on unseen data.

One of the most common solution to avoid overfitting is to reduce data’s dimensionality. This is frequently done by reducing the number of features of our dataset via Principal Component Analysis (PCA) which is a type of Unsupervised Machine Learning algorithm.

PCA identifies patterns in our data based on the correlations between the features. This correlation imply that there is redundancy in our data, in other words, that there is some part of the data that can be explained with other parts of it.

This correlated data is not essential for the model to learn its weights appropiately and so, it can be removed. It may be removed by directly eliminating certain columns (features) or by combining a number of them and getting new ones that hold the most part of the information. We will dig deeper in this technique in future articles.

> See how to do it

### 5.5 Splitting Data Into Subsets
In general, we will split our data in three parts: training, testing and validating sets. We train our model with training data, evaluate it on validation data and finally, once it is ready to use, test it one last time on test data.

Now, is reasonable to ask the the following question : Why not having only two sets, training and testing? In that way, the process will be much simpler, just train the model on training data and test it on testing data.

The answer is that, developing a model involves tunning its configuration, in other words, choosing certain values for their hyperparameters (which are different from the parameters of the model — network’s weights). This tunning is done with the feedback recieved from the validation set, and is in essence, a form of learning.

The ultimate goal is that the model can generalize well on unseen data, in other words, predict accurate results from new data, based on its internal parameters adjusted while it was trained and validated.

#### a) Learning Process

> Using NN to train the model

#### b) Overfitting and Underfitting
One of the most important problems when considering the training of models is the tension between optimization and generalization.
* Optimization is the process of adjusting a model to get the best performance possible on training data (the learning process).
* Generalization is how well the model performs on unseen data. The goal is to obtain the best generalization ability.

At the beginning of training, those two issues are correlated, the lower the loss on training data, the lower the loss on test data. This happens while the model is still underfitted: there is still learning to be done, it hasn’t been modeled yet all the relevant parameters of the model.

But, after a number of iterations on the training data, generalization stops to improve and the validation metrics freeze first, and then start to degrade. The model is starting to overfit: it has learned so well the training data that has learned patternts that are too specific to training data and irrelevant to new data.

There are two ways to avoid this overfitting, getting more data and regularization.
* Getting more data is usually the best solution, a model trained on more data will naturally generalize better.
* Regularization is done when the latter is not possible, it is the process of modulating the quantity of information that the model can store or to add constraints on what information it is allowed to keep. If the model can only memorize a small number of patterns, the optimization will make it to focus on the most relevant ones, improving the chance of generalizing well.

> This project must use Regularization, there is no way to get more data on a specific company more than it has of existance.

Regularization is done mainly by the following techniques:

1) Reducing the model’s size: Reducing the number of learnable parameters in the model, and with them its learning capacity. The goal is to get to a sweet spot between too much and not enough learning capacity. Unfortunately, there aren’t any magical formulas to determine this balance, it must be tested and evaluated by setting different number of parameters and observing its performance.

2) Adding weight regularization: In general, the simpler the model the better. As long it can learn well, a simpler model is much less likely to overfit. A common way to achieve this, is to constraint the complexity of the network by forcing its weights to only take small values, regularizating the distribution of weight values. This is done by adding to the loss function of the network a cost associated with having large weights. The cost comes in two ways:
    
    2.1) L1 regularization: The cost is proportional to the absolute value of the weights coefficients (L1 norm of the weights).
    
    2.2) L2 regularization: The cost is proportional to the square of the value of the weight coefficients (l2 norm of the weights)
    
    To decide which of them to apply to our model, is recommended to keep the in mind following information and take into account the nature of our problem:
![Regularization opts](L1-L2.png "Title")


## 6. Developing a Benchmark model
The goal in this step of the process is to develop a benchamark model that serves us as a baseline, upon we’ll measure the performance of a better and more attuned algorithm.

Benchmarking requires experiments to be comparable, measurable, and reproducible. It is important to emphasize on the reproducible part of the last statement. Nowaday’s data science libraries perform random splits of data, this randomness must be consistent through all runs. Most random generators support setting a seed for this pourpose. In Python we will use the random.seed method from the random package.

As found on “https://blog.dominodatalab.com/benchmarking-predictive-models/”

>“It is often valuable to compare model improvement over a simplified baseline model such as a kNN or Naive Bayes for categorical data, or the EWMA of a value in time series data. These baselines provide an understanding of the possible predictive power of a dataset.

>The models often require far less time and compute power to train and predict, making them a useful cross-check as to the viability of an answer. Neither kNN nor Naive Bayes models are likely to capture complex interactions. They will, however, provide a reasonable estimate of the minimum bound of predictive capabilities of a benchmarked model.

>Additionally, this exercise provides the opportunity to test the benchmarking pipeline. It is important that benchmark pipelines provide stable results for a model with understood performance characteristics. A kNN or a Naive Bayes on the raw dataset, or minimally manipulated with column centering or scaling, will often provide a weak, but adequate learner, with characteristics that are useful for the purposes of comparison. The characteristics of more complex models may be less understood and prove challenging.”