# BFF Machine Learning
In this notebook, I'll test some machine learning (ML) algorithms on Brazilian Financial Funds. The inspiration is to create an algorithm able to forecast funds return so investor can plan and adjust their strategy.

THIS IS NOT A FINANCIAL ADVISE, just an experiment to test different ML algorithms and see which one perform better. It is also possible that none will be good, which is fine! This will give enthusiasts insight to keep looking for more variables that may influence funds return.

I'll test the following algorithms:
- Linear Regression
- Non-linear Regression: 
    1. Support Vector Machine (SVM)
    2. XGBoost
- Neural Networks:
    1. Artificial Neural Networks (ANN)
    2. Recursive Neural Networks (RNN)
    3. Long Short Term Memory (LSTM) - *a type of RNN*


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

## Prepping data
Before I start modeling, I need to make sure my data fits the requeriments of my algorithms. Since I'm dealing with different data scales (monetary, rates, dummy), my models will have a hard time adjusting to it all.

To fix that, I'll use StandardScaler from sklearn to adjust my values based on the standard deviation. This will make sure we have a similar scale to compare my features and will be easier to models to process.

Since my models will use a train, validation and test sets, it's important not influenec the test dataset. So first, I'll:
1. Drops the rows with issues identified on my 'Cleaning Datasets' phase: all data with error on register data
2. Split my dataset in 3: train, validation and test
3. Create a scaler for training and valdiation: later, I'll use the **same** scaler on test set
4. Run my models

In [None]:
# Import financial fund df
fund_df = pd.read_csv('fund_df.csv')

fund_df.info()

In [None]:
# Drop observations with issue on register
print(f'Fund DF complete: ', fund_df.shape)
fund_df = fund_df.drop(fund_df['correct_name'] == False)
fund_df = fund_df.drop(columns=['DENOM_SOCIAL', 'DT_REG', 'manager_name', 'isuer_name', 'big4_name'], axis=1)

# Check the final shape after drops
print(f'Fund DF after drop: ', fund_df.shape)

In [None]:
# Split fund df into test, validation and test
## Split features (X) and independent variable (y)
X = fund_df.drop('quota_return', axis=1)
y = fund_df['quota_return']

# Create subsets for traininn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# Nor subset for validation and test. I'm setting it to 50% so we have an even split between validation and test
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=101)

In [None]:
# Create scaler
scaler = StandardScaler()

# Fit and transform X_train
scaler.fit_transform(X_train)

# Only transform y_train, X_val, y_val
scaler.transform(y_train, X_val, y_val)

## Linear Regression
First, let's try a classic linear regression. I'm trying to predict quota's return (y)

In [None]:
# Create an instance for linear regression model
lm = LinearRegression()

# Fit the model in my training sets (X and y)
lm.fit(X_train,y_train)