### Notebook summary
In this notebook we have trained a model without the features having correlation of 70% or above for AAPL ticker using SPO framework.
Below points were observed:
- Even with reduced features loss is in higher magnitude
- Variance in loss is also high

Next steps:
- Reduce features based on feature importance from sklearn
- Reduce features based on domain knowledge

### Import libraries

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import os
import itertools
import math
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import random
import gurobipy as gp
from gurobipy import GRB
import tensorflow as tf
from tensorflow.keras import initializers
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
import yaml
from pathlib import Path

### Import modules

In [None]:
import sys
sys.path.append("../src")

import data_exploration as de
import model_training as mt

### Load necessary directories

In [None]:
current_dir = Path(os.getcwd())
root_dir = current_dir
while 'Portfolio Optimization using SPO' in root_dir.parts:
    root_dir = root_dir.parent
    if root_dir == Path(root_dir.root):
        print("Root directory not found.")
        break

In [None]:
config_path = root_dir / "Portfolio Optimization using SPO" / "config" / "config.yml"
complete_data_path = root_dir / "Portfolio Optimization using SPO" / "data" / "dat_518_companies.csv"
data_path = root_dir / "Portfolio Optimization using SPO" / "data" / "AAPL_df.csv"
cost_mat_path = root_dir / "Portfolio Optimization using SPO" / "data" / "cost_mat.csv"
sigma_path = root_dir / "Portfolio Optimization using SPO" / "data" / "sigma_df.csv"

In [None]:
with open(config_path, 'r') as file:
    config = yaml.safe_load(file)

### Import data

In [None]:
# import data
df_AAPL_train_test = pd.read_csv(data_path)
df_final_returns = pd.read_csv(cost_mat_path)
sigma_df = pd.read_csv(sigma_path)

In [None]:
gamma = config["gamma"]
sigma = sigma_df.values

### Drop highly correlated features
Features having correlation more than 70% are dropped

In [None]:
df_AAPL_train_test_red = df_AAPL_train_test.drop(config["to_drop"], axis=1)

### Split data into train and test

In [None]:
# training dataframe
df_AAPL_redu_train, df_AAPL_redu_test = train_test_split(df_AAPL_train_test_red, test_size=0.2, 
                                                         random_state=42, shuffle=False)

# cost vector
df_final_returns_train, df_final_returns_test = train_test_split(df_final_returns, test_size=0.2, random_state=42, 
                                                                 shuffle=False)

### Initialize the model

In [None]:
redu_n_rows, redu_n_cols = df_AAPL_redu_train.shape
redu_n_feats = redu_n_cols-1

# Instantiate the model
model_redu_data = mt.get_model(n_feats = redu_n_feats)
model_redu_data.summary()

### Train the model
We will train the model with random hyper-parameters to test if everything is working fine.

In [None]:
%%time
trained_redu_model, epoch_redu_loss_list = mt.SGD_regressor(df_AAPL_redu_train, model_redu_data, df_final_returns_train, sigma, gamma, learning_rate= 0.001, decay_rate=1.02, n_epochs=200, batch_size = 512)

### Plot loss progression with every epoch

In [None]:
fig_redu = px.line(epoch_redu_loss_list).update_layout(title="Training Loss progression", xaxis_title="epochs", yaxis_title="SPO+ loss")
fig_redu.show()

### Testing the model on test data

In [None]:
y_pred_redu = trained_redu_model(df_AAPL_redu_test.iloc[:,0:redu_n_feats].values)
redu_spo_test_loss = mt.get_SPO_plus_testing_loss(df_AAPL_redu_train, df_final_returns_test, y_pred_redu, sigma=sigma, gamma=gamma)

print(f'The SPO+ loss on testing data is {redu_spo_test_loss}')

After observing the loss at every epoch and also on testing data, the loss has very high magnitue and high variability as well so we will do feature selection using sklearn to see if the loss can be reduced.