# Set-up

First thing first, let me import the Python libraries first.

## Import libraries

In [None]:
import json
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf


import sys
sys.path.append('..')  # add ../src to sys.path
from FiDaL import (data as dta, plot, utils)

In [None]:
# Read the credentials from credentials.json
with open('../config/credentials.json') as f:
    credentials = json.load(f)
    
    # Read the config from config.json
with open('../config/config.json') as f:
    config = json.load(f)

## Download and load the data

Selecting the right data period is crucial for the analysis. The following factors should be considered when selecting the data period:
- **Market changes**: Financial markets undergo structural changes over time. Regulations, economic conditions, technological advancements, and other factors can alter market dynamics. It's crucial to ensure that the data used is still representative of current conditions.
- **More Recent Data**: Some analysts prefer using more recent data (e.g., 3-5 years) on the premise that it better reflects the current market dynamics. Financial markets evolve, and the factors that influenced stock performance a decade ago may not be as relevant today.
- **Specific Asset Characteristics**: Different assets may require different look-back periods based on their volatility, liquidity, and the sectors they represent. For instance, technology stocks may behave differently compared to utility stocks over the same period.
- **Investment Horizon**: Align the data period with your investment horizon. If you are a long-term investor, using a longer historical period may be more appropriate. For shorter-term investments, consider using a shorter data period.
- **Statistical Significance:** Ensure that the data set is large enough to be statistically significant, reducing the risk of anomalies skewing the results.
- Be aware of **regime changes** (significant shifts in market trends or economic conditions) within your data period. These can significantly impact the relevance of historical data.
- **Consider using rolling windows** for your analysis. This technique involves continuously updating the time frame of the data used for the analysis (e.g., always using the most recent five years of data). This can provide a more dynamic view of how optimal weights change over time.

In [None]:
# Create an instance of the YFDataDownloader class
downloader = dta.make_data.YFDataDownloader(config, credentials=credentials)

# Get the data for the tickers to analyze using downloader.get_data()
print(config["data_source_params"])
data_downloaded = downloader.get_data(**config["data_source_params"])
data = data_downloaded[["Adj Close", "Volume"]]
del data_downloaded

# Data Exploration

In [None]:
data.head(3)

In [None]:
data = (data.dropna(thresh=data.shape[1]//2)
        .rename(columns={"Adj Close": "price", "Volume": "volume"})
        .sort_index(axis=1))

data = dta.process.FinancialDataProcessor.compute_returns(data, column='price', log=True, apply_smoothing=True, smoothing_factor=1)

In [None]:
daily_returns = data["log_returns"]

expected_returns = daily_returns.mean()  # Calculate expected returns (mean of logarithmic returns)

print("\nExpected Returns:\n", expected_returns)

The most important data cleaning steps for us are:
- **Missing Values**: Check for missing values and handle them appropriately. Missing values can cause issues with the analysis and may lead to incorrect conclusions. Common approaches for handling missing values include removing them, imputing them with a value (e.g., mean, median), or using a forward or backward fill.
- **Outliers**: Check for outliers and handle them appropriately. Outliers can skew the analysis and lead to incorrect conclusions. Common approaches for handling outliers include removing them or capping them at a certain value.
- **Data Types**: Ensure that the data types are correct. For instance, numerical values should be represented as floats or integers, and dates should be represented as date objects.
- **Duplicates**: Check for duplicate values and handle them appropriately. Duplicates can cause issues with the analysis and may lead to incorrect conclusions. Common approaches for handling duplicates include removing them or aggregating them.
- **Data Integrity**: Ensure that the data is correct and consistent. For instance, check that the data is in the expected range, and that the values are consistent with other data sources.
- **Data Format**: Ensure that the data is in the expected format. For instance, check that the data is in the expected units (e.g., dollars vs. cents), and that the values are consistent with other data sources.
- **Data Range**: Ensure that the data is within the expected range. For instance, check that the data is within the expected time period, and that the values are consistent with other data sources.
- **Data Granularity**: Ensure that the data is at the expected level of granularity. For instance, check that the data is at the expected frequency (e.g., daily, monthly, quarterly), and that the values are consistent with other data sources.

Let's apply all of these with the expection of outliers, and granulatiry because we can have outliners that are valid as we will see so these are things that we will need to check mostly manually, and granularity is already set to daily.

## Visual analzysis

### Volume vs price

In [None]:
plot.adj_close_volume(data, columns=["price", "volume"], y_log=False)

### Moving Average

In [None]:
plot.moving_average(adj_close:=data["price"],
                    volume:=data["volume"],
                    window_sizes=[10, 20], include_stats=True, log_scale=False)

In [None]:
# Calculating volatility (annualized standard deviation of daily returns)
volatility = daily_returns.std() * (252**0.5) # Volatility is the annualized standard deviation of daily returns
correlation = daily_returns.dropna().corr(other=data["volume"].dropna())  # Calculate correlation between logarithmic returns and volume
daily_returns.describe(), volatility, correlation

# 1. Component decomposition

## 1.1 Seasonal decomposition

In [None]:
decomposed = seasonal_decompose(data["price"].interpolate(),
                                    model='additive', period=len(data) // 2)
fig = plot.decomposed_time_series(decomposed, ticker:="APPL")
plt.show

## 1.2 Autocorrelation

In [None]:
plot.autocorrelation(data["price"], title=ticker, lags=30, figsize=(7, 3))
plt.show()

## 1.3 Partial Autocorrelation

In [None]:
plot.partial_autocorrelation(data["price"].diff(), title=ticker, lags=30)
plt.show()

## 1.4 Test stationarity

In [None]:
print('\n',ticker, 'Dickey-Fuller Test for stationarity:')
_ = utils.test_stationarity(timeseries=data["price"]
                        .interpolate(),
                        verbose=True)

To ensure that your time series data is stationary, a common technique is differencing. Differencing involves computing the differences between consecutive observations. This technique is particularly effective in removing trends and seasonality, which are common reasons for non-stationarity.

In [None]:
print('\n',ticker, 'Dickey-Fuller Test for stationarity:')
_ = utils.test_stationarity(timeseries=data["price"]
                        .interpolate()
                        .diff().dropna(),
                        verbose=True)

# 2. Traditional Analysis

In [None]:
# 1. Determine Optimal Number of Lags
lag_acf = acf(timeseries:=data["price"].diff().dropna(), nlags=40)
lag_acf  # Inspect the lag_acf to choose short-term and long-term lags

In [None]:
# 2. Plot SMA and EMAs
plot.analyze_moving_averages(timeseries,
                        short_term_lag=10,
                        long_term_lag=20)

In [None]:
from statsmodels.tsa.ar_model import AutoReg
import numpy as np


# Split the data into train and test sets
data_diff = data["price"].diff().dropna()
train_size = int(len(data_diff) * 0.95)

# Use differenced data for the autoregressive model
train, test = data[:train_size], data[train_size:]
train_diff, test_diff = data_diff[:train_size], data_diff[train_size:]

In [None]:
# Specify the optimal lags based on PACF analysis
optimal_lags = 20

# Fit AutoReg model
model = AutoReg(train_diff, lags=optimal_lags)
model_fit = model.fit()

# Make predictions on the test data
predictions = model_fit.predict(start=len(train_diff), end=len(train_diff) + len(test_diff) - 1)
predictions.index = test_diff.index

In [None]:
plt.figure(figsize=(10,6))
plt.plot(test_diff, label='Actual Stock Price', c='b')
plt.plot(predictions, c='r', label='Prediction')
plt.title('Predicted Stock Price-AT&T')
plt.xlabel('Date')
plt.ylabel('$')
plt.legend()
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(test_diff, predictions))
print(f'RMSE for {ticker}: {rmse}')

In [None]:
from statsmodels.tsa.arima.model import ARIMA

# Specify the order of the ARIMA model
order = (7,1,7)

# Fit the ARIMA model
model = ARIMA(data["price"], order=order)
model_fit = model.fit()

# Make predictions on the test data
predictions = model_fit.predict(start=len(train_diff), end=len(train_diff) + len(test_diff) - 1)
predictions.index = test_diff.index

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_diff, predictions))

In [None]:
print(f'RMSE for ARIMA: {rmse}')

In [None]:
plt.figure(figsize=(10,6))
plt.plot(test['price'], label='Actual Stock Price', c='b')
plt.plot(predictions, c='r', label='Prediction')
plt.title('Predicted Stock Price-AT&T')
plt.xlabel('Date')
plt.ylabel('$')
plt.legend()
plt.show()

# 3. Deep learning

## Data preparations

In [None]:
X_diff_train, y_diff_train = dta.process.split_sequence(train_diff, n_steps:=10)
X_diff_train = X_diff_train.reshape((X_diff_train.shape[0],
                                         X_diff_train.shape[1], n_features:=1))

X_diff_test, y_diff_test = dta.process.split_sequence(test_diff.values, n_steps)
X_diff_test = X_diff_test.reshape((X_diff_test.shape[0],
                                       X_diff_test.shape[1], n_features))