# Introduction

As part of this course, students will participate in an in-class real-time financial forecasting competition similar to the [M6](https://m6competition.com/). The competition will require students to augment open-source financial data with external open-source datasets of their choice. They will then present their findings to the class by the end of the course.

The focus of the forecasting competition will be on several key areas, including the ability to estimate future returns and uncertainty, combining estimates into an investment decision, developing a consistent investment strategy, utilizing alternative datasets effectively, and learning from mistakes through teamwork and transparency.

The winning students of the competition will be guaranteed an A+ and will receive a special prize as recognition for their achievement.


# Schedule

The competition will take place in real-time during the semester.


# Evaluation

The competition will consist of two distinct challenges: Forecasting, which will be evaluated using the ranked probability score, and Investment decisions, which will be evaluated using the information ratio.


# Data

The competition's investment universe will consist of three asset classes: 50 stocks from the S&P 500 index, 50 international ETFs, and 10 cryptocurrencies. These assets have been selected to provide a broad representation of the overall market.

# Submission format

The competition will have 10 submission points, plus an additional test point, with a deadline of 6 PM ET on the Sunday before the start of the corresponding investment period. Participants are required to submit their forecasts and investment decisions at each point, outlining their predictions and strategy for the upcoming week. The forecast horizon is one week, typically five trading days, and there will be no overlapping evaluation periods.

**Example**: The deadline for the first submission point is 6 PM ET on September 17th, 2022 (Sunday). Participants are required to submit forecasts and investment decisions reflecting the closing value of the last trading day of the following week, which is September 22nd, 2022 (Friday).


At each submission point, students may submit a single **csv file** (make sure you are submitting a csv file. must be *.csv) consisting of seven columns of 110 values each (one per asset):

* The first column must indicate the asset to which the forecasts and the respective row's investment decisions refer. The acronym of each asset will serve as an identifier.

* The second to sixth columns must contain positive values summing horizontally to **unity** that refer to the probabilities of the ranks of the forecasted percentage return for each asset (stocks or ETFs); rank 1 is the lowest forecasted percentage return, and rank 5 is the highest forecasted percentage return.

* The seventh column must contain numerical values corresponding to the weights for investing in each asset. These values must be positive for long positions, negative for short positions, or zero for no position.

**For example**, if three assets are assigned weights 0.5, 0.3, and -0.2, respectively, and all other assets weights of 0, this means that the participant wishes to invest in only three assets with positions long, long, and short and with a budget allocation of 50%, 30%, and 20% respectively.

The submission will be considered invalid if the **sum of the absolute weights** exceeds 1.

If the **sum of the absolute weights** is less than 1 (less than 100%), then the remainder is assumed to be assigned to an asset with zero return and zero risk (i.e., no investment). However, if the sum of the absolute weights is below 0.25 (25%) the submission will be considered invalid (i.e., some investment must be made and some risk must be taken).

**Example**: The following is an example for the first 8 rows of a submission file. In this case, the participant decides to invest in three assets (3rd, 6th, and 7th) with weights of 50%, 30%, and 20% (or 0.5, 0.3, and 0.2) and positions long, long, and short, respectively. Additionally, the participant forecasts that there is a probability of 0.1, 0.2, 0.5, and 0.2 that the first asset (MMM) will be ranked 2nd, 3rd, 4th, and 5th, respectively, with regards to the expected percentage return. Equally, the participant’s forecast is that the second asset (ATVI) will be ranked 3rd.

In [None]:
import pandas as pd
import numpy as np
cols = ['id','rank1', 'rank2', 'rank3', 'rank4', 'rank5', 'decision']
assets = ['mmm','atvi','googl','aph','bmy','cb','exr','msi']
df = pd.DataFrame(columns = cols)
df["id"] = assets
mat = np.array([[0,0.1,0.2,0.5,0.2,0],\
               [0,0,1,0,0,0],\
               [.1,.1,.1,.1,.6,.5], \
               [.5,.4,.05,.05,0,0], \
               [.2,.2,.2,.2,.2,0], \
               [0,0,.1,.4,.5,.3], \
               [.7,.3,0,0,0,-.2], \
               [0,0,1,0,0,0]])
print(mat.shape)
df.loc[:,cols[1:]] = mat

print('sum of abs decisions: ', df.iloc[:,-1].abs().sum())
print('sum of ranks equals 1: ', np.all(df.iloc[:,1:-1].sum(axis=1)==1))
print('all ranks are non negative: ', np.all(df.iloc[:,1:-1]>=0))

df

(8, 6)
sum of abs decisions:  1.0
sum of ranks equals 1:  True
all ranks are non negative:  True


Unnamed: 0,id,rank1,rank2,rank3,rank4,rank5,decision
0,mmm,0.0,0.1,0.2,0.5,0.2,0.0
1,atvi,0.0,0.0,1.0,0.0,0.0,0.0
2,googl,0.1,0.1,0.1,0.1,0.6,0.5
3,aph,0.5,0.4,0.05,0.05,0.0,0.0
4,bmy,0.2,0.2,0.2,0.2,0.2,0.0
5,cb,0.0,0.0,0.1,0.4,0.5,0.3
6,exr,0.7,0.3,0.0,0.0,0.0,-0.2
7,msi,0.0,0.0,1.0,0.0,0.0,0.0


\# Evaluating Forecast Performance: A Guide to Using Ranked Probability Score (RPS)

To evaluate the forecast performance for a particular submission point, the **Ranked Probability Score (RPS)** is employed. Assets' realized percentage total returns over a specific period are sorted into quintiles, with rankings from 1 (lowest performance) to 5 (highest performance). In a portfolio of 110 assets, each quintile will contain 22 assets. In the event of a tie at the boundary between quintiles, all tied assets are assigned the average rank of the boundary ranks.

### Example: Handling Tied Ranks
For instance, if four assets share the 20th position, they will each receive an average rank calculated as follows:

$$
\text{Average rank} = \frac{(5+5+5+4)}{4} = 4.75,
$$

where the three "5s" represent the rank of the assets in the top quintile, and the "4" signifies the rank of the asset in the second quintile.

### Vector Representation of Asset Ranks
The ranking of each asset is represented by a vector $q_{i,t}$ of dimension 5. If asset $i$ ranks in quintile 3 at time $t$, then $q_{i,t} = (0, 0, 1, 0, 0)$. An asset with a rank of 4.75 at time $t$ would be represented as $q_{j,t} = (0, 0, 0, 0.25, 0.75)$.

### Calculating RPS for an Individual Asset
A forecast vector $f_{i,t}$ signifies the predicted probabilities for each rank of a particular asset, as provided by a participant. The RPS for asset $i$ at time $t$ is calculated using the following formula:

$$
RPS_{i,t} = \frac{1}{5} \sum_{j=1}^{5} \left( \sum_{k=1}^{j} q_{i,t,k} - \sum_{k=1}^{j} f_{i,t,k} \right)^2.
$$

### Example: Calculating RPS
Suppose we want to determine the RPS for asset $i$ at submission point $t$. If the submitted probabilities for ranks are $f_{i,t} = (0, 0.2, 0.3, 0.4, 0.1)$ and the actual rank of the asset is 4, $q_{i,t} = (0, 0, 0, 1, 0)$, the RPS is calculated as:

$$
RPS_{i,t} = \frac{1}{5} \left( (0 - 0)^2 + (0 - 0.2)^2 + (0 - 0.5)^2 + (1 - 0.9)^2 + (1 - 1)^2 \right) = 0.06.
$$

### Portfolio RPS
The portfolio RPS at time $t$ is the average of the RPS for all $N$ assets, given by:

$$
RPS_t = \frac{1}{N} \sum_{i=1}^{N} RPS_{i,t},
$$

where $N$ is the total number of assets, for example, $N = 110$.

### Overall RPS for Multiple Submission Points
For multiple submission points ranging from $t_1$ to $t_2$, the overall RPS is calculated as:

$$
RPS_{t_1-t_2} = \frac{1}{N(t_2 - t_1 + 1)} \sum_{t=t_1}^{t_2} \sum_{i=1}^{N} RPS_{i,t}.
$$

In [None]:
#Example
import pandas as pd
import numpy as np

f = np.cumsum(np.array([[0, 0.2, 0.3, 0.4, 0.1]]), axis=1)
q = np.cumsum(np.array([[0,0,0,1,0]]), axis=1)
np.sum((q-f)**2,axis=1)/f.shape[1]

array([0.06])

In [None]:
#Example

#forecasting
f = np.array([[0,0,0,0.25,0.75],[0.2,0.2,0.2,0.2,0.2],[0,0,0.25,0.5,0.25],[0,0.25,0.5,0.25,0],[0.5,0,0,0,0.5],\
              [0,0,0,0.25,0.75],[0.2,0.2,0.2,0.2,0.2],[0,0,0.25,0.5,0.25],[0,0.25,0.5,0.25,0],[0.5,0,0,0,0.5]])
pd.DataFrame(data = f)

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,0.0,0.25,0.75
1,0.2,0.2,0.2,0.2,0.2
2,0.0,0.0,0.25,0.5,0.25
3,0.0,0.25,0.5,0.25,0.0
4,0.5,0.0,0.0,0.0,0.5
5,0.0,0.0,0.0,0.25,0.75
6,0.2,0.2,0.2,0.2,0.2
7,0.0,0.0,0.25,0.5,0.25
8,0.0,0.25,0.5,0.25,0.0
9,0.5,0.0,0.0,0.0,0.5


In [None]:
#Example

#actual
q = np.array([[0,0,0,0,1],[0,0,0,1,0],[0,0,1,0,0],[0,1,0,0,0],[1,0,0,0,0],\
              [0,0,0,0,1],[0,0,0,1,0],[0,0,1,0,0],[0,1,0,0,0],[1,0,0,0,0]])
pd.DataFrame(data = q)

Unnamed: 0,0,1,2,3,4
0,0,0,0,0,1
1,0,0,0,1,0
2,0,0,1,0,0
3,0,1,0,0,0
4,1,0,0,0,0
5,0,0,0,0,1
6,0,0,0,1,0
7,0,0,1,0,0
8,0,1,0,0,0
9,1,0,0,0,0


In [None]:
def forecast_performance(f,q):

  eps = 1e-3
  assert np.all(q >= 0) and np.all(f >= 0) and np.all(np.abs(np.sum(f,axis=1)-1)<eps) and np.all(np.abs(np.sum(q,axis=1)-1)<eps), \
        "f or q are not conditioned well"

  q = np.cumsum(q, axis=1)
  f = np.cumsum(f, axis=1)
  fp = np.sum((q-f)**2, axis=1)/f.shape[1] #forecast performance
  return np.mean(fp) #mean forecast performance

forecast_performance(f,q)

0.1165

# Evaluating Investment Performance: An Overview of the Information Ratio (IR)

The Information Ratio (IR) serves as the key metric for assessing the performance of investment decisions. It is calculated as the ratio of the portfolio return ($\text{ret}$) to the standard deviation of that return ($\text{sdp}$):

$$
\text{IR} = \frac{\text{ret}}{\text{sdp}}
$$

Here, $\text{ret}$ refers to the continuously compounded portfolio returns, while $\text{sdp}$ denotes the standard deviation of these returns, computed daily. Note that the IR values presented are annualized. This version of the IR employs a benchmark return of 0, making it conceptually akin to the Sharpe Ratio with a risk-free rate of 0.

## Calculating Portfolio Returns

The daily portfolio holding period return, $\text{RET}_t$, is computed using the following formula:

$$
\text{RET}_t = \sum_{i=1}^{N} w_i \left( \frac{S_{i,t}}{S_{i,t-1}} - 1 \right)
$$

Where $N$ is the total number of assets, $w_i$ is the weight of the $i^{th}$ asset in the portfolio, and $S_{i,t}$ is the adjusted closing price of the $i^{th}$ asset at the end of trading day $t$. The term $t-1$ refers to the preceding trading day.

The continuously compounded daily portfolio return, $\text{ret}_t$, is then obtained as:

$$
\text{ret}_t = \log(1 + \text{RET}_t)
$$

The value $\text{RET}_t$ is calculated for a single day, $t$, and represents the weighted average return of the selected assets. For holding periods longer than one day, $\text{ret}_{t_1:t_2}$ is the sum of the daily returns:

$$
\text{ret}_{t_1:t_2} = \sum_{t=t_1}^{t_2} \text{ret}_t
$$

## Calculating Portfolio Risk (Standard Deviation)

The standard deviation of portfolio returns, $\text{sdp}_{t_1:t_2}$, is calculated using the same daily return values $\text{ret}_t$ that were used to compute $\text{ret}_{t_1:t_2}$. It is calculated as follows:

$$
\text{varp}_{t_1:t_2} = \frac{1}{T-1} \sum_{t=t_1}^{t_2} \left( \text{ret}_t - \frac{\text{ret}_{t_1:t_2}}{T} \right)^2
$$

$$
\text{sdp}_{t_1:t_2} = \sqrt{\text{varp}_{t_1:t_2}}
$$

Here, $T$ is the length of the holding period and is defined as $T = t_2 - t_1 + 1$.

## Interpretation of the Information Ratio (IR)

A higher value of the Information Ratio ($\text{IR}$)—which is the ratio of the portfolio return ($\text{ret}$) to its standard deviation ($\text{sdp}$)—indicates superior investment performance.

Your example generally makes sense, but there are a couple of areas that could be clarified:

1. You're annualizing both the return and the standard deviation, which is standard practice. However, the formulas could be more explicitly stated.
  
2. The expression \(252/5 \times 0.01\) suggests that you are annualizing the summed daily returns (\(ret_{1:5}=0.01\)) by multiplying it by \(252/5\). This is a simplification, and a more accurate way to annualize would be to use compounding. Also, daily returns are usually averaged before annualization, unless you're summing them for a specific reason.

3. Similarly, for standard deviation, it's more typical to multiply by \(\sqrt{252}\) to annualize daily standard deviation.

Here's your example revised for clarity:

---

**Example:**

To calculate the Information Ratio (IR) for a one-week investment decision, we assume a 5-day assessment period. First, we determine the daily compounded returns, providing us with 5 $ret_t$ observations. Summing these observations gives $ret_{1:5} = 0.01$. Next, we find $sdp_{1:5} = 0.01$.

Given that this is a 5-day period, the annualized return ($ret^a$) and the annualized standard deviation ($sdp^a$) can be calculated as:

$$
ret^a = (1 + ret_{1:5})^{252/5} - 1 \approx 0.01 \times \frac{252}{5}
$$

$$
sdp^a = sdp_{1:5} \times \sqrt{252}
$$

Thus, the Information Ratio for this period ($IR_{t_1:t_2}$) becomes:

$$
IR_{t_1:t_2} = \frac{ret^a}{sdp^a} = \frac{0.01 \times \frac{252}{5}}{\sqrt{252} \times 0.01} \approx 0.79
$$

Note that, as with all our investment performance assessments, we utilize daily returns. This affords us more degrees of freedom when calculating the standard deviation, yielding a more accurate depiction of the investment's performance over the given period.

## A side note on annualization

### Annualizing Returns

For returns, we are interested in the total return over some period. In this example, we sum up the daily returns over a 5-day period, getting $ret_{1:5} = 0.01$. To annualize it, we multiply it by $\frac{252}{5}$ since there are approximately 252 trading days in a year and we have 5 days of returns. This assumes simple returns and aims to scale the total 5-day return to what it would be over a whole year.

However, the more accurate way to annualize compound returns is to use the formula:

$$
\text{Annualized return} = \left(1 + \text{Average daily return} \right)^{252} - 1
$$

Here we are taking the average of the 5 daily returns and then raise it to the 252nd power to annualize it, before subtracting 1 to get back to a return rate.

### Annualizing Standard Deviation

Standard deviation is a measure of risk or volatility, and it scales differently. For daily data, we can annualize the standard deviation by multiplying it by $\sqrt{252}$ because standard deviation scales with the square root of time. Here, 252 comes from the approximate number of trading days in a year.

Why don't we divide by 5 (or $\sqrt{5}$)? Because we're not trying to reduce the annual standard deviation to a 5-day period; we're trying to take a 5-day standard deviation and scale it up to what it would be over a year. The $\sqrt{252}$ helps project the standard deviation from the daily scale to an annual scale, given the square-root-of-time rule.

In [None]:
# Example

import numpy as np

# Define weights
w = np.array([.2,.3,-.4])

# Generate some synthetic stock prices, end of trading day, Fri -> Fri
np.random.seed(10)
S = np.random.normal(25, 6, size=(3, 6))

def decision_performance(S, w):
    eps = 1e-3
    # Validate the sum of absolute weights
    assert 1+eps > np.sum(np.abs(w)) and np.sum(np.abs(w)) >= 0.25, "w is not conditioned well"

    # Handle cash
    if np.sum(np.abs(w)) < 1:
        w1 = 1 - np.sum(np.abs(w))
    else:
        w1 = 0
    w = np.array(list(w) + [w1])

    # Add cash to assets
    cash = np.array([100,100,100,100,100,100])
    S = np.vstack((S, cash))

    # Calculate holding period return
    RET = np.nansum(np.reshape(np.vstack(w),(-1,1)) * (S[:,1:] / S[:,:-1] - 1), axis=0)

    # Calculate continuously compounded returns
    ret = np.log(1 + RET)

    # Calculate standard deviation and IR
    std = np.nanstd(ret, ddof=1)
    ret = np.nansum(ret)
    ir = (252 / 5) * ret / (np.sqrt(252) * std)

    return ir

# Execute the function
decision_performance(S, w)


-4.3500470193669045

# Measuring the Combined Performance of Forecasts and Investment Decisions

The combined performance of forecasting and investment decisions is assessed using the arithmetic mean of the ranks for the Ranked Probability Score (RPS) and the Information Ratio (IR). Both forecasting and investment tasks are considered equally important. The Overall Rank (OR) for a particular submission at time $t$ is computed as follows:

$$
OR = \frac{{\text{rank}(RPS) + \text{rank}(IR)}}{2}
$$

Here, $\text{rank}(\cdot)$ denotes the ranking of a participant's performance relative to all other participants for the given metric, either RPS or IR.

For an aggregated view of forecasting performance across all 12 submission points, the arithmetic mean of the weekly RPS scores is used.

# Submission example

# Submission Guidelines Summary

* File Format: Submit your data in a CSV file. Google Sheets and XML formats are not accepted.
* Filename: Name your file as `GROUPNAME__DATETIME.csv`, making sure to include double underscores.
* Column Names: The CSV should have columns labeled "id", "rank1", "rank2", "rank3", "rank4", "rank5", and "decision". Do not include columns named "symbol" or "name."
* File Structure: Your file should comprise 7 columns and 110 rows, excluding the header row.
* Unique IDs: Ensure that all "id" names are unique and correspond to the asset universe.
* Rank Values: Entries in the "rank" columns should fall between 0 and 1.
* Rank Sum: For each row, the sum of the values in the "rank" columns must be 1 or 0 if you are not investing in this name.
* Decision Column: The sum of the absolute values in the "decision" column must be less than or equal to 1.
* Notebook Usage: Use a copy of the provided notebooks for your work. Only edit the original notebooks if absolutely necessary.

In [None]:
!pip install yfinance
!pip install pytz

import os
import sys
from google.colab import drive
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from datetime import datetime
import pytz

import shutil
import yfinance as yf
yf.pdr_override()

root_dir = '/content/drive/'
drive.mount(root_dir)

main_dir = root_dir+'MyDrive/ddmif_fall_2023/'
data_dir = main_dir+'data/'
ddmif_dir = root_dir+'MyDrive/ddmif'
sys.path.append(main_dir)
sys.path.append(data_dir)
os.chdir(main_dir)

ddmif_dir = root_dir+'MyDrive/ddmif_fall_2023/ddmif'
try:
  shutil.rmtree(ddmif_dir)
except:
  pass

!git clone https://github.com/naftalic/ddmif.git
import ddmif.ddmif_functions as ddmif
#pip freeze > requirements.txt


Mounted at /content/drive/
fatal: destination path 'ddmif' already exists and is not an empty directory.


In [None]:
first_friday = "2023-10-06"

first_friday   = pd.to_datetime(first_friday, dayfirst=True)
first_saturday = first_friday + pd.Timedelta(days=1)
first_sunday   = first_friday + pd.Timedelta(days=2)
last_saturday  = first_friday + pd.Timedelta(days=51)

#get universe
universe_file = 'universe.csv'
universe_df   = ddmif.get_universe(data_dir, universe_file)

#get price data from yahoo
assets_df  = ddmif.get_data(data_dir, universe_df, first_friday.strftime('%Y-%m-%d'), \
             first_saturday.strftime('%Y-%m-%d'), first_sunday.strftime('%Y-%m-%d'), last_saturday.strftime('%Y-%m-%d'))
assets_df


[*********************100%%**********************]  1 of 1 completed


AttributeError: 'DataFrame' object has no attribute 'append'

1. asset_k_roc = (weighted_ma - last_day)/last_day
2. map roc => rank ( if roc < -5% rank1 = 0.8 rank2 = 0.2 else = 0) (top 10 and last 10)
3. rank => decision (roc => ranking  k 1st m 2nd )

In [None]:
def weighted_MA_roc(dataframe):
    weighted_ma_roc = []
    asset_symbols = dataframe.columns  # Capture the asset symbols (column names)  A list

    weight = np.array([0.2 * i for i in range(dataframe.shape[0] -1)])  # Weight function is y = 0.2 x

    for i in range(dataframe.shape[1]):
        item = dataframe.iloc[:, i]  # Select the i-th column (asset)
        #print(item)
        roc = (item - item.shift(1) ) / item

        roc.dropna(inplace=True)

        roc = roc[roc != 0]



        weighted_roc = weight[:len(roc)] * roc


        # weighted_sum = item * weight  # Multiply the asset's prices by the weight
        weighted_ma_i = np.sum(weighted_roc / np.sum(weight))  # Compute the weighted average

        # last_day = item.iloc[-1]  # Get the last day's value
        # roc = (weighted_ma_i - last_day) / last_day  # Calculate ROC
        weighted_ma_roc.append(weighted_ma_i)

    # Convert the list of ROC values to a Pandas Series with asset symbols as the index
    roc_series = pd.Series(weighted_ma_roc, index=asset_symbols, name='ROC')

    return roc_series

# Calculate the ROC values
roc_series = weighted_MA_roc(assets_df)

# Print the ROC values
print(roc_series[50:])



In [None]:
assets_df.shape

In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame for illustration
# Replace this with your actual dataframe of asset prices

def bollinger_strategy(dataframe, window=20, k=2):
    asset_returns = {}
    for asset in dataframe.columns:
        prices = dataframe[asset]

        # Calculate moving average and standard deviation
        sma = prices.rolling(window=window).mean()
        rolling_std = prices.rolling(window=window).std()

        # Calculate upper and lower Bollinger Bands
        upper_band = sma + (rolling_std * k)
        lower_band = sma - (rolling_std * k)

        # Create signals: Buy (1), Hold (0), Sell (-1)
        signals = pd.Series(index=prices.index)
        signals[prices < lower_band] = 1
        signals[prices > upper_band] = -1
        signals = signals.ffill().fillna(0)

        # Calculate returns
        daily_returns = prices.pct_change()
        strategy_returns = signals.shift() * daily_returns
        total_return = (strategy_returns + 1).prod() - 1
        asset_returns[asset] = total_return

    # Rank assets based on returns
    asset_ranking = pd.Series(asset_returns).sort_values(ascending=False)
    return asset_ranking

# Apply the Bollinger Band strategy and rank the assets
ranking = bollinger_strategy(assets_df)
print("Asset Ranking based on Bollinger Band Strategy:")
print(ranking)


## **MLP Strategy**


## Model Architecture

In [None]:
import pandas as pd

In [None]:
# data cleaning
#get price data from yahoo
start_date = "2022-11-03"
end_date = "2023-11-05"
first_friday = "2022-11-07"

first_friday   = pd.to_datetime(first_friday, dayfirst=True)
first_saturday = first_friday + pd.Timedelta(days=6)
first_sunday   = first_friday + pd.Timedelta(days=7)
first_monday = first_friday + pd.Timedelta(days=1)

one_year_later = pd.Timestamp(year=first_monday.year + 1, month=first_monday.month, day=first_monday.day  + 19)

inputs_df  = ddmif.get_data(data_dir, universe_df, first_friday.strftime('%Y-%m-%d'), \
             first_saturday.strftime('%Y-%m-%d'), first_sunday.strftime('%Y-%m-%d'), one_year_later.strftime('%Y-%m-%d'))


NameError: name 'universe_df' is not defined

In [None]:
inputs_df.head()

In [None]:
date = inputs_df.index[3]
date2 = inputs_df.index[322]
date3 = inputs_df.index[336]
date4 = inputs_df.index[334]
date5 = inputs_df.index[362]
print(date, date2, date3, date4, date5)

In [None]:
print(inputs_df.shape)

In [None]:
inputs_df

eq_col = []
non_eq = []

for col in inputs_df.columns:
    if inputs_df[col].iloc[0] == inputs_df[col].iloc[1]:
      eq_col.append(col)
    else:
      non_eq.append(col)

print(eq_col,non_eq)

eq_df = inputs_df #[eq_col]
non_df = inputs_df[non_eq]

eq_df1 = eq_df

eq_df = ( eq_df - eq_df.shift(1)) / eq_df
non_df = (non_df - non_df.shift(1)) / non_df

# eq_df = eq_df.shift(1)
# non_df = non_df.shift(-1)


print(eq_df.shape)


In [None]:
print(eq_df)

In [None]:
# df_rate_of_change = inputs_df.pct_change()
# df_rate_of_change
# df_rate_of_change = df_rate_of_change.iloc[1:-1]
# cryptos = universe_df[universe_df['class'] == 'Crypto']
# crypto_names = cryptos['symbol'].tolist()
# for column in df_rate_of_change.columns:
#     if column not in crypto_names:
#         df_rate_of_change[column] = df_rate_of_change[column].interpolate(method='linear')

# eq_df = eq_df.iloc[3:,:]
print(eq_df.shape)

xs = []  # List to store all x values
ys = []  # List to store all y values

xt = []  # List to store all x values
yt = []  # List to store all y values

for i in range(3, 370 - 48, 7):  # Sliding window logic
    # For x values: Use the rate of change dataframe (roc_df)
    for j in range(eq_df.shape[1]):
      y_window = eq_df1.iloc[i:i+34,j]
      if y_window.shape[0] > 32:
        fourth_friday_price = y_window[25]

        fifth_friday_price = y_window[32]
        y = (fifth_friday_price - fourth_friday_price) / fourth_friday_price

        ys.append(y)

        x_window = eq_df.iloc[i:i+28,j]
        x = x_window.values.flatten()
        xs.append(x)


for i in range(370 - 48, 370 - 34, 7):  # Sliding window logic
    # For x values: Use the rate of change dataframe (roc_df)
    for j in range(eq_df.shape[1]):
      y_window = eq_df1.iloc[i:i+34,j]
      if y_window.shape[0] > 32:
        fourth_friday_price = y_window[25]

        fifth_friday_price = y_window[32]
        y = (fifth_friday_price - fourth_friday_price) / fourth_friday_price

        yt.append(y)

        x_window = eq_df.iloc[i:i+28,j]
        x = x_window.values.flatten()
        xt.append(x)

      # # For y values: Use the original price dataframe (price_df)
      # y_window = eq_df.iloc[j,i:i+34]
      # fourth_friday_price = y_window[25]
      # print(y_window.shape)
      # fifth_friday_price = y_window[32]
      # y = (fifth_friday_price - fourth_friday_price) / fourth_friday_price

      # ys.append(y)


xtest = []
ytest = []

for j in range(eq_df.shape[1]):

  x_window = eq_df.iloc[362 - 28: 362 ,j]
  x = x_window.values.flatten()
  xtest.append(x)






# Convert xs and ys to numpy arrays for further processing
xs = np.array(xs)
ys = np.array([ys]).T

xt = np.array(xt)
yt = np.array([yt]).T

xtest = np.array(xtest)

print(xs.shape)  # should be (48, 3080)
print(ys.shape)  # should be (48, 110)

print(xt.shape)  # should be (48, 3080)
print(yt.shape)  # should be (48, 110)

print(xtest.shape)


In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset



# Your data (dummy example)
# xs = np.random.rand(48, 3080)
# ys = np.random.rand(48, 110)

# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(xs)
Y_tensor = torch.FloatTensor(ys)

X_ = torch.FloatTensor(xt)
Y_= torch.FloatTensor(yt)

print(X_tensor.shape)

# Create a DataLoader for your training and validation sets
train_dataset = TensorDataset(X_tensor, Y_tensor)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

test_dataset = TensorDataset(X_, Y_)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=True)


In [None]:
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(28, 64)  # Corrected to 3080 input features
        self.fc2 = nn.Linear(64, 16)
        self.fc3 = nn.Linear(16, 1)  # 110 output features to match ys

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


# Instantiate the model
model = MLP()


### Train Loop

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()  # Mean Squared Error for regression tasks

# Training Loop
for epoch in range(150):  # Number of epochs
    for batch_X, batch_Y in train_loader:
        optimizer.zero_grad()
        output = model(batch_X)
        loss = criterion(output, batch_Y)
        loss.backward()
        optimizer.step()


    model.eval()

    total_loss = 0.0
    total_samples = 0

    with torch.no_grad():
        for batch_X, batch_Y in test_loader:

            output = model(batch_X)


            loss = criterion(output, batch_Y)
            total_loss += loss.item() * batch_X.size(0)


            total_samples += batch_X.size(0)


    average_loss = total_loss / total_samples

    # Validation logic
    print(f"Epoch {epoch+1}, Loss: {loss.item()}， Eval: {average_loss}")


In [None]:
print(np.isnan(xs).any(), np.isnan(ys).any())
print(np.isinf(xs).any(), np.isinf(ys).any())


In [None]:
mlpout = []
X = torch.FloatTensor(xtest)
for i in range(xtest.shape[0]):
  pred = model(X[i].reshape(-1))
  mlpout.append(pred.detach().cpu().numpy()[0])
print(mlpout[2])



In [None]:
mlpout = pd.Series(mlpout)

from scipy.stats import zscore
print(zscore(mlpout))

## **Final Combination**

In [None]:
from scipy.stats import zscore

def combined_ranking(dataframe,mlp):
    # Calculate ROC ranking
    roc_series = weighted_MA_roc(dataframe)

    # Calculate Bollinger Band strategy ranking
    bollinger_ranking = bollinger_strategy(dataframe)


    # Normalize using Z-score

    zscore_roc = zscore(roc_series)
    zscore_bollinger = zscore(bollinger_ranking)
    mlp = zscore(mlp)

    # Assign weights
    w1, w2, w3 = 0.3, 0.01, 0.2
    mlp.index = roc_series.index
    # Combine the Z-scores
    combined_zscore = w1*zscore_roc + w2*zscore_bollinger + w3 *mlp

    # Create final ranking
    combined_ranking_series = pd.Series(combined_zscore, index=roc_series.index)#.sort_values(ascending=False)

    return combined_ranking_series

# Apply the combined ranking
final_ranking = combined_ranking(assets_df, mlpout)

print("Final Combined Asset Ranking:")
print(final_ranking[50:])


In [None]:
rank = final_ranking.sort_values(ascending=False)
print(rank[:20],rank[100:])

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# 创建一个pd.Series对象


# 使用Matplotlib绘制直方图
plt.hist(final_ranking, bins=50, edgecolor='black')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Series')
plt.show()

In [None]:
print(final_ranking[1])



CZR score 0.795 => score_classification [0.01,0.09,0.1,0.2,0.6]   [r1,r2,r3,r4,r5] = [-2 , -1.2 , -0.4, 0.4, 1.2]  N(0.795 , sigma) pdf(r1,r2,r3,r4,r5) = [0.01, 0.015 0.02, 0.04 , 0.99] softmax => normalized score [r1.r2,r5]

In [None]:
import scipy.stats as stats

def get_cdf(x):
  std = 1
  seg = np.array([-1000, -1.0 , -0.3 , 0.3, 1.0, 1000] )#segment
  cdf = stats.norm.cdf(seg, loc=x, scale=std)
  cdf = np.diff(cdf)
  return cdf # normalized ranking score

asset_names = assets_df.columns.tolist()  # Extract column names from assets_df
df = pd.DataFrame(0, index=asset_names, columns=[f'rank{i+1}' for i in range(5)])

# Apply softmax to get probabilities
for i in range(df.shape[0]):
  prob = get_cdf(final_ranking[i])
  for j in range(5):
    df.loc[asset_names[i], 'rank'+str(j+1)] = prob[j] = prob[j]




# # Map to brackets based on probabilities
# # We use digitize to get bin numbers, starting from 0, so we add 1 to start from 1.
# brackets = np.digitize(probabilities, [0.2, 0.4, 0.6, 0.8]) + 1
# print(brackets)

# # Create DataFrame
# asset_names = assets_df.columns.tolist()  # Extract column names from assets_df
# df = pd.DataFrame(0, index=asset_names, columns=[f'rank{i+1}' for i in range(5)])

# # Populate DataFrame based on brackets
# for asset, rank in zip(asset_names, brackets):
#     df.loc[asset, f'rank{rank}'] = 1.0

print("Do all rows sum to 1?:", df.sum(axis=1).eq(1.0).all())
print(df.iloc[:40])


In [None]:
# Create a DataFrame containing the names and final_ranking values
ranking_df = pd.DataFrame({'Stock Name': final_ranking.index, 'Final Ranking': final_ranking.values})

# Sort the DataFrame by 'Final Ranking' in descending order to get the top 20 stocks
top_20_stocks = ranking_df.sort_values(by='Final Ranking', ascending=False).head(30)
top_20_stocks = top_20_stocks[top_20_stocks['Final Ranking'] > 0.25]

# Sort the DataFrame by 'Final Ranking' in ascending order to get the worst 20 stocks
worst_20_stocks = ranking_df.sort_values(by='Final Ranking', ascending=True).head(30)
worst_20_stocks = worst_20_stocks[worst_20_stocks['Final Ranking'] < -0.3]

# Concatenate the filtered top and worst 20 stocks
selected_stocks_df = pd.concat([top_20_stocks, worst_20_stocks])

# Reset the index of the selected_stocks_df
selected_stocks_df.reset_index(drop=True, inplace=True)

# Create a new DataFrame with absolute values of 'Final Ranking'
selected_stocks_abs_df = selected_stocks_df.copy()
selected_stocks_abs_df['Final Ranking'] = selected_stocks_df['Final Ranking'].abs()

from scipy.special import softmax

# print(selected_stocks_df, selected_stocks_abs_df)

# Extract the absolute values of 'Final Ranking' column
abs_rankings = selected_stocks_abs_df['Final Ranking'].values

# Apply softmax to the absolute values
softmax_values = softmax(abs_rankings)



# Create a new DataFrame with softmax values and stock names
softmax_df = pd.DataFrame({'Stock Name': selected_stocks_abs_df['Stock Name'], 'Softmax Value': softmax_values})

merged_df = selected_stocks_df.merge(softmax_df, on='Stock Name', how='inner')


# Filter stocks with positive and negative 'Final Ranking' values
long_stocks = merged_df[merged_df['Final Ranking'] > 0]
short_stocks = merged_df[merged_df['Final Ranking'] < 0]

# Calculate the weights for long and short positions based on softmax values
long_weights = long_stocks['Softmax Value']
short_weights = -short_stocks['Softmax Value']

# Create portfolios for long and short positions
long_portfolio = pd.DataFrame({'Stock Name': long_stocks['Stock Name'], 'Weight': long_weights})
short_portfolio = pd.DataFrame({'Stock Name': short_stocks['Stock Name'], 'Weight': short_weights})

# Combine the long and short portfolios to create the final portfolio
portfolio = pd.concat([long_portfolio, short_portfolio])

# Print the final portfolio
print("Final Portfolio:")
print(portfolio)

sum_absolute_weights = abs(portfolio['Weight']).sum()


print("Sum of Absolute Weights in Portfolio:", sum_absolute_weights)

portfolio.set_index('Stock Name', inplace=True)

df_final = df.join(portfolio)

df_final['Weight'].fillna(0, inplace=True)

df_final.rename(columns={'Weight': 'decision'}, inplace=True)

print(df_final)


In [None]:
print(df_final)

df_final['i'] = df_final.index


df_final.reset_index(drop=True, inplace=True)


df_final.index = [f'{i}' for i in range(len(df_final))]

print(df_final)


In [None]:
last_column = df_final.iloc[:, -1]

df_final.insert(0, 'id', last_column)

df_final.drop(df_final.columns[-1], axis=1, inplace=True)

In [None]:
print(df_final)

In [None]:
# # Assuming you have 'roc_series', 'portfolio', and 'merged_df' DataFrames already defined

# # Convert 'roc_series' to a DataFrame and reset its index
# roc_df = roc_series.reset_index()

# # Rename the columns to match your requirements
# roc_df.columns = ['id', 'rank1']

# # Merge 'roc_df' with 'portfolio' on 'Stock Name' to get ranks
# merged_ranks_df = roc_df.merge(portfolio, left_on='id', right_on='Stock Name', how='left')

# # Create the new DataFrame with the specified columns
# new_df = merged_ranks_df[['id', 'rank1', 'rank2', 'rank3', 'rank4', 'rank5', 'decision']].copy()

# # Fill NaN values in 'decision' column with 0
# new_df['decision'].fillna(0, inplace=True)

# # Print the new DataFrame
# print(new_df)


top20 [x1,...,x20]  last[y1,...,y20] (make sure top 20 are positive and last 20 are negative)=> [s1,...,s40] => softmax => normalized score (40 assets) * (0.9) 0.95 1 => decisions [] (top 20 are positive and last 20 are negative)

In [None]:
# def filter_non_zero_decisions_and_check_sum(ranks_df):
#     # Filter out rows where 'decision' is not 0
#     non_zero_decisions = ranks_df[ranks_df['decision'] != 0]['decision']

#     # Calculate the sum of the absolute values
#     abs_sum = non_zero_decisions.abs().sum()

#     print("Sum of absolute values of non-zero decisions:", abs_sum)

#     # Check the sum against your criteria
#     if abs_sum <= 1 and abs_sum > 0.25:
#         print("The sum of absolute values is within the specified criteria.")
#     else:
#         print("The sum of absolute values is NOT within the specified criteria.")

#     return non_zero_decisions

# # Assuming 'ranks' DataFrame has a 'decision' column
# # Run this function to get the non-zero decisions and check their absolute sum
# non_zero_decisions = filter_non_zero_decisions_and_check_sum(ranks)

# print("Non-zero decisions:")
# print(non_zero_decisions)



In [None]:

# Rename the sorted DataFrame to submission_file
submission_file = df_final

print("submission_file DataFrame:")
print(submission_file)


In [None]:
# Reset the index and create a new column 'id' which will hold the index (asset symbols)
submission_file = df_final

# submission_file['id'] = submission_file.index

# submission_file.reset_index(drop=True, inplace=True)

# Rename the columns to match the desired format
# submission_file.columns = ['id', 'rank1', 'rank2', 'rank3', 'rank4', 'rank5', 'decision']

# Display the modified DataFrame
print("Modified submission_file DataFrame:")
print(df_final)


In [None]:
#submission file example:

N = 110
# num_groups = 5

# for i in range(num_groups):

est = pytz.timezone('US/Eastern')  # Get the US Eastern Time Zone
current_time = datetime.now(est)  # Get the current time in EST
submission_time = current_time.strftime('%Y-%m-%d_%H:%M:%S')


submission_dir = main_dir+'submissions/dueNov26/'
filename = 'DYF_'+'__'+submission_time+ '.csv'
df_final.to_csv(submission_dir+filename)
print(submission_dir+filename)
print(submission_file)

In [None]:
# #submission file example:

# N = 110
# # num_groups = 5

# # for i in range(num_groups):

# est = pytz.timezone('US/Eastern')  # Get the US Eastern Time Zone
# current_time = datetime.now(est)  # Get the current time in EST
# submission_time = current_time.strftime('%Y-%m-%d_%H:%M:%S')

# feature_list          = ['id', 'rank1', 'rank2', 'rank3', 'rank4','rank5', 'decision']
# submission_file       = pd.DataFrame(0, index=np.arange(N), columns = feature_list)
# submission_file['id'] = universe_df['symbol'].copy()

# x = np.random.uniform(0, 1, size=(N, 5))
# submission_file[['rank1', 'rank2', 'rank3', 'rank4','rank5']] = x / np.reshape(np.sum(x,axis=1),(-1,1))

# x = np.reshape(np.random.normal(0, .05, size=N), (-1,1))
# submission_file[['decision']] = x/np.sum(np.abs(x))

# submission_dir = main_dir+'submissions/dueSep17/'
# filename = 'XYZ_'+'__'+submission_time
# submission_file.to_csv(submission_dir+filename)
# print(submission_dir+filename)
# print(submission_file)

In [None]:
# # 'fake' real data
# mean_price = 100
# std_price  = 20
# np.random.seed(10)
# S = np.random.normal(mean_price, std_price, size=(110, 6))

# vector = S[:,-1]/S[:,0]

# def quintile_rank_with_average_tie(vector):
#   data = pd.DataFrame({'value':vector, 'dense_rank':np.argsort(np.argsort(vector))})
#   data['quintile'] = pd.qcut(data['dense_rank'], q=5, labels=range(1, 6)).astype('float32')
#   data['quintile_rank'] = data.groupby('value')['quintile'].transform(lambda x: x.mean())
#   return data['quintile_rank'].values

# def map_to_vector(vector):
#     quintile_vector = np.zeros((len(vector), 5))
#     for i in range(len(vector)):
#         quintile_vector[i, int(vector[i])-1] = 1
#         if np.abs(int(vector[i])-vector[i])>0:
#           quintile_vector[i, int(np.ceil(vector[i]))-1]  = vector[i]-np.floor(vector[i])
#           quintile_vector[i, int(np.floor(vector[i]))-1] = np.ceil(vector[i])-vector[i]
#     return quintile_vector

# vector = S[:,-1]/S[:,0]
# quintile_rank = quintile_rank_with_average_tie(vector)
# q = map_to_vector(quintile_rank)

# print(np.sum(q,axis=0))

In [None]:
# submission_dir = main_dir+'submissions/dueOct1/'
# symbols = universe_df.sort_values(by=['symbol'])['symbol']
# eps = 1e-3
# from glob import glob
# lst = glob(submission_dir+'/*')

# stats = pd.DataFrame(columns=['group_name','forecast_performance','decision_performance'])
# group_name = []

# for i in range(len(lst)):

#   try:
#     ddmif.eval_submission_file(lst[i], universe_df)

#     df = pd.read_csv(lst[i],sep=",", header=0,index_col=0).sort_values(by=['id'])
#     f = df[['rank1', 'rank2', 'rank3', 'rank4','rank5']].to_numpy()
#     w = np.reshape(df['decision'].to_numpy(),(-1,1))
#     stats.loc[i,:] = lst[i].split('/')[-1].split('__')[0], forecast_performance(f, q), decision_performance(S, w)
#   except:
#     print('******************')
#     print("**Problem** with file:", i, lst[i].split('/')[-1])
#     stats.loc[i,:] = lst[i].split('/')[-1].split('__')[0], 1e6, 1e6

# stats['overall_rank'] = (pd.Series(stats['forecast_performance']).rank(method='dense') +\
#                                      pd.Series(stats['decision_performance']).rank(method='dense'))/2

# scoring_dir = main_dir+'scoring/'
# stats.to_csv(scoring_dir+'dueOct1.csv')

In [None]:
# lst2 = glob(scoring_dir+'/*')

# for i in range(len(lst2)):
#   df = pd.read_csv(lst2[i],sep=",", header=0,index_col=0)
#   if i==0:
#     final_stats = pd.DataFrame()
#     final_stats['group_name'] = df['group_name']
#   final_stats['OR_'+str(i)] = df['overall_rank']

# cols = [i for i in final_stats.columns if 'OR' in i]
# final_stats['mean_OR'] = final_stats[cols].mean(axis=1)

# final_stats


Backtesting

In [None]:
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta

# Assuming universe_df is your DataFrame containing asset symbols
tickers = universe_df['symbol']

# Define the start and end dates for backtesting
end_date = datetime.today().strftime('%Y-%m-%d')
start_date = '2017-01-01'  # Adjust as per your backtesting period

print(f"Start Date: {start_date}, End Date: {end_date}")
print('*' * 80)

# Function to check if the stock has sufficient historical data
def validate_stock_history(stock):
    listing_date = stock.history(period="max").index[0].to_pydatetime().date()
    required_history_date = (datetime.today() - timedelta(days=2*365)).date()
    return listing_date <= required_history_date

# DataFrame to hold all historical data
all_data = pd.DataFrame()

# Download historical data for each ticker
for ticker in tickers:
    stock = yf.Ticker(ticker)

    # Validate stock history
    if validate_stock_history(stock):
        df = stock.history(start=start_date, end=end_date, auto_adjust=True)
        df['ticker'] = ticker
        df['name'] = stock.info['longName']
        df = df.rename(columns={"Close": "close"})
        df = df[['name', 'ticker', 'close']]
        all_data = pd.concat([all_data, df], axis=0)
    else:
        print(f"Skipping {ticker} as it does not have sufficient history")

# Reset index and rename 'Date' column
all_data.reset_index(inplace=True)
all_data = all_data.rename(columns={"Date": "date"})

# Optionally, save the data to a CSV file
# all_data.to_csv('historical_stock_data.csv', index=False)

all_data


In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Load your prepared data
# Assuming all_data is already loaded from your previous steps
# all_data = pd.read_csv('path_to_your_prepared_data.csv')  # Uncomment and set the path if needed

# Function to calculate WMA and future weekly returns
def calculate_wma_and_weekly_return(df, window=20):
    # Calculate the weights for WMA
    weights = np.arange(1, window + 1)

    # Function to calculate WMA for a given series
    def wma(series):
        return np.sum(weights * series) / np.sum(weights)

    # Apply WMA calculation to each stock's data
    df['WMA'] = df.groupby('ticker')['close'].transform(lambda x: x.rolling(window=window).apply(wma, raw=True))

    # Calculate future weekly return
    df['Future_Weekly_Return'] = df.groupby('ticker')['close'].shift(-5) / df['close'] - 1

    return df

# Apply the function to the all_data DataFrame
enhanced_data = calculate_wma_and_weekly_return(all_data)

# Displaying the first few rows of the enhanced data
print(enhanced_data.head())

# Optionally, you can save this data to a CSV file
# enhanced_data.to_csv('enhanced_stock_data.csv', index=False)


In [None]:
def weighted_MA_roc(df, window=20):
    weighted_ma_roc = pd.DataFrame()

    # Weight function: y = 0.2 * x
    weights = np.array([0.2 * i for i in range(window)])

    for ticker in df['ticker'].unique():
        single_stock_data = df[df['ticker'] == ticker].copy()
        single_stock_data.sort_values(by='date', inplace=True)

        # Calculate Rate of Change (ROC)
        roc = (single_stock_data['close'] - single_stock_data['close'].shift(1)) / single_stock_data['close']

        # Ensure the lengths match by trimming the dataframe
        single_stock_data = single_stock_data.iloc[window-1:]

        # Check if there are enough data points
        if len(roc) >= window:
            roc.dropna(inplace=True)

            # Apply weights to ROC
            weighted_roc = np.convolve(roc, weights, mode='valid') / np.sum(weights)

            # Add results to DataFrame
            single_stock_data = single_stock_data.iloc[:len(weighted_roc)]
            single_stock_data['Weighted_MA_ROC'] = weighted_roc
            weighted_ma_roc = pd.concat([weighted_ma_roc, single_stock_data])

    return weighted_ma_roc


In [None]:
# Calculate weighted moving average ROC
roc_data = weighted_MA_roc(all_data)

# Merge the ROC data with enhanced_data
enhanced_data_with_roc = enhanced_data.merge(roc_data, on=['date', 'ticker'])

# Proceed with your existing backtesting code, replacing WMA with Weighted_MA_ROC
# ...


In [None]:
print(enhanced_data.tail())

In [None]:
pip install pandas numpy statsmodels plotly


In [None]:
enhanced_data_cleaned = enhanced_data.dropna(subset=['WMA', 'Future_Weekly_Return'])

In [None]:
# Assuming enhanced_data is already loaded

# Convert 'date' column to datetime format
enhanced_data['date'] = pd.to_datetime(enhanced_data['date'], errors='coerce', utc=True)

# Remove timezone information
enhanced_data['date'] = enhanced_data['date'].dt.tz_localize(None)



In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly.express as px

# Assuming enhanced_data is already loaded with WMA and Future_Weekly_Return
# ... other parts of the code ...

# Ensure the 'date' column is in datetime format
enhanced_data['date'] = pd.to_datetime(enhanced_data['date'])

# Perform the regression and subsequent analysis
results = []

train_end = pd.to_datetime('2020-12-31')
val_end = pd.to_datetime('2022-12-31')
train_window_size = 52
val_window_size = 1

dates = sorted(enhanced_data['date'].unique())

for start in range(0, len(dates) - train_window_size - val_window_size, val_window_size):
    end = start + train_window_size
    window_data = enhanced_data[(enhanced_data['date'] >= dates[start]) & (enhanced_data['date'] <= dates[end])]

    t0 = window_data['date'].iloc[-1]
    segment = 'oos' if t0 > val_end else 'val' if t0 > train_end else 'train'

    next_window_data = enhanced_data[(enhanced_data['date'] > dates[end]) & (enhanced_data['date'] <= dates[end + val_window_size])]

    regression_result = perform_regression(window_data, next_window_data, segment)
    if regression_result:
        end_date = window_data['date'].iloc[-1]
        full_result = regression_result + (end_date,)
        results.append(full_result)

results_df = pd.DataFrame(results, columns=['intercept', 'slope', 'segment', 'x_train', 'y_train', 'x_test', 'y_test', 'y_pred', 'end_date'])

# Plotting
# Scatter plot for the last training segment
last_train_segment = results_df[results_df['segment'] == 'train'].iloc[-1]
fig = px.scatter(x=last_train_segment['x_train'], y=last_train_segment['y_train'], labels={'x':'WMA', 'y':'Future Weekly Return'}, title='WMA vs Future Weekly Return')
fig.show()

# Plot for slope over time
fig = px.scatter(results_df, x='end_date', y='slope', color='segment', title='Slope of Regression Over Time')
fig.update_layout(xaxis_title='Date', yaxis_title='Slope')
fig.show()
