# Elastic Net Regression - BSQF Assessment

This project is a sample, done by Bocconi Students Quantitative Finance, for the Fall 2025 Recruitment Process.
In this project we are going to:
- Theoretical introduction to the topic
- Download weekly data for 20-30 stocks
- Build features like momentum, rolling volatility and size
- Train Elastic Net with cross-validation to choose alpha and l1_ratio
- Compare results with other regressions' (OLS, Lasso, Ridge) ability to predict next week return

### Part 1: Introduction

One of the main risk, when running a linear regression is "Overfitting" which is the risk that the model adapts too much to the data and becomes unable to generalize.
Lasso and Ridge regressions tackle this problem by exploiting the bias-variance trade-off, introducing bias but decreasing varianca, that resulting in a reduction in MSE.  
This method mainly works with big databases where explaining variables could be collinear and not always useful, thus attributing high values to this variables could result in a misinterpretation of their impact on the dependent variable, thus these models aims to mitigate coefficient values and, eventually, eliminate them.  
A further sophistication of the Lasso and Ridge model is the Elastic Net regresion which combine this two models and now we will see how it works precisely.

### Part 2: Gather market data

In [27]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [32]:
# We start by downloading real stocks' data
tickers = [
    "AAPL", "MSFT", "GOOGL", "AMZN", "TSLA",
    "JPM", "NVDA", "XOM", "KO", "BA",
    "PFE", "CVX", "DIS", "NFLX", "INTC",
    "NKE", "MCD", "COST", "AMD", "GS"
]
tickers.sort()
start = "2020-01-01" 
end = "2025-10-06"
data = yf.download(tickers, start=start, end=end)

shares_table=pd.DataFrame(index=data.index)
for ticker in tickers:
    shares_table[f"{ticker}"] = yf.Ticker(ticker).info.get("sharesOutstanding", None)

  data = yf.download(tickers, start=start, end=end)
[*********************100%***********************]  20 of 20 completed


In [33]:
# Let's see how our data looks like
data.head()

Price,Close,Close,Close,Close,Close,Close,Close,Close,Close,Close,...,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume
Ticker,AAPL,AMD,AMZN,BA,COST,CVX,DIS,GOOGL,GS,INTC,...,JPM,KO,MCD,MSFT,NFLX,NKE,NVDA,PFE,TSLA,XOM
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-01-02,72.538506,49.099998,94.900497,331.348572,266.874512,93.955627,145.769882,67.965225,204.18988,53.666466,...,10803700,11867700,3554200,22622100,4485800,5644100,237536000,16514072,142981500,12456400
2020-01-03,71.833282,48.599998,93.748497,330.791901,267.094269,93.630661,144.097794,67.60968,201.802216,53.013714,...,10386800,11354500,2767600,21116200,3806900,4541800,205384000,14922848,266677500,17386900
2020-01-06,72.405693,48.389999,95.143997,331.766083,267.167542,93.313423,143.261703,69.411758,203.867432,52.863766,...,10259000,14698300,4660400,20813700,5663100,4612400,262636000,15771951,151995000,20081900
2020-01-07,72.065125,48.25,95.343002,335.285156,266.746368,92.121872,143.310898,69.277687,205.209473,51.981667,...,10531300,9973900,4047400,21634100,4703200,6719900,314856000,20108107,268231500,17387700
2020-01-08,73.224411,47.830002,94.598503,329.410095,269.804382,91.06955,143.015808,69.770782,207.187561,52.016956,...,9695300,10676000,5284200,27746500,7104500,4942200,277108000,16403507,467164500,15137700


#### Value selection

By using yt download method we are supplied with a 5 different values: Open, Close, High, Low and Volume. As we want to predict price, not peaks or quantity, we avoig using High, Low and Volume.  
Chosing between Open and Close, we prefer Close as its the settlement price for a certain security, specifically we're going to try to predict Close Adj. Price as it isn't affected by dividends and stock splits. 

In [59]:
target = np.log(data["Close"]/data["Close"].shift(1)).iloc[-1]

In [58]:
# We take into account Close_Adj. (with the new yf, there's no distinction between Close and Close_Adj.)
close = data["Close"]

cap_table=pd.DataFrame(index=data.index)
for i in range(close.shape[1]):
    cap_table[tickers[i]]=close.iloc[:, i]*shares_table.iloc[:, i]

for i in range(close.shape[1]):
    work_data = close.iloc[:, i]
    log_rets = np.log(work_data/work_data.shift(1))
    close[f"{data.columns[i][1]}_log_return"] = log_rets
    momentum = work_data.shift(21)/work_data.shift(252)-1
    close[f"{data.columns[i][1]}_momentum"] = momentum
    r_volatility = work_data.rolling(window=84).std()   
    close[f"{data.columns[i][1]}_roll_vol"] = r_volatility
    close[f"{data.columns[i][1]}_cap"] = cap_table.iloc[:, i]
close = close.dropna()
close = close.drop("2025-10-03")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  close[f"{data.columns[i][1]}_log_return"] = log_rets
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  close[f"{data.columns[i][1]}_momentum"] = momentum
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  close[f"{data.columns[i][1]}_roll_vol"] = r_volatility
A value is trying to be set on a copy of a sli

### Please Note:

Now we have our target variable, but we are going to use the entire database nevertheless, as more data can supply more information for target prediction and, additionally, our model it's made for handling big size databases (surely even bigger than the one we're going to use).