# Individual Regression Project

<i>This project is an opportunity for you to put your expanded set of Python skills to use in an interesting new way. Your goal is to combine daily return data from yfinance with one or more additional data sources with the goal of estimating a regression specification that allows you to test an interesting null hypothesis. The hypothesis should be one that genuinely interests you. You will be graded on execution but not on whether any of the variables in your model are statistically significant.<i>

You have a lot of discretion. In the past students have asked:

- Whether/how bank stocks respond to expected and unexpected changes in interest rates by the Federal Reserve
- Whether retail stocks have more or less volatile returns during the holiday season than non-retail stocks
- Whether the timing/number of 8-K filings by firms predict lower or more volatile returns
- Whether the sentiment of 8-K filings for a sample of stocks can be used to predict the level or volatility of returns
- Whether the volume/sentiment of tweets about stocks predict future stock returns
- Whether the excess returns of high beta stocks are higher on average than the excess returns over low beta stocks
- Whether stocks with higher relative returns last year are more likely to have higher relative returns this year (i.e., testing for momentum)
- Whether their clever new trading strategy generates a positive and statistically significant alpha

<i>I am not looking for a long write up. I primarily want to know what regression you ran and why, what your regression output looked like, and how you interpreted the output in the context of your research question. Part 2 is intended to help me figure out if you ran the regression that you intended to run.</i>

In [1]:
name = 'Scott Lowder'

### Part 1: Hypothesis (20%)

In the markdown cell(s) below, please explain what hypothesis you are testing in this project and <b>why you believe that it is an interesting hypothesis to test.</b> This section should include a description of your main regression specification (and any other specifications that you believe are essential for answering your research question) and your main null hypothesis. This section should also include a description of any concerns that you might have about how to interpret the results of your regressions.

Research Question:
Is influence of individual Robinhood investors on a stock's risk level different for small companies versus large companies?

Hypothesis and Regression Specification:
Yes, a positive change in Robinhood holdings will have a more pronounced positive effect on volatility for smaller firms.

Null Hypothesis ($H_0$): The influence of Robinhood accounts changes on volatility is the same for all firms, regardless of their market size. (Key coefficient is zero.)

Main Regression Specification: Pooled OLS Regression model. Main objective is to estimate the coefficient $\beta_4$, which captures the interaction effect between Robinhood attention and firm size. Use squared returns ($R_{i,t}^2$) as the measure for next-day volatility. $$R_{i,t}^2 = \alpha + \beta_1 R_{i,t-1}^2 + \beta_2 \Delta \text{RH}_{i,t-1} + \beta_3 \text{SIZE}_i + \beta_4 (\Delta \text{RH}_{i,t-1} \times \text{SIZE}_i) + \eta_{i,t}$$
- $\mathbf{R_{i,t}^2}$: Next-Day Volatility. The squared daily return, which is our proxy for how risky or volatile the stock was.
- $\mathbf{R_{i,t-1}^2}$: Lagged Volatility. Yesterday's volatility. This controls for the tendency of high-volatility days to follow each other.
- $\mathbf{\Delta \text{RH}_{i,t-1}}$: Robinhood Attention. The percentage change in Robinhood account holders, measured on the previous day.
- $\mathbf{\text{SIZE}_i}$: Firm Size. A rank based on market capitalization, where a higher value means a relatively smaller company.
- $\mathbf{\beta_4 (\Delta \text{RH}_{i,t-1} \times \text{SIZE}_i)}$: Key Test. This is the combined effect of Robinhood attention and small size. If $\beta_4$ is positive and significant, hypothesis is supported.
- $\eta_{i,t}$: Error Term.

### Part 2: Data (20%)

In the coding cell(s) below, please load and process the data that you use to estimate your main regression. You do not need to include code that extracts data from yfinance (for example), but you should include enough code for me to determine what data you are working with and how you are combining it. I want to make sure that your regression is doing what you want it to be doing, but I don't expect to be able to run the cells in this section. To help with grading, please add lots of comments to your code.

In [3]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta
import statsmodels.api as sm
import os 

# Sample period
START_DATE_RH = '2018-05-01'
END_DATE_RH = '2020-08-31'

# Tickers for Analysis (Are local files)
TICKERS = ['AAPL', 'PFE', 'JPM', 'NKE'] 

# Determine full period needed for yfinance data
START_DATE_YF = (pd.to_datetime(START_DATE_RH) - timedelta(days=10)).strftime('%Y-%m-%d')
END_DATE_YF = (pd.to_datetime(END_DATE_RH)).strftime('%Y-%m-%d')

print(f"Working with {len(TICKERS)} local tickers: {TICKERS}")
print(f"Fetching daily data from {START_DATE_YF} to {END_DATE_YF}...")

# Function to Read and Process LOCAL Robinhood Data
def process_robinhood_data(ticker, start_date, end_date):
    file_path = f'{ticker}.csv'
    
    if not os.path.exists(file_path):
        print(f"ERROR: Robinhood data file not found locally: {file_path}. Halting.")
        raise FileNotFoundError(f"Missing required file: {file_path}")
        
    try:
        # READ FROM LOCAL FILE
        df_rh = pd.read_csv(file_path)
        
        # Correct timestamp is datetime and clean data
        df_rh['timestamp'] = pd.to_datetime(df_rh['timestamp'])
        df_rh = df_rh.sort_values('timestamp')

        # Find last observed users_holding for each day (EOD proxy)
        eod_rh = df_rh.set_index('timestamp')['users_holding'].resample('D').last().ffill().dropna()
        
        # Calculate daily percentage change in users holding: Delta RH (Î”RH_t)
        delta_rh = eod_rh.pct_change() * 100
        
        delta_rh = delta_rh.reset_index()
        delta_rh.columns = ['Date', 'Delta_RH']
        delta_rh['Ticker'] = ticker
        
        # Filter dates to project window
        delta_rh = delta_rh[(delta_rh['Date'] >= start_date) & (delta_rh['Date'] <= end_date)]
        
        return delta_rh
    except Exception as e:
        print(f"Could not process local RH data for {ticker}. Skipping. Error: {e}")
        return None

# Process All Robinhood Tickers (LOCAL)
print(f"\nProcessing LOCAL RH data for {len(TICKERS)} tickers...")
all_rh_data = []
for ticker in TICKERS:
    rh_df = process_robinhood_data(ticker, START_DATE_RH, END_DATE_RH)
    if rh_df is not None and not rh_df.empty:
        all_rh_data.append(rh_df)

# Concatenate all individual ticker dataframes
RH_data_panel = pd.concat(all_rh_data, ignore_index=True)
RH_data_panel['Date'] = pd.to_datetime(RH_data_panel['Date']).dt.normalize()
RH_data_panel = RH_data_panel.dropna(subset=['Delta_RH']).reset_index(drop=True)
print(f"Loaded {len(RH_data_panel)} daily Robinhood observations from local files.")


# Download and Process Daily Stock Price Data (Y Variable and Lagged Control)
print(f"\nDownloading daily price data...")
prices_raw = yf.download(TICKERS, start=START_DATE_YF, end=END_DATE_YF, 
                         progress=False, auto_adjust=True)

# Extract adjusted close prices and standardize format
prices = prices_raw['Close'] if isinstance(prices_raw.columns, pd.MultiIndex) else prices_raw['Close'].to_frame(name=TICKERS[0])

# Calculate daily returns (R_i,t) and squared returns (R^2_i,t)
daily_returns = prices.pct_change().stack().reset_index()
daily_returns.columns = ['Date', 'Ticker', 'R_t'] 
daily_returns['Date'] = pd.to_datetime(daily_returns['Date']).dt.normalize()
daily_returns['R2_t'] = daily_returns['R_t']**2 

# Calculate lagged squared return (R^2_i,t-1)
daily_returns['R2_t_minus_1'] = daily_returns.groupby('Ticker')['R2_t'].shift(1)


# Create SIZE Proxy (Market Capitalization Rank)
print("\nFetching Market Cap data for SIZE Proxy...")
market_caps = {}
for ticker in TICKERS:
    try:
        info = yf.Ticker(ticker).info
        market_cap = info.get('marketCap', np.nan) 
        market_caps[ticker] = market_cap
    except Exception as e:
        print(f"Market cap fetch failed for {ticker}. Skipping from SIZE calculation.")
        market_caps[ticker] = np.nan

size_df = pd.Series(market_caps).dropna().to_frame(name='MarketCap_USD')

size_df = size_df.reset_index(names=['Ticker'])

# RANKING: Assign a higher SIZE_Proxy number to a smaller firm.
size_df['SIZE_Proxy'] = size_df['MarketCap_USD'].rank(ascending=True, method='dense') 
# Normalize SIZE proxy
max_rank = size_df['SIZE_Proxy'].max()
size_df['SIZE_Proxy'] = size_df['SIZE_Proxy'] / max_rank 

print(f"SIZE Proxy calculated for {len(size_df)} firms.")


# Merge 1: Returns and RH Data (on Date and Ticker)
regression_data = pd.merge(daily_returns, RH_data_panel, on=['Date', 'Ticker'], how='inner')

# RH variable needs to be lagged by one day: Delta RH_t-1
regression_data['Delta_RH_t_minus_1'] = regression_data.groupby('Ticker')['Delta_RH'].shift(1)

# Merge 2: Add SIZE Proxy (on Ticker)
regression_data = pd.merge(regression_data, size_df[['SIZE_Proxy', 'Ticker']], on='Ticker', how='left') 

# Create Interaction Term: (Delta RH_t-1 * SIZE_i)
regression_data['Delta_RH_x_SIZE'] = regression_data['Delta_RH_t_minus_1'] * regression_data['SIZE_Proxy']

# Final Cleanup
regression_data = regression_data.dropna().reset_index(drop=True)

print(f"\nFinal regression sample size (rows): {len(regression_data)}")
print(regression_data[['Date', 'Ticker', 'R2_t', 'R2_t_minus_1', 'Delta_RH_t_minus_1', 'SIZE_Proxy', 'Delta_RH_x_SIZE']].head())

Working with 4 local tickers: ['AAPL', 'PFE', 'JPM', 'NKE']
Fetching daily data from 2018-04-21 to 2020-08-31...

Processing LOCAL RH data for 4 tickers...
Loaded 3336 daily Robinhood observations from local files.

Downloading daily price data...

Fetching Market Cap data for SIZE Proxy...
SIZE Proxy calculated for 4 firms.

Final regression sample size (rows): 2296
--- Sample of Final Regression Data (Aligned with Hypothesis) ---
        Date Ticker      R2_t  R2_t_minus_1  Delta_RH_t_minus_1  SIZE_Proxy  \
0 2018-05-04   AAPL  0.001539      0.000003           -4.667573        1.00   
1 2018-05-04    JPM  0.000123      0.000040            0.558230        0.75   
2 2018-05-04    NKE  0.000322      0.000397            0.221298        0.25   
3 2018-05-04    PFE  0.000005      0.000014            0.500825        0.50   
4 2018-05-07   AAPL  0.000052      0.001539           -1.616034        1.00   

   Delta_RH_x_SIZE  
0        -4.667573  
1         0.418673  
2         0.055325  
3    

### Part 3: Regression Specification(s) (30%)

In the coding cell(s) below, please estimate and report your main regression specification using the .summary() option (adding useful variable names). To help me replicate your main regression specification, I will need your column of y values and your matrix of X values. Please save those objects as 'my_y.npy' and 'my_X.npy' and upload them to Canvas along with this notebook. If you are analyzing time-series data (and I expect that virtually all of you are) I would also like the corresponding list of dates stored as 'my_dates.npy'.

Note: If you choose to estimate one or more additional regression specifications, please clearly state what each of these specifications allows you to learn that your main specification does not. 

In [5]:
# Define Y (Dependent Variable: R^2_i,t - Daily Squared Return)
Y = regression_data['R2_t']

# Define X (Independent Variables from hypothesis)
# X_vars: R2_t-1, Delta_RH_t-1, SIZE_Proxy, Delta_RH_x_SIZE
X_vars = ['R2_t_minus_1', 'Delta_RH_t_minus_1', 'SIZE_Proxy', 'Delta_RH_x_SIZE']
X = regression_data[X_vars]
X = sm.add_constant(X)

my_y = Y.to_numpy()
my_X = X.to_numpy()

# Create date array
my_dates = (
    regression_data[['Date', 'Ticker']]
    .drop_duplicates()
    .sort_values(['Ticker', 'Date'])
    ['Date']
    .dt.strftime('%Y-%m-%d')
    .to_numpy()
)

# Save files
np.save('my_y.npy', my_y)
np.save('my_X.npy', my_X)
np.save('my_dates.npy', my_dates)

# Estimate Pooled OLS Regression
print("\nEstimating Pooled OLS Regression...")
model = sm.OLS(Y, X)
main_regression_result = model.fit()

# Output
print("\n" + "="*80)
print("MAIN REGRESSION: RH Attention Predicting Volatility with Size Interaction")
print("Dependent Variable: Daily Squared Return (R2_t, Proxy for Volatility)")
print("Null Hypothesis (H0): Coefficient on Delta_RH_x_SIZE is zero (Beta_4 = 0)")
print("SIZE_Proxy: Higher value means relatively smaller firm.")
print("="*80)
print(main_regression_result.summary(
    xname=['Intercept', 'R2_t-1', 'Delta_RH_t-1', 'SIZE_Proxy', 'Delta_RH_x_SIZE']
))


Saved required .npy files: my_y.npy, my_X.npy, my_dates.npy

Estimating Pooled OLS Regression...

MAIN REGRESSION: RH Attention Predicting Volatility with Size Interaction
Dependent Variable: Daily Squared Return (R2_t, Proxy for Volatility)
Null Hypothesis (H0): Coefficient on Delta_RH_x_SIZE is zero (Beta_4 = 0)
SIZE_Proxy: Higher value means relatively smaller firm.
                            OLS Regression Results                            
Dep. Variable:                   R2_t   R-squared:                       0.195
Model:                            OLS   Adj. R-squared:                  0.194
Method:                 Least Squares   F-statistic:                     138.8
Date:                Sat, 13 Dec 2025   Prob (F-statistic):          2.51e-106
Time:                        17:20:19   Log-Likelihood:                 11833.
No. Observations:                2296   AIC:                        -2.366e+04
Df Residuals:                    2291   BIC:                        -2.363

### Part 4: Interpretation and Conclusion (30%)

In the markdown cell(s) below, please explain whether your regression caused you to accept or reject your null hypothesis, and the confidence level used to assess statistical significant. Please discuss what you learned from running this regression and what additional regressions you might want to run in the future (potentially using additional data) to learn more about your research question. <i>I will be happy with one very good paragraph.</i>

Based on the results from the regression, reject the null hypothesis that the coefficient on the interaction term ($\Delta \text{RH}_{i,t-1} \times \text{SIZE}_i$) is zero, using the standard 99% confidence level (since the P-value was $0.000$). The positive and highly significant coefficient on the interaction term ($\beta_4=0.0004$) says that the predictive power of increased Robinhood attention on next-day stock volatility is significantly greater for smaller firms (those with a higher $\text{SIZE}$ proxy value) than for larger firms, validating the research hypothesis that retail trading disproportionately impacts the market behavior of smaller stocks. Moving forward, to further explore this phenomenon, future research should incorporate firm fixed effects to control for unobserved, time-invariant differences between companies (like brand value or risk tolerance) and could also be expanded to include data on short interest as a control variable, testing whether the "Robinhood effect" is distinct from general speculative short squeezes.