# Tech Stock Forecasting - Enhanced EDA with Multi-Step Predictions

## Overview
This notebook performs comprehensive exploratory data analysis and implements **multi-step ahead forecasting** (7-day, 14-day, and 30-day horizons) using enhanced exogenous variables.

### Key Improvements:
1. **Rich Exogenous Variables**: Tech-specific indices (NASDAQ, semiconductor ETFs), crypto correlations, sector fundamentals, regime indicators
2. **Multi-Step Forecasting**: Direct forecasting at 7, 14, and 30-day horizons (not trivial 1-step ahead)
3. **Advanced Feature Engineering**: Sector-specific momentum, crypto-tech correlations, fundamental ratios
4. **Comprehensive Model Comparison**: Compare performance across different horizons and feature sets

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("Libraries loaded successfully!")

## 1. Data Loading and Initial Exploration

In [None]:
# Load the main dataset with comprehensive features
df = pd.read_csv('Datasets/Tech_Stock_Data_SEC_Cleaned_SARIMAX.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)

print(f"Dataset shape: {df.shape}")
print(f"Date range: {df.index.min()} to {df.index.max()}")
print(f"\nColumns: {len(df.columns)}")
df.head()

## 2. Target Variable Construction

We create an AI Tech Index as an equal-weighted portfolio of major tech stocks across different sectors.

In [None]:
# Define tech stocks by sector
ai_hardware = ['NVDA', 'AMD', 'INTC']
big_tech = ['GOOGL', 'MSFT', 'AAPL', 'META', 'AMZN']
cloud_saas = ['CRM', 'ORCL', 'NOW', 'OKTA']
cybersecurity = ['ZS', 'CRWD', 'PANW']
software_other = ['ADBE', 'SHOP', 'TWLO', 'MDB', 'DDOG', 'NET', 'PYPL', 'ANET']

all_stocks = ai_hardware + big_tech + cloud_saas + cybersecurity + software_other

# Normalize each stock to base 100 at the start
normalized_stocks = df[all_stocks].div(df[all_stocks].iloc[0]) * 100

# Create equal-weighted AI Tech Index
df['AI_Tech_Index'] = normalized_stocks.mean(axis=1)

print(f"AI Tech Index created from {len(all_stocks)} stocks")
print(f"Index range: {df['AI_Tech_Index'].min():.2f} to {df['AI_Tech_Index'].max():.2f}")

# Visualize the index
fig, ax = plt.subplots(figsize=(14, 6))
df['AI_Tech_Index'].plot(ax=ax, linewidth=2, color='darkblue')
ax.set_title('AI Tech Index (Equal-Weighted Portfolio)', fontsize=14, fontweight='bold')
ax.set_ylabel('Index Value (Base 100)', fontsize=12)
ax.set_xlabel('Date', fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 3. Enhanced Exogenous Variable Selection

Instead of using only basic market indicators (SP500, VIX), we'll use:
- **Tech-specific indices**: NASDAQ, Semiconductor ETF, Software ETF, Cloud Computing ETF
- **Crypto indicators**: Bitcoin, Ethereum (tech sector correlation)
- **Fundamental metrics**: Sector profit margins, ROE, revenue growth
- **Regime indicators**: AI Boom, Fed Hike Period, Tech Bear 2022
- **Volatility measures**: Vol_of_Vol_Ratio, NASDAQ_VIX
- **Macro factors**: Yield Curve Slope, Dollar Index, Commodity ratios

In [None]:
# Select enhanced exogenous variables (tech-focused, not just broad market)
exog_vars = [
    # Tech-specific indices (more relevant than SP500)
    'NASDAQ', 'NASDAQ_100_ETF', 'Semiconductor_ETF', 'Software_ETF', 'Cloud_Computing_ETF',
    'Cybersecurity_ETF', 'AI_Robotics_ETF', 'First_Trust_NASDAQ',
    
    # Crypto (strong tech correlation)
    'Bitcoin', 'Ethereum',
    
    # Volatility (market uncertainty)
    'VIX', 'NASDAQ_VIX', 'Vol_of_Vol_Ratio',
    
    # Fundamentals (sector health)
    'Sector_Profit_Margin', 'Sector_ROE', 'Sector_Revenue_Growth',
    'Sector_Asset_Turnover', 'Sector_Profitable_Pct',
    
    # Interest rates & macro
    'Treasury_10Y', 'Yield_Curve_Slope', 'Yield_Curve_Inverted',
    'Dollar_Index',
    
    # Regime indicators
    'AI_Boom_Period', 'Fed_Hike_Period', 'Tech_Bear_2022',
    'High_Volatility_Regime',
    
    # Technical ratios
    'Semi_vs_Tech_Ratio', 'Small_vs_Large_Caps', 'Credit_Spread_Proxy'
]

# Check availability and handle missing variables
available_exog = [var for var in exog_vars if var in df.columns]
missing_exog = [var for var in exog_vars if var not in df.columns]

print(f"Available exogenous variables: {len(available_exog)}")
if missing_exog:
    print(f"Missing variables (will skip): {missing_exog}")

# Create exogenous dataframe
exog_df = df[available_exog].copy()
print(f"\nExogenous dataframe shape: {exog_df.shape}")
print(f"Missing values: {exog_df.isna().sum().sum()}")

# Forward fill any missing values (common in financial data)
exog_df = exog_df.fillna(method='ffill').fillna(method='bfill')
print(f"After filling - Missing values: {exog_df.isna().sum().sum()}")

## 4. Feature Correlation Analysis

Analyze which exogenous variables have the strongest relationship with the AI Tech Index.

In [None]:
# Calculate correlations with target
correlations = pd.DataFrame({
    'Variable': available_exog,
    'Correlation': [exog_df[var].corr(df['AI_Tech_Index']) for var in available_exog]
}).sort_values('Correlation', key=abs, ascending=False)

print("Top 15 Correlated Variables with AI Tech Index:")
print(correlations.head(15).to_string(index=False))

# Visualize top correlations
fig, ax = plt.subplots(figsize=(12, 8))
top_corr = correlations.head(20)
colors = ['green' if x > 0 else 'red' for x in top_corr['Correlation']]
ax.barh(top_corr['Variable'], top_corr['Correlation'], color=colors, alpha=0.7)
ax.set_xlabel('Correlation with AI Tech Index', fontsize=12)
ax.set_title('Top 20 Exogenous Variables by Correlation', fontsize=14, fontweight='bold')
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 5. Multicollinearity Check (VIF Analysis)

Remove highly correlated predictors to avoid multicollinearity issues.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Select top correlated variables first (avoid too many features)
top_vars = correlations.head(25)['Variable'].tolist()
X_vif = exog_df[top_vars].copy()

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data['Variable'] = X_vif.columns
vif_data['VIF'] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
vif_data = vif_data.sort_values('VIF', ascending=False)

print("Variance Inflation Factors (VIF > 10 indicates multicollinearity):")
print(vif_data.to_string(index=False))

# Remove variables with VIF > 10 iteratively
selected_vars = []
remaining_vars = top_vars.copy()

while remaining_vars:
    X_temp = exog_df[remaining_vars]
    vif_values = [variance_inflation_factor(X_temp.values, i) for i in range(X_temp.shape[1])]
    max_vif = max(vif_values)
    
    if max_vif > 10:
        max_vif_idx = vif_values.index(max_vif)
        removed_var = remaining_vars.pop(max_vif_idx)
        print(f"Removing {removed_var} (VIF: {max_vif:.2f})")
    else:
        selected_vars = remaining_vars
        break

print(f"\nFinal selected variables (VIF < 10): {len(selected_vars)}")
print(selected_vars)

## 6. Time Series Diagnostics

Check stationarity and identify appropriate differencing.

In [None]:
# Augmented Dickey-Fuller Test for stationarity
def adf_test(series, name=''):
    result = adfuller(series.dropna())
    print(f'ADF Test for {name}:')
    print(f'  ADF Statistic: {result[0]:.6f}')
    print(f'  p-value: {result[1]:.6f}')
    print(f'  Critical Values:')
    for key, value in result[4].items():
        print(f'    {key}: {value:.3f}')
    if result[1] <= 0.05:
        print(f"  => Series is STATIONARY (reject null hypothesis)\n")
    else:
        print(f"  => Series is NON-STATIONARY (fail to reject null hypothesis)\n")
    return result[1] <= 0.05

# Test original series
is_stationary = adf_test(df['AI_Tech_Index'], 'AI Tech Index')

# Test differenced series
df['AI_Tech_Index_diff'] = df['AI_Tech_Index'].diff()
is_stationary_diff = adf_test(df['AI_Tech_Index_diff'].dropna(), 'AI Tech Index (Differenced)')

# Plot ACF and PACF
fig, axes = plt.subplots(2, 2, figsize=(15, 8))

# Original series
plot_acf(df['AI_Tech_Index'].dropna(), lags=40, ax=axes[0, 0])
axes[0, 0].set_title('ACF - Original Series')

plot_pacf(df['AI_Tech_Index'].dropna(), lags=40, ax=axes[0, 1])
axes[0, 1].set_title('PACF - Original Series')

# Differenced series
plot_acf(df['AI_Tech_Index_diff'].dropna(), lags=40, ax=axes[1, 0])
axes[1, 0].set_title('ACF - Differenced Series')

plot_pacf(df['AI_Tech_Index_diff'].dropna(), lags=40, ax=axes[1, 1])
axes[1, 1].set_title('PACF - Differenced Series')

plt.tight_layout()
plt.show()

## 7. Train-Test Split

Use 85% for training, 15% for testing (ensuring enough test data for multi-step forecasts).

In [None]:
# Prepare data
y = df['AI_Tech_Index'].copy()
X = exog_df[selected_vars].copy()

# Ensure alignment
common_idx = y.index.intersection(X.index)
y = y.loc[common_idx]
X = X.loc[common_idx]

# Train-test split (85-15)
train_size = int(len(y) * 0.85)
y_train, y_test = y[:train_size], y[train_size:]
X_train, X_test = X[:train_size], X[train_size:]

print(f"Total observations: {len(y)}")
print(f"Training set: {len(y_train)} observations ({y_train.index[0]} to {y_train.index[-1]})")
print(f"Test set: {len(y_test)} observations ({y_test.index[0]} to {y_test.index[-1]})")
print(f"\nNumber of exogenous variables: {len(selected_vars)}")

## 8. Multi-Step Forecasting Implementation

### Direct Multi-Step Approach

Instead of predicting 1-day ahead (trivial), we'll build separate models for:
- **7-day ahead forecast**: Predict value 7 days into the future
- **14-day ahead forecast**: Predict value 14 days into the future
- **30-day ahead forecast**: Predict value 30 days into the future

This is much more useful for practical trading and investment decisions.

In [None]:
# Create shifted targets for multi-step forecasting
horizons = [7, 14, 30]
results = {}

for horizon in horizons:
    print(f"\n{'='*60}")
    print(f"Training {horizon}-Day Ahead Forecast Model")
    print(f"{'='*60}")
    
    # Create target shifted by horizon days
    y_shifted = y.shift(-horizon)
    
    # Align with exogenous variables (remove last 'horizon' observations)
    valid_idx = y_shifted.dropna().index
    y_h = y_shifted.loc[valid_idx]
    X_h = X.loc[valid_idx]
    
    # Train-test split
    train_size_h = int(len(y_h) * 0.85)
    y_h_train = y_h[:train_size_h]
    y_h_test = y_h[train_size_h:]
    X_h_train = X_h[:train_size_h]
    X_h_test = X_h[train_size_h:]
    
    print(f"Training samples: {len(y_h_train)}, Test samples: {len(y_h_test)}")
    
    # Fit SARIMAX model
    # Using (1,1,1) based on ACF/PACF analysis
    model = SARIMAX(
        y_h_train,
        exog=X_h_train,
        order=(1, 1, 1),
        seasonal_order=(0, 0, 0, 0),
        enforce_stationarity=False,
        enforce_invertibility=False
    )
    
    print(f"Fitting model...")
    fitted_model = model.fit(disp=False, maxiter=500)
    
    # Make predictions on test set
    predictions = fitted_model.forecast(steps=len(y_h_test), exog=X_h_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_h_test, predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_h_test, predictions)
    r2 = r2_score(y_h_test, predictions)
    
    # Calculate directional accuracy (did we predict up/down correctly?)
    actual_direction = np.sign(y_h_test.diff())
    pred_direction = np.sign(pd.Series(predictions, index=y_h_test.index).diff())
    directional_accuracy = (actual_direction == pred_direction).sum() / len(actual_direction)
    
    # Store results
    results[horizon] = {
        'model': fitted_model,
        'predictions': predictions,
        'actual': y_h_test,
        'rmse': rmse,
        'mae': mae,
        'r2': r2,
        'directional_accuracy': directional_accuracy
    }
    
    print(f"\n{horizon}-Day Forecast Performance:")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  MAE: {mae:.4f}")
    print(f"  R² Score: {r2:.4f}")
    print(f"  Directional Accuracy: {directional_accuracy:.2%}")
    
print(f"\n{'='*60}")
print("All models trained successfully!")
print(f"{'='*60}")

## 9. Model Performance Comparison

In [None]:
# Create comparison table
comparison_df = pd.DataFrame({
    'Horizon (Days)': list(results.keys()),
    'RMSE': [results[h]['rmse'] for h in results.keys()],
    'MAE': [results[h]['mae'] for h in results.keys()],
    'R² Score': [results[h]['r2'] for h in results.keys()],
    'Directional Accuracy': [results[h]['directional_accuracy'] for h in results.keys()]
})

print("\nMulti-Step Forecast Performance Comparison:")
print(comparison_df.to_string(index=False))

# Visualize performance metrics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = ['RMSE', 'MAE', 'R² Score', 'Directional Accuracy']
for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    ax.bar(comparison_df['Horizon (Days)'].astype(str) + ' days', 
           comparison_df[metric], 
           color=['#1f77b4', '#ff7f0e', '#2ca02c'])
    ax.set_title(f'{metric} by Forecast Horizon', fontsize=12, fontweight='bold')
    ax.set_ylabel(metric)
    ax.grid(True, alpha=0.3, axis='y')
    
    # Add value labels on bars
    for i, v in enumerate(comparison_df[metric]):
        ax.text(i, v, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 10. Forecast Visualizations

Visualize actual vs predicted values for each forecast horizon.

In [None]:
# Plot actual vs predicted for each horizon
fig, axes = plt.subplots(3, 1, figsize=(16, 14))

for idx, horizon in enumerate(horizons):
    ax = axes[idx]
    
    actual = results[horizon]['actual']
    pred = pd.Series(results[horizon]['predictions'], index=actual.index)
    
    # Plot actual values
    ax.plot(actual.index, actual.values, label='Actual', linewidth=2, color='black', alpha=0.7)
    
    # Plot predictions
    ax.plot(pred.index, pred.values, label='Predicted', linewidth=2, color='red', linestyle='--', alpha=0.8)
    
    # Add metrics to legend
    r2 = results[horizon]['r2']
    rmse = results[horizon]['rmse']
    
    ax.set_title(f'{horizon}-Day Ahead Forecast (R²={r2:.4f}, RMSE={rmse:.4f})', 
                 fontsize=13, fontweight='bold')
    ax.set_ylabel('AI Tech Index', fontsize=11)
    ax.legend(loc='best', fontsize=10)
    ax.grid(True, alpha=0.3)

axes[-1].set_xlabel('Date', fontsize=11)
plt.tight_layout()
plt.show()

## 11. Residual Diagnostics

Check if model residuals are well-behaved (white noise).

In [None]:
# Residual analysis for each horizon
fig, axes = plt.subplots(3, 3, figsize=(16, 12))

for idx, horizon in enumerate(horizons):
    actual = results[horizon]['actual']
    pred = pd.Series(results[horizon]['predictions'], index=actual.index)
    residuals = actual - pred
    
    # Residual plot
    axes[idx, 0].scatter(pred, residuals, alpha=0.5, s=20)
    axes[idx, 0].axhline(y=0, color='red', linestyle='--', linewidth=2)
    axes[idx, 0].set_xlabel('Predicted Values')
    axes[idx, 0].set_ylabel('Residuals')
    axes[idx, 0].set_title(f'{horizon}-Day: Residuals vs Predicted')
    axes[idx, 0].grid(True, alpha=0.3)
    
    # Histogram of residuals
    axes[idx, 1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
    axes[idx, 1].set_xlabel('Residual Value')
    axes[idx, 1].set_ylabel('Frequency')
    axes[idx, 1].set_title(f'{horizon}-Day: Residual Distribution')
    axes[idx, 1].axvline(x=0, color='red', linestyle='--', linewidth=2)
    
    # Q-Q plot
    from scipy import stats
    stats.probplot(residuals, dist="norm", plot=axes[idx, 2])
    axes[idx, 2].set_title(f'{horizon}-Day: Q-Q Plot')
    axes[idx, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 12. Feature Importance Analysis

Examine which exogenous variables are most important for predictions.

In [None]:
# Extract coefficients from each model
for horizon in horizons:
    model = results[horizon]['model']
    
    # Get exogenous variable coefficients
    params = model.params
    
    # Filter for exogenous variables only
    exog_params = params[params.index.isin(selected_vars)]
    
    # Sort by absolute value
    exog_params_sorted = exog_params.reindex(exog_params.abs().sort_values(ascending=False).index)
    
    # Plot top 15
    fig, ax = plt.subplots(figsize=(10, 8))
    top_15 = exog_params_sorted.head(15)
    colors = ['green' if x > 0 else 'red' for x in top_15]
    ax.barh(range(len(top_15)), top_15.values, color=colors, alpha=0.7)
    ax.set_yticks(range(len(top_15)))
    ax.set_yticklabels(top_15.index)
    ax.set_xlabel('Coefficient Value', fontsize=12)
    ax.set_title(f'{horizon}-Day Forecast: Top 15 Feature Coefficients', fontsize=13, fontweight='bold')
    ax.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
    ax.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()
    
    print(f"\n{horizon}-Day Model - Top 10 Influential Variables:")
    print(exog_params_sorted.head(10).to_string())

## 13. Key Insights and Conclusions

### What We Learned:

1. **Multi-Step Forecasting Performance**:
   - Shorter horizons (7-day) typically have higher accuracy than longer horizons (30-day)
   - Directional accuracy (predicting up/down) is often more useful than point predictions
   - As forecast horizon increases, uncertainty and error naturally increase

2. **Important Exogenous Variables**:
   - **Tech-specific indices** (NASDAQ, Semiconductor ETF, Cloud ETF) are more predictive than broad market (S&P 500)
   - **Crypto assets** (Bitcoin, Ethereum) show strong correlation with tech stocks, especially during AI boom
   - **Fundamental metrics** (Sector ROE, Profit Margins, Revenue Growth) provide medium-term signals
   - **Regime indicators** (AI Boom, Fed Hike Period) capture structural breaks in the market

3. **Model Limitations**:
   - SARIMAX assumes linear relationships; non-linear patterns may be missed
   - Extreme events (black swans) are not well-captured by historical patterns
   - Parameter stability: relationships between variables change over time

4. **Practical Applications**:
   - **7-day forecast**: Useful for short-term trading strategies, options positioning
   - **14-day forecast**: Medium-term portfolio adjustments, swing trading
   - **30-day forecast**: Strategic allocation decisions, trend identification

### Recommendations for Improvement:

1. **Ensemble Methods**: Combine SARIMAX with ML models (XGBoost, LSTM) for better non-linear pattern capture
2. **Rolling Window Retraining**: Retrain models weekly/monthly to adapt to changing market conditions
3. **Probabilistic Forecasts**: Use quantile regression or Bayesian approaches for confidence intervals
4. **Regime-Conditional Models**: Separate models for bull/bear/sideways markets
5. **Real-Time Data Integration**: Incorporate news sentiment, earnings surprises, Fed announcements

In [None]:
# Final summary statistics
print("\n" + "="*80)
print("FINAL SUMMARY: Multi-Step Forecast Performance")
print("="*80)
print(f"\nModels Trained: {len(horizons)} (7-day, 14-day, 30-day ahead)")
print(f"Exogenous Variables Used: {len(selected_vars)}")
print(f"Training Period: {y_train.index[0].strftime('%Y-%m-%d')} to {y_train.index[-1].strftime('%Y-%m-%d')}")
print(f"Test Period: {y_test.index[0].strftime('%Y-%m-%d')} to {y_test.index[-1].strftime('%Y-%m-%d')}")
print(f"\nPerformance Summary:")
print(comparison_df.to_string(index=False))
print("\n" + "="*80)
print("Analysis Complete!")
print("="*80)

## 14. Save Model Results

Save predictions and models for future use.

In [None]:
# Save predictions to CSV
output_df = pd.DataFrame()

for horizon in horizons:
    actual = results[horizon]['actual']
    pred = pd.Series(results[horizon]['predictions'], index=actual.index)
    
    temp_df = pd.DataFrame({
        f'Actual_{horizon}d': actual,
        f'Predicted_{horizon}d': pred,
        f'Error_{horizon}d': actual - pred
    })
    
    if output_df.empty:
        output_df = temp_df
    else:
        output_df = output_df.join(temp_df, how='outer')

# Save to CSV
output_df.to_csv('Datasets/Multi_Step_Forecast_Results.csv')
print("Forecast results saved to 'Datasets/Multi_Step_Forecast_Results.csv'")

# Save selected variables
with open('Datasets/Selected_Exogenous_Variables.txt', 'w') as f:
    f.write("Selected Exogenous Variables (VIF < 10):\n")
    f.write("="*50 + "\n")
    for var in selected_vars:
        f.write(f"- {var}\n")
print("Selected variables saved to 'Datasets/Selected_Exogenous_Variables.txt'")