# **Chapter 23: Linear Models for Time‑Series**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the fundamentals of linear regression and its assumptions
- Apply ordinary least squares (OLS) to time‑series data and interpret coefficients
- Recognize the limitations of OLS in the presence of multicollinearity and high dimensionality
- Implement Ridge (L2) and Lasso (L1) regularization to improve generalization
- Use Elastic Net to combine the benefits of Ridge and Lasso
- Build polynomial regression models to capture non‑linear trends
- Apply generalized linear models (GLMs) for non‑normal targets (e.g., binary direction)
- Use regularization for feature selection in the NEPSE prediction system
- Diagnose model fit through residual analysis and hypothesis tests
- Understand the role of linear models as interpretable baselines in time‑series forecasting

---

## **23.1 Linear Regression Fundamentals**

Linear regression is one of the simplest and most interpretable models in machine learning. It assumes a linear relationship between the input features `X` and the target `y`:

`y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε`

where `ε` is the error term. The coefficients `β` are estimated by minimizing the sum of squared residuals (ordinary least squares, OLS).

In the context of the NEPSE prediction system, we might try to predict tomorrow's return using today's features (lagged returns, volume, technical indicators) with a linear model. While the true relationship is unlikely to be perfectly linear, linear models provide a simple, interpretable baseline.

### **23.1.1 Assumptions of Linear Regression**

1. **Linearity:** The relationship between features and target is linear.
2. **Independence:** Observations are independent (often violated in time‑series).
3. **Homoscedasticity:** Constant variance of errors.
4. **Normality:** Errors are normally distributed (for inference, not required for prediction).
5. **No perfect multicollinearity:** Features are not perfectly correlated.

Time‑series data typically violates the independence assumption (autocorrelation) and may exhibit heteroscedasticity (volatility clustering). Despite these violations, linear models can still produce reasonable forecasts, but we must be cautious with inference.

### **23.1.2 Applying OLS to NEPSE Data**

We'll use the same feature set as in Chapter 22 (lagged returns, moving averages, RSI, etc.) to predict next day's return.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm

# Load and prepare data (same as Chapter 22)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)

# Use a single symbol
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Create features (simplified set for clarity)
df_stock['Return'] = df_stock['Close'].pct_change() * 100
df_stock['Return_Lag1'] = df_stock['Return'].shift(1)
df_stock['Return_Lag2'] = df_stock['Return'].shift(2)
df_stock['Volume_Lag1'] = df_stock['Vol'].shift(1)
df_stock['MA_5'] = df_stock['Close'].rolling(5).mean()
df_stock['Volatility_5'] = df_stock['Return'].rolling(5).std()
df_stock['RSI'] = 100 - (100 / (1 + (df_stock['Close'].diff().where(lambda x: x>0, 0).rolling(14).mean() / 
                                     (-df_stock['Close'].diff().where(lambda x: x<0, 0).rolling(14).mean()))))

# Target: next day's return
df_stock['Target'] = df_stock['Return'].shift(-1)

# Drop NaN
df_stock = df_stock.dropna()

# Features and target
feature_cols = ['Return_Lag1', 'Return_Lag2', 'Volume_Lag1', 'MA_5', 'Volatility_5', 'RSI']
X = df_stock[feature_cols]
y = df_stock['Target']

# Temporal split
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

# Fit OLS using scikit-learn
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

print(f"Train RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train)):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")

# Coefficients
coef_df = pd.DataFrame({'feature': feature_cols, 'coefficient': lr.coef_})
print(coef_df)
```

**Explanation:**

- We use a simple set of features to demonstrate linear regression. The model is trained on 80% of the data (temporally split) and evaluated on the most recent 20%.
- The RMSE gives a baseline. Later we'll compare with regularized versions.
- The coefficients show the estimated effect of each feature. For example, a positive coefficient on `Return_Lag1` would indicate that higher returns yesterday tend to lead to higher returns today (momentum), while a negative coefficient would suggest mean reversion.
- Note that these coefficients are conditional on the other features in the model.

### **23.1.3 Statistical Inference with OLS**

For inference (confidence intervals, p‑values), we can use `statsmodels` which provides detailed regression output.

```python
# Add constant for intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

# Fit OLS using statsmodels
ols_model = sm.OLS(y_train, X_train_const).fit()
print(ols_model.summary())
```

**Explanation:**

- The summary includes R², adjusted R², F‑statistic, and coefficients with standard errors, t‑stats, and p‑values.
- A low p‑value (<0.05) suggests the feature is statistically significant. However, in time‑series, these p‑values may be biased due to autocorrelation.
- The Durbin‑Watson statistic (around 2) indicates no autocorrelation in residuals; values far from 2 suggest autocorrelation, which would violate the independence assumption.

### **23.1.4 Limitations of OLS in Time‑Series**

- **Multicollinearity:** Features like `MA_5` and `RSI` may be correlated, inflating standard errors.
- **Non‑stationarity:** If the target or features are non‑stationary, coefficients may be unstable.
- **Autocorrelation:** Residuals may be correlated over time, leading to inefficient estimates.
- **Overfitting:** With many features, OLS can overfit, especially if the number of features is large relative to samples.

These limitations motivate regularized regression methods.

---

## **23.2 Ridge Regression (L2)**

Ridge regression adds a penalty term to the OLS objective: it minimizes the sum of squared residuals plus `α` times the sum of squared coefficients (L2 norm). This shrinks coefficients toward zero but does not set them exactly to zero. Ridge is effective when there are many correlated features.

**Objective:** minimize `∑(yᵢ - ŷᵢ)² + α ∑βⱼ²`

The hyperparameter `α` controls the strength of regularization. As `α` increases, coefficients shrink more.

### **23.2.1 Implementing Ridge with scikit‑learn**

```python
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Ridge regression
ridge = Ridge()
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
# Use time-series cross-validation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

grid_ridge = GridSearchCV(ridge, param_grid, cv=tscv, scoring='neg_mean_squared_error')
grid_ridge.fit(X_train, y_train)

print(f"Best alpha: {grid_ridge.best_params_['alpha']:.4f}")
print(f"Best CV RMSE: {np.sqrt(-grid_ridge.best_score_):.4f}")

# Evaluate on test
y_pred_ridge = grid_ridge.predict(X_test)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print(f"Test RMSE: {rmse_ridge:.4f}")

# Coefficients
ridge_best = grid_ridge.best_estimator_
coef_ridge = pd.DataFrame({'feature': feature_cols, 'coefficient': ridge_best.coef_})
print(coef_ridge)
```

**Explanation:**

- We use `GridSearchCV` with `TimeSeriesSplit` to tune `α`. The CV folds respect temporal order.
- The best `α` is chosen based on average negative MSE (converted to RMSE).
- Ridge coefficients are smaller than OLS coefficients, and some may be nearly zero but not exactly zero.

### **23.2.2 Effect of Regularization**

As `α` increases, coefficients shrink. This reduces variance but may increase bias. The optimal `α` balances bias and variance, leading to better out‑of‑sample performance.

We can visualize the coefficient paths:

```python
alphas = np.logspace(-2, 2, 50)
coefs = []
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train, y_train)
    coefs.append(ridge.coef_)

plt.figure(figsize=(10,6))
for i in range(len(feature_cols)):
    plt.plot(alphas, [c[i] for c in coefs], label=feature_cols[i])
plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('coefficient')
plt.title('Ridge coefficient paths')
plt.legend()
plt.show()
```

**Explanation:**

- As `α` increases, coefficients smoothly shrink toward zero. This plot helps see which features are most affected by regularization.

---

## **23.3 Lasso Regression (L1)**

Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty: `α ∑|βⱼ|`. This penalty can shrink some coefficients exactly to zero, effectively performing feature selection. Lasso is useful when we suspect only a subset of features are relevant.

**Objective:** minimize `∑(yᵢ - ŷᵢ)² + α ∑|βⱼ|`

### **23.3.1 Implementing Lasso with scikit‑learn**

```python
from sklearn.linear_model import Lasso

lasso = Lasso(max_iter=10000)
param_grid = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}

grid_lasso = GridSearchCV(lasso, param_grid, cv=tscv, scoring='neg_mean_squared_error')
grid_lasso.fit(X_train, y_train)

print(f"Best alpha: {grid_lasso.best_params_['alpha']:.4f}")
print(f"Best CV RMSE: {np.sqrt(-grid_lasso.best_score_):.4f}")

y_pred_lasso = grid_lasso.predict(X_test)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print(f"Test RMSE: {rmse_lasso:.4f}")

# Coefficients (many will be zero)
lasso_best = grid_lasso.best_estimator_
coef_lasso = pd.DataFrame({'feature': feature_cols, 'coefficient': lasso_best.coef_})
print(coef_lasso[coef_lasso['coefficient'] != 0])
```

**Explanation:**

- Lasso may set some coefficients to exactly zero, indicating those features are not selected. This provides a form of automatic feature selection.
- The remaining non‑zero coefficients are the most important predictors.
- For the NEPSE data, Lasso might retain only a few features like `Return_Lag1` and `Volatility_5`, discarding others.

### **23.3.2 Lasso Path**

We can visualize how coefficients change with `α`:

```python
from sklearn.linear_model import lasso_path

alphas_lasso, coefs_lasso, _ = lasso_path(X_train, y_train, alphas=np.logspace(-3, 1, 50))
plt.figure(figsize=(10,6))
for i in range(len(feature_cols)):
    plt.plot(alphas_lasso, coefs_lasso[i, :], label=feature_cols[i])
plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('coefficient')
plt.title('Lasso path')
plt.legend()
plt.show()
```

**Explanation:**

- As `α` increases, coefficients drop to zero at different rates. The order in which they drop indicates feature importance.

---

## **23.4 Elastic Net**

Elastic Net combines L1 and L2 penalties, controlled by a mixing parameter `l1_ratio`. It can select groups of correlated features (unlike Lasso, which picks one from a correlated group) and still perform feature selection.

**Objective:** minimize `∑(yᵢ - ŷᵢ)² + α * (l1_ratio * ∑|βⱼ| + (1-l1_ratio)/2 * ∑βⱼ²)`

When `l1_ratio = 1`, it's Lasso; when `l1_ratio = 0`, it's Ridge. Values in between give a mix.

### **23.4.1 Implementing Elastic Net**

```python
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(max_iter=10000)
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
}

grid_elastic = GridSearchCV(elastic, param_grid, cv=tscv, scoring='neg_mean_squared_error')
grid_elastic.fit(X_train, y_train)

print(f"Best params: {grid_elastic.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_elastic.best_score_):.4f}")

y_pred_elastic = grid_elastic.predict(X_test)
rmse_elastic = np.sqrt(mean_squared_error(y_test, y_pred_elastic))
print(f"Test RMSE: {rmse_elastic:.4f}")

elastic_best = grid_elastic.best_estimator_
coef_elastic = pd.DataFrame({'feature': feature_cols, 'coefficient': elastic_best.coef_})
print(coef_elastic[coef_elastic['coefficient'] != 0])
```

**Explanation:**

- Elastic Net often performs well when there are groups of correlated features (e.g., multiple lagged returns). It can include all of them with shrunk coefficients rather than picking one.
- The `l1_ratio` balances between Ridge and Lasso.

---

## **23.5 Polynomial Regression**

Linear models can capture non‑linear relationships by including polynomial terms (e.g., `x²`, `x³`) or interactions. This is still linear in the parameters, so we can use the same estimation methods.

For time‑series, polynomial terms of time can model trends. For example, we might include `day²` to capture acceleration in prices. However, extrapolating polynomials can be dangerous.

### **23.5.1 Creating Polynomial Features**

```python
from sklearn.preprocessing import PolynomialFeatures

# Create a time index (days since start)
df_stock['Day'] = (df_stock['Date'] - df_stock['Date'].min()).dt.days

# Select base features (include time)
base_features = ['Day', 'Return_Lag1', 'Volume_Lag1']
X_base = df_stock[base_features].dropna()  # ensure alignment

# Generate polynomial features up to degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_base)
feature_names_poly = poly.get_feature_names_out(base_features)

# Convert to DataFrame
X_poly_df = pd.DataFrame(X_poly, columns=feature_names_poly, index=X_base.index)

# Merge back with target (aligning indices)
df_model = df_stock.loc[X_poly_df.index].copy()
y_poly = df_model['Target']

# Train/test split (temporal)
split_idx_poly = int(len(X_poly_df) * 0.8)
X_train_poly, X_test_poly = X_poly_df.iloc[:split_idx_poly], X_poly_df.iloc[split_idx_poly:]
y_train_poly, y_test_poly = y_poly.iloc[:split_idx_poly], y_poly.iloc[split_idx_poly:]

# Fit linear regression on polynomial features
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train_poly)

y_pred_poly = lr_poly.predict(X_test_poly)
rmse_poly = np.sqrt(mean_squared_error(y_test_poly, y_pred_poly))
print(f"Polynomial regression test RMSE: {rmse_poly:.4f}")
```

**Explanation:**

- We include a time trend `Day` and its square to capture non‑linear trends. The square term allows the model to fit a parabolic trend.
- Interaction terms like `Day * Return_Lag1` are also created, which could capture how the effect of past returns changes over time.
- Polynomial regression can easily overfit, especially with high degrees. Regularization is recommended.

---

## **23.6 Generalized Linear Models (GLMs)**

GLMs extend linear regression to targets with non‑normal distributions (e.g., binary, count, positive‑valued). They consist of three components:

- **Random component:** distribution of the target (e.g., Binomial for classification, Poisson for counts).
- **Systematic component:** linear predictor `η = Xβ`.
- **Link function:** connects the mean of the target to the linear predictor, e.g., logit for binary, log for counts.

For the NEPSE system, we might use a GLM with Binomial distribution and logit link to predict the probability of an up move (logistic regression). This is a special case of GLM.

### **23.6.1 Logistic Regression for Direction Prediction**

```python
from sklearn.linear_model import LogisticRegression

# Prepare binary target
y_binary = (df_stock['Target'] > 0).astype(int)

# Align with features (ensure same index as X)
# Use same feature set as before (without polynomial)
X_binary = X  # from earlier
y_binary = y_binary.loc[X_binary.index]

# Temporal split
X_train_bin, X_test_bin = X_binary.iloc[:split_idx], X_binary.iloc[split_idx:]
y_train_bin, y_test_bin = y_binary.iloc[:split_idx], y_binary.iloc[split_idx:]

# Logistic regression with L2 regularization (default)
logreg = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000)
logreg.fit(X_train_bin, y_train_bin)

y_pred_bin = logreg.predict(X_test_bin)
accuracy = accuracy_score(y_test_bin, y_pred_bin)
print(f"Logistic regression accuracy: {accuracy:.4f}")

# Coefficients
coef_log = pd.DataFrame({'feature': feature_cols, 'coefficient': logreg.coef_[0]})
print(coef_log)
```

**Explanation:**

- Logistic regression models the log‑odds of an up move as a linear function of the features.
- The coefficients indicate how a one‑unit change in a feature affects the log‑odds (and thus the probability).
- Regularization parameter `C` is the inverse of `α`; smaller `C` means stronger regularization. We could tune it via cross‑validation.

### **23.6.2 Other GLMs**

For count data (e.g., number of trades), we might use Poisson regression. For positive‑valued targets (e.g., volatility), Gamma regression with log link could be appropriate. `statsmodels` provides extensive GLM capabilities.

```python
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Example: Poisson regression (not directly applicable to NEPSE, but for illustration)
# Assume we have a count variable 'Transactions' (Trans.)
df_stock['Trans'] = df_stock['Trans.'].fillna(0).astype(int)
# Use log link
poisson_model = sm.GLM(df_stock['Trans'], sm.add_constant(df_stock[['Return_Lag1']]), 
                        family=sm.families.Poisson()).fit()
print(poisson_model.summary())
```

---

## **23.7 Regularization Strategies**

Regularization is essential when using many features, especially in time‑series where the risk of overfitting is high. The main strategies are:

- **Ridge (L2):** Shrinks coefficients, good for many small/medium effects.
- **Lasso (L1):** Performs feature selection, good when only a few features matter.
- **Elastic Net:** Compromise, handles correlated features well.

### **23.7.1 Choosing the Regularization Strength**

We use time‑series cross‑validation to select `α` (and `l1_ratio` for Elastic Net). The example above with `GridSearchCV` and `TimeSeriesSplit` demonstrates this.

### **23.7.2 Scaling Features**

Regularization methods are sensitive to the scale of features. Always standardize features (center and scale) before applying Ridge, Lasso, or Elastic Net.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Now use scaled data in regularized models
ridge_scaled = Ridge(alpha=1.0)
ridge_scaled.fit(X_train_scaled, y_train)
# ... etc.
```

**Explanation:**

- Without scaling, features with larger magnitudes would be penalized more, which is undesirable.
- The scaler is fitted on the training set only and applied to the test set to avoid leakage.

---

## **23.8 Feature Selection with Linear Models**

Linear models with L1 regularization (Lasso) provide an embedded feature selection method. The features with non‑zero coefficients are the selected ones. This can be used to reduce dimensionality before applying other models.

```python
# After fitting Lasso with optimal alpha
selected_features = feature_cols[lasso_best.coef_ != 0]
print(f"Selected features: {selected_features}")

# Now train a model (e.g., Random Forest) on only these features
X_train_sel = X_train[selected_features]
X_test_sel = X_test[selected_features]
# ... train another model
```

**Explanation:**

- This two‑step approach (Lasso for selection, then another model) can be effective, but care must be taken to avoid selection bias. The selection should be done within cross‑validation loops to get unbiased performance estimates.

---

## **23.9 Time‑Series Linear Models**

Some linear models are specifically designed for time‑series:

- **Autoregressive (AR) models:** Essentially linear regression on lagged values of the target. We covered this in Chapter 21.
- **ARIMAX / SARIMAX:** ARIMA with exogenous variables (features). These can be seen as linear models with time‑series errors.

### **23.9.1 ARIMAX with Exogenous Features**

We can include our engineered features as exogenous variables in an ARIMA model. This combines the time‑series dynamics with external predictors.

```python
from statsmodels.tsa.arima.model import ARIMA

# Use the same features (scaled) as exogenous
exog_train = X_train_scaled
exog_test = X_test_scaled

# Fit ARIMAX: order (p,d,q) on the target (returns) with exog
# We'll use a simple AR(1) as example, but could tune p,d,q
arimax_model = ARIMA(y_train, order=(1,0,0), exog=exog_train)
arimax_fit = arimax_model.fit()
print(arimax_fit.summary())

# Forecast
forecast = arimax_fit.forecast(steps=len(y_test), exog=exog_test)
rmse_arimax = np.sqrt(mean_squared_error(y_test, forecast))
print(f"ARIMAX test RMSE: {rmse_arimax:.4f}")
```

**Explanation:**

- ARIMAX models the target as a linear function of its own lags and the exogenous variables. The errors follow an ARMA process.
- This can capture autocorrelation that a pure linear regression on features misses.
- However, it is more complex to estimate and requires careful selection of ARIMA order.

---

## **23.10 Interpretation and Diagnostics**

Linear models are prized for interpretability. We can examine coefficients, confidence intervals, and residuals.

### **23.10.1 Coefficient Interpretation**

In a linear regression, a coefficient `βⱼ` represents the expected change in the target for a one‑unit change in feature `xⱼ`, holding all other features constant. For example, if `Return_Lag1` has a coefficient of 0.05, a 1% increase in yesterday's return is associated with a 0.05% increase in today's return, on average.

In logistic regression, the coefficient represents the change in log‑odds of the event (up move) per unit change in the feature.

### **23.10.2 Residual Diagnostics**

We should check residuals for patterns that indicate model inadequacy.

```python
# Residuals from best linear model (e.g., Ridge)
residuals = y_test - y_pred_ridge

plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(residuals)
plt.title('Residuals over time')

plt.subplot(1,2,2)
plt.scatter(y_pred_ridge, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')
plt.tight_layout()
plt.show()

# ACF of residuals
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(residuals, lags=20)
plt.show()
```

**Explanation:**

- Residuals should have no trend, constant variance (homoscedasticity), and no autocorrelation.
- If residuals show autocorrelation, the model may benefit from including lagged target terms (AR terms) or using ARIMAX.
- If variance changes over time (volatility clustering), consider models that account for heteroscedasticity (e.g., GARCH).

### **23.10.3 Hypothesis Tests (Cautions)**

In time‑series, standard errors from OLS are often biased due to autocorrelation and heteroscedasticity. Use heteroscedasticity‑ and autocorrelation‑consistent (HAC) standard errors if inference is important.

```python
# Using statsmodels with robust standard errors
model_hac = sm.OLS(y_train, X_train_const).fit(cov_type='HAC', cov_kwds={'maxlags': 5})
print(model_hac.summary())
```

**Explanation:**

- The `cov_type='HAC'` option uses Newey‑West standard errors, which are robust to autocorrelation and heteroscedasticity.

---

## **23.11 Chapter Summary**

In this chapter, we explored linear models for time‑series forecasting, with applications to the NEPSE dataset.

- **Ordinary least squares (OLS)** provides a simple, interpretable baseline, but suffers from multicollinearity and overfitting with many features.
- **Ridge regression (L2)** shrinks coefficients to reduce variance, useful when many features have small effects.
- **Lasso (L1)** performs feature selection by setting some coefficients to zero, helpful when only a few features matter.
- **Elastic Net** combines L1 and L2 penalties, handling correlated features well.
- **Polynomial regression** extends linear models to capture non‑linear trends and interactions.
- **Generalized linear models (GLMs)** like logistic regression handle non‑normal targets (e.g., binary direction).
- Regularization strength must be tuned via time‑series cross‑validation.
- Features should be standardized before applying regularized methods.
- Linear models can be used for feature selection before feeding into more complex models.
- **ARIMAX** incorporates exogenous variables into a time‑series model, capturing autocorrelation.
- Interpretation and residual diagnostics are essential to validate model assumptions.

### **Practical Takeaways for the NEPSE System:**

- Start with a simple linear model as a baseline; its RMSE or accuracy sets a lower bound.
- Use Lasso to identify the most predictive features – this can reveal market drivers.
- Logistic regression for direction prediction gives interpretable probabilities of up/down moves.
- Always scale features and use time‑series CV to tune regularization.
- Combine linear models with time‑series components (ARIMAX) if residuals show autocorrelation.
- Linear models are not the most accurate, but their transparency makes them valuable for understanding and as a benchmark.

In the next chapter, **Chapter 24: Support Vector Machines**, we will explore how SVMs can capture non‑linear patterns through kernel tricks, and apply them to the NEPSE prediction problem.

---

**End of Chapter 23**