# Enhancing Predictive Efficiency of Time Series Models in Stock Price Prediction

Self Project | May 2023 - Jul 2023

This notebook reproduces the pipeline: ARIMA/SARIMA forecasts, directional metric, feature engineering (RSI, ADX), systematic time-based dataset segregation, and Naive Bayes for direction prediction.

**Note:** Run this notebook in an environment with internet access to download price data, and install the required packages if missing.

In [None]:
!pip install yfinance pandas numpy scipy scikit-learn statsmodels ta matplotlib seaborn joblib --quiet

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import ta

sns.set(style='whitegrid')

In [None]:
# Download S&P 500 index data via yfinance
ticker = '^GSPC'  # S&P 500 index symbol
start = '2018-01-01'
end = '2023-12-31'

df = yf.download(ticker, start=start, end=end, progress=False)
df = df[['Open','High','Low','Close','Adj Close','Volume']]
df.head()

In [None]:
plt.figure(figsize=(12,4))
plt.plot(df.index, df['Adj Close'])
plt.title('S&P 500 Adjusted Close')
plt.ylabel('Price')
plt.show()

## ARIMA / SARIMA Forecasting Examples

We show short-horizon forecasts using ARIMA and SARIMA. These are illustrative — for production you should tune orders and validate with walk-forward.

In [None]:
series = df['Adj Close'].asfreq('B')
series = series.fillna(method='ffill')

arima_order = (5,1,0)
model_arima = ARIMA(series, order=arima_order)
res_arima = model_arima.fit()
print(res_arima.summary().tables[1])

pred_arima = res_arima.get_forecast(steps=5).predicted_mean
pred_arima

In [None]:
sarima_order = (1,1,1)
seasonal_order = (1,0,1,5)
model_sarima = SARIMAX(series, order=sarima_order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
res_sarima = model_sarima.fit(disp=False)
print(res_sarima.summary().tables[1])
res_sarima.get_forecast(steps=5).predicted_mean

## Feature engineering: RSI, ADX, Directional Metric

We compute technical indicators (RSI, ADX) using the `ta` library, create returns and a directional label (next-day direction).

In [None]:
feature_df = df.copy()
feature_df = feature_df.dropna()

# RSI (14)
feature_df['rsi_14'] = ta.momentum.RSIIndicator(feature_df['Adj Close'], window=14).rsi()
# ADX (14)
feature_df['adx_14'] = ta.trend.ADXIndicator(feature_df['High'], feature_df['Low'], feature_df['Adj Close'], window=14).adx()

feature_df['ret'] = feature_df['Adj Close'].pct_change()
feature_df['ret_next'] = feature_df['Adj Close'].pct_change().shift(-1)
feature_df['direction'] = (feature_df['ret_next'] > 0).astype(int)

feature_df = feature_df.dropna()
feature_df[['Adj Close','rsi_14','adx_14','ret','direction']].tail()

## Systematic time-based segregation and supervised dataset

We create lag features and split data chronologically into train / validation / test sets to avoid lookahead bias.

In [None]:
lags = [1,2,3,5,10]
for l in lags:
    feature_df[f'ret_lag_{l}'] = feature_df['ret'].shift(l)
feature_df = feature_df.dropna()

features = ['rsi_14','adx_14'] + [f'ret_lag_{l}' for l in lags]
X = feature_df[features]
y = feature_df['direction']

N = len(X)
train_end = int(0.7 * N)
val_end = train_end + int(0.15 * N)

X_train, y_train = X.iloc[:train_end], y.iloc[:train_end]
X_val, y_val = X.iloc[train_end:val_end], y.iloc[train_end:val_end]
X_test, y_test = X.iloc[val_end:], y.iloc[val_end:]

X_train.shape, X_val.shape, X_test.shape

In [None]:
pipe = Pipeline([('scaler', StandardScaler()), ('clf', GaussianNB())])
pipe.fit(X_train, y_train)

for name, (X_, y_) in [('Validation', (X_val, y_val)), ('Test', (X_test, y_test))]:
    ypred = pipe.predict(X_)
    acc = accuracy_score(y_, ypred)
    print(f"{name} accuracy: {acc:.4f}")
    print(classification_report(y_, ypred))

# Baseline: always predict up
baseline_pred = np.ones_like(y_test)
baseline_acc = accuracy_score(y_test, baseline_pred)
print('Baseline (always up) acc on test:', baseline_acc)

In [None]:
ypred_test = pipe.predict(X_test)
acc_model = accuracy_score(y_test, ypred_test)
acc_baseline = baseline_acc
improvement = (acc_model - acc_baseline) * 100
print(f'Model acc: {acc_model:.4f}, Baseline acc: {acc_baseline:.4f}, Improvement: {improvement:.2f} percentage points')

In [None]:
import joblib
joblib.dump(pipe, 'gnb_direction_pipe.joblib')
print('Saved pipeline to gnb_direction_pipe.joblib')

## Notes and next steps

- Tune ARIMA/SARIMA orders using AIC/BIC or grid search and validate with walk-forward. 
- Try more powerful classifiers (RandomForest, XGBoost) and sequence models (LSTM, Transformer) for direction prediction.
- Consider adding more features (volume-based indicators, macro signals).
- Be careful with lookahead bias; always split chronologically.

---

**End of notebook**