# Task 3: Forecast Future Market Trends

## Objective
Use the best-performing forecasting model from Task 2 to predict Tesla's future stock prices, visualize forecasts with confidence intervals, and extract actionable insights about trends, opportunities, and risks.

> Prerequisite: Run Task 1 and Task 2 first so that processed data and model comparison results exist in `../data/processed/`.

In [5]:
# 1. Imports and configuration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
import warnings
import os

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully for Task 3.")

Libraries imported successfully for Task 3.


In [6]:
# 2. Load processed TSLA data and model comparison results

# Load TSLA processed data created in Task 1
tsla_data = pd.read_csv('../data/processed/TSLA_processed.csv', index_col='Date', parse_dates=True)

# Use Adjusted Close price for forecasting (fallback to Close if Adj Close not available)
if 'Adj Close' in tsla_data.columns:
    tsla_close = tsla_data['Adj Close'].dropna()
    print("Using 'Adj Close' column")
else:
    tsla_close = tsla_data['Close'].dropna()
    print("Using 'Close' column (Adj Close not found)")

# Chronological split (same as Task 2) — use UTC-safe comparison
split_date = '2025-01-01'
# Convert index to UTC-aware DatetimeIndex to handle tz-aware/naive mixes
idx_utc = pd.to_datetime(tsla_close.index, utc=True)
# Prepare split timestamp as UTC-aware
split_ts = pd.Timestamp(split_date)
if split_ts.tz is None:
    split_ts_utc = split_ts.tz_localize('UTC')
else:
    split_ts_utc = split_ts.tz_convert('UTC')
# Boolean mask and slicing (preserve original series/index types)
mask = idx_utc < split_ts_utc
train_data = tsla_close.iloc[mask]
test_data = tsla_close.iloc[~mask]

print(f"Training set: {train_data.index.min()} -> {train_data.index.max()} ({len(train_data)} points)")
print(f"Test set: {test_data.index.min()} -> {test_data.index.max()} ({len(test_data)} points)")

# Try loading model comparison results from Task 2; if missing, infer from cached artifacts
comparison_csv = '../data/processed/model_comparison_results.csv'
best_model_name = None
if os.path.exists(comparison_csv):
    comparison_df = pd.read_csv(comparison_csv)
    print("\nModel performance from Task 2:")
    print(comparison_df.to_string(index=False))
    # Select best model based on RMSE
    best_row = comparison_df.loc[comparison_df['RMSE'].idxmin()]
    best_model_name = best_row['Model']
else:
    print("Model comparison CSV not found; inferring best model from cached artifacts...")
    sarima_path = '../data/processed/sarima_model.pkl'
    lstm_path = '../data/processed/lstm_model.h5'
    if os.path.exists(sarima_path) and os.path.exists(lstm_path):
        print("Both SARIMA and LSTM artifacts found. Defaulting to 'SARIMA'.")
        best_model_name = 'SARIMA'
    elif os.path.exists(sarima_path):
        print("Found SARIMA model artifact. Selecting 'SARIMA'.")
        best_model_name = 'SARIMA'
    elif os.path.exists(lstm_path):
        print("Found LSTM model artifact. Selecting 'LSTM'.")
        best_model_name = 'LSTM'
    else:
        print("No model artifacts found. Defaulting to 'SARIMA'.")
        best_model_name = 'SARIMA'

print(f"\nBest-performing model based on available artifacts: {best_model_name}")

Using 'Close' column (Adj Close not found)
Training set: 2015-01-02 00:00:00-05:00 -> 2024-12-31 00:00:00-05:00 (2516 points)
Test set: 2025-01-02 00:00:00-05:00 -> 2026-01-14 00:00:00-05:00 (259 points)
Model comparison CSV not found; inferring best model from cached artifacts...
Found SARIMA model artifact. Selecting 'SARIMA'.

Best-performing model based on available artifacts: SARIMA


## 3. Generate 6–12 Month Ahead Forecasts

We:
- Refit the best model on the full TSLA series (train + test).
- Generate multi-step forecasts (≈12 months of trading days).
- For SARIMA, we also obtain confidence intervals from the model.
- For LSTM, we generate forecasts iteratively using the last 60 days of data as the starting window.

In [None]:
# 3.1 SARIMA-based future forecasts (if SARIMA is best)
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pmdarima as pm

forecast_steps = 252  # ~12 months of trading days

sarima_future_df = None

if best_model_name == 'SARIMA':
    print("Best model is SARIMA – refitting on full TSLA series...")

    auto_arima_model = pm.auto_arima(
        tsla_close,
        start_p=0, start_q=0,
        max_p=5, max_q=5,
        start_P=0, start_Q=0,
        max_P=2, max_Q=2,
        m=12,                 # assume monthly seasonality
        seasonal=True,
        stepwise=True,
        suppress_warnings=True,
        error_action='ignore',
        trace=False
    )

    arima_order = auto_arima_model.order
    seasonal_order = auto_arima_model.seasonal_order
    print(f"Selected SARIMA order: {arima_order}, seasonal order: {seasonal_order}")

    sarima_model = SARIMAX(
        tsla_close,
        order=arima_order,
        seasonal_order=seasonal_order,
        enforce_stationarity=False,
        enforce_invertibility=False
    )

    fitted_sarima = sarima_model.fit(disp=False)

    sarima_future = fitted_sarima.get_forecast(steps=forecast_steps)
    future_mean = sarima_future.predicted_mean
    future_ci = sarima_future.conf_int()

    last_date = tsla_close.index[-1]
    future_index = pd.bdate_range(start=last_date + timedelta(days=1), periods=forecast_steps)

    sarima_future_df = pd.DataFrame({
        'Forecast': future_mean.values,
        'Lower_CI': future_ci.iloc[:, 0].values,
        'Upper_CI': future_ci.iloc[:, 1].values
    }, index=future_index)

    print("Generated SARIMA future forecasts:")
    display(sarima_future_df.head())
else:
    print("Best model is not SARIMA – this cell will be skipped.")

In [1]:
# 3.2 LSTM-based future forecasts (if LSTM is best)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.preprocessing import MinMaxScaler

lstm_future_df = None

if best_model_name == 'LSTM':
    print("Best model is LSTM – preparing data and refitting...")

    # Scale the full Adjusted Close series
    scaler = MinMaxScaler(feature_range=(0, 1))
    tsla_scaled = scaler.fit_transform(tsla_close.values.reshape(-1, 1))

    seq_length = 60

    # Use only the training portion to build sequences for training
    train_scaled = scaler.transform(train_data.values.reshape(-1, 1))

    def create_sequences(data, seq_len=60):
        X, y = [], []
        for i in range(seq_len, len(data)):
            X.append(data[i-seq_len:i, 0])
            y.append(data[i, 0])
        return np.array(X), np.array(y)

    X_train, y_train = create_sequences(train_scaled, seq_length)
    X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))

    # Build LSTM architecture (same style as Task 2)
    lstm_model = Sequential([
        LSTM(50, return_sequences=True, input_shape=(seq_length, 1)),
        Dropout(0.2),
        LSTM(50, return_sequences=True),
        Dropout(0.2),
        LSTM(50),
        Dropout(0.2),
        Dense(1)
    ])

    lstm_model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

    early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

    history = lstm_model.fit(
        X_train, y_train,
        epochs=50,
        batch_size=32,
        validation_split=0.2,
        callbacks=[early_stopping],
        verbose=1
    )

    # Iterative multi-step forecast
    last_seq = tsla_scaled[-seq_length:].reshape(1, seq_length, 1)
    future_scaled = []

    for _ in range(forecast_steps):
        next_pred = lstm_model.predict(last_seq, verbose=0)[0, 0]
        future_scaled.append(next_pred)
        # slide window
        last_seq = np.append(last_seq[:, 1:, :], [[[next_pred]]], axis=1)

    future_scaled = np.array(future_scaled).reshape(-1, 1)
    future_prices = scaler.inverse_transform(future_scaled).flatten()

    last_date = tsla_close.index[-1]
    future_index = pd.bdate_range(start=last_date + timedelta(days=1), periods=forecast_steps)

    lstm_future_df = pd.DataFrame({'Forecast': future_prices}, index=future_index)

    print("Generated LSTM future forecasts:")
    display(lstm_future_df.head())
else:
    print("Best model is not LSTM – this cell will be skipped.")

: 

## 4. Visualize Historical Data, Test Set, and Future Forecasts

We now:
- Plot the full historical Adjusted Close series.
- Highlight the test period used in Task 2.
- Overlay the future forecasts from the selected best model.
- Show confidence intervals when available (SARIMA).

In [1]:
# 4.1 Plot historical data, test period, and future forecasts

plt.figure(figsize=(16, 8))

# Historical TSLA Adjusted Close
plt.plot(tsla_close.index, tsla_close.values, label='Historical TSLA (Adj Close)', color='blue', linewidth=1.5)

# Test period highlight
plt.plot(test_data.index, test_data.values, label='Test Period (Task 2)', color='green', linewidth=2)

# Future forecasts from selected best model
if best_model_name == 'SARIMA' and sarima_future_df is not None:
    plt.plot(sarima_future_df.index, sarima_future_df['Forecast'],
             label='SARIMA Future Forecast', color='red', linestyle='--', linewidth=2)
    plt.fill_between(
        sarima_future_df.index,
        sarima_future_df['Lower_CI'],
        sarima_future_df['Upper_CI'],
        color='red', alpha=0.2, label='95% Confidence Interval'
    )
elif best_model_name == 'LSTM' and lstm_future_df is not None:
    plt.plot(lstm_future_df.index, lstm_future_df['Forecast'],
             label='LSTM Future Forecast', color='red', linestyle='--', linewidth=2)

plt.axvline(test_data.index[0], color='black', linestyle='--', linewidth=1, label='Train/Test Split (2025-01-01)')

plt.title('TSLA – Historical Prices, Test Period, and 12-Month Forecast', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Price (USD)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../data/processed/task3_tsla_future_forecast.png', dpi=300, bbox_inches='tight')
plt.show()

NameError: name 'plt' is not defined

## 5. Task Complete

Forecast visualization and data have been generated. 

**Note:** Trend analysis, opportunities, risks, and conclusions are documented in `final_report.ipynb` (Section 4.3).