#  Python Assignment: LSTM for Stock Price Prediction

This assignment challenges you to build and evaluate a Long Short-Term Memory (LSTM) neural network for time-series forecasting, specifically applied to **stock price prediction**. LSTMs, a type of Recurrent Neural Network (RNN), are well-suited for sequence prediction due to their ability to learn long-term dependencies in data. You will acquire real-world stock data, preprocess it, construct an LSTM model, train it, and assess its forecasting accuracy for future stock prices.

## Part 1: Data Acquisition and Preprocessing (35 points)

You'll acquire historical stock data, handle potential issues, normalize the data, and transform it into the sequential format required by LSTMs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from tensorflow import keras
from tensorflow.keras import layers
import yfinance as yf # A library to download financial data
import warnings

warnings.filterwarnings('ignore') # Suppress warnings for cleaner output
np.random.seed(42) # for reproducibility
tf.random.set_seed(42)

# 1.1 Acquire Stock Data
#    Use the `yfinance` library to download historical stock data for a chosen ticker symbol (e.g., 'AAPL' for Apple, 'GOOG' for Google, 'RELIANCE.NS' for Reliance Industries on NSE).
#    Download at least 3-5 years of daily data.
#    Focus on the 'Close' price as your target variable.

ticker_symbol = 'AAPL' # You can change this to another stock
start_date = '2019-01-01'
end_date = '2024-12-31'

print(f"\n--- Acquiring Stock Data for {ticker_symbol} ---")
# TODO: Download stock data using yfinance
# stock_data = yf.download(ticker_symbol, start=start_date, end=end_date)

# Select the target variable
# target_series = stock_data['Close'].copy()

print("Raw Data Head:\n", stock_data.head())
print("\nTarget Series Info:\n", target_series.info())

# 1.2 Handle Missing Values (if any)
#    While yfinance usually returns clean data, check for any NaNs and handle them (e.g., with forward fill `ffill()`).

print("\n--- Handling Missing Values ---")
# TODO: Fill NaNs if present
# target_series_cleaned = target_series.ffill().bfill() # ffill then bfill for leading NaNs
print(f"NaNs after cleaning: {target_series_cleaned.isnull().sum()}")

# 1.3 Data Normalization
#    Normalize the `target_series_cleaned` data using `MinMaxScaler` to values between 0 and 1.
#    **Store the scaler object**; you'll need it to inverse transform predictions back to the original price scale.

print("\n--- Normalizing Data ---")
# TODO: Initialize and fit MinMaxScaler
# scaler = MinMaxScaler(feature_range=(0, 1))
# scaled_data = scaler.fit_transform(target_series_cleaned.values.reshape(-1, 1))
print(f"Scaled data shape: {scaled_data.shape}")

# 1.4 Create Sequences for LSTM
#    Implement a function to create input-output sequences for the LSTM.
#    - `n_steps_in`: Number of past daily closing prices to use as input features (e.g., 60 days).
#    - `n_steps_out`: Number of future daily closing prices to predict (e.g., 1 for next day).
#    The input X should be `(samples, n_steps_in, 1)` and output y should be `(samples, n_steps_out)`.

def create_sequences(data, n_steps_in, n_steps_out):
    X, y = [], []
    for i in range(len(data) - n_steps_in - n_steps_out + 1):
        # TODO: Define input sequence (X_seq)
        # X_seq = data[i:(i + n_steps_in), 0]
        # TODO: Define output sequence (y_seq)
        # y_seq = data[(i + n_steps_in):(i + n_steps_in + n_steps_out), 0]

        # X.append(X_seq)
        # y.append(y_seq)
    return np.array(X), np.array(y)

n_steps_in = 60 # Use 60 past days to predict
n_steps_out = 1 # Predict next 1 day

X_seq, y_seq = create_sequences(scaled_data, n_steps_in, n_steps_out)

print(f"Input sequences (X_seq) shape: {X_seq.shape}")
print(f"Output sequences (y_seq) shape: {y_seq.shape}")

# Reshape X for LSTM input: (samples, timesteps, features)
X_seq = X_seq.reshape(X_seq.shape[0], X_seq.shape[1], 1)
print(f"Reshaped X_seq for LSTM: {X_seq.shape}")

# 1.5 Train-Test Split
#    Split the sequential data into training and testing sets. **Crucially, maintain temporal order.**
#    (e.g., 80% train, 20% test, no shuffling).

train_size = int(len(X_seq) * 0.8)
X_train, X_test = X_seq[:train_size], X_seq[train_size:]
y_train, y_test = y_seq[:train_size], y_seq[train_size:]

print(f"Train set: {X_train.shape}, {y_train.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")


## Part 2: Building the LSTM Model (25 points)

You'll design and compile a simple LSTM neural network using Keras.

In [None]:
# 2.1 Define the LSTM Architecture
#    Create a `tf.keras.Sequential` model with:
#    - One or more `LSTM` layers. The first LSTM layer needs `input_shape=(n_steps_in, 1)`.
#      Consider using `return_sequences=True` for stacked LSTMs and `return_sequences=False` for the last LSTM layer before Dense.
#    - `Dropout` layers to prevent overfitting (e.g., 0.2-0.3).
#    - One or more `Dense` layers, with the final output layer having `n_steps_out` neurons (no activation for regression).

print("\n--- Building LSTM Model ---")
# TODO: Build the Sequential LSTM model
# model = keras.Sequential([
#     layers.LSTM(50, activation='relu', input_shape=(n_steps_in, 1)), # Example: 50 LSTM units
#     layers.Dropout(0.2),
#     layers.Dense(n_steps_out) # Output layer for n_steps_out predictions
# ])

# 2.2 Compile the Model
#    Configure the model for training:
#    - `optimizer`: Choose `'adam'`.
#    - `loss`: Use `'mse'` (Mean Squared Error) for regression.
#    - `metrics`: Monitor `['mae']` (Mean Absolute Error) for interpretability.

# TODO: Compile the model
# model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# 2.3 Display Model Summary
#    Print the model summary to see the layers, output shapes, and parameter counts.

model.summary()


## Part 3: Training and Evaluation (30 points)

Train your LSTM model and evaluate its performance on the test set, visualizing the predictions against actual values.

In [None]:
# 3.1 Train the Model
#    Use `model.fit()` to train the LSTM.
#    - `epochs`: Choose a sufficient number (e.g., 50-100, or more if needed).
#    - `batch_size`: Choose a common batch size (e.g., 32, 64).
#    - `validation_split`: Use a portion of the training data for validation (e.g., 0.1 or 0.2).
#    - (Optional) Add `EarlyStopping` and `ModelCheckpoint` callbacks to prevent overfitting and save the best model.
#    Store the returned `history` object.

epochs = 50
batch_size = 64

# Optional Callbacks
# early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
# model_checkpoint = keras.callbacks.ModelCheckpoint('best_lstm_model.h5', monitor='val_loss', save_best_only=True)

print(f"\n--- Training LSTM Model for {epochs} epochs with batch size {batch_size} ---")
# TODO: Train the model
# history = model.fit(X_train, y_train,
#                     epochs=epochs,
#                     batch_size=batch_size,
#                     validation_split=0.1, # Using a split from training data for simplicity
#                     # callbacks=[early_stopping, model_checkpoint], # Uncomment if using callbacks
#                     verbose=1)

print("Training complete.")

# 3.2 Plot Training History
#    Plot the training and validation loss (MSE) and MAE over epochs.

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
mae_values = history_dict['mae']
val_mae_values = history_dict['val_mae']

epochs_trained = len(loss_values)
epochs_range = range(1, epochs_trained + 1)

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
# TODO: Plot training and validation loss
# plt.plot(epochs_range, loss_values, 'bo', label='Training Loss (MSE)')
# plt.plot(epochs_range, val_loss_values, 'b', label='Validation Loss (MSE)')
# plt.title('Training and Validation Loss')
# plt.xlabel('Epochs')
# plt.ylabel('Loss')
# plt.legend()

plt.subplot(1, 2, 2)
# TODO: Plot training and validation MAE
# plt.plot(epochs_range, mae_values, 'bo', label='Training MAE')
# plt.plot(epochs_range, val_mae_values, 'b', label='Validation MAE')
# plt.title('Training and Validation MAE')
# plt.xlabel('Epochs')
# plt.ylabel('MAE')
# plt.legend()

plt.tight_layout()
plt.show()

# 3.3 Evaluate on Test Data
#    Use `model.evaluate()` on your `X_test` and `y_test`.

print("\n--- Evaluating Model on Test Data ---")
# TODO: Evaluate the model
# test_loss, test_mae = model.evaluate(X_test, y_test, verbose=2)

print(f"Test Loss (MSE): {test_loss:.4f}")
print(f"Test MAE: {test_mae:.4f}")

# 3.4 Make Predictions and Inverse Transform
#    Generate predictions on the `X_test` set.
#    Inverse transform both the predictions and the actual `y_test` back to their original scale using your `scaler`.

print("\n--- Making Predictions and Inverse Transforming ---")
# TODO: Make predictions
# y_pred_scaled = model.predict(X_test)

# TODO: Inverse transform predictions and actuals
# y_pred = scaler.inverse_transform(y_pred_scaled)
# y_actual = scaler.inverse_transform(y_test)

# 3.5 Visualize Actual vs. Predicted Values
#    Plot a segment of the actual stock prices from the test set against the model's predictions.
#    (e.g., the last 100-200 predictions for clarity).

plot_start_idx = len(y_actual) - 200 # Adjust to visualize a clear segment, e.g., last 200 days
plot_end_idx = len(y_actual)

plt.figure(figsize=(15, 7))
# TODO: Plot actuals and predictions
# plt.plot(y_actual[plot_start_idx:plot_end_idx], label='Actual Stock Price', color='blue')
# plt.plot(y_pred[plot_start_idx:plot_end_idx], label='Predicted Stock Price', color='red', linestyle='--')
# plt.title(f'Actual vs. Predicted Stock Price for {ticker_symbol} (Test Set)')
# plt.xlabel('Time Step (Days)')
# plt.ylabel('Stock Price ($)')
# plt.legend()
# plt.grid(True)
# plt.show()

# 3.6 Calculate RMSE and MAE on Original Scale
#    Compute RMSE and MAE using the inverse-transformed actuals and predictions to get interpretable errors in currency units.

rmse_orig_scale = np.sqrt(mean_squared_error(y_actual, y_pred))
mae_orig_scale = mean_absolute_error(y_actual, y_pred)

print(f"\nRMSE on original scale: {rmse_orig_scale:.4f} ") # Currency unit (e.g., USD, INR)
print(f"MAE on original scale: {mae_orig_scale:.4f} ")   # Currency unit (e.g., USD, INR)


## Part 4: Reflection and Further Exploration (10 points)

Answer the following questions based on your understanding and observations from this assignment.

### Your Answers to Reflection Questions:

1.  **Explain why stock price prediction is considered a particularly challenging time-series forecasting problem.** What factors make it harder than, say, predicting power consumption or temperature?

    _(Your answer here)_

2.  **What is the significance of maintaining temporal order when splitting time-series data into training and test sets? Why can't you just use `train_test_split` with `shuffle=True`?**

    _(Your answer here)_

3.  **How would you justify your choice of `n_steps_in` for stock price prediction? What are the trade-offs of using a very short vs. a very long input sequence length?**

    _(Your answer here)_

4.  **Suggest two ways to potentially improve the performance of this LSTM model for stock price prediction.** Consider additional data sources (beyond just 'Close' price) or different model complexities.

    * **Improvement Idea 1:** _(Name and brief explanation)_
    * **Improvement Idea 2:** _(Name and brief explanation)_

5.  **In a real-world financial context, what are the ethical considerations or risks of relying solely on an LSTM model for making trading decisions?**

    _(Your answer here)_


## Deliverables:

1.  This completed Jupyter Notebook (`lstm_stock_prediction_assignment.ipynb`) with all code cells executed and reflection questions answered.
2.  Ensure all plots are clearly visible and well-labeled within the notebook.