In [11]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import joblib

In [2]:
# Load raw stock data
stock_data = pd.read_csv('../data/raw/AAPL_stock_data.csv', index_col='Date', parse_dates=True)

In [3]:
# Feature selection: Using 'Close' price for prediction
data = stock_data[['Close']].copy()

In [4]:
data.head()

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2018-01-02,43.064999
2018-01-03,43.057499
2018-01-04,43.2575
2018-01-05,43.75
2018-01-08,43.587502


In [5]:
# Data normalization (Min-Max scaling)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data)

In [6]:
# Creating sequences for time series
def create_sequences(data, seq_length):
    sequences = []
    labels = []
    for i in range(seq_length, len(data)):
        sequences.append(data[i-seq_length:i])
        labels.append(data[i])
    return np.array(sequences), np.array(labels)

In [7]:
SEQ_LENGTH = 50    # Considering the last 50 days to predict the next day's price
X, y = create_sequences(scaled_data, SEQ_LENGTH)

In [9]:
# Splitting the data into training, validation, and test sets
train_size = int(0.7 * len(X))
val_size = int(0.15 * len(X))
test_size = len(X) - train_size - val_size

X_train, X_val, X_test = X[:train_size], X[train_size:train_size + val_size], X[train_size + val_size:]
y_train, y_val, y_test = y[:train_size], y[train_size:train_size + val_size], y[train_size + val_size:]

# Summary of splits
print(f"Training data: {X_train.shape}, Validation data: {X_val.shape}, Test data: {X_test.shape}")

Training data: (846, 50, 1), Validation data: (181, 50, 1), Test data: (182, 50, 1)


In [12]:
# Save the MinMaxScaler as a serialized object
scaler_path = "../data/processed/scaler.pkl"
joblib.dump(scaler, scaler_path)

['../data/processed/scaler.pkl']

In [13]:
# Save scaled data (for future predictions) and training, validation, test sets
np.save("../data/processed/scaled_data.npy", scaled_data)
np.save("../data/processed/X_train.npy", X_train)
np.save("../data/processed/y_train.npy", y_train)
np.save("../data/processed/X_val.npy", X_val)
np.save("../data/processed/y_val.npy", y_val)
np.save("../data/processed/X_test.npy", X_test)
np.save("../data/processed/y_test.npy", y_test)

## Data Preprocessing Summary
In this phase, the stock price data was successfully prepared for a time series modeling. Below are the completed key steps:

1. **Feature Selection:**
    - The `Close` price was selected as the primary feature for predicting future stock prices, based on its importance in financial analysis.

2. **Data Normalization:**
    - Using `MinMaxScaler, the data was normalized to a range between 0 and 1 to ensure that the models can process the data effectively without being biased by the scale of the original stock prices.

3. **Sequence Creation:**
    - Sequences for the past 50 days' closing prices (`SEQ_LENGTH = 50`) was generated to serve as input features for the model, with the next days' closing price as the target label.

4. **Data Splitting:**
    - The data was split into **training**, **validation**, and **test** sets (70%, 15%, and 15%, respectively), ensuring the temporal order was preserved to avoid any data leakage.

5. **Saving Preprocessed Data:**
    - All preprocessed data (training, validation, test sets) was saved to the `processed/` directory.
    - The fitted `MinMaxScaler` was also saved to ensure consistent scaling is applied when making predictions or reversing the scaling.

This phase sets the foundation for building, training, and evaluating RNN, LSTM, and GRU models in the upcoming steps.