prepare feature and labels

Stock price prediction is a challenging task in the field of finance with applications ranging from personal investment strategies to algorithmic trading. In this article we will explore how to build a stock price prediction model using TensorFlow and Long Short-Term Memory (LSTM) networks a type of recurrent neural network (RNN) which is well-suited for Timeseries data like stock prices.

#####################################################################################################

What is LSTM? LSTM = Long Short-Term Memory It’s a type of Recurrent Neural Network (RNN). Designed to handle sequence data (time series, text, speech). Solves the vanishing gradient problem of vanilla RNNs.

Problem with Vanilla RNN RNNs pass information through hidden states step by step. But when sequences are long, gradients shrink during backpropagation (vanishing gradient) → model forgets long-term dependencies. Example: Sentence: “I grew up in France … I speak fluent ___.” To predict French, the model must remember “France” from many words ago. Vanilla RNN struggles with this.
LSTM to the Rescue LSTMs introduce a cell state + gates that control what information to keep or forget. Key Idea: Instead of just blindly passing hidden states, LSTMs have a memory cell that can store information for long periods. Gates decide what to write, read, and forget.
The LSTM Cell An LSTM has 3 gates:

Forget Gate (𝑓𝑡): Decides what info to forget from the cell state. Formula: ft=σ(Wf[ht−1,xt]+bf)
Input Gate (𝑖𝑡) + Candidate Memory (𝐶𝑡): Decides what new info to store in the cell state. Formula: it=σ(Wi[ht−1,xt]+bi) C~t=tanh(WC[ht−1,xt]+bC) -Output Gate (𝑜𝑡): Decides what to output as hidden state. Formula: ot=σ(Wo[ht−1,xt]+bo)

Updating Memory Cell state update: Ct=ft∗Ct−1+it∗C~t (keep old info + add new info) Hidden state update: ht=ot∗tanh(Ct)
Intuition with Example Imagine reading a paragraph: Forget gate: Drops irrelevant info (e.g., “I went to the store”). Input gate: Stores useful info (e.g., “France”). Cell state: Memory across the paragraph. Output gate: Gives output relevant for prediction (e.g., language = “French”).
Applications of LSTMs Natural Language Processing (NLP): text generation, translation, sentiment analysis. Time Series Forecasting: stock prices, weather. Speech Recognition. Music Generation.
Why Important? Before Transformers (like GPT), LSTMs were state-of-the-art for sequential tasks. Still used in resource-constrained systems (smaller, efficient compared to Transformers).

#####################################################################################################

STEP 1. Importing Libraries

import pandas as pd Purpose: Primary library for data manipulation and analysis Uses in your project: Reading stock price data from CSV files or APIs Data cleaning and preprocessing Creating DataFrames to organize stock data Handling missing values, date indexing Data filtering, sorting, and transformation Example: df = pd.read_csv('stock_data.csv') import matplotlib.pyplot as plt Purpose: Primary plotting library for Python Uses in your project: Plotting stock price trends over time Visualizing training vs validation loss Creating prediction vs actual price charts Plotting model performance metrics Example: plt.plot(dates, prices), plt.show() import numpy as np Purpose: Fundamental library for numerical computing Uses in your project: Array operations for time series data Mathematical calculations and transformations Reshaping data for neural network input Statistical operations (mean, std, etc.) Example: np.array(stock_prices), np.reshape(data, (samples, timesteps, features)) import tensorflow as tf Purpose: Open-source machine learning platform Uses in your project: Building and training LSTM neural networks Creating sequential models for time series prediction GPU acceleration for faster training Model compilation and optimization from tensorflow import keras Purpose: High-level API for building neural networks Uses in your project: Creating LSTM layers: keras.layers.LSTM() Building model architecture: keras.Sequential() Adding layers: Dense, Dropout, etc. Model compilation with optimizers and loss functions Example: model = keras.Sequential([keras.layers.LSTM(50, return_sequences=True)]) import seaborn as sns Purpose: Statistical data visualization library Uses in your project: Creating correlation heatmaps between stock features Enhanced statistical plots Better-looking default styles for matplotlib Distribution plots for stock price data Example: sns.heatmap(correlation_matrix) import os Purpose: Operating system interface Uses in your project: File path operations Checking if data files exist Creating directories for saving models Environment variable access Example: os.path.exists('data.csv'), os.mkdir('models') from datetime import datetime Purpose: Date and time manipulation Uses in your project: Converting date strings to datetime objects Date arithmetic for time series Timestamp generation for logging Date formatting for plots Example: datetime.strptime('2023-01-01', '%Y-%m-%d') import warnings warnings.filterwarnings("ignore") Purpose: Suppress warning messages Uses in your project: Hides deprecation warnings from libraries Reduces console clutter during training Keeps output clean for better readability Common warnings: TensorFlow version compatibility, pandas future warnings

#####################################################################################################

STEP 2. Loading the Dataset

We will load the dataset containing stock prices over a 5-year period. The read_csv function loads the dataset into a pandas DataFrame for further analysis.

delimiter=',': Specifies that commas separate values in the CSV file. Specifies that columns are separated by commas on_bad_lines='skip': Skips any malformed lines in the CSV file. If there are any corrupted/malformed rows, skip them instead of crashing

data = pd.read_csv('all_stocks_5yr.csv', delimiter=',', on_bad_lines='skip') print(data.shape) Purpose: Shows the dimensions of your dataset Output format: (rows, columns) - e.g., (125000, 7) What it tells you: First number = total number of data points/rows Second number = number of features/columns in the dataset print(data.sample(7)) Purpose: Displays a random sample of 7 rows from your dataset data.sample(7): Randomly selects 7 rows from the DataFrame Why useful: Shows you what the actual data looks like Helps you understand the structure and format Displays column names and sample values

Since the given data consists of a date feature, this is more likely to be an 'object' data type.

data.info()

Whenever we deal with the date or time feature, it should always be in the DateTime data type. Pandas library helps us convert the object date feature to the DateTime data type.

data['date'] = pd.to_datetime(data['date']) Purpose: Converts the 'date' column from string format to proper datetime format What's happening: data['date']: Selects the 'date' column from your DataFrame pd.to_datetime(): Pandas function that converts string dates to datetime objects data['date'] = ...: Replaces the original date column with the converted version Why this is important: Before: Date might be stored as strings like '2013-02-08' or '02/08/2013' After: Converted to proper datetime objects that Python can understand data.info()

##############################################################################################################

STEP 3. Exploratory Data Analysis

Exploratory Data Analysis is a technique that is used to analyze the data through visualization and manipulation. For this project let us visualize the data of famous companies such as Nvidia, Google, Apple, Facebook and so on. First let us consider a few companies and visualize the distribution of open and closed Stock prices through 5 years.

companies = ['AAPL', 'AMD', 'FB', 'GOOGL', 'AMZN', 'NVDA', 'EBAY', 'CSCO', 'IBM']

plt.figure(figsize=(15, 8)) # Purpose: Creates a new matplotlib figure figsize=(15, 8): Sets figure size to 15 inches wide × 8 inches tall Why large size: Accommodates 9 subplots (3×3 grid) for index, company in enumerate(companies, 1): #enumerate(companies, 1): Creates pairs of (index, company) starting from 1 Example: (1, 'AAPL'), (2, 'AMD'), (3, 'FB'), etc. index: Used for subplot positioning company: Current company ticker symbol plt.subplot(3, 3, index) #Purpose: Creates a 3×3 grid of subplots 3, 3: 3 rows, 3 columns index: Position of current subplot (1-9) c = data[data['Name'] == company] Purpose: Filters data for the current company data['Name'] == company: Creates boolean mask (True/False for each row) data[...]: Returns only rows where condition is True c: Contains only data for current company plt.plot(c['date'], c['close'], c="r", label="close", marker="+") plt.plot(c['date'], c['open'], c="g", label="open", marker="^") plt.title(company) plt.legend() plt.tight_layout()

Now let's plot the volume of trade for these 9 stocks as well as a function of time.

plt.figure(figsize=(15, 8)) #Purpose: Creates a second figure for volume data Same size: 15×8 inches for index, company in enumerate(companies, 1): plt.subplot(3, 3, index) c = data[data['Name'] == company] plt.plot(c['date'], c['volume'], c='purple', marker='') #Purpose: Plots trading volume over time c['volume']: Y-axis (number of shares traded) c='purple': Purple color marker='': Star markers plt.title(f"{company} Volume") plt.tight_layout()

First Figure: 9 subplots showing price trends Red lines with + markers: Closing prices Green lines with ^ markers: Opening prices Each subplot shows one company's price movement over time Second Figure: 9 subplots showing volume trends Purple lines with * markers: Trading volume Shows how much trading activity occurred each day Purpose: Data exploration: Understand price patterns and trading activity Company comparison: Compare different stocks visually Pattern identification: Spot trends, volatility, and correlations Model preparation: Helps decide which stocks to focus on for prediction This visualization helps you understand your data before building the LSTM model!

Now let's analyze the data for Apple Stocks from 2013 to 2018.

apple = data[data['Name'] == 'AAPL'] ##Purpose: Extracts only Apple stock data from the entire dataset data['Name'] == 'AAPL': Creates a boolean mask (True/False for each row) data[...]: Returns only rows where the condition is True apple: New DataFrame containing only Apple's stock data Result: All Apple stock records (dates, prices, volume, etc.) prediction_range = apple.loc[(apple['date'] > datetime(2013,1,1)) & (apple['date']<datetime(2018,1,1))] ##Purpose: Filters Apple data to a specific 5-year period (2013-2018) datetime(2013,1,1): Creates date object for January 1, 2013 datetime(2018,1,1): Creates date object for January 1, 2018 apple['date'] > datetime(2013,1,1): Dates after Jan 1, 2013 apple['date'] < datetime(2018,1,1): Dates before Jan 1, 2018 &: Logical AND operator (both conditions must be True) apple.loc[...]: Uses label-based indexing to filter rows prediction_range: Contains Apple data from 2013-2018 only plt.plot(apple['date'],apple['close']) plt.xlabel("Date") plt.ylabel("Close") plt.title("Apple Stock Prices") plt.show()

Now let's select a subset of the whole data as the training data so that we will be left with a subset of the data for the validation part as well.

close_data = apple.filter(['close']) Purpose: Extracts only the 'close' column from Apple's data apple.filter(['close']): Creates a new DataFrame with only the closing prices close_data: Contains only Apple's closing stock prices Why only close prices: LSTM models often focus on one target variable (closing price) Closing price is the most important price for prediction Simplifies the model by reducing input features dataset = close_data.values Purpose: Converts pandas DataFrame to NumPy array close_data.values: Extracts the underlying NumPy array from DataFrame dataset: Now contains closing prices as a NumPy array training = int(np.ceil(len(dataset) * .95)) Purpose: Calculates how many data points to use for training len(dataset): Total number of data points in the dataset * .95: Multiplies by 0.95 (95% of data) np.ceil(): Rounds up to the nearest integer int(): Converts to integer training: Number of data points for training print(training)

Now we have the training data length, next applying scaling and preparing features and labels that are x_train and y_train.

from sklearn.preprocessing import MinMaxScaler Purpose: Imports the MinMaxScaler from scikit-learn MinMaxScaler: Normalizes data to a specific range (usually 0-1) Why needed: Neural networks work better with normalized data

scaler = MinMaxScaler(feature_range=(0, 1)) Purpose: Creates a scaler that will normalize data between 0 and 1 feature_range=(0, 1): Sets the target range (minimum=0, maximum=1) scaler: Object that will transform your data scaled_data = scaler.fit_transform(dataset) Purpose: Normalizes all stock prices to the 0-1 range fit_transform(): fit: Learns the min/max values from your data transform: Applies the scaling to all data points scaled_data: All stock prices now between 0 and 1

train_data = scaled_data[0:int(training), :] Purpose: Extracts the training portion from scaled data 0:int(training): From index 0 to the training size (95% of data) :: All columns (only 1 column in this case) train_data: Contains 95% of scaled data for training

prepare feature and labels

x_train = [] y_train = [] Purpose: Creates empty lists to store training features and labels x_train: Will contain input sequences (60 days of prices) y_train: Will contain target values (next day's price)

for i in range(60, len(train_data)): """ Loop through data starting from index 60 range(60, len(train_data)): Starts at 60, goes to end of training data Why start at 60: Need 60 previous days to predict the next day"""

x_train.append(train_data[i-60:i, 0])
        train_data[i-60:i, 0]: Takes 60 consecutive days of prices
        i-60:i: From 60 days ago to current day
        0: First (and only) column (closing prices)
        Example: If i=100, takes days 40-99 to predict day 100
y_train.append(train_data[i, 0])
        train_data[i, 0]: The price on day i (what we want to predict)
        Example: If i=100, target is the price on day 100

x_train, y_train = np.array(x_train), np.array(y_train) Purpose: Converts lists to NumPy arrays for LSTM np.array(x_train): Converts list of sequences to array np.array(y_train): Converts list of targets to array Why needed: LSTM requires NumPy arrays as input x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1)) Purpose: Reshapes data to LSTM format Current shape: (samples, 60) - 2D array New shape: (samples, 60, 1) - 3D array LSTM requirement: Needs 3D input (samples, timesteps, features) samples: Number of sequences timesteps: 60 (days in each sequence) features: 1 (closing price)

Data Structure Example Before reshaping: x_train.shape = (890, 60) # 890 sequences, each with 60 days After reshaping: x_train.shape = (890, 60, 1) # 890 sequences, 60 timesteps, 1 feature Sequence Creation Example If you have 1000 days of data: Day 1-60: Input sequence → Day 61: Target Day 2-61: Input sequence → Day 62: Target Day 3-62: Input sequence → Day 63: Target ...and so on This creates a sliding window approach where the LSTM learns to predict the next day's price based on the previous 60 days! Why This Approach? Time series: LSTM needs sequential data to learn patterns 60 days: Enough history to capture trends and patterns Normalization: Helps LSTM train faster and more accurately Sliding window: Maximizes training data by creating overlapping sequences This is the standard preparation for LSTM time series prediction!

###############################################################################################

STEP 4. Build LSTM network using TensorFlow

Using TensorFlow, we can easily create LSTM models. It is used in Recurrent Neural Networks for sequence models and time series data. It is used to avoid the vanishing gradient issue which is widely occurred in training RNN. To stack multiple LSTM in TensorFlow it is mandatory to use return_sequences = True. Since our data is time series varying we apply no activation to the output layer and it remains as 1 node.

model = keras.models.Sequential()

    Purpose: Creates a sequential neural network model
    Sequential: Layers are stacked one after another (linear flow)
    model: Empty container that will hold your neural network layers

model.add(keras.layers.LSTM(units=64, return_sequences=True, input_shape=(x_train.shape[1], 1)))

    Purpose: Adds the first LSTM layer to the model
    units=64: LSTM has 64 memory units (neurons)
    More units = more learning capacity, but slower training
    return_sequences=True: Returns output for each timestep
    Why needed: Next layer is also LSTM, needs sequence output
    If False: Would only return final output
    input_shape=(x_train.shape[1], 1): Defines input dimensions
    x_train.shape[1]: 60 (number of timesteps/days)
    1: 1 feature (closing price)
    Shape: (60, 1) - 60 days, 1 price per day

model.add(keras.layers.LSTM(units=64))

    Purpose: Adds a second LSTM layer
    units=64: Same number of units as first layer
    return_sequences=False: Default - only returns final output
    Why: Next layer is Dense (doesn't need sequences)
    Purpose: Second LSTM learns higher-level patterns from first LSTM

model.add(keras.layers.Dense(32))

    Purpose: Adds a fully connected (dense) layer
    32: 32 neurons in this layer
    Function: Combines and processes LSTM outputs
    Activation: Default is linear (no activation function)

model.add(keras.layers.Dropout(0.5))

    Purpose: Adds regularization to prevent overfitting
    0.5: Randomly sets 50% of neurons to 0 during training
    Why needed: Prevents model from memorizing training data
    Effect: Forces model to learn more robust patterns

model.add(keras.layers.Dense(1))

    Purpose: Final layer that produces the prediction
    1: Single neuron (predicts one value - next day's price)
    Activation: Linear (no activation function)
    Output: Single number representing predicted stock price

model.summary()

##Network Architecture Flow Input (60, 1) → LSTM(64) → LSTM(64) → Dense(32) → Dropout(0.5) → Dense(1) → Output

Layer Details Input: 60 days of stock prices LSTM Layer 1: 64 units, returns sequences LSTM Layer 2: 64 units, returns final output Dense Layer: 32 units for feature combination Dropout: 50% dropout for regularization Output: 1 unit for price prediction Why This Architecture? Two LSTM layers: First learns basic patterns, second learns complex patterns 64 units: Good balance between capacity and training speed Dense layer: Combines LSTM outputs into final prediction Dropout: Prevents overfitting on training data Single output: Predicts one value (next day's price)

###############################################################################################

STEP 5. Model Compilation and Training

While compiling a model we provide these three essential parameters:

optimizer – This is the method that helps to optimize the cost function by using gradient descent. loss – The loss function by which we monitor whether the model is improving with training or not. metrics – This helps to evaluate the model by predicting the training and the validation data.

model.compile(optimizer='adam', loss='mean_squared_error')

    Purpose: Configures the model for training
    optimizer='adam': Sets the optimization algorithm
    Adam: Adaptive learning rate optimizer
    Why Adam: Works well for most problems, adjusts learning rate automatically
    Alternative: Could use 'sgd', 'rmsprop', etc.
    loss='mean_squared_error': Sets the loss function
    MSE: Measures average squared difference between predicted and actual prices
    Why MSE: Good for regression problems (predicting continuous values like stock prices)
    Formula: MSE = (1/n) * Σ(actual - predicted)²

history = model.fit(x_train, y_train, epochs=10)

    Purpose: Trains the model on your data
    x_train: Input sequences (60 days of prices for each sample)
    y_train: Target values (actual next day's price)
    epochs=10: Number of complete passes through the training data
    Epoch 1: Model sees all training data once
    Epoch 2: Model sees all training data again
    ...continues for 10 epochs
    history: Stores training metrics (loss values over time)

For predicting we require testing data, so we first create the testing data and then proceed with the model prediction.

test_data = scaled_data[training - 60:, :]

    Purpose: Extracts test data from scaled dataset
    training - 60: Starts 60 days before the training cutoff
    Why -60: Need 60 previous days to predict the first test day
    :: Takes all data from that point to the end
    test_data: Contains scaled test data plus 60 days before for context
    Example: If training=950, takes data from index 890 to end\

x_test = [] ##Will contain 60-day sequences for testing

y_test = dataset[training:, :]

Purpose: Extracts actual stock prices for testing
dataset[training:, :]: Takes original (unscaled) data from training point to end
y_test: Contains actual stock prices to compare with predictions
Why original data: Need real prices for comparison, not scaled values

for i in range(60, len(test_data)): x_test.append(test_data[i-60:i, 0])

    Loop through test data starting from index 60
    Create 60-day input sequences
    test_data[i-60:i, 0]: Takes 60 consecutive days of scaled prices
    Same logic: As training data - use 60 days to predict next day

x_test = np.array(x_test)

    Purpose: Converts list of sequences to NumPy array
    Required: LSTM needs NumPy arrays as input

x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))

    Purpose: Reshapes to LSTM format
    Shape: (samples, 60, 1) - 3D array for LSTM
    Same format: As training data

predictions = model.predict(x_test)

    Purpose: Uses trained model to predict stock prices
    model.predict(): Runs forward pass through the network
    predictions: Contains predicted stock prices (scaled values)
    Shape: Same as y_test

predictions = scaler.inverse_transform(predictions)

    Purpose: Converts scaled predictions back to real stock prices
    scaler.inverse_transform(): Undoes the MinMax scaling
    Result: Predictions now in original price units (dollars)
    Why needed: Compare with actual prices in same units

mse = np.mean(((predictions - y_test) ** 2))

    Purpose: Calculates Mean Squared Error
    predictions - y_test: Difference between predicted and actual prices
    ** 2: Squares each difference
    np.mean(): Averages all squared differences
    MSE: Average squared prediction error

rmse = np.sqrt(mse)

    Purpose: Calculates Root Mean Squared Error
    np.sqrt(mse): Square root of MSE
    RMSE: Same units as original data (dollars)
    Easier to interpret: RMSE is in same units as stock prices

print("MSE", mse) print("RMSE", np.sqrt(mse))

Now that we have predicted the testing data, let us visualize the final results.

train = apple[:training] test = apple[training:] test['Predictions'] = predictions

plt.figure(figsize=(10, 8)) plt.plot(train['date'], train['close']) plt.plot(test['date'], test[['close', 'Predictions']]) plt.title('Apple Stock Close Price') plt.xlabel('Date') plt.ylabel("Close") plt.legend(['Train', 'Test', 'Predictions'])

###############################################################################################

The chart shows Apple’s stock closing price over time with the "Train" data representing historical prices used for model training, "Test" data for evaluation and "Predictions" showing the model’s forecasted values. It visually demonstrates how well the model’s predictions align with actual stock prices highlighting areas of accurate forecasting and divergence.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
all_stocks_5yr.csv		all_stocks_5yr.csv
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

prepare feature and labels

About

Uh oh!

Releases

Packages

Languages

DeepankarSenapati/Stock_Price_Prediction_Model_using_Tensorflow

Folders and files

Latest commit

History

Repository files navigation

prepare feature and labels

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages