# 3. Data Pre Processing

<a id="contents"></a>
# Table of Contents  
3.1. [Introduction](#introduction) <br>
3.2. [Imports](#imports)  <br>
3.3. [Data Processing](#process)<br>
3.4. [Scale the Data](#data)<br>
3.5. [Create LSTM Sequences](#create)<br>
3.6. [Data Splitting](#split)<br>
3.7. [Save Updated Dataframe](#save)

## 3.1 Introduction<a id="introduction"></a>

The goal of this notebook is to create a cleaned development dataset to be used to complete the modeling step of my project.

## 3.2 Imports<a id="imports"></a>

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.dates as mdates
import matplotlib.pyplot as plt 
import seaborn as sns 
import os
import csv
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import pickle

In [2]:
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/Concated_Dataframe.csv')
df = df[df['stock_symbol'].isin(['EL','ULTA','COTY','ELF'])]
formulas_to_keep = ['stock_symbol','Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'EMA_10', 'PSARl_0.02_0.2', 'PSARs_0.02_0.2', 'BBL_5_2.0', 'BBM_5_2.0', 'BBU_5_2.0', 'ISA_9', 'ISB_26', 'ITS_9', 'IKS_26', 'ICS_26']
df = df[formulas_to_keep]
df.head()
scalers = pickle.load(open('scalers.pkl', 'rb'))

In [3]:
df.Date = pd.to_datetime(df.Date)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15880 entries, 0 to 15879
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   stock_symbol    15880 non-null  object        
 1   Date            15880 non-null  datetime64[ns]
 2   Open            15880 non-null  float64       
 3   High            15880 non-null  float64       
 4   Low             15880 non-null  float64       
 5   Close           15880 non-null  float64       
 6   Volume          15880 non-null  int64         
 7   Dividends       15880 non-null  float64       
 8   Stock Splits    15880 non-null  float64       
 9   EMA_10          15880 non-null  float64       
 10  PSARl_0.02_0.2  15880 non-null  float64       
 11  PSARs_0.02_0.2  15880 non-null  float64       
 12  BBL_5_2.0       15880 non-null  float64       
 13  BBM_5_2.0       15880 non-null  float64       
 14  BBU_5_2.0       15880 non-null  float64       
 15  IS

## 3.3 Data Pre-processing<a id="process"></a>

In [5]:
# Function to preprocess data
def preprocess_data(df):
    df = df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'stock_symbol']].copy()
    df.set_index('Date', inplace=True)
    df = df.sort_index()
    return df

# Preprocess the data
df_1 = preprocess_data(df)
print(df_1.head())

                Open      High       Low     Close    Volume stock_symbol
Date                                                                     
1995-11-17  6.086093  6.530846  6.062685  6.460622  35659200           EL
1995-11-20  6.437214  6.601070  6.132909  6.226542   8434000           EL
1995-11-21  6.179726  6.203134  5.945645  6.109501   6440000           EL
1995-11-22  6.086094  6.390399  6.086094  6.296767   3480800           EL
1995-11-24  6.343582  6.530847  6.320174  6.530847   1279200           EL


In [6]:
predictor_df = df_1.drop('Close', axis=1)
target_df = df_1[['Close']]

## 3.4 Scale the Data<a id="data"></a>

Standardization (Z-score normalization) is used to transform the data to have a mean of 0 and a standard deviation of 1. This is to ensure optimal performance and stability.

In [7]:
target_df

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
1995-11-17,6.460622
1995-11-20,6.226542
1995-11-21,6.109501
1995-11-22,6.296767
1995-11-24,6.530847
...,...
2024-03-27,513.520020
2024-03-28,196.029999
2024-03-28,154.149994
2024-03-28,522.880005


In [8]:
# Separate Close from the df:
predictor_df = df_1.drop('Close', axis=1)
target_df = df_1[['Close', 'stock_symbol']]

In [9]:
def scale_data(df_1):
    scalers = {}
    scaled_data = pd.DataFrame()
    
    for stock in df_1['stock_symbol'].unique():
        stock_data = df_1[df_1['stock_symbol'] == stock]
        scaler = StandardScaler()
        scaled_values = scaler.fit_transform(stock_data.drop(columns='stock_symbol'))
        scaled_df = pd.DataFrame(scaled_values, columns=stock_data.columns[:-1], index=stock_data.index)
        scaled_df['stock_symbol'] = stock
        scalers[stock] = scaler
        scaled_data = pd.concat([scaled_data, scaled_df])
        
    return scaled_data, scalers


# Scale the data for our predictor variables:
predictor_scaled_df, predictor_scalers = scale_data(predictor_df)
# Scale the data for our target variable:
target_scaled_df, target_scaler = scale_data(target_df) 


# Save the scalers:
with open('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/predictor_scalers.pkl', 'wb') as f:
    pickle.dump(predictor_scalers, f)
    
with open('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/target_scaler.pkl', 'wb') as f:
    pickle.dump(target_scaler, f)

In [10]:
target_scaled_df

Unnamed: 0_level_0,Close,stock_symbol
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1995-11-17,-0.781448,EL
1995-11-20,-0.784352,EL
1995-11-21,-0.785804,EL
1995-11-22,-0.783481,EL
1995-11-24,-0.780577,EL
...,...,...
2024-03-22,4.332822,ELF
2024-03-25,4.189170,ELF
2024-03-26,4.131813,ELF
2024-03-27,4.091249,ELF


In [11]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import pickle

# Function to create sequences for LSTM
def create_sequences(predictor_data, target_data, seq_length, stock_symbol):
    sequences = []
    labels = []
    stock_symbols = []
    
    for i in range(len(predictor_data) - seq_length):
        seq = predictor_data[i:i + seq_length]
        label = target_data[i + seq_length]  # Use 'Close' as target variable
        sequences.append(seq)
        labels.append(label)
        stock_symbols.append(stock_symbol)
        
    return np.array(sequences), np.array(labels), np.array(stock_symbols)

# Define the sequence length
SEQ_LENGTH = 50

# Assuming predictor_scaled_df and target_scaled_df are already scaled
# and both have a 'stock_symbol' column.

# Filter by stock symbol and create sequences
all_sequences = []
all_labels = []
all_stock_symbols = []

for stock in predictor_scaled_df['stock_symbol'].unique():
    predictor_stock_data = predictor_scaled_df[predictor_scaled_df['stock_symbol'] == stock].drop(columns='stock_symbol').values
    target_stock_data = target_scaled_df[target_scaled_df['stock_symbol'] == stock]['Close'].values
    sequences, labels, stock_symbols = create_sequences(predictor_stock_data, target_stock_data, SEQ_LENGTH, stock)
    all_sequences.extend(sequences)
    all_labels.extend(labels)
    all_stock_symbols.extend(stock_symbols)

all_sequences = np.array(all_sequences)
all_labels = np.array(all_labels)
all_stock_symbols = np.array(all_stock_symbols)


## 3.5 Create LSTM Sequences<a id="create"></a>

Long Short-Term Memory (LSTM) models are designed to work with sequential data, therefore sequences must be created for training. Creating LSTM sequences is a crucial step in preparing data for stock market prediction because it helps the model understand the patterns and trends over time. Specifically, creating LSTM sequences are important for: 

1) Understanding Trends: Stock prices are influenced by their historical values. By creating sequences, we allow the model to look at a series of past prices and learn how these prices evolve over time.

2) Capturing Patterns: Financial data often shows specific patterns, like trends or cycles. Sequences help the model recognize these patterns by providing context from previous days or weeks.

3) Improving Predictions: Just looking at a single day's data isn't enough to make accurate predictions. Sequences give the model a broader view, allowing it to make better-informed predictions about future prices.

4) Handling Time Dependency: Stock prices are inherently time-dependent; today's price is related to yesterday's price. Sequences ensure that this time dependency is captured in the model, which is crucial for making accurate forecasts.

In [12]:
all_sequences = np.array(all_sequences)
all_labels = np.array(all_labels)
all_stock_symbols = np.array(all_stock_symbols)

## 3.6 Data Splitting<a id="split"></a>

In [13]:
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test, stock_symbols_train, stock_symbols_test = train_test_split(all_sequences, all_labels, all_stock_symbols, test_size=0.2, shuffle=False)

## 3.7 Save Updated Dataframe<a id="save"></a>

The training and test sets must be saved as numpy arrays to be imported into the Modeling notebook. 

In [14]:
# save updated dataframe
df_1.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/Updated_df.csv')

print("Data saved successfully.")

Data saved successfully.


In [15]:
# Save the data to .npy files
np.save('X_train.npy', X_train, allow_pickle=True)
np.save('X_test.npy', X_test, allow_pickle=True)
np.save('y_train.npy', y_train, allow_pickle=True)
np.save('y_test.npy', y_test, allow_pickle=True)
np.save('stock_symbols_train.npy', stock_symbols_train, allow_pickle=True)
np.save('stock_symbols_test.npy', stock_symbols_test, allow_pickle=True)

print("Data saved successfully.")

Data saved successfully.
