# 3. Data Pre Processing

<a id="contents"></a>
# Table of Contents  
3.1. [Introduction](#introduction)  
3.2. [Imports](#imports)   
3.3. [Data Processing](#process)  
3.4. [Scale the Data](#data)
3.5. [Create LSTM Sequences](#create)
3.6. [Data Splitting](#split)
3.7. [Save Updated Dataframe](#save)

## 3.1 Introduction<a id="introduction"></a>

The goal of this notebook is to create a cleaned development dataset to be used to complete the modeling step of my project.

## 3.2 Imports<a id="imports"></a>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.dates as mdates
import matplotlib.pyplot as plt 
import seaborn as sns 
import os
import csv
from tqdm.notebook import tqdm
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
import pickle

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
import math
from sklearn.metrics import mean_squared_error

In [None]:
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/Concated_Dataframe.csv')
df = df[df['stock_symbol'].isin(['EL','ULTA','COTY','ELF'])]
formulas_to_keep = ['stock_symbol','Date', 'Open', 'High', 'Low', 'Close','Volume', 'Dividends', 'Stock Splits', 'EMA_10', 'PSARl_0.02_0.2', 'PSARs_0.02_0.2', 'BBL_5_2.0', 'BBM_5_2.0', 'BBU_5_2.0', 'ISA_9', 'ISB_26', 'ITS_9', 'IKS_26', 'ICS_26']
df = df[formulas_to_keep]
df.head()

In [None]:
df.Date = pd.to_datetime(df.Date)

In [None]:
df.info()

## 3.3 Data Pre-processing<a id="process"></a>

In [None]:
# Function to preprocess data
def preprocess_data(df):
    df = df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'stock_symbol']].copy()
    
    # Set Date as index and sort
    df.set_index('Date', inplace=True)
    df = df.sort_index()
    
    return df

# Preprocess the data
df_1 = preprocess_data(df)
print(df_1.head())

## 3.4 Scale the Data<a id="data"></a>

Standardization (Z-score normalization) is used to transform the data to have a mean of 0 and a standard deviation of 1. This is to ensure optimal performance and stability.

In [None]:
def scale_data(df):
    scalers = {}
    scaled_data = pd.DataFrame()
    
    for stock in df['stock_symbol'].unique():
        stock_data = df[df['stock_symbol'] == stock]
        scaler = StandardScaler()
        scaled_values = scaler.fit_transform(stock_data.drop(columns='stock_symbol'))
        scaled_df = pd.DataFrame(scaled_values, columns=stock_data.columns[:-1], index=stock_data.index)
        scaled_df['stock_symbol'] = stock
        scalers[stock] = scaler
        scaled_data = pd.concat([scaled_data, scaled_df])
        
    return scaled_data, scalers

scaled_df, scalers = scale_data(df_1)
print(scaled_df.head())

# Save the scalers
with open('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/scalers.pkl', 'wb') as f:
    pickle.dump(scalers, f)

## 3.5 Create LSTM Sequences<a id="create"></a>

Long Short-Term Memory (LSTM) models are designed to work with sequential data, therefore sequences must be created for training. Creating LSTM sequences is a crucial step in preparing data for stock market prediction because it helps the model understand the patterns and trends over time. Specifically, creating LSTM sequences are important for: 

1) Understanding Trends: Stock prices are influenced by their historical values. By creating sequences, we allow the model to look at a series of past prices and learn how these prices evolve over time.

2) Capturing Patterns: Financial data often shows specific patterns, like trends or cycles. Sequences help the model recognize these patterns by providing context from previous days or weeks.

3) Improving Predictions: Just looking at a single day's data isn't enough to make accurate predictions. Sequences give the model a broader view, allowing it to make better-informed predictions about future prices.

4) Handling Time Dependency: Stock prices are inherently time-dependent; today's price is related to yesterday's price. Sequences ensure that this time dependency is captured in the model, which is crucial for making accurate forecasts.

In [None]:
# Function to create sequences for LSTM
def create_sequences(data, seq_length):
    sequences = []
    labels = []
    
    for i in range(len(data) - seq_length):
        seq = data[i:i+seq_length]
        label = data[i+seq_length][3]  # The 'Close' price is the 4th column (0-based index)
        sequences.append(seq)
        labels.append(label)
        
    return np.array(sequences), np.array(labels)

# Define the sequence length
SEQ_LENGTH = 50

# Filter by stock symbol and create sequences
all_sequences = []
all_labels = []

for stock in scaled_df['stock_symbol'].unique():
    stock_data = scaled_df[scaled_df['stock_symbol'] == stock].drop(columns='stock_symbol').values
    sequences, labels = create_sequences(stock_data, SEQ_LENGTH)
    all_sequences.extend(sequences)
    all_labels.extend(labels)

all_sequences = np.array(all_sequences)
all_labels = np.array(all_labels)

## 3.6 Data Splitting<a id="split"></a>

In [None]:
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(all_sequences, all_labels, test_size=0.2, shuffle=False)

## 3.7 Save Updated Dataframe<a id="save"></a>

The training and test sets must be saved as numpy arrays to be imported into the Modeling notebook. 

In [None]:
# save updated dataframe
df.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/Updated_df.csv')

print("Data saved successfully.")

In [None]:
# Save the data to .npy files
np.save('X_train.npy', X_train)
np.save('X_test.npy', X_test)
np.save('y_train.npy', y_train)
np.save('y_test.npy', y_test)

print("Data saved successfully.")