# <center><font color = '#DF9166' size = 20 center> **Data Preprocessing**</font></center>



## <font color = '#DF9166' size=6>**Table of content**<font/><a class = 'anchor' id = 'introduction'/>

1. [**Import Libraries**](#import)
2. [**Data Loading**](#data_loading)
3. [**Data Inspection**](#data_inspection)
4. [**Data Preprocessing**](#data_preprocessing)

## <font color = '#DF9166' size=6>**Import Libraries**<font/><a class = 'anchor' id = 'import'/>


In [1]:
import os
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
from IPython.display import display

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [3]:
sns.set_style("whitegrid")
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

In [4]:
sys.path.append(os.path.abspath(os.path.pardir))
from scripts.preprocess_data import FinancialDataProcessor

## <font color = '#DF9166' size=6>**Data Loading**<font/><a class = 'anchor' id = 'data_loading'/>


In [27]:
# Initialize the processor
processor = FinancialDataProcessor(["TSLA", "BND", "SPY"], "2015-01-01", "2025-01-31")

Initialized processor for TSLA, BND, SPY from 2015-01-01 to 2025-01-31.


In [28]:
# Fetch data
processor.fetch_data()

Fetching data from Yahoo Finance...
Downloading data for TSLA...


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

Downloading data for BND...
Downloading data for SPY...



[*********************100%***********************]  1 of 1 completed

Data fetching complete.





In [29]:
for ticker, df in processor.data.items():
    print(f"\n{ticker} data:")
    display(df.head(2))


TSLA data:


Price,Close,High,Low,Open,Volume
Ticker,TSLA,TSLA,TSLA,TSLA,TSLA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2015-01-02,14.620667,14.883333,14.217333,14.858,71466000
2015-01-05,14.006,14.433333,13.810667,14.303333,80527500



BND data:


Price,Close,High,Low,Open,Volume
Ticker,BND,BND,BND,BND,BND
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2015-01-02,62.57312,62.603404,62.398988,62.40656,2218800
2015-01-05,62.754833,62.777545,62.610985,62.641269,5820100



SPY data:


Price,Close,High,Low,Open,Volume
Ticker,SPY,SPY,SPY,SPY,SPY
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2015-01-02,172.592865,173.811099,171.542672,173.391022,121465900
2015-01-05,169.475876,171.702279,169.165023,171.534251,169632600


## <font color = '#DF9166' size=6>**Data Inspection**<font/><a class = 'anchor' id = 'data_inspection'/>

In [30]:
for ticker, df in processor.data.items():
    print(f"\n{ticker} data:")
    display(df.info())


TSLA data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2535 entries, 2015-01-02 to 2025-01-30
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   (Close, TSLA)   2535 non-null   float64
 1   (High, TSLA)    2535 non-null   float64
 2   (Low, TSLA)     2535 non-null   float64
 3   (Open, TSLA)    2535 non-null   float64
 4   (Volume, TSLA)  2535 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 118.8 KB


None


BND data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2535 entries, 2015-01-02 to 2025-01-30
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   (Close, BND)   2535 non-null   float64
 1   (High, BND)    2535 non-null   float64
 2   (Low, BND)     2535 non-null   float64
 3   (Open, BND)    2535 non-null   float64
 4   (Volume, BND)  2535 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 118.8 KB


None


SPY data:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2535 entries, 2015-01-02 to 2025-01-30
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   (Close, SPY)   2535 non-null   float64
 1   (High, SPY)    2535 non-null   float64
 2   (Low, SPY)     2535 non-null   float64
 3   (Open, SPY)    2535 non-null   float64
 4   (Volume, SPY)  2535 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 118.8 KB


None

In [31]:
processor.basic_statistics()

Generating basic statistics...

Statistics for TSLA:


Price,Close,High,Low,Open,Volume
Ticker,TSLA,TSLA,TSLA,TSLA,TSLA
count,2535.0,2535.0,2535.0,2535.0,2535.0
mean,117.848209,120.474827,115.097514,117.877662,112030800.0
std,116.508288,119.236025,113.69893,116.611575,73875090.0
min,9.578,10.331333,9.403333,9.488,10620000.0
25%,17.228,17.527667,16.942,17.259334,66802950.0
50%,30.298,32.329334,29.76,31.299999,92641800.0
75%,221.525002,226.696671,217.061661,222.653336,129428300.0
max,479.859985,488.540009,457.51001,475.899994,914082000.0



Statistics for BND:


Price,Close,High,Low,Open,Volume
Ticker,BND,BND,BND,BND,BND
count,2535.0,2535.0,2535.0,2535.0,2535.0
mean,69.289452,69.391261,69.186012,69.293225,4233200.0
std,4.792145,4.800408,4.782141,4.793199,2796083.0
min,61.860889,61.937361,61.822651,61.891484,0.0
25%,65.565277,65.630169,65.475488,65.552735,2057700.0
50%,68.329132,68.457854,68.169747,68.310226,3805800.0
75%,72.87994,72.946428,72.728798,72.845716,5626700.0
max,78.82328,78.920142,78.788043,78.884912,31937200.0



Statistics for SPY:


Price,Close,High,Low,Open,Volume
Ticker,SPY,SPY,SPY,SPY,SPY
count,2535.0,2535.0,2535.0,2535.0,2535.0
mean,316.067402,317.72566,314.157889,316.021385,87146550.0
std,117.92691,118.48641,117.269902,117.921618,44872530.0
min,156.800873,157.864167,154.676912,156.354974,20270000.0
25%,214.841972,215.702253,214.206923,214.972477,58620050.0
50%,277.11792,277.919519,276.073612,277.23075,76428700.0
75%,405.869156,409.134294,402.830858,406.100271,101886600.0
max,609.75,610.780029,606.799988,609.809998,507244300.0


In [32]:
# Step 3: Understand the data
processor.check_missing_values()

Checking for missing values...
TSLA has no missing values.
BND has no missing values.
SPY has no missing values.


## <font color = '#DF9166' size=6>**Data Preprocessing**<font/><a class = 'anchor' id = 'data_preprocessing'/>

In [None]:
# processor.handle_missing_values(method='interpolate')

In [34]:
# save raw data
processor.save_cleaned_data("../data/raw/")

Saving data to files...
Data for TSLA saved to ../data/raw/TSLA.csv.
Data for BND saved to ../data/raw/BND.csv.
Data for SPY saved to ../data/raw/SPY.csv.


In [35]:
processor.normalize_data()

Normalizing data using Min-Max scaling...
Data normalization complete.


In [36]:
# Save processed data
processor.save_cleaned_data("../data/processed/")

Saving data to files...
Data for TSLA saved to ../data/processed/TSLA.csv.
Data for BND saved to ../data/processed/BND.csv.
Data for SPY saved to ../data/processed/SPY.csv.
