# Data Cleaning and Understanding for Financial Time Series Data
This notebook covers basic data cleaning steps, including:
1. Calculating summary statistics for understanding data distribution
2. Checking data types and handling missing values
3. Normalizing or scaling the data for further analysis or machine learning models


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import yfinance as yf
from sklearn.preprocessing import MinMaxScaler, StandardScaler


## Step 1: Load Data for Tesla, BND, and SPY


In [3]:
# Define tickers and download data
tickers = ['TSLA', 'BND', 'SPY']
start_date = '2015-01-01'
end_date = '2024-10-31'

data = yf.download(tickers, start=start_date, end=end_date, group_by='ticker')

# Separate data for each asset
tesla_data = data['TSLA']
bond_data = data['BND']
spy_data = data['SPY']

# Display first few rows of Tesla data
tesla_data.head()

[*********************100%***********************]  3 of 3 completed


Price,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-02 00:00:00+00:00,14.858,14.883333,14.217333,14.620667,14.620667,71466000
2015-01-05 00:00:00+00:00,14.303333,14.433333,13.810667,14.006,14.006,80527500
2015-01-06 00:00:00+00:00,14.004,14.28,13.614,14.085333,14.085333,93928500
2015-01-07 00:00:00+00:00,14.223333,14.318667,13.985333,14.063333,14.063333,44526000
2015-01-08 00:00:00+00:00,14.187333,14.253333,14.000667,14.041333,14.041333,51637500


## Step 2: Calculate Summary Statistics


In [4]:
# Summary statistics for each asset
print("Tesla Summary Statistics:")
print(tesla_data.describe())

print("\nBond ETF (BND) Summary Statistics:")
print(bond_data.describe())

print("\nS&P 500 ETF (SPY) Summary Statistics:")
print(spy_data.describe())

Tesla Summary Statistics:
Price         Open         High          Low        Close    Adj Close  \
count  2474.000000  2474.000000  2474.000000  2474.000000  2474.000000   
mean    111.461872   113.895836   108.869421   111.438965   111.438965   
std     110.208156   112.643277   107.541830   110.120450   110.120450   
min       9.488000    10.331333     9.403333     9.578000     9.578000   
25%      17.058499    17.368167    16.790167    17.066167    17.066167   
50%      24.986667    25.279000    24.462334    25.043000    25.043000   
75%     217.264999   221.910004   212.084999   216.865002   216.865002   
max     411.470001   414.496674   405.666656   409.970001   409.970001   

Price        Volume  
count  2.474000e+03  
mean   1.125745e+08  
std    7.449619e+07  
min    1.062000e+07  
25%    6.682590e+07  
50%    9.289395e+07  
75%    1.301899e+08  
max    9.140820e+08  

Bond ETF (BND) Summary Statistics:
Price         Open         High          Low        Close    Adj Close  \

## Step 3: Data Types and Missing Values


In [6]:
# Check data types for each asset
print("Tesla Data Types:")
print(tesla_data.dtypes)

print("\nBond ETF (BND) Data Types:")
print(bond_data.dtypes)

print("\nS&P 500 ETF (SPY) Data Types:")
print(spy_data.dtypes)

# Check for missing values
print("\nMissing Values in Tesla Data:")
print(tesla_data.isnull().sum())

print("\nMissing Values in Bond ETF (BND) Data:")
print(bond_data.isnull().sum())

print("\nMissing Values in S&P 500 ETF (SPY) Data:")
print(spy_data.isnull().sum())


Tesla Data Types:
Price
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

Bond ETF (BND) Data Types:
Price
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

S&P 500 ETF (SPY) Data Types:
Price
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

Missing Values in Tesla Data:
Price
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

Missing Values in Bond ETF (BND) Data:
Price
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

Missing Values in S&P 500 ETF (SPY) Data:
Price
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64


### Handling Missing Values
For simplicity, we'll use forward fill to handle missing values. But we dont have missing values.

In [7]:
# Fill missing values using forward fill method
tesla_data.fillna(method='ffill', inplace=True)
bond_data.fillna(method='ffill', inplace=True)
spy_data.fillna(method='ffill', inplace=True)


  tesla_data.fillna(method='ffill', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tesla_data.fillna(method='ffill', inplace=True)
  bond_data.fillna(method='ffill', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bond_data.fillna(method='ffill', inplace=True)
  spy_data.fillna(method='ffill', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spy_data.fillna(method='ffill', inplace=True)


## Step 4: Data Normalization and Scaling


In [8]:
# Initialize scalers
scaler = MinMaxScaler()

# Normalize the 'Close' prices for each asset
tesla_data[['Close']] = scaler.fit_transform(tesla_data[['Close']])
bond_data[['Close']] = scaler.fit_transform(bond_data[['Close']])
spy_data[['Close']] = scaler.fit_transform(spy_data[['Close']])

# Display normalized Tesla 'Close' prices
tesla_data[['Close']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tesla_data[['Close']] = scaler.fit_transform(tesla_data[['Close']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bond_data[['Close']] = scaler.fit_transform(bond_data[['Close']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spy_data[['Close']] = scaler.fit_transform(spy_data[['Close']])


Price,Close
Date,Unnamed: 1_level_1
2015-01-02 00:00:00+00:00,0.012594
2015-01-05 00:00:00+00:00,0.011059
2015-01-06 00:00:00+00:00,0.011257
2015-01-07 00:00:00+00:00,0.011202
2015-01-08 00:00:00+00:00,0.011147


### Summary
We have loaded the data, calculated basic summary statistics, handled missing values, and normalized the 'Close' prices for each asset to prepare the data for further analysis and modeling.