# S&P500 Dataset Cleaning



Link to dataset: https://www.kaggle.com/datasets/awadhi123/finance-data-sp-500

Description: the dataset contains the candle stick data necessary for the S&P500 index from 2 Jan 2000 to 10 Jul 2020

__Note:__ 
* I kept the variable "volume" since I think it might be a relevant feature
* 'up/down' column: 1 represents the day's closing price is higher than its opening price; -1, lower; 0, otherwise. 

In [13]:
import pandas as pd
df = pd.read_csv("SP500.csv")   #import the data and read as df

In [14]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2000-01-03,1469.25,1478.0,1438.359985,1455.219971,1455.219971,931800000
1,2000-01-04,1455.219971,1455.219971,1397.430054,1399.420044,1399.420044,1009000000
2,2000-01-05,1399.420044,1413.27002,1377.680054,1402.109985,1402.109985,1085500000
3,2000-01-06,1402.109985,1411.900024,1392.099976,1403.449951,1403.449951,1092300000
4,2000-01-07,1403.449951,1441.469971,1400.72998,1441.469971,1441.469971,1225200000


Adjusted close price is not relevant, I dropped it: 

In [15]:
df = df.drop(['Adj Close'], axis = 1)

In [16]:
print("The first workday of 2018 is " + df["Date"][4528])
print("The last workday of 2018 is " + df["Date"][4778])

The first workday of 2018 is 2018-01-02
The last workday of 2018 is 2018-12-31


I dropped all observations outside the above range and reset the indices

In [17]:
#drop unnecessary observations since we only look at 2018's data
df = df[4528:4779].reset_index()

In [18]:
#check the range of dates
df["Date"]

0      2018-01-02
1      2018-01-03
2      2018-01-04
3      2018-01-05
4      2018-01-08
          ...    
246    2018-12-24
247    2018-12-26
248    2018-12-27
249    2018-12-28
250    2018-12-31
Name: Date, Length: 251, dtype: object

In [19]:
# new column 'up/down': if closing price is higher than opening 
# price, displays 1; if lower. displays -1; else displays 0
df['up/down'] = 0
for i in range(df.shape[0]): 
  if df['Open'][i] > df['Close'][i]: 
    df['up/down'][i] = -1
  elif df['Open'][i] < df['Close'][i]:
    df['up/down'][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['up/down'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['up/down'][i] = -1


In [20]:
# new column for the true values, if next day's opening price
# is higher than today's price, display 1; if lower display -1
# 1 = buy, -1 = sell
df['true_value'] = 0
for i in range(df.shape[0] - 1): 
  if df['Close'][i] < df['Close'][i+1]: 
    df['true_value'][i] = 1
  if df['Close'][i] > df['Close'][i+1]: 
    df['true_value'][i] = -1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['true_value'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['true_value'][i] = -1


In [21]:
df['Close'][3]

2743.149902

In [22]:
from sklearn.preprocessing import StandardScaler

#normalize data using sklearn
normalizer = StandardScaler()
normalizer
df_dropped = df.drop('Date', axis = 1)
normalized_df = pd.DataFrame(normalizer.fit_transform(df_dropped), columns = df_dropped.columns)
normalized_df.insert(loc = 0, column = 'Date', value = df['Date'])

In [23]:
normalized_df

Unnamed: 0,Date,index,Open,High,Low,Close,Volume,up/down,true_value
0,2018-01-02,-1.725164,-0.648749,-0.717987,-0.450101,-0.502990,-0.356569,0.996024,0.955190
1,2018-01-03,-1.711363,-0.506020,-0.519223,-0.305372,-0.330849,-0.107277,0.996024,0.955190
2,2018-01-04,-1.697561,-0.289100,-0.358751,-0.105322,-0.221778,0.120476,0.996024,0.955190
3,2018-01-05,-1.683760,-0.167600,-0.206454,-0.022204,-0.030579,-0.546551,0.996024,0.955190
4,2018-01-08,-1.669959,-0.052975,-0.152031,0.068712,0.014927,-0.537782,0.996024,0.955190
...,...,...,...,...,...,...,...,...,...
246,2018-12-24,1.669959,-3.511073,-3.789219,-3.561289,-3.942897,-1.452166,-1.003992,0.955190
247,2018-12-26,1.683760,-3.889522,-3.171639,-3.603741,-2.779331,0.903983,0.996024,0.955190
248,2018-12-27,1.697561,-3.087138,-2.942115,-3.121370,-2.568471,0.704183,0.996024,-1.051109
249,2018-12-28,1.711363,-2.518352,-2.606867,-2.417442,-2.599307,0.131180,-1.003992,0.955190


In [24]:
# save file as csv
normalized_df.to_csv('SP500_processed.csv')