## 데이터 수집

코로나 이후의 데이터를 얻기 위해서 2020년 1월 1일부터 2024년 4월까지의 데이터를 수집하였습니다.

사용하는 독립변수 데이터의 카테고리는 다음과 같습니다.
1. 삼성전자의 기본적인 시고저종(OHLC) 및 거래량 데이터, 외인 소진율
2. 시장 인덱스 데이터 
    - S&P500 : 시고저종(OHLC) 및 거래량
    - VIX : 시고저종(OHLC) 및 거래량
    - SOX : 시고저종(OHLC) 및 거래량
3. 기술 지표
    - 이동평균(5일,10일)
    - 이동 표준편차(5일, 10일)
    - 로그 수익률
    - 볼린저 밴드(이동 평균선 상위 2표준 편차선, 하위 2표준 편차선)
    - ATR(Average True range): 주가의 변동성 측정 값
    - 1개월 모멘텀: 한달 전과 현재 값과의 차이
    - CCI(commodity channel index): 사이클 트랜드 오실레이터
    - 3개월 모멘텀
    - MACD : 모멘텀 트랜드 지표
    - Williams percent range: 매수/매도 스트레스 측정

예측하고자 하는 종속 변수
- 다음날 수정 종가 



In [67]:
import requests
import pandas as pd
import numpy as np
start ="202001010000" 
end ="202404301044"
sp_data = pd.DataFrame(requests.get(f'https://api.stock.naver.com/chart/foreign/index/.INX/day?startDateTime={start}&endDateTime={end}').json())
snp_url = 'https://query1.finance.yahoo.com/v8/finance/chart/%5EGSPC?events=capitalGain%7Cdiv%7Csplit&formatted=true&includeAdjustedClose=true&interval=1d&period1=1577836800&period2=1714521600&symbol=%5EGSPC&userYfid=true&lang=en-US&region=US'
header = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"}
snp_adc = requests.get(snp_url,headers=header).json()['chart']['result'][0]['indicators']['adjclose'][0]['adjclose']
sp_data['adj_close'] = np.array(snp_adc)
sp_data.set_index('localDate')
samsung = pd.read_csv("./data/samsung.csv") 
vix = pd.read_csv('./data/vix.csv')

# 데이트 형식 통일
samsung['Date'] = samsung['Date'].str.replace('-','')
vix['Date'] = vix['Date'].str.replace('-','')
sp_data.columns = ['Date', 'Close', 'Open', 'High', 'Low', 'Volume','Adj Close']


In [68]:
print(f"삼성전자     데이터 크기: {len(samsung)}")
print(f"vix         데이터 크기: {len(vix)}")
print(f"S&P         데이터 크기: {len(sp_data)}")

삼성전자     데이터 크기: 1067
vix         데이터 크기: 1129
S&P         데이터 크기: 1089


동일한 범위에 따른 데이터이지만 데이터 길이가 다름
- **인덱스가 날짜이므로, 서로 날짜가 맞지 않는 데이터가 있음**

In [69]:
idx = np.intersect1d(np.intersect1d(sp_data["Date"],samsung['Date']),vix['Date'])
samsung = samsung[samsung['Date'].apply(lambda x : x in idx)].reset_index(drop=True).set_index("Date")
vix = vix[vix['Date'].apply(lambda x : x in idx)].reset_index(drop=True).set_index("Date")
sp_data = sp_data[sp_data['Date'].apply(lambda x : x in idx)].reset_index(drop=True).set_index("Date")

In [73]:
samsung["next_price"] = samsung["Adj Close"].shift(-1)

In [77]:
samsung["next_rtn"] = samsung["Close"]/samsung["Open"] - 1

In [79]:
samsung["log_return"] = np.log(1 + samsung["Adj Close"].pct_change())

In [83]:
samsung["CCI"] = talib.CCI(samsung['High'],samsung["Low"],samsung["Adj Close"], timeperiod=14)

In [65]:
import talib

# 종속변수: 다음날 수정종가 추가
samsung["next_price"] = samsung["Adj Close"].shift(-1)
# 시가와 종가의 변화량 비율
samsung["next_rtn"] = samsung["Close"]/samsung["Open"] - 1
# 로그 수익률
samsung["log_return"] = np.log(1 + samsung["Adj Close"].pct_change())
#CCI
samsung["CCI"] = talib.CCI(samsung['High'],samsung["Low"],samsung["Adj Close"], timeperiod=14)

In [87]:
# 이동 평균 5, 10
samsung["MA5"] = talib.SMA(samsung['Close'], timeperiod=5)
samsung["MA10"] = talib.SMA(samsung['Close'], timeperiod=10)

In [90]:
# 이동 표준 편차 5, 10
samsung["RASD5"] = talib.SMA(talib.STDDEV(samsung['Close'],timeperiod=5, nbdev=1), timeperiod=5)
samsung["RASD10"] = talib.SMA(talib.STDDEV(samsung['Close'], timeperiod=5, nbdev=1), timeperiod=10)

In [66]:
samsung

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,next_price,next_rtn,log_return,CCI
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
20200102,55500.0,56000.0,55000.0,55200.0,49542.542969,12993228,49811.792969,-0.005405,,
20200103,56000.0,56600.0,54900.0,55500.0,49811.792969,15422255,49811.792969,-0.008929,0.005420,
20200106,54900.0,55600.0,54600.0,55500.0,49811.792969,10278951,50081.042969,0.010929,0.000000,
20200107,55700.0,56400.0,55600.0,55800.0,50081.042969,10009778,50978.546875,0.001795,0.005391,
20200108,56200.0,57400.0,55900.0,56800.0,50978.546875,23501171,52594.066406,0.010676,0.017762,
...,...,...,...,...,...,...,...,...,...,...
20240424,77500.0,78800.0,77200.0,78600.0,78600.000000,22166150,76300.000000,0.014194,0.040239,-65.373215
20240425,77300.0,77500.0,76300.0,76300.0,76300.000000,15549134,76700.000000,-0.012937,-0.029699,-87.438250
20240426,77800.0,77900.0,76500.0,76700.0,76700.000000,12755629,76700.000000,-0.014139,0.005229,-71.185461
20240429,77400.0,77600.0,76200.0,76700.0,76700.000000,14664474,77500.000000,-0.009044,0.000000,-68.930609


In [64]:
samsung

Unnamed: 0_level_0,Close,Open,High,Low,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20200102,3257.85,3244.67,3258.14,3235.53,1827686,3257.850098
20200103,3234.85,3226.36,3246.15,3222.34,1733948,3234.850098
20200106,3246.28,3217.55,3246.84,3214.64,1872803,3246.280029
20200107,3237.18,3241.86,3244.91,3232.43,1892856,3237.179932
20200108,3253.05,3238.59,3267.07,3236.67,1956337,3253.050049
...,...,...,...,...,...,...
20240424,5071.63,5084.86,5089.48,5047.02,2523336,5071.629883
20240425,5048.42,5019.88,5057.75,4990.58,2691434,5048.419922
20240426,5099.96,5084.65,5114.62,5073.14,2401044,5099.959961
20240429,5116.17,5114.13,5123.49,5088.65,2337163,5116.169922


In [56]:
sp_data

Unnamed: 0,localDate,closePrice,openPrice,highPrice,lowPrice,accumulatedTradingVolume,adj_close
0,20200102,3257.85,3244.67,3258.14,3235.53,1827686,3257.850098
1,20200103,3234.85,3226.36,3246.15,3222.34,1733948,3234.850098
2,20200106,3246.28,3217.55,3246.84,3214.64,1872803,3246.280029
3,20200107,3237.18,3241.86,3244.91,3232.43,1892856,3237.179932
4,20200108,3253.05,3238.59,3267.07,3236.67,1956337,3253.050049
...,...,...,...,...,...,...,...
1084,20240424,5071.63,5084.86,5089.48,5047.02,2523336,5071.629883
1085,20240425,5048.42,5019.88,5057.75,4990.58,2691434,5048.419922
1086,20240426,5099.96,5084.65,5114.62,5073.14,2401044,5099.959961
1087,20240429,5116.17,5114.13,5123.49,5088.65,2337163,5116.169922


1089

In [None]:
df['next_price'] = df['Adj Close'].shift(-1)
df['next_rtn'] = df['Close'] / df['Open'] -1
df['log_return'] = np.log(1 + df['Adj Close'].pct_change())
df['CCI'] = talib.CCI(df['High'], df['Low'], df['Adj Close'], timeperiod=14)