### 목표
1. 앞서 가공한 Talib 지표 데이터(MACD, bolinger band)를 train 데이터로 가공된 데이터 프레임을 만들어서 csv 로 저장하기
    - 피처는 일단 boligner pca, macd, macdsignal, macd_hist로 얻고, 답은 Target으로 간소화 시켜서 테스트 해보기

### 1. 기본 데이터 획득 및 Target 값 가공하기

In [1]:
import numpy as np
import pandas as pd
import FinanceDataReader as fdr
import talib as ta

In [97]:
# 불러오고자 하는 종목코드와 기간 입력
org = fdr.DataReader('035720','2010-01-01', '2020-10-28')

# Talib 사용을 위해 컬럼명을 모두 소문자로 변환하는 코드(Talib은 컬럼명으로 소문자만 인식)
org.columns = list(map(lambda x : x.lower(), org.columns))
org

Unnamed: 0_level_0,open,high,low,close,volume,change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,70300,74200,69100,73400,158976,0.044097
2010-01-05,73000,75300,72200,74000,124156,0.008174
2010-01-06,74600,75200,73000,74300,72453,0.004054
2010-01-07,74300,74800,72100,73400,99241,-0.012113
2010-01-08,73500,73500,70100,72900,114818,-0.006812
...,...,...,...,...,...,...
2020-10-22,351500,352000,345500,349000,376380,-0.012730
2020-10-23,348500,350000,340000,340000,622592,-0.025788
2020-10-26,337000,341000,328000,329500,714722,-0.030882
2020-10-27,324500,343500,324500,334000,812610,0.013657


- Target label(답)은 약 한달 후 종가 기준의 change(%)로 임의 설정해보겠다.
- 달마다 다르겠지만 25일을 shift하는 것으로 임의 설정하겠다. 

In [98]:
# 약 한달후의 종가 컬럼 추가
org['1month_close'] = org['close'].shift(-25)
# 한달 후 종가 기준 chagne 컬럼 추가
org['1month_change'] = (org['1month_close']-org['close'])/org['close']

In [99]:
org.sort_values(by='1month_change', ascending=False)[:60]

Unnamed: 0_level_0,open,high,low,close,volume,change,1month_close,1month_change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-04-14,159500,160500,158000,159500,430759,0.012698,268000.0,0.680251
2014-05-20,72200,72200,70400,71700,57662,-0.001393,118300.0,0.64993
2020-04-16,159000,167000,158500,165500,1005533,0.037618,270000.0,0.63142
2014-05-19,72800,73000,71500,71800,46998,-0.009655,116400.0,0.62117
2014-05-21,71400,72600,71400,71600,30891,-0.001395,115800.0,0.617318
2014-05-22,71700,73400,71600,73200,59556,0.022346,117000.0,0.598361
2014-05-15,73000,73400,72300,73000,59422,0.002747,116400.0,0.594521
2014-05-16,72700,73100,71800,72500,87198,-0.006849,114200.0,0.575172
2020-04-13,161000,161000,157000,157500,470251,-0.021739,247000.0,0.568254
2014-05-23,75400,79500,75300,78100,467873,0.06694,120500.0,0.542894


In [100]:
# 답 저장 공간 생성
label = np.zeros(org.shape[0])
# 인덱스 맞춰서 시리즈로 변환
label_ser = pd.Series(label, index=org.index)

In [101]:
# 한달 기준 수익율 구간별 점수로 카테고라이즈화 Input
label_ser[org[org['1month_change']>0.2].index] = 5
label_ser[org[(org['1month_change']>0.1)&(org['1month_change']<=0.2)].index] = 3
label_ser[org[(org['1month_change']>0.03)&(org['1month_change']<=0.1)].index] = 1
label_ser[org[(org['1month_change']>=0.01)&(org['1month_change']<=0.03)].index] = 0
label_ser[org[(org['1month_change']<0.01)&(org['1month_change']>= -0.05)].index] = -3
label_ser[org[(org['1month_change']< -0.05)&(org['1month_change']>= -0.1)].index] = -5
label_ser[org[org['1month_change']< -0.1].index] = -10

In [102]:
# label을 데이터 프레임과 합치기
org['TARGET'] = label_ser

In [103]:
label_ser.value_counts()

-3.0     602
 1.0     576
-5.0     408
-10.0    333
 3.0     308
 0.0     224
 5.0     219
dtype: int64

In [104]:
# qcut 함수 사용해서 구간 나누기 연습
org['_qcut']=pd.qcut(org['1month_change'], q=7, labels=[-10, -5, -3, 0, 1,3,5])

In [105]:
org[:30]

Unnamed: 0_level_0,open,high,low,close,volume,change,1month_close,1month_change,TARGET,_qcut
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2010-01-04,70300,74200,69100,73400,158976,0.044097,66500.0,-0.094005,-5.0,-10
2010-01-05,73000,75300,72200,74000,124156,0.008174,67100.0,-0.093243,-5.0,-10
2010-01-06,74600,75200,73000,74300,72453,0.004054,69200.0,-0.068641,-5.0,-5
2010-01-07,74300,74800,72100,73400,99241,-0.012113,71500.0,-0.025886,-3.0,-3
2010-01-08,73500,73500,70100,72900,114818,-0.006812,71900.0,-0.013717,-3.0,-3
2010-01-11,73200,73400,70900,72000,125740,-0.012346,71800.0,-0.002778,-3.0,0
2010-01-12,71300,72500,70500,72500,98142,0.006944,72900.0,0.005517,-3.0,0
2010-01-13,72000,72300,70600,71800,70397,-0.009655,72500.0,0.009749,-3.0,0
2010-01-14,72000,73900,70700,71100,132968,-0.009749,70800.0,-0.004219,-3.0,0
2010-01-15,72000,78500,71700,77500,292527,0.090014,71600.0,-0.076129,-5.0,-5


### 2. Talib 지표 feature 추가하기

- 1. MACD
    - MACD 함수 디폴트 조건 : macd, macdsignal, macdhist = ta.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)

In [106]:
# MACD 함수 디폴트 조건
macd, macdsignal, macdhist = ta.MACD(org['close'], fastperiod=12, slowperiod=26, signalperiod=9)
org['macd'] = macd
org['macd_signal']=macdsignal
org['macd_hist']=macdhist

- 2. BBAND(볼린저 밴드)

In [107]:
# 볼린저 밴드
org['bol_upperband'], org['bol_middleband'], org['bol_lowerband'] = \
ta.BBANDS(org['close'], timeperiod=20)

In [108]:
# 볼린저 밴드를 1개의 컬럼으로 PCA 차원 축소
from sklearn.decomposition import PCA
pca = PCA(n_components=1, random_state=0)

In [109]:
# NaN 값이 없어야 학습됨.
pca.fit(org.loc['2010-01-29':,:][['bol_upperband','bol_middleband','bol_lowerband']])

PCA(n_components=1, random_state=0)

In [110]:
# 3개 컬럼을 압축한 새로운 feature bol_pca 생성
bol_pca = pca.transform(org.loc['2010-01-29':,:][['bol_upperband','bol_middleband','bol_lowerband']])

In [114]:
# bolinger pca 한 컬럼 추가한 df 를 새로 선언(Nan 값 때문에 일자를 맞춰야함.)
df = org.loc['2010-01-29':,:]

In [115]:
df['bolinger_pca']=bol_pca

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [116]:
# MACD와 bolinger 밴드 지표가 들어간 df 생성됨.
df[:30]

Unnamed: 0_level_0,open,high,low,close,volume,change,1month_close,1month_change,TARGET,_qcut,macd,macd_signal,macd_hist,bol_upperband,bol_middleband,bol_lowerband,bolinger_pca
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2010-01-29,70600,73000,70000,70600,133663,-0.024862,67200.0,-0.048159,-3.0,-5,,,,77026.653596,73095.0,69163.346404,-77990.386024
2010-02-01,70600,72000,70000,71800,74728,0.016997,68400.0,-0.047354,-3.0,-3,,,,76983.513576,73015.0,69046.486424,-78123.365474
2010-02-02,72600,73000,71700,72800,116512,0.013928,68800.0,-0.054945,-5.0,-5,,,,76898.336151,72955.0,69011.663849,-78230.203372
2010-02-03,73600,73800,70200,72200,162244,-0.008242,67700.0,-0.062327,-5.0,-5,,,,76756.148999,72850.0,68943.851001,-78416.245398
2010-02-04,71900,71900,69600,70000,108450,-0.030471,67500.0,-0.035714,-3.0,-3,,,,76767.346327,72680.0,68592.653673,-78684.993273
2010-02-05,67100,67800,66200,67800,121509,-0.031429,67000.0,-0.011799,-3.0,-3,,,,77029.291476,72425.0,67820.708524,-79055.148577
2010-02-08,67500,67600,66400,66500,63183,-0.019174,68700.0,0.033083,1.0,1,,,,77430.340898,72150.0,66869.659102,-79438.392338
2010-02-09,66100,67300,66100,67100,82177,0.009023,69100.0,0.029806,0.0,1,,,,77595.452738,71880.0,66164.547262,-79845.41524
2010-02-10,67200,69700,67200,69200,113336,0.031297,68700.0,-0.007225,-3.0,0,,,,77583.866642,71750.0,65916.133358,-80053.637832
2010-02-11,69600,71800,69600,71500,110516,0.033237,69000.0,-0.034965,-3.0,-3,,,,77597.555234,71770.0,65942.444766,-80020.002508


In [117]:
# 위의 데이터 프레임 csv 로 저장
# !!파일명 중복되서 입력하지 않도록 주의할것!!
df.to_csv('./data/kakao_df_201028.csv', sep=',', encoding='utf-8')