## **기계학습 구현준비 알아보기**
1) Scikit-learn() 라이브러리 살펴보기
- 데이터셋 제공
- 데이터 분리 기능 제공

2) 핵심속성 추출하기(Target/Features)
- Features : X 로 지정하기 (입력)
- Target : y 로 지정하기   (라벨-정답)

3) 훈련/테스트 데이터 분리하기

### **Step 1 :  라이브러리 가져오기**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

### **Step 2 : 데이터 읽기**

#### 1)파일 읽어 저장하기

In [2]:
# 코드 작성 : data 폴더 하위에 있는 diabetes.csv 파일을 읽어 df에 저장한다.
# 당뇨병 관련
df = pd.read_csv("./data/diabetes.csv")

df.head(3)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,blood_glucose
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0


### **Step 3 : 기계학습 데이터 준비하기**

1) 특성-X, 타켓-y 항목 지정하기

In [4]:
# 코드 작성 : 'blood_glucose' 항목이외 모든 항목을 X 에, 'blood_glucose' 항목을 y 에 지정한다
X = df.drop(columns=["blood_glucose"])
y = df["blood_glucose"]

0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: blood_glucose, dtype: float64

In [5]:
X

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485
439,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044528,-0.025930


In [6]:
y

0      151.0
1       75.0
2      141.0
3      206.0
4      135.0
       ...  
437    178.0
438    104.0
439    132.0
440    220.0
441     57.0
Name: blood_glucose, Length: 442, dtype: float64

2) 학습/테스트 데이터 분리하기

In [7]:
# 코드 작성 : X, y 데이터를 train_test_split() 함수를 사용하여 
#                test_size=0.3, random_state = 42 를 적용하여
# X_train, X_test, y_train, y_test 에 저장한다.

# test_size    > test data set의 비율
# random_state > 순서 random으로 섞어줄때 쓰는 seed

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

In [8]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(309, 10)
(133, 10)
(309,)
(133,)
