# Decision Tree
1. = 지도학습(supervised learning)의 일종
2. 지도학습의 경우 2가지 필요:
    1) feature_names (독립변수)
    2) label_name (종속변수): 여기서는 "count"
3. datetime &rarr; 연월일시분초로 쪼개어서 사용
4. 나머지 8개 변수는 그대로 사용
5. 지도학습을 위해 3가지를 만들어야 함
    1) x: train data의 feature (학습용)
    2) y: train data의 label (학습용)
    3) x_test: test data의 feature (예측용)

In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train=pd.read_csv("train_bsd.csv")
test=pd.read_csv("test_bsd.csv")

print(train.shape)
print(test.shape)

datasets=[train,test]

(10886, 12)
(6493, 9)


In [4]:
for dataset in datasets:
    dataset["datetime"]=pd.to_datetime(dataset["datetime"])

In [5]:
for dataset in datasets:
    dataset["datetime-year"]=dataset["datetime"].dt.year
    dataset["datetime-month"]=dataset["datetime"].dt.month
    dataset["datetime-day"]=dataset["datetime"].dt.day
    dataset["datetime-hour"]=dataset["datetime"].dt.hour
    dataset["datetime-minute"]=dataset["datetime"].dt.minute
    dataset["datetime-second"]=dataset["datetime"].dt.second

In [6]:
print(train.shape)
print(test.shape)
train.head()

(10886, 18)
(6493, 15)


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,datetime-year,datetime-month,datetime-day,datetime-hour,datetime-minute,datetime-second
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,2011,1,1,0,0,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,2011,1,1,1,0,0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2011,1,1,2,0,0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,2011,1,1,3,0,0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,2011,1,1,4,0,0


In [7]:
feature_names=["season","holiday","workingday","weather","temp","atemp","humidity","windspeed","datetime-year","datetime-month","datetime-day","datetime-hour","datetime-minute","datetime-second"]
feature_names

['season',
 'holiday',
 'workingday',
 'weather',
 'temp',
 'atemp',
 'humidity',
 'windspeed',
 'datetime-year',
 'datetime-month',
 'datetime-day',
 'datetime-hour',
 'datetime-minute',
 'datetime-second']

In [8]:
label_name="count"
label_name

'count'

In [9]:
x=train[feature_names]
print(x.shape)
x.head()

(10886, 14)


Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,datetime-year,datetime-month,datetime-day,datetime-hour,datetime-minute,datetime-second
0,1,0,0,1,9.84,14.395,81,0.0,2011,1,1,0,0,0
1,1,0,0,1,9.02,13.635,80,0.0,2011,1,1,1,0,0
2,1,0,0,1,9.02,13.635,80,0.0,2011,1,1,2,0,0
3,1,0,0,1,9.84,14.395,75,0.0,2011,1,1,3,0,0
4,1,0,0,1,9.84,14.395,75,0.0,2011,1,1,4,0,0


In [10]:
y=train[label_name]
print(y.shape)
y.head()

(10886,)


0    16
1    40
2    32
3    13
4     1
Name: count, dtype: int64

In [11]:
x_test=test[feature_names]
print(x_test.shape)
x_test.head()

(6493, 14)


Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,datetime-year,datetime-month,datetime-day,datetime-hour,datetime-minute,datetime-second
0,1,0,1,1,10.66,11.365,56,26.0027,2011,1,20,0,0,0
1,1,0,1,1,10.66,13.635,56,0.0,2011,1,20,1,0,0
2,1,0,1,1,10.66,13.635,56,0.0,2011,1,20,2,0,0
3,1,0,1,1,10.66,12.88,56,11.0014,2011,1,20,3,0,0
4,1,0,1,1,10.66,12.88,56,11.0014,2011,1,20,4,0,0


In [12]:
from sklearn.tree import DecisionTreeRegressor

In [22]:
model=DecisionTreeRegressor()
model

In [23]:
model.fit(x,y)

In [24]:
prediction_list=model.predict(x_test)
print(prediction_list.shape)
prediction_list

(6493,)


array([ 17.,   6.,   3., ..., 100., 106.,  54.])

In [16]:
submit=pd.read_csv("sampleSubmission.csv")
print(submit.shape)
submit.head()

(6493, 2)


Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0


In [17]:
submit["count"]=prediction_list

In [18]:
submit.to_csv("baseline-regress.csv",index=False)

# Classification vs Regression
## 1. Classification 문제 &rarr; "DecisionTreeClassifier" 사용
    a. 정의: 맞춰야 하는 정답(Label, y)이 categorical(특정 분류 중에 하나)
    b. y값의 범위: 주로 양자택일 (0 or 1)'
    c. 예
        i. 암 환자 예측 (양성/음성)
        ii. Spam filtering 예측 (ham/spam)
        iii. 광고성 게시물 예측 (광고/광고 아님)
## 2. Regression 문제 &rarr; "DecisionTreeRegressor" 사용
    a. 맞춰야 하는 정답(Label, y)이 continuous(높고 낮음을 비교할 수 있는 숫자)
    b. y값의 범위가 무한대: 0 ~ ∞ ~ 0 or -∞ ~ +∞
    c. 예:
        i. 부동산 가격 예측
        ii. 삼성전자 주가 예측
        ill. 비트코인 가격 예측