scikit-learn이 제공하는 회귀 분석용 예제 데이터에 대해 소개한다. 먼저, 보스턴 주택 가격 데이터는 다음과 같이 구성되어 있다.

타겟 데이터
1978 보스턴 주택 가격
506개 타운의 주택 가격 중앙값 (단위 1,000 달러)
특징 데이터
CRIM: 범죄율
INDUS: 비소매상업지역 면적 비율
NOX: 일산화질소 농도
RM: 주택당 방 수
LSTAT: 인구 중 하위 계층 비율
B: 인구 중 흑인 비율
PTRATIO: 학생/교사 비율
ZN: 25,000 평방피트를 초과 거주지역 비율
CHAS: 찰스강의 경계에 위치한 경우는 1, 아니면 0
AGE: 1940년 이전에 건축된 주택의 비율
RAD: 방사형 고속도로까지의 거리
DIS: 직업센터의 거리
TAX: 재산세율

데이터 불러오기

In [4]:
from sklearn.datasets import load_boston

datasets = load_boston()

print(dir(datasets))

['DESCR', 'data', 'feature_names', 'filename', 'target']


In [5]:
print(datasets.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

pd로 변환

In [6]:
import pandas as pd

df = pd.DataFrame(datasets.data)
df.columns = datasets.feature_names
df['price'] = datasets.target
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


전처리 및 데이터 분할(x=rm , 표준화만 진행)

In [7]:
from sklearn.model_selection import train_test_split

#X = datasets.data
X = df['RM']

y = datasets.target


X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

print(X.shape, X_train.shape, X_test.shape)

(506,) (339,) (167,)


In [8]:
import numpy as np

X_train = np.array(X_train).reshape(-1, 1)
X_test = np.array(X_test).reshape(-1, 1)

y_train = np.array(y_train).reshape(-1, 1)
y_test = np.array(y_test).reshape(-1, 1)

정규화

In [9]:
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()

standardScaler.fit(X_train)

X_scaling_train = standardScaler.transform(X_train)
X_scaling_test = standardScaler.transform(X_test)

tensor

In [10]:
import torch

X_train_tensor = torch.tensor(X_scaling_train, dtype=torch.float)
y_train_tensor = torch.tensor(y_train, dtype=torch.float).view(-1, 1)

X_test_tensor = torch.tensor(X_scaling_test, dtype=torch.float)
y_test_tensor = torch.tensor(y_test, dtype=torch.float).view(-1, 1)

model 정의

In [11]:
import torch
import torch.nn as nn

class LinearRegression(nn.Module):
    def __init__(self, input_size, output_size):
        super(LinearRegression, self).__init__()
        self.linear1 = nn.Linear(input_size, output_size)
    
    def forward(self, X):
        y = self.linear1(X)
       
        return y

model = LinearRegression(1, 1)

loss function, optimizer 정의

In [12]:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

학습

In [13]:
epochs = 10000

loss_train = []
model.train()
for epoch in range(epochs):
    prediction = model(X_train_tensor)
    loss = criterion(input=prediction, target=y_train_tensor)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 1000 == 0:
        print(loss.item())

658.1212158203125
265.1588439941406
106.9111557006836
55.20410919189453
46.302608489990234
45.832332611083984
45.82902145385742
45.82902526855469
45.829017639160156
45.82902526855469


machine learing으로 진행한다면?

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [15]:
standardScaler = StandardScaler()

standardScaler.fit(X_train)

X_scaling_train = standardScaler.transform(X_train)
X_scaling_test = standardScaler.transform(X_test)

In [16]:
linear_regression = LinearRegression()

linear_regression.fit(X_scaling_train, y_train)

LinearRegression()

In [17]:
prediction = linear_regression.predict(X_scaling_test)

In [18]:
print(f'mse:{mean_squared_error(y_test, prediction)}, r2:{r2_score(y_test, prediction)}')

mse:39.091051114869956, r2:0.4834590168919487
