#### 보스터 집값 데이터 실습
- 선형 회귀의 일반화 성능을 올리는 방법
    - 특성 확장(ex.다항 회귀)
    - 규제
- 스케일링  

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [3]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

In [7]:
boston = pd.DataFrame(data, columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT'])

In [8]:
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


| 컬럼명  | 설명                                           |
|---------|------------------------------------------------|
| CRIM    | 지역별 범죄 발생률                             |
| ZN      | 25,000평방피트를 초과하는 거주 지역의 비율    |
| INDUS   | 비상업 지역 넓이 비율                          |
| CHAS    | 찰스강에 대한 더미 변수 (1: 강의 경계에 위치, 0: 그 외) |
| NOX     | 일산화질소 농도                                |
| RM      | 거주할 수 있는 방 개수                         |
| AGE     | 1940년 이전에 건축된 소유 주택의 비율          |
| DIS     | 5개 주요 고용센터까지의 가중 거리              |
| RAD     | 고속도로 접근 용이도                           |
| TAX     | 10,000달러당 재산세율                          |
| PTRATIO | 지역의 교사와 학생 수 비율                     |
| B       | 지역의 흑인 거주 비율                          |
| LSTAT   | 하위 계층의 비율                               |
| MEDV    | 본인 소유의 주택 가격 (중앙값)                 |

In [9]:
boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
dtypes: float64(13)
memory usage: 51.5 KB


In [11]:
boston.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CRIM,506.0,3.613524,8.601545,0.00632,0.082045,0.25651,3.677083,88.9762
ZN,506.0,11.363636,23.322453,0.0,0.0,0.0,12.5,100.0
INDUS,506.0,11.136779,6.860353,0.46,5.19,9.69,18.1,27.74
CHAS,506.0,0.06917,0.253994,0.0,0.0,0.0,0.0,1.0
NOX,506.0,0.554695,0.115878,0.385,0.449,0.538,0.624,0.871
RM,506.0,6.284634,0.702617,3.561,5.8855,6.2085,6.6235,8.78
AGE,506.0,68.574901,28.148861,2.9,45.025,77.5,94.075,100.0
DIS,506.0,3.795043,2.10571,1.1296,2.100175,3.20745,5.188425,12.1265
RAD,506.0,9.549407,8.707259,1.0,4.0,5.0,24.0,24.0
TAX,506.0,408.237154,168.537116,187.0,279.0,330.0,666.0,711.0


### Data Scaling
- 모든 특성이 모델에 미치는 영향력을 균형있게 만들어 줌.
    - 특정한 특성이 타 특성보다 지나치게 의존하는 것을 방지
- Standard Scaler
    - 평균이 0, 분산이 1인 스케일로 변환
- MinMax Scaler
    - 특정범위(기본적으로 0~1)의 스케일로 변환    
    - 이상치에 매우 민감   

In [12]:
from sklearn.preprocessing import StandardScaler # 스케일러 도구

In [13]:
std_scaler = StandardScaler()

X_scaled=std_scaler.fit_transform(boston)
X_scaled

array([[-0.41978194,  0.28482986, -1.2879095 , ..., -1.45900038,
         0.44105193, -1.0755623 ],
       [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,
         0.44105193, -0.49243937],
       [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,
         0.39642699, -1.2087274 ],
       ...,
       [-0.41344658, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.98304761],
       [-0.40776407, -0.48772236,  0.11573841, ...,  1.17646583,
         0.4032249 , -0.86530163],
       [-0.41500016, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.66905833]])

In [16]:
X_scaled = pd.DataFrame(X_scaled, columns = boston.columns) # 스케일링을 적용하면, ndarray형이기 때문에 DataFrame으로 다시 형변환!

In [18]:
# 데이터 분리
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, target, test_size=0.3,random_state=2024
                                                   )

In [20]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((354, 13), (152, 13), (354,), (152,))

In [25]:
# 1. 일반선형 모델로 예측을 수행해보자!
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [26]:
# 검증으로 성능 확인
cross_val_score(LinearRegression(),X_train, y_train, cv=5)
result.mean()
# r2 score : 데이터에 대해 어느정도 이해하고 있는가를 0~1사이의 값으로 표현된 지표값(1에 가까울수록 우리 모델이 데이터를 잘 이해하고 있다!)

NameError: name 'result' is not defined