# 기본 지도 학습 알고리즘들

## 3. 다항 회귀 (Polynomial Regression)

### 01. 다항 회귀

직선이 아닌 곡선의, 다시 말해 다항식으로 이루어진 가설 함수를 사용하는 방식을 다항 회귀라 한다.

### 03. 단일 속성 다항 회귀

속성이 하나인 다항 회귀를 단일 속성 다항 회귀라 한다. 이때 다항 회귀 또한 선형 회귀와 마찬가지로 학습 데이터에 잘 맞는 $\theta$값을 찾는 게 목표다.

입력 변수가 세 개인 다중 선형 회귀의 식을 나타내면 아래와 같다.
$$
h_{\theta}(x) = \theta_{0} + \theta_{1}x + \theta_{2}x^{2} + \theta_{3}x^{3} 
$$

해당 식은 입력 변수가 세 개인 선형 회귀의 식인 아래 식과 굉장히 유사한 것을 알 수 있는데 각각의 항을 선형 회귀와 마찬가지로 취급하여 계산을 하면 된다. 즉 입력 변수에 대해 제곱, 세제곱을 취한 뒤 각각의 결과를 하나의 입력 변수와 같이 취급해서 계산하면 되는 것이다.

$$
h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3}
$$

### 05. 다중 다항 회귀

속성이 여러 개인 다항 회귀를 다중 다항 회귀라 한다. 다중 다항 회귀의 경우 예를 들어 이차항에 대한 다중 다항 회귀이면 각각의 입력 변수를 두 개씩 곱한 것과 하나의 입력 변수에 제곱을 취한 것 모두 이차항이기 때문에 이렇게 구해진 값들을 새로운 입력 변수로 취급해 다중 선형 회귀처럼 계산하면 된다.

### 07. 다항 회귀의 힘

다항 회귀를 통해 모델의 성능을 극대화할 수 있다. 예를 들어 집 값을 예측할 때 단순 선형 회귀를 사용할 경우 각각의 입력 변수는 독립적이기 때문에 너비와 높이가 집 값에 따로 미치는 영향을 예측하게 된다. 하지만 다항 회귀를 사용할 경우 너비와 높이의 합산에 따른 집 크기를 하나의 입력 변수로 취급하여 집 값을 예측하기 때문에 훨씬 성능을 극대화할 수 있게 된다.

따라서 선형 회귀 문제를 다항 회귀 문제로 만들면 속성들 사이에 존재하는 관계를 다차원으로 표현하여 모델의 성능을 극대화할 수 있다.


### 10. 다항 회귀로 당뇨병 예측하기 I: 문제 만들기

```Python
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pandas as pd  


diabetes_dataset = datasets.load_diabetes()

polynomial_transformer = PolynomialFeatures(2)
polynomial_data = polynomial_transformer.fit_transform(diabetes_dataset.data)
polynomial_feature_names = polynomial_transformer.get_feature_names_out(diabetes_dataset.feature_names)
X = pd.DataFrame(polynomial_data, columns=polynomial_feature_names)

X.head()
```

### 11. 다항 회귀로 당뇨병 예측하기 II: 모델 학습하기

```Python
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pandas as pd  


diabetes_dataset = datasets.load_diabetes()

polynomial_transformer = PolynomialFeatures(2)
polynomial_data = polynomial_transformer.fit_transform(diabetes_dataset.data)
polynomial_feature_names = polynomial_transformer.get_feature_names_out(diabetes_dataset.feature_names)
X = pd.DataFrame(polynomial_data, columns=polynomial_feature_names)
Y = pd.DataFrame(diabetes_dataset.target, columns=["diabetes"])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

model = LinearRegression()
model.fit(X_train, Y_train)

Y_predict = model.predict(X_test)
mean_squared_error(Y_test, Y_predict) ** 0.5
```


In [3]:
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures


boston_dataset = load_boston()
boston_dataset.data.shape

(506, 13)

In [4]:
polynomial_transformer = PolynomialFeatures(2)
polynomial_data = polynomial_transformer.fit_transform(boston_dataset.data)
polynomial_data.shape

(506, 105)

In [6]:
polynomial_feature_names = polynomial_transformer.get_feature_names(boston_dataset.feature_names)
polynomial_feature_names



['1',
 'CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT',
 'CRIM^2',
 'CRIM ZN',
 'CRIM INDUS',
 'CRIM CHAS',
 'CRIM NOX',
 'CRIM RM',
 'CRIM AGE',
 'CRIM DIS',
 'CRIM RAD',
 'CRIM TAX',
 'CRIM PTRATIO',
 'CRIM B',
 'CRIM LSTAT',
 'ZN^2',
 'ZN INDUS',
 'ZN CHAS',
 'ZN NOX',
 'ZN RM',
 'ZN AGE',
 'ZN DIS',
 'ZN RAD',
 'ZN TAX',
 'ZN PTRATIO',
 'ZN B',
 'ZN LSTAT',
 'INDUS^2',
 'INDUS CHAS',
 'INDUS NOX',
 'INDUS RM',
 'INDUS AGE',
 'INDUS DIS',
 'INDUS RAD',
 'INDUS TAX',
 'INDUS PTRATIO',
 'INDUS B',
 'INDUS LSTAT',
 'CHAS^2',
 'CHAS NOX',
 'CHAS RM',
 'CHAS AGE',
 'CHAS DIS',
 'CHAS RAD',
 'CHAS TAX',
 'CHAS PTRATIO',
 'CHAS B',
 'CHAS LSTAT',
 'NOX^2',
 'NOX RM',
 'NOX AGE',
 'NOX DIS',
 'NOX RAD',
 'NOX TAX',
 'NOX PTRATIO',
 'NOX B',
 'NOX LSTAT',
 'RM^2',
 'RM AGE',
 'RM DIS',
 'RM RAD',
 'RM TAX',
 'RM PTRATIO',
 'RM B',
 'RM LSTAT',
 'AGE^2',
 'AGE DIS',
 'AGE RAD',
 'AGE TAX',
 'AGE PTRATIO',
 'AGE B',
 'AGE LSTAT',
 'DI

In [7]:
import pandas as pd


X = pd.DataFrame(polynomial_data, columns=polynomial_feature_names)
X

Unnamed: 0,1,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,...,TAX^2,TAX PTRATIO,TAX B,TAX LSTAT,PTRATIO^2,PTRATIO B,PTRATIO LSTAT,B^2,B LSTAT,LSTAT^2
0,1.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,...,87616.0,4528.8,117482.40,1474.08,234.09,6072.570,76.194,157529.6100,1976.5620,24.8004
1,1.0,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,...,58564.0,4307.6,96049.80,2211.88,316.84,7064.820,162.692,157529.6100,3627.6660,83.5396
2,1.0,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,...,58564.0,4307.6,95064.86,975.26,316.84,6992.374,71.734,154315.4089,1583.1049,16.2409
3,1.0,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,...,49284.0,4151.4,87607.86,652.68,349.69,7379.581,54.978,155732.8369,1160.2122,8.6436
4,1.0,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,...,49284.0,4151.4,88111.80,1183.26,349.69,7422.030,99.671,157529.6100,2115.4770,28.4089
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,1.0,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,...,74529.0,5733.0,107013.27,2639.91,441.00,8231.790,203.070,153656.1601,3790.5433,93.5089
502,1.0,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,...,74529.0,5733.0,108353.70,2478.84,441.00,8334.900,190.680,157529.6100,3603.8520,82.4464
503,1.0,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,...,74529.0,5733.0,108353.70,1539.72,441.00,8334.900,118.440,157529.6100,2238.5160,31.8096
504,1.0,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,...,74529.0,5733.0,107411.85,1769.04,441.00,8262.450,136.080,154802.9025,2549.5560,41.9904


In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


Y = pd.DataFrame(boston_dataset.target, columns=["MEDV"])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

model = LinearRegression()
model.fit(X_train, Y_train)

Y_predict = model.predict(X_test)
mean_squared_error(Y_test, Y_predict) ** 0.5

3.1965276513493723

In [10]:
# 10. 다항 회귀로 당뇨병 예측하기 I: 문제 만들기
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pandas as pd  


diabetes_dataset = datasets.load_diabetes()

polynomial_transformer = PolynomialFeatures(2)
polynomial_data = polynomial_transformer.fit_transform(diabetes_dataset.data)
polynomial_feature_names = polynomial_transformer.get_feature_names_out(diabetes_dataset.feature_names)
X = pd.DataFrame(polynomial_data, columns=polynomial_feature_names)

X.head()

Unnamed: 0,1,age,sex,bmi,bp,s1,s2,s3,s4,s5,...,s3^2,s3 s4,s3 s5,s3 s6,s4^2,s4 s5,s4 s6,s5^2,s5 s6,s6^2
0,1.0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,...,0.001884,0.000113,-0.000864,0.000766,7e-06,-5.2e-05,4.6e-05,0.000396,-0.000351,0.000311
1,1.0,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,...,0.005537,-0.002939,-0.005085,-0.006861,0.00156,0.002699,0.003641,0.004669,0.0063,0.008502
2,1.0,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,...,0.001047,8.4e-05,-9.3e-05,0.000839,7e-06,-7e-06,6.7e-05,8e-06,-7.4e-05,0.000672
3,1.0,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,...,0.001299,-0.001236,-0.000818,0.000337,0.001177,0.000778,-0.000321,0.000515,-0.000212,8.8e-05
4,1.0,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,...,6.6e-05,-2.1e-05,-0.00026,-0.00038,7e-06,8.3e-05,0.000121,0.001023,0.001492,0.002175


In [11]:
# 11. 다항 회귀로 당뇨병 예측하기 II: 모델 학습하기
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import pandas as pd  


diabetes_dataset = datasets.load_diabetes()

polynomial_transformer = PolynomialFeatures(2)
polynomial_data = polynomial_transformer.fit_transform(diabetes_dataset.data)
polynomial_feature_names = polynomial_transformer.get_feature_names_out(diabetes_dataset.feature_names)
X = pd.DataFrame(polynomial_data, columns=polynomial_feature_names)
Y = pd.DataFrame(diabetes_dataset.target, columns=["diabetes"])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=5)

model = LinearRegression()
model.fit(X_train, Y_train)

Y_predict = model.predict(X_test)
mean_squared_error(Y_test, Y_predict) ** 0.5

57.88957704461995