# 1章 回帰ってどうやるの？



<a href="https://www.docswell.com/s/MasahiroAraki/Z98N1V-2023-08-14-151405">スライド 1章</a>


回帰のデータセットとして本書で使用している Boston housing prices dataset は倫理的な問題があるので、scikit-learn ver 1.2 以降では削除されています。従ってここでは、新しく使用が推奨されている California housing dataset を用います。なお、Boston housing prices dataset を使ったコードは付録として、この notebook の末尾に掲載します。

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso

housing = fetch_california_housing()
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

`X` に特徴ベクトル、`y` にターゲット（10万ドル単位の住宅価格）を設定します。

In [7]:
X = housing.data
y = housing.target

単純な線形回帰

In [8]:
lr1 = LinearRegression()
lr1.fit(X, y)
print("Linear Regression")
for f, w in zip(housing.feature_names, lr1.coef_) :
    print("{0:7s}: {1:6.2f}". format(f, w))
print("coef = {0:4.2f}".format(sum(lr1.coef_**2)))

Linear Regression
MedInc :   0.44
HouseAge:   0.01
AveRooms:  -0.11
AveBedrms:   0.65
Population:  -0.00
AveOccup:  -0.00
Latitude:  -0.42
Longitude:  -0.43
coef = 0.98


リッジ回帰

In [17]:
lr2 = Ridge(alpha=100.0)
lr2.fit(X, y)
print("\n Ridge")
for f, w in zip(housing.feature_names, lr2.coef_) :
    print("{0:7s}: {1:6.2f}". format(f, w))
print("coef = {0:4.2f}".format(sum(lr2.coef_**2)))


 Ridge
MedInc :   0.43
HouseAge:   0.01
AveRooms:  -0.09
AveBedrms:   0.56
Population:  -0.00
AveOccup:  -0.00
Latitude:  -0.42
Longitude:  -0.43
coef = 0.86


In [22]:
lr3 = Lasso(alpha=0.05)
lr3.fit(X, y)
print("\n Lasso")
for f, w in zip(housing.feature_names, lr3.coef_) :
    print("{0:7s}: {1:6.2f}". format(f, w))
print("coef = {0:4.2f}".format(sum(lr3.coef_**2)))


 Lasso
MedInc :   0.38
HouseAge:   0.01
AveRooms:   0.00
AveBedrms:   0.00
Population:   0.00
AveOccup:  -0.00
Latitude:  -0.28
Longitude:  -0.28
coef = 0.30


## 付録 Boston データを用いた回帰（非推奨）

このデータがどのように問題があるかを示す目的のみで用いてください。

[データおよび特徴の説明](http://lib.stat.cmu.edu/datasets/boston)


In [27]:
import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]
feature_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"]

In [28]:
lr1 = LinearRegression()
lr1.fit(X, y)
print("Linear Regression")
for f, w in zip(feature_names, lr1.coef_) :
    print("{0:7s}: {1:6.2f}". format(f, w))
print("coef = {0:4.2f}".format(sum(lr1.coef_**2)))

lr2 = Ridge(alpha=10.0)
lr2.fit(X, y)
print("\n Ridge")
for f, w in zip(feature_names, lr2.coef_) :
    print("{0:7s}: {1:6.2f}". format(f, w))
print("coef = {0:4.2f}".format(sum(lr2.coef_**2)))

lr3 = Lasso(alpha=2.0)
lr3.fit(X, y)
print("\n Lasso")
for f, w in zip(feature_names, lr3.coef_) :
    print("{0:7s}: {1:6.2f}". format(f, w))
print("coef = {0:4.2f}".format(sum(lr3.coef_**2)))


Linear Regression
CRIM   :  -0.11
ZN     :   0.05
INDUS  :   0.02
CHAS   :   2.69
NOX    : -17.77
RM     :   3.81
AGE    :   0.00
DIS    :  -1.48
RAD    :   0.31
TAX    :  -0.01
PTRATIO:  -0.95
B      :   0.01
LSTAT  :  -0.52
coef = 340.85

 Ridge
CRIM   :  -0.10
ZN     :   0.05
INDUS  :  -0.04
CHAS   :   1.95
NOX    :  -2.37
RM     :   3.70
AGE    :  -0.01
DIS    :  -1.25
RAD    :   0.28
TAX    :  -0.01
PTRATIO:  -0.80
B      :   0.01
LSTAT  :  -0.56
coef = 25.74

 Lasso
CRIM   :  -0.02
ZN     :   0.04
INDUS  :  -0.00
CHAS   :   0.00
NOX    :  -0.00
RM     :   0.00
AGE    :   0.04
DIS    :  -0.07
RAD    :   0.17
TAX    :  -0.01
PTRATIO:  -0.56
B      :   0.01
LSTAT  :  -0.82
coef = 1.02
