# 演習5-3

Scikit learn の GaussianNBとLogisticRegression でbreast_cancerデータを識別します。

## 準備

必要なライブラリ等を読み込みます。

In [1]:
import numpy as np
import sklearn.datasets as ds
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

## データの読み込み

In [2]:
bc = ds.load_breast_cancer()
print(bc.DESCR)

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

## ナイーブベイズ

In [3]:
clf1 = GaussianNB()
clf1.fit(bc.data, bc.target)

GaussianNB(priors=None)

In [4]:
scores = cross_val_score(clf1, bc.data, bc.target, cv=10)
print("mean: {:.3f} (std: {:.3f})".format(scores.mean(),
                                          scores.std()),
                                          end="\n\n" )

mean: 0.939 (std: 0.030)



### 学習結果の表示

平均(theta)と分散(sigma)

In [5]:
clf1.theta_

array([[  1.74628302e+01,   2.16049057e+01,   1.15365377e+02,
          9.78376415e+02,   1.02898491e-01,   1.45187783e-01,
          1.60774717e-01,   8.79900000e-02,   1.92908962e-01,
          6.26800943e-02,   6.09082547e-01,   1.21091462e+00,
          4.32392925e+00,   7.26724057e+01,   6.78009434e-03,
          3.22811651e-02,   4.18240094e-02,   1.50604717e-02,
          2.04724009e-02,   4.06240566e-03,   2.11348113e+01,
          2.93182075e+01,   1.41370330e+02,   1.42228632e+03,
          1.44845236e-01,   3.74824104e-01,   4.50605566e-01,
          1.82237311e-01,   3.23467925e-01,   9.15299528e-02],
       [  1.21465238e+01,   1.79147619e+01,   7.80754062e+01,
          4.62790196e+02,   9.24776471e-02,   8.00846218e-02,
          4.60576210e-02,   2.57174062e-02,   1.74185994e-01,
          6.28673950e-02,   2.84082353e-01,   1.22038011e+00,
          2.00032129e+00,   2.11351485e+01,   7.19590196e-03,
          2.14382465e-02,   2.59967356e-02,   9.85765266e-03,
       

In [6]:
clf1.sigma_

array([[  1.02173326e+01,   1.42173373e+01,   4.75373242e+02,
          1.34739779e+05,   4.81815426e-04,   3.22449895e-03,
          5.92495053e-03,   1.49958986e-03,   1.08385868e-03,
          3.80682228e-04,   1.18813656e-01,   2.32683420e-01,
          6.56663044e+00,   3.74671236e+03,   3.31912848e-04,
          6.60091666e-04,   7.88104335e-04,   3.53895368e-04,
          4.24421811e-04,   3.27745726e-04,   1.82627391e+01,
          2.93980930e+01,   8.63625413e+02,   3.55878793e+05,
          7.99631244e-04,   2.92132929e-02,   3.31128885e-02,
          2.45789394e-03,   5.87512976e-03,   7.85933509e-04],
       [  3.16166515e+00,   1.59166354e+01,   1.39025386e+02,
          1.79825177e+04,   5.03888232e-04,   1.45946646e-03,
          2.20553183e-03,   5.75977967e-04,   9.37249183e-04,
          3.68996780e-04,   1.29600188e-02,   3.46483988e-01,
          5.93359705e-01,   7.79882542e+01,   3.32938763e-04,
          5.90220654e-04,   1.40417263e-03,   3.56094783e-04,
       

### ロジスティック識別

In [7]:
clf2 = LogisticRegression()
clf2.fit(bc.data, bc.target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [8]:
scores = cross_val_score(clf2, bc.data, bc.target, cv=10)
print("mean: {:.3f} (std: {:.3f})".format(scores.mean(),
                                          scores.std()),
                                          end="\n\n" )

mean: 0.951 (std: 0.019)



### 学習結果の表示

重み(coef)と切片(intercept)

In [9]:
clf2.coef_

array([[  2.10133328e+00,   1.22048269e-01,  -5.73563909e-02,
         -3.59426002e-03,  -1.53787056e-01,  -3.99851347e-01,
         -6.43594083e-01,  -3.40198652e-01,  -2.25912127e-01,
         -2.60041112e-02,  -2.59521806e-02,   1.24814938e+00,
          1.61584272e-03,  -9.46636080e-02,  -1.68457243e-02,
          6.96298555e-03,  -4.64501120e-02,  -3.99883360e-02,
         -4.22211679e-02,   6.40657161e-03,   1.24531358e+00,
         -3.46269485e-01,  -1.25741558e-01,  -2.39928721e-02,
         -2.84647551e-01,  -1.11868895e+00,  -1.57192780e+00,
         -6.53832308e-01,  -6.90301939e-01,  -1.12870122e-01]])

In [10]:
clf2.intercept_

array([ 0.39033943])