<a href="https://colab.research.google.com/github/MasahiroAraki/MLCourse/blob/master/Python/answer/14a_semi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 第14章 半教師あり学習

## 課題1

使用するデータをbreast_cancerに変更して、同様の手順で半教師あり学習を行ってください。その際、以下の点に注意してください。

* 学習時にエラーが出てしまう場合は、irisデータとbreast_cancerデータの両方のデータを見て、何が違うかを考えてみてください。
* 低い性能しか出ないときは、LabelPropagationのハイパーパラメータ調整を試みてください。


ライブラリの読み込み

In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.semi_supervised import LabelPropagation
from sklearn.preprocessing import normalize

In [2]:
bc = load_breast_cancer()
X = bc.data
y = bc.target
print(bc.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

breast_cancerデータは各次元の値の範囲が大きく異なるので、データの距離に基づくアルゴリズムを使う場合は標準化が必要です。

In [3]:
X = normalize(X)
unlabeled_points = np.random.choice(np.arange(y.size), int((y.size)*.7), replace=False)
labels = np.copy(y)
labels[unlabeled_points] = -1

In [4]:
lp = LabelPropagation()
lp.fit(X, labels)

LabelPropagation(gamma=20, kernel='rbf', max_iter=1000, n_jobs=None,
                 n_neighbors=7, tol=0.001)

In [5]:
lp.score(X[unlabeled_points], y[unlabeled_points])

0.628140703517588

すべて多数クラス(Benign: 良性)に分類したとしても$\frac{357}{212+357} \fallingdotseq 0.63$の正解率なので、学習はうまくいっていません。

RBFカーネルの範囲gammaを調整してみます。大きくするほど、近くのデータにしか影響を受けないようになります。

In [6]:
lp = LabelPropagation(gamma=2000)
lp.fit(X, labels)

LabelPropagation(gamma=2000, kernel='rbf', max_iter=1000, n_jobs=None,
                 n_neighbors=7, tol=0.001)

In [7]:
lp.score(X[unlabeled_points], y[unlabeled_points])

0.9120603015075377

よい結果が出ているようなので、複数回実行して正解付きデータの割合と性能の関係を調べます。

In [8]:
labeled_percent = [0.05, 0.1, 0.2, 0.3, 0.5]
num = y.size
for labeled in labeled_percent :
    score = 0
    for i in range(100):
        unlabeled_points = np.random.choice(np.arange(num),int(num-num*labeled), replace=False)
        labels = np.copy(y)
        labels[unlabeled_points] = -1
        lp.fit(X, labels)
        score += lp.score(X[unlabeled_points], y[unlabeled_points])
    print("{0}{1:4.1f}{2}{3:6.3f}".format("labeled:", labeled*100, "%, score=", score/100))  



labeled: 5.0%, score= 0.849
labeled:10.0%, score= 0.891
labeled:20.0%, score= 0.907
labeled:30.0%, score= 0.909
labeled:50.0%, score= 0.917


breast_cancerデータでは、LabelPropagationのハイパーパラメータをうまく調整すると、10%程度の正解付きデータである程度の性能を達成することができそうです。

## 課題2

CNNを用いたCIFER10の識別において、データ拡張のパラメータや方法を変更して性能評価をしてください。

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [10]:
cf10 = keras.datasets.cifar10
(X_train,y_train),(X_test,y_test)=keras.datasets.cifar10.load_data()

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [11]:
data_augmentation = keras.Sequential(
    [
        layers.experimental.preprocessing.RandomFlip("horizontal"),
        layers.experimental.preprocessing.RandomRotation(0.2),
        layers.experimental.preprocessing.RandomZoom(0.2),
    ]
)

In [12]:
X_train = X_train / 255.0
X_test = X_test / 255.0
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

In [13]:
model3 = keras.Sequential([
    data_augmentation,
    layers.Conv2D(32, kernel_size=(3, 3), activation='relu'),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.BatchNormalization(),
    #layers.Dropout(0.5),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])
model3.build(input_shape=(None, 32, 32, 3))
model3.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sequential (Sequential)      (None, 32, 32, 3)         0         
_________________________________________________________________
conv2d (Conv2D)              (None, 30, 30, 32)        896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 13, 13, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32)          0         
_________________________________________________________________
batch_normalization (BatchNo (None, 6, 6, 32)          128       
_________________________________________________________________
flatten (Flatten)            (None, 1152)             

In [14]:
model3.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(X_train, y_train, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb1c04e17d0>

In [15]:
test_loss, test_acc = model3.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)

Test accuracy: 0.5947999954223633


データ拡張において極端な変形を行うと性能が下がります。データ拡張は対象データの性質をよく検討し、慎重に行う必要があります。