複習一些過往的code並慢慢整理自己的codebase，並且養成每日寫data science code的習慣!

## 目標: 將過往code文件化、模組化，使其重複使用。

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import warnings
import jieba

warnings.filterwarnings("ignore")
%matplotlib inline

## Day27 - Designing ML System(ch4), handling the lack of labels
- **Weak Supervision**: 透過Label function，使用Heuristic 方式產生標籤資料。
- Semi-supervision
- Transfer Learning
- Active Learning

> 建立 Label Function(Programmatic labeling) 去建立標籤資料！其中的精神在於比起人工標籤，一樣有專家精神，且可以擴充！

In [15]:
"""
    假設是一個惡意留言檢測模型。
"""

def lf_has_fuck(data):
    if 'fuck' in data:
        return 1
    return 0
    
def lf_has_bastard(data):
    if 'bastard' in data:
        return 1
    return 0

class ProgrammaticLabel():
    def __init__(self, lfs):
        """
            ProgrammatcLabeling 的實踐。
        """
        self.lfs = lfs
    
    def label(self, data):
        """
            還尚未完善，可能是透過投票或者滿足任一。
        """
        if type(data) == type([]):
            labels = []
            for row_data in data:
                for lf in self.lfs:
                    label = lf(row_data)
                    if label:
                        labels.append(label)
                        break
                else:
                    labels.append(label)
            return labels
                    
        else:
            for lf in self.lfs:
                label = lf(data)
                if label: return [label]
            return [label]

In [16]:
text = [
    "You are such a bastard. I don't wanna talk to anymore.",
    "It's beatiful day. I wanna go outside and have fun.",
    "Fuck!!!!!!!!!! Today is fucking crazy!!!!!"
]
        
labeler = ProgrammaticLabel([lf_has_bastard, lf_has_fuck])

labeler.label(text)

[1, 0, 1]

In [17]:
text = "Chill day!"
labeler.label(text)

[0]

## Day28 - Designing ML System(ch4), handling the lack of labels
- Weak Supervision
- **Semi-supervision**: 透過有限、少量的標籤資料當作初始，進而訓練模型，再在無標籤資料上進行預測，當做新的標籤資料。
- Transfer Learning
- Active Learning

1. 使用完整訓練資料(50000)
2. 使用部分訓練資料(25000)+Semi-supervision(25000)
3. 使用部分訓練資料(25000)

In [31]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.model_selection import train_test_split

In [29]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

In [30]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((50000, 32, 32, 3), (50000, 1), (10000, 32, 32, 3), (10000, 1))

In [32]:
# 切分資料

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.5, random_state=222)

In [33]:
x_train.shape, x_val.shape, x_test.shape

((25000, 32, 32, 3), (25000, 32, 32, 3), (10000, 32, 32, 3))

In [35]:
# 標準化資料

x_train = x_train / 255.0
x_val = x_val / 255.0
x_test = x_test / 255.0

In [55]:
# 定義模型

def get_cnn_model(name):
    """
        簡單驗證用。
    """
    inputs = keras.Input(shape=(32, 32, 3))
    x = layers.Conv2D(filters=16, kernel_size=2, padding='same', activation='relu')(inputs)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(filters=16, kernel_size=2, padding='same', activation='relu')(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    outputs = layers.Dense(units=10, activation='softmax')(x)
    
    return keras.Model(inputs, outputs, name=name)

In [56]:
batch_size = 32
epochs = 10

In [57]:
# 1. 全部資料

full_ds_model = get_cnn_model('full_ds')
full_ds_model.summary()
full_ds_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
full_ds_model.fit(
    np.concatenate([x_train, x_val], axis=0), 
    np.concatenate([y_train, y_val], axis=0), 
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)

Model: "full_ds"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_9 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_16 (Conv2D)          (None, 32, 32, 16)        208       
                                                                 
 max_pooling2d_16 (MaxPoolin  (None, 16, 16, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_17 (Conv2D)          (None, 16, 16, 16)        1040      
                                                                 
 max_pooling2d_17 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 flatten_8 (Flatten)         (None, 1024)              0   

<keras.callbacks.History at 0x23abba21748>

In [64]:
# 2. 部分資料+semi-supervised(過程可以更精緻，比如是迭代產生新標籤，目前利用的是最差的)


semi_model = get_cnn_model('semi-supervised')
semi_model.summary()
semi_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
semi_model.fit(
    x_train, 
    y_train, 
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)


Model: "semi-supervised"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_12 (InputLayer)       [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_22 (Conv2D)          (None, 32, 32, 16)        208       
                                                                 
 max_pooling2d_22 (MaxPoolin  (None, 16, 16, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_23 (Conv2D)          (None, 16, 16, 16)        1040      
                                                                 
 max_pooling2d_23 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 flatten_11 (Flatten)        (None, 1024)          

<keras.callbacks.History at 0x23abdf875f8>

In [65]:
y_val_semi = np.argmax(semi_model.predict(x_val), axis=1)


semi_model.fit(
    np.concatenate([x_train, x_val], axis=0), 
    np.concatenate([y_train, y_val_semi.reshape(-1, 1)], axis=0),
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x23abe056080>

In [60]:
# 3. 部分資料


part_of_ds_model = get_cnn_model('part_of_ds')
part_of_ds_model.summary()
part_of_ds_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
part_of_ds_model.fit(
    x_train, 
    y_train, 
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)

Model: "part_of_ds"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_11 (InputLayer)       [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_20 (Conv2D)          (None, 32, 32, 16)        208       
                                                                 
 max_pooling2d_20 (MaxPoolin  (None, 16, 16, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_21 (Conv2D)          (None, 16, 16, 16)        1040      
                                                                 
 max_pooling2d_21 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 flatten_10 (Flatten)        (None, 1024)              0

<keras.callbacks.History at 0x23abca883c8>