複習一些過往的code並慢慢整理自己的codebase，並且養成每日寫data science code的習慣!

## 目標: 將過往code文件化、模組化，使其重複使用。

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import warnings
import jieba

warnings.filterwarnings("ignore")
%matplotlib inline

## Day27 - Designing ML System(ch4), handling the lack of labels
- **Weak Supervision**: 透過Label function，使用Heuristic 方式產生標籤資料。
- Semi-supervision
- Transfer Learning
- Active Learning

> 建立 Label Function(Programmatic labeling) 去建立標籤資料！其中的精神在於比起人工標籤，一樣有專家精神，且可以擴充！

In [15]:
"""
    假設是一個惡意留言檢測模型。
"""

def lf_has_fuck(data):
    if 'fuck' in data:
        return 1
    return 0
    
def lf_has_bastard(data):
    if 'bastard' in data:
        return 1
    return 0

class ProgrammaticLabel():
    def __init__(self, lfs):
        """
            ProgrammatcLabeling 的實踐。
        """
        self.lfs = lfs
    
    def label(self, data):
        """
            還尚未完善，可能是透過投票或者滿足任一。
        """
        if type(data) == type([]):
            labels = []
            for row_data in data:
                for lf in self.lfs:
                    label = lf(row_data)
                    if label:
                        labels.append(label)
                        break
                else:
                    labels.append(label)
            return labels
                    
        else:
            for lf in self.lfs:
                label = lf(data)
                if label: return [label]
            return [label]

In [16]:
text = [
    "You are such a bastard. I don't wanna talk to anymore.",
    "It's beatiful day. I wanna go outside and have fun.",
    "Fuck!!!!!!!!!! Today is fucking crazy!!!!!"
]
        
labeler = ProgrammaticLabel([lf_has_bastard, lf_has_fuck])

labeler.label(text)

[1, 0, 1]

In [17]:
text = "Chill day!"
labeler.label(text)

[0]

## Day28 - Designing ML System(ch4), handling the lack of labels
- Weak Supervision
- **Semi-supervision**: 透過有限、少量的標籤資料當作初始，進而訓練模型，再在無標籤資料上進行預測，當做新的標籤資料。
- Transfer Learning
- Active Learning

1. 使用完整訓練資料(50000)
2. 使用部分訓練資料(25000)+Semi-supervision(25000)
3. 使用部分訓練資料(25000)

In [31]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.model_selection import train_test_split

In [29]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

In [30]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((50000, 32, 32, 3), (50000, 1), (10000, 32, 32, 3), (10000, 1))

In [32]:
# 切分資料

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.5, random_state=222)

In [33]:
x_train.shape, x_val.shape, x_test.shape

((25000, 32, 32, 3), (25000, 32, 32, 3), (10000, 32, 32, 3))

In [35]:
# 標準化資料

x_train = x_train / 255.0
x_val = x_val / 255.0
x_test = x_test / 255.0

In [55]:
# 定義模型

def get_cnn_model(name):
    """
        簡單驗證用。
    """
    inputs = keras.Input(shape=(32, 32, 3))
    x = layers.Conv2D(filters=16, kernel_size=2, padding='same', activation='relu')(inputs)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(filters=16, kernel_size=2, padding='same', activation='relu')(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Flatten()(x)
    outputs = layers.Dense(units=10, activation='softmax')(x)
    
    return keras.Model(inputs, outputs, name=name)

In [56]:
batch_size = 32
epochs = 10

In [57]:
# 1. 全部資料

full_ds_model = get_cnn_model('full_ds')
full_ds_model.summary()
full_ds_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
full_ds_model.fit(
    np.concatenate([x_train, x_val], axis=0), 
    np.concatenate([y_train, y_val], axis=0), 
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)

Model: "full_ds"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_9 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_16 (Conv2D)          (None, 32, 32, 16)        208       
                                                                 
 max_pooling2d_16 (MaxPoolin  (None, 16, 16, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_17 (Conv2D)          (None, 16, 16, 16)        1040      
                                                                 
 max_pooling2d_17 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 flatten_8 (Flatten)         (None, 1024)              0   

<keras.callbacks.History at 0x23abba21748>

In [64]:
# 2. 部分資料+semi-supervised(過程可以更精緻，比如是迭代產生新標籤，目前利用的是最差的)


semi_model = get_cnn_model('semi-supervised')
semi_model.summary()
semi_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
semi_model.fit(
    x_train, 
    y_train, 
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)


Model: "semi-supervised"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_12 (InputLayer)       [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_22 (Conv2D)          (None, 32, 32, 16)        208       
                                                                 
 max_pooling2d_22 (MaxPoolin  (None, 16, 16, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_23 (Conv2D)          (None, 16, 16, 16)        1040      
                                                                 
 max_pooling2d_23 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 flatten_11 (Flatten)        (None, 1024)          

<keras.callbacks.History at 0x23abdf875f8>

In [65]:
y_val_semi = np.argmax(semi_model.predict(x_val), axis=1)


semi_model.fit(
    np.concatenate([x_train, x_val], axis=0), 
    np.concatenate([y_train, y_val_semi.reshape(-1, 1)], axis=0),
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x23abe056080>

In [66]:
# 3. 部分資料


part_of_ds_model = get_cnn_model('part_of_ds')
part_of_ds_model.summary()
part_of_ds_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
part_of_ds_model.fit(
    x_train, 
    y_train, 
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(x_test, y_test)
)

Model: "part_of_ds"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_13 (InputLayer)       [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_24 (Conv2D)          (None, 32, 32, 16)        208       
                                                                 
 max_pooling2d_24 (MaxPoolin  (None, 16, 16, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_25 (Conv2D)          (None, 16, 16, 16)        1040      
                                                                 
 max_pooling2d_25 (MaxPoolin  (None, 8, 8, 16)         0         
 g2D)                                                            
                                                                 
 flatten_12 (Flatten)        (None, 1024)              0

<keras.callbacks.History at 0x23ac129aa20>

## Day29 - Designing ML System(ch4), handling the lack of labels
- Weak Supervision
- Semi-supervision
- Transfer Learning
- **Active Learning**: 透過改善標籤品質的效率，去減少需要的資料，比如分類模型最不確定答案的資料、或者多個不同參數模型容易出錯的資料等等。

1. 分出訓練、測試資料
2. 訓練模型，然後預測測試資料，從中挑選預測信心最小的，也就是最大的類別也很大的那種，當作active learning標籤
3. 將其標籤，然後丟入訓練資料一起重新訓練 vs 隨機選擇。

In [70]:
# 1. 分出訓練、測試資料

x_train.shape, x_val.shape

((25000, 32, 32, 3), (25000, 32, 32, 3))

In [73]:
# 2. 直接用Day28的模型來用

y_val_predicted = part_of_ds_model.predict(x_val)



In [101]:
def get_the_most_uncertain(y, top_k=10000):
    return np.argsort(y.min(axis=1))[-top_k:] # 由小到大排序


# 取得最不確定的前10000筆
val_indices = get_the_most_uncertain(y_val_predicted)
x_new_label_by_active_learning = x_val[val_indices]
y_new_label_by_active_learning = y_val[val_indices]

print(x_new_label_by_active_learning.shape, y_new_label_by_active_learning.shape)

(10000, 32, 32, 3) (10000, 1)


In [110]:
# 3. 將其標籤，然後丟入訓練資料一起重新訓練 vs 隨機選擇


x_train_active_learning = np.concatenate([x_train, x_new_label_by_active_learning])
y_train_active_learning = np.concatenate([y_train, y_new_label_by_active_learning])


active_model = get_cnn_model('active_model')
# active_model.summary()
active_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
active_model.fit(
    x_train_active_learning, 
    y_train_active_learning, 
    batch_size=batch_size, 
    epochs=epochs*2,
    validation_data=(x_test, y_test)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x23ac5fc3e48>

In [111]:
# 隨機選擇
_, x_val_random, _, y_val_random = train_test_split(x_val, y_val, test_size=0.5, random_state=222)

x_val_random = x_val_random[:10000]
y_val_random = y_val_random[:10000]

x_train_random = np.concatenate([x_train, x_val_random])
y_train_random = np.concatenate([y_train, y_val_random])


random_model = get_cnn_model('random_model')
random_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
random_model.fit(
    x_train_random, 
    y_train_random, 
    batch_size=batch_size, 
    epochs=epochs*2,
    validation_data=(x_test, y_test)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x23ac610a400>

以上雖然只是一次嘗試，不太嚴謹，但可以看到大致上，active learning 產生標籤的方式可以讓模型在相同標籤下，學到的模型表現更佳。

## Day30 - Designing ML System(ch4), class imbalance
1. Use the right metric to evaluate: like precision, recall, f1-score and more.
2. Data-level method: like SMOTE and more data-aug method.
3. Algorithm-level method: use the loss function to solve this problem.

## Day31 - Designing ML System(ch4), Data-augmentation
1. **Single-label preserving**:
    - CV: 翻轉等等操作
    - NLP: 找到同義詞替代
2. Perturbation
3. Data Synthesis

In [5]:
# 1. Single-label preserving: NLP
import nltk


# First, you're going to need to import wordnet:
from nltk.corpus import wordnet
  
# Then, we're going to use the term "program" to find synsets like so:
syns = wordnet.synsets("program")
  
# An example of a synset:
print(syns[0].name())
  
# Just the word:
print(syns[0].lemmas()[0].name())
  
# Definition of that first synset:
print(syns[0].definition())
  
# Examples of the word in use in sentences:
print(syns[0].examples())

plan.n.01
plan
a series of steps to be carried out or goals to be accomplished
['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [16]:
syns[0].lemmas()[0].name()

'plan'

## Day32 - Designing ML System(ch4), Data-augmentation
1. Single-label preserving:
    - CV: 翻轉等等操作
    - NLP: 找到同義詞替代
2. **Perturbation**: 透過擾動
3. Data Synthesis

In [18]:
# 使用 BERT MLM 去做

from transformers import TFBertForMaskedLM
from transformers import BertTokenizer, TFBertModel, TFBertForSequenceClassification

In [26]:
mlm = TFBertForMaskedLM.from_pretrained('bert-base-chinese')

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
# 資料科學涉及程式開發、數學、統計、機器學習以及領域知識，是目前很火紅的領域！
inputs = tokenizer("[MASK]料科學涉及程式開發、數學、統計、[MASK]器學習以及領域知識，是目前很火紅的領域！", return_tensors="tf")
inputs["label"] = tokenizer("資料科學涉及程式開發、數學、統計、機器學習以及領域知識，是目前很火紅的領域！", return_tensors="tf")["input_ids"]
outputs = mlm(inputs)

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


In [27]:
print(inputs)
print()
print(outputs)

{'input_ids': <tf.Tensor: shape=(1, 40), dtype=int32, numpy=
array([[ 101,  103, 3160, 4906, 2119, 3868, 1350, 4923, 2466, 7274, 4634,
         510, 3149, 2119,  510, 5186, 6243,  510,  103, 1690, 2119, 5424,
         809, 1350, 7526, 1818, 4761, 6352, 8024, 3221, 4680, 1184, 2523,
        4125, 5148, 4638, 7526, 1818, 8013,  102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 40), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(1, 40), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>, 'label': <tf.Tensor: shape=(1, 40), dtype=int32, numpy=
array([[ 101, 6536, 3160, 4906, 2119, 3868, 1350, 4923, 2466, 7274, 4634,
         510, 3149, 2119,  510, 5186, 6243,  510, 3582, 1690, 2119, 5424,
         809, 1350, 7526, 1818, 4761, 6352, 8024,

In [50]:
# 找到最可能的結果
import numpy as np

for id_ in np.argmax(outputs['logits'], axis=-1)[0]:
    #print(id_)
    print(tokenizer.decode(token_ids=int(id_)), end='/')

，/資/料/科/學/涉/及/程/式/開/發/、/數/學/、/統/計/、/機/器/學/習/以/及/領/域/知/識/，/是/目/前/很/火/紅/的/領/域/！/料/

In [67]:
# 隨機擾動字
import random

sentence = '資料科學涉及程式開發、數學、統計、機器學習以及領域知識，是目前很火紅的領域！'


def get_random_mlm_sentence(sentence):
    length = len(sentence)
    idx = random.randint(0, length)
    sent = list(sentence)
    sent[idx] = '[MASK]'
    sent = ''.join(sent)
    print(sent)
    
    inputs = tokenizer(sent, return_tensors='tf')
    inputs['label'] = tokenizer(sentence, return_tensors='tf')['input_ids']
    outputs = mlm(inputs)
    
    out_sentence = []
    for id_ in np.argmax(outputs['logits'], axis=-1)[0]:
        #print(id_)
        word = tokenizer.decode(token_ids=int(id_))
        out_sentence.append(word)
    
    return ''.join(out_sentence)

In [81]:
get_random_mlm_sentence(sentence)

資料科學涉及程式開發、數學、統計、機器學習以及領域知識，是目前很[MASK]紅的領域！


'，資料科學涉及程式開發、數學、統計、機器學習以及領域知識，是目前很火紅的領域！的'

發現頭跟尾巴有點問題，要想一下怎麼解決。