## Data Shift
- 資料集分布發生改變，真實世界非常常見，通常會透過偵測去即時通知，減少實務上的成本消耗。
- 類型
    - Convariate shift
    - Prior probability shift
    - Concept drift

### Convariate shift
公式定義
$$P(Y|X)_{train} = P(Y|X)_{test}, but P(X)_{train} \neq P(X)_{test}$$
步驟
1. 將資料集各自加上新的變數['is_train'] = 1 / 0
2. 將資料集訓練與測試集合併 並且shuffle
3. 將資料集切分成train/test
4. for loop 每一個特徵當作輸入建立一個分類模型去預測 is_train
    - 4-1. 訓練
    - 4-2. 計算AUC
    - 4-3. 透過AUC去判斷一個threshold, 如0.8, 如果超過就是有data shift
 

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, accuracy_score, roc_auc_score, roc_curve
from pprint import pprint

In [4]:
# 使用隨機森林，因為簡單不用標準化，方便


data = load_boston(return_X_y=False)
# print(x.shape, y.shape)
print(data['data'].shape)
print(data['target'].shape)
print(data.keys())

df = pd.DataFrame(data=np.concatenate((data['data'], data['target'].reshape(-1,  1)), axis=1), columns=data['feature_names'].tolist()+['target'])


# 1. 加上新變數 is_train, 這邊的例子我先打散一次，再把前面80%當做train, 後20%當做test
num = int(len(df) * 0.8)

df = df.sample(frac=1)
label = [1 for _ in range(num)] + [0 for _ in range(len(df)-num)]
df['is_train'] = label
print(df)


# 2. shuffle
df = df.sample(frac=1)
print(df)


# 3. 切分成train/test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1], test_size=0.2, random_state=42)

(506, 13)
(506,)
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
         CRIM    ZN  INDUS  CHAS    NOX     RM    AGE     DIS   RAD    TAX  \
22    1.23247   0.0   8.14   0.0  0.538  6.142   91.7  3.9769   4.0  307.0   
213   0.14052   0.0  10.59   0.0  0.489  6.375   32.3  3.9454   4.0  277.0   
456   4.66883   0.0  18.10   0.0  0.713  5.976   87.9  2.5806  24.0  666.0   
400  25.04610   0.0  18.10   0.0  0.693  5.987  100.0  1.5888  24.0  666.0   
39    0.02763  75.0   2.95   0.0  0.428  6.595   21.8  5.4011   3.0  252.0   
..        ...   ...    ...   ...    ...    ...    ...     ...   ...    ...   
115   0.17134   0.0  10.01   0.0  0.547  5.928   88.2  2.4631   6.0  432.0   
185   0.06047   0.0   2.46   0.0  0.488  6.153   68.8  3.2797   3.0  193.0   
153   2.14918   0.0  19.58   0.0  0.871  5.709   98.5  1.6232   5.0  403.0   
137   0.35233   0.0  21.89   0.0  0.624  6.454   98.4  1.8498   4.0  437.0   
290   0.03502  80.0   4.95   0.0  0.411  6.861   27.9  5.

In [5]:
# 4. 透過每一個特徵去預測

d = {}

for i in range(len(X_train.columns)):
    x = X_train.iloc[:, i:i+1] # 這樣寫才會是2-d
    rf = RandomForestClassifier()
    rf.fit(x, y_train)
    y_pred = rf.predict(x)     # 類別
    y_test_pred = rf.predict(X_test.iloc[:, i:i+1])
        
    acc_train = accuracy_score(y_train, y_pred)
    auc_train = roc_auc_score(y_train, rf.predict_proba(x)[:, 1])
    
    acc_test = accuracy_score(y_test, y_test_pred)
    auc_test = roc_auc_score(y_test, rf.predict_proba(X_test.iloc[:, i:i+1])[:, 1])
    d[X_train.columns[i]] = {
        #'acc_train': acc_train,
        'acc_test': acc_test,
        #'auc_train': auc_train,
        'auc_test': auc_test
    }

pprint(d)

{'AGE': {'acc_test': 0.6764705882352942, 'auc_test': 0.2883874518436984},
 'B': {'acc_test': 0.7450980392156863, 'auc_test': 0.6576774903687397},
 'CHAS': {'acc_test': 0.7745098039215687, 'auc_test': 0.48183819482663737},
 'CRIM': {'acc_test': 0.696078431372549, 'auc_test': 0.5476059438635112},
 'DIS': {'acc_test': 0.696078431372549, 'auc_test': 0.5465052283984589},
 'INDUS': {'acc_test': 0.7647058823529411, 'auc_test': 0.4479911942762796},
 'LSTAT': {'acc_test': 0.696078431372549, 'auc_test': 0.5701706108970831},
 'NOX': {'acc_test': 0.7745098039215687, 'auc_test': 0.5253164556962026},
 'PTRATIO': {'acc_test': 0.7941176470588235, 'auc_test': 0.5140341221794167},
 'RAD': {'acc_test': 0.7745098039215687, 'auc_test': 0.4722069345074299},
 'RM': {'acc_test': 0.6764705882352942, 'auc_test': 0.5324711062190424},
 'TAX': {'acc_test': 0.7450980392156863, 'auc_test': 0.538800220143093},
 'ZN': {'acc_test': 0.7745098039215687, 'auc_test': 0.4625756741882223},
 'target': {'acc_test': 0.627450980

In [None]:
def check_data_shift(x_train, y_train, x_test, y_test, s_type='convariate shift') -> dict:
    """
        將上述過程包裝成可重複利用的function
        *先想輸入輸出，也許先把其他種寫好再說
    """
    if s_type == 'convariate shift':
        x_train['is_train'] = 1
        x_test['is_train'] = 0
        X = x_train.append(x_test, ignore_index=True)
        y = y_train.append(y_test, ingore_index=True)
        

### Prior probability shift
公式定義
$$P(X|Y)_{train} = P(X|Y)_{test}, but P(Y)_{train} \neq P(Y)_{test}$$
步驟
1. 計算訓練、測試集的各自的類別機率
2. 透過ANOVA、T-test 去檢驗是否有顯著差異
    - Yes -> 拒絕$H_{0}$, 代表有 Prior probability shift
    - No  -> 不拒絕$H_{0}$, 代表 沒有 Prior probality shift

In [65]:
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# print(X)
# print(y)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=6)
print(y_train.shape, y_test.shape)

(426,) (143,)


In [66]:
# 1. 計算P(Y)

py_train = (y_train == 1).sum() / len(y_train)
py_test = (y_test == 1).sum() / len(y_test)

print(py_train, py_test)

0.6572769953051644 0.5384615384615384


In [67]:
from scipy.stats import ttest_ind

# 2. t-test

result = ttest_ind(y_train, y_test)
print(result.pvalue)
print(result.statistic)

0.010946905364189472
2.5527999897365663


> 透過將random_state設定，找到P(Y)是有差距的樣本，可以去發現其$p_{value}$ < 0.05, 是拒絕$H_{0}$的，有產生data shift

In [68]:
# 此時透過模型去預測看看

rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_train)
acc_train = accuracy_score(y_train, y_pred)

y_pred_test= rf.predict(x_test)
acc_test = accuracy_score(y_test, y_pred_test)

print(acc_train, acc_test)

1.0 0.958041958041958
