
Let us build a proportional model ($\mathbb{P}(Y=1 \mid X) = G(\beta_0+\beta \cdot X)$ where $G$ is the logistic function) for the spam vs not spam data. Here we assume that the features are presence vs not presence of a word, let $X_1,X_2,X_3$ denote the presence (1) or absence (0) of the words $("free", "prize", "win")$.

1. [2p] Load the file `data/spam.csv` and create two numpy arrays, `problem2_X` which has shape (n_emails,3) where each feature in `problem2_X` corresponds to $X_1,X_2,X_3$ from above, `problem2_Y` which has shape **(n_emails,)** and consists of a $1$ if the email is spam and $0$ if it is not. Split this data into a train-calibration-test sets where we have the split $40\%$, $20\%$, $40\%$, put this data in the designated variables in the code cell.

2. [4p] Follow the calculation from the lecture notes where we derive the logistic regression and implement the final loss function inside the class `ProportionalSpam`. You can use the `Test` cell to check that it gives the correct value for a test-point.

3. [4p] Train the model `problem2_ps` on the training data. The goal is to calibrate the probabilities output from the model. Start by creating a new variable `problem2_X_pred` (shape `(n_samples,1)`) which consists of the predictions of `problem2_ps` on the calibration dataset. Then train a calibration model using `sklearn.tree.DecisionTreeRegressor`, store this trained model in `problem2_calibrator`.

4. [3p] Use the trained model `problem2_ps` and the calibrator `problem2_calibrator` to make final predictions on the testing data, store the prediction in `problem2_final_predictions`. Compute the $0-1$ test-loss and store it in `problem2_01_loss` and provide a $99\%$ confidence interval of it, store this in the variable `problem2_interval`, this should again be a tuple as in **problem1**.

In [1]:
# 加载数据并拆分数据集

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# 加载数据
data = pd.read_csv('data/spam.csv')

# 创建特征和标签
problem2_X = data[['free', 'prize', 'win']].values
problem2_Y = data['spam'].values

# 拆分数据集
X_train, X_temp, Y_train, Y_temp = train_test_split(problem2_X, problem2_Y, test_size=0.6, random_state=0)
X_calib, X_test, Y_calib, Y_test = train_test_split(X_temp, Y_temp, test_size=2/3, random_state=0)

print("Train set size:", X_train.shape)
print("Calibration set size:", X_calib.shape)
print("Test set size:", X_test.shape)


FileNotFoundError: [Errno 2] No such file or directory: 'data/spam.csv'

In [None]:
# 实现逻辑回归模型

from scipy.special import expit  # expit是sigmoid函数
from scipy.optimize import minimize

class ProportionalSpam:
    def __init__(self):
        self.beta = None

    def fit(self, X, y):
        X = np.hstack((np.ones((X.shape[0], 1)), X))  # 在X中添加一列1 (β_0的偏置项)
        n, d = X.shape

        # 损失函数
        def loss(beta):
            linear_term = np.dot(X, beta)
            likelihood = expit(linear_term)
            return -np.sum(y * np.log(likelihood) + (1 - y) * np.log(1 - likelihood))

        # 初始化β
        beta_init = np.zeros(d)
        result = minimize(loss, beta_init, method='BFGS')
        self.beta = result.x

    def predict_proba(self, X):
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        linear_term = np.dot(X, self.beta)
        return expit(linear_term)

    def predict(self, X):
        return self.predict_proba(X) >= 0.5

# 测试
ps = ProportionalSpam()
ps.fit(X_train, Y_train)
print("Learned coefficients:", ps.beta)


In [None]:
# 训练校准模型

from sklearn.tree import DecisionTreeRegressor

# 在校准数据集上进行预测
problem2_X_pred = ps.predict_proba(X_calib).reshape(-1, 1)

# 训练校准模型
problem2_calibrator = DecisionTreeRegressor(random_state=0)
problem2_calibrator.fit(problem2_X_pred, Y_calib)


In [None]:
# 在数据上进行最终预测，并计算0-1损失和置信区间

# 最终预测
test_proba = ps.predict_proba(X_test).reshape(-1, 1)
calibrated_proba = problem2_calibrator.predict(test_proba)
problem2_final_predictions = (calibrated_proba >= 0.5).astype(int)

# 计算0-1损失
problem2_01_loss = np.mean(problem2_final_predictions != Y_test)

# 计算99%置信区间
n_test = len(Y_test)
epsilon = np.sqrt(np.log(2 / 0.01) / (2 * n_test))
problem2_interval = (problem2_01_loss - epsilon, problem2_01_loss + epsilon)

print(f"0-1 loss: {problem2_01_loss}")
print(f"99% confidence interval: {problem2_interval}")
