# 作业4：线性模型的分布式算法

### 第1题

先利用如下代码生成模拟数据，并写入文件。数据中最后一列代表因变量 $Y$，其余列为自变量 $X$。

In [1]:
import numpy as np
np.set_printoptions(linewidth=100)

np.random.seed(123)
n = 100000
p = 100
x = np.random.normal(size=(n, p))
beta = np.random.normal(size=p)
y = 1.23 + x.dot(beta) + np.random.normal(scale=2.0, size=n)
dat = np.hstack((x, y.reshape(n, 1)))
np.savetxt("reg_data.txt", dat, fmt="%.8f", delimiter=";")

请以单机模式启动 PySpark，使用4个 CPU 核心，并编写分布式程序，实现如下计算：

In [2]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
            master("local[4]").\
            appName("PySpark RDD").\
            getOrCreate()
sc = spark.sparkContext
print(spark)
print(sc)

<pyspark.sql.session.SparkSession object at 0x000001F5427A3430>
<SparkContext master=local[4] appName=PySpark RDD>


1. 打印数据的前5行，并将每行的字符串截断至80个字符：

In [4]:
file = sc.textFile("reg_data.txt")
demo = file.map(lambda x: x[:80] + "...").take(5)
print(*demo, sep="\n")

-1.08563060;0.99734545;0.28297850;-1.50629471;-0.57860025;1.65143654;-2.42667924...
0.64205469;-1.97788793;0.71226464;2.59830393;-0.02462598;0.03414213;0.17954948;-...
0.70331012;-0.59810533;2.20070210;0.68829693;-0.00630725;-0.20666230;-0.08652229...
0.76505485;-0.82898883;-0.65915131;0.61112355;-0.14401335;1.31660560;-0.70434215...
1.53409029;-0.52991410;-0.49097228;-1.30916531;-0.00866047;0.97681298;-1.7510703...


2. 将读取数据后得到的 RDD 按分区转为矩阵。使用默认分区数，无需重新分区。打印出转换后的第一个非空分区所包含的数据。

In [5]:
def str_to_vec(line):
    str_vec = line.split(";")
    num_vec = map(lambda s:float(s), str_vec)
    return np.fromiter(num_vec, dtype=float)
def par_to_mat(iterator):
    iter_arr = map(str_to_vec, iterator)
    arr_list = list(iter_arr)
    if len(arr_list) >= 1:
        mat = np.vstack(arr_list)
    else:
        mat = np.array([])
    yield mat
dat = file.mapPartitions(par_to_mat).filter(lambda x: x.shape[0] > 0 )
print(dat.first())

[[ -1.0856306    0.99734545   0.2829785  ...   0.37940061  -0.37917643   3.72488966]
 [  0.64205469  -1.97788793   0.71226464 ...  -0.34126172  -0.21794626  10.98088055]
 [  0.70331012  -0.59810533   2.2007021  ...   0.16054442   0.81976061 -12.63028846]
 ...
 [ -0.30751248   0.1323937    2.33256448 ...   0.37475498  -1.37608098 -13.52353737]
 [ -0.02266014  -0.3014796    2.34502536 ...  -2.06082696  -1.20995417 -10.00714174]
 [  0.02415432  -0.3896902   -0.07492828 ...  -0.41935638  -1.68496516   8.33748658]]


3. 估计线性回归模型 $Y=X\beta+\varepsilon$ 的回归系数，**同时包含截距项**。要求**只使用一次** `reduce()`。

In [6]:
#betacap = (X'X)^{-1}X'y
dat_1 = dat.map(lambda m: np.c_[np.ones(m.shape[0]), m])
XtX, Xty = dat_1.map(lambda m: (m[: , :-1].T.dot(m[: , :-1]), m[: , :-1].T.dot(m[: , -1]))).\
                                reduce(lambda x ,y:(x[0] + y[0],x[1] + y[1]))
betah = np.linalg.solve(XtX, Xty)
print(betah)


[ 1.22841355 -0.58056172 -1.12947488  1.16031679  0.68276231  0.64063205 -1.69803101  0.87295008
 -0.6827681   1.21323821 -0.18532546 -0.60313748  0.45016343  1.54732259  0.93536575  0.33661885
 -0.62839196 -0.18223468  1.04004336  0.99530527 -0.22421889  0.26910036 -1.95584105  0.93200566
 -0.46663344 -1.30308226 -1.07451859 -0.9200001  -0.4751849  -0.41498631  0.0893936   0.74250157
  0.44142653  0.78310696  0.0968675  -0.20661749  1.36408459 -0.84452182 -1.56303708 -0.03391736
  0.05672465 -0.01335776 -0.31919022 -1.7366497  -1.35682179 -1.60938262 -1.28888311  0.92820726
  0.9148462  -0.87189391 -1.11327839 -0.65324334 -1.54752238 -1.48016168 -1.40044728  0.06124555
 -2.06832355  0.23966887 -1.45310857 -0.4958114  -1.0917562   1.22608413  0.71866161  0.46548143
 -0.21573557  1.19919219 -0.18470024  0.41716831  0.48748654 -0.28702665 -0.92945413 -2.54835305
  1.21073672 -0.41380347  0.40696645  0.74054168  1.59228068 -0.35873326  0.41181034 -1.44030368
 -0.47743396 -0.27652029 -1.65

4. 设计一个分布式算法，计算回归模型的 $R^2$。         
$$SSR = \sum(y_i -\hat y_i)^2 = ||Y - X\hat\beta||_2^2$$
$$SST = \sum(y_i - \bar y)^2 = \sum y_i^2 - n(\bar y)^2$$
$$R^2 = 1 - \frac{SSR}{SST}$$


In [7]:
lol = [1,2,3]
print(lol[:-1])

[1, 2]


In [8]:
ssr,ysq,yid,num= dat_1.map(lambda m: (np.sum((m[:, -1] - m[: , :-1].dot(betah))**2),
                                         np.sum(m[: , -1] * m[: , -1]),
                                         np.sum(m[: , -1]),
                                         np.sum(np.shape(m)[0]))).\
                             reduce(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2], x[3] + y[3]))
sst = ysq - 1 / num * yid * yid
R2 = 1 - ssr/sst
print(ssr , ysq , yid,sst)
print("R^2 = ", R2)


397451.80241834675 11636386.644065393 116691.99105594002 11500216.436299397
R^2 =  0.9654396241479573


### 第2题

(a) 考虑 Softplus 函数 $$\mathrm{softplus}(x)=\log(1+e^x)$$

请利用 Numpy 编写一个函数 `softplus(x)`，令其可以接收一个向量或矩阵 `x`，返回 Softplus 函数在 `x` 上的取值。

In [9]:
import numpy as np

def softplus(x):
    return np.log(1 + np.exp(x))

一个简单的测试：

In [10]:
x = np.array([-1000.0, -100.0, -10.0, 0.0, 1.0, 10.0, 100.0, 1000.0])

# 上面编写的函数
print(softplus(x))

[0.00000000e+00 0.00000000e+00 4.53988992e-05 6.93147181e-01 1.31326169e+00 1.00000454e+01
 1.00000000e+02            inf]


  return np.log(1 + np.exp(x))


(b) 上述结果是否正常？如果出现异常取值，思考可能的原因是什么，并参照课件上的说明再次尝试编写 Softplus 函数。注意尽可能使用 Numpy 提供的向量化函数，避免使用循环。该函数需同时支持向量和矩阵参数。如果一切正常，可忽略此问题。

In [11]:
def softplus(x):
    case1 = np.log(1 + np.exp(x))
    case2 = x + np.log(1 + np.exp(-x))
    return np.where(x >=0, case2, case1)
#np.where实际上只是在处理三个向量，这里的溢出是计算不被选择的情况下得到的溢出
#法二：np.log(1 + np.exp(abs(x)))

再次测试：

In [12]:
print(softplus(x))
print()
print(softplus(x.reshape(4, 2)))

[0.00000000e+00 0.00000000e+00 4.53988992e-05 6.93147181e-01 1.31326169e+00 1.00000454e+01
 1.00000000e+02 1.00000000e+03]

[[0.00000000e+00 0.00000000e+00]
 [4.53988992e-05 6.93147181e-01]
 [1.31326169e+00 1.00000454e+01]
 [1.00000000e+02 1.00000000e+03]]


  case1 = np.log(1 + np.exp(x))
  case2 = x + np.log(1 + np.exp(-x))


### 第3题

利用如下代码生成模拟数据，其中数据第一列代表0-1因变量 $Y$，其余列为自变量 $X$。

In [13]:
import numpy as np
from scipy.special import expit

np.random.seed(123)
n = 100000
p = 100
x = np.random.normal(size=(n, p))
beta = np.random.normal(size=p)
prob = expit(-0.5 + x.dot(beta))  # p = 1 / (1 + exp(-x * beta))
y = np.random.binomial(1, prob, size=n)
dat = np.hstack((y.reshape(n, 1), x))
np.savetxt("logistic_data.txt", dat, fmt="%.8f", delimiter="\t")

1. 对上述数据建立 Logistic 回归模型。任选一种算法，估计 Logistic 回归的回归系数，**同时包含截距项**。请利用第2题中编写的 Softplus 函数，编写**数值稳定**的函数计算 Logistic 回归的目标函数和梯度。

目标：minimize 
$$L_{CE} = - \sum_{i=1}^N(y_ilog\sigma(\beta xi) + (1 - y_i)log(1 - \sigma(x_i;\beta)))$$
$$L_{CE} = - \sum_{i=1}^N(y_i(x\beta - log(1 + e^{x\beta})) + (1 - y_i)(-log(1 + e^{x\beta})))$$
$=y_ix'\beta$
由于损失函数的可加性，拆解到每个矩形上计算，最后用加法汇总，得到总损失函数

In [14]:
def str_to_vec(line):
    str_vec = line.split("\t")
    num_vec = map(lambda s:float(s), str_vec)
    return np.fromiter(num_vec, dtype=float)
def par_to_mat(iterator):
    iter_arr = map(str_to_vec, iterator)
    arr_list = list(iter_arr)
    if len(arr_list) >= 1:
        mat = np.vstack(arr_list)
    else:
        mat = np.array([])
    yield mat
file = sc.textFile("logistic_data.txt")
print(file.count())
print()
text = file.map(lambda x: x[:70] + "...").take(5)
print(text)
dat = file.mapPartitions(par_to_mat).filter(lambda x: x.shape[0] > 0)
#将训练集分区映射为矩阵以便于下一步处理
print(dat.count())
#做缓存操作 减少通信成本
dat.cache()

100000

['0.00000000\t-1.08563060\t0.99734545\t0.28297850\t-1.50629471\t-0.57860025\t1...', '1.00000000\t0.64205469\t-1.97788793\t0.71226464\t2.59830393\t-0.02462598\t0....', '0.00000000\t0.70331012\t-0.59810533\t2.20070210\t0.68829693\t-0.00630725\t-0...', '1.00000000\t0.76505485\t-0.82898883\t-0.65915131\t0.61112355\t-0.14401335\t1...', '0.00000000\t1.53409029\t-0.52991410\t-0.49097228\t-1.30916531\t-0.00866047\t...']
4


PythonRDD[14] at RDD at PythonRDD.scala:53

In [15]:
#准备好需要使用的函数
def softplus(x):
    case1 = np.log(1 + np.exp(x))
    case2 = x + np.log(1 + np.exp(-x))
    return np.where(x >=0, case2, case1)  

def sigmoid(x):
    case1 = 1/(1 + np.exp(-x))
    case2 = np.exp(x)/(1 + np.exp(x))
    return np.where(x >= 0, case1, case2)
    

In [16]:
def compute_stats(part_mat, beta_old):
    c = np.ones(part_mat.shape[0])
    y = part_mat[: , 0]
    x = np.c_[c, part_mat[: , 1:]]
    xb = x.dot(beta_old)
    prob = sigmoid(xb)
    w = prob * (1.0 - prob) + 1e-6 #这里如何避免
    xtw = x.transpose() * w
    xtwx = xtw.dot(x)
    z = xb + (y - prob) / w
    xtwz = xtw.dot(z)
    objfn = -np.sum(y * (xb - softplus(xb)) + (1.0 - y) *(- softplus(xb)))
    return xtwx, xtwz, objfn

In [17]:
# beta初始化 注意包含常数项
p = dat.first().shape[1]
beta_hat = np.zeros(p)
# 记录目标函数值
objvals = []
#迭代次数上界
maxit = 30
# 收敛条件
eps = 1e-6

for i in range(maxit):
    xtwx, xtwz, objfn = dat.map(lambda part: compute_stats(part, beta_hat)).\
        reduce(lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
    beta_new = np.linalg.solve(xtwx, xtwz)
    resid = np.linalg.norm(beta_new - beta_hat)
    print(f"Iteration {i}, objfn = {objfn}, resid = {resid}")
    objvals.append(objfn)
    #debug用
    if resid < eps:
        break
    beta_hat = beta_new

Iteration 0, objfn = 69314.71805599453, resid = 1.569852188525537
Iteration 1, objfn = 32646.979227911244, resid = 1.3901997462268307
Iteration 2, objfn = 21647.76931828451, resid = 1.7368788815875644
Iteration 3, objfn = 16036.81715822243, resid = 2.077661047626655
Iteration 4, objfn = 13369.971015111047, resid = 2.0554425789459017
Iteration 5, objfn = 12424.605050958291, resid = 1.3106763115298692
Iteration 6, objfn = 12255.539828228108, resid = 0.3468415339074449
Iteration 7, objfn = 12248.21498992548, resid = 0.01834434661832474
Iteration 8, objfn = 12248.196924487687, resid = 6.370796261384574e-05
Iteration 9, objfn = 12248.19692427128, resid = 6.05647308167563e-08


2. 利用估计得到的 $\hat{\beta}$ 对原始数据进行预测，令 $\hat{\rho}_i$ 表示估计出的每个观测 $Y_i$ 取值为1的概率。为每个观测计算一个预测的0-1标签 $\hat{l}_i$，规则如下：如果 $\hat{\rho}_i\ge 0.5$，则 $\hat{l}_i=1$，反之 $\hat{l}_i=0$。利用分布式算法计算模型的预测准确度，即 $n^{-1}\sum_{i=1}^n I(Y_i=\hat{l}_i)$。$I(Y_i=\hat{l}_i)$ 表示预测对取1，预测错取0。

In [1]:
c = np.ones(x.shape[0])
X = np.c_[c, x]
s = sigmoid(X.dot(beta_hat))
output = np.where(s >= 0.5, 1, 0)
judge = np.where(output == y, 1.0, 0.0 )
right_sum = sum(judge)
count = len(judge)
print(right_sum / count)

NameError: name 'np' is not defined