# 作业4：线性模型的分布式算法

### 第1题

先利用如下代码生成模拟数据，并写入文件。数据中最后一列代表因变量 $Y$，其余列为自变量 $X$。

In [1]:
import numpy as np
np.set_printoptions(linewidth=100)
np.random.seed(123)
n = 100000
p = 100
x = np.random.normal(size=(n, p))
beta = np.random.normal(size=p)
y = 1.23 + x.dot(beta) + np.random.normal(scale=2.0, size=n)
dat = np.hstack((x, y.reshape(n, 1)))
np.savetxt("reg_data.txt", dat, fmt="%.8f", delimiter=";")

请以单机模式启动 PySpark，使用4个 CPU 核心，并编写分布式程序，实现如下计算：

In [2]:
import findspark
findspark.init("")
from pyspark.sql import SparkSession
# 本地模式
spark = SparkSession.builder.\
    master("local[4]").\
    appName("PySpark RDD").\
    getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
print(spark)
print(sc)

<pyspark.sql.session.SparkSession object at 0x000002280DDF40A0>
<SparkContext master=local[4] appName=PySpark RDD>


1. 打印数据的前5行，并将每行的字符串截断至80个字符：

In [9]:
file = sc.textFile("reg_data.txt")
text = file.map(lambda x: x[:80]).take(5)
print(*text,sep='\n')

-1.08563060;0.99734545;0.28297850;-1.50629471;-0.57860025;1.65143654;-2.42667924
0.64205469;-1.97788793;0.71226464;2.59830393;-0.02462598;0.03414213;0.17954948;-
0.70331012;-0.59810533;2.20070210;0.68829693;-0.00630725;-0.20666230;-0.08652229
0.76505485;-0.82898883;-0.65915131;0.61112355;-0.14401335;1.31660560;-0.70434215
1.53409029;-0.52991410;-0.49097228;-1.30916531;-0.00866047;0.97681298;-1.7510703


2. 将读取数据后得到的 RDD 按分区转为矩阵。使用默认分区数，无需重新分区。打印出转换后的第一个非空分区所包含的数据。

In [10]:
def string_to_vector(line):
    vector = line.split(";")
    return np.array(vector, dtype=float)

def partition_to_matrix (iterator):
    iterator_vec = map(string_to_vector, iterator)
    data = list(iterator_vec)
    if len(data) < 1:
        matrix = np.array([])
    else:
        matrix = np.vstack(data)
    yield matrix

data_partition = file.mapPartitions(partition_to_matrix)
data_partition_nonempty = data_partition.filter(lambda x: x.shape[0] > 0)
print(data_partition_nonempty.first())
print(data_partition_nonempty.getNumPartitions())


[[ -1.0856306    0.99734545   0.2829785  ...   0.37940061  -0.37917643   3.72488966]
 [  0.64205469  -1.97788793   0.71226464 ...  -0.34126172  -0.21794626  10.98088055]
 [  0.70331012  -0.59810533   2.2007021  ...   0.16054442   0.81976061 -12.63028846]
 ...
 [ -0.30751248   0.1323937    2.33256448 ...   0.37475498  -1.37608098 -13.52353737]
 [ -0.02266014  -0.3014796    2.34502536 ...  -2.06082696  -1.20995417 -10.00714174]
 [  0.02415432  -0.3896902   -0.07492828 ...  -0.41935638  -1.68496516   8.33748658]]
4


3. 估计线性回归模型 $Y=X\beta+\varepsilon$ 的回归系数，**同时包含截距项**。要求**只使用一次** `reduce()`。

$$\hat \beta = (X^TX)^{-1}X^TY$$
$$X^*=[1:X]$$
$${X^*}'(X^*,Y)=({X^*}'{X^*},{X^*}'Y)$$

In [11]:
xt_xy = data_partition_nonempty.\
    map(lambda x: np.hstack((np.ones((np.shape(x)[0],1)),x))).\
    map(lambda x: x[:,:-1].transpose().dot(x) ).\
    reduce (lambda x,y: x+y)

xt_x = xt_xy[:,:-1]
xt_y = xt_xy[:,-1]

hat_beta = np.linalg.solve(xt_x,xt_y)
print(hat_beta)

[ 1.22841355 -0.58056172 -1.12947488  1.16031679  0.68276231  0.64063205 -1.69803101  0.87295008
 -0.6827681   1.21323821 -0.18532546 -0.60313748  0.45016343  1.54732259  0.93536575  0.33661885
 -0.62839196 -0.18223468  1.04004336  0.99530527 -0.22421889  0.26910036 -1.95584105  0.93200566
 -0.46663344 -1.30308226 -1.07451859 -0.9200001  -0.4751849  -0.41498631  0.0893936   0.74250157
  0.44142653  0.78310696  0.0968675  -0.20661749  1.36408459 -0.84452182 -1.56303708 -0.03391736
  0.05672465 -0.01335776 -0.31919022 -1.7366497  -1.35682179 -1.60938262 -1.28888311  0.92820726
  0.9148462  -0.87189391 -1.11327839 -0.65324334 -1.54752238 -1.48016168 -1.40044728  0.06124555
 -2.06832355  0.23966887 -1.45310857 -0.4958114  -1.0917562   1.22608413  0.71866161  0.46548143
 -0.21573557  1.19919219 -0.18470024  0.41716831  0.48748654 -0.28702665 -0.92945413 -2.54835305
  1.21073672 -0.41380347  0.40696645  0.74054168  1.59228068 -0.35873326  0.41181034 -1.44030368
 -0.47743396 -0.27652029 -1.65

4. 设计一个分布式算法，计算回归模型的 $R^2$。

相关公式：
$$SSR = \sum (y_i-\hat y_i)^2 = ||Y-X\hat\beta||_2$$
$$SST = \sum (y_i - \bar y)^2 = \sum y_i^2+n\bar y^2-2\bar y \sum y_i $$
$$ R^2 = 1 - SSR/SST$$

假设：
1. 已知$\hat\beta, [X,Y]$
2. 数据经过mapPartition在多个矩阵中存储
   
计算过程：
1. 扩充$X := [1,X]$
2. 计算 $Y-X\hat\beta$，稍后对其进行平方求和
3. 计算 $\sum y_i, \sum y_i^2$
4. reduce，根据上述Rsquare公式进行整合计算

In [17]:
sum_y, sum_y_sq, ssr, num = data_partition_nonempty.\
    map(lambda x: np.hstack((np.ones((np.shape(x)[0],1)),x))).\
    map(lambda x: (np.sum(x[:,-1]),np.sum(x[:,-1]**2),np.sum( ( x[:,-1] -  x[:,:-1].dot(hat_beta)  )**2,axis=0),np.shape(x)[0])).\
    reduce(lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]+y[2],x[3]+y[3]) )
print(sum_y,sum_y_sq, ssr, num)
y_bar = sum_y/num
sst = sum_y_sq + n*y_bar**2 -2*y_bar*sum_y
R_sq = 1 - ssr/sst
print(f"R^2: {R_sq}")


116691.99105594002 11636386.644065393 397451.80241834675 100000
R^2: 0.9654396241479573


### 第2题

(a) 考虑 Softplus 函数 $$\mathrm{softplus}(x)=\log(1+e^x)$$

请利用 Numpy 编写一个函数 `softplus(x)`，令其可以接收一个向量或矩阵 `x`，返回 Softplus 函数在 `x` 上的取值。

In [10]:
import numpy as np

def softplus(x):
    # 此处插入代码
    print("softplus 1")
    return np.log(1+np.exp(x))

一个简单的测试：

In [12]:
x = np.array([-1000.0, -100.0, -10.0, 0.0, 1.0, 10.0, 100.0, 1000.0])

# 上面编写的函数
print(softplus(x))

softplus 1
[0.00000000e+00 0.00000000e+00 4.53988992e-05 6.93147181e-01 1.31326169e+00 1.00000454e+01
 1.00000000e+02            inf]


  return np.log(1+np.exp(x))


(b) 上述结果是否正常？如果出现异常取值，思考可能的原因是什么，并参照课件上的说明再次尝试编写 Softplus 函数。注意尽可能使用 Numpy 提供的向量化函数，避免使用循环。该函数需同时支持向量和矩阵参数。如果一切正常，可忽略此问题。

- *在测试$x=1000.0$时，提示发生了溢出，并且在数值返回的时候返回值为`inf`.*

- *初步推断这是由于当$x$较大时，$e^x$指数函数的数值过大发生了溢出，导致计算的稳定性出现问题*

- *为了改进，当$x\ge0$时，改用如下等价表达形式$\log(1+e^x) = \log[e^x(e^{-x}+1)]=x+\log(1+e^{-x})$以增强计算的稳定性*



In [13]:
def softplus(x):
    # 此处插入代码
    print("softplus 2")
    ans = np.where(x>=0,np.log(1+np.exp(-x))+x,np.log(1+np.exp(x)))
    return ans

但是上面这段代码是有问题的，在进行where的时候事实上这里还是都计算了两次，只不过用where选择了数值稳定的。这里通过绝对值对代码再次进行改进：

$x>0, softplus=x+\log(1+e^{-x})$

$x<0, softplus=\log(1+e^x)$

因此

可以将log部分统一整理为: $\log(1+e^{-|x|})$

In [14]:
def softplus(x):
    log_num = np.log(1+np.exp(-np.abs(x)))
    ans = np.where(x>=0, x+log_num, log_num)
    return ans

再次测试：

In [15]:
print(softplus(x))
print()
print(softplus(x.reshape(4, 2)))

[0.00000000e+00 0.00000000e+00 4.53988992e-05 6.93147181e-01 1.31326169e+00 1.00000454e+01
 1.00000000e+02 1.00000000e+03]

[[0.00000000e+00 0.00000000e+00]
 [4.53988992e-05 6.93147181e-01]
 [1.31326169e+00 1.00000454e+01]
 [1.00000000e+02 1.00000000e+03]]


### 第3题

利用如下代码生成模拟数据，其中数据第一列代表0-1因变量 $Y$，其余列为自变量 $X$。

In [16]:
import numpy as np
from numpy import exp
from scipy.special import expit

np.random.seed(123)
n = 100000
p = 100
x = np.random.normal(size=(n, p))
beta = np.random.normal(size=p)
prob = expit(-0.5 + x.dot(beta))  # p = 1 / (1 + exp(-x * beta))
y = np.random.binomial(1, prob, size=n)
dat = np.hstack((y.reshape(n, 1), x))
np.savetxt("logistic_data.txt", dat, fmt="%.8f", delimiter="\t")

1. 对上述数据建立 Logistic 回归模型。任选一种算法，估计 Logistic 回归的回归系数，**同时包含截距项**。请利用第2题中编写的 Softplus 函数，编写**数值稳定**的函数计算 Logistic 回归的目标函数和梯度。

$$\begin{aligned}
f_{obj}&=-\sum[ y_i\log p_i+(1-y_i)\log (1-p_i)]\\&=-\sum [y_i(x\beta-\log(1+e^{x\beta}))+(1-y_i)(-\log (1+e^{x\beta}))] \\&=-\sum[ y_i(x\beta-s(x\beta))+(y_i-1)s(x\beta) ]
\end{aligned}$$

where $s(x) = \log(1+e^x)$

In [17]:
# load softplus
test = False
def softplus(x):
    # 此处插入代码
    if (test): 
        print(f"call func: softplus")
    ans = np.where(x>=0,np.log(1+np.exp(-x))+x,np.log(1+np.exp(x)))
    if (test): 
        print(f"return.shape:{ans.shape}")
        print(f"end func: softplus \n")
    return ans

def sigmoid(x):
    ans = np.where(x>0,1/(1+np.exp(-x)),exp(x)/(1+exp(x)))
    return ans

# load data to rdd
def string_to_vector(line):
    vector = line.split("\t")
    vector = np.append(vector,1.0)
    return np.array(vector, dtype=float)

def partition_to_matrix (iterator):
    iterator_vec = map(string_to_vector, iterator)
    data = list(iterator_vec)
    if len(data) < 1:
        matrix = np.array([])
    else:
        matrix = np.vstack(data)
    yield matrix

file = sc.textFile("logistic_data.txt")

data_partition = file.mapPartitions(partition_to_matrix)
data_partition_nonempty = data_partition.filter(lambda x: x.shape[0] > 0)
data_partition_nonempty.cache()
data_partition_nonempty.count()



4

In [18]:
# compute beta_hat fcn
def compute_betahat(mat,beta_old):

    if (test):
        print("call fnc: comp_bhat\n")

    y = mat[:,0]
    x = mat[:,1:]
    xbeta = x.dot(beta_old)
    prob = sigmoid(xbeta) #这里的prob可以避免吗sigmoid（因为后面的objfn只用了softplus）？但是W该怎么算啊
    w = prob * (1.0 - prob) + 1e-6
    xtw = x.transpose() * w
    xtwx = xtw.dot(x)
    z = xbeta + (y - prob) / w
    xtwz = xtw.dot(z)
    objfn = -np.sum( y * (xbeta-softplus(xbeta) ) + (y-1) * softplus(xbeta) )
    return xtwx, xtwz, objfn

# iter computation

p = data_partition_nonempty.first().shape[1]-1 #subtract y
beta_hat = np.zeros(p)#initialization
object_values = [] #init

MaxIteration = 100 #iter settings
epsilon = 1e-6 #iter settings

for i in range(MaxIteration):
    if (test):
        print(f"start iter:{i}")
    # 完整数据的 X'WX 和 X'Wz 是各分区的加和
    xtwx, xtwz, objfn = data_partition_nonempty.\
        map(lambda part: compute_betahat(part, beta_hat)).\
        reduce(lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
    # 计算新 beta
    beta_new = np.linalg.solve(xtwx, xtwz)
    if (test):
        print(f"bn{beta_new.shape}")
    # 计算 beta 的变化
    resid = np.linalg.norm(beta_new - beta_hat)
    print(f"Iteration {i}, objfn = {objfn}, resid = {resid}")
    object_values.append(objfn)
    # 如果 beta 几乎不再变化，退出循环
    if resid < epsilon:
        print(f"Accomplish!\nFinal Iteration {i}, objfn = {objfn}, resid = {resid}")
        break
    # 更新 beta
    beta_hat = beta_new

Iteration 0, objfn = 69314.71805599453, resid = 1.5698521885255372
Iteration 1, objfn = 32646.979227911244, resid = 1.3901997462268305
Iteration 2, objfn = 21647.769318284514, resid = 1.7368788815875644
Iteration 3, objfn = 16036.81715822243, resid = 2.0776610476266577
Iteration 4, objfn = 13369.971015111047, resid = 2.0554425789459
Iteration 5, objfn = 12424.605050958291, resid = 1.3106763115298687
Iteration 6, objfn = 12255.539828228106, resid = 0.34684153390743866
Iteration 7, objfn = 12248.214989925482, resid = 0.018344346618324184
Iteration 8, objfn = 12248.196924487687, resid = 6.370796263421582e-05
Iteration 9, objfn = 12248.196924271282, resid = 6.056472389209912e-08
Accomplish!
Final Iteration 9, objfn = 12248.196924271282, resid = 6.056472389209912e-08


2. 利用估计得到的 $\hat{\beta}$ 对原始数据进行预测，令 $\hat{\rho}_i$ 表示估计出的每个观测 $Y_i$ 取值为1的概率。为每个观测计算一个预测的0-1标签 $\hat{l}_i$，规则如下：如果 $\hat{\rho}_i\ge 0.5$，则 $\hat{l}_i=1$，反之 $\hat{l}_i=0$。利用分布式算法计算模型的预测准确度，即 $n^{-1}\sum_{i=1}^n I(Y_i=\hat{l}_i)$。$I(Y_i=\hat{l}_i)$ 表示预测对取1，预测错取0。

In [19]:
one = np.ones((x.shape[0],1))
X = np.hstack((x,one))
print(X)

[[-1.0856306   0.99734545  0.2829785  ...  0.37940061 -0.37917643  1.        ]
 [ 0.64205469 -1.97788793  0.71226464 ... -0.34126172 -0.21794626  1.        ]
 [ 0.70331012 -0.59810533  2.2007021  ...  0.16054442  0.81976061  1.        ]
 ...
 [ 0.14100959  0.80978972 -0.42440731 ...  2.24800309 -0.74050246  1.        ]
 [ 0.83784344 -0.61011528  1.25735545 ... -0.2700087  -1.25482477  1.        ]
 [ 0.34676545 -0.52206363 -0.04829659 ...  0.08482555  0.9228148   1.        ]]


In [20]:
probhat = sigmoid(X.dot(beta_hat))
result = np.where(probhat>=0.5,1,0)
if_right  = np.where(result ==y,1,0)
right_num = sum(if_right)
acc = right_num/len(result)
print(f"accuracy:{acc*100}%")

accuracy:94.73700000000001%
