# 作业5

### 第1题

利用 PySpark 实现一个分布式估计圆周率 $\pi$ 的程序，原理如下：

在正方形 $\{(x,y):-1\le x \le 1, -1\le y \le 1\}$ 中随机生成 $N$ 个独立的均匀分布随机数 $(X_i,Y_i)$，其中每个点 $(X_i,Y_i)$ 落入圆 $R=\{(x,y): x^2+y^2\le 1\}$ 的概率是 $\pi/4$。因此，如果随机生成的 $N$ 个点中有 $n$ 个落入圆 $R$ 中，那么 $\pi$ 的估计就是 $4n/N$。

![](https://media.geeksforgeeks.org/wp-content/uploads/MonteCarlo.png)

现在我们采用分布式的方法并行模拟大量的随机数。考虑将所有的点分成100组，每组生成10000个点，每组独立产生随机数并计算落入圆内的数量，最后将所有100组的结果汇总并得出最终 $\pi$ 的估计。为了使结果可重复，第 $i$ 组在生成随机数时使用 $i$ 作为随机数种子。PySpark 使用本地模式，开启 8 个 CPU 核心。

**提示**：使用标准方法启动 PySpark 后，可以利用 `sc.parallelize()` 从一个迭代器或列表生成 RDD，如 `sc.parallelize(range(10))` 和 `sc.parallelize([1, 2, 3])`。

In [2]:
# PySpark 初始化
import findspark
findspark.init("/Users/xinby/Library/Spark")

from pyspark.sql import SparkSession

spark = SparkSession.builder.\
    master("local[8]").\
    appName("PySpark RDD").\
    getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
print(spark)
print(sc)

# set rand num using numpy
import numpy as np

def gen_point(grp_num):
    """Generate a point set with 10000 points;
    grp_num: (int), used to define random seed;
    return: (np.array), a 10000*2 array, each row is a point(x,y);
    """
    np.random.seed(grp_num)
    PointSet = np.random.uniform(low=-1.00,high = 1.00,size=(10000,2))
    return PointSet


def is_in_circle(point):
    """Check if a point is in the circle;
    point: (np.array), a 1*2 array, the point to be checked;
    return: (int), 1 if in, 0 if not;
    """
    if ( point[0]**2 + point[1]**2 <= 1):
        return 1
    else:
        return 0
    
def num_in_circle(PointSet):
    """Count the number of points in the circle;
    PointSet: (np.array), a 10000*2 array, each row is a point(x,y);
    return: (int), number of points in the circle;
    """
    is_in_circle_list = np.apply_along_axis(is_in_circle,1,PointSet)
    num = np.sum(is_in_circle_list)
    return num

# RDD
NUM = sc.parallelize(np.array(range(100))).\
    map(lambda x: gen_point(x)).\
    map(lambda x: num_in_circle(x)).\
    reduce(lambda x,y : x+y)

print(f"pi:{NUM*4/(100*10000)}")

23/06/02 21:03:39 WARN Utils: Your hostname, XinBys-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.17.1.210 instead (on interface en0)
23/06/02 21:03:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/06/02 21:03:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/02 21:03:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
<pyspark.sql.session.SparkSession object at 0x7fe8097f38e0>
<SparkContext master=local[8] appName=PySpark RDD>


[Stage 0:>                                                          (0 + 8) / 8]

pi:3.14186


                                                                                

### 第2题

在 `lec12-admm-lasso.ipynb` 的基础上，利用 ADMM 算法求解 Lasso 问题

$$\frac{1}{2}\Vert y-X\beta\Vert^2+\lambda \Vert \beta\Vert_1,$$

并将其封装成一个函数：

```python
admm_lasso(X, y, lam, rho=1.0, maxit=10000, eps=1e-3, verbose=0)
```

1. 其中 `X` 是 $n\times p$ 的自变量矩阵，`y` 是 $n\times 1$ 的因变量向量，`lam` 是惩罚项参数 $\lambda$，`rho` 是 ADMM 算法的 $\rho$ 参数，`maxit` 是最大迭代次数，`eps` 是 ADMM 收敛的残差临界值，`verbose` 表示是否输出迭代信息，如果 $>0$，则每隔 1000 次迭代打印出当前的两类残差，如果 $\le 0$ 否则不输出任何信息。
2. 参考 `lec12-admm-lad.ipynb` 中的 Cholesky 分解方法，只对矩阵进行一次分解，从而在每次迭代中高效地求解线性方程组。
3. 函数需返回两个量，第一个表示实际使用的迭代次数，第二个表示估计的回归系数。

In [3]:
# 此处插入代码
from scipy.linalg import cho_factor, cho_solve

def soft_thresholding(a, k):
    return np.sign(a) * np.maximum(0.0, np.abs(a) - k)

def admm_lasso(X, y, lam, rho=1.0, maxit=10000, eps=1e-3, verbose=0):
    # 此处插入代码

    M = X
    b = y
    
    p = M.shape[1]


    MtM  = M.transpose().dot(M)
    Mtb = M.transpose().dot(b)
    I = np.eye(p)

    c, lower = cho_factor(MtM+rho*I)

    x = np.zeros(p)
    z = np.zeros(p)
    u = np.zeros(p)


    kappa = lam / rho

    # iteration

    resid_r = -999
    resid_s = -999

    for iter in range(maxit):
        xnew = cho_solve((c,lower),Mtb+rho*(z-u))
        znew = soft_thresholding(xnew+u,kappa)
        unew = u + xnew - znew
        
        resid_r = np.linalg.norm(xnew-znew)
        resid_s = np.linalg.norm(- rho*(znew-z))

        x = xnew
        z = znew
        u = unew

        # 打印残差信息，判断是否收敛

        if iter % 1000 == 0:
            if (verbose):
                print(f"Iteration {iter}, ||r|| = {resid_r:.6f}, ||s|| = {resid_s:.6f}")
        if resid_r <= eps and resid_s <= eps:
            if (verbose):
                print(f"Iteration {iter}, ||r|| = {resid_r:.6f}, ||s|| = {resid_s:.6f}")
            break
            
    #print(f" **{iter}**")        
    return iter, x


利用模拟训练集数据测试上述编写的函数：

In [4]:
np.random.seed(123)
n = 1000
p = 30
nz = 20
Xtrain = np.random.normal(size=(n, p))
# 真实的 x 只有前20个元素非零，其余均为0
beta = np.random.normal(size=nz)
beta = np.concatenate((beta, np.zeros(p - nz)))
ytrain = Xtrain.dot(beta) + np.random.normal(size=n)
beta

array([-1.05417044, -0.78301134,  1.82790084,  1.7468072 ,  1.3282585 ,
       -0.43277314, -0.6686141 , -0.47208845,  1.05554064,  0.67905585,
        0.14814832,  1.04294573,  0.28718991,  1.55577283,  0.97031604,
        0.39737593,  1.15394013, -0.00333042,  1.30948521, -0.90230241,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])

In [5]:
admm_lasso(Xtrain, ytrain, lam=0.1 * n, maxit=10000, eps=1e-3, verbose=1)

Iteration 0, ||r|| = 4.571809, ||s|| = 0.000000
Iteration 1000, ||r|| = 0.067950, ||s|| = 0.000009
Iteration 2000, ||r|| = 0.019016, ||s|| = 0.000003
Iteration 3000, ||r|| = 0.007483, ||s|| = 0.000001
Iteration 4000, ||r|| = 0.002978, ||s|| = 0.000000
Iteration 5000, ||r|| = 0.001199, ||s|| = 0.000000
Iteration 5201, ||r|| = 0.001000, ||s|| = 0.000000


(5201,
 array([-9.88446192e-01, -7.29951991e-01,  1.72843395e+00,  1.66188615e+00,
         1.18779108e+00, -1.94466297e-01, -5.94711189e-01, -3.91430856e-01,
         1.01063023e+00,  5.73786673e-01,  3.36364138e-02,  9.31135970e-01,
         2.21897026e-01,  1.51032137e+00,  9.07779872e-01,  2.93449914e-01,
         1.08151311e+00, -2.97145094e-04,  1.17431918e+00, -7.88572868e-01,
        -1.56776389e-04,  3.81700889e-04, -4.71138798e-04, -2.07577854e-04,
        -3.67392622e-04, -3.15013324e-04, -8.35414719e-06,  5.02959595e-06,
         4.88460731e-04, -5.39329955e-05]))

In [6]:
admm_lasso(Xtrain, ytrain, lam=0.01 * n, maxit=10000, eps=1e-3, verbose=0)

(1286,
 array([-1.07555904e+00, -8.14460224e-01,  1.79118556e+00,  1.72909346e+00,
         1.27448621e+00, -3.06897473e-01, -6.69469287e-01, -4.73021701e-01,
         1.09124222e+00,  6.69340756e-01,  1.24876014e-01,  1.02527211e+00,
         3.02106479e-01,  1.58722372e+00,  9.68663224e-01,  3.84937457e-01,
         1.15919477e+00, -3.76698667e-02,  1.27397237e+00, -9.01267833e-01,
        -7.42409908e-03,  1.91910835e-02, -6.06421740e-02, -2.57202269e-02,
        -1.93040114e-02, -3.33947154e-02,  9.99301503e-04, -1.52979627e-02,
         2.22508621e-02, -2.01260868e-02]))

### 第3题

利用第2题中编写的函数，对一个新的测试集数据做预测。首先生成模拟数据：

In [7]:
np.random.seed(123)
ntest = 500
p = 30
Xtest = np.random.normal(size=(ntest, p))
ytest = Xtest.dot(beta) + np.random.normal(size=ntest)

取 $\lambda=0.1 n$，利用训练集估计回归系数，然后对测试集的因变量做预测，计算预测结果的均方误差，即
$$
MSE=\frac{1}{n_{test}}\sum_{i=1}^{n_{test}}(\hat{y}_i-y_i)^2,
$$
其中 $y_i$ 是第 $i$ 个测试集观测的因变量取值，$\hat{y}_i=x_i'\hat{\beta}$ 是第 $i$ 个观测的因变量预测值。

In [8]:
# 此处插入代码
num_train = Xtrain.shape[0]
num_test = Xtest.shape[0]
fit = admm_lasso(X = Xtrain, y = ytrain, lam = 0.1*num_train)
betahat = fit[1]

yhat =  Xtest.dot(betahat)
MSE = np.sum((yhat - ytest)**2)/num_test
print(f"MSE:{MSE}")


MSE:1.178361311126027


利用 PySpark 来并行地对 Lasso 模型的 $\lambda$ 参数进行调优，并考察 $\rho$ 参数对算法收敛速度的影响。取 $\rho=0.1,0.2,\ldots,1.0$，$\lambda=0.1n,0.01n,0.001n$。对于 $\rho$ 和 $\lambda$ 的这 30 个组合，分别利用训练集拟合 Lasso 模型，返回迭代次数，并计算在测试集上的预测 MSE。最终输出如下的结果：

```
rho = 0.1, lambda/n = 0.1, niter = ..., mse = ...
rho = 0.1, lambda/n = 0.01, niter = ..., mse = ...
...
```

**提示**：先生成 $\rho$ 和 $\lambda$ 所有组合的列表，类似于 `params = [(0.1, 0.1), (0.1, 0.01), (0.1, 0.001), (0.2, 0.1), ...]`，然后利用 `sc.parallelize(params)` 生成一个 RDD，最后对这个 RDD 进行 `map()` 和 `collect()` 操作。

In [9]:
# 此处插入代码
# define func MSE
def cal_MSE(y_hat, y_real):
    MSE = np.sum((y_hat-y_real)**2)/(y_hat.shape[0])
    return MSE

# set Params
rho_list = np.round(np.linspace(start = 0.1, stop = 1, num = 10),decimals = 3)
lam_list = np.array([0.1,0.01,0.001])
params = [(rho,lam) for rho in rho_list for lam in lam_list ]
print(params)

# transform params into rdd
result = sc.parallelize(params).\
    map(lambda p: (p[0],p[1], admm_lasso(X = Xtrain, y = ytrain, lam = p[1]*num_train, rho = p[0]))).\
    map(lambda fit: (fit[0],fit[1],fit[2][0],cal_MSE (y_hat = Xtest.dot(fit[2][1]), y_real = ytest))).\
    collect()

for trial in range(len(result)):
    fit_result = result[trial]
    print(f"{trial+1}: rho:{fit_result[0]}, lambda/n:{fit_result[1]}, niter:{fit_result[2]+1}, MSE:{fit_result[3]}")

[(0.1, 0.1), (0.1, 0.01), (0.1, 0.001), (0.2, 0.1), (0.2, 0.01), (0.2, 0.001), (0.3, 0.1), (0.3, 0.01), (0.3, 0.001), (0.4, 0.1), (0.4, 0.01), (0.4, 0.001), (0.5, 0.1), (0.5, 0.01), (0.5, 0.001), (0.6, 0.1), (0.6, 0.01), (0.6, 0.001), (0.7, 0.1), (0.7, 0.01), (0.7, 0.001), (0.8, 0.1), (0.8, 0.01), (0.8, 0.001), (0.9, 0.1), (0.9, 0.01), (0.9, 0.001), (1.0, 0.1), (1.0, 0.01), (1.0, 0.001)]




1: rho:0.1, lambda/n:0.1, niter:10000, MSE:1.1776481573538624
2: rho:0.1, lambda/n:0.01, niter:10000, MSE:1.0461628597163128
3: rho:0.1, lambda/n:0.001, niter:2469, MSE:1.0544803270461345
4: rho:0.2, lambda/n:0.1, niter:10000, MSE:1.178891757864105
5: rho:0.2, lambda/n:0.01, niter:6429, MSE:1.0461758946479949
6: rho:0.2, lambda/n:0.001, niter:1235, MSE:1.0544803365447828
7: rho:0.3, lambda/n:0.1, niter:10000, MSE:1.1784775331915271
8: rho:0.3, lambda/n:0.01, niter:4286, MSE:1.0461758914463362
9: rho:0.3, lambda/n:0.001, niter:824, MSE:1.0544803449550013
10: rho:0.4, lambda/n:0.1, niter:10000, MSE:1.1783893073944776
11: rho:0.4, lambda/n:0.01, niter:3215, MSE:1.0461758960880838
12: rho:0.4, lambda/n:0.001, niter:619, MSE:1.0544803244140812
13: rho:0.5, lambda/n:0.1, niter:10000, MSE:1.1783639995979076
14: rho:0.5, lambda/n:0.01, niter:2572, MSE:1.0461758928138751
15: rho:0.5, lambda/n:0.001, niter:496, MSE:1.0544803174526547
16: rho:0.6, lambda/n:0.1, niter:8668, MSE:1.1783613194710008


                                                                                