# 作业3

### 第1题：统计计算练习

(a) 生成一个 $10000\times 1000$ 的矩阵 `X`，每个元素服从标准正态分布。生成一个长度为 10000 的向量 `y`，每个元素服从均值为0、方差为2的正态分布。生成一个长度为 10000 的向量 `w`，每个元素服从 $(1,5)$ 上的均匀分布。

In [1]:
# initialize ray
import ray
ray.init(num_cpus=8)

2024-05-05 15:03:51,330	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


0,1
Python version:,3.10.13
Ray version:,2.9.3
Dashboard:,http://127.0.0.1:8265


In [2]:
# load required libraries
import numpy as np
import more_itertools as mit
import functools
import time

In [3]:
n = 10000
p = 1000
X = np.random.normal(size=(n, p))
y = np.random.normal(loc=0, scale=np.sqrt(2), size=n)
w = np.random.uniform(low=1.0, high=5.0, size=n)

(b) 使用恰当的方式计算 $\hat{y}=X(X^{T}WX)^{-1}X^{T}Wy$，其中 `W` 是以 `w` 为对角线的对角矩阵，并打印出计算所耗费的时间。

In [4]:
start = time.time()
XTw = X.T * w
y_hat = X.dot(np.linalg.solve(XTw.dot(X), XTw.dot(y.reshape(n, 1))))
end = time.time()
print("Time taken for calculation: ", end - start)
print("Result: ", y_hat)

Time taken for calculation:  0.22847342491149902
Result:  [[ 0.03412735]
 [-0.37166228]
 [-0.46364038]
 ...
 [-0.43557756]
 [ 0.67465689]
 [-0.26852423]]


### 第2题：利用 Ray 处理大型数据

给定数据文件 `sim_data.txt`，使用合适的方法，将数据每12345行作为一个分块，利用 Ray 并行地计算数据每一列的均值和标准差。请尽量对你的每一步操作加上注解或代码注释。

In [5]:
# read the first 5 lines
with open("sim_data.txt", encoding="utf-8") as file:
    for _ in range(5):
        print(next(file))

# The first line is a comment. You should skip it.

# The second line is another comment.

0.696469,0.286139,0.226851,0.551315,0.719469,0.423106,0.980764,0.684830,0.480932,0.392118

0.343178,0.729050,0.438572,0.059678,0.398044,0.737995,0.182492,0.175452,0.531551,0.531828

0.634401,0.849432,0.724455,0.611024,0.722443,0.322959,0.361789,0.228263,0.293714,0.630976



In [6]:
def txt_to_mat(lines):
    return np.loadtxt(lines, delimiter=",") # transfer to numpy matrix

def mat_to_obj(mat):
    return ray.put(mat) # transfer to object store

batch_size = 12345
with open("sim_data.txt", encoding="utf-8") as file:
    next(file)
    next(file) # skip header
    it_chunk = mit.chunked(file, batch_size)
    it_mat = map(txt_to_mat, it_chunk)
    it_obj = map(mat_to_obj, it_mat)
    batches = list(it_obj)

In [7]:
@ray.remote
def batch_summation(batch) -> tuple:
    """
    batch: a batch of data, each row is a vector
    return:
        linear_sum: np.ndarray
        quadratic_sum: np.ndarray
        row_count: int
    """
    linear_sum = np.sum(batch, axis=0)
    quadratic_sum = np.sum(batch**2, axis=0)
    row_count = batch.shape[0]
    return (linear_sum, quadratic_sum, row_count)

@ray.remote
def summation(a, b):
    """
    calculate the summation of two batches
    """
    linear_sum = a[0] + b[0]
    quadratic_sum = a[1] + b[1]
    row_count = a[2] + b[2]
    return (linear_sum, quadratic_sum, row_count)

# transfer the batches into batch_summation iterator
batch_summation = list(map(batch_summation.remote, batches))

# calculate the summation of all batches
summation = functools.reduce(summation.remote, batch_summation)
summation = ray.get(summation)

mean = summation[0] / summation[2]
var = summation[1] / summation[2] - mean**2
std = np.sqrt(var)

print("Mean: ", mean)
print("Standard Deviation: ", std)

Mean:  [0.4998118  0.50005466 0.49987793 0.50002139 0.50005731 0.50007364
 0.49995952 0.50009029 0.49997664 0.50001398]
Standard Deviation:  [0.28861135 0.28863712 0.2887348  0.28869777 0.28871596 0.28872578
 0.28873098 0.28869691 0.28867944 0.28868873]


### 第3题：分布式矩阵乘法

假设第2题中的数据表示为一个 $n\times d$ 的矩阵 $X$。请随机产生一个 $d\times 1$ 的向量 $v$，其每个元素服从独立的标准正态分布，然后选择合适的方法分布式地计算 $n^{-1}X^T Xv$。

In [8]:
d = ray.get(batches[0]).shape[1]
v = np.random.normal(size=d)

@ray.remote
def prod(batch, v):
    xv = batch.dot(v)
    xt_xv = batch.T.dot(xv)
    row_count = batch.shape[0]
    return (row_count, xt_xv)

@ray.remote
def add(x, y):
    return (x[0] + y[0], x[1] + y[1])

prod = list(map(lambda x: prod.remote(x, v), batches))
row_and_prod = functools.reduce(add.remote, prod)
result = ray.get(row_and_prod)
result = result[1] / result[0]
print("Result: ", result)

Result:  [-0.53283082 -0.46108077 -0.41977192 -0.48558497 -0.46904915 -0.48546932
 -0.52603428 -0.517377   -0.66091507 -0.46562161]


In [9]:
ray.shutdown()