# 主成分分析 PCA

## 模型

将多个指标合并(**降维**)成更少的新指标，使得

1. 新指标能更**简练**地概括合并的多个指标

2. 新指标之间的**重叠性更低**, 最理想情况下正交

## 实现

In [95]:
import numpy as np
import scipy.linalg as linalg
from sklearn.decomposition import PCA

#! 修改输入数据
data = np.array([
    [1, 2, 3, 1, 2, ],
    [4, 5, 6, 2, 0, ],
    [7, 8, 9, 1, 2, ],
    [7, 8, 10, 1, 2, ],
])

In [96]:
# 标准化数据
X = (data - np.mean(data, axis=0)) / np.std(data, ddof=1, axis=0)

print("标准化后数据: \n", X)

标准化后数据: 
 [[-1.30558242 -1.30558242 -1.26491106 -0.5         0.5       ]
 [-0.26111648 -0.26111648 -0.31622777  1.5        -1.5       ]
 [ 0.78334945  0.78334945  0.63245553 -0.5         0.5       ]
 [ 0.78334945  0.78334945  0.9486833  -0.5         0.5       ]]


In [97]:
# 计算协方差矩阵
R = np.cov(X.T)

print("数据的协方差矩阵为: \n", R)

数据的协方差矩阵为: 
 [[ 1.          1.          0.99086739 -0.17407766  0.17407766]
 [ 1.          1.          0.99086739 -0.17407766  0.17407766]
 [ 0.99086739  0.99086739  1.         -0.21081851  0.21081851]
 [-0.17407766 -0.17407766 -0.21081851  1.         -1.        ]
 [ 0.17407766  0.17407766  0.21081851 -1.          1.        ]]


In [98]:
# 获得特征值 & 特征向量
eigen_val, eigen_vec = linalg.eigh(R) # 默认升序给出
# 降序排序
eigen_val = eigen_val[::-1]
eigen_vec = eigen_vec[:, ::-1]

print("协方差矩阵特征值:\n", eigen_val)
print("协方差矩阵特征向量:\n", eigen_vec)

协方差矩阵特征值:
 [ 3.16644839e+00  1.82234844e+00  1.12031705e-02 -2.60862082e-17
 -2.09414480e-16]
协方差矩阵特征向量:
 [[ 5.36147002e-01 -2.19689029e-01 -4.05318544e-01  0.00000000e+00
   7.07106781e-01]
 [ 5.36147002e-01 -2.19689029e-01 -4.05318544e-01 -4.84741921e-15
  -7.07106781e-01]
 [ 5.40594477e-01 -1.91905612e-01  8.19103075e-01  4.87406071e-15
   1.33536447e-14]
 [-2.57730863e-01 -6.58272625e-01  1.58730574e-02  7.07106781e-01
  -2.23782600e-15]
 [ 2.57730863e-01  6.58272625e-01 -1.58730574e-02  7.07106781e-01
  -2.63327752e-15]]


In [99]:
# 计算贡献率 & 累积贡献率
contribution_rate = eigen_val / sum(eigen_val)
cum_contribution_rate = np.cumsum(contribution_rate)

print("贡献率:\n", contribution_rate)
print("累积贡献率:\n", cum_contribution_rate)

贡献率:
 [ 6.33289678e-01  3.64469688e-01  2.24063410e-03 -5.21724164e-18
 -4.18828960e-17]
累积贡献率:
 [0.63328968 0.99775937 1.         1.         1.        ]


In [None]:
# PCA

#! 修改主成分数, 要根据累积贡献率
#  整数时为个数
#  小数时作为累积贡献率阈值
MAIN_COMPONENT = 0.95

# 应用 PCA
PCAX = PCA(n_components=MAIN_COMPONENT).fit_transform(X)

print("PCA降维后数据:\n", PCAX)

PCA降维后数据:
 [[-1.82604127 -1.47466043]
 [-1.22413721  1.79940314]
 [ 1.43961375 -0.1927143 ]
 [ 1.61056473 -0.13202841]]
