# PCA Analysis

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of large data sets. PCA creates a new set of variables that are linear combinations of the original variables. The new variables, called principal components (PCs), are ranked in order of how much of the variation in the original data they explain.

Here are some steps to analyze the results of a PCA:

1. Check the amount of variation explained by each principal component. This can be done using the `explained_variance_ratio` attribute of the PCA object. The sum of the variance ratios for all the principal components should be equal to 1.
2. Look at the loadings of the original variables on each principal component. The loadings represent the correlation between each original variable and the principal component. You can access the loadings using the components_ attribute of the PCA object.
3. Plot the scores for each principal component to see how the observations relate to each other based on the principal components.
4. Check the correlation between the original variables to see if there are any patterns in the data that are not captured by the principal components.

Use the principal components as input variables for subsequent analysis, such as regression or clustering.
Overall, PCA can be a useful tool for data analysis and visualization, but it is important to interpret the results carefully and consider the limitations of the technique.Here are some steps to analyze the results of a PCA:


主成分分析（Principal Components Analysis, PCA）是一种用于降低大型数据集维度的技术。PCA 利用原始变量的线性组合创建了一个新的变量集，这些新的变量被称为主成分（PC），按照它们在原始数据中解释方差变化的大小顺序进行排序。
Bayan（2021）已证明了将PCA与合成控制法（Synth Control, SC）相结合可以提高其因果推理的鲁棒性 @bayani2021。

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
from mksci_font import config_font, mksci_font
import pandas as pd
from core.src import VARS_EN2CH

from src import fit_pca

config_font()

数据处理过程：

按省进行分组平均，丢掉有任何缺失值的省份

In [None]:
# 读取数据

X_inputs = [var for _, var in VARS_EN2CH.items()]


merged_data = pd.read_csv(r"../data/processed/merged_data.csv", index_col=0).rename(
    VARS_EN2CH, axis=1
)
merged_mean = merged_data.groupby("Province").mean().dropna(how="any")
merged_mean.head()

数据集的空间范围包括中国$30$个省、自治区、直辖市和地区（不包括台湾、香港、澳门；天津和河北因在分水政策中被合在一起考虑，因此合并两地数据为“津冀”）。
数据集的时间范围涵盖两次政策前后各十年，即$1979$年至$2008$年，数据所有特征在该时间段内均无任何缺失值。

In [None]:
# 考虑的省份
merged_data.Province.unique()

# 时间范围
merged_data.Year.unique()

subset = merged_data[(merged_data.Year > 1978) & (merged_data.Year < 2009)]
# subset[features].isna().sum().plot.bar(figsize=(6, 2))

In [None]:
# 查看缺失情况，自然条件的 jinji 数据缺失，不影响后续分析
# merged_data.groupby('Year')[features].count()

## Features selection and fit

面板数据不能直接应用 PCA，本章研究将所有预处理后的数据沿时间轴进行多年平均，对均值使用主成分分析（PCA）进行降维，将得到的主成分按照其在原始数据中解释方差变化的大小顺序进行排序，并用肘部法确定主成分的个数，从而降低反事实推断合成控制模型的自由度。


In [None]:
Y_input = "Total water use"

not_features = ["Province", "Year", "Total water use"]

# 5 principals 89.63%
features = [f for f in X_inputs if f not in not_features]

model, results = fit_pca(merged_mean, features=features, n_components=0.85)
fig, ax = model.plot(figsize=(6, 3))


@mksci_font(xlabel="主成分", ylabel="方差解释率")
def better_elbow_plot(ax):
    ax.set_title("")
    ax.grid(False)
    return ax


ax = better_elbow_plot(ax)
fig.savefig("../../PhD_Thesis/img/ch5/ch5_elbow.png", dpi=300)

In [None]:
model.results["explained_var"]

其中第一个主成分解释了方差变化的$51.6\%$，第二个主成分解释了$16.9\%$的方差变化，是对区域用水量影响最大的两个轴，其余主成分对用水量的解释力均低于$10\%$。其中第一个主成分解释了方差变化的$51.6\%$，第二个主成分解释了$16.9\%$的方差变化，是对区域用水量影响最大的两个轴，其余主成分对用水量的解释力均低于$10\%$。

## Biplot of the Components

In [None]:
fontdict = {"weight": "normal", "size": 9, "ha": "center", "va": "center", "c": "black"}
fig, ax = model.biplot(
    figsize=(5, 4),
    s=0,  # merged_mean[Y_input].values
    n_feat=10,
    jitter=0.01,
    legend=False,
    label=False,
    SPE=True,
    fontdict=fontdict,
    # alpha_transparency=0.6,
    hotellingt2=True,
    title="",
)

In [None]:
from mksci_font import mksci_font
from pca.pca import _get_coordinates


@mksci_font(xlabel="主成分1", ylabel="主成分2")
def better_biplot(fig, ax):
    xs, ys, zs, ax = _get_coordinates(model.results["PC"], [0, 1], fig, ax, False)
    ax.scatter(
        xs,
        ys,
        s=merged_mean[Y_input].values * 30,
        alpha=0.4,
        edgecolors="white",
        color="white",
    )
    ax.grid(False)
    return ax


better_biplot(fig, ax)

# 保存到毕业论文的作图区
fig.savefig(r"../../PhD_Thesis/img/ch5/ch5_biplot.png", dpi=300)

description = """
这里的
"""
fig

## Find Significant Features

In order to test the **significance of the PCA loadings**, we used a combination of three methods: 
1) the bootstrapped eigenvector method3
2) the threshold method loadings are significant when their absolute value and contribution are larger than a specific threshold depending on the number of dimensions (ndim , i.e. variables), and 
3) a fixed threshold fixed according to Richman et al.

In practice the loadings are significant, and considered as “high relevance”, if 
1) the p-value from method 1 is below 0.01; 
2) their contribution is above 1/ndim (i.e. above 8.3%);
3) the absolute value of the loadings is above 0.34. 


负荷率代表每个原始变量与主成分之间的相关性，通过原始变量在每个主成分上的负荷，可以分析对区域用水量贡献最大的主成分与特征集合有关，了解用水量影响特征与主成分的相互关系。
参考 Mirco 等人（2021）的研究，每个特征对于特定主成分的贡献是否显著主要有三种方法：特征向量法、负荷阈值法、固定阈值法~\cite{migliavacca2021}。
本章研究采用负荷阈值法分析各特征对不同主成分的贡献，即当负荷值的绝对值和贡献大于与维数（即变量）相关的特定阈值时（即$|{x_{loading}}| > 1/N_{dims}$），认为该特征对当前主成分的贡献显著。

@migliavacca2021

In [None]:
from matplotlib import pyplot as plt


def sig_loadings(
    model,
    pc=1,
    method="contribution",
    color="#0889a6",
    threshold=0.3,
    ax=None,
    ticklabels=True,
):
    loadings = model.results["loadings"]
    if not ax:
        _, ax = plt.subplots(figsize=(2.5, 6))
    if method == "contribution":
        threshold = 1 / len(loadings)
    data = loadings.loc[f"PC{pc}"]
    if isinstance(color, str):
        colors = ["lightgray" if abs(da) < threshold else color for da in data]
    elif hasattr(color, "__iter__"):
        colors = [
            "lightgray" if abs(da) < threshold else color[i]
            for i, da in enumerate(data)
        ]
    ax.barh(width=data.values, y=data.index, color=colors)

    # 美化
    if ticklabels:
        ax.set_yticklabels(data.index)
    else:
        ax.set_yticklabels([])
    ax.spines[["top", "left", "right"]].set_visible(False)
    ax.set_xlabel(f"PC{pc}")
    ax.axvline(0, ls=":", color="black")
    ax.tick_params(axis="y", length=1.5, direction="in")
    return ax


sig_loadings(model)

In [None]:
model.results.keys()

In [None]:
fig, axs = plt.subplots(1, 5, figsize=(10, 6))

for i, ax in enumerate(axs):
    if i == 0:
        tickslabeles = True
    else:
        tickslabeles = False
    sig_loadings(model, i + 1, ax=ax, ticklabels=tickslabeles)
    ax.set_xlabel(f"主成分{i+1}")
    ax.spines[["left"]].set_visible(True)

fig.savefig("../../PhD_Thesis/img/ch5/ch5_variables.png")
plt.show();

In [None]:
from src.filter_features import transform_features

results = transform_features(
    transform_data=merged_data, features=features, fitted_model=model
)
print(results.shape)
results.to_csv("../data/processed/pca_transformed_0.85.csv")