주성분분석(PCA)
================
개요
-----------------
### 차원축소(Dimantionlity Reduction)
>> 데이터의 전반적인 특징을 보존하면서 데이터의 변수 수를 줄이는 방법

|방법|종류|
|--|--|
|특성선택|가장 중요한 특징들만 선택하고 기존의 데이터를 표현|
|특성추출|기존 특성들을 사용하여 새로운 특성들을 만들어내는 것|

### 주성분 분석
>> 데이터의 복잡성을 줄여줌

In [21]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sb
import sklearn.preprocessing as skp
import statsmodels.stats.anova as anova
import statsmodels.formula.api as smf
import statsmodels

In [10]:
D=pd.read_excel("https://data.hossam.kr/E04/boston.xlsx")
data=pd.DataFrame(D)

In [11]:
data=data[data.columns[:-1]]

In [14]:
x=data[data.columns.difference(["MEDV"])]
y=data["MEDV"]

In [22]:
scale=skp.StandardScaler()
std_x=scale.fit_transform(x)
std_x

데이터 정규화
-----

In [31]:
std_x_df=pd.DataFrame(std_x).rename(columns=dict(zip(range(13),x.columns)))

PCA 분석
----

In [32]:
import sklearn.decomposition as sd

In [33]:
pca=sd.PCA(n_components=6)
pca_result=pca.fit_transform(std_x_df)

In [35]:
pd.DataFrame(pca_result)

Unnamed: 0,0,1,2,3,4,5
0,-2.098297,0.773113,0.342943,-0.891774,0.423070,-0.315338
1,-1.457252,0.591985,-0.695199,-0.487459,-0.195876,0.264223
2,-2.074598,0.599639,0.167122,-0.739204,-0.934534,0.448095
3,-2.611504,-0.006871,-0.100284,-0.343721,-1.104956,0.664649
4,-2.458185,0.097712,-0.075348,-0.427907,-1.065924,0.617047
...,...,...,...,...,...,...
501,-0.314968,0.724285,-0.860896,-0.434740,-1.121040,0.508064
502,-0.110513,0.759308,-1.255979,-0.309376,-0.891542,0.408208
503,-0.312360,1.155246,-0.408598,-0.786304,-1.595185,0.467947
504,-0.270519,1.041362,-0.585454,-0.678134,-1.416024,0.482259


PCA를 이용한 독립변수 추출
---

In [44]:
model=pca.pca(n_components=std_x_df.shape[1])

In [45]:
A=model.fit_transform(std_x_df)

[pca] >Extracting column labels from dataframe.
[pca] >Extracting row labels from dataframe.
[pca] >The PCA reduction is performed on the [13] columns of the input dataframe.
[pca] >Fit using PCA.
[pca] >Compute loadings and PCs.
[pca] >Compute explained variance.
[pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[13]
[pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
[pca] >Outlier detection using SPE/DmodX with n_std=[3]


In [63]:
result=pd.DataFrame(A["topfeat"])

In [64]:
result

Unnamed: 0,PC,feature,loading,type
0,PC1,INDUS,0.346672,best
1,PC2,CHAS,0.454829,best
2,PC3,RM,0.593961,best
3,PC4,CHAS,0.815941,best
4,PC5,PTRATIO,-0.584002,best
5,PC6,B,-0.803455,best
6,PC7,CRIM,0.777607,best
7,PC8,AGE,-0.600823,best
8,PC9,INDUS,0.644416,best
9,PC10,LSTAT,-0.600711,best


In [66]:
set(result[result["type"]=='best']["feature"])

{'AGE',
 'B',
 'CHAS',
 'CRIM',
 'DIS',
 'INDUS',
 'LSTAT',
 'NOX',
 'PTRATIO',
 'RM',
 'TAX'}

In [74]:
pre_x_df=data[list(set(result[result["type"]=='best']["feature"]))]

In [92]:
cols="+".join(pre_x_df.columns)
formula=f'MEDV~{cols}'

In [86]:
formula

'MEDV~AGE+INDUS+DIS+NOX+B+CRIM+CHAS+RM+PTRATIO+TAX+LSTAT'

In [71]:
import statsmodels.api as sa

In [88]:
model=smf.ols(formula,data=data)

In [90]:
fianl=model.fit()

In [91]:
fianl.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.725
Model:,OLS,Adj. R-squared:,0.719
Method:,Least Squares,F-statistic:,118.4
Date:,"Wed, 26 Jul 2023",Prob (F-statistic):,9.42e-131
Time:,13:53:52,Log-Likelihood:,-1513.7
No. Observations:,506,AIC:,3051.0
Df Residuals:,494,BIC:,3102.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,30.5585,5.020,6.087,0.000,20.695,40.422
AGE,-0.0083,0.013,-0.616,0.538,-0.035,0.018
INDUS,-0.0722,0.061,-1.194,0.233,-0.191,0.047
DIS,-1.2542,0.187,-6.698,0.000,-1.622,-0.886
NOX,-15.8300,3.880,-4.079,0.000,-23.454,-8.206
B,0.0086,0.003,3.112,0.002,0.003,0.014
CRIM,-0.0619,0.032,-1.905,0.057,-0.126,0.002
CHAS,3.1229,0.880,3.548,0.000,1.393,4.852
RM,4.2847,0.420,10.206,0.000,3.460,5.110

0,1,2,3
Omnibus:,192.416,Durbin-Watson:,1.043
Prob(Omnibus):,0.0,Jarque-Bera (JB):,944.673
Skew:,1.617,Prob(JB):,7.36e-206
Kurtosis:,8.861,Cond. No.,14500.0
