### Dimension Reduction (차원 축소)
- 우리가 다루는 데이터들은 보통 3차원 공간에서는 표현하기 힘든 고차원의 데이터인 경우가 많다.
- 차원이 커질 수록 데이터 간 거리가 크게 늘어나며, 데이터가 희소화된다.
- 고차원을 이루는 feature 중 상대적으로 중요도가 떨어지는 feature가 존재할 수 있기 때문에  
  계산 비용이 많이 들고 분석에 필요한 시각화가 어렵다.
- 머신러닝에서는 고차원 데이터를 다루는 경우가 많으며, 희소 데이터를 학습 시 예측 성능이 좋지 않다.
- 차원 축소를 통해 Spares Data를 Dense하게 만들 필요가 있다.
- feature가 많을 경우 독립변수 간 상관관계가 높아질 가능성이 높고, 이로 인해 다중 공선성문제가 발생할 수 있다.
- 차원 축소로 인해 표현력이 일부 손실되지만, 손실을 감수하더라도 계산 효율을 얻기 위해 사용한다.

---
#### PCA (Principal Component Analysis), 주성분 분석
- 고차원의 데이터를 저차원으로 압축하는 대표적인 차원 축소방법이다.
- 데이터의 특성을 눈으로 쉽게 파악할 수 있도록 하며, 연산 속도에 큰 이점을 얻을 수 있다.
- 고차원 데이터를 저차원 데이터로 압축하기 위해서는 먼저, 데이터를 가장 잘 표현하는 축을 설정해야 한다.
- 2차원 공간에서 1차원 공간으로 차원 축소를 진행하면, 1차원 공간상에서 데이터 분포가 가장 넓게 퍼지게 만드는 고유 벡터를 찾아야 한다.
- 고유 벡터를 찾았다면 feature 데이터들을 고유 벡터 축에 투영시킴으로써 주성분을 찾아낼 수 있게 된다.

<div style="display: flex">
    <div>
        <img src="./images/pca01.gif" style="margin-left: -200px">
    </div>
    <div>
        <img src="./images/pca02.gif" width="700" style="margin-top:50px; margin-left: -350px">
    </div>
</div>


#### LDA (Linear Discriminant Analysis)
- PCA와 유사하지만, 분류에서 사용하기 쉽도록 개별 클래스를 분별할 수 있는 기준을 최대한 유지하면서 차원을 축소한다.
- PCA는 가장 큰 분산을 가지는 축을 찾았지만, LDA는 입력 데이터의 클래스를 최대한 분리할 수 있는 축을 찾는다.
- 클래스를 최대한 분리하기 위해서 클래스 간 분산을 최대화하고 클래스 내부 분산을 최소화 하는 방식으로 차원을 축소한다.

<div style="display: flex">
    <div>
        <img src="./images/lda01.png" width="650" style="margin:20px; margin-left: -20px">
    </div>
    <div>
        <img src="./images/lda02.png" width="650" style="margin:20px; margin-left: 0">
    </div>
</div># LDA ()

In [15]:
import pandas as pd

c_df = pd.read_csv('./datasets/company.csv')
c_df

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.405750,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.827890,0.290202,0.026601,0.564050,1,0.016469
1,1,0.464291,0.538214,0.516730,0.610235,0.610235,0.998946,0.797380,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.601450,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.774670,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.998700,0.796967,0.808966,0.303350,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.035490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6814,0,0.493687,0.539468,0.543230,0.604455,0.604462,0.998992,0.797409,0.809331,0.303510,...,0.799927,0.000466,0.623620,0.604455,0.840359,0.279606,0.027064,0.566193,1,0.029890
6815,0,0.475162,0.538269,0.524172,0.598308,0.598308,0.998992,0.797414,0.809327,0.303520,...,0.799748,0.001959,0.623931,0.598306,0.840306,0.278132,0.027009,0.566018,1,0.038284
6816,0,0.472725,0.533744,0.520638,0.610444,0.610213,0.998984,0.797401,0.809317,0.303512,...,0.797778,0.002840,0.624156,0.610441,0.840138,0.275789,0.026791,0.565158,1,0.097649
6817,0,0.506264,0.559911,0.554045,0.607850,0.607850,0.999074,0.797500,0.809399,0.303498,...,0.811808,0.002837,0.623957,0.607846,0.841084,0.277547,0.026822,0.565302,1,0.044009


In [16]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

smote = SMOTE(random_state=124)
features, targets = c_df.iloc[:, 1:], c_df.iloc[:, 0]

X_trina, X_test, y_train, y_test = train_test_split(features, targets, stratify=targets, test_size=0.2, random_state=124)

over_features, over_targets = smote.fit_resample(features, targets)


In [17]:
train_df = pd.concat([over_features, over_targets], axis=1).reset_index(drop=True)
test_df = pd.concat([X_test, y_test], axis=1).reset_index(drop=True)

In [21]:
from sklearn.decomposition import PCA

pca = PCA(n_components= 2)

pca_train = pca.fit_transform(train_df.iloc[:, :-1])
pca_test =pca.fit_transform(test_df.iloc[:, :-1])



In [22]:
pca_columns = [f'pca{i+1}' for i in range(pca_train.shape[1])]
pca_train_df = pd.DataFrame(pca_train, columns=pca_columns)
pca_train_df.loc[:,'target'] = train_df['Bankrupt?']


In [23]:
pca_columns = [f'pca{i+1}' for i in range(pca_test.shape[1])]
pca_test_df = pd.DataFrame(pca_test, columns=pca_columns)
pca_test_df.loc[:,'target'] = test_df['Bankrupt?']


In [25]:
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())

[0.21421998 0.16121997]
0.3754399445273024


In [41]:
from sklearn.decomposition import PCA

pca = PCA(n_components= 5)

pca_train = pca.fit_transform(train_df.iloc[:, :-1])
pca_test =pca.fit_transform(test_df.iloc[:, :-1])

In [42]:
pca_columns = [f'pca{i+1}' for i in range(pca_train.shape[1])]
pca_train_df = pd.DataFrame(pca_train, columns=pca_columns)
pca_train_df.loc[:,'target'] = train_df['Bankrupt?']
pca_columns = [f'pca{i+1}' for i in range(pca_test.shape[1])]
pca_test_df = pd.DataFrame(pca_test, columns=pca_columns)
pca_test_df.loc[:,'target'] = test_df['Bankrupt?']


In [43]:
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())

[0.21421998 0.16121997 0.13168396 0.1242189  0.11207448]
0.7434172923920851


In [44]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

features, targets = pca_train_df.iloc[:, :-1], pca_train_df.iloc[:, -1]
parameters = {'max_depth': [5, 10, 20], 'min_samples_split': [10, 50, 100]}

rfc = RandomForestClassifier()
# import sklearn
# sklearn.metrics.SCORERS.keys()
g_rfc = GridSearchCV(rfc, param_grid=parameters, cv=5, return_train_score=True, scoring='accuracy')
g_rfc.fit(features, targets)

In [45]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


lda = LinearDiscriminantAnalysis(n_components= 1)

lda_train = pca.fit_transform(train_df.iloc[:, :-1],train_df.iloc[:, -1])
lda_test =pca.fit_transform(test_df.iloc[:, :-1], test_df.iloc[:, -1])