This Project is to demonstrate using the Scikit-Learn Principal Component Analysis (PCA) algorithm to lower the dimensionality of this data set. 

The data set is information relating to churn rate of bank customers. 

I found this data set on Kaggle using the following link: https://www.kaggle.com/santoshd3/bank-customers.

The first section of this project is to import both the necessary libraries and the data (as well as have it properly labeled), as well as to do some brief Exploratory Data Analysis (in order to understand the gist of the data).

In [1]:
#https://towardsdatascience.com/principal-component-analysis-pca-with-scikit-learn-1e84a0c731b0
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

cols_to_import = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 
                  'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']
data = pd.read_csv('./Churn Modeling.csv', usecols=cols_to_import)

print(data.describe())
print(data.info())
print(data.shape)

        CreditScore           Age        Tenure        Balance  NumOfProducts  \
count  10000.000000  10000.000000  10000.000000   10000.000000   10000.000000   
mean     650.528800     38.921800      5.012800   76485.889288       1.530200   
std       96.653299     10.487806      2.892174   62397.405202       0.581654   
min      350.000000     18.000000      0.000000       0.000000       1.000000   
25%      584.000000     32.000000      3.000000       0.000000       1.000000   
50%      652.000000     37.000000      5.000000   97198.540000       1.000000   
75%      718.000000     44.000000      7.000000  127644.240000       2.000000   
max      850.000000     92.000000     10.000000  250898.090000       4.000000   

         HasCrCard  IsActiveMember  EstimatedSalary        Exited  
count  10000.00000    10000.000000     10000.000000  10000.000000  
mean       0.70550        0.515100    100090.239881      0.203700  
std        0.45584        0.499797     57510.492818      0.402769 

The first example seeks to lower the number of dimensions/features to three. Afterwards, I run some methods to provide further information about the results.

In [2]:
pca = PCA(n_components=3, whiten=True).fit(data)
X_pca = pca.transform(data)

In [3]:
print(pca.singular_values_)

[6242294.24576723 5747650.93431437    9664.64753335]


In [4]:
print((pca.explained_variance_ratio_))

[5.41184038e-01 4.58814649e-01 1.29726454e-06]


In [5]:
print((pca.explained_variance_))

[3.89701345e+09 3.30387951e+09 9.34147534e+03]
9


In [6]:
print(pca.inverse_transform(pca.singular_values_))

[ 3.83629033e+06  2.39044713e+06 -3.56671634e+05  4.14161815e+11
 -1.23008086e+06 -2.14509000e+04 -3.73845086e+03 -2.99110729e+11
  2.94462068e+05]


In [7]:
print(pca.score(data))

-69.35820238091982


The second example seeks to lower the number of dimensions/features to the number needed to provide a 90% accuracy score. Afterwards, I run some methods to provide further information about the results.

In [8]:
pca2 = PCA(n_components=.9, whiten=True).fit(data)
X_pca2 = pca.transform(data)

In [9]:
print(pca2.singular_values_)

[6242294.24576723 5747650.93431437]


In [10]:
print((pca2.explained_variance_ratio_))

[0.54118404 0.45881465]
