# GENERAL ROUTINE FOR COMBINING PCA WITH K-MEANS CLUSTERING

## Why combine PCA with K-means Clustering?
By reducing the number of features(PCA reduced the features numbers by combining them into bigger, more meanningful features), we can not only improve the performance of the algorithm but also reduce the noise. In this notebook, I will introduce how to combine PCA and k-means cluster together in detail.

## Table of Contents
- [Preprocessing Data](#preprocessingdata)
- [Performing Dimensionality Reduction with PCA](#performingdimensionalityreductionwithpca)
- [Determine the number of K-Means clusters](#determinethenumberofclusters)
- [Combine PCA and K-means Cluster](#combinepcaandkmeanscluster)
- [Analysis the results](analysistheresults)
- [Visualize Clusters by Components](visualizeclustersbycomponents)

In [3]:
# Essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

<a id='preprocessingdata'></a>
## Preprocessing Data

In [None]:
# Standardization

## If the range of variables has a vast difference, We should first standardize them firstly. 
## Otherwise, because of the mathematical nature of modeling, it would disregard the smaller one variable.
## To treat all the features equally, we transform all the variables to make their values fall within the same numerical range.

scaler = StandardScaler()
data_std = scaler.fit_transform(data)

<a id='performingdimensionalityreductionwithpca'></a>
## Performing Dimensionality Reduction with PCA

In [None]:
# Perform dimensionality reduction with PCA-1

## (1) fit the standardized data (data-std) using PCA
pca = PCA(n_components=???)
pca.fit(data_std)

In [None]:
# Perform dimensionality reduction with PCA-2

## (2) decide how many features we'd like to keep
## 2.1 method based on cumulative variance plot
pca.explained_variance_ratio_

plt.figure(figsize=(12, 8))
plt.plot(range(1,pca.n_components_), pca.explained_variance_ratio_.cumsum(), marker='o', linestyle ='--')
plt.title('-------')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')

# The figure will show the amount of variances captured depending on the number of componenets. 
# A rule of thumbe is to preserve about 80% of the variances.(80% of y). We temporatily use @ to denote the chosen number

## 2.2 medthod based on the relatively ratio
plt.bar(range(1,number of features), pca.explained_variance_ratio_)
plt.xlabel('Number of Components')
plt.ylabel('variance %')
plt.



In [None]:
# Perform dimensionality reduction with PCA-3

## (3) perform PCA with the chosen number of components
pca = PCA(n_components = @)
pca.fit(data_std)
pca.transform(data_std)

pca_scores = pca.transform(data_std)

# The components' scores are stored in the pca_scores.

<a id='determinenumberofclusters'></a>
## Determine the number of K-Means clusters

In [None]:
# Determine the number of clusters in K-means algorithm

## In order to determine the number of clusters in K-means algorithm, we run the algorithem with different numbers and determine the SUM of Squares for each one.
## Based on the values of SUM of Squares and the Elbow method, we made a decision.
## We decide how many clustering we'd test according to data.

ss = []
for i in range(1,number of clusters):
    kmeans_pca = KMeans(n_clusters = i, random_state=42)
    kmeans_pca.fit(pca_scores)
    ss.append(kmeans_pca.inertia_)
## inertia_: Sum of squared distances of samples to their closest cluster center.

plt.figure(figsize(10, 8))
plt.plot(range(1, number of clusters), ss, marker='o', linestyle = '--')
plt.xlabel('Number of Clusters')
plt.ylabel('SS')
plt.title('K-means with PCA Cluster')

## using elbow method to determine the number of clusters

<a id='combinepcaandkmeanscluster'></a>
## Combine PCA and K-means Cluster

In [None]:
# combine K-Means with PCA
kmeans_pca = KMeans(n_cluster = @, random_state = 42)
kmeans_pca.fit(pca_scores)

<a id='analysistheresults'></a>
## Analysis the results

In [None]:
# K-Means clustering with PCA results
## create a new dataframe with the original features and pca scores and assigned clusters.
data_pca_kmeans = pd.concat([data, pd.DataFrame(pca_scores)], axis=1)
data_pca_kmeans.columns.values[-@:] = ['Component 1', 'Component 2', 'Component 3',...]

data_pca_kmeans['Labels'] = kmeans_pca.labels_

In [None]:
# Add the names of the segments to the labels
data_pca_kmeans['labels'] = data_pca_kmeans['labels'].map({0:'first', 1:'second', 2:'third'...})

<a id='visualizeclustersbycomponents'></a>
## Visualize Clusters by Components

In [None]:
# If we hope to visualize the clusters on a 2D plane, we need to choose two components. PCA has already determined the most important compoenets for us. We just need to choose the very first two.
x_axis = data_pca_kmeans['Component 2']
y_axis = data_pca_kmeans['Component 1']

plt.figure(figsize = (10,8))
sns.scatterplot(x_axis,y_axis, hue=data_pca_kmeans['labels'], palette = ['g','r','c','m'])
plt.title('Clusters by PCA Components')