# Assignment 04.2 - PCA for Reduced Dimensionality in Clustering
## For this problem you will use an image segmentation data set for clustering. You will experiment with using PCA as an approach to reduce dimensionality and noise in the data. You will compare the results of clustering the data with and without PCA using the provided image class assignments as the ground truth. The data set is divided into three files. The file "segmentation_data.txt" contains data about images with each line corresponding to one image. Each image is represented by 19 features (these are the columns in the data and correspond to the feature names in the file "segmentation_names.txt". The file "segmentation_classes.txt" contains the class labels (the type of image) and a numeric class label for each of the corresponding images in the data file. After clustering the image data, you will use the class labels to measure completeness and homogeneity of the generated clusters. The data set used in this problem is based on the Image Segmentation data set at the UCI Machine Learning Repository.

## Your tasks in this problem are the following:

In [1]:
# import statements
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.metrics import completeness_score, homogeneity_score


In [2]:
# Assign %pwd to working_dir
working_dir = %pwd

## 4.2.a

## Load in the image data matrix (with rows as images and columns as features). Also load in the numeric class labels from the segmentation class file. Using your favorite method (e.g., sklearn's min-max scaler), perform min-max normalization on the data matrix so that each feature is scaled to [0,1] range.

In [3]:
segmentation_data = pd.read_csv(working_dir + "/Data/segmentation_data/segmentation_data.txt", header=None)
#segmentation_data.head(10)
scaler = MinMaxScaler(feature_range=(0,1))
segmentation_data_normalized = scaler.fit_transform(segmentation_data)
segmentation_data_normalized = pd.DataFrame(segmentation_data_normalized)
segmentation_data_normalized

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.430830,0.741667,0.0,0.000000,0.0,0.034221,0.000672,0.027329,0.000856,0.090111,0.079417,0.061119,0.130943,0.731343,0.014118,0.872865,0.123711,0.508139,0.831849
1,0.335968,0.733333,0.0,0.000000,0.0,0.038023,0.000726,0.032298,0.000541,0.095791,0.085089,0.068483,0.134840,0.729478,0.023529,0.859583,0.127393,0.463329,0.836986
2,0.885375,0.970833,0.0,0.000000,0.0,0.115970,0.002213,0.067081,0.001097,0.085463,0.075365,0.061856,0.120031,0.736940,0.038824,0.827324,0.113402,0.480149,0.844782
3,0.181818,0.920833,0.0,0.000000,0.0,0.043726,0.001265,0.022360,0.000645,0.088562,0.080227,0.059647,0.127046,0.748134,0.014118,0.855787,0.120029,0.500966,0.825889
4,0.379447,0.729167,0.0,0.000000,0.0,0.039924,0.000697,0.026087,0.000725,0.108701,0.101297,0.078056,0.148090,0.748134,0.010588,0.861480,0.139912,0.442661,0.823924
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2095,0.122530,0.612500,0.0,0.000000,0.0,0.032319,0.000870,0.018634,0.000441,0.055513,0.046191,0.078792,0.039751,0.751866,0.256471,0.461101,0.078792,0.520578,0.178177
2096,0.027668,0.629167,0.0,0.333333,0.0,0.055133,0.002080,0.007453,0.000096,0.058353,0.048622,0.079529,0.045207,0.748134,0.247059,0.480076,0.079529,0.484805,0.167750
2097,0.501976,0.625000,0.0,0.000000,0.0,0.019011,0.000254,0.017391,0.000118,0.049832,0.040519,0.072165,0.035074,0.753731,0.250588,0.468691,0.072165,0.540918,0.175915
2098,0.588933,0.612500,0.0,0.000000,0.0,0.074145,0.001647,0.031056,0.000302,0.058869,0.051053,0.081001,0.042868,0.761194,0.251765,0.459203,0.081001,0.503086,0.184789


In [4]:
segmentation_classes = pd.read_csv(working_dir + "/Data/segmentation_data/segmentation_classes.txt", header=None, sep="\t").iloc[:, 1]
segmentation_classes.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: 1, dtype: int64

In [5]:
segmentation_names = pd.read_csv(working_dir + "/Data/segmentation_data/segmentation_names.txt", header=None, sep="\t")
segmentation_names.head()

Unnamed: 0,0
0,REGION-CENTROID-COL
1,REGION-CENTROID-ROW
2,REGION-PIXEL-COUNT
3,SHORT-LINE-DENSITY-5
4,SHORT-LINE-DENSITY-2


## 4.2.b
## Next, Perform Kmeans clustering (for this problem, use the Kmeans implementation in scikit-learn) on the image data (since there are a total 7 pre-assigned image classes, you should use K = 7 in your clustering). Use Euclidean distance as your distance measure for the clustering. 

In [6]:
kmeans = KMeans(n_clusters=7,max_iter=1000,verbose=0)
kmeans.fit(segmentation_data_normalized)

KMeans(max_iter=1000, n_clusters=7)

## Print the cluster centroids (use some formatting so that they are visually understandable). 

In [7]:
pd.options.display.float_format='{:,.2f}'.format
centroids = pd.DataFrame(kmeans.cluster_centers_, columns=segmentation_names)
centroids

Unnamed: 0,"(REGION-CENTROID-COL,)","(REGION-CENTROID-ROW,)","(REGION-PIXEL-COUNT,)","(SHORT-LINE-DENSITY-5,)","(SHORT-LINE-DENSITY-2,)","(VEDGE-MEAN,)","(VEDGE-SD,)","(HEDGE-MEAN,)","(HEDGE-SD,)","(INTENSITY-MEAN,)","(RAWRED-MEAN,)","(RAWBLUE-MEAN,)","(RAWGREEN-MEAN,)","(EXRED-MEAN,)","(EXBLUE-MEAN,)","(EXGREEN-MEAN,)","(VALUE-MEAN,)","(SATURATION-MEAN,)","(HUE-MEAN,)"
0,0.51,0.81,0.0,0.08,0.01,0.05,0.0,0.05,0.0,0.11,0.09,0.09,0.14,0.68,0.08,0.82,0.13,0.41,0.89
1,0.3,0.53,0.0,0.05,0.05,0.1,0.01,0.08,0.01,0.4,0.37,0.47,0.35,0.5,0.57,0.21,0.47,0.3,0.16
2,0.77,0.43,0.0,0.01,0.02,0.04,0.0,0.02,0.0,0.04,0.04,0.06,0.03,0.78,0.22,0.49,0.06,0.54,0.24
3,0.54,0.15,0.0,0.03,0.0,0.03,0.0,0.03,0.0,0.82,0.78,0.89,0.79,0.27,0.67,0.29,0.89,0.21,0.13
4,0.26,0.39,0.0,0.07,0.02,0.08,0.0,0.06,0.0,0.15,0.14,0.19,0.12,0.72,0.34,0.36,0.19,0.41,0.2
5,0.75,0.53,0.0,0.04,0.04,0.11,0.02,0.11,0.02,0.3,0.28,0.35,0.27,0.59,0.45,0.31,0.35,0.3,0.16
6,0.25,0.46,0.0,0.03,0.01,0.04,0.0,0.03,0.0,0.03,0.02,0.04,0.02,0.77,0.22,0.51,0.04,0.8,0.18


## Compare your 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

In [8]:
clusters = kmeans.predict(segmentation_data_normalized)
print('Completeness Score:  ',completeness_score(segmentation_classes,clusters))
print('Homogeneity Score:   ',homogeneity_score(segmentation_classes,clusters))

Completeness Score:   0.6131870124853009
Homogeneity Score:    0.6115021163370862


## 4.2.c
## Perform PCA on the normalized image data matrix. You may use the linear algebra package in Numpy or the Decomposition module in scikit-learn (the latter is much more efficient). 

In [9]:
from sklearn.decomposition import PCA

# Function: principleComponentAnalysis
# Accepts:  dataset
# Returns:  top ratios and corresponding feature 
#           names for Principle Component Analysis
def principleComponentAnalysis(data):
    number_of_columns = data.shape[1]
    initial_feature_names = data.columns
        
    for x in range(number_of_columns):
        pca = PCA(n_components = x)
        pca.fit_transform(data)
        pca_ratios = pca.explained_variance_ratio_
        
        # https://stackoverflow.com/questions/22984335/recovering-features-names-of-explained-variance-ratio-in-pca-with-sklearn
        # block of code for returning corresponding feature names.
        most_important = [np.abs(pca.components_[i]).argmax() for i in range(x)]
        pca_names = [initial_feature_names[most_important[i]] for i in range(x)]
        
        if sum(pca_ratios) >= .95:
            break
    return pca_ratios, pca_names

# Function: screePlot
# Accepts:  dataset
# Returns:  Scree Plot
def screePlot(data):
    number_of_columns = data.shape[1]
    
    # Calculate Eigen Values
    for x in range(number_of_columns):
        pca = PCA(n_components = x)
        pca.fit_transform(data)
        eigen_values = pca.explained_variance_

    total = sum(eigen_values)
    var_exp = [(i/total) for i in sorted(eigen_values, reverse=True)]
    cum_var_exp = np.cumsum(var_exp)
    
    # Plot Graph
    # https://medium.com/@sercandogan/why-scree-plot-is-important-in-pca-a66cd7dcd624
    plt.plot(cum_var_exp)
    plt.xlabel('number of components')
    plt.ylabel('cumulative explained variance');

## Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. 

In [10]:
pca_ratios, pca_names = principleComponentAnalysis(segmentation_data_normalized)

In [17]:
print('Principle Components From Highest Ratio To Lowest\n')
for i in range(len(pca_ratios)):
    print('Principle Component {}:  {}'.format(pca_names[i],pca_ratios[i]))
print('\nRatio Sum:', sum(pca_ratios))

Principle Components From Highest Ratio To Lowest

Principle Component 11:  0.6071423396853326
Principle Component 18:  0.1319697923315601
Principle Component 0:  0.10123772940872644
Principle Component 1:  0.04543539200763941
Principle Component 17:  0.03547361137769836
Principle Component 3:  0.01988035498510543
Principle Component 3:  0.0189197029744334

Ratio Sum: 0.9600589227704958


## Then use these r components as features to transform the data into a reduced dimension space. 

In [12]:
reduced_dimension_space = segmentation_data_normalized[pca_names]
reduced_dimension_space

Unnamed: 0,11,18,0,1,17,3,3.1
0,0.06,0.83,0.43,0.74,0.51,0.00,0.00
1,0.07,0.84,0.34,0.73,0.46,0.00,0.00
2,0.06,0.84,0.89,0.97,0.48,0.00,0.00
3,0.06,0.83,0.18,0.92,0.50,0.00,0.00
4,0.08,0.82,0.38,0.73,0.44,0.00,0.00
...,...,...,...,...,...,...,...
2095,0.08,0.18,0.12,0.61,0.52,0.00,0.00
2096,0.08,0.17,0.03,0.63,0.48,0.33,0.33
2097,0.07,0.18,0.50,0.62,0.54,0.00,0.00
2098,0.08,0.18,0.59,0.61,0.50,0.00,0.00


## 4.2.d
## Perform Kmeans again, but this time on the lower dimensional transformed data. Then, compute the Completeness and Homogeneity values of the new clusters.

In [13]:
kmeans = KMeans(n_clusters=7,max_iter=1000,verbose=0)
kmeans.fit(segmentation_data_normalized)
clusters = kmeans.predict(segmentation_data_normalized)
print('Original Normalized Set')
print('Completeness Score:  ',completeness_score(segmentation_classes,clusters))
print('Homogeneity Score:   ',homogeneity_score(segmentation_classes,clusters))

Original Normalized Set
Completeness Score:   0.6117374684331665
Homogeneity Score:    0.6100499914689614


In [14]:
kmeans = KMeans(n_clusters=7,max_iter=1000,verbose=0)
kmeans.fit(reduced_dimension_space)
clusters = kmeans.predict(reduced_dimension_space)
print('Reduced Dimension Space')
print('Completeness Score:  ',completeness_score(segmentation_classes,clusters))
print('Homogeneity Score:   ',homogeneity_score(segmentation_classes,clusters))

Reduced Dimension Space
Completeness Score:   0.5459411203386391
Homogeneity Score:    0.533535854923261


## 4.2.e
## Discuss your observations based on the comparison of the two clustering results.