## PCA for Reduced Dimensionality in Clustering
For this task I will experiment with using PCA as an approach to reduce dimensionality and noise in the data. Then compare the results of clustering the data with and without PCA using the provided image class assignments as the ground truth.

### Data Set Information
The data set used in this task is based on the Image Segmentation data set at the UCI Machine Learning Repository. The file "segmentation_data.txt" contains data about images with each line corresponding to one image. Each image is represented by 19 features (these are the columns in the data and correspond to the feature names in the file "segmentation_names.txt". The file "segmentation_classes.txt" contains the class labels (the type of image) and a numeric class label for each of the corresponding images in the data file. 

In [413]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### A. Load the Data
1. Load image data matrix (with rows as images and columns as features). 
2. Load the numeric class labels from the segmentation class file. 
3. Perform min-max normalization on the data matrix so that each feature is scaled to [0,1] range

In [414]:
data=pd.read_csv("C:/Users/Rai Chiang/Desktop/478 data/assignment4/Q1/segmentation_data.txt",header = None)
feature_name=pd.read_csv("C:/Users/Rai Chiang/Desktop/478 data/assignment4/Q1/segmentation_names.txt")
class_labels=pd.read_csv("C:/Users/Rai Chiang/Desktop/478 data/assignment4/Q1/segmentation_classes.txt",header = None)

In [415]:
from sklearn import preprocessing
MinMax=preprocessing.MinMaxScaler()
data_trans=MinMax.fit_transform(data)

### B. Perform Kmeans clustering on the image data 
1. Since there are a total 7 pre-assigned image classes, I try K = 7 in clustering. 
2. Use Euclidean distance as the distance measure for the clustering. 
3. Print the cluster centroids. 
4. Compare 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

In [416]:
from sklearn.cluster import KMeans
KMeans=KMeans(n_clusters=7, max_iter=500, verbose=1)
KMeans=KMeans.fit(data_trans)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 422.5804386140194
start iteration
done sorting
end inner loop
Iteration 1, inertia 375.78242444517707
start iteration
done sorting
end inner loop
Iteration 2, inertia 364.6887494121455
start iteration
done sorting
end inner loop
Iteration 3, inertia 358.57050004147663
start iteration
done sorting
end inner loop
Iteration 4, inertia 356.2373649149557
start iteration
done sorting
end inner loop
Iteration 5, inertia 354.7874631827957
start iteration
done sorting
end inner loop
Iteration 6, inertia 353.9337584364134
start iteration
done sorting
end inner loop
Iteration 7, inertia 353.26175987295136
start iteration
done sorting
end inner loop
Iteration 8, inertia 352.12475785340143
start iteration
done sorting
end inner loop
Iteration 9, inertia 351.41863875215984
start iteration
done sorting
end inner loop
Iteration 10, inertia 351.344633048776
start iteration
done sorting
end inner loop
Iteration 11, 

end inner loop
Iteration 8, inertia 363.49166793803846
start iteration
done sorting
end inner loop
Iteration 9, inertia 363.44512430034115
start iteration
done sorting
end inner loop
Iteration 10, inertia 363.41889361911905
start iteration
done sorting
end inner loop
Iteration 11, inertia 363.40605499912965
start iteration
done sorting
end inner loop
Iteration 12, inertia 363.40331667437147
start iteration
done sorting
end inner loop
Iteration 13, inertia 363.40331667437147
center shift 0.000000e+00 within tolerance 4.150157e-06


In [425]:
predict_clusters=KMeans.predict(data_trans)
print predict_clusters

np.set_printoptions(precision=2,suppress=True,linewidth=100)
print "\n",KMeans.cluster_centers_

[0 0 0 ... 6 6 3]

[[0.51 0.81 0.   0.08 0.01 0.05 0.   0.05 0.   0.11 0.09 0.09 0.14 0.68 0.08 0.82 0.13 0.41 0.89]
 [0.75 0.53 0.   0.04 0.04 0.11 0.02 0.11 0.02 0.3  0.28 0.35 0.27 0.59 0.45 0.31 0.35 0.3  0.16]
 [0.54 0.15 0.   0.03 0.   0.03 0.   0.03 0.   0.82 0.78 0.89 0.79 0.27 0.67 0.29 0.89 0.21 0.13]
 [0.26 0.39 0.   0.07 0.02 0.08 0.   0.06 0.   0.15 0.14 0.19 0.12 0.72 0.34 0.36 0.19 0.41 0.2 ]
 [0.25 0.46 0.   0.03 0.01 0.04 0.   0.03 0.   0.03 0.02 0.04 0.02 0.77 0.22 0.51 0.04 0.8  0.18]
 [0.3  0.53 0.   0.05 0.05 0.1  0.01 0.08 0.01 0.4  0.37 0.47 0.35 0.5  0.57 0.21 0.47 0.3  0.16]
 [0.77 0.43 0.   0.01 0.02 0.04 0.   0.02 0.   0.04 0.04 0.06 0.03 0.78 0.22 0.49 0.06 0.54 0.24]]


In [9]:
from sklearn.metrics import completeness_score, homogeneity_score
class_labels=np.array(class_labels).reshape(-1)

print completeness_score(class_labels,predict_clusters)
print homogeneity_score(class_labels,predict_clusters)

0.6131870124853012
0.6115021163370863


### C. Perform PCA on the Normalized Image Data Matrix. 
1. Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. 
2. Then use these r components as features to transform the data into a reduced dimension space. 

In [13]:
from sklearn import decomposition
pca=decomposition.PCA(n_components=10)
pca_trains=pca.fit(data_trans).transform(data_trans)
print pca.explained_variance_ratio_   
#it seems that the first 6 eigen values can capture the variance of the data

[0.61 0.13 0.1  0.05 0.04 0.02 0.02 0.02 0.01 0.01]


### D. Perform Kmeans Again on the Lower Dimensional Transformed Data 
1. Compute the Completeness and Homogeneity values of the new clusters

#### A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.On the other hand,a clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster. With the display scores below, I find that after applying the pca to transform the data, both homofeneity score and completeness score does not increase, which suggests that the lower dimension data representation captures most of the variance from the original data.

In [16]:
from sklearn.cluster import KMeans
KMeans1=KMeans(n_clusters=7, max_iter=500, verbose=1)
KMeans1=KMeans1.fit(pca_trains)
preddict_class_pca=KMeans1.predict(pca_trains)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 421.10953220024186
start iteration
done sorting
end inner loop
Iteration 1, inertia 407.7542808926719
start iteration
done sorting
end inner loop
Iteration 2, inertia 397.10503953598635
start iteration
done sorting
end inner loop
Iteration 3, inertia 373.3479342997225
start iteration
done sorting
end inner loop
Iteration 4, inertia 368.74163733754705
start iteration
done sorting
end inner loop
Iteration 5, inertia 367.49011311416984
start iteration
done sorting
end inner loop
Iteration 6, inertia 365.7836456031216
start iteration
done sorting
end inner loop
Iteration 7, inertia 364.8852565002595
start iteration
done sorting
end inner loop
Iteration 8, inertia 364.51688414923717
start iteration
done sorting
end inner loop
Iteration 9, inertia 364.27531546532276
start iteration
done sorting
end inner loop
Iteration 10, inertia 364.16446885603045
start iteration
done sorting
end inner loop
Iteration 1

In [23]:
from sklearn.metrics import completeness_score,homogeneity_score
print completeness_score(class_labels,preddict_class_pca)
print homogeneity_score(class_labels,preddict_class_pca)

0.6117374684331666
0.6100499914689615
