#  Day 2: Clustering and Dimensionality Reduction
## K-Means & Principal Components Analysis (PCA)

In this notebook we are going to implement clustering with and without dimensionality reduction. In particular, you will:

* Complete a function `kmeansNewPredict(X, k)` to implement k-means algorithm from scratch for different number of clusters k=2,3,4,5.
* Use the function `PCA()` from scikit-learn in order to reduce the dimension of data to d = 2. Plot the data.


# Import libraries

The required libraries for this notebook are pandas, sklearn, copy, numpy, pickle, mpl_toolkits and matplotlib.

In [None]:
# import libraries
import pickle
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


# Load the data
For the clustering tasks we will use the dataset clustering_data.pkl. It consists of 3000 data points. Each data point has 2 features (2D data). The data points come from different distributions (between 2 and 5).

In [None]:
# Loading the pickle file
data_file = open('./dataset/clustering_data.pkl','rb')
data = pickle.load(data_file, encoding='latin1')
data_file.close()
X = data['X']

# Use K-means clustering from a library

We will first see how k-means can be implemented using already available functions from the scikit-learn library.

In [None]:
# sklearn functions implementation
def kmeansPredict(X, k):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    y_kmeans = kmeans.predict(X)
    return y_kmeans, kmeans.cluster_centers_

# run K-means for different values of k
k1 = 2
y_predict, centers = kmeansPredict(X,k1)
#  plot the clusters 
plt.scatter(X[:, 0], X[:, 1], c=y_predict, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
print(centers)


# Implement you own K-means clustering function 

You are asked to calculate the missing variables in the `kmeansNewPredict(X,k)` function.

In [None]:
# Euclidean Distance Caculator
def dist(a, b, ax=1):
    return np.linalg.norm(a - b, axis=ax)

def kmeansNewPredict(X,k):
    # Generate X and Y coordinates of random centroids
    C_x = np.random.randint(0, np.max(X), size=k)
    C_y = np.random.randint(0, np.max(X), size=k)
    C = np.array(list(zip(C_x, C_y)), dtype=np.float32)

    C_old = np.zeros(C.shape)
    clusters = np.zeros(len(X))
    # Error: Distance between new centroids and old centroids
    error = dist(C, C_old, None)
    # Loop will run till the error becomes zero
    while error != 0:
        # Assigning each value to its closest cluster
        for i in range(len(X)):
            distances = dist(X[i], C)
            #cluster = ...
            clusters[i] = cluster
        # Storing the old centroid values
        C_old = deepcopy(C)
        # Finding the new centroids by taking the average value
        for i in range(k):
            #points = ...
            #C[i] = ...
        error = dist(C, C_old, None)


    return points, C

# run K-means for different values of k
k2 = 2
y_kmeans1, centers1 = kmeansNewPredict(X,k2)

#  plot the clusters 
fig, ax = plt.subplots()
for i in range(k2):
        y_kmeans1 = np.array([X[j] for j in range(len(X))])
        ax.scatter(y_kmeans1[:, 0], y_kmeans1[:, 1], s=50)
ax.scatter(centers1[:, 0], centers1[:, 1],  c='black', s=200, alpha=0.5);

print(centers1)


# Load the iris dataset for PCA

We will now use the iris dataset available from scikit learn library, which includes 4 feature columns (sepal length, sepal width, petal length, and petal width) and 150 data points. 

In [None]:
dataset=load_iris()

#print(dataset.data.shape) # shape of data
#print(dataset.target.shape)
#print(boston.feature_names)
dataset_df=pd.DataFrame(dataset.data,columns=dataset.feature_names) # convert the boston.data to a a dataframe
print(dataset_df.head())

X_iris=dataset.data
y_iris=dataset.target

#print(X.shape)
#print(y.shape)


In [None]:
# to enable graph interactivity 
%matplotlib notebook  
fig_pca=plt.figure(1, figsize=(4,3))
plot=Axes3D(fig_pca, rect=[0,0,.95,1],elev=48, azim=134)


for name, label in [('Setosa',0), ('Versicolour',1), ('Virginica',2)]:
    plot.text3D(X_iris[y_iris==label,0].mean(),
                X_iris[y_iris==label,1].mean()+1.5,
                X_iris[y_iris==label,2].mean(),name, 
                horizontalalignment='center',
                bbox=dict(alpha=.5, edgecolor='w',facecolor='w') )



y_iris=y_iris.astype(np.float)    



plot.scatter(X_iris[:, 0], X_iris[:, 1], X_iris[:, 2], c=y_iris,  cmap=plt.cm.nipy_spectral, edgecolor='k')



# Scale your data

Before we apply PCA, the data features (i.e. X) should be scaled (standardized) onto unit scale (mean = 0 and variance = 1). You could do so by using `StandardScaler()` function from scikit-learn.

For more information on the importance of feature scaling, please visit: **https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py**

In [None]:
# Call the function
sc = StandardScaler() 
# Standardizing the features
X_scale = sc.fit_transform(X_iris)  
#print(X_iris.shape)

# Project data to 2D using PCA

In the following cell you should implement the projection of the original data into 2 dimensions. Then apply kmeans and compare your results with and without PCA.

**Note 1:** The original data is 4 dimensional.

**Note 2:** The principal components you will create with dimensionality reduction, don't have a particular meaning assigned (usually). They are just the two main dimensions of variation.

In [None]:
#pca = ... #call PCA
#X_new = ... # fit
print('Information included per principal component (%): '+str(pca.explained_variance_ratio_*100))

In [None]:
y_new, centers_new = kmeansPredict(X_new,k)
print( centers_new)