__NAME:__ __FULLNAME__  
__SECTION:__ __NUMBER__  
__CS 5970: Machine Learning Practices__

# Homework 13: Clustering

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
For all plots, make sure all necessary axes and curves are clearly and 
accurately labeled. Include figure/plot titles appropriately as well.
Post any questions to the Canvas Discussion.

For this assignment you will be exploring clustering. Clustering is an unsupervised learning technique that can be utilized to discover interesting patterns or groupings amongst data.

Select one of the two tasks below

### [Task 1](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones)
Explore clustering for the Human Activity Recognition dataset. Recordings come from 30 subjects
performing activities of daily living while carrying a waist-mounted smartphone with embedded
inertial sensors.


### Task 2
Explore clustering for a few synthetic datasets.


### Objectives
* Clustering
* Dimensionality Reduction

### Notes
* Do not save work within the ml_practices folder

### General References
* [Guide to Jupyter](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Numpy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [DataCamp: Matplotlib](https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python?utm_source=adwords_ppc&utm_campaignid=1565261270&utm_adgroupid=67750485268&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=332661264365&utm_targetid=aud-299261629574:dsa-473406587955&utm_loc_interest_ms=&utm_loc_physical_ms=9026223&gclid=CjwKCAjw_uDsBRAMEiwAaFiHa8xhgCsO9wVcuZPGjAyVGTitb_-fxYtkBLkQ4E_GjSCZFVCqYCGkphoCjucQAvD_BwE)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
* [Sci-kit Learn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [Sci-kit Learn Preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

In [None]:
"""import sys
sys.path.insert(0, "../hw09")

# THESE 4 IMPORTS ARE CUSTOM .py FILES AND CAN BE FOUND 
# ON THE SERVER AND GIT
import visualize
import metrics_plots
from pipeline_components import DataSampleDropper, DataFrameSelector
from pipeline_components import DataScaler, DataLabelEncoder"""

import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import os, re, fnmatch
import pathlib, itertools
import time as timelib
import matplotlib.pyplot as plt
import matplotlib.patheffects as peffects

from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import explained_variance_score, confusion_matrix
from sklearn.metrics import mean_squared_error, roc_curve, auc, f1_score

from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap
from sklearn.neighbors import NearestNeighbors
from sklearn.externals import joblib

FIGWIDTH = 5
FIGHEIGHT = 5
FONTSIZE = 12

plt.rcParams['figure.figsize'] = (FIGWIDTH, FIGHEIGHT)
plt.rcParams['font.size'] = FONTSIZE

plt.rcParams['xtick.labelsize'] = FONTSIZE
plt.rcParams['ytick.labelsize'] = FONTSIZE

%matplotlib inline

In [None]:
"""
Display current working directory of this notebook. If you are using 
relative paths for your data, then it needs to be relative to the CWD.
"""
HOME_DIR = pathlib.Path.home()
pathlib.Path.cwd()

# TASK 1 DATASET: UCI_HAR_Dataset

### LOAD DATA

In [None]:
"""
https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Abstract: Human Activity Recognition database built from the recordings of 30 subjects
performing activities of daily living (ADL) while carrying a waist-mounted smartphone 
with embedded inertial sensors.

Number of Attributes: 561

Source:
Jorge L. Reyes-Ortiz(1,2), Davide Anguita(1), Alessandro Ghio(1), Luca Oneto(1) and
Xavier Parra(2)
1 - Smartlab - Non-Linear Complex Systems Laboratory
DITEN - Università degli Studi di Genova, Genoa (I-16145), Italy.
2 - CETpD - Technical Research Centre for Dependency Care and Autonomous Living
Universitat Politècnica de Catalunya (BarcelonaTech). Vilanova i la Geltrú (08800), 
Spain activityrecognition '@' smartlab.ws


Data Set Information:
The experiments have been carried out with a group of 30 subjects. Each person
performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, 
STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. 

The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise 
filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap
(128 readings/window).

Check the README.txt file for further details about this dataset.

Attribute Information:
For each record in the dataset it is provided:
- Triaxial acceleration from the accelerometer (total acceleration) and the 
  estimated body acceleration.
- Triaxial Angular velocity from the gyroscope.
- A 561-feature vector with time and frequency domain variables.
- Its activity label.
- An identifier of the subject who carried out the experiment.

Citation:
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
"""
# TODO: Set any data file paths appropriately
data = pd.read_csv('UCI_HAR_Dataset/uci_har_Xy_train.csv')
data.shape

In [None]:
data.head()

In [None]:
""" PROVIDED
Check for any NaNs
"""
data.isna().any().any()

In [None]:
data.describe()

In [None]:
""" TODO
Class counts
use pd.value_counts(data['activity_label']) to determine how many instances there 
are for each activity class
"""
cnt = # TODO
cnt

In [None]:
""" PROVIDED
Feature names for each column in the data
"""
features = pd.read_csv('UCI_HAR_Dataset/meta/features.txt', sep='\s+', header=None)
features.columns = ['num', 'feature_name']
features.shape

In [None]:
features.head()

In [None]:
""" PROVIDED
Activity Class Label names
"""
activity_labels = pd.read_csv('UCI_HAR_Dataset/meta/activity_labels.txt', sep='\s+', header=None)
activity_labels.columns = ['num', 'activity_name']
nclasses, ncols = activity_labels.shape
nclasses, ncols

In [None]:
# Display class names and corresponding number
activity_labels

In [None]:
# Separate out just the class names
activity_names = list(activity_labels['activity_name'].values)
activity_names

### PARTITION DATA

In [None]:
""" TODO
Separate the data into X and y. Use the features variable to pull out the 
appropriate feature data. For y we are predicting the 'activity_label' 
column from the data.

Hold out a subset of the data, using train_test_split, a test_size 
fraction of .2, and shuffle to False
"""
# Feature Names
feature_names = features['feature_name'].values

# TODO: Separate the data into X and y
X = # TODO

# Substract 1 from the label number for convenience, such that number matches
# the list index. i.e. changing the label numbers from 1 to 6 to 0 to 5
y = data['activity_label'].copy().values - 1

# TODO: Split into train and validation



nsamples_train = Xtrain.shape[0]

Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

### CLUSTERING

In [None]:
def group_scatter_plot(X, labels, feature_names, label_names, centers=None,
                         leg_on=False, FIGSIZE=(15,10), elev=35, angle=310):
    '''
    Plot 2D or 3D scatter plots of selected sets of features
    PARAMS:
        X: full feature space as a dataframe
        labels: labels for each example in X
        feature_names: subset of features to plot from X
        label_names: contains nclass elements, where each element is the name 
                     for each class (Note: only viable for classes not clusters)
        centers: nclass-by-2 or nclass-by-3 matrix of group centers.
        leg_on: flag whether to display the legend (Note: only set True when 
                plotting the actual class groupings. False when displaying clusters)
        FIGSIZE: tuple of figure width and height
        elev: 3D plot view elevation
        angle: 3D plot view angle
    '''
    # Select a subset of features
    data = X[feature_names].copy().values

    # Create the figure
    fig = plt.figure(figsize=FIGSIZE)
    
    # 2D Plots
    if data.shape[1] == 2:
        ax0 = fig.add_subplot(111)
        # Plot the points by class or cluster
        for i, name in enumerate(label_names):
            inds = labels == i
            ax0.scatter(data[inds,0], data[inds,1], label=name)
        if leg_on: ax0.legend()
        
    # 3D Plots
    elif data.shape[1] > 2:
        ax0 = fig.add_subplot(111, projection='3d')
        # Plot the points by class or cluster
        for i, name in enumerate(label_names):
            inds = labels == i
            ax0.scatter(data[inds,0], data[inds,1], data[inds,2], label=name)
        ax0.view_init(elev, angle)
        if leg_on: ax0.legend()
    
    if centers is not None:
        # Plot the group centers
        mn = np.min(labels)
        mx = np.max(labels)
        if data.shape[1] == 2:
            ax0.scatter(centers[:,0], centers[:,1], c=np.arange(mn, mx+1), 
                        marker='D', cmap=plt.cm.rainbow)
        elif data.shape[1] > 2:
            ax0.scatter(centers[:,0], centers[:,1], centers[:,2], 
                       c=np.arange(mn, mx+1), marker='D', cmap=plt.cm.rainbow)

In [None]:
def compute_class_centers(X, y, feature_names, classes):
    '''
    Compute group centers within the selected sub-feature space
    PARAMS:
        X: full feature space
        y: labels for each example in X
        classes: contains nclass elements, where each element is the index for each class
    '''
    data = X[feature_names].copy().values

    nclasses = len(classes)
    nfeats = len(feature_names)
    centers = np.empty((nclasses, nfeats))
    
    for c in classes:
        centers[c, :] = np.mean(data[y == c, :], axis=0)
    return centers

In [None]:
""" TODO
Use the following two cells.

Observe and analyze 2 feature subspaces of 2 or 3 features. To do this select sets
of 2 or 3 features. For example, consider the feature subspace defined by the features
'tBodyAcc-entropy()-X', 'tBodyAcc-entropy()-Y' and 'tBodyAcc-entropy()-Z'.

First plot the actual classifications in this subspace using:
    group_scatter_plot(Xtrain, ytrain, selected_feats, activity_names, leg_on=True, angle=300)

Second, construct a KMeans model for unsupervised learning of various clusterings of
the data in the selected feature subspaces. Use predict on the KMeans model to determine
the set of 6 clusters. Use group_scatter_plot(leg_on=False) with the predicted clusters as the 
labels, and not the real classifications. Display the model's interia (i.e. the sum of squared
distances of samples to their closest cluster center)
"""
angle = 300

feats = # TODO: list of subset of selected features
# TODO: Plot Actual classifications. use group_scatter_plot with leg_on=True




# TODO: Determine clusters. Create KMeans model and predict the clusters



# TODO: Plot determined clusters. use group_scatter_plot with leg_on=False



# Sum of squared distances of samples to their closest cluster center
model.inertia_

In [None]:
""" TODO
Observe the class groupings in the second selected feature subspace:
"""



### IsoMap

In [None]:
""" TODO
Reduce the full feature space (i.e. all 561 features) down to 
2 features (i.e n_components) using Isomap. Also, make sure to 
determine a goodchoice for the number of neighbors.

Display the classes in the new feature space.

Then construct a KMeans model to locate a set of 6 clusters
in this new feature subspace. Display the determined clusters in 
this new feature subspace.
"""
# TODO: Create the Isomap object and transform the training data
isomap2 = # TODO: create Isomap object
Xmap2 = # TODO: transform the training data

# TODO: Plot actual classifications in the new feature space
fig = plt.figure(figsize=(15,8))
ax0 = fig.add_subplot(111)
for i, name in enumerate(activity_names):
    # Mask of examples belonging to the current class 
    inds = ytrain == i
    # TODO: use scatter to plot the selected examples in the isomap
    # subspace. set the label to the class name
    # see the API pages for matplotlib scatter
    
ax0.set_title('Actual Classifications')
ax0.legend()


# TODO: Construct a KMeans Model. fit it to the Isomap features
iso2_model = # TODO: create and fit the model
# TODO: determine the cluster groupings using predict
pred = # TODO



# TODO: Plot determined predicted clusters
fig = plt.figure(figsize=(15,8))
ax0 = fig.add_subplot(111)
# TODO: use scatter to plot all the examples in the isomap subspace.
# do NOT set the label, instead set the parameter c to the predicted clusters
# see the API pages for matplotlib scatter

ax0.set_title('Determined clusters')

# Sum of squared distances of samples to their closest cluster center
iso2_model.inertia_

In [None]:
""" TODO
Reduce the full feature space (i.e. all 561 features) down to 
3 features using Isomap. Also, make sure to determine a good 
choice for the number of neighbors.

Display the classes in the new feature space.

Then construct a KMeans model to locate a set of 6 clusters
in this new feature space. Display the determined clusters in 
this new feature space.
"""
# TODO: Create the Isomap object and transform the training data
isomap3 = # TODO
Xmap3 = # TODO

# TODO: Plot actual classifications in the new feature space
fig = plt.figure(figsize=(15,15))
ax0 = fig.add_subplot(111, projection='3d')
for i, name in enumerate(activity_names):
    # Mask of examples belonging to the current class 
    inds = ytrain == i
    # TODO: use scatter to plot the selected examples in the isomap
    # subspace. set the label to the class name
    
    
ax0.set_title('Actual Classifications')
ax0.legend()


# TODO: Construct a KMeans Model
iso3_model = # TODO
# TODO: determine the cluster groupings
pred = # TODO



# TODO: Plot determined clusters
fig = plt.figure(figsize=(15,15))
ax0 = fig.add_subplot(111, projection='3d')
# TODO: use scatter to plot all the examples in the isomap subspace.
# do NOT set the label, instead set the parameter c to the predicted clusters

ax0.set_title('Determined clusters')

iso3_model.inertia_

### PCA

In [None]:
""" TODO
Reduce the full feature space (i.e. all 561 features) down to 
2 features using PCA. Also, set whiten to True.

Display the classes in the new feature space.

Then construct a KMeans model to locate a set of 6 clusters
in this new feature space. Display the determined clusters in 
this new feature space.
"""
# TODO: Create the PCA object and transform the training data
pca2 = # TODO
Xpca2 = # TODO

# TODO: Plot actual classifications in the new feature space
fig = plt.figure(figsize=(15,8))
ax0 = fig.add_subplot(111)
for i, name in enumerate(activity_names):
    # Mask of examples belonging to the current class 
    inds = ytrain == i
    # TODO: use scatter to plot the selected examples in the PCA
    # subspace. set the label to the class name
    # see the API pages for matplotlib scatter

ax0.set_title('Actual Classifications')
ax0.legend()



# TODO: Construct a KMeans Model. fit the model to the PCA features
pca2_model = # TODO
# TODO: determine the cluster groupings
pred = # TODO

# TODO: Plot determined clusters
fig = plt.figure(figsize=(15,8))
ax0 = fig.add_subplot(111)
# TODO: use scatter to plot all the examples in the isomap subspace.
# do not set the label, instead set the parameter c to the predicted clusters
# see the API pages for matplotlib scatter

ax0.set_title('Determined clusters')

pca2_model.inertia_

In [None]:
""" TODO
Reduce the full feature space (i.e. all 561 features) down to 
3 features using PCA. Also, set whiten to True.

Display the classes in the new feature space.

Then construct a KMeans model to locate a set of 6 clusters
in this new feature space. Display the determined clusters in 
this new feature space.
"""
# TODO: Create the PCA object and transform the training data
pca3 = # TODO
Xpca3 = # TODO

# TODO: Plot actual classifications in the new feature space
elev = 25
angle = 300
fig = plt.figure(figsize=(15,15))
ax0 = fig.add_subplot(111, projection='3d')
for i, name in enumerate(activity_names):
    # Mask of examples belonging to the current class 
    inds = ytrain == i
    # TODO: use scatter to plot the selected examples in the PCA
    # subspace. set the label to the class name
    # see the API pages for matplotlib scatter

ax0.view_init(elev, angle)
ax0.set_title('Actual Classifications')
ax0.legend()


# TODO: Construct a KMeans Model. fit the model on the PCA features
pca3_model = # TODO
# TODO: determine the cluster groupings
pred = # TODO

# TODO: Plot determined clusters
fig = plt.figure(figsize=(15,15))
ax0 = fig.add_subplot(111, projection='3d')
# TODO: use scatter to plot all the examples in the isomap subspace.
# do not set the label, instead set the parameter c to the predicted clusters
# see the API pages for matplotlib scatter

ax0.view_init(elev, angle)
ax0.set_title('Determined clusters')

pca3_model.inertia_

# TASK 2 DATASETS: SYNTHETIC DATA

### D31

In [None]:
""" PROVIDED
Load the dataset
"""
D31 = pd.read_csv('synthetic/D31.txt', sep='\s+', header=None)
D31.columns = ['x', 'y', 'cluster']
D31['cluster'] = D31['cluster'] - 1
D31.shape

In [None]:
# Display first few examples
D31.head()

In [None]:
""" TODO
Display class counts using pd.value_counts() on the clusters column
"""
d31_cnt = # TODO
d31_cnt

In [None]:
""" TODO
Display the actual classifications and the predicted clusters
"""
# TODO: Plot true classifications. use group_scatter_plot. 
# the feature names are ['x', 'y']. the label names are d31_cnt.index


# TODO: Determine a set of clusters using KMeans



# TODO: Plot the determined clusters. use group_scatter_plot. 


model.inertia_

### AGGREGATION

In [None]:
""" PROVIDED
Load the dataset
"""
Aggregation = pd.read_csv('synthetic/Aggregation.txt', sep='\s+', header=None)
Aggregation.columns = ['x', 'y', 'cluster']
Aggregation['cluster'] = Aggregation['cluster'] - 1
Aggregation.shape

In [None]:
# Display first few examples
Aggregation.head()

In [None]:
""" TODO
Display class counts
"""
agg_cnt = # TODO
agg_cnt

In [None]:
""" TODO
Display the actual and predicted clusters
"""
# TODO: Plot true classifications. use group_scatter_plot. 
# the feature names are ['x', 'y']. the label names are agg_cnt.index



# TODO: Determine clusters using KMeans



# TODO: Plot the determined clusters. use group_scatter_plot. 


model.inertia_

### R15

In [None]:
""" PROVIDED
Load the dataset
"""
R15 = pd.read_csv('synthetic/R15.txt', sep='\s+', header=None)
R15.columns = ['x', 'y', 'cluster']
R15['cluster'] = R15['cluster'] - 1
R15.shape

In [None]:
# Display first few examples
R15.head()

In [None]:
""" TODO
Display class counts
"""
r15_cnt = # TODO
r15_cnt

In [None]:
"""
Display the actual and predicted clusters
"""
# TODO: Plot true classifications. use group_scatter_plot. 
# the feature names are ['x', 'y']. the label names are r15_cnt.index


# TODO: Determine clusters using KMeans



# TODO: Plot the determined clusters. use group_scatter_plot. 


model.inertia_

# DISCUSSION
For which ever task you selected, answer the following question:

In several paragraphs describe the original clusters and compare them to the clusters learned by the KMeans model. What are the limitations or issues with the learned clusters? Please be clear and concise in your response.