# Assignment 05 - [30 points] Solutions

## <u>Case Study</u>: Wine Dataset Analysis

Suppose that you are chemist who is interested in studying the properties of different types of wine. The data contained in the wine.csv file contains the results of a chemical analysis of 178 types of wines grown in the same region in Italy but derived from three different cultivars. The cultivar is listed in the 'Wine Class' column. The analysis determined the numerical quantities of 13 constituents found in each of the three types of wines.

For *this* case study, the goal of our cluster analysis is to find the larger naturally occuring clusters in the dataset and learn more about them. *If* we find that there are outliers and noise in this dataset, then it is ok to identify them as such and leave them out of the clusters. However, if a point is not *actually* noise or an outlier, we would like for it to be considered as part of a cluster in our final clustering.

https://archive.ics.uci.edu/ml/datasets/wine

### <u>Research Questions</u>:

We would like to answer the following research questions about the dataset.
* Is this dataset clusterable? If so, how many, what are their shapes and sizes (ie. number of objects in them), and are they well-separated?
* How far apart are these clusters? Are they cohesive?
* How do these clusters associate with the wine class labels (ie. the three cultivars)? How homogeneous are the clusters with respect to the wine class labels? To what extent are the wine classes completely together in the clusters that are naturally occuring in this dataset?
* Are there noise and outliers in this dataset?
* Which clustering algorithm will be most useful in identifying the naturally occuring large clusters in the dataset?


<hr style="height:1px;border:none;color:#333;background-color:#333;" />

### Imports

## 1. [0.5 pt] Data Pre-Processing 

This dataset does not have any missing values.

* Read the wine.csv into a dataframe.
* Create another dataframe that contains the scaled numerical variables.
* Create another dataframe that contains the scaled numerical variables and the seed class labels.

You should scale the variables by mean subtracting and dividing by the standard deviation).

## 2. Clusterability

### 2.1. [0.5 pt] t-SNE Plots
Using 6 different perplexity values and at least two random states for each perplexity value, map this dataset onto a two-dimensional dataset with the t-SNE algorithm. Show your projected coordinates in a scatterplot for each combination of random states and perplexity value. Also, color code your points by the Wine Class labels.

### 2.2. [1.5 pt] Interpretation:

Use your t-SNE plots to answer the following questions.

1. Is this dataset clusterable?
2. If so, how many clusters are in this dataset?
3. What is the shape of the clusters in this dataset?
4. Are these clusters well separated?
5. Are these clusters balanced in size?
6. Describe the relationship between the clusters suggested by the t-SNE plots and the wine classes.

Finally, pick out a random state and perplexity value that reflects the answers to your questions and show the corresponding t-SNE plot below.

## 3. Dataset Noise Assessment

### 3.1. [1.5 pt] Nearest Neighbor Distance Plots

For your scaled dataset, create the following 4 plots:
* k=2 nearest neighbor sorted distance plot,
* k=3 nearest neighbor sorted distance plot,
* k=4 nearest neighbor sorted distance plot,
* k=5 nearest neighbor sorted distance plot.

See DBSCAN application notes for code assistance.

### 3.2. [1.5 pt] Noise Assessment

Based on these plots above, would you say that your dataset has a lot of noise? Explain.

## 4. Clustering Algorithm Selection

### 4.1. DBSCAN - Parameter Selection and Algorithm Fit

We would like to try to use DBSCAN to cluster this dataset, but first we need to determine the best parameter values for $\epsilon$ and $minpts$. For *this* case study, the goal of our cluster analysis is to find the naturally occuring clusters in the dataset and learn more about them. It is ok if not all points, whether they be outliers or noise are not included in the final clustering. But if we have a strong indication that there are not a lot of noise and outliers, then we should not expect to see too many points labeled as such in our result.

#### 4.1.1.  [3 pt] t-SNE Plots

Try to find a pair of $minpts$ and $epsilon$ values for which DBSCAN will mostly identify the main k clusters suggested by the t-SNE plots and have very few noise points. If you are unable to find a pair of parameter values that can do this, explain why this might have happened.

<u>Hint</u>: *To test this out, you should try out many different combinations of minpts and epsilon values. Try out $minpts\in[2,3,4,5]$ and many values of $\epsilon$ in the range $[1.5,4.5]$.*

### 4.2.  [1 pt] k-Means: Algorithm Fit

List three reasons why this dataset might be a good fit for k-means (aside from the fact that this is a numerical dataset).

### 4.3. k-Means Clustering

#### 4.3.1.  [1 pt] Cluster the data

Cluster this dataset into the number of clusters that your t-SNE plots are suggesting. Then color code your clustering results on your t-SNE plot. Use a random state of 100.

#### 4.3.2 [1 pt] Interpretation:

For *this* case study, one of the goals of our cluster analysis is to find larger, naturally occuring clusters in the dataset and learn more about them. How well did k-means do in meeting this goal? Explain.

## 5. Post k-Means Cluster Analysis

### 5.1.Cluster Sorted Similarity Matrix

#### 5.1.1. [2 pt] Create it
Create a cluster sorted similarity matrix for this k-means clustering and your scaled dataset.

#### 5.1.2. [2 pt] Interpretation

Use this cluster sorted distance matrix to answer the following questions.

1. Which two clusters are furthest part? Explain.
2. Which cluster has the worst overall cohesion? Explain.

### 5.2. Cluster Sorted Similarity Matrix Correlation

#### 5.2.1. [1 pt] Calculate correlation coefficient
Scale this clustered sorted similarity matrix (using the method we discussed in class) and calculate the correlation of this scaled matrix to the "ideal" scaled cluster sorted similarity matrix which has the same number of clusters and the same number of objects within each of these clusters.

#### 5.2.2. [1 pt] Interpretation

Does this correlation coefficient suggest that this clustering has a very *high* amount of cohesion and separation? Explain.

### 5.3 Pre-Assigned Class Label Association

#### 5.3.1. [1 pt] Homogeneity and Completeness
Calculate the homogeneity score, the completeness score, and the V-score between the Wine Class labels and the k-means clustering.

#### 5.3.2. [1 pt] Interpretation

1. Interpret your homogeneity score and your completeness score.

2. In your t-SNE plot, color code the points by the k-means cluster labels and code the shape of points by the wine class labels. Does the homogeneity score and the completeness score that you just calculated agree with what you see in this t-SNE plot? Explain.