Data was gathered for popular wines in hopes of clustering them into groups for a wine delivery service to use to better serve customers with wine recommendations.
Alcohol
Malic_Acid
Ash
Ash_Alcanity
Magnesium
Total_Phenols
Flavanoids
Nonflavanoid_Phenols
Proanthocyanins
Color_Intensity
Hue
OD280
Proline
Before clustering the data, a principal component analysis was conducted. What this does is keep all the important information and scales it down into less features. This allows the cluster model to run better. We condensed the features in 5 to retain 80% variance. We then went with 3 clusters, as our tests showed it worked best.
Relationships were found between Alcohol, Malic Acid, Flavanoids, Color_intensity, Hue, and OD280.
Cluster 0: Least amount of alcohol on average (12) Cluster 1: Most amount on average (14) Cluster 2: Middle amount on average (13)
Cluster 0: Lowest amount on average (1.9) Cluster 1: Middle amount on average (2) Cluster 2: Most amount on average (3.5)
Cluster 0: Middle amount on average (2.1) Cluster 1: Most amount on average (3) Cluster 2: Least amount on average (.6)
Cluster 0: Least amount on average (3) Cluster 1: Middle amount on average (5.2) Cluster 2: Most amount on average (7.2)
Cluster 0: Tied for most (1) Cluster 1: Tied for most(1) Cluster 2: Least on average (.7)
Cluster 0: Middle amount on average (2.7) Cluster 1: Most amount on average (3.2) Cluster 2: Least on average (1.6)
After finding these relationships, we performed several hypothesis tests to see if there was a statistical significance between the features between clusters. The only feature that had a statistical significance was alcohol content.
Training the model on more data to see if more relationships can be found. Clusters based on alcohol alone might be enough for customers, however consulting a SME for a second opinion might be wise.