# **Conclusion Notebook**

## In this project we set out to look into Customer Segmentation of the Instacart Dataset from Kaggle using several Clustering methods, doing individual analysis on each then comparing all of the methods at the end to see which one was the best fit for this project.

- **KMeans Clustering:** Identified three distinct customer segments (Frequent, After Hours, and Weekend Shoppers) with clear separation between clusters.

- **Agglomerative Clustering:** Also identified three similar segments with a slightly different distribution, showing consistent patterns in customer behavior.

- **DBSCAN:** Revealed the presence of noise in the data and identified core points, but with fewer clear segments compared to KMeans and Agglomerative.  No clustering outside of the core cluster was able to be reliably identified due to the noise points.


In [2]:
import pandas as pd

metrics = {
    'Method': ['KMeans', 'Agglomerative', 'DBSCAN'],
    'Number of Clusters': [3, 3, 1],
    'WCSS': [473425.83, 28520.669, None],
    'BCSS': [351379.47, 15327.630, None],
    'Silhouette Score': [0.3521, 0.2776, None]}

metrics_df = pd.DataFrame(metrics)
print(metrics_df)


          Method  Number of Clusters        WCSS       BCSS  Silhouette Score
0         KMeans                   3  473425.830  351379.47            0.3521
1  Agglomerative                   3   28520.669   15327.63            0.2776
2         DBSCAN                   1         NaN        NaN               NaN


## Metrics Overview
#### WCSS (Within Cluster Sum of Squares) - shows/determines compactness of each cluster or group of customers; we are looking for lower scores here.  Higher scores are going to indicate variance, which is a measure of how far the data points are spread out from each other.

#### BCSS (Between Cluster Sum of Squares) - shows/determines separation of each cluster or group of customers.  In this metric we are looking for higher scores.

#### Silhouette Scores - are a measure of how similar an object (read data point) is to its own cluster compared to other clusters.  The higher score the better here.  Higher scores indicate that the clusters are well-defined and distinct from each other, are matched well to their own cluster and poorly matched to other clusters.

## Clustering Method Selection
#### This really just came down to KMeans and Agglomerative Clustering for this project.  At this point in the analysis I will be selecting KMeans Clustering as the best method for our customer segmentation.  While KMeans has a noticeably higher variance in its WCSS score, this method has a much better separation in its clusters with the BCSS score and the silhouette score, while not great, is comparably better than that of Agglomerative.  The 3D render comparison (shown below again) shows us as well how well the separation and clustering performed for a visual aid.  If desired the variance measure can be investigated further though the cluster number may be subject to change.
![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

#### In conclusion, KMeans Clustering is the choice to go forward with for this project based on the clustering methods used and work performed.  




## Future Work
#### If desired, there are items that could enhance the scope and value of this project.  Such things are:

#### -As mentioned above looking into the variance of the KMeans Clustering method (the WCSS score) and optimizing it further.
#### -Just to be sure, look deeper into DBSCAN more to get finality out of it.  Lots of parameters (read eps, min_sample) were looked into, but to be sure more work can be done to establish that DBSCAN isn't a reliable candidate for this project.
#### -A big value add for this project from the onset was creating marketing promotions for the customer clusters once they were segmented.  If so desired continue on and create a list of custom promotions, then run their predicted success using Linear Regression and plot the results for each cluster group.  Report the results based on how successful each promotion is for each unique cluster group so the stakeholders can use going forward.