<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-your-start:" data-toc-modified-id="Before-your-start:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before your start:</a></span></li><li><span><a href="#Challenge-1---Import-and-Describe-the-Dataset" data-toc-modified-id="Challenge-1---Import-and-Describe-the-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenge 1 - Import and Describe the Dataset</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?" data-toc-modified-id="Explore-the-dataset-with-mathematical-and-visualization-techniques.-What-do-you-find?-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Explore the dataset with mathematical and visualization techniques. What do you find?</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-2---Data-Cleaning-and-Transformation" data-toc-modified-id="Challenge-2---Data-Cleaning-and-Transformation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Challenge 2 - Data Cleaning and Transformation</a></span></li><li><span><a href="#Challenge-3---Data-Preprocessing" data-toc-modified-id="Challenge-3---Data-Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Challenge 3 - Data Preprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here." data-toc-modified-id="We-will-use-the-StandardScaler-from-sklearn.preprocessing-and-scale-our-data.-Read-more-about-StandardScaler-here.-4.0.0.1"><span class="toc-item-num">4.0.0.1&nbsp;&nbsp;</span>We will use the <code>StandardScaler</code> from <code>sklearn.preprocessing</code> and scale our data. Read more about <code>StandardScaler</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler" target="_blank">here</a>.</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Challenge-4---Data-Clustering-with-K-Means" data-toc-modified-id="Challenge-4---Data-Clustering-with-K-Means-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Challenge 4 - Data Clustering with K-Means</a></span></li><li><span><a href="#Challenge-5---Data-Clustering-with-DBSCAN" data-toc-modified-id="Challenge-5---Data-Clustering-with-DBSCAN-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Challenge 5 - Data Clustering with DBSCAN</a></span></li><li><span><a href="#Challenge-6---Compare-K-Means-with-DBSCAN" data-toc-modified-id="Challenge-6---Compare-K-Means-with-DBSCAN-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Challenge 6 - Compare K-Means with DBSCAN</a></span></li><li><span><a href="#Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters" data-toc-modified-id="Bonus-Challenge-2---Changing-K-Means-Number-of-Clusters-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bonus Challenge 2 - Changing K-Means Number of Clusters</a></span></li><li><span><a href="#Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples" data-toc-modified-id="Bonus-Challenge-3---Changing-DBSCAN-eps-and-min_samples-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bonus Challenge 3 - Changing DBSCAN <code>eps</code> and <code>min_samples</code></a></span></li></ul></div>

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [2]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings                                              
from sklearn.exceptions import DataConversionWarning          
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The origin of the dataset is [here](https://archive.ics.uci.edu/ml/datasets/wholesale+customers).

In [None]:

# Load the dataset
url = './Wholesalecustomersdata.csv'
data = pd.read_csv(url)

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(data.head())

# Display the summary statistics of the dataset
print("\nSummary statistics of the dataset:")
print(data.describe())


#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
print(data.columns)
print("")
print(data.info())
print("\nNull values:")
print(data.isnull().sum())
print("\nCorrelation Matrix")
correlation_matrix = data.corr()
print(correlation_matrix)
print("\ndata describe")
print(data.describe())
data.hist(figsize=(12, 10)) 
plt.show()


# Calculate total spend per customer
data['TotalSpend'] = data.drop(columns=['Channel', 'Region']).sum(axis=1)

# Sort the data by total spend
sorted_data = data.sort_values(by='TotalSpend', ascending=False)

# Calculate cumulative spend
sorted_data['CumulativeSpend'] = sorted_data['TotalSpend'].cumsum()

# Calculate cumulative percentage of customers
sorted_data['CumulativeCustomers'] = (sorted_data.index + 1) / len(sorted_data)

# Calculate cumulative percentage of spend
sorted_data['CumulativeSpendPercentage'] = sorted_data['CumulativeSpend'] / sorted_data['TotalSpend'].sum()

# Plot the cumulative spend vs customers
plt.figure(figsize=(10, 6))
plt.plot(sorted_data['CumulativeCustomers'], sorted_data['CumulativeSpendPercentage'], marker='o')
plt.xlabel('Cumulative Percentage of Customers')
plt.ylabel('Cumulative Percentage of Spend')
plt.title('Cumulative Spend vs Cumulative Customers')
plt.grid(True)
plt.show()


**Your observations here**

- ex.: Frozen, Grocery, Milk and Detergents Paper have a high...
- ...



# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:

# Function to remove outliers using IQR
def remove_outliers(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# Columns to check for outliers
columns = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']

# Remove outliers from the dataset
data_cleaned = remove_outliers(data, columns)

print("Data after outlier removal:")
print(data_cleaned.describe())




**Your comment here**

-  ...
-  ...

# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the data
customers_scale = scaler.fit_transform(data_cleaned.drop(columns=['Channel', 'Region']))

# Convert the scaled data back to a DataFrame
customers_scale = pd.DataFrame(customers_scale, columns=data_cleaned.columns.drop(['Channel', 'Region']))

# Display the first few rows of the scaled data
print("Scaled data:")
print(customers_scale.head())


# Challenge 4 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [None]:
from sklearn.cluster import KMeans

k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)

k_means.fit(customers_scale)

labels = k_means.labels_

data_cleaned['labels'] = labels

print("Clustered data with labels:") 
print(data_cleaned.head())

### Looking to the elbow we can choose 2 like the correct number of clusters

In [16]:
kmeans_2 = KMeans(n_clusters=2).fit(customers_scale)

labels = kmeans_2.predict(customers_scale)

clusters = kmeans_2.labels_.tolist()

In [None]:
data_cleaned['Label'] = clusters

Count the values in `labels`.

In [None]:
# Count the values in the 'Label' column
label_counts = data_cleaned['Label'].value_counts()

print("Counts of each cluster label:")
print(label_counts)


# Challenge 5 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN 

dbscan = DBSCAN(eps=0.5)

dbscan.fit(customers_scale)

dbscan_lables = dbscan.labels_

data_cleaned['labels_DBSCAN'] = dbscan_lables

print("Clustered data with DBSCAN labels:") 
print(data_cleaned.head())



Count the values in `labels_DBSCAN`.

In [None]:
dbscan_label_counts = data_cleaned['labels_DBSCAN'].value_counts()

print(dbscan_label_counts)


# Challenge 6 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [24]:
def plot(x,y,hue):
    sns.scatterplot(x=x, 
                    y=y,
                    hue=hue)
    plt.title('Detergents Paper vs Milk ')
    return plt.show();

In [None]:




import matplotlib.pyplot as plt

# Plot the scatter plots for 'Detergents_Paper' vs 'Milk'
plt.figure(figsize=(14, 6))

# Plot K-Means clustering
plt.subplot(1, 2, 1)
plt.scatter(data_cleaned['Detergents_Paper'], data_cleaned['Milk'], c=data_cleaned['Label'], cmap='viridis', s=50)
plt.title('K-Means Clustering: Detergents_Paper vs Milk')
plt.xlabel('Detergents_Paper')
plt.ylabel('Milk')

# Plot DBSCAN clustering
plt.subplot(1, 2, 2)
plt.scatter(data_cleaned['Detergents_Paper'], data_cleaned['Milk'], c=data_cleaned['labels_DBSCAN'], cmap='viridis', s=50)
plt.title('DBSCAN Clustering: Detergents_Paper vs Milk')
plt.xlabel('Detergents_Paper')
plt.ylabel('Milk')

plt.tight_layout()
plt.show()


Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:

# Plot the scatter plots for 'Detergents_Paper' vs 'Milk'
plt.figure(figsize=(14, 6))

# Plot K-Means clustering
plt.subplot(1, 2, 1)
plt.scatter(data_cleaned['Detergents_Paper'], data_cleaned['Milk'], c=data_cleaned['Label'], cmap='viridis', s=50)
plt.title('K-Means Clustering: Detergents_Paper vs Milk')
plt.xlabel('Detergents_Paper')
plt.ylabel('Milk')

# Plot DBSCAN clustering
plt.subplot(1, 2, 2)
plt.scatter(data_cleaned['Detergents_Paper'], data_cleaned['Milk'], c=data_cleaned['labels_DBSCAN'], cmap='viridis', s=50)
plt.title('DBSCAN Clustering: Detergents_Paper vs Milk')
plt.xlabel('Detergents_Paper')
plt.ylabel('Milk')

plt.tight_layout()
plt.show()

Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:

# Plot the scatter plots for 'Frozen' vs 'Delicassen'
plt.figure(figsize=(14, 6))

# Plot K-Means clustering
plt.subplot(1, 2, 1)
plt.scatter(data_cleaned['Frozen'], data_cleaned['Delicassen'], c=data_cleaned['Label'], cmap='viridis', s=50)
plt.title('K-Means Clustering: Frozen vs Delicassen')
plt.xlabel('Frozen')
plt.ylabel('Delicassen')

# Plot DBSCAN clustering
plt.subplot(1, 2, 2)
plt.scatter(data_cleaned['Frozen'], data_cleaned['Delicassen'], c=data_cleaned['labels_DBSCAN'], cmap='viridis', s=50)
plt.title('DBSCAN Clustering: Frozen vs Delicassen')
plt.xlabel('Frozen')
plt.ylabel('Delicassen')

plt.tight_layout()
plt.show()

Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# Group by 'labels' from K-Means and compute the means
kmeans_grouped = data_cleaned.groupby('Label').mean()

# Group by 'labels_DBSCAN' from DBSCAN and compute the means
dbscan_grouped = data_cleaned.groupby('labels_DBSCAN').mean()

print("Mean values for each cluster (K-Means):")
print(kmeans_grouped)

print("\nMean values for each cluster (DBSCAN):")
print(dbscan_grouped)


Which algorithm appears to perform better?

**Your observations here**

- 

# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:


# Define the range of cluster numbers to experiment with
cluster_range = range(2, 10)

# Define the column pairs to visualize
x_col, y_col = 'Grocery', 'Fresh'

plt.figure(figsize=(20, 30))

for i, n_clusters in enumerate(cluster_range, start=1):
    # Initialize and fit K-Means model
    k_means = KMeans(init="k-means++", n_clusters=n_clusters, n_init=12, random_state=42)
    k_means.fit(customers_scale)
    
    # Get the cluster labels
    labels = k_means.labels_
    
    # Plot the clustering results
    plt.subplot(4, 2, i)
    plt.scatter(data_cleaned[x_col], data_cleaned[y_col], c=labels, cmap='viridis', s=50)
    plt.title(f'K-Means Clustering with {n_clusters} Clusters')
    plt.xlabel(x_col)
    plt.ylabel(y_col)

plt.tight_layout()
plt.show()


**Your comment here**

- 

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:


# Define different parameter settings
parameters = [
    {'eps': 0.3, 'min_samples': 5},
    {'eps': 0.5, 'min_samples': 5},
    {'eps': 0.7, 'min_samples': 5},
    {'eps': 0.5, 'min_samples': 10},
    {'eps': 0.5, 'min_samples': 20}
]

# Loop through the parameter settings
for params in parameters:
    # Initialize the DBSCAN model with specified parameters
    dbscan = DBSCAN(eps=params['eps'], min_samples=params['min_samples'])
    
    # Fit the model to the scaled data
    dbscan.fit(customers_scale)
    
    # Get the cluster labels
    labels = dbscan.labels_
    
    # Plot the scatter plot for 'Grocery' vs 'Fresh'
    plt.figure(figsize=(7, 6))
    plt.scatter(data_cleaned['Grocery'], data_cleaned['Fresh'], c=labels, cmap='viridis', s=50)
    plt.title(f"DBSCAN Clustering: eps={params['eps']}, min_samples={params['min_samples']}")
    plt.xlabel('Grocery')
    plt.ylabel('Fresh')
    plt.show()


**Your comment here**

- 