# IE6400 Foundations of Data Analytics Engineering
# Fall 2023 
### Module 3: Clustering Methods Part - 1
#### - STUDENT VERSION -

### Proximity Measures

Proximity measures are metrics used to determine the similarity or dissimilarity between data points. They play a crucial role in various machine learning and data analysis techniques, especially clustering and classification. The choice of a proximity measure often depends on the nature of the data and the specific problem at hand.

#### Types of Proximity Measures

There are two main types of proximity measures:

1. **Similarity Measures**: These quantify how similar two data points are. Higher values indicate greater similarity.
2. **Dissimilarity Measures (or Distance Measures)**: These represent the "distance" or dissimilarity between two data points. Higher values indicate greater dissimilarity.

#### Common Proximity Measures

#### For Continuous Data:

1. **Euclidean Distance**: 
    - It's the "ordinary" straight-line distance between two points in Euclidean space.

    $d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$


2. **Manhattan Distance (or L1 norm)**:
    - It's the distance between two points measured along axes at right angles (taxicab or city block distance).

    $d(x, y) = \sum_{i=1}^{n} |x_i - y_i|$


3. **Minkowski Distance**:
    - A generalized metric. When \(p=2\), it becomes the Euclidean distance. When \(p=1\), it's the Manhattan distance.

    $d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}}$


#### For Categorical Data:

1. **Hamming Distance**: 
    - Used for categorical variables. It's the number of positions at which the corresponding symbols in two strings of equal length are different.

2. **Jaccard Similarity**:
    - Measures the similarity between two sets. It's the size of the intersection divided by the size of the union of the two sets.

    $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$


### For Mixed-Type Data:

1. **Gower Distance**: 
    - Combines various distance metrics for mixed-type data.

### For Binary Data:

1. **Jaccard Coefficient**: 
    - Similar to Jaccard similarity but specifically tailored for binary attributes.

2. **Cosine Similarity**:
    - Measures the cosine of the angle between two non-zero vectors. It's often used in text analysis to determine similarity between documents.

    $\text{cosine_similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}$


#### Exercise 1 Understanding Euclidean Distance

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the dataset
customers = np.array([[5, 3], [2, 8], [9, 1], [4, 7]])

# Visualize the data
plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers')
plt.xlabel('Product A')
plt.ylabel('Product B')
plt.title('Purchase History of Customers')
plt.grid(True)
plt.show()


In [None]:
# Define a function to compute Euclidean distance
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Calculate the distance between Customer 1 and Customer 2
distance = euclidean_distance(customers[0], customers[1])

print(f"Euclidean Distance between Customer 1 and Customer 2: {distance:.2f}")


In [None]:
# Visualize the data with the Euclidean distance
plt.scatter(customers[:, 0], customers[:, 1], color='blue', label='Customers')
plt.plot([customers[0][0], customers[1][0]], [customers[0][1], customers[1][1]], 'ro-')
plt.xlabel('Product A')
plt.ylabel('Product B')
plt.title('Euclidean Distance between Customer 1 and Customer 2')
plt.grid(True)
plt.show()


#### Exercise 2 Understanding Manhattan Distance

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the dataset
rides = {
    'Start': [(2, 3), (1, 4), (3, 3), (6, 1)],
    'End': [(5, 6), (4, 2), (3, 7), (2, 5)]
}

# Visualize the data
for start, end in zip(rides['Start'], rides['End']):
    plt.plot([start[0], end[0]], [start[1], end[1]], 'ro-')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Taxi Rides')
plt.grid(True)
plt.show()


In [None]:
# Define a function to compute Manhattan distance
def manhattan_distance(point1, point2):
    return abs(point1[0] - point2[0]) + abs(point1[1] - point2[1])

# Calculate the distance for Ride 1
distance = manhattan_distance(rides['Start'][0], rides['End'][0])

print(f"Manhattan Distance for Ride 1: {distance} blocks")


In [None]:
# Visualize the data with the Manhattan distance for Ride 1
start, end = rides['Start'][0], rides['End'][0]
plt.plot([start[0], end[0]], [start[1], start[1]], 'bo-')
plt.plot([end[0], end[0]], [start[1], end[1]], 'bo-')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Manhattan Distance for Ride 1')
plt.grid(True)
plt.show()


#### Exercise 3 Understanding Chebyshev Distance

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the dataset
moves = {
    'Start': [(2, 3), (1, 4), (3, 3), (6, 1)],
    'Target': [(5, 6), (4, 2), (3, 7), (2, 5)]
}

# Visualize the data
for start, target in zip(moves['Start'], moves['Target']):
    plt.plot([start[0], target[0]], [start[1], target[1]], 'ro-')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Game Moves')
plt.grid(True)
plt.show()


In [None]:
# Define a function to compute Chebyshev distance
def chebyshev_distance(point1, point2):
    return max(abs(point1[0] - point2[0]), abs(point1[1] - point2[1]))

# Calculate the distance for Move 1
distance = chebyshev_distance(moves['Start'][0], moves['Target'][0])

print(f"Chebyshev Distance for Move 1: {distance} squares")


In [None]:
# Visualize the data with the Chebyshev distance for Move 1
start, target = moves['Start'][0], moves['Target'][0]
plt.scatter(*zip(*[start, target]), color=['blue', 'red'])
plt.plot([start[0], target[0]], [start[1], start[1]], 'g--')
plt.plot([target[0], target[0]], [start[1], target[1]], 'g--')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title('Chebyshev Distance for Move 1')
plt.grid(True)
plt.show()


#### Exercise 4 Understanding Minkowski Distance

In [None]:
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt

# Sample Dataset
points = np.array([[2, 3], [3, 5]])

# Minkowski Distance Function
def minkowski_distance(p1, p2, p):
    return np.sum(np.abs(p1 - p2) ** p) ** (1/p)

# Calculate Minkowski Distance for p=1,2,3,4
p_values = [1, 2, 3, 4]
distances = [minkowski_distance(points[0], points[1], p) for p in p_values]

distances


In [None]:
for p, d in zip(p_values, distances):
    print(f"For p = {p}, Minkowski Distance = {d:.2f}")


In [None]:
plt.plot(p_values, distances, 'o-', color='blue')
plt.xlabel('Value of p')
plt.ylabel('Minkowski Distance')
plt.title('Minkowski Distance for Different p Values')
plt.grid(True)
plt.show()


#### Exercise 5 Understanding the Dissimilarity Matrix

In [None]:
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import distance_matrix

# Sample Dataset
data = np.array([[2, 3], [3, 5], [5, 8], [8, 9], [7, 5]])

# Compute Dissimilarity Matrix
dissimilarity = distance_matrix(data, data)

dissimilarity


In [None]:
sns.heatmap(dissimilarity, annot=True, cmap='YlGnBu', cbar=True)
plt.title('Dissimilarity Matrix Heatmap')
plt.show()


#### Exercise 6 Understanding the Hamming Distance

In [None]:
# Required Libraries
import numpy as np
import matplotlib.pyplot as plt

# Sample Dataset
strings = ["101010", "100010", "111011", "101011", "110010"]

# Compute Hamming Distance
def hamming_distance(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

distances = [hamming_distance(strings[0], s) for s in strings]

distances


In [None]:
plt.bar(strings, distances, color='skyblue')
plt.xlabel('Strings')
plt.ylabel('Hamming Distance')
plt.title('Hamming Distance from the First String')
plt.show()


#### Exercise 7 Understanding the Jaccard Similarity for Categorical Data

In [None]:
import numpy as np
import pandas as pd

# Generating a sample dataset
np.random.seed(42)
data = {
    'Group1': np.random.choice(['Apple', 'Banana', 'Cherry', 'Date'], 100),
    'Group2': np.random.choice(['Apple', 'Banana', 'Cherry', 'Date'], 100)
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset
df.head()


In [None]:
# Step 1: Determine the unique categories chosen by each group
group1_unique = set(df['Group1'].unique())
group2_unique = set(df['Group2'].unique())

# Step 2: Compute the intersection of the two sets
intersection = group1_unique.intersection(group2_unique)

# Step 3: Compute the union of the two sets
union = group1_unique.union(group2_unique)

# Step 4: Calculate the Jaccard Similarity
jaccard_similarity = len(intersection) / len(union)
jaccard_similarity


In [None]:
#!conda install -c conda-forge matplotlib-venn

In [None]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Plotting the unique categories for each group and their intersection
venn2_subsets = (len(group1_unique - group2_unique), 
                 len(group2_unique - group1_unique), 
                 len(intersection))

plt.figure(figsize=(8, 8))
venn2(subsets=venn2_subsets, set_labels=('Group1', 'Group2'))
plt.title("Venn Diagram of Preferences for Group1 and Group2")
plt.show()


#### Exercise 8 Understanding the Gower Distance for Mixed-Type Data

In [None]:
import numpy as np
import pandas as pd

# Generating a sample dataset
np.random.seed(42)
data = {
    'Age': np.random.randint(20, 60, 100),
    'Income': np.random.randint(30000, 80000, 100),
    'Fruit_Preference': np.random.choice(['Apple', 'Banana', 'Cherry'], 100),
    'Is_Smoker': np.random.choice([True, False], 100)
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset
df.head()


In [None]:
#!pip install gower

In [None]:
import gower

# Compute the Gower Distance matrix
gower_distance_matrix = gower.gower_matrix(df)

# Display a portion of the Gower Distance matrix
gower_distance_matrix[:5, :5]


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(gower_distance_matrix, cmap='viridis', cbar=True)
plt.title("Gower Distance Heatmap")
plt.show()


#### Exercise 9 Understanding the Jaccard Coefficient for Binary Data

In [None]:
import numpy as np
import pandas as pd

# Generating a sample dataset with binary attributes
np.random.seed(42)
data = {
    'Bought_Apple': np.random.choice([0, 1], 100),
    'Bought_Banana': np.random.choice([0, 1], 100),
    'Bought_Cherry': np.random.choice([0, 1], 100),
    'Is_Vegetarian': np.random.choice([0, 1], 100)
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset
df.head()


In [None]:
from sklearn.metrics import jaccard_score

# Compute the Jaccard Coefficient for the first two data points as an example
data_point_1 = df.iloc[0]
data_point_2 = df.iloc[1]

jaccard_coefficient = jaccard_score(data_point_1, data_point_2, average='macro')

jaccard_coefficient


In [None]:
jaccard_coefficients = []

# Compute Jaccard Coefficients for all pairs of data points
for i in range(len(df)):
    for j in range(i+1, len(df)):
        coef = jaccard_score(df.iloc[i], df.iloc[j], average='macro')
        jaccard_coefficients.append(coef)

# Plotting the histogram
plt.hist(jaccard_coefficients, bins=20, edgecolor='k', alpha=0.7)
plt.title("Distribution of Jaccard Coefficients")
plt.xlabel("Jaccard Coefficient")
plt.ylabel("Frequency")
plt.show()


#### Exercise 10 Understanding the Cosine Similarity for Binary Data

In [None]:
import numpy as np
import pandas as pd

# Generating a sample dataset with binary attributes
np.random.seed(42)
data = {
    'Bought_Apple': np.random.choice([0, 1], 100),
    'Bought_Banana': np.random.choice([0, 1], 100),
    'Bought_Cherry': np.random.choice([0, 1], 100),
    'Is_Vegetarian': np.random.choice([0, 1], 100)
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset
df.head()


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity for the first two data points as an example
data_point_1 = df.iloc[0].values.reshape(1, -1)
data_point_2 = df.iloc[1].values.reshape(1, -1)

cosine_sim = cosine_similarity(data_point_1, data_point_2)

cosine_sim[0][0]


In [None]:
cosine_similarities = []

# Compute cosine similarities for all pairs of data points
for i in range(len(df)):
    for j in range(i+1, len(df)):
        coef = cosine_similarity(df.iloc[i].values.reshape(1, -1), df.iloc[j].values.reshape(1, -1))
        cosine_similarities.append(coef[0][0])

# Plotting the histogram
import matplotlib.pyplot as plt
plt.hist(cosine_similarities, bins=20, edgecolor='k', alpha=0.7)
plt.title("Distribution of Cosine Similarities")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.show()


### Evaluating Clustering Methods

Evaluating the results of clustering methods is critical to understand the quality and relevance of the clusters formed. Since clustering is unsupervised, assessing its effectiveness can be somewhat subjective. However, there are established metrics and techniques to guide this evaluation, both when ground truth labels are available and when they aren't.

#### Internal Evaluation:

Without ground truth labels, evaluate clustering based on the dataset's intrinsic properties.

#### Metrics:

1. **Silhouette Coefficient**:
   - Compares similarity of data points to their own cluster against other clusters.
   - Values range from -1 (incorrect clustering) to 1 (highly dense clustering), with 0 suggesting overlapping clusters.

2. **Davies-Bouldin Index**:
   - A ratio of within-cluster and between-cluster distances.
   - Lower values indicate better clustering.

3. **Calinski-Harabasz Index**:
   - Compares between-cluster dispersion to within-cluster dispersion.
   - Higher values indicate better-defined clusters.

4. **Dunn Index**:
   - Ratio of the smallest distance between points in different clusters to the largest intra-cluster distance.
   - Higher values indicate better clustering.

#### Relative Evaluation:

This involves comparing the results of clustering for different configurations or numbers of clusters.

#### Techniques:

1. **Elbow Method**:
   - Used with K-means to determine the optimal number of clusters.
   - Plot the variance explained (or inertia) against the number of clusters. The "elbow" point, where the rate of decrease sharply changes, often indicates an optimal number of clusters.

#### Stability and Consistency:

Evaluate the robustness of clusters by perturbing the dataset.

#### Techniques:

1. **Sub-sampling or Bootstrapping**:
   - Repeatedly sample subsets of data and perform clustering.
   - Examine the consistency of the clustering results.

2. **Adding Noise**:
   - Introduce random noise to the data.
   - Stable clusters should remain relatively unchanged.

#### Challenges and Considerations:

1. **Subjectivity**: Without a definitive "correct" clustering, some evaluation aspects remain subjective.
2. **Scale Sensitivity**: Some metrics need data normalization or standardization.
3. **Choice of Metric**: Different metrics might give varied evaluations for the same clustering result.

In conclusion, while evaluating clustering methods, it's often beneficial to consider multiple metrics and, when possible, combine them with domain knowledge to get a comprehensive view of the clustering quality.


#### Exercise 11 Evaluating Clustering with Silhouette Coefficient

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings 
warnings.filterwarnings('ignore')

# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)

# Calculating the Silhouette Coefficient
sil_coeff = silhouette_score(X, predicted_clusters, metric='euclidean')

sil_coeff


In [None]:
plt.scatter(X[:, 0], X[:, 1], c=predicted_clusters, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title("Clusters Formed by KMeans")
plt.show()


#### Exercise 12 Evaluating Clustering with Davies-Bouldin Index

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score

# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)

# Calculating the Davies-Bouldin Index
dbi = davies_bouldin_score(X, predicted_clusters)

dbi


#### Exercise 13 Evaluating Clustering with Calinski-Harabasz Index

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score

# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)

# Calculating the Calinski-Harabasz Index
chi = calinski_harabasz_score(X, predicted_clusters)

chi


#### Exercise 14 Evaluating Clustering with Dunn Index

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import numpy as np

# Applying KMeans clustering
kmeans = KMeans(n_clusters=4)
predicted_clusters = kmeans.fit_predict(X)

# Calculating the Dunn Index
def dunn_index(X, labels):
    pairwise_dists = pairwise_distances(X)
    min_intercluster_distance = np.min([pairwise_dists[labels == i][:, labels == j].min() for i in np.unique(labels) for j in np.unique(labels) if i != j])
    max_diameter = max([np.max(pairwise_distances(X[labels == i])) for i in np.unique(labels)])
    
    return min_intercluster_distance / max_diameter

di = dunn_index(X, predicted_clusters)

di


#### Exercise 15 Validating Clustering using the Elbow Method

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans

# Applying KMeans clustering for a range of k values
wcss = []  # Within-Cluster-Sum-of-Squares
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plotting the Elbow Method graph
plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()


#### Exercise 16 Validating Clustering using Cohesion

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans

# Applying KMeans clustering
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(X)

# Computing the cohesion value
cohesion = kmeans.inertia_

cohesion


#### Exercise 17 Validating Clustering using Separation Score

In [None]:
# Generating a sample dataset
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the generated dataset
import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Generated Data Points")
plt.show()


In [None]:
from sklearn.cluster import KMeans
import numpy as np

# Applying KMeans clustering
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(X)

# Computing the separation score
centroids = kmeans.cluster_centers_
separation = np.sum(np.var(centroids, axis=0))

separation


---

#### Revised Date: October 23, 2023