# Lab06a KMeans

This lab explores the differences between `NearestNeighbors` and `KMeans` and discuss considerations when working with untagged data.

### Nearest Neighbors (`NearestNeighbors`)

**Import Statement:**
```python
from sklearn.neighbors import NearestNeighbors
```

**Purpose:**
- `NearestNeighbors` is primarily used for finding the nearest neighbors of data points.
- It is a versatile algorithm that can be used for both supervised and unsupervised learning scenarios.
- In the context of clustering, though not typical, it can be adapted to measure distances and group data points based on proximity.

**How it works:**
- It computes the distance between data points and identifies the nearest neighbors based on a chosen distance metric (e.g., Euclidean, Manhattan).
- For supervised learning, it is typically used in classification tasks (e.g., k-NN classifier).
- For unsupervised learning, it might be used in anomaly detection or in an adapted manner to identify groupings.

### K-Means (`KMeans`)

**Import Statement:**
```python
from sklearn.cluster import KMeans
```

**Purpose:**
- `KMeans` is specifically designed for clustering.
- It is unsupervised and does not require prior labels or tags on the data.

**How it works:**
- The algorithm partitions the data into `k` clusters.
- It initializes `k` centroids (cluster centers) randomly.
- It then iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned points.
- This process continues until convergence (i.e., the centroids no longer change significantly).

### When Data is Not Tagged with True Values

When your data is not tagged with true values (which is a typical scenario in clustering), the goal is to uncover the natural structure in the data. Here are the key points to consider:

1. **Clustering Algorithms like KMeans:**
   - **Effective Use**: Algorithms like `KMeans` do not require true labels beforehand. They are designed to find patterns and group similar data points together based purely on the data's intrinsic properties.
   - **Initialization**: Ensure careful initialization of centroids (e.g., using `k-means++` to improve convergence).

2. **Evaluation Metrics:**
   - **Internal Validation**: Use metrics like the silhouette score, inertia, or Davies-Bouldin index which do not require ground truth labels to evaluate the quality of clusters.
   - **Visualization**: Plotting the clusters can also help in visually assessing the clustering quality.

3. **Choosing the Number of Clusters:**
   - **Elbow Method**: Plot the within-cluster sum of squares (also known as inertia) for different values of `k` (number of clusters) and look for an "elbow" point.
   - **Silhouette Analysis**: Calculate the silhouette score for different values of `k` to determine the number of clusters that best fits the data.

Here’s an example using `KMeans` for clustering untagged data:

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

In [None]:
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(df[iris.feature_names])

# Visualizing the clusters using first two features
fig = px.scatter(df, x='sepal length (cm)', y='sepal width (cm)', color='cluster', 
                 title='KMeans Clustering (Sepal Length vs Sepal Width)')
fig.show()

### Elbow Method Example with KMeans

Use this method to determine the number of clusters.

### Explanation:

1. **Imports**:
   - We import necessary libraries including pandas, plotly for plotting, and sklearn for KMeans.
2. **Data Loading**:
   - Load the Iris dataset and convert it to a pandas DataFrame.
3. **Inertia Calculation**: 
   - Define a function `calculate_inertia` that computes the within-cluster sum of squares (inertia) for different values of \( k \) ranging from 1 to `max_k`.
   - Loop over a range of \( k \) values, fit the KMeans model, and store the results.
4. **Elbow Plot**:
   - Create a scatter plot using Plotly, showing the relationship between the number of clusters \( k \) and the calculated inertia.
   - Look for the "elbow" point on the plot to determine the optimal number of clusters. 

In [None]:
# Function to calculate inertia for different k values
def calculate_inertia(data, max_k):
    inertias = []
    for k in range(1, max_k + 1):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(data)
        inertias.append(kmeans.inertia_)
    return inertias

# Calculate inertia for k values from 1 to 10
max_k = 10
inertias = calculate_inertia(df[iris.feature_names], max_k)

# Visualize the Elbow Method
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(range(1, max_k + 1)),
    y=inertias,
    mode='lines+markers',
    marker=dict(color='blue'),
    name="Inertia (Within-cluster Sum of Squares)"
))

fig.update_layout(
    title="Elbow Method for Determining Optimal Number of Clusters",
    xaxis_title="Number of Clusters (k)",
    yaxis_title="Inertia",
    showlegend=False
)
fig.show()


In this example, you will see that the optimal number of clusters for the Iris dataset is indeed \( k = 3 \), validating what we already know. However, this method can be applied to any dataset to determine the appropriate number of clusters.

### Summary
- **`NearestNeighbors`** is versatile and typically used for finding nearest neighbors, with applications in both classification and anomaly detection.
- **`KMeans`** is specifically tailored for clustering tasks and does not require labeled data.
- When data is not tagged with true values, use appropriate clustering algorithms, evaluation metrics, and visualization techniques to uncover and assess the natural groupings in your data.