## Short Coding Project: K-Means Clustering

### Project Overview

You will perform **K-Means clustering** to group **heavy equipment usage scenarios** based on multiple features. The dataset is synthetic, focusing on equipment rental costs, usage hours, maintenance issues, transport distances, and more. Tasks include:

1. **Generating a synthetic dataset** (already in the code) with random missing values.
2. **Handling missing data** appropriately.
3. **Encoding categorical variables** using LabelEncoder.
4. **Scaling features** for distance-based clustering.
5. **Determining the optimal number of clusters** via the elbow method.
6. **Visualizing and interpreting** the clustering results.
7. **(Advanced)** Optionally using the Silhouette Score for deeper evaluation.

- Delete the `# YOUR CODE HERE` comments and write your code.
- **Do not change** the variable names.

### Step 0: Imports & Synthetic Dataset Generation

Run the cell below to import necessary libraries and create a **synthetic dataset** of 200 samples representing heavy equipment usage. Some values are intentionally replaced with `NaN` to simulate missing data.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.datasets import make_blobs
warnings.filterwarnings("ignore")

# Random seed for reproducibility
np.random.seed(42)

num_samples = 200

# Generate three features using make_blobs
blobs, _ = make_blobs(n_samples=num_samples, n_features=3, centers=4, random_state=42)

# Define a function to linearly scale a feature to a desired range.
def scale_feature(feature, new_min, new_max):
    old_min = feature.min()
    old_max = feature.max()
    return new_min + (feature - old_min) * (new_max - new_min) / (old_max - old_min)

# Transform the blob columns:
rental_cost = scale_feature(blobs[:, 0], 2000, 7000).round(2)
usage_hours = scale_feature(blobs[:, 1], 60, 180).round(1)
maintenance_issues = scale_feature(blobs[:, 2], 0, 10)
maintenance_issues = np.round(maintenance_issues).astype(int)

# Generate Transport_Distance (in miles)
transport_distance = np.random.randint(low=10, high=1000, size=num_samples)

# Generate categorical columns
equipment_type = np.random.choice(['Excavator','Crane','Bulldozer','Loader','Forklift'], size=num_samples)
project_location = np.random.choice(['Urban','Rural','Suburban'], size=num_samples)
fuel_type = np.random.choice(['Diesel','Electric','Gas'], size=num_samples)

# Create the DataFrame with the original feature names
data = pd.DataFrame({
    'Rental_Cost': rental_cost,
    'Usage_Hours': usage_hours,
    'Maintenance_Issues': maintenance_issues,
    'Transport_Distance': transport_distance,
    'Equipment_Type': equipment_type,
    'Project_Location': project_location,
    'Fuel_Type': fuel_type
})

# Introduce some missing values randomly (~10% chance per value)
mask_missing = np.random.choice([True, False], size=data.shape, p=[0.1, 0.9])
data = data.mask(mask_missing)

print("Initial data preview:")
print(data.head())

**Dataset columns**:

1. **Rental_Cost (float)** — Daily cost to rent the equipment (USD).  
2. **Usage_Hours (float)** — Total equipment operating hours logged.  
3. **Maintenance_Issues (int)** — Count of maintenance issues reported.  
4. **Transport_Distance (int)** — Distance in miles transported to/from the site.  
5. **Equipment_Type (object)** — Type of heavy equipment (Excavator, Crane, etc.).  
6. **Project_Location (object)** — Where the project is located (Urban, Rural, or Suburban).  
7. **Fuel_Type (object)** — Type of fuel (Diesel, Electric, or Gas).


### Question 1: Handling Missing Values

1. **Create** a variable `missing_counts` that counts missing values per column.
2. **Fill** missing values for **numeric** columns with their **mean**.
3. **Fill** missing values for **categorical** columns with `'Unknown'`.
4. **Create** a variable `post_missing_counts` that stores the count of missing values for each column after filling. This should confirm that all missing values have been handled (i.e., should be 0 for all columns).

In [None]:
# 1. Create `missing_counts`
missing_counts =  # YOUR CODE HERE
print("Missing values before filling:\n", missing_counts, "\n")

# 2. Fill numeric columns with the mean
#    Hint: data.select_dtypes(include=[np.number]) -> numeric columns
# YOUR CODE HERE

# 3. Fill categorical columns with 'Unknown'
#    Hint: data.select_dtypes(include=['object']) -> categorical columns
# YOUR CODE HERE

# 4. Create `post_missing_counts`
post_missing_counts = # YOUR CODE HERE
print("Missing values after filling:\n", post_missing_counts)

### Question 2: Encoding Categorical Features

K-Means requires numeric data. We will encode `Equipment_Type`, `Project_Location`, and `Fuel_Type` using `LabelEncoder`. Here we Identified the list of columns to encode and stored it in `categorical_cols`.

- You need to **initialize** a LabelEncoder for each and **transform** the columns in `data`.

At the end we print the first 5 rows of the updated DataFrame to verify changes.

In [None]:
from sklearn.preprocessing import LabelEncoder

# 1. Identify the categorical columns
categorical_cols = ["Equipment_Type", "Project_Location", "Fuel_Type"]

# 2. Encode these columns
for col in categorical_cols:
    # YOUR CODE HERE


# 3. Print first 5 rows to check
print("\nData after label encoding (first 5 rows):")
print(data.head())

### Question 3: Feature Scaling

We will use **all columns** in `data` for clustering. Since K-Means uses distance metrics, standardizing helps prevent bias from differing scales.

1. **Copy** the entire `data` into a variable `features`.
2. **Initialize** a `StandardScaler` as `scaler` and fit-transform it on `features`. Store the result in `features_scaled`.

Now we print the shape of `features_scaled` to confirm it has 200 rows and 7 columns.


In [None]:
# 1. Copy data
features = # YOUR CODE HERE

# 2. Scale the features
scaler =  # YOUR CODE HERE
features_scaled = # YOUR CODE HERE

# Print shape of features_scaled
print("Shape of scaled features:", features_scaled.shape)

### Question 4: Elbow Method

We will decide the optimal number of clusters `k` by looking at **inertia** (sum of squared distances to cluster centers) for `k` in `[1..10]`.

First, we initialize a list `inertia_values`.

Now, you need to do these steps:

- **Loop** through each `k` in `[1..10]`.
- **Create** a `KMeans`, with these settings: `init='k-means++'`, `n_init=10`, `random_state=42`.
- **Fit** it to `features_scaled`.
- **Append** `kmeans.inertia_` to `inertia_values`.

At the end, we plot `k` vs. `inertia_values`.

In [None]:
inertia_values = []
k_values = range(1, 11)

for k in k_values:
    # YOUR CODE HERE

# Plot the elbow curve
plt.figure()
plt.plot(k_values, inertia_values, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

> **Choose** a value for `k` as `k=3`, based on the identified "elbow" point in the plot.

### Question 5: K-Means Clustering & Visualization

1. **Set** `optimal_k` to the chosen value above.
2. **Create** a `KMeans` model (`kmeans_model`) , with these settings: `init='k-means++'`, `n_init=10`, `random_state=42`. Then, fit it to `features_scaled`, and get cluster labels in `labels`.
3. **Add** these labels to the original `data` in a new column named `"Cluster"`.
4. **Plot** the clusters in 3D using `Rental_Cost`, `Usage_Hours`, and `Maintenance_Issues` colored by `"Cluster"`.  

   - Also **plot** cluster centroids by inverse-transforming the model’s `cluster_centers_`.

In [None]:
# 1. Set `optimal_k`
optimal_k = # YOUR CODE HERE

# 2. Create and fit KMeans
kmeans_model = # YOUR CODE HERE
# YOUR CODE HERE

# 3. Assign labels to `data["Cluster"]`
labels = # YOUR CODE HERE
# YOUR CODE HERE

# 4. Plot the clusters
centroids_scaled = # YOUR CODE HERE
centroids_original = # YOUR CODE HERE

# Create a 3D scatter plot using the 3 blob-generated features:
# Rental_Cost, Usage_Hours, and Maintenance_Issues
fig = plt.figure(figsize=(8, 30))

ax = fig.add_subplot(111, projection='3d')
ax.view_init(elev=25, azim=70)
scatter = ax.scatter(data["Rental_Cost"], data["Usage_Hours"], data["Maintenance_Issues"],
                     c=data["Cluster"], cmap='viridis', alpha=0.6)
ax.set_xlabel("Rental_Cost", fontsize=12)
ax.set_ylabel("Usage_Hours", fontsize=12)
ax.set_zlabel("Maintenance_Issues", fontsize=12)

plt.title("3D K-Means Clusters", fontsize=14)

# Plot centroids
ax.scatter(centroids_original[:, 0], centroids_original[:, 1], centroids_original[:, 2],
           c='red', marker='X', s=200, label='Centroids')
plt.legend()
plt.show()

### Question 6 (Advanced): Silhouette Score

**Silhouette Score** measures how similar each sample is to its own cluster compared to other clusters. Values range from `-1` to `1`, and higher generally indicates better clustering.

1. **Import** `silhouette_score` from `sklearn.metrics`.
2. **Compute** the silhouette score using `(features_scaled, labels)` and store it in `sil_score`.
3. **Print** `sil_score`.
4. **Loop** loop over different `k` values to find which has the highest silhouette score.

**Hint**: Check [Silhouette Score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) for details.

In [None]:
# 1. Import silhouette_score
# YOUR CODE HERE

# 2. Compute silhouette score
sil_score = # YOUR CODE HERE
print("\nSilhouette Score:", sil_score)

# 3. Loop over range of k and see which yields best silhouette
best_k = None
best_score = -1
for k_test in range(2, 11):
    km_test = KMeans(n_clusters=k_test, init='k-means++', n_init=10, random_state=42)
    km_test.fit(features_scaled)
    test_labels = km_test.labels_
    score = # YOUR CODE HERE
    if score > best_score:
        best_score = score
        best_k = k_test
    print(f"k={k_test}, Silhouette Score={score:.4f}")
print(f"Best k by silhouette: {best_k}, Score={best_score:.4f}")