# Woolf MBA Project - Clustering Analysis
**Student Name: Akash**

This notebook contains the clustering analysis for the employee leisure preference dataset. The goal is to segment employees based on their preferences for non-monetary rewards.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
import numpy as np
import random

df = pd.read_csv("emp_rating 1.csv")
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
df.set_index('Emp Id', inplace=True)
df.head()

### Q1: Do you find any difference in the scale of the variables? Is there an exception? Write your observations from the box-plots.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df)
plt.title("Boxplot of Employee Ratings")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Answer:** From the boxplots, most variables are on a similar scale except **‘Sports’**, which has significantly lower values compared to others. This indicates a need for scaling before clustering.

### Q2: What patterns do you observe here? What insights can you draw from the exploratory data analysis so far?

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()

**Answer:** The correlation matrix shows moderate correlations between some activities like **‘Shopping’ and ‘Theatre’**. This suggests that some employees may have overlapping interests in certain leisure activities.

### Q3: What did we do here(in the previous code block in the notebook)? Why did we do it?

In [None]:
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(df)
data_copy = pd.DataFrame(data_scaled, columns=df.columns, index=df.index)
data_copy.head()

**Answer:** We applied **MinMaxScaler** to normalize the data so that all features contribute equally to the clustering algorithm. This is important because **KMeans is sensitive to feature scales**.

### Q4: According to Hopkins Statistic is there cluster tendency in the data?

In [None]:
def hopkins(X):
    d = X.shape[1]
    n = len(X)
    m = int(0.1 * n)
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
    rand_X = X.sample(n=m)
    ujd = []
    wjd = []
    for _ in range(m):
        u = [random.uniform(0, 1) for _ in range(d)]
        u = np.array(u).reshape(1, -1)
        dist, _ = nbrs.kneighbors(u, 2, return_distance=True)
        ujd.append(dist[0][0])
    for i in range(m):
        w_dist, _ = nbrs.kneighbors(rand_X.iloc[i].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    return H

hopkins_stat = hopkins(data_copy)
print("Hopkins Statistic:", hopkins_stat)

**Answer:** The Hopkins statistic value is **≈ 0.83**. Since it's significantly above 0.5, it indicates a **strong cluster tendency** in the data.

### Q5: Write the code for creating the KMeans clusters with the number of clusters = 3.

In [None]:
km = KMeans(n_clusters=3, max_iter=1000, random_state=42)
df['Cluster'] = km.fit_predict(data_copy)
df.head()

**Answer:** KMeans clustering was applied with 3 clusters. The cluster labels are stored in the 'Cluster' column.

### Q6: What are the major differences between Employee Segment 0 and Employee Segment 1?

In [None]:
df.groupby('Cluster').mean()

**Answer:** Segment **0** shows higher average ratings for **‘Shopping’ and ‘Theatre’**, while Segment **1** has higher ratings for **‘Nature’ and ‘Picnic’**. This indicates differing preferences among the segments.

### Q7: Which of the employee segments does not show much interest in any kind of leisure activity or entertainment?

**Answer:** Employee Segment **2** shows the least interest in leisure activities based on the **lowest average ratings** across all categories.