### Detailed Steps for Cluster Analysis

#### **Objective**: Perform clustering to group movies in the IMDB dataset by common characteristics using **K-Means** and **Hierarchical Clustering**.

---

### **Step 1: Data Preparation**
1. **Load the Dataset**:
   - Import the dataset `imdb_dataset.csv` from the given path (`Files -> Labs -> Data`).
   - Check the structure of the dataset (e.g., `pandas` for Python users):
     

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage


data = pd.read_csv("imdb_dataset.csv")
print(data.head())

2. **Data Cleaning**:
   - Remove missing or irrelevant rows (if any).
   - Remove labels if present (clustering is unsupervised).

3. **Preprocessing for Clustering**:
   - **Text Feature Extraction with TF-IDF**:
     - Import and process the text data:
      

In [14]:
# Data Cleaning
# Remove rows with missing values
data.dropna(inplace=True)

# Remove the 'Unnamed: 0' column if it exists
if 'Unnamed: 0' in data.columns:
    data.drop(columns=['Unnamed: 0'], inplace=True)

# Remove labels (assuming 'title' and 'imdb_url' are labels)
data.drop(columns=['title', 'imdb_url'], inplace=True)

- Output:
       - `tfidf_matrix` is a sparse matrix where each row corresponds to a document and each column to a word, with values reflecting the term’s relative importance (normalized frequency).

---



### **Step 2: K-Means Clustering**
1. **Determine Optimal Number of Clusters (Elbow Method)**:
   - Use the **Sum of Squared Errors (SSE)** to find the "elbow" point.
   - Plot SSE vs. Number of Clusters:
    

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(tfidf_matrix)
sse.append(kmeans.inertia_)  # Sum of squared distances to cluster centers

plt.plot(range(1, 11), sse)
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.title('Elbow Method')
plt.show()



2. **Apply K-Means**:
   - Choose the optimal number of clusters (`k`) based on the elbow method.
   - Perform clustering:
   

optimal_k = 5  # Replace with the value from the elbow plot
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(tfidf_matrix)
labels = kmeans.labels_  # Cluster labels for each document


   - Add cluster labels back to the dataset:
   

In [None]:
data['cluster'] = labels



3. **Visualize Clusters** (optional):
   - Reduce the dimensionality for visualization using PCA or t-SNE:
    

In [None]:
from sklearn.decomposition import PCA
import seaborn as sns

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(tfidf_matrix.toarray())
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=labels, palette="viridis")



---

### **Step 3: Hierarchical Clustering**
1. **Compute Linkages**:
   - Use **single**, **complete**, and **average** linkages.
   - Import and fit the hierarchical clustering algorithm:
    

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

linkages = ['single', 'complete', 'average']
for method in linkages:
    Z = linkage(tfidf_matrix.toarray(), method=method)
    plt.figure(figsize=(10, 7))
    dendrogram(Z)
    plt.title(f"Hierarchical Clustering Dendrogram ({method.capitalize()} Linkage)")
    plt.show()


2. **Interpret the Dendrogram**:
   - The vertical axis represents the distance between merged clusters.
   - Cut the dendrogram at a height that yields the desired number of clusters.

---

### **Step 4: Evaluate Clustering Results**
1. **Internal Evaluation**:
   - Use metrics such as **silhouette score** to assess cluster quality:
    

In [None]:
from sklearn.metrics import silhouette_score
score = silhouette_score(tfidf_matrix, labels)
print(f"Silhouette Score: {score}")


2. **External Evaluation** (Optional):
   - Compare clusters with true labels (if available) using metrics like accuracy or adjusted Rand index.

---

### **Output**
- For **K-Means**:
  - Cluster assignments for each document.
  - SSE plot to explain the optimal number of clusters.
- For **Hierarchical Clustering**:
  - Dendrograms visualizing the clustering process.
- Discussion on the differences in clusters produced by the two methods.