#### K-Means Clustering
##### K-means clustering is a popular unsupervised machine learning algorithm used to group data into distinct clusters. The goal of K-means is to partition a dataset into 𝑘  clusters, where each data point belongs to the cluster with the nearest mean. It’s widely used in scenarios like customer segmentation, image compression, and anomaly detection.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [5]:


# Load your dataset
reviews_data=pd.read_csv("D:/Associate - Junior DS Assessment/Junior (A - L2) Data Science/Data/final_ds_nlp/modified_final_file.csv")
# Assuming 'translated_content' is the column with review texts
reviews_data = reviews_data.dropna(subset=['translated_content'])

# Preprocess the text data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(reviews_data['translated_content'])

# Apply K-means clustering
num_clusters = 10  # Choose the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X)

# Assign clusters to reviews
reviews_data['cluster'] = kmeans.labels_

# Analyze the clusters
# Print cluster centers
feature_names = vectorizer.get_feature_names_out()
for i in range(num_clusters):
    cluster_center = kmeans.cluster_centers_[i]
    top_features = [feature_names[j] for j in cluster_center.argsort()[-10:]]
    print(f"Cluster {i} top features: {top_features}")


Cluster 0 top features: ['google', 'prominent', 'hotel', 'god', 'best', 'good', 'clean', 'amazing', 'place', 'beautiful']
Cluster 1 top features: ['renovation', 'services', 'cleanliness', 'maintenance', 'place', 'care', 'beautiful', 'development', 'attention', 'needs']
Cluster 2 top features: ['wonderful', 'good', 'room', 'hotel', 'excellent', 'sea', 'place', 'nice', 'beautiful', 'view']
Cluster 3 top features: ['tidy', 'beach', 'hotel', 'amazing', 'services', 'park', 'clean', 'place', 'quiet', 'beautiful']
Cluster 4 top features: ['clean', 'thank', 'garden', 'recommend', 'quiet', 'atmosphere', 'park', 'beautiful', 'place', 'wonderful']
Cluster 5 top features: ['organized', 'good', 'experience', 'hotel', 'park', 'quiet', 'family', 'clean', 'place', 'nice']
Cluster 6 top features: ['good', 'services', 'beautiful', 'prices', 'place', 'clean', 'location', 'hotel', 'service', 'excellent']
Cluster 7 top features: ['best', 'amazing', 'clean', 'advise', 'good', 'visiting', 'beautiful', 'place

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Sample review data
data = {
    'review': [
        'The service was excellent and the staff was very friendly.',
        'The food was terrible and the restaurant was dirty.',
        'I loved the ambiance and the food was great!',
        'The staff was rude and the place was too noisy.',
        'A wonderful experience with delicious food and great service.',
        'I did not like the food and the service was poor.',
        'The restaurant had a great atmosphere and the food was okay.',
        'The service was slow and the food was average.'
    ]
}

df = pd.DataFrame(data)

# Step 1: Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['review'])

# Step 2: Apply K-means Clustering
kmeans = KMeans(n_clusters=3, random_state=0)  # You can adjust the number of clusters
kmeans.fit(X)

# Step 3: Predict the clusters for each review
df['cluster'] = kmeans.predict(X)

# Step 4: Analyze the Clusters
for cluster_num in sorted(df['cluster'].unique()):
    print(f"Cluster {cluster_num}:")
    cluster_reviews = df[df['cluster'] == cluster_num]['review']
    for review in cluster_reviews:
        print(f" - {review}")
    print()

# Optional: Plot the clusters if dimensionality reduction is done (e.g., using PCA or t-SNE)
# Note: You need to reduce dimensions to 2D for plotting


Cluster 0:
 - I loved the ambiance and the food was great!
 - A wonderful experience with delicious food and great service.
 - The restaurant had a great atmosphere and the food was okay.

Cluster 1:
 - The service was excellent and the staff was very friendly.
 - The staff was rude and the place was too noisy.

Cluster 2:
 - The food was terrible and the restaurant was dirty.
 - I did not like the food and the service was poor.
 - The service was slow and the food was average.

