# Yelp Category Normalization
This step will be extracting the category string from the yelp data and grouping them by traditional K-means clustering so that it is an easier list to work with.



In [1]:
import pandas as pd
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# 1. Load Data from JSON
data = []
with open('../data/raw/yelp_academic_dataset_business.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)
# Drop businesses without categories, as we will be clustering based on these
df = df.dropna(subset=['categories'])


In [2]:
# Use Tfidf to turn strings into numerical features
# we use max_features to keep the matrix manageable
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(df['categories'])

## Actual Clustering

Now we can do the actual clustering logic to find the centroids

In [3]:
k_val = 10 # Number of categories we want to end with
model = KMeans(n_clusters=k_val, init='k-means++', max_iter=200, n_init=10)
df['cluster'] = model.fit_predict(X)

# This will print the first two results in each cluster so that we can see examples of what the clusters found
print("Cluster Examples")
for i in range(k_val):
    print(f"\nCLUSTER {i}:")
    samples = df[df['cluster'] == i]['categories'].head(2).tolist()
    
    for j, sample in enumerate(samples):
        print(f"  Sample {j+1}: {sample}")

Cluster Examples

CLUSTER 0:
  Sample 1: Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek
  Sample 2: Sushi Bars, Restaurants, Japanese

CLUSTER 1:
  Sample 1: Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services
  Sample 2: Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories

CLUSTER 2:
  Sample 1: Hair Salons, Hair Extensions, Beauty & Spas, Wigs, Shopping
  Sample 2: Health & Medical, Personal Care Services, Beauty & Spas, Massage, Nail Salons

CLUSTER 3:
  Sample 1: Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists
  Sample 2: General Dentistry, Dentists, Health & Medical, Cosmetic Dentists

CLUSTER 4:
  Sample 1: Automotive, Car Rental, Hotels & Travel, Truck Rental
  Sample 2: Event Planning & Services, Hotels, Hotels & Travel

CLUSTER 5:
  Sample 1: Korean, Restaurants
  Sample 2: Steakhouses, Asian Fusion, Restaurants

CLUSTER 6:
  Sample 1: Restaura

## Name Clusters
Next, we want to use the most common words in each centroid to come up with a suitable name for that cluster

In [4]:
print("Top terms per cluster:")
# make sure centroids are in numerical order
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()

cluster_counts = df['cluster'].value_counts().sort_index()

cluster_names = {}
for i in range(k_val):
    # Get the top 3 most significant words for this cluster
    top_words = [terms[ind] for ind in order_centroids[i, :3]]
    cluster_names[i] = " & ".join(top_words)
    print(f"Cluster {i}, {cluster_counts.get(i, 0)} items, category: {cluster_names[i]}")

# Map names back to the dataframe in a new 'normal_category' field
df['normal_category'] = df['cluster'].map(cluster_names)

Top terms per cluster:
Cluster 0, 11294 items, category: bars & nightlife & restaurants
Cluster 1, 32736 items, category: shopping & services & pet
Cluster 2, 11438 items, category: spas & beauty & salons
Cluster 3, 10846 items, category: medical & health & dentists
Cluster 4, 8785 items, category: event & hotels & planning
Cluster 5, 18238 items, category: restaurants & pizza & italian
Cluster 6, 23859 items, category: food & restaurants & tea
Cluster 7, 8881 items, category: automotive & auto & repair
Cluster 8, 16588 items, category: home & services & estate
Cluster 9, 7578 items, category: american & traditional & restaurants


## Return Processed Data

We can use these new categories and return back the json values into our processed folder for further integration

In [5]:
outputLoc = '../data/processed/yelp_clustered_categories.json'

# Save each row as a separate JSON object on its own line like yelp had originally
df.to_json(outputLoc, orient='records', lines=True)

print(f'Export complete at: {outputLoc}')

Export complete at: ../data/processed/yelp_clustered_categories.json
