# Yelp Category Normalization
This step will be extracting the category string from the yelp data and grouping them by traditional K-means clustering so that it is an easier list to work with.

This data is then stored into a new field on the JSON, 'normal_category' for future methodologies


In [2]:
import pandas as pd
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# 1. Load Data from JSON
data = []
with open('../data/processed/yelp_cleaned.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)
# Drop businesses without categories, as we will be clustering based on these
df = df.dropna(subset=['categories'])


## TF-IDF Usage

We use TF-IDF (Term Frequency and Inverse Document Frequency) to assign a numerical value that combines uniqueness of words and their frequencies to allow them to be clustered just as numerical values would be.

In [3]:
# Use Tfidf to turn strings into numerical features
# we use max_features to keep the matrix manageable
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(df['categories'])

## Actual Clustering

Now we can do the actual clustering logic to find the centroids

In [4]:
k_val = 10 # Number of categories we want to end with
model = KMeans(n_clusters=k_val, init='k-means++', max_iter=500, n_init=10)
df['cluster'] = model.fit_predict(X)

# This will print the first two results in each cluster so that we can see examples of what the clusters found
print("Cluster Examples")
for i in range(k_val):
    print(f"\nCLUSTER {i}:")
    samples = df[df['cluster'] == i]['categories'].head(2).tolist()
    
    for j, sample in enumerate(samples):
        print(f"  Sample {j+1}: {sample}")

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



Cluster Examples

CLUSTER 0:
  Sample 1: fashion, shopping, department stores, furniture stores, candle stores, home & garden, home decor
  Sample 2: shopping, auction houses, active life

CLUSTER 1:
  Sample 1: health & medical, beauty & spas, laser hair removal, doctors, hair removal, chiropractors, weight loss centers, sports medicine, medical spas, skin care
  Sample 2: ophthalmologists, eyewear & opticians, health & medical, shopping, doctors, optometrists

CLUSTER 2:
  Sample 1: korean, restaurants
  Sample 2: restaurants, italian

CLUSTER 3:
  Sample 1: eatertainment, arts & entertainment, brewpubs, american (traditional), bakeries, breweries, food, restaurants
  Sample 2: food, grocery

CLUSTER 4:
  Sample 1: pizza, restaurants, salad, soup
  Sample 2: pizza, restaurants

CLUSTER 5:
  Sample 1: dui law, professional services, lawyers, criminal defense law
  Sample 2: keys & locksmiths, home services, local services

CLUSTER 6:
  Sample 1: car wash, auto detailing, automotive
  

## Name Clusters
Next, we want to use the most common words in each centroid to come up with a suitable name for that cluster

In [5]:
print("Top terms per cluster:")
# make sure centroids are in numerical order
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()

cluster_counts = df['cluster'].value_counts().sort_index()

cluster_names = {}
for i in range(k_val):
    # Get the top 3 most significant words for this cluster
    top_words = [terms[ind] for ind in order_centroids[i, :3]]
    cluster_names[i] = " & ".join(top_words)
    print(f"Cluster {i}, {cluster_counts.get(i, 0)} items, category: {cluster_names[i]}")

# Map names back to the dataframe in a new 'normal_category' field
df['normal_category'] = df['cluster'].map(cluster_names)

Top terms per cluster:
Cluster 0, 2434 items, category: shopping & stores & fashion
Cluster 1, 1673 items, category: medical & health & dentists
Cluster 2, 3471 items, category: restaurants & chinese & american
Cluster 3, 3031 items, category: food & restaurants & grocery
Cluster 4, 1054 items, category: pizza & restaurants & italian
Cluster 5, 5615 items, category: services & home & local
Cluster 6, 1187 items, category: automotive & auto & repair
Cluster 7, 1219 items, category: tea & coffee & food
Cluster 8, 1845 items, category: spas & beauty & salons
Cluster 9, 1870 items, category: bars & nightlife & restaurants


## Return Processed Data

We can use these new categories and return back the json values into our processed folder for further integration

In [6]:
outputLoc = '../data/processed/yelp_clustered_categories.json'

# Save each row as a separate JSON object on its own line like yelp had originally
df.to_json(outputLoc, orient='records', lines=True)

print(f'Export complete at: {outputLoc}')

Export complete at: ../data/processed/yelp_clustered_categories.json
