<a href="https://colab.research.google.com/github/KhanhHa26/restaurant-embedding-map/blob/main/Restaurant_Embedding_Map.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading Dataset

I picked the **Yelp Dataset** from **Kaggle**, where it has all the datas about businesses. After getting the dataset, I filted out restaurants only, and then combined the data of **"name", "attributes", and "categories"** into a new column called "text" so that I could perform sentence-BERT later on üôÇ

In [None]:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Choose a file inside the dataset
file_path = "yelp_academic_dataset_business.json"

df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "yelp-dataset/yelp-dataset",
    file_path,
    pandas_kwargs={"lines": True}
)

print(df.head())
print(df.shape)


In [None]:
restaurants = df[df["categories"].str.contains("Restaurants", na=False)].copy()

In [None]:
print(restaurants.shape)
restaurants.head()

In [None]:
#reorder the index to 0,1,2
restaurants.reset_index(drop=True, inplace=True)

In [None]:
import pandas as pd
def combine_text(row):
    text = ""

    if pd.notna(row["name"]):
        text += row["name"] + " "

    if pd.notna(row["categories"]):
        text += row["categories"] + " "

    if pd.notna(row["attributes"]):
        text += str(row["attributes"]) + " "

    if pd.notna(row["city"]):
        text += str(row["city"]) + " "

    return text.strip()

In [None]:
restaurants["text"] = restaurants.apply(combine_text, axis=1)

In [None]:
restaurants["text"]

Next, we will be using **Sentence-BERT embedding**. The reason we picked this over TF-IDF, which is another word processing algorithm, is that TF-IDF doesn't take context into consideration. TF-IDF only considers word frequency and assume that each phrase is independent of each other. However, sentence-BERT embedding calculates the similarity between sentences and understands the context (thanks to BERT model).

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import os

def compute_or_load_embeddings(texts):
    if os.path.exists("embeddings.npy"):
        print("Loading saved embeddings...")
        return np.load("embeddings.npy")

    print("Computing embeddings...")
    model = SentenceTransformer("all-MiniLM-L6-v2")
    emb = model.encode(texts, batch_size=16, show_progress_bar=True)

    np.save("embeddings.npy", emb)
    return emb

embeddings = compute_or_load_embeddings(restaurants["text"])

After this, we will use nearest neighbors to find top k restaurants that are closest to each other. Initially, the plan was to use **cosine similarity**, but our data is pretty big (50k datas) and we can't endure a NxN matrix.

In [None]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

k = 10 #find the top 10 similar restaurants

nn = NearestNeighbors(metric='cosine', algorithm='auto')
nn.fit(embeddings)

#indices[i] ‚Üí top-k similar restaurants to restaurant i
#distances[i] ‚Üí cosine distance scores
distances, indices = nn.kneighbors(embeddings, n_neighbors=k)

In [None]:
distances

In [None]:
indices

In [None]:
!pip install umap-learn

# Plotting

Now we can start plotting. Clusters show that the restaurant points are more similar to each other.

**UMAP** is a non-linear data transformation, and it's commonly used for data visualization. I will be using UMAP instead of t-SNE because UMAP is much faster and lower in memory.

First, we will find the main category of each restaurant by taking the first category in the 'categories' column. Second, we will transform those main categories into numbers so that each category can correspond to one color when graphing.

In [None]:
import umap

# Initialize UMAP with desired parameters (n_components for target dimensions)
# metric='cosine' to use cosine distance
reducer = umap.UMAP(n_components=2, metric='cosine', random_state=42) # random_state for reproducibility

# Fit and transform the data
X_umap = reducer.fit_transform(embeddings)

In [None]:
# Convert category strings into a single primary category
def get_primary_category(cat_string):
    if not isinstance(cat_string, str):
        return "Other"
    return cat_string.split(",")[0].strip()  # use the first category

restaurants["main_category"] = restaurants["categories"].apply(get_primary_category)


In [None]:
#Convert these categories into integer IDs
restaurants["category_id"] = restaurants["main_category"].astype('category').cat.codes

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.scatter(
    X_umap[:,0],
    X_umap[:,1],
    c=restaurants["category_id"],
    cmap="tab20",
    s=4,
    alpha=0.7
)
plt.title("Restaurant Embedding Map Colored by Category")
plt.show()

Then, we can graph it more interactively by using Plotly. The graph below shows each restaurant category corresponding to one color. The user can zoomed in and out, and clicking on one category shows restaurants with that category only.

In [None]:
import plotly.express as px

fig = px.scatter(restaurants, x=X_umap[:, 0], y=X_umap[:, 1], color='main_category', hover_data=['main_category'])
config = {'scrollZoom': True}
# fig.show(config=config)

Great! The graph looks great! Now, I want to make this even more interactive by allowing the user to input in restaurant name, and then Plotly will show the top 10 restaurants that are most similar to the user-given one.

In [None]:
def find_restaurant(name, df):
  name = name.lower()
  matches = df[df["name"].str.lower().str.contains(name)]

  if len(matches) == 0:
    print("There is no restaurant with the given name. Please try again!")
    return

  #return the first matched restaurant
  return matches.index[0]

In [None]:
def get_10_similar_restaurants(match, embeddings, df, nn):
  idx = find_restaurant(match, df)
  if idx is None:
      return None
  distances, indices = nn.kneighbors([embeddings[idx]])

  #skip the first restaurant because it's the restaurant itself
  similar_idx = indices[0][1:]
  similar_distances = distances[0][1:]

  results = df.iloc[similar_idx].copy()
  results["similarity (%)"] = (1 - similar_distances) * 100
  return df.loc[idx, "name"], results

In [None]:
query = "Vietnamese Food Truck"   # user input
input, results = get_10_similar_restaurants(query, embeddings, restaurants, nn)

print("üîç Input restaurant:", input)
print("\nüçΩ Top 10 similar restaurants:")
print(results)