# Recommendation System for Restaurants

## Documentation
In this notebook, we developed a semantic tag affinity map using restaurant data from the Yelp dataset. The core idea is to leverage a large, diverse dataset to learn how certain categories (tags) of restaurants tend to appear together, and thus infer similarity between them.

This affinity map is later used to enhance the recommendation engine by expanding the set of user preferences or restaurant tags with semantically similar alternatives.

### Objective

The goal is to generate a dictionary where each restaurant category (e.g., `"vegetarian"`) maps to a list of the most semantically similar categories based on co-occurrence patterns across the Yelp dataset.

Example output:

```json
{
  "vegetarian": ["vegan", "salad", "juice bars", "organic", "healthy"],
  ...
}
```

### Methodology

#### Step 1: Data Preprocessing

We extracted the `categories` field from the Yelp business data, which contains tags like "Mexican", "Fast Food", "Vegetarian", etc. Each restaurant may have multiple comma-separated categories.

These strings were cleaned (lowercased and stripped of punctuation) to prepare for vectorization.

#### Step 2: TF-IDF Vectorization

We applied **TF-IDF (Term Frequency – Inverse Document Frequency)** vectorization on the cleaned categories. TF-IDF is a classic method in natural language processing that weighs terms by how frequently they appear in a document (here, a restaurant's tag list) versus how common they are across the entire corpus.

TF-IDF was chosen for several reasons:

- **Simplicity and interpretability**: It is easy to understand and debug.
- **Efficiency**: Lightweight and fast to compute, especially on structured short texts like tags.
- **Noise resistance**: TF-IDF reduces the impact of overly common tags (like "Fast Food") while emphasizing distinctive ones (like "Vegan", "Ethiopian").

Each category ends up represented as a high-dimensional sparse vector.

#### Step 3: Cosine Similarity

We then computed the **cosine similarity** between all tag vectors. Cosine similarity measures the angle between two vectors, which makes it ideal for comparing TF-IDF vectors regardless of their magnitude.

It tells us: *"how similar are the contexts in which two tags appear?"*

Tags that frequently appear together across many restaurants (e.g., "vegetarian" and "vegan") will have a high cosine similarity.

#### Step 4: Top-k Similar Tags

For each tag, we selected the top-k most similar tags (k = 5) based on cosine similarity scores. This creates the final tag affinity map, saved as a JSON file.

### Justification of the Chosen Methods

TF-IDF combined with cosine similarity is a well-established approach for tasks involving structured, categorical text like restaurant tags. Here's why it was particularly well-suited for this use case:

- **No training required**: Unlike neural embeddings, there's no need to pre-train a model.
- **Generalizable**: Works across different datasets and domains without retraining.
- **Explainable recommendations**: You can clearly trace why tags are considered similar, which is crucial in user-facing systems.
- **Efficient at scale**: Handles thousands of tags and restaurants with low computational cost.

While more complex methods like Word2Vec or BERT could be used, they introduce significant overhead and reduce explainability. Given the structured nature of the tags and the clear co-occurrence relationships, TF-IDF and cosine similarity provide an ideal balance of performance and interpretability.

### Output

The resulting JSON dictionary (`tag_affinity.json`) is used in the recommendation engine to enrich user preference profiles or restaurant metadata with semantically similar tags, making the system more intelligent and flexible.


## Implementation

### Import Libraries

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pandas_gbq import read_gbq
import json

### Load restaurant dataset from BigQuery

In [None]:
project_id = "campusbites-72033"
query = """
SELECT
  categories
FROM
  `campusbites-72033.recommendation_system.business_yelp`
WHERE
  categories IS NOT NULL
"""
df = read_gbq(query, project_id=project_id)

### Vectorize the categories

In [None]:
# clean up the categories
df['categories'] = df['categories'].str.lower().str.replace(',', ' ')

# vectorize the categories with TF-IDF
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b[\w\-]+\b")
tfidf_matrix = vectorizer.fit_transform(df['categories'])

### Calculate the Cosine similarity

In [None]:
# calculate cosine similarity
terms = vectorizer.get_feature_names_out()
similarity_matrix = cosine_similarity(tfidf_matrix.T)

# affinity matrix
tag_affinity = {}
top_k = 5
for i, tag in enumerate(terms):
    sim_scores = similarity_matrix[i]
    top_indices = sim_scores.argsort()[::-1][1:top_k+1]
    related_tags = [terms[j] for j in top_indices]
    tag_affinity[tag] = related_tags

### Save the affinity map to a JSON file

In [None]:
with open("tag_affinity.json", "w") as f:
    json.dump(tag_affinity, f, indent=2)

# sample output
print("Sample of affinity for tag 'vegetarian'")
print(tag_affinity.get("vegetarian", []))
