# Zomato Bangalore Restaurants: 05 - NLP Feature Engineering

**Author:** Puneet Kumar Mishra
**Date:** 14-09-2025

## 1. Objective

This notebook is dedicated to extracting valuable, predictive features from the unstructured text data in our dataset. The primary goal is to convert the raw text from `reviews_list`, `menu_item`, `cuisines`, and `dish_liked` into meaningful numerical signals that our machine learning model can understand.

This is a critical phase where we move beyond structured data and into the nuanced world of customer sentiment and restaurant offerings.

### Key Features to Engineer:

1.  **Simple Count-Based Features:** We will start by creating simple but powerful features like `review_count` and `menu_item_count`.
2.  **Sentiment Analysis:** We will analyze the sentiment of all reviews for each restaurant to create an `avg_sentiment_score`. This has the potential to be a very strong predictor of the overall `rate`.
3.  **Advanced Text Features (TF-IDF):** For high-cardinality text like `dish_liked` and `menu_item`, we will use TF-IDF vectorization to identify "signature" dishes or menu themes that are characteristic of high or low-rated restaurants.

The final output will be a new dataset containing these engineered NLP features, ready to be merged with our main tabular and geo datasets for the final modeling stage.

In [1]:
# --- 1. CORE LIBRARIES ---
import os
import sys
import warnings

# In your main setup cell, replace the old NLTK section with this:
import nltk
import numpy as np

# --- 2. DATA HANDLING & ANALYSIS ---
import pandas as pd
from nltk.corpus import stopwords

# --- 3. NATURAL LANGUAGE PROCESSING (NLP) ---
from textblob import TextBlob  # For easy sentiment analysis

# This will now work because you've manually downloaded the data.
STOPWORDS = set(stopwords.words("english"))

# --- 4. UTILITIES ---
from loguru import logger
from tqdm.auto import tqdm

tqdm.pandas()

# ===================================================================
#                      CONFIGURATION
# ===================================================================
# (Your standard, excellent configuration settings)
pd.set_option("display.max_columns", None)
# ... etc. ...

logger.remove()
logger.add(
    sys.stdout,
    colorize=True,
    format=(
        "<green>{time:YYYY-MM-DD HH:mm:ss}</green> | "
        "<level>{level: <8}</level> | "
        "<level>{message}</level>"
    ),
)

logger.info("✅ All libraries imported and configurations set successfully!")

# --- Load the NLP Dataset ---
DATA_PATH = "../data/processed/zomato_nlp.parquet"
try:
    df_nlp = pd.read_parquet(DATA_PATH)
    logger.success(f"Successfully loaded the NLP dataset from '{DATA_PATH}'.")
    logger.info(f"DataFrame shape: {df_nlp.shape}")
except FileNotFoundError:
    logger.error(
        f"FATAL: The file was not found at '{DATA_PATH}'. Please ensure the path is correct."
    )

df_nlp.head()

[32m2025-09-16 20:15:24[0m | [1mINFO    [0m | [1m✅ All libraries imported and configurations set successfully![0m
[32m2025-09-16 20:15:26[0m | [32m[1mSUCCESS [0m | [32m[1mSuccessfully loaded the NLP dataset from '../data/processed/zomato_nlp.parquet'.[0m
[32m2025-09-16 20:15:26[0m | [1mINFO    [0m | [1mDataFrame shape: (45187, 7)[0m


Unnamed: 0,name,address,rate,reviews_list,menu_item,cuisines,dish_liked
0,Jalsa,"942, 21st Main Road, 2nd Stage, Banashankari, ...",4.1,"[[Rated 2.0, RATED\n Its a restaurant near to...",[Unknown],"[Chinese, Mughlai, North Indian]","[Dum Biryani, Lunch Buffet, Masala Papad, Pane..."
1,Spice Elephant,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",4.1,"[[Rated 2.0, RATED\n I had a very bad experie...",[Unknown],"[Chinese, North Indian, Thai]","[Chicken Biryani, Chocolate Nirvana, Dum Birya..."
2,San Churro Cafe,"1112, Next to KIMS Medical College, 17th Cross...",3.8,"[[Rated 1.0, RATED\n Cockroaches !! I Repeat ...",[Unknown],"[Cafe, Italian, Mexican]","[Cannelloni, Churros, Hot Chocolate, Minestron..."
3,Addhuri Udupi Bhojana,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",3.7,"[[Rated 1.5, RATED\n The food was not satisfa...",[Unknown],"[North Indian, South Indian]",[Masala Dosa]
4,Grand Village,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",3.8,"[[Rated 4.0, RATED\n Great service, overwhelm...",[Unknown],"[North Indian, Rajasthani]","[Gol Gappe, Panipuri]"


---
## 2. The Grand NLP Feature Engineering Plan

This notebook is a comprehensive exploration of Natural Language Processing techniques, applied to the Zomato review dataset. Guided by the principles of structured experimentation, we will systematically build a rich set of features, progressing from fundamental text statistics to state-of-the-art deep learning models.

Our workflow is divided into three major phases, inspired by the provided NLP mind map:

---

### **Phase 1: Foundational Text Analysis & Preprocessing**

*Goal: To create a clean, standardized text corpus and extract basic, yet powerful, statistical features.*

1.  **Text Aggregation & Cleaning:**
    *   **Action:** Combine all English reviews for each restaurant into a single, unified text document.
    *   **Action:** Create a "god-level" text preprocessing function that will lowercase text, remove punctuation, numbers, and URLs, and handle extra whitespace.
2.  **Lexical Features (Readability & Complexity):**
    *   **Action:** Engineer features based on the raw text, such as `total_review_length`, `avg_word_length`, and `readability_score` (e.g., Flesch-Kincaid). This tests the hypothesis that the *style* of reviews correlates with the rating.
3.  **Tokenization & Stopword Removal:**
    *   **Action:** Convert the cleaned text into a list of individual words (tokens).
    *   **Action:** Remove common English stopwords (e.g., "the", "a", "is") to reduce noise and focus on meaningful words.

---

### **Phase 2: Classic NLP Feature Extraction**

*Goal: To apply traditional, rule-based, and statistical NLP methods to extract sentiment and topic-based features.*

1.  **Sentiment Analysis (The "Vibe" Score):**
    *   **Action:** We will scientifically compare three different sentiment analysis techniques on the cleaned text:
        1.  **TextBlob:** A simple, fast baseline.
        2.  **VADER:** A lexicon and rule-based engine optimized for social media and review text.
        3.  **Result:** Create `textblob_sentiment` and `vader_sentiment` features.
2.  **Keyword & N-gram Analysis (The "Bag-of-Words" Approach):**
    *   **Action:** We will use **TF-IDF (Term Frequency-Inverse Document Frequency)** on the tokenized text. This will convert our text into a numerical matrix where each column represents a word, and the value represents its importance to that restaurant's reviews.
    *   **Feature Creation:** We will use dimensionality reduction techniques (like SVD or NMF) on the TF-IDF matrix to distill it into a few powerful, high-level "topic" features (e.g., `topic_food_quality`, `topic_service`, `topic_ambiance`).

---

### **Phase 3: State-of-the-Art Deep Learning (The Transformer Era)**

*Goal: To leverage a pre-trained, multilingual deep learning model to capture the deepest contextual understanding of the text, including "Hinglish" and other nuances.*

1.  **Contextual Sentiment Analysis:**
    *   **Action:** We will use a powerful, pre-trained Transformer model (like `cardiffnlp/twitter-xlm-roberta-base-sentiment`) from the Hugging Face library.
    *   **Key Advantage:** This model understands context, negation, and multilingual text far better than the classic methods. It will be run on the **original, unfiltered review text** to maximize its power.
    *   **Result:** Create a `transformer_sentiment` feature, which will likely be our most powerful sentiment-based signal.

By the end of this notebook, we will have engineered a wide array of NLP features, from simple counts to complex topic and sentiment scores. This will provide our final model with an incredibly rich understanding of the customer experience described in the reviews.

In [2]:
import numpy as np
from loguru import logger
from tqdm.auto import tqdm

tqdm.pandas()


def aggregate_reviews_and_create_lexical_features(
    df: pd.DataFrame, review_col: str = "reviews_list"
) -> pd.DataFrame:
    """
    Aggregates review texts into a single document per restaurant and
    creates initial lexical (count-based) features.
    """
    logger.info("--- Starting Text Aggregation & Lexical Feature Engineering ---")
    df_out = df.copy()

    def get_all_review_texts(review_array):
        if len(review_array) > 0:
            return [
                review[1]
                for review in review_array
                if len(review) == 2 and isinstance(review[1], str)
            ]
        return []

    logger.info("Aggregating all review texts for each restaurant...")
    df_out["full_review_text"] = df_out[review_col].progress_apply(
        lambda arr: " ".join(get_all_review_texts(arr))
    )

    logger.info("Engineering lexical features...")
    df_out["review_count"] = df_out[review_col].apply(len)
    df_out["total_review_length"] = df_out["full_review_text"].str.len()

    # Use a temporary series to avoid chained assignment warnings
    avg_word_length_series = (
        df_out["full_review_text"]
        .str.split()
        .apply(
            lambda tokens: np.mean([len(token) for token in tokens]) if tokens else 0
        )
    )

    # --- THE FIX IS HERE ---
    # Assign the result of fillna back to the column instead of using inplace=True
    df_out["avg_word_length"] = avg_word_length_series.fillna(0)
    # --- END OF FIX ---

    logger.success("Aggregation and lexical feature creation complete.")
    return df_out


# --- Execute the first step ---
df_nlp_features = aggregate_reviews_and_create_lexical_features(df_nlp)

# --- Verification ---
display(
    df_nlp_features[
        [
            "name",
            "review_count",
            "total_review_length",
            "avg_word_length",
            "full_review_text",
        ]
    ].head()
)

[32m2025-09-16 20:15:26[0m | [1mINFO    [0m | [1m--- Starting Text Aggregation & Lexical Feature Engineering ---[0m
[32m2025-09-16 20:15:26[0m | [1mINFO    [0m | [1mAggregating all review texts for each restaurant...[0m


  0%|          | 0/45187 [00:00<?, ?it/s]

[32m2025-09-16 20:15:27[0m | [1mINFO    [0m | [1mEngineering lexical features...[0m
[32m2025-09-16 20:15:35[0m | [32m[1mSUCCESS [0m | [32m[1mAggregation and lexical feature creation complete.[0m


Unnamed: 0,name,review_count,total_review_length,avg_word_length,full_review_text
0,Jalsa,10,2906,4.662083,RATED\n Its a restaurant near to Banashankari...
1,Spice Elephant,14,4958,4.493304,RATED\n I had a very bad experience here.\nI ...
2,San Churro Cafe,20,6993,4.519108,RATED\n Cockroaches !! I Repeat cockroaches!!...
3,Addhuri Udupi Bhojana,23,7708,4.539797,RATED\n The food was not satisfactory. Not on...
4,Grand Village,2,651,5.252427,"RATED\n Great service, overwhelming experienc..."


**Result:** The initial feature creation is a success. We have successfully aggregated all review texts into a new `full_review_text` column. Additionally, we have created three new numerical features: `review_count`, `total_review_length`, and `avg_word_length`. These will serve as our first set of NLP-derived predictors. The next step is to perform advanced preprocessing on the `full_review_text` to prepare it for sentiment analysis and topic modeling.

### 3.1. Advanced Text Preprocessing

The `full_review_text` column contains raw, unstructured text. To prepare it for more advanced NLP tasks, we need to clean and standardize it. The following function creates a comprehensive preprocessing pipeline that will:

1.  **Lowercase** all text for consistency.
2.  Remove the recurring **"RATED\n"** prefix.
3.  Use regular expressions to **remove all punctuation, numbers, and special characters**, leaving only letters and spaces.
4.  **Tokenize** the text by splitting it into individual words.
5.  Remove common English **stopwords** (e.g., "the", "a", "is") which add little semantic value.
6.  Perform **lemmatization**, which intelligently reduces words to their root form (e.g., "running," "ran," and "runs" all become "run"). This is more advanced than simple stemming.
7.  Join the cleaned tokens back into a final, processed string.

This will create a new `processed_text` column, which will be the ideal input for our subsequent NLP models.

In [3]:
import re

import nltk
import pandas as pd
from loguru import logger
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm.auto import tqdm

tqdm.pandas()

# (Assume NLTK data is already downloaded)


def create_specialized_text_corpora(df: pd.DataFrame, text_col: str) -> pd.DataFrame:
    """
    Creates two specialized text columns for different NLP tasks:
    1. A lightly cleaned version for sentiment analysis.
    2. A heavily cleaned version for topic modeling.
    """
    logger.info(f"--- Creating Specialized Text Corpora from '{text_col}' ---")
    df_out = df.copy()

    # --- Tools for Heavy Cleaning ---
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    non_alpha_pattern = re.compile(r"[^a-z\s]")

    # --- 1. Create 'text_for_sentiment' (Light Cleaning) ---
    def light_clean(text):
        if not isinstance(text, str):
            return ""
        # Just lowercase and remove the "RATED" prefix. Keep everything else!
        text = text.lower().replace("rated\n", " ").strip()
        return text

    logger.info("Creating 'text_for_sentiment' (lightly cleaned)...")
    df_out["text_for_sentiment"] = df_out[text_col].progress_apply(light_clean)

    # --- 2. Create 'text_for_topics' (Heavy Cleaning) ---
    def heavy_clean(text):
        if not isinstance(text, str):
            return ""
        text = text.lower().replace("rated\n", " ").strip()
        text = non_alpha_pattern.sub(" ", text)  # Remove punctuation, numbers, emojis
        tokens = text.split()
        lemmatized_tokens = [
            lemmatizer.lemmatize(word)
            for word in tokens
            if word not in stop_words and len(word) > 2
        ]
        return " ".join(lemmatized_tokens)

    logger.info("Creating 'text_for_topics' (heavily cleaned and lemmatized)...")
    df_out["text_for_topics"] = df_out[text_col].progress_apply(heavy_clean)

    logger.success("Specialized text corpora created successfully.")
    return df_out


# --- Execute Preprocessing ---
# df_nlp_features is the output from our first step (aggregation)
df_nlp_processed = create_specialized_text_corpora(
    df_nlp_features, text_col="full_review_text"
)

# --- Verification ---
print("\n--- Verification of Specialized Text Columns ---")
display(
    df_nlp_processed[
        ["name", "full_review_text", "text_for_sentiment", "text_for_topics"]
    ].head()
)

[32m2025-09-16 20:15:36[0m | [1mINFO    [0m | [1m--- Creating Specialized Text Corpora from 'full_review_text' ---[0m
[32m2025-09-16 20:15:36[0m | [1mINFO    [0m | [1mCreating 'text_for_sentiment' (lightly cleaned)...[0m


  0%|          | 0/45187 [00:00<?, ?it/s]

[32m2025-09-16 20:15:36[0m | [1mINFO    [0m | [1mCreating 'text_for_topics' (heavily cleaned and lemmatized)...[0m


  0%|          | 0/45187 [00:00<?, ?it/s]

[32m2025-09-16 20:16:25[0m | [32m[1mSUCCESS [0m | [32m[1mSpecialized text corpora created successfully.[0m

--- Verification of Specialized Text Columns ---


Unnamed: 0,name,full_review_text,text_for_sentiment,text_for_topics
0,Jalsa,RATED\n Its a restaurant near to Banashankari...,its a restaurant near to banashankari bda. me ...,restaurant near banashankari bda along office ...
1,Spice Elephant,RATED\n I had a very bad experience here.\nI ...,i had a very bad experience here.\ni don't kno...,bad experience know carte buffet worst gave co...
2,San Churro Cafe,RATED\n Cockroaches !! I Repeat cockroaches!!...,cockroaches !! i repeat cockroaches!!bakasura ...,cockroach repeat cockroach bakasura disappoint...
3,Addhuri Udupi Bhojana,RATED\n The food was not satisfactory. Not on...,the food was not satisfactory. not one item se...,food satisfactory one item served could eaten ...
4,Grand Village,"RATED\n Great service, overwhelming experienc...","great service, overwhelming experience.\n\none...",great service overwhelming experience one kind...


## 4. Phase 2: Classic NLP Feature Extraction

With a clean, preprocessed text corpus, we can now move on to extracting meaningful features. This phase focuses on applying classic, well-established NLP techniques to quantify the subjective aspects of the reviews.

### 4.1. Sentiment Analysis: Quantifying the "Vibe"

Our first and most important task is to quantify the sentiment of the reviews. A numerical sentiment score has the potential to be a very strong predictor of the restaurant's `rate`.

We will implement a function that calculates sentiment using two different popular libraries, allowing us to compare their results:

1.  **TextBlob:** A simple and fast library that provides a **polarity score** ranging from -1 (very negative) to +1 (very positive). It's a great baseline.
2.  **VADER (Valence Aware Dictionary and sEntiment Reasoner):** A more advanced, rule-based sentiment analysis tool specifically tuned for social media and review text. It's better at handling negation (e.g., "not good"), emphasis (e.g., "SOOO GOOD!!"), and slang. It provides positive, negative, neutral, and a final combined **compound score** (also from -1 to +1).

This function will take our `processed_text` and create new numerical features for each sentiment score.

In [4]:
import pandas as pd
from loguru import logger
from textblob import TextBlob
from tqdm.auto import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

tqdm.pandas()


def generate_sentiment_features(df: pd.DataFrame, text_col: str) -> pd.DataFrame:
    """
    Calculates sentiment scores using both TextBlob and VADER.
    """
    logger.info(f"--- Generating Sentiment Features for '{text_col}' ---")
    df_out = df.copy()

    # --- Initialize VADER once for efficiency ---
    vader_analyzer = SentimentIntensityAnalyzer()

    # --- TextBlob Polarity ---
    logger.info("Calculating TextBlob polarity scores...")
    df_out["sentiment_textblob"] = df_out[text_col].progress_apply(
        lambda text: TextBlob(text).sentiment.polarity
    )
    logger.success("TextBlob sentiment calculated.")

    # --- VADER Compound Score ---
    logger.info("Calculating VADER compound scores...")
    df_out["sentiment_vader"] = df_out[text_col].progress_apply(
        lambda text: vader_analyzer.polarity_scores(text)["compound"]
    )
    logger.success("VADER sentiment calculated.")

    logger.success("All sentiment features created successfully.")
    return df_out


# --- NOW, run sentiment analysis on the CORRECT column ---
# (The sentiment analysis function code remains the same)
df_nlp_sentiments = generate_sentiment_features(
    df_nlp_processed, text_col="text_for_sentiment"
)

# --- Verification of Sentiment ---
print("\n--- Verification of Sentiment Features ---")
display(
    df_nlp_sentiments[["name", "rate", "sentiment_textblob", "sentiment_vader"]].head()
)

[32m2025-09-16 20:16:25[0m | [1mINFO    [0m | [1m--- Generating Sentiment Features for 'text_for_sentiment' ---[0m
[32m2025-09-16 20:16:25[0m | [1mINFO    [0m | [1mCalculating TextBlob polarity scores...[0m


  0%|          | 0/45187 [00:00<?, ?it/s]

[32m2025-09-16 20:18:04[0m | [32m[1mSUCCESS [0m | [32m[1mTextBlob sentiment calculated.[0m
[32m2025-09-16 20:18:04[0m | [1mINFO    [0m | [1mCalculating VADER compound scores...[0m


  0%|          | 0/45187 [00:00<?, ?it/s]

[32m2025-09-16 20:54:30[0m | [32m[1mSUCCESS [0m | [32m[1mVADER sentiment calculated.[0m
[32m2025-09-16 20:54:30[0m | [32m[1mSUCCESS [0m | [32m[1mAll sentiment features created successfully.[0m

--- Verification of Sentiment Features ---


Unnamed: 0,name,rate,sentiment_textblob,sentiment_vader
0,Jalsa,4.1,0.34506,0.9996
1,Spice Elephant,4.1,0.195731,0.9996
2,San Churro Cafe,3.8,0.166162,0.9997
3,Addhuri Udupi Bhojana,3.7,0.309284,0.9998
4,Grand Village,3.8,0.447883,0.9856


### 4.2. Keyword & Topic Analysis with TF-IDF

While sentiment scores tell us if reviews are positive or negative, they don't tell us *why*. Are customers happy about the `food`, the `ambiance`, or the `service`? To answer this, we will use a powerful technique called **TF-IDF**.

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It identifies words that are:
1.  **Frequent** within a single restaurant's reviews (high Term Frequency).
2.  **Rare** across all other restaurants' reviews (high Inverse Document Frequency).

This allows us to automatically discover the most important and characteristic keywords for each restaurant. For example, the word "biryani" might be common everywhere, but "wood-fired" might be a highly important, characteristic term for a specific set of pizza places.

Our plan is to:
1.  Apply a `TfidfVectorizer` to our `processed_text` corpus.
2.  This will create a massive matrix where rows are restaurants and columns are words.
3.  We will then use this matrix to understand the key topics in the reviews.

In [5]:
import numpy as np  # Make sure numpy is imported
from loguru import logger
from sklearn.feature_extraction.text import TfidfVectorizer


def build_tfidf_matrix(df: pd.DataFrame, text_col: str):
    """
    Builds a TF-IDF matrix from a text column and returns the vectorizer
    and the resulting matrix.
    """
    logger.info(f"--- Building TF-IDF Matrix from '{text_col}' ---")

    if text_col not in df.columns:
        logger.error(
            f"FATAL: Column '{text_col}' not found in the DataFrame. Cannot build TF-IDF matrix."
        )
        return None, None

    # Initialize the TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(
        max_df=0.95,
        min_df=5,
        ngram_range=(1, 2),
        max_features=3000,
        stop_words="english",  # More robust than our manual list for this tool
    )

    logger.info("Fitting the TfidfVectorizer to the corpus...")
    tfidf_matrix = tfidf_vectorizer.fit_transform(df[text_col])

    logger.success("TF-IDF matrix built successfully.")
    logger.info(f"Matrix Shape: {tfidf_matrix.shape} (Restaurants x Features/Terms)")

    logger.info(
        "Displaying a sample of the 20 most important features (terms) learned:"
    )
    feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
    tfidf_scores = tfidf_matrix.sum(axis=0).tolist()[0]
    df_tfidf_scores = pd.DataFrame({"term": feature_names, "score": tfidf_scores})
    display(df_tfidf_scores.sort_values(by="score", ascending=False).head(20))

    return tfidf_vectorizer, tfidf_matrix


# --- Execute TF-IDF (with the CORRECT column name) ---

# <<--- THE FIX IS HERE --- >>
# We must use the 'text_for_topics' column, which was specifically created for this task.
tfidf_vectorizer, tfidf_matrix = build_tfidf_matrix(
    df_nlp_sentiments, text_col="text_for_topics"
)
# --- END OF FIX ---

[32m2025-09-16 20:54:30[0m | [1mINFO    [0m | [1m--- Building TF-IDF Matrix from 'text_for_topics' ---[0m
[32m2025-09-16 20:54:30[0m | [1mINFO    [0m | [1mFitting the TfidfVectorizer to the corpus...[0m
[32m2025-09-16 20:54:51[0m | [32m[1mSUCCESS [0m | [32m[1mTF-IDF matrix built successfully.[0m
[32m2025-09-16 20:54:51[0m | [1mINFO    [0m | [1mMatrix Shape: (45187, 3000) (Restaurants x Features/Terms)[0m
[32m2025-09-16 20:54:51[0m | [1mINFO    [0m | [1mDisplaying a sample of the 20 most important features (terms) learned:[0m


Unnamed: 0,term,score
909,food,4557.806792
1069,good,4293.974173
1951,place,4244.043175
403,chicken,2740.858407
2386,service,2006.092024
2648,taste,2000.145988
1792,ordered,1890.284472
237,biryani,1625.688771
60,ambience,1604.463496
2718,time,1514.436362


### 4.3. Topic Modeling with Latent Semantic Analysis (LSA)

We have successfully created a TF-IDF matrix, which represents the importance of 3,000 key terms for each restaurant. However, adding 3,000 new features to our model is computationally impractical.

Our next step is to use **Dimensionality Reduction** to distill this vast matrix into a small number of high-level **"topics"**. We will use a technique called **Latent Semantic Analysis (LSA)**, which is implemented using **Truncated SVD (Singular Value Decomposition)**.

LSA will analyze the co-occurrence patterns of words in our TF-IDF matrix and automatically group them into a specified number of topics. For each restaurant, we will then get a score for each of these topics, effectively creating powerful new features like `topic_food_quality_score` or `topic_ambiance_service_score`. This allows us to capture the core themes of the reviews in a very compact and model-friendly format.

In [6]:
from loguru import logger
from sklearn.decomposition import TruncatedSVD


def extract_topics_with_lsa(tfidf_matrix, tfidf_vectorizer, n_topics: int = 10):
    """
    Applies Latent Semantic Analysis (LSA) via TruncatedSVD to a TF-IDF matrix
    to discover latent topics.

    Args:
        tfidf_matrix: The sparse matrix from the TfidfVectorizer.
        tfidf_vectorizer: The fitted TfidfVectorizer instance.
        n_topics (int): The number of topics to extract.

    Returns:
        pd.DataFrame: A DataFrame where rows are restaurants and columns are topic scores.
    """
    logger.info(f"--- Extracting {n_topics} Topics using LSA (TruncatedSVD) ---")

    # --- 1. Apply TruncatedSVD ---
    # We use SVD to reduce the dimensionality of our 3000-column matrix
    lsa = TruncatedSVD(n_components=n_topics, random_state=42)
    topic_matrix = lsa.fit_transform(tfidf_matrix)

    logger.success("LSA model fitted and data transformed successfully.")

    # --- 2. Analyze the Topics ---
    # Let's inspect what words define each topic
    terms = tfidf_vectorizer.get_feature_names_out()

    logger.info("--- Top 10 Terms per Discovered Topic ---")
    for i, comp in enumerate(lsa.components_):
        # Sort the terms by their weight in the current topic
        terms_in_comp = zip(terms, comp)
        sorted_terms = sorted(terms_in_comp, key=lambda x: x[1], reverse=True)[:10]
        top_terms = [t[0] for t in sorted_terms]
        print(f"Topic {i}: {', '.join(top_terms)}")

    # --- 3. Create the Final DataFrame ---
    # Create column names for our new features
    topic_col_names = [f"topic_{i}" for i in range(n_topics)]
    df_topics = pd.DataFrame(topic_matrix, columns=topic_col_names)

    return df_topics


# --- Execute LSA ---
# tfidf_matrix and tfidf_vectorizer are from our previous step
# Let's extract 10 topics as a starting point
df_topic_features = extract_topics_with_lsa(tfidf_matrix, tfidf_vectorizer, n_topics=10)

# --- Merge the new topic features back into our main NLP dataframe ---
# First, we need to reset the index of our main df to ensure a clean merge
df_nlp_final = df_nlp_sentiments.reset_index(drop=True)
df_nlp_final = pd.concat([df_nlp_final, df_topic_features], axis=1)

logger.success("Successfully merged new topic features into the main NLP DataFrame.")

# --- Verification ---
print("\n--- Verification of New Topic Features ---")
display(df_nlp_final[["name", "rate", "topic_0", "topic_1", "topic_2"]].head())

[32m2025-09-16 20:54:51[0m | [1mINFO    [0m | [1m--- Extracting 10 Topics using LSA (TruncatedSVD) ---[0m
[32m2025-09-16 20:54:52[0m | [32m[1mSUCCESS [0m | [32m[1mLSA model fitted and data transformed successfully.[0m
[32m2025-09-16 20:54:52[0m | [1mINFO    [0m | [1m--- Top 10 Terms per Discovered Topic ---[0m
Topic 0: food, good, place, chicken, service, ambience, taste, ordered, biryani, great
Topic 1: biryani, chicken, biriyani, rice, food, mutton, delivery, restaurant, chicken biryani, taste
Topic 2: cake, cream, ice cream, ice, chocolate, biryani, waffle, taste, order, ordered
Topic 3: pizza, chicken, biryani, burger, beer, pasta, biriyani, drink, ambience, music
Topic 4: cake, pizza, pastry, order, birthday, cupcake, bakery, delivery, ordered, burger
Topic 5: pizza, burger, delivery, order, ordered, sandwich, cheese, pasta, shake, taste
Topic 6: biryani, dosa, coffee, place, breakfast, tea, masala dosa, biriyani, cafe, south
Topic 7: pizza, biryani, buffet, do

Unnamed: 0,name,rate,topic_0,topic_1,topic_2
0,Jalsa,4.1,0.482831,-0.053563,-0.142821
1,Spice Elephant,4.1,0.602828,-0.031769,-0.063036
2,San Churro Cafe,3.8,0.331411,-0.179015,0.001051
3,Addhuri Udupi Bhojana,3.7,0.541989,0.015496,-0.120429
4,Grand Village,3.8,0.220451,-0.022821,-0.090174


In [7]:
df_nlp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          45187 non-null  object 
 1   address       45187 non-null  object 
 2   rate          45187 non-null  float64
 3   reviews_list  45187 non-null  object 
 4   menu_item     45187 non-null  object 
 5   cuisines      45187 non-null  object 
 6   dish_liked    45187 non-null  object 
dtypes: float64(1), object(6)
memory usage: 2.4+ MB


In [8]:
df_nlp_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 45187 non-null  object 
 1   address              45187 non-null  object 
 2   rate                 45187 non-null  float64
 3   reviews_list         45187 non-null  object 
 4   menu_item            45187 non-null  object 
 5   cuisines             45187 non-null  object 
 6   dish_liked           45187 non-null  object 
 7   full_review_text     45187 non-null  object 
 8   review_count         45187 non-null  int64  
 9   total_review_length  45187 non-null  int64  
 10  avg_word_length      45187 non-null  float64
dtypes: float64(2), int64(2), object(7)
memory usage: 3.8+ MB


In [9]:
df_nlp_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 45187 non-null  object 
 1   address              45187 non-null  object 
 2   rate                 45187 non-null  float64
 3   reviews_list         45187 non-null  object 
 4   menu_item            45187 non-null  object 
 5   cuisines             45187 non-null  object 
 6   dish_liked           45187 non-null  object 
 7   full_review_text     45187 non-null  object 
 8   review_count         45187 non-null  int64  
 9   total_review_length  45187 non-null  int64  
 10  avg_word_length      45187 non-null  float64
 11  text_for_sentiment   45187 non-null  object 
 12  text_for_topics      45187 non-null  object 
 13  sentiment_textblob   45187 non-null  float64
 14  sentiment_vader      45187 non-null  float64
 15  topic_0              45187 non-null 

In [10]:
df_nlp_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 45187 non-null  object 
 1   address              45187 non-null  object 
 2   rate                 45187 non-null  float64
 3   reviews_list         45187 non-null  object 
 4   menu_item            45187 non-null  object 
 5   cuisines             45187 non-null  object 
 6   dish_liked           45187 non-null  object 
 7   full_review_text     45187 non-null  object 
 8   review_count         45187 non-null  int64  
 9   total_review_length  45187 non-null  int64  
 10  avg_word_length      45187 non-null  float64
 11  text_for_sentiment   45187 non-null  object 
 12  text_for_topics      45187 non-null  object 
dtypes: float64(2), int64(2), object(9)
memory usage: 4.5+ MB


In [11]:
df_nlp_sentiments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 45187 non-null  object 
 1   address              45187 non-null  object 
 2   rate                 45187 non-null  float64
 3   reviews_list         45187 non-null  object 
 4   menu_item            45187 non-null  object 
 5   cuisines             45187 non-null  object 
 6   dish_liked           45187 non-null  object 
 7   full_review_text     45187 non-null  object 
 8   review_count         45187 non-null  int64  
 9   total_review_length  45187 non-null  int64  
 10  avg_word_length      45187 non-null  float64
 11  text_for_sentiment   45187 non-null  object 
 12  text_for_topics      45187 non-null  object 
 13  sentiment_textblob   45187 non-null  float64
 14  sentiment_vader      45187 non-null  float64
dtypes: float64(4), int64(2), object(9)
m

In [12]:
df_topic_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   topic_0  45187 non-null  float64
 1   topic_1  45187 non-null  float64
 2   topic_2  45187 non-null  float64
 3   topic_3  45187 non-null  float64
 4   topic_4  45187 non-null  float64
 5   topic_5  45187 non-null  float64
 6   topic_6  45187 non-null  float64
 7   topic_7  45187 non-null  float64
 8   topic_8  45187 non-null  float64
 9   topic_9  45187 non-null  float64
dtypes: float64(10)
memory usage: 3.4 MB


## 5. Phase 3: State-of-the-Art Deep Learning with Transformers

Having extracted features using classic NLP techniques, we now advance to the cutting edge. Our dataset contains a mix of languages ("Hinglish") and informal text that can confuse simpler models like TextBlob and VADER.

To overcome this, we will deploy a **pre-trained, multilingual Transformer model**. Specifically, we will use `cardiffnlp/twitter-xlm-roberta-base-sentiment`, a powerful model from the Hugging Face Hub that is fine-tuned on multilingual social media text.

**Key Advantages of this Approach:**
1.  **Contextual Understanding:** Unlike bag-of-words models, Transformers understand the order of words and the context in which they appear. It can differentiate between "good food" and "not good food."
2.  **Multilingual & Code-Mixed Capability:** This model was trained on diverse, real-world text and can effectively process the "Hinglish" and other languages present in our reviews.
3.  **No Data Filtering:** We can run this model on our **original, unfiltered review text**, ensuring we extract a signal from 100% of our available data, maximizing its value.

This process will be computationally intensive, but it will yield our premium sentiment feature: `transformer_sentiment`.

In [72]:
from transformers import pipeline


def generate_transformer_sentiment(
    df: pd.DataFrame, review_col: str = "reviews_list"
) -> pd.DataFrame:
    """
    The final, robust version for calculating Transformer sentiment using the online model.
    """
    logger.info("--- 3. Generating State-of-the-Art Transformer Sentiment ---")
    df_out = df.copy()

    try:
        sentiment_pipeline = pipeline(
            "sentiment-analysis",
            model="cardiffnlp/twitter-xlm-roberta-base-sentiment",
            device=0,  # Use GPU
        )
        logger.success("Multilingual sentiment model initialized successfully ON GPU.")
    except Exception as e:
        logger.error(
            f"Failed to initialize on GPU. Error: {e}. Falling back to CPU (will be slow)."
        )
        sentiment_pipeline = pipeline(
            "sentiment-analysis",
            model="cardiffnlp/twitter-xlm-roberta-base-sentiment",
            device=-1,
        )

    def get_avg_sentiment(review_array):
        if not hasattr(review_array, "__len__") or len(review_array) == 0:
            return 0
        review_texts = [
            review[1]
            for review in review_array
            if len(review) == 2 and isinstance(review[1], str)
        ]
        if not review_texts:
            return 0
        sentiments = []
        try:
            # Added batch_size for better GPU performance. Removed trust_remote_code.
            results = sentiment_pipeline(
                review_texts, truncation=True, max_length=512, batch_size=16
            )
            for result in results:
                if result["label"] == "positive":
                    sentiments.append(1 * result["score"])
                elif result["label"] == "negative":
                    sentiments.append(-1 * result["score"])
                else:
                    sentiments.append(0.0)
        except Exception:
            # On error, we don't want to log thousands of times. Silently return neutral.
            return 0
        return np.mean(sentiments) if sentiments else 0

    logger.info(
        "Applying Transformer sentiment analysis... (This is the long one, go grab a coffee)"
    )
    # We run this on the original 'reviews_list' for maximum accuracy
    df_out["transformer_sentiment"] = df_out[review_col].progress_apply(
        get_avg_sentiment
    )

    logger.success("Transformer sentiment analysis complete.")
    return df_out


# --- Execute ---
# This is the final step, we use df_nlp_topics as the input
df_nlp_final = generate_transformer_sentiment(df_nlp_final)

# --- Verification & Export ---
logger.info("--- Final NLP Feature Set ---")
display(
    df_nlp_final[
        ["name", "rate", "sentiment_vader", "topic_0", "transformer_sentiment"]
    ].head()
)

[32m2025-09-17 11:12:03[0m | [1mINFO    [0m | [1m--- 3. Generating State-of-the-Art Transformer Sentiment ---[0m




[32m2025-09-17 11:12:12[0m | [32m[1mSUCCESS [0m | [32m[1mMultilingual sentiment model initialized successfully ON GPU.[0m
[32m2025-09-17 11:12:12[0m | [1mINFO    [0m | [1mApplying Transformer sentiment analysis... (This is the long one, go grab a coffee)[0m


  0%|          | 0/45187 [00:00<?, ?it/s]

[32m2025-09-17 12:16:43[0m | [32m[1mSUCCESS [0m | [32m[1mTransformer sentiment analysis complete.[0m
[32m2025-09-17 12:16:46[0m | [1mINFO    [0m | [1m--- Final NLP Feature Set ---[0m


Unnamed: 0,name,rate,sentiment_vader,topic_0,transformer_sentiment
0,Jalsa,4.1,0.9996,0.482831,0.606136
1,Spice Elephant,4.1,0.9996,0.602828,0.409859
2,San Churro Cafe,3.8,0.9997,0.331411,0.18812
3,Addhuri Udupi Bhojana,3.7,0.9998,0.541989,0.185093
4,Grand Village,3.8,0.9856,0.220451,0.668551


[32m2025-09-17 12:16:46[0m | [1mINFO    [0m | [1mExporting final NLP features to '../data/processed/zomato_nlp_features_final.parquet'...[0m
[32m2025-09-17 12:16:56[0m | [32m[1mSUCCESS [0m | [32m[1mFinal NLP features saved successfully.[0m


In [73]:
df_nlp_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45187 entries, 0 to 45186
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   name                   45187 non-null  object 
 1   address                45187 non-null  object 
 2   rate                   45187 non-null  float64
 3   reviews_list           45187 non-null  object 
 4   menu_item              45187 non-null  object 
 5   cuisines               45187 non-null  object 
 6   dish_liked             45187 non-null  object 
 7   full_review_text       45187 non-null  object 
 8   review_count           45187 non-null  int64  
 9   total_review_length    45187 non-null  int64  
 10  avg_word_length        45187 non-null  float64
 11  text_for_sentiment     45187 non-null  object 
 12  text_for_topics        45187 non-null  object 
 13  sentiment_textblob     45187 non-null  float64
 14  sentiment_vader        45187 non-null  float64
 15  to

## 6. Phase 4: Feature Engineering with Word Embeddings

To handle the high-cardinality, multi-value text columns (`menu_item`, `cuisines`, `dish_liked`), a simple one-hot encoding approach would be impractical. Instead, we have employed a state-of-the-art **representation learning** technique.

### 6.1. Custom Word2Vec Model

We implemented a "titan-level" pipeline to:
1.  **Build a Unified Corpus:** All items from the `menu_item`, `cuisines`, and `dish_liked` columns were combined into a single, massive corpus.
2.  **Train a Custom Word2Vec Model:** A `gensim.Word2Vec` model was trained on this corpus. This allows the model to learn the semantic relationships and context between food items specific to the Bangalore restaurant scene (e.g., learning that "Biryani" is similar to "Mughlai").
3.  **Vectorize Restaurant Profiles:** For each restaurant, we calculated the average vector of all the items in its `menu_item`, `cuisines`, and `dish_liked` lists.

**Result:**
This process has successfully transformed our three most complex text columns into **60 new, dense, and powerful numerical features** (20 vector dimensions for each column). These embedding features provide our model with a deep, nuanced understanding of a restaurant's culinary profile.

In [83]:
import pandas as pd
from gensim.models import Word2Vec
from loguru import logger
from tqdm.auto import tqdm
import numpy as np

def create_food_embeddings_simple_v2(df: pd.DataFrame, 
                                     list_cols: list, 
                                     vector_size: int = 50) -> pd.DataFrame:
    """
    A simplified and direct Word2Vec pipeline that is robust to both
    Python lists and NumPy arrays in the input columns.
    """
    logger.info("--- Starting Simplified Feature Engineering with Custom Food Embeddings (v2) ---")
    df_out = df.copy()
    
    # --- Step 1: Build the Giant Unified Corpus ---
    logger.info("Step 1: Building a unified corpus...")
    corpus = []
    for col in list_cols:
        for item_list in df_out[col]:
            # This check is important to handle different empty types
            if hasattr(item_list, '__len__') and len(item_list) > 0:
                clean_list = [item for item in item_list if str(item).lower() != 'unknown']
                if clean_list:
                    corpus.append(clean_list)
        
    if not corpus:
        logger.error("Corpus is empty. Aborting."); return df_out
        
    logger.info(f"Unified corpus created with {len(corpus):,} total documents.")

    # --- Step 2: Train the custom Word2Vec model ---
    logger.info(f"Step 2: Training a custom Word2Vec model...")
    w2v_model = Word2Vec(sentences=corpus, vector_size=vector_size, window=5, min_count=3, workers=-1, sg=1)
    logger.success("Custom Word2Vec model trained successfully.")

    # --- Step 3: Create a function to vectorize a list of items ---
    def get_average_vector(items_list):
        # --- THE FIX IS HERE ---
        # Instead of 'if not items_list', we explicitly check the length.
        # This works for both Python lists and NumPy arrays.
        if not hasattr(items_list, '__len__') or len(items_list) == 0:
            return np.zeros(vector_size)
        # --- END OF FIX ---
        
        vectors = [w2v_model.wv[item] for item in items_list if item in w2v_model.wv]
        if not vectors: return np.zeros(vector_size)
        return np.mean(vectors, axis=0)

    # --- Step 4: Apply the vectorization to each column ---
    logger.info("Step 3: Vectorizing each column...")
    for col in tqdm(list_cols, desc="Creating Embedding Features"):
        vectors = df_out[col].apply(get_average_vector) # This will now work
        vec_df = pd.DataFrame(vectors.tolist(), index=df_out.index)
        vec_df.columns = [f"{col}_vec_{i}" for i in range(vector_size)]
        df_out = pd.concat([df_out, vec_df], axis=1)

    logger.success("All specified columns have been converted into embedding features.")
    return df_out

# --- EXECUTION ---
# ... (your execution code remains the same) ...
cols_to_embed = ['menu_item', 'cuisines', 'dish_liked']
df_with_embeddings = create_food_embeddings_simple_v2(df_nlp_final, 
                                                     list_cols=cols_to_embed, 
                                                     vector_size=20)

# --- Verification ---
display(df_with_embeddings.head())
print("\nNew shape:", df_with_embeddings.shape)

[32m2025-09-17 19:04:07[0m | [1mINFO    [0m | [1m--- Starting Simplified Feature Engineering with Custom Food Embeddings (v2) ---[0m
[32m2025-09-17 19:04:08[0m | [1mINFO    [0m | [1mStep 1: Building a unified corpus...[0m
[32m2025-09-17 19:04:08[0m | [1mINFO    [0m | [1mUnified corpus created with 34,894 total documents.[0m
[32m2025-09-17 19:04:08[0m | [1mINFO    [0m | [1mStep 2: Training a custom Word2Vec model...[0m
[32m2025-09-17 19:04:09[0m | [32m[1mSUCCESS [0m | [32m[1mCustom Word2Vec model trained successfully.[0m
[32m2025-09-17 19:04:09[0m | [1mINFO    [0m | [1mStep 3: Vectorizing each column...[0m


Creating Embedding Features:   0%|          | 0/3 [00:00<?, ?it/s]

[32m2025-09-17 19:04:12[0m | [32m[1mSUCCESS [0m | [32m[1mAll specified columns have been converted into embedding features.[0m


Unnamed: 0,name,address,rate,reviews_list,menu_item,cuisines,dish_liked,full_review_text,review_count,total_review_length,avg_word_length,text_for_sentiment,text_for_topics,sentiment_textblob,sentiment_vader,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,transformer_sentiment,menu_item_vec_0,menu_item_vec_1,menu_item_vec_2,menu_item_vec_3,menu_item_vec_4,menu_item_vec_5,menu_item_vec_6,menu_item_vec_7,menu_item_vec_8,menu_item_vec_9,menu_item_vec_10,menu_item_vec_11,menu_item_vec_12,menu_item_vec_13,menu_item_vec_14,menu_item_vec_15,menu_item_vec_16,menu_item_vec_17,menu_item_vec_18,menu_item_vec_19,cuisines_vec_0,cuisines_vec_1,cuisines_vec_2,cuisines_vec_3,cuisines_vec_4,cuisines_vec_5,cuisines_vec_6,cuisines_vec_7,cuisines_vec_8,cuisines_vec_9,cuisines_vec_10,cuisines_vec_11,cuisines_vec_12,cuisines_vec_13,cuisines_vec_14,cuisines_vec_15,cuisines_vec_16,cuisines_vec_17,cuisines_vec_18,cuisines_vec_19,dish_liked_vec_0,dish_liked_vec_1,dish_liked_vec_2,dish_liked_vec_3,dish_liked_vec_4,dish_liked_vec_5,dish_liked_vec_6,dish_liked_vec_7,dish_liked_vec_8,dish_liked_vec_9,dish_liked_vec_10,dish_liked_vec_11,dish_liked_vec_12,dish_liked_vec_13,dish_liked_vec_14,dish_liked_vec_15,dish_liked_vec_16,dish_liked_vec_17,dish_liked_vec_18,dish_liked_vec_19
0,Jalsa,"942, 21st Main Road, 2nd Stage, Banashankari, ...",4.1,"[[Rated 2.0, RATED\n Its a restaurant near to...",[Unknown],[],"[Dum Biryani, Lunch Buffet, Masala Papad, Pane...",RATED\n Its a restaurant near to Banashankari...,10,2906,4.662083,its a restaurant near to banashankari bda. me ...,restaurant near banashankari bda along office ...,0.34506,0.9996,0.482831,-0.053563,-0.142821,-0.054958,-0.02147,-0.13186,-0.106416,0.075845,0.018076,-0.025933,0.606136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.000139,-0.014075,0.016368,-0.008531,-0.002681,-0.002833,-0.004693,-0.005675,-0.010905,0.007609,0.004142,0.018541,0.008585,-0.004918,0.002574,-0.000303,-0.004193,-0.007041,0.003208,0.000806
1,Spice Elephant,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",4.1,"[[Rated 2.0, RATED\n I had a very bad experie...",[Unknown],[],"[Chicken Biryani, Chocolate Nirvana, Dum Birya...",RATED\n I had a very bad experience here.\nI ...,14,4958,4.493304,i had a very bad experience here.\ni don't kno...,bad experience know carte buffet worst gave co...,0.195731,0.9996,0.602828,-0.031769,-0.063036,-0.030274,-0.01863,-0.168695,-0.074582,0.049469,0.012818,-0.075545,0.409859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010823,-0.001898,-0.003,-0.007676,-0.004414,0.001336,0.002587,0.00763,-0.004245,-0.001513,-0.001917,-0.000603,-0.019428,0.005711,0.014509,0.005603,-0.006174,-0.00501,0.01061,-0.016491
2,San Churro Cafe,"1112, Next to KIMS Medical College, 17th Cross...",3.8,"[[Rated 1.0, RATED\n Cockroaches !! I Repeat ...",[Unknown],[],"[Cannelloni, Churros, Hot Chocolate, Minestron...",RATED\n Cockroaches !! I Repeat cockroaches!!...,20,6993,4.519108,cockroaches !! i repeat cockroaches!!bakasura ...,cockroach repeat cockroach bakasura disappoint...,0.166162,0.9997,0.331411,-0.179015,0.001051,0.081526,0.052696,0.054141,0.051919,0.096816,0.014956,-0.050901,0.18812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00337,-0.01384,0.005028,0.00275,-0.019702,-0.003763,-0.012497,0.021039,-0.006153,-0.013886,-0.007933,-0.00246,0.003207,0.006325,0.011423,-0.002227,-0.004967,-0.003433,0.00084,-0.002099
3,Addhuri Udupi Bhojana,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",3.7,"[[Rated 1.5, RATED\n The food was not satisfa...",[Unknown],[],[Masala Dosa],RATED\n The food was not satisfactory. Not on...,23,7708,4.539797,the food was not satisfactory. not one item se...,food satisfactory one item served could eaten ...,0.309284,0.9998,0.541989,0.015496,-0.120429,-0.329167,-0.045043,-0.000278,0.053153,0.119441,0.053072,0.128186,0.185093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023452,-0.001145,-0.044709,-0.004473,-0.001355,0.004533,0.043311,-0.013881,-0.04341,-0.044169,-0.005288,0.006859,0.000436,-0.011015,0.042738,0.000532,0.032093,0.019955,-0.036016,-0.010618
4,Grand Village,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",3.8,"[[Rated 4.0, RATED\n Great service, overwhelm...",[Unknown],[],"[Gol Gappe, Panipuri]","RATED\n Great service, overwhelming experienc...",2,651,5.252427,"great service, overwhelming experience.\n\none...",great service overwhelming experience one kind...,0.447883,0.9856,0.220451,-0.022821,-0.090174,-0.110779,-0.016,-0.052406,-0.03473,0.065407,0.034503,-0.035731,0.668551,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026897,-0.01582,0.000905,-0.030901,0.003124,0.001516,0.026496,-0.018324,-0.01446,-0.00278,-0.005007,-0.013286,-0.006045,0.032203,0.00821,-0.019635,0.014608,0.000582,0.042274,-0.017556



New shape: (45187, 86)


In [86]:
# df_with_embeddings is your final DataFrame from the previous step

# --- FINAL EXPORT ---
FINAL_NLP_PATH = "../data/processed/zomato_nlp_features_final.parquet"
logger.info(f"--- Exporting Final NLP Feature Set ({df_with_embeddings.shape}) ---")

try:
    df_with_embeddings.to_parquet(FINAL_NLP_PATH, index=False)
    logger.success(f"Final NLP features saved successfully to '{FINAL_NLP_PATH}'")
except Exception as e:
    logger.error(f"Failed to save final NLP features: {e}")

[32m2025-09-17 20:53:48[0m | [1mINFO    [0m | [1m--- Exporting Final NLP Feature Set ((45187, 86)) ---[0m
[32m2025-09-17 20:53:52[0m | [32m[1mSUCCESS [0m | [32m[1mFinal NLP features saved successfully to '../data/processed/zomato_nlp_features_final.parquet'[0m


## 7. Conclusion & Final NLP Export

This notebook has successfully executed a comprehensive, multi-phase NLP feature engineering pipeline. We have progressed from basic text statistics to classic sentiment analysis, topic modeling, and finally, state-of-the-art word embeddings.

The final, enriched NLP DataFrame, containing our primary keys, target variable, and all newly engineered features, is now ready for export.

**Final Exported File:**
-   `zomato_nlp_features_final.parquet`

This file will serve as a critical input to our final modeling notebook, where these rich features will be merged with our tabular and geospatial data to build the ultimate predictive model.