# Initial Exploration - Text Track

For illustration purposes and to establish the pipeline for initial exploration, this analysis uses a dataset containing a sample of 10,000 data points.

In [1]:
import pandas as pd

file_path = "yelp_sampled_data.csv"
df = pd.read_csv(file_path)

- Data Processing

  **To-Do:**

  * The columns **"address," "city," "state," "postal_code," "latitude,"** and **"longitude"** contain geographical location data for restaurants or businesses. However, the dataset available at [Kaggle](https://www.kaggle.com/datasets/thedevastator/2013-irs-us-income-data-by-zip-code) provides average income information for each ZIP code as of 2013, which may be more useful. We can merge this dataset based on ZIP codes and remove the original location-related columns.

  * The **"BusinessParking"** and **"Ambience"** columns contain nested attributes. If present, they store structured data in dictionary-like formats (e.g., for **BusinessParking**:  
    `"{'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}"`,  
    for **Ambience**:  
    `"{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casual': False}"`).  
    We need to extract and process these nested attributes, where one-hot encoding should be a suitable approach.

  * Building on the previous point, if a business lacks **BusinessParking** or **Ambience** data, we need to determine an appropriate way to handle missing values in the one-hot encoding. Setting all values to zero might be a reasonable default.

  * The **"RestaurantsAttire"** column contains inconsistent values, including:  
    `"u'casual'"`, `nan`, `"'casual'"`, `"u'dressy'"`, `"'dressy'"`, `"'formal'"`, `"u'formal'"`.  
    We need to standardize these values and clarify their meanings.

  * The **"categories"** column stores a comma-separated string of labels, such as:  
    `"Restaurants, Gluten-Free, Bars, Food, Nightlife, Sandwiches, Burgers"`.  
    We should split these into individual categorical variables for better usability.

- Text Processing

  Below are some basic text processing methods considered:

  - **Sentiment Analysis**

  - **SBERT Embeddings**

  - **Topic Modeling (LDA)**

  - **TF-IDF Features**

  These methods, while conventional, may introduce excessive noise, sparsity, or irrelevant features. Some approaches (e.g., LDA, TF-IDF) lack deep contextual understanding, while others (e.g., SBERT) may be too complex relative to the dataset size. 
  
  **To-Do:**
  
  * Further Investigation of Existing Methods

    * Evaluate whether they actually help with prediction. We can add one technique at a time to see if it helps with performance. Some other useful techniques include permutation importance or SHAP values for feature selection.
    
    * Implement other preprocessing steps for simple methods (TF-IDF and LDA), including stopword removal, stemming, etc.
    
    * Some minor things to do for specific technique, including hyperparameter tuning for LDA, comparison of sentiment scores against key business indicators to determine their predictive value. Also SBERT embeddings might not be the most optimal embedding, so we can try Word2Vec or GloVe as well. 
  
  * Other NLP Techniques

    * Experiment with domain-specific adaptations of BERT (e.g., DistilBERT, RoBERTa) using a small dataset of labeled business reviews.  

    * Use transformer-based models to extract contextualized word importance instead of relying solely on TF-IDF.  

    * Develop a pipeline combining LDA for topic extraction with embeddings for semantic similarity scoring.  

    * Investigate if zero-shot classifiers like Mistral or OpenAI’s CLIP can generalize well on sentiment and topic classification without requiring extensive labeled data.  

- Modeling

  The current model is used primarily for testing purposes. A key challenge in this project is the severe class imbalance in target variables such as **usefulness** and **price range**, which requires careful handling.

  **To-Do:**
  - select an appropriate model, training strategy, and hyperparameter tuning approach.
  - research and implement techniques to address class imbalance.
  - ...

# Predicting Attributes for Businesses

There are two primary approaches for this dataset:  
1. Predict attributes at the **individual review** level.  
2. Predict attributes at the **business** level.  

If we aim to predict business attributes such as **price range, ambience,** or similar characteristics, the following roadmap outlines the necessary steps.

1. revert to the full dataset (instead of the sample dataset) and apply the following transformations.
2. since the goal is to predict attributes for businesses rather than individual reviews, we should aggregate the dataset at the **business level**.
3. decide how to handle various attributes. Below are suggested approaches:
    - remove `user_id' since user identity is irrelevant for business-level predictions
    - the following attributes should remain **constant** for a business across all its reviews. We can simply copy them from any review associated with that business:  
        - `business_id`
        - `name`
        - `business_stars`
        - `business_review_count`
        - `BusinessParking`
        - `Ambience`
        - `RestaurantsAttire`
        - `RestaurantsPriceRange2`
        - `categories`
    - `stars`, `useful`, `cool`: These are properties of individual reviews. Are they still useful when aggregating at the business level? We need to determine whether to retain, aggregate (e.g., average), or discard them.
    - `user_review_count`, `user_useful`, `user_funny`, `user_cool`, `average_stars`, `fans`, `compliment_hot`, `compliment_more`, `compliment_profile`, `compliment_cute`, `compliment_list`, `compliment_note`, `compliment_plain`, `compliment_cool`, `compliment_funny`, `compliment_writer`, `compliment_photos`  
    These attributes describe individual users. Should we **remove** them entirely, or **aggregate** them (e.g., taking the median) as a proxy for a typical reviewer’s profile for each business?
    - `text`: we can combine all review texts for a business into a single large paragraph while maintaining structure to differentiate individual reviews. One possible format:  "{review: ...} {review: ...}..."
4. once the transformations are complete, we should extract a subset of businesses from the full dataset for exploratory analysis.
5. from this point onward, the workflow will align with the framework for predicting at the **individual review level** (outlined later). The main difference will be the **target variable**, but adjusting the framework should be straightforward.


# Predicting Attributes for Individual Reviews

The following code provides an example of how to predict an attribute for each individual review. In this case, we use "useful" votes as the attribute of interest.

For each business, we first calculate the average number of useful votes received across all its reviews. Each individual review is then compared against this average and classified into one of three categories:

- "Average" – if the review's useful votes are close to the business's average.

- "More useful" – if the review has significantly more useful votes than the business's average.

- "Less useful" – if the review has significantly fewer useful votes.

This approach allows us to categorize reviews relative to their business, rather than using an absolute threshold, making the classification more context-aware.

In [2]:
df.columns

Index(['user_id', 'business_id', 'stars', 'useful', 'cool', 'text', 'name',
       'address', 'city', 'state', 'postal_code', 'latitude', 'longitude',
       'business_stars', 'business_review_count', 'BusinessParking',
       'Ambience', 'RestaurantsAttire', 'RestaurantsPriceRange2', 'categories',
       'user_review_count', 'user_useful', 'user_funny', 'user_cool',
       'average_stars', 'fans', 'compliment_hot', 'compliment_more',
       'compliment_profile', 'compliment_cute', 'compliment_list',
       'compliment_note', 'compliment_plain', 'compliment_cool',
       'compliment_funny', 'compliment_writer', 'compliment_photos'],
      dtype='object')

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.decomposition import PCA, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sentence_transformers import SentenceTransformer
from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import RandomizedSearchCV
from imblearn.combine import SMOTETomek
from sklearn.utils.class_weight import compute_sample_weight

### Drop Average

In [4]:
# 1. Drop missing rows for 'useful' or 'business_id'
df_filtered = df.dropna(subset=['useful', 'business_id'])

# 2. Compute 40th and 60th percentile of "useful" per business
#    transform() ensures the percentile values are repeated for each row in that group
df_filtered['p40'] = df_filtered.groupby('business_id')['useful']\
                                .transform(lambda x: x.quantile(0.40))
df_filtered['p60'] = df_filtered.groupby('business_id')['useful']\
                                .transform(lambda x: x.quantile(0.60))

# 3. Define classification function based on 40th/60th cutoffs
def classify_useful(row):
    if row['useful'] < row['p40']:
        return "less useful"
    elif row['useful'] > row['p60']:
        return "more useful"
    else:
        return "average"

df_filtered['useful_category'] = df_filtered.apply(classify_useful, axis=1)

# 4. Optionally remove the "average" rows to focus on a 2-class problem
df_filtered = df_filtered[df_filtered['useful_category'] != 'average']

# 5. Drop the temporary percentile columns if you don’t need them later
df_filtered.drop(['p40', 'p60'], axis=1, inplace=True)

# Now df_filtered has only two categories: 'more useful' and 'less useful'
print(df_filtered.head())


                   user_id             business_id  stars  useful  cool  \
1   bDjP8ELwG0OIWHsLgYyhLw  9vH3pJlBjfhi8HFlX9EB1w    5.0       1     0   
2   NRHPcLq2vGWqgqwVugSgnQ  8sf9kv6O4GgEb0j1o22N1g    5.0       0     0   
7   _cpU0VVdQcfN5AnuL6M56A  CADaY34LnEGjpSpU7Lee8w    4.0      31    24   
13  mUINJT7vETh8ds9-jQhtiQ  ksisjnLbytLO8wlbxPZoCg    1.0       0     0   
16  NeeSsIDvn-5-OYxMyp8doQ  R3FDYMQBrMpkUquwE-eniQ    3.0       0     0   

                                                 text  \
1   I had recently started looking for a new hair ...   
2   Jim Woltman who works at Goleta Honda is 5 sta...   
7   Great lil bar on strip. Waitress was sweet she...   
13  Decided to stop by and try this place again be...   
16  Kind of a cool place... it's an old diner conv...   

                                          name                 address  \
1                                     Mi Salon    1221 State St, Ste 4   
2                          Santa Barbara Honda       475 S

### Old Average

In [None]:
# convert useful votes to useful classifications

df_filtered = df.dropna(subset = ['useful', 'business_id'])


# preprocessing for useful categories
business_avg_useful = df_filtered.groupby('business_id')['useful'].mean().reset_index()
business_avg_useful.rename(columns = {'useful': 'average_business_useful'}, inplace = True)

# merge the average back
df_filtered = df_filtered.merge(business_avg_useful, on = 'business_id', how = 'left')

def classify_useful(row):
    if row['useful'] > row['average_business_useful']:
        return "more useful"
    elif row['useful'] < row['average_business_useful']:
        return "less useful"
    else:
        return "average"

df_filtered['useful_category'] = df_filtered.apply(classify_useful, axis=1)

In [6]:
y = df_filtered['useful_category']
#X = df_filtered.drop(columns=['useful', 'average_business_useful', 'useful_category'])
X = df_filtered.drop(columns=['useful', 'useful_category'])

categorical_cols = ['state', 'city', 'categories', 'BusinessParking', 'Ambience', 'RestaurantsAttire']
numerical_cols = ['stars', 'RestaurantsPriceRange2', 'cool', 'business_stars', 'business_review_count',
                  'user_review_count', 'user_useful', 'user_funny', 'user_cool', 'average_stars', 'fans']

In [7]:
# sentiment analysis
df_filtered['sentiment'] = df_filtered['text'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
numerical_cols.append('sentiment')

In [None]:
# SBERT embeddings

sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
text_embeddings_sbert = sbert_model.encode(df_filtered['text'].tolist(), convert_to_numpy = True)
pca_sbert = PCA(n_components = 50, random_state = 42)
text_embeddings_sbert_reduced = pca_sbert.fit_transform(text_embeddings_sbert)
embedding_cols_sbert = [f"text_sbert_pca_{i}" for i in range(50)]
df_embeddings_sbert = pd.DataFrame(text_embeddings_sbert_reduced, columns = embedding_cols_sbert, index = df_filtered.index)
df_filtered = pd.concat([df_filtered, df_embeddings_sbert], axis=1)
numerical_cols.extend(embedding_cols_sbert)

In [None]:
# LDA topic modeling

vectorizer = CountVectorizer(max_features = 1000, stop_words = 'english')
X_text = vectorizer.fit_transform(df_filtered['text'])
lda = LatentDirichletAllocation(n_components = 5, random_state = 42)
lda_topics = lda.fit_transform(X_text)
topic_cols = [f"topic_{i}" for i in range(5)]
df_topics = pd.DataFrame(lda_topics, columns = topic_cols, index = df_filtered.index)
df_filtered = pd.concat([df_filtered, df_topics], axis=1)
numerical_cols.extend(topic_cols)

In [None]:
# TF-IDF
tfidf = TfidfVectorizer(max_features = 500, stop_words = 'english')
X_tfidf = tfidf.fit_transform(df_filtered['text'])
tfidf_cols = [f"tfidf_{word}" for word in tfidf.get_feature_names_out()]
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns = tfidf_cols, index = df_filtered.index)
df_filtered = pd.concat([df_filtered, df_tfidf], axis = 1)
numerical_cols.extend(tfidf_cols)

In [None]:
X = df_filtered.drop(columns = ['text'])

# preprocessing pipelines
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = 'mean')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown = 'ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

Try new training: 

In [None]:
X_encoded = preprocessor.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],  # Number of trees in the forest
    'max_depth': [10, 20, 30, None],       # Depth of each tree
    'min_samples_split': [2, 5, 10],       # Minimum samples needed to split a node
    'min_samples_leaf': [1, 2, 4],         # Minimum samples per leaf
    'max_features': ['sqrt', 'log2'],      # Number of features considered for splits
    'bootstrap': [True, False]             # Bootstrap sampling for trees
}

# Initialize Random Forest with class weights
rf_classifier = RandomForestClassifier(random_state=42, class_weight={'average': 1, 'less useful': 20, 'more useful': 20})

# Perform hyperparameter tuning using RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_classifier,
    param_distributions=param_grid,
    n_iter=15,  # Number of random combinations to try
    cv=3,       # 3-fold cross-validation
    scoring='f1_weighted',  # Optimize for F1 score due to class imbalance
    random_state=42,
    n_jobs=-1   # Use all available processors
)

# Fit the model with SMOTE-resampled training data
random_search.fit(X_train, y_train)

# Get the best classifier from tuning
best_classifier = random_search.best_estimator_

# Train the best classifier
best_classifier.fit(X_train, y_train)

# Make predictions
y_pred = best_classifier.predict(X_test)

# Display the best parameters
print("Best Parameters Found:", random_search.best_params_)

# Evaluate performance
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Previous

In [None]:
# Preprocess data and split into train/test sets
X_encoded = preprocessor.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size = 0.2, random_state = 42, stratify = y)

# Apply SMOTE to handle class imbalance
smote = SMOTE(random_state = 42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train a Random Forest Classifier
classifier = RandomForestClassifier(n_estimators = 200, random_state = 42, class_weight = 
                                    {'average': 1, 'less useful': 20, 'more useful': 20})
classifier.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluate model performance
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy:", accuracy_score(y_test, y_pred))

In [None]:
# classification visualization

class_labels = sorted(y.unique())
conf_matrix = confusion_matrix(y_test, y_pred, labels=class_labels)

plt.figure(figsize=(8, 6))
ax = sns.heatmap(conf_matrix, annot = False, cmap = 'Blues', xticklabels = class_labels, yticklabels = class_labels)
min_val, max_val = conf_matrix.min(), conf_matrix.max()
threshold = (min_val + max_val) / 2

for i in range(conf_matrix.shape[0]):
    for j in range(conf_matrix.shape[1]):
        value = conf_matrix[i, j]
        text_color = "white" if value > threshold else "black"  # Choose color based on intensity
        ax.text(j + 0.5, i + 0.5, value, ha = 'center', va = 'center', color = text_color, fontsize = 12)

plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
