# A2

In this assignment, I first implemented a Multinomial Naive Bayes classifier from scratch using user review texts as input. After filling missing values, converting all text to lowercase, and removing stop words, I constructed a bag‑of‑words count matrix. By computing each class’s prior probabilities and each term’s conditional probabilities—with Laplace smoothing and logarithmic scaling—I built the baseline model. On the test set, this model achieved an accuracy of 0.8963 (about 0.90 overall); specifically, for the Nightlife category precision, recall, and F1‑score were all 0.61, for Restaurants they were 0.92/0.93/0.93, and for Shopping they were 0.93/0.92/0.92, yielding a weighted average accuracy of 0.90.

Next, inspired by Rennie et al. (2003) “Tackling the Poor Assumptions of Naive Bayes Text Classifiers,” I replaced raw term counts with TF‑IDF features and switched to the ComplementNB classifier to correct Naive Bayes’s weaknesses under class imbalance and extreme word‑frequency distributions. Using stratified five‑fold cross‑validation and grid search to tune sublinear\_tf, use\_idf, norm, min\_df, ngram\_range, and the smoothing parameter alpha, the model reached a best CV accuracy of 0.8867 and improved to 0.9048 on the test set. In that configuration, Nightlife achieved precision/recall/F1 of 0.71/0.64/0.67, Restaurants 0.94/0.93/0.93, and Shopping 0.90/0.94/0.92.

Finally, when exploring additional attributes to improve performance, I found that ID, latitude/longitude, and mean\_checkin\_time bore little relation to review categories. Instead, I introduced the “name” field as an extra textual feature—independent of the words in category‑specific reviews—and used a ColumnTransformer to apply customized TF‑IDF separately to reviews and to names. By merging these two feature streams in a single Pipeline and jointly grid‑searching both streams’ TF‑IDF parameters along with the classifier’s smoothing parameter, the multi‑attribute fusion model achieved a CV accuracy of 0.9023 and a test accuracy of 0.9162. In that final model, Nightlife’s precision/recall/F1 rose to 0.74/0.66/0.69, Restaurants to 0.94/0.94/0.94, and Shopping to 0.91/0.95/0.93—demonstrating that each iterative enhancement steadily increased the model’s predictive accuracy.

### 1. Build a baseline model by implementing the Naive Bayes classifier from scratch

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load training and testing datasets from CSV files
train_df = pd.read_csv("training.csv")
test_df  = pd.read_csv("testing.csv")
# Extract the 'review' column as input features, replacing any missing values with empty strings
X_train = train_df["review"].fillna("")
X_test  = test_df["review"].fillna("")
# Extract the 'category' column as target labels and convert to a NumPy array
y_train = train_df["category"].to_numpy()
y_test = test_df["category"].to_numpy()

In [2]:
class MultinomialNaiveBayes:
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha # Laplace smoothing constant
        self.classes_ = None # Unique class labels
        self.class_log_prior_ = None # Log prior probabilities log(P(c))
        self.feature_log_prob_ = None # Log conditional probabilities log(P(w_j | c))
        self.vocab_size_ = None # Number of features (vocabulary size)

    def fit(self, X, y):
        # Identify the unique classes and map y to integer indices
        self.classes_, y_idx = np.unique(y, return_inverse=True)
        n_classes, n_features = len(self.classes_), X.shape[1]
        self.vocab_size_ = n_features

        # Count how many samples belong to each class
        class_counts = np.bincount(y_idx)
        # Compute log prior probabilities: log(P(c)) = log(n_c) - log(n_total)
        self.class_log_prior_ = np.log(class_counts) - np.log(class_counts.sum())

        # Initialize a matrix to count feature occurrences per class
        feature_count = np.zeros((n_classes, n_features), dtype=np.float64)
        # Sum up the counts of each feature for samples in each class
        for i, c in enumerate(self.classes_):
            feature_count[i, :] = X[y == c].sum(axis=0)

        # Apply Laplace smoothing to avoid zero probabilities
        smoothed = feature_count + self.alpha
        # Compute the normalization denominator for each class (total count of words + alpha * V)
        denom = smoothed.sum(axis=1, keepdims=True)
        # Compute log probabilities of features given class: log(P(w_j | c))
        self.feature_log_prob_ = np.log(smoothed) - np.log(denom)
        return self

    def _joint_log_likelihood(self, X):
        # jll = X * log(P(w|c)).T + log(P(c))
        return X @ self.feature_log_prob_.T + self.class_log_prior_

    def predict(self, X):
        jll = self._joint_log_likelihood(X)
        # Select the class with the highest joint log likelihood
        return self.classes_[np.argmax(jll, axis=1)]

    def predict_proba(self, X):
        jll = self._joint_log_likelihood(X)
        # Compute log of marginal likelihood: log P(x) = logsumexp over classes
        log_prob_x = np.logaddexp.reduce(jll, axis=1, keepdims=True)
        # Convert joint log likelihoods to normalized probabilities
        return np.exp(jll - log_prob_x)

In [3]:
# Initialize a CountVectorizer to transform text into token counts
vectorizer = CountVectorizer(
    lowercase=True, # convert all characters to lowercase before tokenizing
    stop_words="english",  # remove common English stop words
    min_df=2 # ignore terms that appear in fewer than 2 documents              
)

# Learn the vocabulary from the training data and vectorize the text
X_train = vectorizer.fit_transform(X_train)
# Transform the test data using the same vocabulary
X_test  = vectorizer.transform(X_test)

# Create and train the Multinomial Naive Bayes classifier
nb = MultinomialNaiveBayes(alpha=1.0)
nb.fit(X_train, y_train);

# Make predictions on the test set
y_pred = nb.predict(X_test)
# Evaluate and print the model's accuracy
print(f"Accuracy of baseline model: {accuracy_score(y_test, y_pred):.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy of baseline model: 0.8963

Classification Report:
              precision    recall  f1-score   support

   Nightlife       0.61      0.61      0.61        64
 Restaurants       0.92      0.93      0.93       422
    Shopping       0.93      0.92      0.92       218

    accuracy                           0.90       704
   macro avg       0.82      0.82      0.82       704
weighted avg       0.90      0.90      0.90       704



### 2. Improve on the benchmark model based on the review attribute only. 

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Prepare text data: fill missing reviews with empty strings
X_train = train_df["review"].fillna("")
y_train = train_df["category"].to_numpy()
X_test  = test_df["review"].fillna("")
y_test = test_df["category"].to_numpy()
# Build a pipeline: first compute TF-IDF features, then apply ComplementNB classifier
pipeline = make_pipeline(
    TfidfVectorizer(lowercase=True, stop_words="english"), # convert all text to lowercase and remove common English stop words
    ComplementNB() # initialize the Complement Naive Bayes classifier
)
# Define hyperparameter grid for exhaustive search
param_grid = {
    "tfidfvectorizer__sublinear_tf": [True, False],
    "tfidfvectorizer__use_idf":      [True, False],
    "tfidfvectorizer__norm":         ["l1", "l2", None],
    "tfidfvectorizer__min_df":       [2, 3, 4],
    "tfidfvectorizer__ngram_range":  [(1, 1), (1, 2)],
    "complementnb__alpha":           [0.9, 0.95, 1.0]
}
# Set up stratified 5‑fold cross‐validation to preserve class proportions in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=cv,
    scoring="accuracy",
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

# Make predictions on the test set using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy  : {:.4f}".format(accuracy_score(y_test, y_pred)))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best CV Accuracy: 0.8867
Best hyperparameters: {'complementnb__alpha': 0.95, 'tfidfvectorizer__min_df': 2, 'tfidfvectorizer__ngram_range': (1, 2), 'tfidfvectorizer__norm': None, 'tfidfvectorizer__sublinear_tf': True, 'tfidfvectorizer__use_idf': False}
Accuracy  : 0.9048
Classification Report:
              precision    recall  f1-score   support

   Nightlife       0.71      0.64      0.67        64
 Restaurants       0.94      0.93      0.93       422
    Shopping       0.90      0.94      0.92       218

    accuracy                           0.90       704
   macro avg       0.85      0.84      0.84       704
weighted avg       0.90      0.90      0.90       704



### 3. Improve your model by adding additional attributes to model.

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Fill missing product names with empty string so vectorizer can handle them
train_df["name"]   = train_df["name"].fillna("")
test_df["name"]   = test_df["name"].fillna("")
X_train = train_df[["review", "name"]]
X_test  = test_df[["review", "name"]]
# Define a ColumnTransformer 
preprocessor = ColumnTransformer(transformers=[
    ("review_tfidf", TfidfVectorizer(lowercase=True, stop_words="english"),
     "review"),
    ("name_tfidf",   TfidfVectorizer(lowercase=True, analyzer="word"),
     "name")
])

# Build a pipeline that first transforms features using our preprocessor,
# then fits a Complement Naive Bayes classifier on the combined TF-IDF features
pipeline = Pipeline([
    ("features", preprocessor),
    ("clf", ComplementNB())
])

# Specify grid of hyperparameters to search over
param_grid = {
    # Hyperparameters for the review TF-IDF transformer
    "features__review_tfidf__sublinear_tf": [True],
    "features__review_tfidf__use_idf": [False],
    "features__review_tfidf__norm": [None],
    "features__review_tfidf__ngram_range": [(1,2)],
    "features__review_tfidf__min_df": [2],
    # Hyperparameters for the name TF-IDF transformer
    "features__name_tfidf__sublinear_tf": [True, False],
    "features__name_tfidf__use_idf": [True, False],
    "features__name_tfidf__norm": ["l2", "l2", None],
    "features__name_tfidf__ngram_range": [(1,1), (1,2)],
    "features__name_tfidf__max_features": [500, 1000],
    # Smoothing parameter for the ComplementNB classifier
    "clf__alpha": [0.8, 0.95, 1.0]
}
# Use stratified 5‑fold cross‑validation to preserve class distribution in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
    pipeline, param_grid,
    cv=cv, scoring="accuracy",
    n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

# Compute and print final accuracy on test data
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
print("Best Hyperparameters:")
for param, val in grid_search.best_params_.items():
    print(f"  - {param}: {val}")
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best CV Accuracy: 0.9023
Best Hyperparameters:
  - clf__alpha: 0.95
  - features__name_tfidf__max_features: 500
  - features__name_tfidf__ngram_range: (1, 2)
  - features__name_tfidf__norm: None
  - features__name_tfidf__sublinear_tf: True
  - features__name_tfidf__use_idf: False
  - features__review_tfidf__min_df: 2
  - features__review_tfidf__ngram_range: (1, 2)
  - features__review_tfidf__norm: None
  - features__review_tfidf__sublinear_tf: True
  - features__review_tfidf__use_idf: False

Accuracy: 0.9162

Classification Report:
              precision    recall  f1-score   support

   Nightlife       0.74      0.66      0.69        64
 Restaurants       0.94      0.94      0.94       422
    Shopping       0.91      0.95      0.93       218

    accuracy                           0.92       704
   macro avg       0.86      0.85      0.86       704
weighted avg       0.91      0.92      0.92       704

