# Assignment 2 – Recommender Systems**Name:**  **PID:**  **Dataset:** Food.com Recipes & Reviews (McAuley Lab Recommender Systems Datasets)---## 1. Dataset Selection & Exploratory AnalysisWe use the **Food.com Recipes & Reviews** dataset from Julian McAuley’s Recommender Systems and Personalization Datasets page.This dataset contains:- ~231k recipes  - ~226k users  - ~1.1M reviews (user–recipe interactions)  Each interaction includes:- `user_id`, `recipe_id`- integer `rating` from 1 to 5- review `date`Each recipe includes:- `minutes`, `n_steps`, `n_ingredients`- `nutrition` (list of [calories, total_fat, sugar, sodium, protein, sat_fat, carbs])- title, description, ingredients, steps, tags, etc.We will:- Load recipes and interactions- Compute basic statistics: number of users/items/interactions, sparsity, rating distribution, time span- Use these to motivate our **features** and **predictive task**#

In [None]:
%pip install --upgrade pip%pip install numpy pandas matplotlib seaborn%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118%pip install ipykernel

In [None]:
import pandas as pdimport numpy as npfrom ast import literal_evalimport matplotlib.pyplot as plt# For pretty printingpd.set_option("display.max_columns", 50)# === Adjust these paths to where you store the dataset ===RECIPES_PATH = "RAW_recipes.csv"INTERACTIONS_PATH = "RAW_interactions.csv"recipes = pd.read_csv(RECIPES_PATH)interactions = pd.read_csv(INTERACTIONS_PATH)recipes.head()

In [None]:
interactions.head()

### 1.1 Basic dataset statisticsWe first compute:- Number of users- Number of recipes- Number of interactions- Density / sparsity of the user–item matrix- Rating distribution- Time period covered

In [None]:
n_users = interactions["user_id"].nunique()n_items = interactions["recipe_id"].nunique()n_interactions = len(interactions)print("Number of users:", n_users)print("Number of recipes (items):", n_items)print("Number of interactions (reviews):", n_interactions)# Sparsity of the user–item matrixtotal_possible = n_users * n_itemssparsity = 1 - (n_interactions / total_possible)print(f"Sparsity of user–item matrix: {sparsity:.6f}")

In [None]:
# Rating distributionrating_counts = interactions["rating"].value_counts().sort_index()print("Rating distribution:")print(rating_counts)plt.figure()rating_counts.plot(kind="bar")plt.xlabel("Rating")plt.ylabel("Count")plt.title("Rating Distribution")plt.show()

In [None]:
# Time period coveredinteractions["date"] = pd.to_datetime(interactions["date"])print("Earliest review date:", interactions["date"].min())print("Latest review date:", interactions["date"].max())

### 1.2 Interesting phenomena from exploratory analysis (to discuss in writeup)Based on the above:- The dataset is **large enough** to apply collaborative filtering and other models from class.  - Ratings may be skewed towards high values (common in review data).  - The dataset spans multiple years, so we can potentially do **time-based splits** (train on older, test on newer).  - The user–item matrix is very sparse, which motivates using **matrix factorization** and other latent-factor methods.We will use these observations to justify our **prediction task** and **feature choices**.#

## 2. Predictive Task & Feature Engineering### 2.1 Predictive TaskWe define the following **binary classification task**:> Given a user and a recipe, predict whether the user will give a **high rating (≥ 4)** to that recipe.Formally, for each interaction `(user u, item i)` with rating `r`:- `y = 1` if `r ≥ 4`- `y = 0` otherwiseThis is a standard setup in recommender systems, interpreting ratings 4–5 as “liked” and ≤3 as “not liked / neutral”.We will evaluate our predictions using:- **Accuracy**- **AUC (Area Under ROC Curve)**- Optionally, **F1-score** for class-imbalance sensitivity.#

In [None]:
interactions["label"] = (interactions["rating"] >= 4).astype(int)interactions[["user_id", "recipe_id", "rating", "label"]].head()

### 2.2 Merging interactions with recipe metadataWe join interactions with recipe metadata so we can build **content-based features** (recipe-level) and also support **collaborative filtering** (user–item IDs).#

In [None]:
# Parse nutrition field into separate columns# nutrition is a string like "[calories, total_fat, sugar, sodium, protein, sat_fat, carbs]"recipes["nutrition_list"] = recipes["nutrition"].apply(literal_eval)nutrition_cols = ["calories", "total_fat", "sugar", "sodium", "protein", "sat_fat", "carbs"]for i, col in enumerate(nutrition_cols):    recipes[col] = recipes["nutrition_list"].apply(lambda x: x[i] if len(x) > i else np.nan)# Select recipe features of interestrecipe_features = [    "id",              # recipe_id    "minutes",    "n_steps",    "n_ingredients",] + nutrition_colsrecipes_small = recipes[recipe_features]# Mergedf = interactions.merge(recipes_small, left_on="recipe_id", right_on="id", how="inner")df.head()

### 2.3 Content-based feature matrixFor the **content-based model**, we only use recipe features (no user IDs), so the same recipe always has the same predicted “likelihood of like”.  This gives a strong **non-personalized baseline**.We will use:- `minutes`- `n_steps`- `n_ingredients`- `calories`, `total_fat`, `sugar`, `sodium`, `protein`, `sat_fat`, `carbs`We fill missing values with 0 for simplicity.#

In [None]:
content_feature_cols = [    "minutes", "n_steps", "n_ingredients"] + nutrition_colsX_content = df[content_feature_cols].fillna(0)y = df["label"].valuesX_content.head()

### 2.4 Train / validation / test splitTo assess validity and significance of our results, we split the data into **train, validation, and test** sets.- Train: used to fit the model  - Validation: used for model/parameter selection (if needed)  - Test: used for final evaluation  We start with a simple random split; later we could explore time-based splits.#

In [None]:
from sklearn.model_selection import train_test_splitX_train_c, X_temp_c, y_train, y_temp = train_test_split(    X_content, y, test_size=0.3, random_state=42, stratify=y)X_valid_c, X_test_c, y_valid, y_test = train_test_split(    X_temp_c, y_temp, test_size=0.5, random_state=42, stratify=y_temp)len(X_train_c), len(X_valid_c), len(X_test_c)

In [None]:
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train_c)X_valid_scaled = scaler.transform(X_valid_c)X_test_scaled = scaler.transform(X_test_c)

## 3. Models & EvaluationWe implement and compare:1. **Logistic Regression (Content-Based Baseline)**     - Uses only recipe features (no user ID).     - A standard **classification model from class**.     - Answers: “Given the recipe’s properties, how likely is it to be liked on average?”2. **Matrix Factorization (Collaborative Filtering)**     - Uses **user and item IDs** to learn low-dimensional embeddings.     - Captures personalized preferences.     - A standard **recommender model** from class.We evaluate both with **Accuracy** and **AUC** on the held-out test set, and we discuss:- Relevant baselines  - Feature representations  - Overfitting / scaling issues  - Noise / missing data considerations  #

### 3.1 Model 1 – Logistic Regression (Content-Based)This is a **linear classifier**:\[\hat{y} = \sigma(w^\top x + b)\]Where:- \( x \) = recipe feature vector  - \( w \) = learned weights  - \( \sigma \) = sigmoid function  We expect this to capture **global tendencies** (e.g., very long, high-calorie recipes might be rated differently), but it cannot personalize to individual users.#

In [None]:
from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_reportlr = LogisticRegression(class_weight='balanced', max_iter=5000)lr.fit(X_train_scaled, y_train)# Validationvalid_probs_lr = lr.predict_proba(X_valid_scaled)[:, 1]valid_preds_lr = (valid_probs_lr >= 0.5).astype(int)print("Logistic Regression (Validation) Accuracy:", accuracy_score(y_valid, valid_preds_lr))print("Logistic Regression (Validation) AUC:", roc_auc_score(y_valid, valid_probs_lr))

In [None]:
# Test performancetest_probs_lr = lr.predict_proba(X_test_c)[:, 1]test_preds_lr = (test_probs_lr >= 0.5).astype(int)print("Logistic Regression (Test) Accuracy:", accuracy_score(y_test, test_preds_lr))print("Logistic Regression (Test) AUC:", roc_auc_score(y_test, test_probs_lr))print("Logistic Regression (Test) F1:", f1_score(y_test, test_preds_lr))print()print("Classification report (Logistic Regression):")print(classification_report(y_test, test_preds_lr))

#### Notes:- This model is **fast and scalable** even on large datasets.  - It provides interpretable coefficients, which we’ll analyze in Section 5.  - It ignores user identity, so it fails to capture “user A likes spicy food, user B doesn’t”.Next, we move to a **collaborative filtering** model that incorporates user and item IDs.#

### 3.2 Preparing data for Matrix FactorizationFor MF, we need:- Integer-encoded `user_id`- Integer-encoded `recipe_id`- Binary label `y` (like vs not like)We will re-split the data into train/valid/test on the **full interaction data** with user and item IDs.#

In [None]:
from sklearn.preprocessing import LabelEncoderuser_encoder = LabelEncoder()item_encoder = LabelEncoder()df["user_idx"] = user_encoder.fit_transform(df["user_id"])df["item_idx"] = item_encoder.fit_transform(df["recipe_id"])n_users = df["user_idx"].nunique()n_items = df["item_idx"].nunique()print("Number of encoded users:", n_users)print("Number of encoded items:", n_items)

In [None]:
# Create arrays for MFu_all = df["user_idx"].valuesi_all = df["item_idx"].valuesy_all = df["label"].valuesu_train, u_temp, i_train, i_temp, y_train_cf, y_temp_cf = train_test_split(    u_all, i_all, y_all, test_size=0.3, random_state=42, stratify=y_all)u_valid, u_test, i_valid, i_test, y_valid_cf, y_test_cf = train_test_split(    u_temp, i_temp, y_temp_cf, test_size=0.5, random_state=42, stratify=y_temp_cf)len(u_train), len(u_valid), len(u_test)

### 3.3 Model 2 – Matrix Factorization (Collaborative Filtering)We implement a simple MF model using PyTorch:- Each user \( u \) has embedding vector \( p_u \in \mathbb{R}^k \).  - Each item \( i \) has embedding vector \( q_i \in \mathbb{R}^k \).  - Predicted probability of a “like”:\[\hat{y}_{ui} = \sigma(p_u^\top q_i)\]We train with **binary cross-entropy loss**.This is the standard latent factor model from class, adapted for binary labels.#

In [None]:
import torchimport torch.nn as nnfrom torch.utils.data import Dataset, DataLoaderdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")device

In [None]:
class InteractionDataset(Dataset):    def __init__(self, users, items, labels):        self.users = users        self.items = items        self.labels = labels    def __len__(self):        return len(self.labels)    def __getitem__(self, idx):        return (            torch.tensor(self.users[idx]).long(),            torch.tensor(self.items[idx]).long(),            torch.tensor(self.labels[idx]).float()        )train_dataset = InteractionDataset(u_train, i_train, y_train_cf)valid_dataset = InteractionDataset(u_valid, i_valid, y_valid_cf)test_dataset  = InteractionDataset(u_test,  i_test,  y_test_cf)train_loader = DataLoader(train_dataset, batch_size=4096, shuffle=True)valid_loader = DataLoader(valid_dataset, batch_size=4096, shuffle=False)test_loader  = DataLoader(test_dataset,  batch_size=4096, shuffle=False)

In [None]:
class MFModel(nn.Module):    def __init__(self, n_users, n_items, k=32):        super().__init__()        self.user_emb = nn.Embedding(n_users, k)        self.item_emb = nn.Embedding(n_items, k)        self.sigmoid = nn.Sigmoid()        # Optional: initialize embeddings        nn.init.normal_(self.user_emb.weight, std=0.01)        nn.init.normal_(self.item_emb.weight, std=0.01)    def forward(self, u, i):        u_vec = self.user_emb(u)        i_vec = self.item_emb(i)        dot = (u_vec * i_vec).sum(dim=1)        return self.sigmoid(dot)

In [None]:
model = MFModel(n_users=n_users, n_items=n_items, k=32).to(device)optimizer = torch.optim.Adam(model.parameters(), lr=0.01)criterion = nn.BCELoss()

In [None]:
def evaluate_mf(model, data_loader):    model.eval()    all_labels = []    all_probs = []    with torch.no_grad():        for u, it, labels in data_loader:            u = u.to(device)            it = it.to(device)            labels = labels.to(device)            probs = model(u, it)            all_labels.append(labels.cpu().numpy())            all_probs.append(probs.cpu().numpy())    all_labels = np.concatenate(all_labels)    all_probs = np.concatenate(all_probs)    preds = (all_probs >= 0.5).astype(int)    acc = accuracy_score(all_labels, preds)    auc = roc_auc_score(all_labels, all_probs)    f1  = f1_score(all_labels, preds)    return acc, auc, f1

In [None]:
n_epochs = 5  # you can increase this if training is fastfor epoch in range(1, n_epochs + 1):    model.train()    total_loss = 0.0    for u, it, labels in train_loader:        u = u.to(device)        it = it.to(device)        labels = labels.to(device)        optimizer.zero_grad()        probs = model(u, it)        loss = criterion(probs, labels)        loss.backward()        optimizer.step()        total_loss += loss.item()    val_acc, val_auc, val_f1 = evaluate_mf(model, valid_loader)    print(f"Epoch {epoch:02d} | Train Loss: {total_loss:.4f} | "          f"Valid Acc: {val_acc:.4f} | Valid AUC: {val_auc:.4f} | Valid F1: {val_f1:.4f}")

In [None]:
test_acc_mf, test_auc_mf, test_f1_mf = evaluate_mf(model, test_loader)print("MF (Test) Accuracy:", test_acc_mf)print("MF (Test) AUC:", test_auc_mf)print("MF (Test) F1:", test_f1_mf)

### 3.4 Baselines and model comparisonWe now summarize results (fill actual numbers from your run):| Model                      | Accuracy (Test) | AUC (Test) | F1 (Test) ||---------------------------|-----------------|------------|-----------|| Logistic Regression       | …               | …          | …         || Matrix Factorization (MF) | …               | …          | …         |**Discussion points for the writeup:**- LR uses only content-based features → non-personalized baseline.  - MF uses only IDs but captures personalized patterns in the interaction matrix.  - Which performs better overall?  - Does MF overfit? (Look at train vs valid performance.)  - Scaling issues: MF training time vs LR training time.  - Any sensitivity to class imbalance? (Check F1, thresholding.)#

## 4. Related LiteratureIn this section (for the report, not code), you should discuss:- **Dataset source and usage**    - Food.com data was introduced in:      > *Generating Personalized Recipes from Historical User Preferences* (Majumder et al., EMNLP 2019).    - How they use the dataset (personalized recipe generation, recommendation) vs your simpler binary prediction task.- **Similar datasets**    - Other recommender datasets from McAuley’s page (Amazon reviews, BeerAdvocate, RateBeer, Steam, etc.).    - How those have been used for rating prediction, ranking, explanation generation, etc.- **State-of-the-art methods for this task**    - Collaborative filtering (matrix factorization, factorization machines).    - Neural approaches (Neural Collaborative Filtering, sequence-based recommenders).    - Hybrid methods: combining content features (ingredients, nutrition) with CF.- **Connection to your work**    - You’re implementing:    - A classic **logistic regression** baseline using content features.      - A classic **MF** model using user–item interactions.    - You can mention that your work is a simplified version of the setups in those papers, focusing on clarity and comparison of basic models, not SOTA performance.#

## 5. Results Interpretation & Conclusions### 5.1 Which models worked best?- Compare LR vs MF:  - Which has higher **AUC / accuracy / F1** on the test set?  - Does personalization (MF) give a clear boost over content-only LR?  - Are there cases where LR might do reasonably well (e.g., recipes that are almost universally liked or disliked)?### 5.2 Interpretation of parameters and featuresFor **Logistic Regression**, we can inspect the learned coefficients to see which recipe features are associated with higher probability of a “like”.#

In [None]:
coef = lr.coef_[0]for name, c in sorted(zip(content_feature_cols, coef), key=lambda x: -abs(x[1])):    print(f"{name:15s}  coeff = {c:+.4f}")

Interpretation examples (fill with your observations):- Positive coefficient: feature increases probability of high rating.- Negative coefficient: feature decreases probability of high rating.For instance:- If `n_ingredients` has a positive coefficient: more complex recipes might be rated more favorably.- If `minutes` has a negative coefficient: very long-cooking recipes may be less popular on average.For **Matrix Factorization**:- Individual embedding dimensions are harder to interpret directly, but:  - Large norms for particular user/item embeddings → very “opinionated” users or polarized items.  - You could also look at nearest neighbors in embedding space (optional) to see clusters of similar items/users.### 5.3 Issues encounteredYou should comment on (in prose):- **Preprocessing**:  - Parsing `nutrition` strings.  - Handling missing values.  - Encoding user/item IDs.- **Scaling**:  - Time to train MF vs LR.  - Choice of embedding dimension `k`, batch size, and epochs.- **Overfitting**:  - Compare train/valid metrics across epochs for MF.  - Could add regularization (weight decay) or early stopping.- **Noise / missing data**:  - Ratings are noisy, subjective.  - Some users have very few reviews (cold-start).  - Some items have very few reviews.### 5.4 Final conclusionSummarize in a short paragraph (you can edit this):> In this assignment, we used the Food.com recipe and review dataset to study personalized rating prediction. We framed the problem as a binary classification task (high vs low rating) and compared a non-personalized logistic regression model based on recipe content against a personalized matrix factorization model based on user–item interactions. Our experiments showed that [MF/logistic regression] achieved better performance in terms of AUC and accuracy, highlighting the importance of personalization in recommender systems. At the same time, analyzing the logistic regression coefficients provided interpretable insights into how recipe properties such as cooking time, number of ingredients, and nutrition information relate to user satisfaction.#