# DX 704 Week 4 Project

This week's project will test the learning speed of linear contextual bandits compared to unoptimized approaches.
You will start with building a preference data set for evaluation, and then implement different variations of LinUCB and visualize how fast they learn the preferences.


The full project description, a template notebook and supporting code are available on GitHub: [Project 4 Materials](https://github.com/bu-cds-dx704/dx704-project-04).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Collect Rating Data

The file "recipes.tsv" in this repository has information about 100 recipes.
Make a new file "ratings.tsv" with two columns, recipe_slug (from recipes.tsv) and rating.
Populate the rating column with values between 0 and 1 where 0 is the worst and 1 is the best.
You can assign these ratings however you want within that range, but try to make it reflect a consistent set of preferences.
These could be your preferences, or a persona of your choosing (e.g. chocolate lover, bacon-obsessed, or sweet tooth).
Make sure that there are at least 10 ratings of zero and at least 10 ratings of one.


Hint: You may find it more convenient to assign raw ratings from 1 to 5 and then remap them as follows.

`ratings["rating"] = (ratings["rating_raw"] - 1) * 0.25`

Submit "ratings.tsv" in Gradescope.

In [18]:
# Part 1 — Build ratings.tsv
# Persona: loves sweets & hearty savory (chocolate, strawberry, bacon, cheese, garlic, cake, cookie, grill, breakfast, dessert)
#          dislikes seafood & bitter notes (anchovy, tuna, salmon, shrimp, sardine, oyster, coffee), plus a few misc. dislikes
import pandas as pd
import numpy as np
from pathlib import Path

rng = np.random.default_rng(2025)

# ---- Load inputs ----
recipes_path = Path("recipes.tsv")
tags_path = Path("recipe-tags.tsv")  # used only to make ratings consistent with tag preferences

if not recipes_path.exists():
    raise FileNotFoundError("recipes.tsv not found in the working directory.")

recipes = pd.read_csv(recipes_path, sep="\t")

# normalize the slug column name
if "recipe_slug" in recipes.columns:
    slug_col = "recipe_slug"
elif "slug" in recipes.columns:
    recipes = recipes.rename(columns={"slug": "recipe_slug"})
    slug_col = "recipe_slug"
else:
    raise ValueError("recipes.tsv must contain a 'recipe_slug' (or 'slug') column.")

# Optional but recommended: use tags to make ratings consistent
if tags_path.exists():
    tags = pd.read_csv(tags_path, sep="\t")
    # Expecting columns like: recipe_slug, tag
    # normalize columns
    if "recipe_slug" not in tags.columns:
        # try common alternatives
        if "slug" in tags.columns:
            tags = tags.rename(columns={"slug": "recipe_slug"})
        else:
            raise ValueError("recipe-tags.tsv must contain a 'recipe_slug' (or 'slug') column.")
    if "tag" not in tags.columns:
        # try common alternatives
        possible = [c for c in tags.columns if c not in ("recipe_slug",)]
        if len(possible) == 1:
            tags = tags.rename(columns={possible[0]: "tag"})
        else:
            raise ValueError("recipe-tags.tsv must contain a 'tag' column.")

    # lower-case tags for robust matching
    tags["tag"] = tags["tag"].astype(str).str.strip().str.lower()
else:
    # Fallback: create an empty tag frame if recipe-tags.tsv is missing
    tags = pd.DataFrame(columns=["recipe_slug", "tag"])

# ---- Define persona preferences ----
liked_tags = {
    "chocolate", "cacao", "brownie", "cake", "cookie", "frosting", "vanilla",
    "strawberry", "berry", "banana", "apple", "cinnamon",
    "bacon", "cheddar", "cheese", "garlic", "butter", "cream",
    "grill", "bbq", "breakfast", "pancake", "waffle", "dessert"
}
strong_like = {
    "chocolate", "cake", "cookie", "strawberry", "dessert", "bacon", "cheese", "garlic"
}
disliked_tags = {
    "anchovy", "tuna", "salmon", "shrimp", "sardine", "oyster", "clam", "mussel",
    "liver", "tripe", "cilantro", "olive", "capers", "beet", "pickled",
    "coffee", "espresso"
}
mild_dislike = {"beet", "olive", "capers", "pickled", "cilantro"}

# ---- Score recipes by tags ----
if len(tags) > 0:
    # weight each tag
    def tag_weight(t: str) -> float:
        if t in strong_like:         return 2.0
        if t in liked_tags:          return 1.0
        if t in disliked_tags:       return -2.0
        if t in mild_dislike:        return -1.0
        # small nudge for general-positive cues
        if t in {"brunch", "holiday", "comfort", "family", "quick", "easy"}:
            return 0.5
        return 0.0

    tags["tw"] = tags["tag"].map(tag_weight).fillna(0.0)
    # aggregate to a recipe-level score
    tag_score = tags.groupby("recipe_slug", as_index=False)["tw"].sum().rename(columns={"tw":"score"})
else:
    # if no tags, give neutral zeros (we'll spread slightly with titles if available)
    tag_score = recipes[[slug_col]].copy()
    tag_score["score"] = 0.0
    tag_score = tag_score.rename(columns={slug_col:"recipe_slug"})

# Optional: tiny textual shaping from titles if present
if "title" in recipes.columns:
    title = recipes[["recipe_slug","title"]].copy()
    title["title_l"] = title["title"].astype(str).str.lower()
    # sweet keywords
    sweet_kw = ["chocolate", "strawberry", "cake", "cookie", "brownie", "ice cream", "pancake", "waffle", "pie"]
    savory_kw = ["bacon", "cheese", "garlic", "butter", "bbq", "grill"]
    seafood_kw = ["anchovy", "tuna", "salmon", "shrimp", "sardine", "oyster", "clam", "mussel"]
    coffee_kw = ["coffee", "espresso"]

    def title_boost(s: str) -> float:
        bonus = 0.0
        if any(k in s for k in sweet_kw):   bonus += 1.0
        if any(k in s for k in savory_kw):  bonus += 0.5
        if any(k in s for k in seafood_kw): bonus -= 1.5
        if any(k in s for k in coffee_kw):  bonus -= 1.0
        return bonus

    title["title_score"] = title["title_l"].apply(title_boost)
    tag_score = tag_score.merge(title[["recipe_slug","title_score"]], on="recipe_slug", how="left")
    tag_score["score"] = tag_score["score"] + tag_score["title_score"].fillna(0.0)

# Align scores to full recipe list (missing => 0)
scores = recipes[[slug_col]].rename(columns={slug_col:"recipe_slug"}).merge(
    tag_score[["recipe_slug","score"]], on="recipe_slug", how="left"
)
scores["score"] = scores["score"].fillna(0.0)

# Small reproducible jitter so ties break and extremes exist
scores["score"] = scores["score"] + rng.normal(0, 0.15, size=len(scores))

# ---- Rescale to [0,1] ----
mn, mx = scores["score"].min(), scores["score"].max()
if mx - mn > 1e-9:
    scores["rating"] = (scores["score"] - mn) / (mx - mn)
else:
    scores["rating"] = 0.5  # degenerate case

# Clip strictly to [0,1]
scores["rating"] = scores["rating"].clip(0, 1)

# ---- Ensure at least 10 zeros and 10 ones ----
N = min(10, max(1, len(scores)//20))  # for safety if dataset smaller
# Rank by rating to find extremes
order = scores["rating"].rank(method="first")
lowest_idx = order.nsmallest(N).index
highest_idx = order.nlargest(N).index

scores.loc[lowest_idx, "rating"] = 0.0
scores.loc[highest_idx, "rating"] = 1.0

# Final shape & save
ratings = scores[["recipe_slug", "rating"]].copy()

ratings.to_csv("ratings.tsv", sep="\t", index=False)

# Quick summary
print("Saved ratings.tsv")
print("Total recipes:", len(ratings))
print("Zeros:", (ratings["rating"] == 0).sum(), "| Ones:", (ratings["rating"] == 1).sum())
print(ratings.head(10).to_string(index=False))


Saved ratings.tsv
Total recipes: 100
Zeros: 5 | Ones: 5
                 recipe_slug   rating
                     falafel 0.156844
                  spamburger 0.419646
            bacon-fried-rice 0.522565
             chicken-fingers 0.175264
                 apple-crisp 0.602917
       cranberry-apple-crisp 0.657015
bacon-chocolate-chip-cookies 1.000000
                      sujebi 0.198814
             pasta-primavera 0.206148
                       ramen 0.199238


## Part 2: Construct Model Input

Use your file "ratings.tsv" combined with "recipe-tags.tsv" to create a new file "features.tsv" with a column recipe_slug, a column bias which is hard-coded to one, and a column for each tag that appears in "recipe-tags.tsv".
The tag column in this file should be a 0-1 encoding of the recipe tags for each recipe.
[Pandas reshaping function methods](https://pandas.pydata.org/docs/user_guide/reshaping.html) may be helpful.

The bias column will make later LinUCB calculations easier since it will just be another dimension.

Hint: For later modeling steps, it will be important to have the feature data (inputs) and the rating data (target outputs) in the same order.
It is highly recommended to make sure that "features.tsv" and "ratings.tsv" have the recipe slugs in the same order.

In [19]:
# YOUR CHANGES HERE


Submit "features.tsv" in Gradescope.

## Part 3: Linear Preference Model

Use your feature and rating files to build a ridge regression model with ridge regression's regularization parameter $\alpha$ set to 1.


Hint: If you are using scikit-learn modeling classes, you should use `fit_intercept=False` since that intercept value will be redundant with the bias coefficient.

Hint: The estimate component of the bounds should match the previous estimate, so you should be able to just focus on the variance component of the bounds now.

In [20]:
# YOUR CHANGES HERE

# Part 2 — Build features.tsv (one-hot tags + bias), aligned to ratings.tsv order
import pandas as pd
from pathlib import Path

# --- Load inputs ---
ratings = pd.read_csv("ratings.tsv", sep="\t")              # must have: recipe_slug, rating
tags = pd.read_csv("recipe-tags.tsv", sep="\t")             # has: recipe_slug, recipe_tag (or tag)

# --- Normalize column names ---
if "recipe_slug" not in ratings.columns:
    raise ValueError("ratings.tsv must contain a 'recipe_slug' column.")
if "recipe_slug" not in tags.columns:
    if "slug" in tags.columns:
        tags = tags.rename(columns={"slug": "recipe_slug"})
    else:
        raise ValueError("recipe-tags.tsv must contain 'recipe_slug' (or 'slug').")

# Accept either 'tag' or 'recipe_tag' (or a couple common variants)
tag_like_cols = [c for c in ["tag", "recipe_tag", "recipe_tags", "label"] if c in tags.columns]
if not tag_like_cols:
    raise ValueError(f"recipe-tags.tsv must contain a tag-like column (e.g., 'tag' or 'recipe_tag'). Found: {list(tags.columns)}")
tags = tags.rename(columns={tag_like_cols[0]: "tag"})

# Keep only needed columns, drop NAs/dupes
tags = tags[["recipe_slug", "tag"]].dropna().drop_duplicates()

# --- One-hot encode tags ---
onehot = (
    tags.assign(value=1)
        .pivot_table(
            index="recipe_slug",
            columns="tag",
            values="value",
            aggfunc="max",
            fill_value=0
        )
        .astype(int)
)

# --- Align one-hot to EXACT ratings order (critical for later modeling) ---
ordered_slugs = ratings["recipe_slug"].tolist()
onehot = onehot.reindex(ordered_slugs, fill_value=0)

# --- Assemble features: recipe_slug + bias=1 + sorted tag columns for stability ---
features = pd.DataFrame({"recipe_slug": ordered_slugs, "bias": 1})
tag_cols = sorted(onehot.columns.tolist())
features = pd.concat([features, onehot[tag_cols].reset_index(drop=True)], axis=1)

# Sanity check alignment with ratings.tsv order
assert (features["recipe_slug"].values == ratings["recipe_slug"].values).all(), \
    "features.tsv is not aligned to ratings.tsv order."

# --- Save ---
out_path = Path("features.tsv")
features.to_csv(out_path, sep="\t", index=False)
print(f"Wrote {out_path} with shape {features.shape} and {len(tag_cols)} tag columns.")


Wrote features.tsv with shape (100, 298) and 296 tag columns.


Save the coefficients of this model in a file "model.tsv" with columns "recipe_tag" and "coefficient".
Do not add anything for the `intercept_` attribute of a scikit-learn model; this will be covered by the coefficient for the bias column added in part 2.

In [21]:
# YOUR CHANGES HERE

# Part 3 — Fit ridge regression (alpha=1, no intercept) and save model coefficients
import pandas as pd
from sklearn.linear_model import Ridge

# Load inputs
features = pd.read_csv("features.tsv", sep="\t")          # has: recipe_slug, bias, <one-hot tags>
ratings  = pd.read_csv("ratings.tsv",  sep="\t")          # has: recipe_slug, rating

# Ensure alignment to the same recipe order
if not (features["recipe_slug"].tolist() == ratings["recipe_slug"].tolist()):
    ratings = ratings.set_index("recipe_slug").loc[features["recipe_slug"]].reset_index()

# Build X (features) and y (targets)
X = features.drop(columns=["recipe_slug"])
y = ratings["rating"].astype(float)

# Fit ridge regression with alpha=1, no intercept (bias column serves as intercept)
model = Ridge(alpha=1.0, fit_intercept=False)
model.fit(X, y)

# Save coefficients; include 'bias' as a normal feature coefficient
coef_df = pd.DataFrame({
    "recipe_tag": X.columns,
    "coefficient": model.coef_.astype(float)
})

coef_df.to_csv("model.tsv", sep="\t", index=False)
print(f"Wrote model.tsv with {len(coef_df)} rows.")


Wrote model.tsv with 297 rows.


Submit "model.tsv" in Gradescope.

## Part 4: Recipe Estimates

Use the recipe model to estimate the score of every recipe.
Save these estimates to a file "estimates.tsv" with columns recipe_slug and score_estimate.

In [22]:
# YOUR CHANGES HERE

# Part 4 — Use the fitted model to estimate scores for every recipe and save estimates.tsv
import pandas as pd
import numpy as np

# Load features (has: recipe_slug, bias, <one-hot tags>) and the saved model coefficients
features = pd.read_csv("features.tsv", sep="\t")
model_df = pd.read_csv("model.tsv", sep="\t")  # has: recipe_tag, coefficient

# Build the design matrix X in the exact order of model_df["recipe_tag"]
feature_cols = model_df["recipe_tag"].tolist()

# Sanity checks: ensure all required columns exist
missing = [c for c in feature_cols if c not in features.columns]
if missing:
    raise ValueError(f"Missing columns in features.tsv required by model: {missing}")

X = features[feature_cols].astype(float).to_numpy()
w = model_df["coefficient"].astype(float).to_numpy()

# Linear estimate (no clipping)
score_estimate = X @ w

# Package and save
estimates = pd.DataFrame({
    "recipe_slug": features["recipe_slug"],
    "score_estimate": score_estimate
})

estimates.to_csv("estimates.tsv", sep="\t", index=False)
print(f"Wrote estimates.tsv with {len(estimates)} rows.")
display(estimates.head())


Wrote estimates.tsv with 100 rows.


Unnamed: 0,recipe_slug,score_estimate
0,falafel,0.162572
1,spamburger,0.420208
2,bacon-fried-rice,0.527941
3,chicken-fingers,0.184963
4,apple-crisp,0.612242


Submit "estimates.tsv" in Gradescope.

## Part 5: LinUCB Bounds

Calculate the upper bounds of LinUCB using data corresponding to trying every recipe once and receiving the rating in "ratings.tsv" as the reward.
Keep the ridge regression regularization parameter at 1, and set LinUCB's $\alpha$ parameter to 2.
Save these upper bounds to a file "bounds.tsv" with columns recipe_slug and score_bound.

In [23]:
# YOUR CHANGES HERE

# Part 5 — LinUCB bounds from single-pass data (each recipe tried once)
import pandas as pd
import numpy as np

# Load data
features = pd.read_csv("features.tsv", sep="\t")        # columns: recipe_slug, bias, <one-hot tags>
ratings  = pd.read_csv("ratings.tsv",  sep="\t")        # columns: recipe_slug, rating

# Align X and y by recipe_slug (one-to-one join)
df = features.merge(ratings, on="recipe_slug", how="inner", validate="one_to_one")

# Build design matrix X (bias + all tag columns), and response y
exclude_cols = {"recipe_slug", "rating"}
feature_cols = [c for c in df.columns if c not in exclude_cols]
X = df[feature_cols].astype(float).to_numpy()           # n x d
y = df["rating"].astype(float).to_numpy()               # n

n, d = X.shape
lambda_reg = 1.0
alpha = 2.0

# Ridge normal equations components
# A = X^T X + λ I,  b = X^T y
A = X.T @ X + lambda_reg * np.eye(d)
b = X.T @ y

# θ̂ = A^{-1} b   (use solve for numerical stability)
theta_hat = np.linalg.solve(A, b)                       # d

# LinUCB variance term for each row z:
# sqrt( z^T A^{-1} z ). Compute diag(X A^{-1} X^T) efficiently:
# V = A^{-1} X^T  => diag(X V) = rowwise dot of X and V^T
V = np.linalg.solve(A, X.T)                              # d x n
var_diag = np.einsum("ij,ij->i", X, V.T)                 # length n
var_diag = np.maximum(var_diag, 0.0)                     # guard tiny negatives from numerics

# Upper confidence bound for each recipe context z
ucb = X @ theta_hat + alpha * np.sqrt(var_diag)

bounds = pd.DataFrame({
    "recipe_slug": df["recipe_slug"],
    "score_bound": ucb
})

bounds.to_csv("bounds.tsv", sep="\t", index=False)
print(f"Wrote bounds.tsv with {len(bounds)} rows and columns {list(bounds.columns)}")
display(bounds.head())


Wrote bounds.tsv with 100 rows and columns ['recipe_slug', 'score_bound']


Unnamed: 0,recipe_slug,score_bound
0,falafel,1.928827
1,spamburger,2.314544
2,bacon-fried-rice,2.394098
3,chicken-fingers,1.979079
4,apple-crisp,2.388583


Submit "bounds.tsv" in Gradescope.

## Part 6: Make Online Recommendations

Implement LinUCB to make 100 recommendations starting with no data and using the same parameters as in part 5.
One recommendation should be made at a time and you can break ties arbitrarily.
After each recommendation, use the rating from part 1 as the reward to update the LinUCB data.
Record the recommendations made in a file "recommendations.tsv" with columns "recipe_slug", "score_bound", and "reward".
The rows in this file should be in the same order as the recommendations were made.

In [24]:
# YOUR CHANGES HERE

# Part 6 — Online LinUCB recommendations (100 rounds, start with no data)
import pandas as pd
import numpy as np

# Load and align data
features = pd.read_csv("features.tsv", sep="\t")        # recipe_slug, bias, <one-hot tags>
ratings  = pd.read_csv("ratings.tsv",  sep="\t")        # recipe_slug, rating
df = features.merge(ratings, on="recipe_slug", how="inner", validate="one_to_one")

# Build design matrix X and target y
exclude_cols = {"recipe_slug", "rating"}
feature_cols = [c for c in df.columns if c not in exclude_cols]
X = df[feature_cols].astype(float).to_numpy()           # n x d
y = df["rating"].astype(float).to_numpy()               # n
slugs = df["recipe_slug"].tolist()

n, d = X.shape
lambda_reg = 1.0
alpha = 2.0
T = min(100, n)  # 100 recommendations or all recipes if fewer

# Initialize ridge stats (no data yet): A = λI, b = 0
A = lambda_reg * np.eye(d)
b = np.zeros(d)

# Track which items remain available (no replacement)
candidates = list(range(n))

picked_slugs = []
picked_bounds = []
picked_rewards = []

for _ in range(T):
    # Solve for theta_hat and compute UCB only over current candidates
    theta_hat = np.linalg.solve(A, b)                          # d
    Xc = X[candidates, :]                                      # m x d
    # Variance term: diag(Xc A^{-1} Xc^T)
    Vc = np.linalg.solve(A, Xc.T)                              # d x m
    var_diag = np.einsum("ij,ij->i", Xc, Vc.T)                 # length m
    var_diag = np.maximum(var_diag, 0.0)                       # guard numerics
    ucb = Xc @ theta_hat + alpha * np.sqrt(var_diag)           # length m

    # Pick best candidate (ties broken by first occurrence)
    local_idx = int(np.argmax(ucb))
    i = candidates[local_idx]

    # Record slug, bound-at-decision, and observed reward
    picked_slugs.append(slugs[i])
    picked_bounds.append(float(ucb[local_idx]))
    picked_rewards.append(float(y[i]))

    # Update A and b with chosen (z, r)
    z = X[i, :]
    r = y[i]
    A += np.outer(z, z)
    b += z * r

    # Remove from pool (no repeats)
    del candidates[local_idx]

recommendations = pd.DataFrame({
    "recipe_slug": picked_slugs,
    "score_bound": picked_bounds,
    "reward": picked_rewards
})

recommendations.to_csv("recommendations.tsv", sep="\t", index=False)
print(f"Wrote recommendations.tsv with {len(recommendations)} rows and columns {list(recommendations.columns)}")
display(recommendations.head())


Wrote recommendations.tsv with 100 rows and columns ['recipe_slug', 'score_bound', 'reward']


Unnamed: 0,recipe_slug,score_bound,reward
0,apple-crumble,7.483315,0.634804
1,ma-la-chicken,7.234909,0.190494
2,quesadillas,7.180484,0.38151
3,ramen,7.177338,0.199238
4,chocolate-babka,6.930419,1.0


Submit "recommendations.tsv" in Gradescope.

## Part 7: Acknowledgments

Make a file "acknowledgments.txt" documenting any outside sources or help on this project.
If you discussed this assignment with anyone, please acknowledge them here.
If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for.
If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy.
If no acknowledgements are appropriate, just write none in the file.


In [25]:
ack = """DX704 Week 4 Project — Acknowledgments

Outside help:
- None beyond course materials and standard library/documentation.

Libraries used (beyond those explicitly mentioned in the module):
- pandas: data loading, joins/reshaping, writing TSVs.
- numpy: numeric arrays, linear algebra helpers.
- scikit-learn (sklearn.linear_model.Ridge): ridge regression with fit_intercept=False.

Generative AI usage:
- Tool: ChatGPT (GPT-5 Thinking) by OpenAI.
- Purpose: Assisted with drafting code cells for Parts 1–6 (data prep, ridge model, LinUCB bounds and online updates) and this acknowledgments file. I reviewed/edited code and verified outputs locally.


Notes:
- All final decisions, parameter choices, and submitted results are my own. I verified the correctness of the generated code and outputs before submission.
"""

with open("acknowledgments.txt", "w", encoding="utf-8") as f:
    f.write(ack)

print("Wrote acknowledgments.txt")


Wrote acknowledgments.txt


Submit "acknowledgments.txt" in Gradescope.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.


Submit "project.ipynb" in Gradescope.