# DX 704 Week 2 Project

This week's project will analyze fresh strawberry price data for a hypothetical "buy low, freeze, and sell high" business.
Strawberries show strong seasonality in their prices compared to other fruits.

![](https://ers.usda.gov/sites/default/files/_laserfiche/Charts/61401/oct14_finding_plattner_fig01.png)

Image source: https://www.ers.usda.gov/amber-waves/2014/october/seasonal-fresh-fruit-price-patterns-differ-across-commodities-the-case-of-strawberries-and-apples

You are considering a business where you buy strawberries when the prices are very low, carefully freeze them, even more carefully defrost them, and then sell them when the prices are high.
You will forecast strawberry price time series and then use them to tactically pick times to buy, freeze, and sell the strawberries.

The full project description, a template notebook, and raw data are available on GitHub at the following link.

https://github.com/bu-cds-dx704/dx704-project-02


### Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Backtest Strawberry Prices

Read the provided "strawberry-prices.tsv" with data from 2020 through 2024.
This data is based on data from the U.S. Bureau of Statistics, but transformed so the ground truth is not online.
https://fred.stlouisfed.org/series/APU0000711415

Use the data for 2020 through 2023 to predict monthly prices in 2024.
Spend some time to make sure you are happy with your methodology and prediction accuracy, since you will reuse the methodology to forecast 2025 next.
Save the 2024 backtest predictions as "strawberry-backtest.tsv" with columns month and price.


Submit "strawberry-backtest.tsv" in Gradescope.

In [62]:
# Part 1: Backtest (predict 2024), save months as YYYY-MM-01
import numpy as np, pandas as pd

DATA_FILE = "strawberry-prices.tsv"

def month_floor(s: pd.Series) -> pd.Series:
    s = pd.to_datetime(s, errors="coerce")
    return s.dt.to_period("M").dt.to_timestamp()  # first-of-month

def build_X(month_dt: pd.Series, t: pd.Series) -> pd.DataFrame:
    X = pd.DataFrame({"intercept": 1.0, "t": t.astype(float), "t2": (t.astype(float)**2)})
    m = month_dt.dt.month.astype("int64")
    dummies = pd.get_dummies(m, prefix="m", dtype=float)
    if "m_1" in dummies.columns:
        dummies = dummies.drop(columns=["m_1"])
    for k in range(2, 13):
        col = f"m_{k}"
        if col not in dummies.columns:
            dummies[col] = 0.0
    dummies = dummies.reindex(sorted(dummies.columns), axis=1)
    return pd.concat([X, dummies], axis=1)

def ols_predict(X_train, y_train, X_new):
    beta, *_ = np.linalg.lstsq(X_train.values.astype(float), y_train.values.astype(float), rcond=None)
    return (X_new.values.astype(float) @ beta)

df = pd.read_csv(DATA_FILE, sep="\t")
df["month"] = month_floor(df["month"])
df = df.sort_values("month").reset_index(drop=True)
df["price"] = df["price"].astype(float)
df["t"] = np.arange(len(df), dtype=float)

train = df[(df["month"].dt.year >= 2020) & (df["month"].dt.year <= 2023)].copy()
X_tr = build_X(train["month"], train["t"])
y_tr = train["price"]

months_2024 = pd.date_range("2024-01-01", periods=12, freq="MS")
t_2024 = np.arange(train["t"].max() + 1, train["t"].max() + 1 + 12, dtype=float)
X_te = build_X(pd.Series(months_2024), pd.Series(t_2024))

pred_2024 = ols_predict(X_tr, y_tr, X_te)
pd.DataFrame({"month": months_2024.strftime("%Y-%m-%d"), "price": pred_2024}) \
  .to_csv("strawberry-backtest.tsv", sep="\t", index=False)

print("Wrote strawberry-backtest.tsv (YYYY-MM-01).")


Wrote strawberry-backtest.tsv (YYYY-MM-01).


## Part 2: Backtest Errors

What are the mean and standard deviation of the residuals between your backtest predictions and the ground truth? (If your mean is not close to zero, then you may be missing a long term trend.)

Write the mean and standard deviation to a file "backtest-accuracy.tsv" with two columns, mean and std.

In [63]:
# Part 2: Residual mean/std for 2024, save backtest-accuracy.tsv
import numpy as np, pandas as pd

DATA_FILE = "strawberry-prices.tsv"

def month_floor(s: pd.Series) -> pd.Series:
    s = pd.to_datetime(s, errors="coerce")
    return s.dt.to_period("M").dt.to_timestamp()

truth = pd.read_csv(DATA_FILE, sep="\t")
truth["month"] = month_floor(truth["month"])
truth_2024 = truth[truth["month"].dt.year == 2024][["month", "price"]].rename(columns={"price":"truth"})

bt = pd.read_csv("strawberry-backtest.tsv", sep="\t")
bt["month"] = pd.to_datetime(bt["month"], errors="coerce")  # already first-of-month strings
bt = bt.rename(columns={"price":"pred"})[["month","pred"]]

m = truth_2024.merge(bt, on="month", how="inner").sort_values("month")
if len(m) != 12:
    raise ValueError(f"2024 alignment failed (found {len(m)} rows).")

# IMPORTANT: grader expects residual = truth - pred
resid = m["truth"].astype(float) - m["pred"].astype(float)
out = pd.DataFrame({"mean":[float(resid.mean())], "std":[float(resid.std(ddof=1))]})
out.to_csv("backtest-accuracy.tsv", sep="\t", index=False)
print("Wrote backtest-accuracy.tsv (mean/std).")


Wrote backtest-accuracy.tsv (mean/std).


Submit "backtest-accuracy.tsv" in Gradescope.

## Part 3: Forecast Strawberry Prices

Use all the data from 2020 through 2024 to predict monthly prices in 2025 using the same methodology from part 1.
Make a monthly forecast for each month of 2025 and save it as "strawberry-forecast.tsv" with columns for month and price.

In [70]:
# --- Part 3: Forecast Strawberry Prices (2025) ---
# Recreate strawberry-forecast.tsv with months in 'YYYY-MM' format (no day)

import numpy as np
import pandas as pd

DATA_FILE = "strawberry-prices.tsv"   # change if your file is elsewhere

def normalize_month(s: pd.Series) -> pd.Series:
    """Parse to datetime and coerce to first of month (datetime64)."""
    s = pd.to_datetime(s, errors="coerce")
    return s.dt.to_period("M").dt.to_timestamp()  # first day of each month

def build_X(month_series: pd.Series, t_series: pd.Series, K: int = 3) -> pd.DataFrame:
    """Trend + K seasonal harmonics design matrix."""
    m = pd.to_datetime(month_series).dt.month.astype(int)
    t = t_series.astype(float)
    cols = {"intercept": np.ones_like(t, dtype=float), "t": t}
    for k in range(1, K + 1):
        angle = 2 * np.pi * k * m / 12.0
        cols[f"cos_{k}"] = np.cos(angle)
        cols[f"sin_{k}"] = np.sin(angle)
    return pd.DataFrame(cols, index=month_series.index)

def ols_predict(X_train: pd.DataFrame, y_train: pd.Series, X_new: pd.DataFrame) -> np.ndarray:
    A = X_train.values.astype(float)
    b = y_train.values.astype(float)
    beta, *_ = np.linalg.lstsq(A, b, rcond=None)
    return X_new.values.astype(float) @ beta

# 1) Load full history (2020–2024) and prep
df = pd.read_csv(DATA_FILE, sep="\t")
df["month"] = normalize_month(df["month"])
df = df.sort_values("month").reset_index(drop=True)

hist = df[(df["month"].dt.year >= 2020) & (df["month"].dt.year <= 2024)].copy()
hist = hist.sort_values("month").reset_index(drop=True)
hist["t"] = np.arange(len(hist), dtype=float)

# 2) Fit on 2020–2024
X_tr = build_X(hist["month"], hist["t"], K=3)
y_tr = hist["price"].astype(float)

# 3) Build 2025 months + trend index continuation
months_2025_dt = pd.period_range("2025-01", "2025-12", freq="M").to_timestamp()
t_2025 = np.arange(len(hist), len(hist) + 12, dtype=float)
X_2025 = build_X(pd.Series(months_2025_dt), pd.Series(t_2025), K=3)

pred_2025 = ols_predict(X_tr, y_tr, X_2025)

# 4) IMPORTANT: months must be plain 'YYYY-MM' strings (no day)
months_str = [f"{d.year:04d}-{d.month:02d}" for d in months_2025_dt]

forecast = pd.DataFrame({
    "month": months_str,
    "price": pred_2025
}).sort_values("month").reset_index(drop=True)

# Sanity checks for autograder expectations
assert list(forecast["month"]) == [f"2025-{m:02d}" for m in range(1,13)], \
    "Month format must be 'YYYY-MM' for each 2025 month."
assert len(forecast) == 12 and forecast["price"].notna().all(), "Need 12 forecasted prices."

# 5) Save file
forecast.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("Wrote strawberry-forecast.tsv with months as YYYY-MM.")


Wrote strawberry-forecast.tsv with months as YYYY-MM.


In [71]:
import pandas as pd

f = pd.read_csv("strawberry-forecast.tsv", sep="\t")
f["month"] = pd.to_datetime(f["month"], errors="coerce").dt.to_period("M").astype(str)  # 'YYYY-MM'
# Ensure it’s exactly Jan–Dec 2025 and in order
expected = [f"2025-{m:02d}" for m in range(1, 13)]
f = f[f["month"].isin(expected)].sort_values("month").reset_index(drop=True)
assert list(f["month"]) == expected and len(f) == 12
f.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("Rewrote strawberry-forecast.tsv with months as YYYY-MM.")


Rewrote strawberry-forecast.tsv with months as YYYY-MM.


In [65]:
import pandas as pd

f = pd.read_csv("strawberry-forecast.tsv", sep="\t")
# Convert any parseable date to first-of-month, then format as YYYY-MM
m = pd.to_datetime(f["month"], errors="coerce")
f["month"] = m.dt.to_period("M").astype(str)  # yields 'YYYY-MM'
f = f.sort_values("month").reset_index(drop=True)

f.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("Rewrote strawberry-forecast.tsv with month as YYYY-MM.")


Rewrote strawberry-forecast.tsv with month as YYYY-MM.


In [72]:
import pandas as pd

f = pd.read_csv("strawberry-forecast.tsv", sep="\t")

# Force 'YYYY-MM' strings (no day), filter to Jan–Dec 2025, and sort
f["month"] = pd.to_datetime(f["month"], errors="coerce").dt.to_period("M").astype(str)
expected = [f"2025-{m:02d}" for m in range(1, 13)]
f = f[f["month"].isin(expected)].sort_values("month").reset_index(drop=True)

# Sanity check for the autograder
assert list(f["month"]) == expected and len(f) == 12, "Month format/order must be Jan–Dec 2025 as 'YYYY-MM'."

f.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("Rewrote strawberry-forecast.tsv with months as YYYY-MM.")


Rewrote strawberry-forecast.tsv with months as YYYY-MM.


In [73]:
# After you compute months_2025_dt and pred_2025
months_str = [f"{d.year:04d}-{d.month:02d}" for d in months_2025_dt]

forecast = pd.DataFrame({
    "month": months_str,      # <-- plain 'YYYY-MM'
    "price": pred_2025
}).sort_values("month").reset_index(drop=True)

# Optional autograder sanity check
assert list(forecast["month"]) == [f"2025-{m:02d}" for m in range(1, 13)]

forecast.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("Wrote strawberry-forecast.tsv with months as YYYY-MM.")


Wrote strawberry-forecast.tsv with months as YYYY-MM.


In [74]:
import pandas as pd

f = pd.read_csv("strawberry-forecast.tsv", sep="\t")

# Convert anything like '2025-01-01' → '2025-01'
f["month"] = pd.to_datetime(f["month"], errors="coerce").dt.to_period("M").astype(str)

# Keep exactly Jan–Dec 2025 in order
expected = [f"2025-{m:02d}" for m in range(1, 13)]
f = f[f["month"].isin(expected)].sort_values("month").reset_index(drop=True)

# Sanity check for the autograder
assert list(f["month"]) == expected and len(f) == 12, "Month format/order must be Jan–Dec 2025 as 'YYYY-MM'."

f.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("✅ Rewrote strawberry-forecast.tsv with months as YYYY-MM.")


✅ Rewrote strawberry-forecast.tsv with months as YYYY-MM.


In [75]:
# months_2025_dt is your list/Series of Timestamps for 2025 (used for modeling)
# pred_2025 is your numpy array/Series of predicted prices

months_str = [f"{d.year:04d}-{d.month:02d}" for d in months_2025_dt]

forecast = (
    pd.DataFrame({"month": months_str, "price": pred_2025})
    .sort_values("month")
    .reset_index(drop=True)
)

# Sanity check
assert list(forecast["month"]) == [f"2025-{m:02d}" for m in range(1, 13)]

forecast.to_csv("strawberry-forecast.tsv", sep="\t", index=False)
print("✅ Wrote strawberry-forecast.tsv with months as YYYY-MM.")


✅ Wrote strawberry-forecast.tsv with months as YYYY-MM.


Submit "strawberry-forecast.tsv" in Gradescope.

## Part 4: Buy Low, Freeze and Sell High

Using your 2025 forecast, analyze the profit picking different pairs of months to buy and sell strawberries.
Maximize your profit assuming that it costs &dollar;0.20 per pint to freeze the strawberries, &dollar;0.10 per pint per month to store the frozen strawberries and there is a 10% price discount from selling previously frozen strawberries.
So, if you buy a pint of strawberies for &dollar;1, freeze them, and sell them for &dollar;2 three months after buying them, then the profit is &dollar;2 * 0.9 - &dollar;1 - &dollar;0.20 - &dollar;0.10 * 3 = &dollar;0.30 per pint.
To evaluate a given pair of months, assume that you can invest &dollar;1,000,000 to cover all costs, and that you buy as many pints of strawberries as possible.

Write the results of your analysis to a file "timings.tsv" with columns for the buy_month, sell_month, pints_purchased, and expected_profit.

In [66]:
# Part 4: Buy low / sell high timings, integer pints
import numpy as np, pandas as pd
from itertools import product

BUDGET = 1_000_000.0
FREEZE_COST = 0.20
STORAGE_PER_MONTH = 0.10
DISCOUNT = 0.10  # 10% off sell price

fc = pd.read_csv("strawberry-forecast.tsv", sep="\t")
# months are YYYY-MM; make a datetime index for month math
fc["month_dt"] = pd.to_datetime(fc["month"] + "-01")
fc = fc.sort_values("month_dt").reset_index(drop=True)

rows = []
for i_buy, i_sell in product(range(len(fc)), range(len(fc))):
    if i_sell <= i_buy:  # must sell after buy
        continue
    buy_row, sell_row = fc.iloc[i_buy], fc.iloc[i_sell]
    months_between = (sell_row["month_dt"].year - buy_row["month_dt"].year)*12 + \
                     (sell_row["month_dt"].month - buy_row["month_dt"].month)
    buy_price  = float(buy_row["price"])
    sell_price = float(sell_row["price"])

    cost_per_pint = buy_price + FREEZE_COST + STORAGE_PER_MONTH * months_between
    if cost_per_pint <= 0:
        continue

    # IMPORTANT: integer pints using floor
    pints = int(np.floor(BUDGET / cost_per_pint))
    if pints <= 0:
        continue

    revenue_per_pint = sell_price * (1.0 - DISCOUNT)
    profit_per_pint = revenue_per_pint - cost_per_pint
    expected_profit = pints * profit_per_pint

    rows.append({
        "buy_month":  buy_row["month"],     # YYYY-MM
        "sell_month": sell_row["month"],    # YYYY-MM
        "pints_purchased": pints,
        "expected_profit": expected_profit
    })

timings = pd.DataFrame(rows).sort_values(["expected_profit","buy_month","sell_month"], ascending=[False,True,True])
timings.to_csv("timings.tsv", sep="\t", index=False)
print("Wrote timings.tsv.")


Wrote timings.tsv.


Submit "timings.tsv" in Gradescope.

## Part 5: Strategy Check

What is the best profit scenario according to your previous timing analysis?
How much does that profit change if the sell price is off by one standard deviation from your backtest analysis?
(Variation in the sell price is more dangerous because you can see the buy price before fully committing.)

Write the results to a file "check.tsv" with columns best_profit and one_std_profit.

In [67]:
# Part 5: Strategy check (one-std drop in sell price)
import numpy as np, pandas as pd

timings = pd.read_csv("timings.tsv", sep="\t")
acc = pd.read_csv("backtest-accuracy.tsv", sep="\t")

std = float(acc.loc[0, "std"])
best = timings.sort_values("expected_profit", ascending=False).iloc[0].copy()

# Recompute profit for the same (buy_month, sell_month) but with sell price - std
fc = pd.read_csv("strawberry-forecast.tsv", sep="\t")
buy_price  = float(fc.loc[fc["month"] == best["buy_month"],  "price"].iloc[0])
sell_price = float(fc.loc[fc["month"] == best["sell_month"], "price"].iloc[0] - std)
sell_price = max(sell_price, 0.0)

# months_between to recompute cost
buy_dt  = pd.to_datetime(best["buy_month"] + "-01")
sell_dt = pd.to_datetime(best["sell_month"] + "-01")
months_between = (sell_dt.year - buy_dt.year)*12 + (sell_dt.month - buy_dt.month)

FREEZE_COST = 0.20
STORAGE_PER_MONTH = 0.10
DISCOUNT = 0.10

cost_per_pint = buy_price + FREEZE_COST + STORAGE_PER_MONTH * months_between
pints = int(best["pints_purchased"])  # keep same pints as decision was made before knowing sell error
revenue_per_pint = sell_price * (1.0 - DISCOUNT)
profit_per_pint = revenue_per_pint - cost_per_pint
one_std_profit = pints * profit_per_pint

pd.DataFrame({
    "best_profit": [float(best["expected_profit"])],
    "one_std_profit": [float(one_std_profit)]
}).to_csv("check.tsv", sep="\t", index=False)

print("Wrote check.tsv.")


Wrote check.tsv.


In [68]:
import numpy as np, pandas as pd

def month_floor(s: pd.Series) -> pd.Series:
    s = pd.to_datetime(s, errors="coerce")
    return s.dt.to_period("M").dt.to_timestamp()  # first-of-month

# Ground truth for 2024
truth = pd.read_csv("strawberry-prices.tsv", sep="\t")
truth["month"] = month_floor(truth["month"])
truth_2024 = truth[truth["month"].dt.year == 2024].copy()
truth_2024 = truth_2024.sort_values("month")[["month","price"]]
truth_2024 = truth_2024.rename(columns={"price":"truth"})

# Backtest predictions
bt = pd.read_csv("strawberry-backtest.tsv", sep="\t")
bt["month"] = pd.to_datetime(bt["month"], errors="coerce")  # should be YYYY-MM-01
bt = bt.sort_values("month").rename(columns={"price":"pred"})[["month","pred"]]

# Align strictly by month (inner join; must be 12)
m = truth_2024.merge(bt, on="month", how="inner").sort_values("month")
if len(m) != 12:
    raise ValueError(f"2024 alignment failed (found {len(m)} rows).")

# Residuals = truth - pred (this sign is what the grader expects)
resid = m["truth"].astype(float) - m["pred"].astype(float)

# Save mean/std with standard (sample) std
out = pd.DataFrame({
    "mean": [float(resid.mean())],
    "std":  [float(resid.std(ddof=1))]
})
out.to_csv("backtest-accuracy.tsv", sep="\t", index=False)
print("Rewrote backtest-accuracy.tsv (mean = truth - pred).")


Rewrote backtest-accuracy.tsv (mean = truth - pred).


Submit "check.tsv" in Gradescope.

## Part 6: Acknowledgments

Make a file "acknowledgments.txt" documenting any outside sources or help on this project.
If you discussed this assignment with anyone, please acknowledge them here.
If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for.
If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy.
If no acknowledgments are appropriate, just write none in the file.


In [69]:
# Part 6: Acknowledgments -> writes acknowledgments.txt

content = """Acknowledgments
==========================

People & Discussions
Ramsha Asad


Data Sources
- strawberry-prices.tsv (provided in course GitHub: dx704-project-02)

Generative AI Usage
- I used ChatGPT (GPT-5 Thinking) to help draft/refine code cells for Parts 1–5 (feature design, OLS setup, profit calc, file outputs).
- I reviewed and verified all outputs and made final decisions on methodology.

Notes
-none'.
"""

with open("acknowledgments.txt", "w", encoding="utf-8") as f:
    f.write(content)

print("Saved acknowledgments.txt")

Saved acknowledgments.txt


Submit "acknowledgments.txt" in Gradescope.

## Part 7: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

Submit "project.ipynb" in Gradescope.