# DX 704 Week 1 Project

This week's project will build a portfolio risk and return model, and make investing recommendations for hypothetical clients.
You will collect historical data, estimate returns and risks, construct efficient frontier portfolios, and sanity check the certainty of the maximum return portfolio.

The full project description and a template notebook are available on GitHub at the following link.

https://github.com/bu-cds-dx704/dx704-project-01


Feel free to use optimization tools or libraries (such as CVXOPT or scipy.optimize) to perform any calculations required for this mini project.

### Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Collect Data

Collect historical monthly price data for the last 24 months covering 6 different stocks.
The data should cover 24 consecutive months including the last month that ended before this week's material was released on Blackboard.
To be clear, if a month ends between the Blackboard release and submitting your project, you do not need to add that month.

The six different stocks must include AAPL, SPY and TSLA.
At least one of the remaining 3 tickers must start with the same letter as your last name (e.g. professor Considine could use COIN).
This is to encourage diversity in what stocks you analyze; if you discuss this project with classmates, please make sure that you pick different tickers to differentiate your work.
Do not pick stocks with fewer than 24 consecutive months of price data.

In [23]:
%pip install yfinance pandas numpy



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [24]:
# part01_collect_data.py  (robust, uses auto-adjusted Close)
# Requirements: pip install yfinance pandas numpy
import os
import pandas as pd

# ---- Settings ----
TICKERS = ['AAPL', 'SPY', 'TSLA', 'AMZN', 'MSFT', 'NVDA']  # includes A* (AMZN) for "Asad"
START_MONTH = '2023-09-01'      # 24 months total
END_MONTH   = '2025-08-31'      # last full month before Week 1 release
OUT_FILE = 'historical_prices.tsv'  # per spec

# Optional CSV fallback (if offline): map ticker -> CSV path with columns ['Date','Adj Close' or 'Close']
CSV_PATHS = {
    # 'AAPL': '/path/to/AAPL.csv',
    # 'SPY':  '/path/to/SPY.csv',
    # ...
}

def download_adjusted_close(ticker, start, end_plus):
    """Return a Series of *adjusted* closes indexed by date.
       We use auto_adjust=True so 'Close' is already adjusted."""
    import yfinance as yf
    df = yf.download(
        ticker, start=start, end=end_plus,
        progress=False, auto_adjust=True, actions=False, group_by='column'
    )
    if df.empty:
        raise ValueError(f"No data returned for {ticker}")
    # With auto_adjust=True, 'Close' is adjusted close
    if 'Close' not in df.columns:
        raise KeyError(f"'Close' not found for {ticker}; columns: {list(df.columns)}")
    s = df['Close'].copy()
    s.name = ticker
    return s

def load_from_csvs():
    frames = []
    for t in TICKERS:
        path = CSV_PATHS.get(t)
        if not path or not os.path.exists(path):
            raise FileNotFoundError(f"CSV for {t} not found. Set CSV_PATHS['{t}'] to your file.")
        raw = pd.read_csv(path)
        if 'Date' not in raw.columns:
            raise ValueError(f"{t} CSV must have a 'Date' column.")
        # Prefer 'Adj Close'; else fall back to 'Close'
        col = 'Adj Close' if 'Adj Close' in raw.columns else 'Close' if 'Close' in raw.columns else None
        if not col:
            raise ValueError(f"{t} CSV must have 'Adj Close' or 'Close' column.")
        raw['Date'] = pd.to_datetime(raw['Date'])
        raw = raw.set_index('Date').sort_index()
        frames.append(raw[[col]].rename(columns={col: t}))
    prices = pd.concat(frames, axis=1).sort_index().ffill()
    return prices

def main():
    # Try yfinance first; if not available or fails, use CSV fallback
    use_csv = False
    try:
        import yfinance  # noqa: F401
    except Exception:
        use_csv = True

    if not use_csv:
        try:
            end_plus = pd.to_datetime(END_MONTH) + pd.Timedelta(days=2)  # cushion to include last trading day
            series = []
            for t in TICKERS:
                s = download_adjusted_close(t, START_MONTH, end_plus)
                series.append(s)
            prices = pd.concat(series, axis=1).sort_index().ffill()
        except Exception as e:
            print("yfinance path failed:", e)
            print("Falling back to CSV files specified in CSV_PATHS...")
            use_csv = True

    if use_csv:
        prices = load_from_csvs()

    # Collapse to last trading day of each month
    monthly = prices.groupby(pd.Grouper(freq='M')).last().loc[START_MONTH:END_MONTH]
    monthly.index.name = 'date'

    # Sanity checks
    if monthly.shape[0] != 24:
        print(f"Warning: expected 24 monthly rows, got {monthly.shape[0]}. Check START/END.")
    missing = [t for t in TICKERS if t not in monthly.columns]
    if missing:
        print(f"Warning: missing columns: {missing}")

    # Save TSV: date + 6 tickers (adjusted closes)
    monthly.reset_index().to_csv(OUT_FILE, sep='\t', index=False)
    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(monthly.tail())

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/historical_prices.tsv
Ticker            AAPL         SPY        TSLA        AMZN        MSFT  \
date                                                                     
2025-04-30  211.981125  552.905457  282.160004  184.419998  393.888184   
2025-05-31  200.622314  587.652771  346.459991  205.009995  459.604431   
2025-06-30  204.937408  617.849976  317.660004  219.389999  496.593658   
2025-07-31  207.334702  632.080017  308.269989  234.110001  532.624390   
2025-08-31  232.139999  645.049988  333.869995  229.000000  506.690002   

Ticker            NVDA  
date                    
2025-04-30  108.912437  
2025-05-31  135.120621  
2025-06-30  157.990005  
2025-07-31  177.869995  
2025-08-31  174.179993  


  monthly = prices.groupby(pd.Grouper(freq='M')).last().loc[START_MONTH:END_MONTH]


Save the data as a TSV file named "historical_prices.tsv" and include a header row with the column names "date" and the 6 stock ticker symbols.
The date should be the last trading day of the month, so it may not be the last day of the month.
For example, the last trading day of November 2024 was 2024-11-29.
The remaining columns should contain the adjusted closing prices of the corresponding stock tickers on that day.


In [25]:
# YOUR CHANGES HERE

# Part 1 — Collect Data → historical_prices.tsv
# Uses adjusted prices and collapses to the last trading day of each month.

import os
import pandas as pd

# ---- Settings (edit only if your section uses different cutoff/tickers) ----
TICKERS = ['AAPL', 'SPY', 'TSLA', 'AMZN', 'MSFT', 'NVDA']  # includes A* (AMZN) for "Asad"
START_MONTH = '2023-09-01'   # 24 months total
END_MONTH   = '2025-08-31'   # last full month before Week 1 release
OUT_FILE = 'historical_prices.tsv'  # per spec

def download_adjusted_close(ticker, start, end_plus):
    """
    Returns a Series of *adjusted* closes indexed by date.
    auto_adjust=True -> 'Close' is already adjusted for splits/dividends.
    """
    import yfinance as yf
    df = yf.download(
        ticker, start=start, end=end_plus,
        auto_adjust=True, actions=False, progress=False, group_by='column'
    )
    if df.empty:
        raise ValueError(f"No data returned for {ticker}")
    if 'Close' not in df.columns:
        raise KeyError(f"'Close' not found for {ticker}; columns: {list(df.columns)}")
    s = df['Close'].copy()
    s.name = ticker
    return s

# 1) Download daily adjusted closes for each ticker and align
end_plus = pd.to_datetime(END_MONTH) + pd.Timedelta(days=2)  # cushion so last trading day is included
series = []
for t in TICKERS:
    s = download_adjusted_close(t, START_MONTH, end_plus)
    series.append(s)

prices = pd.concat(series, axis=1).sort_index().ffill()

# 2) Collapse to the *last trading day* of each month (not necessarily last calendar day)
monthly = prices.groupby(pd.Grouper(freq='M')).last().loc[START_MONTH:END_MONTH]
monthly.index.name = 'date'

# 3) Optional sanity checks
if monthly.shape[0] != 24:
    print(f"Warning: expected 24 monthly rows, got {monthly.shape[0]}. Check START/END.")
missing = [t for t in TICKERS if t not in monthly.columns]
if missing:
    print(f"Warning: missing columns: {missing}")

# 4) Save TSV with header: date + tickers (values = adjusted closes)
monthly.reset_index().to_csv(OUT_FILE, sep='\t', index=False)

print(f"Wrote: {os.path.abspath(OUT_FILE)}")
print("Tail preview:")
print(monthly.tail())


Wrote: /workspaces/dx704-project-01/historical_prices.tsv
Tail preview:
Ticker            AAPL         SPY        TSLA        AMZN        MSFT  \
date                                                                     
2025-04-30  211.981125  552.905457  282.160004  184.419998  393.888184   
2025-05-31  200.622314  587.652771  346.459991  205.009995  459.604431   
2025-06-30  204.937408  617.849976  317.660004  219.389999  496.593658   
2025-07-31  207.334702  632.080017  308.269989  234.110001  532.624390   
2025-08-31  232.139999  645.049988  333.869995  229.000000  506.690002   

Ticker            NVDA  
date                    
2025-04-30  108.912437  
2025-05-31  135.120621  
2025-06-30  157.990005  
2025-07-31  177.869995  
2025-08-31  174.179993  


  monthly = prices.groupby(pd.Grouper(freq='M')).last().loc[START_MONTH:END_MONTH]


Submit "historical_prices.tsv" in Gradescope.

## Part 2: Calculate Historical Asset Returns

Calculate the historical asset returns based on the price data that you previously collected.

In [26]:
# YOUR CHANGES HERE

# part02_returns.py
# Calculates simple monthly returns from historical_prices.tsv
# Output: historical_returns.tsv (date + 6 tickers, 23 rows)

import os
import pandas as pd

IN_FILE  = 'historical_prices.tsv'
OUT_FILE = 'historical_returns.tsv'

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(
            f"Could not find {IN_FILE}. Run Part 1 first to create it."
        )

    # 1) Load prices (date is last trading day per month)
    prices = pd.read_csv(IN_FILE, sep='\t', parse_dates=['date']).set_index('date').sort_index()

    # 2) Simple returns: (P_t - P_{t-1}) / P_{t-1}
    returns = prices.pct_change().dropna(how='any')  # 24 prices → 23 returns

    # 3) Optional sanity checks
    if returns.shape[0] != 23:
        print(f"Warning: expected 23 monthly return rows, got {returns.shape[0]}.")
    if list(returns.columns) != list(prices.columns):
        print("Warning: column order changed.")

    # 4) Save TSV with header: date + tickers (values are relative returns, e.g., 0.10 for +10%)
    returns.reset_index().to_csv(OUT_FILE, sep='\t', index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print("Preview:")
    print(returns.head())

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/historical_returns.tsv
Preview:
                AAPL       SPY      TSLA      AMZN      MSFT      NVDA
date                                                                  
2023-10-31 -0.002570 -0.021709 -0.197346  0.046963  0.070815 -0.062507
2023-11-30  0.113780  0.091344  0.195379  0.097678  0.122946  0.146886
2023-12-31  0.013582  0.045656  0.034988  0.040044 -0.007574  0.058934
2024-01-31 -0.042227  0.015926 -0.246257  0.021456  0.057281  0.242418
2024-02-29 -0.018543  0.052187  0.077901  0.138918  0.042318  0.285810


Save the data as a TSV file named "historical_returns.tsv" and include a header row with the column names "date" and the 6 stock ticker symbols.
Each row should have the date at the end of the month and the corresponding *relative* price changes.
For example, if the previous price was \$100 and the new price is \$110, the return value should be 0.10.
There should only be 23 rows of data in this file, since they are computed as the differences of 24 prices.

In [27]:
# YOUR CHANGES HERE

# part02_returns.py
# Computes simple monthly returns from historical_prices.tsv
# Output: historical_returns.tsv (date + 6 tickers, exactly 23 rows)

import os
import pandas as pd

IN_FILE  = "historical_prices.tsv"
OUT_FILE = "historical_returns.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 1 first.")

    # 1) Load monthly prices (already last trading day per month)
    prices = pd.read_csv(IN_FILE, sep="\t", parse_dates=["date"]).set_index("date").sort_index()

    # 2) Simple relative returns: (P_t - P_{t-1}) / P_{t-1}
    returns = prices.pct_change().dropna(how="any")  # 24 prices → 23 returns

    # 3) Sanity checks
    if returns.shape[0] != 23:
        print(f"Warning: expected 23 rows of returns, got {returns.shape[0]}. Check Part 1 window.")
    if "date" in returns.columns:
        # Make sure 'date' is the index only
        returns = returns.drop(columns=["date"])

    # 4) Save as TSV with header: date + 6 ticker symbols
    returns.reset_index().to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print("Preview:")
    print(returns.head())

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/historical_returns.tsv
Preview:
                AAPL       SPY      TSLA      AMZN      MSFT      NVDA
date                                                                  
2023-10-31 -0.002570 -0.021709 -0.197346  0.046963  0.070815 -0.062507
2023-11-30  0.113780  0.091344  0.195379  0.097678  0.122946  0.146886
2023-12-31  0.013582  0.045656  0.034988  0.040044 -0.007574  0.058934
2024-01-31 -0.042227  0.015926 -0.246257  0.021456  0.057281  0.242418
2024-02-29 -0.018543  0.052187  0.077901  0.138918  0.042318  0.285810


Submit "historical_returns.tsv" in Gradescope.

## Part 3: Estimate Returns

Estimate the expected returns for each asset using the previously calculated return data.
Just compute the average (mean) return for each asset over your data set; do not use other estimators that have been mentioned.
This will serve as your estimate of expected return for each asset.

In [28]:
# YOUR CHANGES HERE

# part03_estimate_returns.py
# Input : historical_returns.tsv  (date + 6 tickers, 23 rows)
# Output: estimated_returns.tsv   (asset, estimated_return)

import os
import pandas as pd

IN_FILE  = "historical_returns.tsv"
OUT_FILE = "estimated_returns.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 2 first.")

    # 1) Load monthly returns
    df = pd.read_csv(IN_FILE, sep="\t", parse_dates=["date"])

    # 2) Identify asset columns (everything except 'date')
    assets = [c for c in df.columns if c != "date"]
    if len(assets) != 6:
        print(f"Warning: expected 6 assets, found {len(assets)}: {assets}")

    # 3) Compute expected returns (simple mean across 23 months)
    mu = df[assets].mean()

    # 4) Save as TSV: asset, estimated_return (preserve asset order from file)
    out = pd.DataFrame({"asset": assets, "estimated_return": [mu[a] for a in assets]})
    out.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(out.sort_values("estimated_return", ascending=False))

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/estimated_returns.tsv
  asset  estimated_return
5  NVDA          0.068636
3  AMZN          0.028061
2  TSLA          0.025313
4  MSFT          0.023318
1   SPY          0.019666
0  AAPL          0.015442


Save the estimated returns in a TSV file named "estimated_returns.tsv" and include a header row with the column names "asset" and "estimated_return".

In [29]:
# YOUR CHANGES HERE

# part03_estimate_returns.py
# Input : historical_returns.tsv  (date + 6 tickers, 23 rows)
# Output: estimated_returns.tsv   (asset, estimated_return)

import os
import pandas as pd

IN_FILE  = "historical_returns.tsv"
OUT_FILE = "estimated_returns.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 2 first.")

    # Load monthly returns
    df = pd.read_csv(IN_FILE, sep="\t", parse_dates=["date"])

    # Asset columns = everything except 'date' (preserve original order)
    assets = [c for c in df.columns if c != "date"]
    if len(assets) != 6:
        print(f"Warning: expected 6 assets, found {len(assets)}: {assets}")

    # Mean (expected) return for each asset across the 23 months
    mu = df[assets].mean()

    # Save TSV with headers exactly: asset, estimated_return
    out = pd.DataFrame({"asset": assets, "estimated_return": [mu[a] for a in assets]})
    out.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(out)

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/estimated_returns.tsv
  asset  estimated_return
0  AAPL          0.015442
1   SPY          0.019666
2  TSLA          0.025313
3  AMZN          0.028061
4  MSFT          0.023318
5  NVDA          0.068636


Submit "estimated_returns.tsv" in Gradescope.

## Part 4: Estimate Risk

Estimate the covariance matrix for the asset returns to understand how the assets move together.

In [30]:
# YOUR CHANGES HERE

# part04_estimate_covariance.py
# Input : historical_returns.tsv  (date + 6 tickers, 23 rows)
# Output: estimated_covariance.tsv (pandas-style covariance table, tab-separated)

import os
import pandas as pd

IN_FILE  = "historical_returns.tsv"
OUT_FILE = "estimated_covariance.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 2 first.")

    # 1) Load monthly returns
    df = pd.read_csv(IN_FILE, sep="\t", parse_dates=["date"])

    # 2) Asset columns (preserve original order)
    assets = [c for c in df.columns if c != "date"]
    if len(assets) != 6:
        print(f"Warning: expected 6 assets, found {len(assets)}: {assets}")

    # 3) Compute sample covariance (ddof=1)
    returns = df[assets].dropna(how="any")
    Sigma = returns.cov(ddof=1)

    # 4) Ensure blank top-left cell in header (as per pandas to_csv format)
    Sigma.index.name = ""

    # 5) Save TSV
    Sigma.to_csv(OUT_FILE, sep="\t")

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print("Preview:")
    print(Sigma)

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/estimated_covariance.tsv
Preview:
          AAPL       SPY      TSLA      AMZN      MSFT      NVDA
                                                                
AAPL  0.003627  0.000832  0.003255  0.000449  0.000389  0.000962
SPY   0.000832  0.001292  0.002421  0.001666  0.001488  0.002840
TSLA  0.003255  0.002421  0.026703  0.005571  0.002217 -0.000383
AMZN  0.000449  0.001666  0.005571  0.004512  0.002875  0.003898
MSFT  0.000389  0.001488  0.002217  0.002875  0.004044  0.004914
NVDA  0.000962  0.002840 -0.000383  0.003898  0.004914  0.014523


Save the estimated covariances to a TSV file named "estimated_covariance.tsv".
The header row should have a blank column name followed by the names of the assets.
Each data row should start with the name of an asset for that row, and be followed by the individual covariances corresponding to that row and column's assets.
(This is the format of pandas's `to_csv` method with `sep="\t"` when used on a covariance matrix as computed in the examples.)

In [31]:
# YOUR CHANGES HERE

# part04_estimate_covariance.py
# Input : historical_returns.tsv  (date + 6 tickers, 23 rows)
# Output: estimated_covariance.tsv (TSV with blank top-left header)

import os
import pandas as pd

IN_FILE  = "historical_returns.tsv"
OUT_FILE = "estimated_covariance.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 2 first.")

    # Load returns
    df = pd.read_csv(IN_FILE, sep="\t", parse_dates=["date"])

    # Asset columns in original order
    assets = [c for c in df.columns if c != "date"]

    # Sample covariance (ddof=1) on returns
    Sigma = df[assets].cov(ddof=1)

    # Blank top-left header cell per spec
    Sigma.index.name = ""

    # Save TSV (pandas-style covariance table)
    Sigma.to_csv(OUT_FILE, sep="\t")

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/estimated_covariance.tsv


Submit "estimated_covariance.tsv" in Gradescope.

## Part 5: Construct the Maximum Return Portfolio

Compute the maximum return portfolio based on your previously estimated risks and returns.

In [32]:
# YOUR CHANGES HERE

# part05_max_return.py
# Input : estimated_returns.tsv  (asset, estimated_return)
# Output: maximum_return.tsv     (asset, allocation) — long-only, sum to 1

import os
import numpy as np
import pandas as pd

RET_FILE = "estimated_returns.tsv"
OUT_FILE = "maximum_return.tsv"

def main():
    if not os.path.exists(RET_FILE):
        raise FileNotFoundError(f"{RET_FILE} not found. Run Part 3 first.")

    # 1) Load estimated returns
    er = pd.read_csv(RET_FILE, sep="\t")
    if list(er.columns) != ["asset", "estimated_return"]:
        raise ValueError("estimated_returns.tsv must have columns: 'asset', 'estimated_return'")

    assets = er["asset"].tolist()
    mu = pd.Series(er["estimated_return"].values, index=assets)

    # 2) Max-return allocation under long-only + budget=1
    #    (All-in on argmax(mu); if there are ties, split evenly across the ties.)
    tol = 1e-12
    max_val = mu.max()
    top = [a for a in assets if (max_val - mu[a]) <= tol]

    alloc = pd.Series(0.0, index=assets)
    alloc[top] = 1.0 / len(top)

    # 3) Sanity: allocations sum to 1 and are non-negative
    s = float(alloc.sum())
    if abs(s - 1.0) > 1e-9:
        alloc = alloc / s  # normalize just in case
    if (alloc < -1e-12).any():
        raise ValueError("Negative allocation encountered — check inputs.")

    # 4) Save TSV exactly as required
    out = pd.DataFrame({"asset": assets, "allocation": [alloc[a] for a in assets]})
    out.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(out)

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/maximum_return.tsv
  asset  allocation
0  AAPL         0.0
1   SPY         0.0
2  TSLA         0.0
3  AMZN         0.0
4  MSFT         0.0
5  NVDA         1.0


Save the maximum return portfolio in a TSV file named "maximum_return.tsv".
The header row should have two columns, "asset" and "allocation".
The allocation values should sum up to one.


In [33]:
# YOUR CHANGES HERE

# part05_max_return.py
# Input : estimated_returns.tsv  (asset, estimated_return)
# Output: maximum_return.tsv     (asset, allocation), allocations sum to 1

import os
import pandas as pd
import numpy as np

IN_FILE  = "estimated_returns.tsv"
OUT_FILE = "maximum_return.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 3 first.")

    # 1) Load estimated returns
    df = pd.read_csv(IN_FILE, sep="\t")
    if list(df.columns) != ["asset", "estimated_return"]:
        raise ValueError("expected columns: 'asset', 'estimated_return'")

    assets = df["asset"].tolist()
    mu = pd.Series(df["estimated_return"].values, index=assets)

    # 2) Construct max-return portfolio (long-only, budget=1)
    #    All weight on argmax; if ties, split evenly.
    max_mu = mu.max()
    tol = 1e-12
    winners = [a for a in assets if (max_mu - mu[a]) <= tol]

    alloc = pd.Series(0.0, index=assets, dtype=float)
    alloc[winners] = 1.0 / len(winners)

    # 3) Clean tiny numerical noise, ensure sum to 1
    alloc = alloc.clip(lower=0)                 # no negatives
    alloc = alloc / alloc.sum()                 # normalize exactly
    alloc = alloc.round(15)                     # avoid -0.0, etc.

    # 4) Save TSV: asset, allocation
    out = pd.DataFrame({"asset": assets, "allocation": [alloc[a] for a in assets]})
    out.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(out)

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/maximum_return.tsv
  asset  allocation
0  AAPL         0.0
1   SPY         0.0
2  TSLA         0.0
3  AMZN         0.0
4  MSFT         0.0
5  NVDA         1.0


Submit "maximum_return.tsv" in Gradescope.

## Part 6: Construct the Minimum Risk Portfolio

Compute the minimum risk portfolio based on your previously estimated risks.

In [34]:
# YOUR CHANGES HERE

# part06_min_risk.py
# Input : estimated_covariance.tsv  (pandas-style covariance table)
# Output: minimum_risk.tsv          (asset, allocation) — long-only, sum to 1

# Requires: cvxpy (preferred) or scipy (fallback)
# pip install cvxpy scipy pandas numpy

import os
import numpy as np
import pandas as pd

COV_FILE = "estimated_covariance.tsv"
OUT_FILE = "minimum_risk.tsv"

def main():
    if not os.path.exists(COV_FILE):
        raise FileNotFoundError(f"{COV_FILE} not found. Run Part 4 first.")

    # 1) Load covariance matrix (assets as both index and columns)
    Sigma = pd.read_csv(COV_FILE, sep="\t", index_col=0)
    assets = list(Sigma.columns)
    Sigma = Sigma.loc[assets, assets].astype(float)

    # Symmetrize & add tiny ridge to avoid numerical issues (keeps solution the same to ~1e-10)
    Sigma = 0.5 * (Sigma + Sigma.T)
    Sigma += 1e-12 * np.eye(len(assets))

    n = len(assets)
    w_min = None

    # 2) Solve min w^T Σ w  s.t. sum(w)=1, w>=0
    try:
        import cvxpy as cp
        w = cp.Variable(n)
        prob = cp.Problem(cp.Minimize(cp.quad_form(w, Sigma.values)),
                          [cp.sum(w) == 1, w >= 0])
        prob.solve(solver=cp.SCS, verbose=False)
        if w.value is None:
            raise RuntimeError("CVXPY solver failed to return a solution.")
        w_min = np.maximum(w.value, 0)
    except Exception:
        # SciPy fallback
        from scipy.optimize import minimize
        def var_obj(w): return float(w @ Sigma.values @ w)
        bounds = [(0.0, 1.0)] * n
        cons = ({'type': 'eq', 'fun': lambda w: np.sum(w) - 1.0},)
        w0 = np.ones(n) / n
        res = minimize(var_obj, w0, method="SLSQP", bounds=bounds, constraints=cons, options={'maxiter': 1000})
        if not res.success:
            raise RuntimeError(f"SciPy SLSQP failed: {res.message}")
        w_min = res.x

    # 3) Clean & normalize
    w_min = np.clip(w_min, 0, None)
    s = w_min.sum()
    if s <= 0:
        raise RuntimeError("Degenerate solution (sum of weights <= 0).")
    w_min = (w_min / s).round(15)  # ensure sum=1 up to rounding

    # 4) Save TSV: asset, allocation
    out = pd.DataFrame({"asset": assets, "allocation": w_min})
    out.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(out)

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/minimum_risk.tsv
  asset    allocation
0  AAPL  1.413660e-01
1   SPY  8.586299e-01
2  TSLA  3.845598e-06
3  AMZN  2.093650e-07
4  MSFT  0.000000e+00
5  NVDA  0.000000e+00


Save the minimum risk portfolio in a TSV file named "minimum_risk.tsv".
The header row should have two columns, "asset" and "allocation".
The allocation values should sum up to one.


In [35]:
# YOUR CHANGES HERE

# part06_min_risk.py
# Input : estimated_covariance.tsv  (pandas-style covariance table)
# Output: minimum_risk.tsv          (asset, allocation) — long-only, sum to 1

# Requires: cvxpy (preferred) or scipy (fallback)
# pip install cvxpy scipy pandas numpy

import os
import numpy as np
import pandas as pd

IN_FILE  = "estimated_covariance.tsv"
OUT_FILE = "minimum_risk.tsv"

def main():
    if not os.path.exists(IN_FILE):
        raise FileNotFoundError(f"{IN_FILE} not found. Run Part 4 first.")

    # Load covariance matrix (assets label rows/cols)
    Sigma = pd.read_csv(IN_FILE, sep="\t", index_col=0).astype(float)
    assets = list(Sigma.columns)
    Sigma = Sigma.loc[assets, assets]
    # Stabilize numerics
    Sigma = 0.5 * (Sigma + Sigma.T)
    Sigma += 1e-12 * np.eye(len(assets))

    n = len(assets)
    w_min = None

    # Minimize w^T Σ w  s.t. sum(w)=1, w>=0
    try:
        import cvxpy as cp
        w = cp.Variable(n)
        prob = cp.Problem(cp.Minimize(cp.quad_form(w, Sigma.values)),
                          [cp.sum(w) == 1, w >= 0])
        prob.solve(solver=cp.SCS, verbose=False)
        if w.value is None:
            raise RuntimeError("CVXPY failed to return a solution.")
        w_min = np.maximum(w.value, 0)
    except Exception:
        from scipy.optimize import minimize
        def var_obj(w): return float(w @ Sigma.values @ w)
        bounds = [(0.0, 1.0)] * n
        cons = ({'type': 'eq', 'fun': lambda w: np.sum(w) - 1.0},)
        w0 = np.ones(n) / n
        res = minimize(var_obj, w0, method="SLSQP", bounds=bounds, constraints=cons, options={'maxiter': 1000})
        if not res.success:
            raise RuntimeError(f"SciPy SLSQP failed: {res.message}")
        w_min = res.x

    # Clean, normalize, and round
    w_min = np.clip(w_min, 0, None)
    w_min = w_min / w_min.sum()
    w_min = np.round(w_min, 15)  # avoid -0.0, ensure clean sum

    # Save TSV exactly as requested
    pd.DataFrame({"asset": assets, "allocation": w_min}) \
      .to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(f"Sum of allocations: {w_min.sum():.12f}")

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/minimum_risk.tsv
Sum of allocations: 1.000000000000


Submit "minimum_risk.tsv" in Gradescope.

## Part 7: Build Efficient Frontier Portfolios

Compute 101 portfolios along the mean-variance efficient frontier with evenly spaced estimated returns.
The first portfolio should be the minimum risk portfolio from part 4, and the last portfolio should be the maximum return portfolio from part 3.
The estimated return of each portfolio should be higher than the previous by one percent of the difference between the first and last portfolios.
That is, the estimated return of the portfolios should be similar to `np.linspace(min_risk_return, max_return, 101)`.


In [36]:
# YOUR CHANGES HERE

# part07_efficient_frontier.py
# Inputs:
#   - estimated_returns.tsv     (asset, estimated_return)
#   - estimated_covariance.tsv  (pandas-style covariance table, tab-separated)
#   - minimum_risk.tsv          (asset, allocation)
#   - maximum_return.tsv        (asset, allocation)
# Output:
#   - efficient_frontier.tsv    (index, return, risk, <asset columns>) — 101 rows, weights sum to 1
#
# Requires: cvxpy (preferred) or scipy (fallback)
# pip install cvxpy scipy pandas numpy

import os, math
import numpy as np
import pandas as pd

RET_FILE   = "estimated_returns.tsv"
COV_FILE   = "estimated_covariance.tsv"
WMIN_FILE  = "minimum_risk.tsv"
WMAX_FILE  = "maximum_return.tsv"
OUT_FILE   = "efficient_frontier.tsv"

def port_stats(w, mu, Sigma):
    """Return (mean, std) for weights w given mu (Series) and Sigma (DataFrame)."""
    w = np.asarray(w, dtype=float)
    mu_p = float(w @ mu.values)
    var_p = float(w @ Sigma.values @ w)
    sd_p  = math.sqrt(max(var_p, 0.0))
    return mu_p, sd_p

def main():
    # --- Load inputs ---
    if not all(os.path.exists(p) for p in [RET_FILE, COV_FILE, WMIN_FILE, WMAX_FILE]):
        missing = [p for p in [RET_FILE, COV_FILE, WMIN_FILE, WMAX_FILE] if not os.path.exists(p)]
        raise FileNotFoundError(f"Missing inputs: {missing}. Run Parts 3, 4, 5, and 6 first.")

    mu = pd.read_csv(RET_FILE, sep="\t").set_index("asset")["estimated_return"]
    Sigma = pd.read_csv(COV_FILE, sep="\t", index_col=0)
    w_min = pd.read_csv(WMIN_FILE, sep="\t").set_index("asset")["allocation"]
    w_max = pd.read_csv(WMAX_FILE, sep="\t").set_index("asset")["allocation"]

    # Align assets (intersection, keep order from returns file)
    assets = [a for a in mu.index if a in Sigma.columns and a in w_min.index and a in w_max.index]
    if len(assets) < 2:
        raise ValueError("Need at least two aligned assets to form a frontier.")

    mu    = mu.loc[assets]
    Sigma = Sigma.loc[assets, assets].astype(float)
    w_min = w_min.loc[assets].fillna(0.0)
    w_max = w_max.loc[assets].fillna(0.0)

    # Stabilize covariance for numeric safety
    Sigma = 0.5 * (Sigma + Sigma.T)
    Sigma += 1e-12 * np.eye(len(assets))

    # Endpoints' returns
    mu_min = float(w_min.values @ mu.values)
    mu_max = float(w_max.values @ mu.values)
    if mu_max < mu_min:
        # swap if weird ordering (shouldn't happen, but let's be safe)
        mu_min, mu_max = mu_max, mu_min
        w_min, w_max = w_max, w_min

    # 101 evenly spaced target returns
    targets = np.linspace(mu_min, mu_max, 101)

    # Try CVXPY first; else SciPy SLSQP
    rows = []
    n = len(assets)

    try:
        import cvxpy as cp
        w = cp.Variable(n)
        Sigma_np = Sigma.values
        mu_np = mu.values
        for i, t in enumerate(targets):
            # Constraints: sum w = 1, w >= 0, w·μ >= t
            cons = [cp.sum(w) == 1, w >= 0, w @ mu_np >= float(t)]
            prob = cp.Problem(cp.Minimize(cp.quad_form(w, Sigma_np)), cons)
            prob.solve(solver=cp.SCS, verbose=False)
            w_sol = np.array(w.value, dtype=float).ravel()
            # If solver struggles at extremes, project to feasible and renormalize
            w_sol = np.maximum(w_sol, 0)
            s = w_sol.sum()
            if s <= 0:  # fallback to min-risk if infeasible
                w_sol = w_min.values.copy()
            else:
                w_sol = w_sol / s

            r, sdev = port_stats(w_sol, mu, Sigma)
            row = {"index": i, "return": r, "risk": sdev}
            row.update({a: w_sol[j] for j, a in enumerate(assets)})
            rows.append(row)

    except Exception:
        from scipy.optimize import minimize

        def var_obj(w): return float(w @ Sigma.values @ w)
        bounds = [(0.0, 1.0)] * n
        w0 = w_min.values.copy()

        for i, t in enumerate(targets):
            cons = (
                {"type": "eq",   "fun": lambda w: np.sum(w) - 1.0},
                {"type": "ineq", "fun": lambda w, t=t: (w @ mu.values) - float(t)},
            )
            res = minimize(var_obj, w0, method="SLSQP", bounds=bounds, constraints=cons, options={"maxiter": 800})
            if not res.success:
                # If a target fails (usually very close to endpoints), back off to last good or endpoint
                w_sol = w0.copy()
            else:
                w_sol = res.x
            r, sdev = port_stats(w_sol, mu, Sigma)
            row = {"index": i, "return": r, "risk": sdev}
            row.update({a: w_sol[j] for j, a in enumerate(assets)})
            rows.append(row)
            w0 = w_sol

    # Force endpoints to match exactly your provided min-risk and max-return portfolios
    rows[0].update({"return": float(w_min.values @ mu.values),
                    "risk":   math.sqrt(float(w_min.values @ Sigma.values @ w_min.values))})
    for j, a in enumerate(assets):
        rows[0][a] = float(w_min.values[j])

    rows[-1].update({"return": float(w_max.values @ mu.values),
                     "risk":   math.sqrt(float(w_max.values @ Sigma.values @ w_max.values))})
    for j, a in enumerate(assets):
        rows[-1][a] = float(w_max.values[j])

    # Save output: index, return, risk, then asset columns
    cols = ["index", "return", "risk"] + assets
    df = pd.DataFrame(rows, columns=cols)

    # Clean tiny numerical noise and ensure weights sum to 1
    W = df[assets].to_numpy()
    W = np.clip(W, 0, None)
    sums = W.sum(axis=1, keepdims=True)
    sums[sums == 0] = 1.0
    W = W / sums
    for k, a in enumerate(assets):
        df[a] = np.round(W[:, k], 15)

    df.to_csv(OUT_FILE, sep="\t", index=False)
    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(df.head())

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/efficient_frontier.tsv
   index    return      risk      AAPL       SPY          TSLA          AMZN  \
0      0  0.019069  0.035033  0.141366  0.858630  3.845598e-06  2.093650e-07   
1      1  0.019565  0.035378  0.105495  0.887477  9.225344e-07  1.804254e-07   
2      2  0.020060  0.035810  0.107275  0.875424  4.865523e-06  9.328190e-07   
3      3  0.020556  0.036265  0.109054  0.863372  9.657223e-06  1.842615e-06   
4      4  0.021051  0.036743  0.110837  0.851312  8.581446e-06  1.646273e-06   

   MSFT      NVDA  
0   0.0  0.000000  
1   0.0  0.007027  
2   0.0  0.017296  
3   0.0  0.027563  
4   0.0  0.037841  


Save the portfolios in a TSV file named "efficient_frontier.tsv".
The header row should have columns "index", "return", "risk", and all the asset tickers.
Each data row should have the portfolio index (0-100), the estimated return of the portfolio, the estimated standard deviation (not variance) of the portfolio, and all the asset allocations (which should sum to one).

In [37]:
# YOUR CHANGES HERE

# part07_efficient_frontier.py
# Inputs:
#   - estimated_returns.tsv     (asset, estimated_return)
#   - estimated_covariance.tsv  (TSV covariance, blank top-left header)
#   - minimum_risk.tsv          (asset, allocation)   # index 0 target
#   - maximum_return.tsv        (asset, allocation)   # index 100 target
# Output:
#   - efficient_frontier.tsv    (index, return, risk, <asset columns>) — 101 rows

# Requires: cvxpy (preferred) or scipy (fallback)
# pip install cvxpy scipy pandas numpy

import os, math
import numpy as np
import pandas as pd

RET_FILE   = "estimated_returns.tsv"
COV_FILE   = "estimated_covariance.tsv"
WMIN_FILE  = "minimum_risk.tsv"
WMAX_FILE  = "maximum_return.tsv"
OUT_FILE   = "efficient_frontier.tsv"

def port_stats(w, mu, Sigma):
    w = np.asarray(w, float)
    mu_p = float(w @ mu.values)
    var_p = float(w @ Sigma.values @ w)
    return mu_p, math.sqrt(max(var_p, 0.0))  # standard deviation

def main():
    # Load inputs
    if not all(os.path.exists(p) for p in [RET_FILE, COV_FILE, WMIN_FILE, WMAX_FILE]):
        missing = [p for p in [RET_FILE, COV_FILE, WMIN_FILE, WMAX_FILE] if not os.path.exists(p)]
        raise FileNotFoundError(f"Missing inputs: {missing}")

    mu = pd.read_csv(RET_FILE, sep="\t").set_index("asset")["estimated_return"]
    Sigma = pd.read_csv(COV_FILE, sep="\t", index_col=0).astype(float)
    w_min = pd.read_csv(WMIN_FILE, sep="\t").set_index("asset")["allocation"]
    w_max = pd.read_csv(WMAX_FILE, sep="\t").set_index("asset")["allocation"]

    # Align assets (keep order from estimated_returns.tsv)
    assets = [a for a in mu.index if a in Sigma.columns and a in w_min.index and a in w_max.index]
    if len(assets) < 2:
        raise ValueError("Need at least two aligned assets.")
    mu, Sigma = mu.loc[assets], Sigma.loc[assets, assets]
    w_min, w_max = w_min.loc[assets].fillna(0.0), w_max.loc[assets].fillna(0.0)

    # Stabilize covariance
    Sigma = 0.5*(Sigma + Sigma.T)
    Sigma += 1e-12*np.eye(len(assets))

    # Endpoints
    mu_min = float(w_min.values @ mu.values)
    mu_max = float(w_max.values @ mu.values)
    if mu_max < mu_min:
        mu_min, mu_max = mu_max, mu_min
        w_min, w_max = w_max, w_min

    targets = np.linspace(mu_min, mu_max, 101)

    rows, n = [], len(assets)

    # Solve with CVXPY if available, else SciPy
    try:
        import cvxpy as cp
        w = cp.Variable(n)
        S = Sigma.values
        m = mu.values
        for i, t in enumerate(targets):
            cons = [cp.sum(w) == 1, w >= 0, w @ m >= float(t)]
            cp.Problem(cp.Minimize(cp.quad_form(w, S)), cons).solve(solver=cp.SCS, verbose=False)
            w_sol = np.maximum(np.array(w.value, float).ravel(), 0)
            s = w_sol.sum()
            w_sol = w_sol / s if s > 0 else w_min.values.copy()
            r, sdev = port_stats(w_sol, mu, Sigma)
            row = {"index": i, "return": r, "risk": sdev}
            row.update({a: w_sol[j] for j,a in enumerate(assets)})
            rows.append(row)
    except Exception:
        from scipy.optimize import minimize
        def var_obj(x): return float(x @ Sigma.values @ x)
        bnds = [(0.0, 1.0)]*n
        w0 = w_min.values.copy()
        for i, t in enumerate(targets):
            cons = (
                {"type": "eq",   "fun": lambda x: np.sum(x) - 1.0},
                {"type": "ineq", "fun": lambda x, t=t: (x @ mu.values) - float(t)},
            )
            res = minimize(var_obj, w0, method="SLSQP", bounds=bnds, constraints=cons, options={"maxiter": 800})
            w_sol = res.x if res.success else w0.copy()
            r, sdev = port_stats(w_sol, mu, Sigma)
            row = {"index": i, "return": r, "risk": sdev}
            row.update({a: w_sol[j] for j,a in enumerate(assets)})
            rows.append(row)
            w0 = w_sol

    # Force endpoints to match exactly provided min-risk and max-return portfolios
    rows[0]["return"] = float(w_min.values @ mu.values)
    rows[0]["risk"]   = math.sqrt(float(w_min.values @ Sigma.values @ w_min.values))
    for j,a in enumerate(assets): rows[0][a] = float(w_min.values[j])

    rows[-1]["return"] = float(w_max.values @ mu.values)
    rows[-1]["risk"]   = math.sqrt(float(w_max.values @ Sigma.values @ w_max.values))
    for j,a in enumerate(assets): rows[-1][a] = float(w_max.values[j])

    # Assemble dataframe and clean small numerical noise; ensure weights sum to 1
    cols = ["index","return","risk"] + assets
    df = pd.DataFrame(rows, columns=cols)
    W = df[assets].to_numpy()
    W = np.clip(W, 0, None)
    sums = W.sum(axis=1, keepdims=True)
    sums[sums == 0] = 1.0
    W = W / sums
    for k,a in enumerate(assets):
        df[a] = np.round(W[:,k], 15)

    # Save TSV with required header and rows
    df.to_csv(OUT_FILE, sep="\t", index=False)
    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(df.head())

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/efficient_frontier.tsv
   index    return      risk      AAPL       SPY          TSLA          AMZN  \
0      0  0.019069  0.035033  0.141366  0.858630  3.845598e-06  2.093650e-07   
1      1  0.019565  0.035378  0.105495  0.887477  9.225344e-07  1.804254e-07   
2      2  0.020060  0.035810  0.107275  0.875424  4.865523e-06  9.328190e-07   
3      3  0.020556  0.036265  0.109054  0.863372  9.657223e-06  1.842615e-06   
4      4  0.021051  0.036743  0.110837  0.851312  8.581446e-06  1.646273e-06   

   MSFT      NVDA  
0   0.0  0.000000  
1   0.0  0.007027  
2   0.0  0.017296  
3   0.0  0.027563  
4   0.0  0.037841  


Submit "efficient_frontier.tsv" in Gradescope.

## Part 8: Check Maximum Return Portfolio Stability

Check the stability of the maximum return portfolio by resampling the estimated risk/return model.

Repeat 1000 times -
1. Use `np.random.multivariate_normal` to generate 23 return samples using your previously estimated risks and returns.
2. Estimate the return of each asset using that resampled return history.
3. Check which asset had the highest return in those resampled estimates.

This procedure is a reduced and simplified version of the Michaud resampled efficient frontier procedure that takes uncertainty in the risk model into account.

In [38]:
# YOUR CHANGES HERE

# part08_max_return_stability.py
# Inputs :
#   - estimated_returns.tsv     (asset, estimated_return)
#   - estimated_covariance.tsv  (pandas-style covariance table)
#   - historical_returns.tsv    (optional; used only to infer T=number of months)
# Output :
#   - max_return_probabilities.tsv  (asset, probability)

import os
import numpy as np
import pandas as pd

RET_FILE   = "estimated_returns.tsv"
COV_FILE   = "estimated_covariance.tsv"
RET_HIST   = "historical_returns.tsv"  # optional, to infer T
OUT_FILE   = "max_return_probabilities.tsv"

N_TRIALS = 1000  # per spec
SEED     = 17    # reproducibility

def make_psd(S):
    """Project symmetric matrix to nearest PSD by zeroing negative eigenvalues."""
    # Ensure symmetry
    S = 0.5 * (S + S.T)
    vals, vecs = np.linalg.eigh(S)
    vals_clipped = np.clip(vals, a_min=0.0, a_max=None)
    S_psd = (vecs * vals_clipped) @ vecs.T
    # Tiny ridge for numerical stability
    S_psd += 1e-12 * np.eye(S_psd.shape[0])
    return S_psd

def main():
    # --- Load estimated returns and covariance ---
    if not os.path.exists(RET_FILE) or not os.path.exists(COV_FILE):
        missing = [p for p in [RET_FILE, COV_FILE] if not os.path.exists(p)]
        raise FileNotFoundError(f"Missing input(s): {missing}. Run Parts 3 and 4 first.")

    mu = pd.read_csv(RET_FILE, sep="\t").set_index("asset")["estimated_return"]
    Sigma = pd.read_csv(COV_FILE, sep="\t", index_col=0)

    # Align assets (intersection; keep order from mu file)
    assets = [a for a in mu.index if a in Sigma.columns]
    if len(assets) < 2:
        raise ValueError("Need at least two aligned assets.")
    mu = mu.loc[assets].astype(float)
    Sigma = Sigma.loc[assets, assets].astype(float)

    # In-sample length T: read from historical_returns if available, else use 23
    if os.path.exists(RET_HIST):
        T = pd.read_csv(RET_HIST, sep="\t", parse_dates=["date"]).shape[0]
    else:
        T = 23  # 24 months of prices -> 23 monthly returns (per spec)

    # Ensure PSD covariance for sampling
    Sigma_psd = make_psd(Sigma.to_numpy())

    # --- Resampling loop ---
    rng = np.random.default_rng(SEED)
    counts = dict.fromkeys(assets, 0)

    for _ in range(N_TRIALS):
        # Draw T samples of returns ~ N(mu, Sigma)
        sample = rng.multivariate_normal(mean=mu.values, cov=Sigma_psd, size=T)  # shape (T, n)
        # Re-estimate mean returns from the sample
        mu_hat = sample.mean(axis=0)  # shape (n,)
        # Identify the highest-return asset this trial
        winner_idx = int(np.argmax(mu_hat))
        counts[assets[winner_idx]] += 1

    # Convert counts -> probabilities
    probs = pd.DataFrame({
        "asset": assets,
        "probability": [counts[a] / N_TRIALS for a in assets]
    })

    # Save TSV exactly as requested
    probs.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(probs.sort_values("probability", ascending=False))

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/max_return_probabilities.tsv
  asset  probability
5  NVDA        0.843
2  TSLA        0.145
0  AAPL        0.006
3  AMZN        0.005
1   SPY        0.001
4  MSFT        0.000


Save a file "max_return_probabilities.tsv" with the distribution of highest return assets.
The header row should have columns "asset" and "probability".
There should be a data row for each asset and its sample probability of having the highest return based on those 1000 resampled estimates.


In [39]:
# YOUR CHANGES HERE

# part08_max_return_stability.py
# Inputs:
#   - estimated_returns.tsv     (asset, estimated_return)
#   - estimated_covariance.tsv  (pandas-style covariance table)
#   - historical_returns.tsv    (optional; used to infer T=number of months)
# Output:
#   - max_return_probabilities.tsv (asset, probability)

import os
import numpy as np
import pandas as pd

RET_FILE = "estimated_returns.tsv"
COV_FILE = "estimated_covariance.tsv"
RET_HIST = "historical_returns.tsv"  # optional to infer T
OUT_FILE = "max_return_probabilities.tsv"

N_TRIALS = 1000
SEED = 17  # reproducible

def make_psd(S):
    S = 0.5 * (S + S.T)
    vals, vecs = np.linalg.eigh(S)
    vals = np.clip(vals, 0.0, None)
    S_psd = (vecs * vals) @ vecs.T
    S_psd += 1e-12 * np.eye(S_psd.shape[0])
    return S_psd

def main():
    if not os.path.exists(RET_FILE) or not os.path.exists(COV_FILE):
        missing = [p for p in [RET_FILE, COV_FILE] if not os.path.exists(p)]
        raise FileNotFoundError(f"Missing input(s): {missing}. Run Parts 3 and 4 first.")

    mu = pd.read_csv(RET_FILE, sep="\t").set_index("asset")["estimated_return"].astype(float)
    Sigma = pd.read_csv(COV_FILE, sep="\t", index_col=0).astype(float)

    # Align assets (keep order from estimated_returns.tsv)
    assets = [a for a in mu.index if a in Sigma.columns]
    mu = mu.loc[assets]
    Sigma = Sigma.loc[assets, assets]

    # Infer T (# of monthly returns)
    if os.path.exists(RET_HIST):
        T = pd.read_csv(RET_HIST, sep="\t", parse_dates=["date"]).shape[0]
    else:
        T = 23  # 24 months of prices -> 23 returns

    Sigma_psd = make_psd(Sigma.to_numpy())
    rng = np.random.default_rng(SEED)

    counts = dict.fromkeys(assets, 0)
    for _ in range(N_TRIALS):
        sample = rng.multivariate_normal(mean=mu.values, cov=Sigma_psd, size=T)  # (T, n)
        mu_hat = sample.mean(axis=0)
        winner = assets[int(np.argmax(mu_hat))]
        counts[winner] += 1

    probs = pd.DataFrame({
        "asset": assets,
        "probability": [counts[a] / N_TRIALS for a in assets]
    })
    probs.to_csv(OUT_FILE, sep="\t", index=False)

    print(f"Wrote: {os.path.abspath(OUT_FILE)}")
    print(probs.sort_values("probability", ascending=False))

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/max_return_probabilities.tsv
  asset  probability
5  NVDA        0.843
2  TSLA        0.145
0  AAPL        0.006
3  AMZN        0.005
1   SPY        0.001
4  MSFT        0.000


Submit "max_return_probabilities.tsv" in Gradescope.

## Part 9: Acknowledgments

Make a file "acknowledgments.txt" documenting any outside sources or help on this project.
If you discussed this assignment with anyone, please acknowledge them here.
If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for.
If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy.
If no acknowledgements are appropriate, just write none in the file.


In [40]:
# part09_acknowledgments.py
# Output: acknowledgments.txt

import os
OUT_FILE = "acknowledgments.txt"

# Set to True if you want the file to contain just "none"
WRITE_NONE = False

TEMPLATE = """Acknowledgments
---------------



Libraries used (brief purpose):
- pandas — data manipulation and TSV I/O
- numpy — numeric operations
- yfinance — downloading historical prices from Yahoo Finance
- cvxpy — quadratic programming (min-variance & frontier)
- scipy.optimize — SLSQP fallback optimizer
(Keep/edit as needed to reflect what you actually used.)

Generative AI usage (per course policy):
- Tool: ChatGPT (used for code scaffolding and clarification).
- I reviewed, executed, and verified all code and results.
- Transcript links:

Course-provided resources consulted:
- https://github.com/bu-cds-dx704/dx704-project-01
- https://github.com/bu-cds-omds/dx704-examples


"""

def main():
    content = "none\n" if WRITE_NONE else TEMPLATE
    with open(OUT_FILE, "w", encoding="utf-8") as f:
        f.write(content)
    print(f"Wrote: {os.path.abspath(OUT_FILE)}")

if __name__ == "__main__":
    main()


Wrote: /workspaces/dx704-project-01/acknowledgments.txt


Submit "acknowledgements.txt" in Gradescope.

## Part 10: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

In [41]:
%pip install pandas numpy cvxpy scipy yfinance



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Submit "project.ipynb" in Gradescope.