## Problem 3: Taking stock (15 points)

A joint distribution of data has a natural graph associated with it. When the distribution is multivariate normal, this graph is encoded in the pattern of zeros and non-zeros in the inverse of the covariance matrix, also known as the "precision matrix."

In class we demonstrated the graphical lasso for estimating the graph on ETF data.
In this problem you will construct two different "portfolios" of stocks,
and run the graphical lasso to estimate a graph, commenting on your results.

All of the code you might need for this is contained in the demo.

## Downloading data

As demonstrated in class, we will use R to download equity prices from Yahoo Finance (via the BatchGetSymbols package), then analyze in Python.
Your job is to construct two "portfolios" of stocks, each of which has some kind of organization to it. For example, in one portfolio you might have 5 energy stocks, 5 tech stocks, 5 consumer staples stocks, and 5 ETF stocks. Each portfolio should have at least 20 stocks.

The page https://en.wikipedia.org/wiki/List_of_S%26P_500_companies lists GICS sectors (also written to `sp500_meta.csv` by the R script below).

### R → CSV workflow
Use the provided R script (`problem3_yahoo.Rmd`) to fetch prices in batched, cached calls and write CSVs for Python analysis.

**What the R script does**
- Scrapes the current S&P 500 membership (symbol, company, sector).
- Downloads **raw prices** (Adjusted Close by default) for any tickers (S&P 500 or an ETF list), at **daily/weekly/monthly** frequency.
- Writes:
  - `sp500_meta.csv` — symbol/company/sector.
  - `weekly_stock.csv` — prices for your example ETF set (weekly, unadjusted close in this demo).

### Analyzing  your portfolios

Your task is to analyze each porfolio using the graphical lasso, and comment on your findings.
Here are the types of questions you should address:

* How did you choose the portolio? How did you choose the date range and frequency (daily, weekly, etc.)? Remember, each of the portfolios must contain at least 20 stocks, and be organized in some reasonable way.

* Display the graph obtained with the graphical lasso, using networkx. How did you choose the regularization level? Does the structure of the graph make sense? Is it sensitive to the choice of regularization level? Is this the structure you expected to see when you designed the portfolio? Why or why not?

* What are some of the conditional independence assumptions implied by the graph? Are some parts of the graph more densely connected than others? Why?




In [5]:

# (Optional) Python helper: load the CSVs produced by the R script
import pandas as pd
try:
    # Load prices written by the R script
    fname = "weekly_stock.csv" 
    df = pd.read_csv(fname)

    # Normalize date column name (could be 'date' or 'Date')
    date_col = "date" if "date" in df.columns else ("Date" if "Date" in df.columns else None)
    if date_col is None:
        raise ValueError("No 'date' or 'Date' column found in CSV.")

    df[date_col] = pd.to_datetime(df[date_col])
    df = df.set_index(date_col).sort_index()
    df.index.name = "Date"

    print(f"{fname}: {df.shape[0]} rows × {df.shape[1]} tickers")
    display(df.head())
except FileNotFoundError:
    print("Run the R script first to create weekly_stock.csv in this folder.")


weekly_stock.csv: 262 rows × 33 tickers


Unnamed: 0_level_0,ECH,EIDO,EIRL,EIS,ENZL,EPHE,EPOL,EPU,ERUS,EWA,...,EWS,EWT,EWU,EWW,EWY,EWZ,EZA,FXI,THD,TUR
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-03-20,43.380001,25.950001,40.209999,51.689999,41.439999,34.189999,22.09,33.84,32.650002,22.17,...,22.620001,33.669998,32.470001,51.560001,62.470001,37.119999,60.369999,39.25,77.470001,36.73
2017-03-27,43.560001,25.790001,40.060001,51.25,42.220001,34.16,21.58,34.080002,32.119999,22.610001,...,22.809999,33.23,32.549999,51.169998,61.869999,37.459999,55.189999,38.490002,77.989998,35.799999
2017-04-03,44.849998,26.24,39.959999,51.470001,42.150002,35.860001,21.959999,34.389999,31.9,22.309999,...,22.65,33.240002,32.299999,52.07,60.349998,37.0,54.369999,38.740002,78.129997,35.25
2017-04-10,45.130001,26.01,40.16,50.849998,42.310001,36.0,21.540001,33.75,30.879999,22.43,...,22.639999,33.18,32.419998,51.189999,59.919998,36.09,57.18,38.259998,78.650002,36.369999
2017-04-17,44.369999,26.23,40.810001,50.900002,42.439999,35.810001,22.09,33.630001,31.07,22.33,...,22.559999,32.959999,32.369999,51.369999,60.959999,36.16,57.720001,37.959999,78.110001,37.779999
