## Cell 1 — API Setup with yfinance
Define a helper function to fetch stock price data using `yfinance` and save it.


In [None]:
# Cell 1 — API to fetch stock data using yfinance
import yfinance as yf
import pandas as pd
import time, os
from pathlib import Path

def fetch_and_save(ticker="AAPL"):
    # request data from Yahoo Finance
    df = yf.download(ticker, period="1y", interval="1d")
    
    # check if data is empty
    print("Shape:", df.shape)   
    print("Columns:", df.columns.tolist())
    print("NA counts:\n", df.isna().sum())
    
    # save the data to a CSV file
    ts = time.strftime("%Y%m%d-%H%M")
    fname = RAW_DIR / f"api_yf_{ticker}_{ts}.csv"
    df.to_csv(fname)
    print("Saved to:", fname)
    
    return df

## Cell 2 — Fetch and Save API Data
Fetch Apple (AAPL) daily prices for the past year, validate, and save to CSV.


In [None]:
# Cell 2 — Set up the directory for saving data

# choose "AAPL" as the stock ticker 
ticker = "AAPL"
df = fetch_and_save(ticker)

# print the first few rows of the DataFrame
df.head()

  df = yf.download(ticker, period="1y", interval="1d")
[*********************100%***********************]  1 of 1 completed

Shape: (250, 5)
Columns: [('Close', 'AAPL'), ('High', 'AAPL'), ('Low', 'AAPL'), ('Open', 'AAPL'), ('Volume', 'AAPL')]
NA counts:
 Price   Ticker
Close   AAPL      0
High    AAPL      0
Low     AAPL      0
Open    AAPL      0
Volume  AAPL      0
dtype: int64
Saved to: data/raw/api_yf_AAPL_20250820-1810.csv





Price,Close,High,Low,Open,Volume
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2024-08-21,225.351196,226.923879,224.007459,225.470651,34765500
2024-08-22,223.489868,227.282216,222.862782,226.734761,43695300
2024-08-23,225.789169,227.162781,223.290802,224.614643,38677300
2024-08-26,226.127579,226.227122,222.852826,225.709526,30602200
2024-08-27,226.973648,227.789857,223.848195,224.953053,35934600


## Cell 3 — Web Scraping (Wikipedia S&P 500 List)
Use `requests` + `BeautifulSoup` to scrape a small public table from Wikipedia.


In [None]:
# Cell 3 — WIKI S&P500 companies list scraping

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# request the webpage
resp = requests.get(url)
resp.raise_for_status()

# BeautifulSoup to parse the HTML
soup = BeautifulSoup(resp.text, "html.parser")

# Find the S&P 500 table
table = soup.find("table", {"id": "constituents"})
df_sp500 = pd.read_html(str(table))[0]

# pirnt basic information
print("Shape:", df_sp500.shape)
print("Columns:", df_sp500.columns.tolist())
df_sp500.head()

  df_sp500 = pd.read_html(str(table))[0]


Shape: (503, 8)
Columns: ['Symbol', 'Security', 'GICS Sector', 'GICS Sub-Industry', 'Headquarters Location', 'Date added', 'CIK', 'Founded']
Shape: (503, 8)
Columns: ['Symbol', 'Security', 'GICS Sector', 'GICS Sub-Industry', 'Headquarters Location', 'Date added', 'CIK', 'Founded']


  df_sp500 = pd.read_html(str(table))[0]


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


## Cell 4 — Validate and Save Scraped Data
Perform simple validation (shape, columns, NA counts) and save to CSV.


In [5]:
# Cell 4 — 验证并保存爬取的数据

import time

# 简单验证
print("Shape:", df_sp500.shape)
print("Columns:", df_sp500.columns.tolist())
print("NA counts:\n", df_sp500.isna().sum())

# 保存文件
ts = time.strftime("%Y%m%d-%H%M")
fname = RAW_DIR / f"scrape_wikipedia_sp500_{ts}.csv"
df_sp500.to_csv(fname, index=False)
print("Saved to:", fname)

# 看一下前几行
df_sp500.head()


Shape: (503, 8)
Columns: ['Symbol', 'Security', 'GICS Sector', 'GICS Sub-Industry', 'Headquarters Location', 'Date added', 'CIK', 'Founded']
NA counts:
 Symbol                   0
Security                 0
GICS Sector              0
GICS Sub-Industry        0
Headquarters Location    0
Date added               0
CIK                      0
Founded                  0
dtype: int64
Saved to: data/raw/scrape_wikipedia_sp500_20250820-1817.csv


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


## Documentation: Data Sources and Validation

**API Source**  
- Library: `yfinance`  
- Example ticker: `AAPL`  
- Parameters: period = "1y", interval = "1d"  
- Validation: checked shape, columns (Open, High, Low, Close, Adj Close, Volume), and NA counts.  

**Scraping Source**  
- Website: [Wikipedia — List of S&P 500 companies](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)  
- Table: S&P 500 constituents  
- Validation: checked DataFrame shape, column names, and NA counts.  

**File Naming**  
- API output: `data/raw/api_yf_<ticker>_<timestamp>.csv`  
- Scraping output: `data/raw/scrape_wikipedia_sp500_<timestamp>.csv`  

---

## Assumptions & Risks

- **API Risks**:  
  - Free API (yfinance) may have delays or missing fields.  
  - Time zones and trading calendar could affect date alignment.  

- **Scraping Risks**:  
  - Website structure may change, breaking the parser.  
  - Wikipedia content is community-edited, so data might not always be accurate.  

- **General Assumptions**:  
  - `.env` file is local-only and not committed to GitHub.  
  - Saved CSV files are reproducible and follow the required naming convention.  
