# Homework Starter — Stage 04: Data Acquisition and Ingestion
Name: 
Date: 

## Objectives
- API ingestion with secrets in `.env`
- Scrape a permitted public table
- Validate and save raw data to `data/raw/`

In [None]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

RAW = pathlib.Path('data/raw'); RAW.mkdir(parents=True, exist_ok=True)
# load_dotenv(); print('ALPHAVANTAGE_API_KEY loaded?', bool(os.getenv('ALPHAVANTAGE_API_KEY')))

## Helpers (use or modify)

In [7]:
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    path = RAW / f"{prefix}_{mid}_{ts()}.csv"
    df.to_csv(path, index=False)
    print('Saved', path)
    return path

def validate(df: pd.DataFrame, required):
    missing = [c for c in required if c not in df.columns]
    return {'missing': missing, 'shape': df.shape, 'na_total': int(df.isna().sum().sum())}

## Part 1 — API Pull (Required)
Choose an endpoint (e.g., Alpha Vantage or use `yfinance` fallback).

In [9]:
SYMBOL = 'NVDA'
# USE_ALPHA = bool(os.getenv('ALPHAVANTAGE_API_KEY'))
# if USE_ALPHA:
#     url = 'https://www.alphavantage.co/query'
#     params = {'function':'TIME_SERIES_DAILY_ADJUSTED','symbol':SYMBOL,'outputsize':'compact','apikey':os.getenv('ALPHAVANTAGE_API_KEY')}
#     r = requests.get(url, params=params, timeout=30)
#     r.raise_for_status()
#     js = r.json()
#     key = [k for k in js if 'Time Series' in k][0]
#     df_api = pd.DataFrame(js[key]).T.reset_index().rename(columns={'index':'date','5. adjusted close':'adj_close'})[['date','adj_close']]
#     df_api['date'] = pd.to_datetime(df_api['date']); df_api['adj_close'] = pd.to_numeric(df_api['adj_close'])
# else:
import yfinance as yf
df_api = yf.download(SYMBOL, period='3mo', interval='1d', auto_adjust=False).reset_index()[['Date','Adj Close']]
df_api.columns = ['date','adj_close']

v_api = validate(df_api, ['date','adj_close']); v_api

[*********************100%***********************]  1 of 1 completed


{'missing': [], 'shape': (63, 2), 'na_total': 0}

In [10]:
print(df_api)

         date   adj_close
0  2025-05-19  135.560593
1  2025-05-20  134.370667
2  2025-05-21  131.790848
3  2025-05-22  132.820786
4  2025-05-23  131.280884
..        ...         ...
58 2025-08-12  183.160004
59 2025-08-13  181.589996
60 2025-08-14  182.020004
61 2025-08-15  180.449997
62 2025-08-18  182.559998

[63 rows x 2 columns]


In [11]:
_ = save_csv(df_api.sort_values('date'), prefix='api', source='yfinance', symbol=SYMBOL)

Saved data/raw/api_source-yfinance_symbol-NVDA_20250818-150953.csv


## Part 2 — Scrape a Public Table (Required)
Replace `SCRAPE_URL` with a permitted page containing a simple table.

In [12]:
SCRAPE_URL = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'  # TODO: replace with permitted page
headers = {'User-Agent':'AFE-Homework/1.0'}
try:
    resp = requests.get(SCRAPE_URL, headers=headers, timeout=30); resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'html.parser')
    rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)
except Exception as e:
    print('Scrape failed, using inline demo table:', e)
    html = '<table><tr><th>Ticker</th><th>Price</th></tr><tr><td>AAA</td><td>101.2</td></tr></table>'
    soup = BeautifulSoup(html, 'html.parser')
    rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)

if 'Price' in df_scrape.columns:
    df_scrape['Price'] = pd.to_numeric(df_scrape['Price'], errors='coerce')
v_scrape = validate(df_scrape, list(df_scrape.columns)); v_scrape

{'missing': [], 'shape': (880, 8), 'na_total': 758}

In [13]:
df_scrape

Unnamed: 0,Symbol,Security,GICSSector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,0000066740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,0000091142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,0000001800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,0001551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,0001467373,1989
...,...,...,...,...,...,...,...,...
875,"December 11, 1998",CPWR,Compuware,SUN,SunAmerica,AIGpurchased SunAmerica.[265],,
876,"June 17, 1997",CCI,Countrywide Credit Industries,USL,USLife,AIGacquired USLife.[266],,
877,"September 30, 1994",NCC,National City,MCK,McKesson,McKesson sold PCS Health Services toEli Lilly....,,
878,"July 1, 1976",BUD,Anheuser Busch,HNG,Houston Natural Gas,Major restructuring of S&P 500 to have fewer i...,,


In [14]:
_ = save_csv(df_scrape, prefix='scrape', site='example', table='markets')

Saved data/raw/scrape_site-example_table-markets_20250818-151332.csv


## Documentation
- API Source: (URL/endpoint/params)
- Scrape Source: (URL/table description)
- Assumptions & risks: (rate limits, selector fragility, schema changes)
- Confirm `.env` is not committed.