# Homework Starter — Stage 04: Data Acquisition and Ingestion
Name: Siddhant Yadav

Date: 08/18/25

## Objectives
- API ingestion with secrets in `.env`
- Scrape a permitted public table
- Validate and save raw data to `data/raw/`

In [14]:
# !pip install yfinance

In [15]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv


RAW = pathlib.Path(r'..\data\raw'); RAW.mkdir(parents=True, exist_ok=True)
load_dotenv()

print('KAGGLE_API_KEY loaded?', bool(os.getenv('kaggle_key')))

KAGGLE_API_KEY loaded? True


## Helpers (use or modify)

In [16]:
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    path = RAW / f"{prefix}_{mid}_{ts()}.csv"
    df.to_csv(path, index=False)
    print('Saved', path)
    return path

def validate(df: pd.DataFrame, required):
    missing = [c for c in required if c not in df.columns]
    return {'missing': missing, 'shape': df.shape, 'na_total': int(df.isna().sum().sum())}

## Part 1 — API Pull (Required)
Choose an endpoint (e.g., Alpha Vantage or use `yfinance` fallback).

In [23]:
# !kaggle datasets download -d yasserh/housing-prices-dataset

In [22]:
import zipfile
import os


SYMBOL = 'AAPL'
USE_KAGGLE = bool(os.getenv('kaggle_key'))
if USE_KAGGLE:
    zip_file_path = r"D:\codebases\bootcamp_siddhant_yadav\homework\stages\data\raw\housing-prices-dataset.zip"
    extracted_path = r'D:\codebases\bootcamp_siddhant_yadav\homework\stages\data\processed' 

    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extracted_path)

    print(os.listdir(extracted_path)) # Verify extracted files
else:
    print("Use yfinance")

['Housing.csv', 'summary.csv']


## Part 2 — Scrape a Public Table (Required)
Replace `SCRAPE_URL` with a permitted page containing a simple table.

In [24]:
# Scrapping a public table using yfinance 

import yfinance as yf

data = yf.download("AAPL", start="2020-01-01", end="2021-01-01")
print(data.head())

  data = yf.download("AAPL", start="2020-01-01", end="2021-01-01")
[*********************100%***********************]  1 of 1 completed

Price           Close       High        Low       Open     Volume
Ticker           AAPL       AAPL       AAPL       AAPL       AAPL
Date                                                             
2020-01-02  72.538521  72.598899  71.292311  71.545897  135480400
2020-01-03  71.833305  72.594071  71.608700  71.765682  146322800
2020-01-06  72.405663  72.444306  70.702997  70.954173  118387200
2020-01-07  72.065155  72.671348  71.845377  72.415345  108872000
2020-01-08  73.224411  73.526303  71.768086  71.768086  132079200





In [25]:
# SCRAPE_URL = 'https://example.com/markets-table'  # TODO: replace with permitted page
# headers = {'User-Agent':'AFE-Homework/1.0'}

# try:
#     resp = requests.get(SCRAPE_URL, headers=headers, timeout=30); resp.raise_for_status()
#     soup = BeautifulSoup(resp.text, 'html.parser')
#     rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
#     header, *data = [r for r in rows if r]
#     df_scrape = pd.DataFrame(data, columns=header)
# except Exception as e:
#     print('Scrape failed, using inline demo table:', e)
#     html = '<table><tr><th>Ticker</th><th>Price</th></tr><tr><td>AAA</td><td>101.2</td></tr></table>'
#     soup = BeautifulSoup(html, 'html.parser')
#     rows = [[c.get_text(strip=True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
#     header, *data = [r for r in rows if r]
#     df_scrape = pd.DataFrame(data, columns=header)

# if 'Price' in df_scrape.columns:
#     df_scrape['Price'] = pd.to_numeric(df_scrape['Price'], errors='coerce')
# v_scrape = validate(df_scrape, list(df_scrape.columns)); v_scrape

In [26]:
# _ = save_csv(df_scrape, prefix='scrape', site='example', table='markets')

## Documentation
- API Source: KAGGLE
- Scrape Source: yfinance
- .env is not commited. Only a .env.example is pushed to the repo with dummy values as instructed.