# Stage 04: Data Acquisition and Ingestion

This notebook demonstrates various data acquisition methods including API calls and web scraping.


## Setup and Imports


In [None]:
import pandas as pd
import requests
import os
from dotenv import load_dotenv
import yfinance as yf
from bs4 import BeautifulSoup
import time

load_dotenv()


## 1. API Pull (Alpha Vantage)


In [None]:
api_key = os.getenv('ALPHA_VANTAGE_API_KEY')
ticker = 'AAPL'
base_url = 'https://www.alphavantage.co/query'

params = {
    'function': 'TIME_SERIES_DAILY',
    'symbol': ticker,
    'apikey': api_key,
    'outputsize': 'compact'
}

try:
    response = requests.get(base_url, params=params)
    response.raise_for_status()
    data = response.json()
    
    if 'Time Series (Daily)' in data:
        time_series = data['Time Series (Daily)']
        df_api = pd.DataFrame.from_dict(time_series, orient='index')
        df_api.index = pd.to_datetime(df_api.index)
        df_api.columns = ['Open', 'High', 'Low', 'Close', 'Volume']
        
        for col in ['Open', 'High', 'Low', 'Close', 'Volume']:
            df_api[col] = pd.to_numeric(df_api[col])
        
        print(f"API Data Shape: {df_api.shape}")
        print(f"Missing Values: {df_api.isnull().sum().sum()}")
        print(f"Required columns present: {all(col in df_api.columns for col in ['Open', 'High', 'Low', 'Close', 'Volume'])}")
        
        df_api.to_csv('../data/raw/alpha_vantage_aapl.csv')
        print("Data saved to data/raw/alpha_vantage_aapl.csv")
        print(df_api.head())
        
    else:
        print("API Error or rate limit reached, using yfinance fallback")
        df_api = yf.download(ticker, period='1mo')
        df_api.to_csv('../data/raw/yfinance_aapl_fallback.csv')
        print("Fallback data saved to data/raw/yfinance_aapl_fallback.csv")
        print(df_api.head())
        
except Exception as e:
    print(f"Error: {e}")
    print("Using yfinance fallback")
    df_api = yf.download(ticker, period='1mo')
    df_api.to_csv('../data/raw/yfinance_aapl_fallback.csv')
    print("Fallback data saved")


## 2. Web Scraping (S&P 500 Companies)


In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

try:
    response = requests.get(url)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', {'class': 'wikitable sortable'})
    
    rows = table.find_all('tr')[1:]
    
    data = []
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 3:
            symbol = cols[0].text.strip()
            company = cols[1].text.strip()
            sector = cols[2].text.strip()
            data.append([symbol, company, sector])
    
    df_scraped = pd.DataFrame(data, columns=['Symbol', 'Company', 'Sector'])
    
    print(f"Scraped Data Shape: {df_scraped.shape}")
    print(f"Missing Values: {df_scraped.isnull().sum().sum()}")
    print(f"Text columns validated: {df_scraped['Symbol'].dtype == 'object' and df_scraped['Company'].dtype == 'object'}")
    
    df_scraped.to_csv('../data/raw/sp500_companies.csv', index=False)
    print("Data saved to data/raw/sp500_companies.csv")
    print(df_scraped.head())
    
except Exception as e:
    print(f"Scraping error: {e}")
    df_scraped = pd.DataFrame({'Symbol': ['AAPL', 'MSFT', 'GOOGL'], 
                              'Company': ['Apple Inc.', 'Microsoft Corp.', 'Alphabet Inc.'],
                              'Sector': ['Technology', 'Technology', 'Technology']})
    df_scraped.to_csv('../data/raw/sp500_companies_sample.csv', index=False)
    print("Sample data saved as fallback")


## 3. Documentation


### Data Sources and Parameters

**API Source:**
- URL: https://www.alphavantage.co/query
- Function: TIME_SERIES_DAILY
- Symbol: AAPL (Apple Inc.)
- API Key: Loaded from .env file
- Fallback: yfinance library for AAPL data

**Web Scraping Source:**
- URL: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
- Target: First table with class 'wikitable sortable'
- Extracted columns: Symbol, Company Name, Sector

### Validation Logic

**API Data Validation:**
- Check for required columns: Open, High, Low, Close, Volume
- Convert all price columns to numeric types
- Validate date index format
- Count missing values
- Verify data shape and completeness

**Scraped Data Validation:**
- Ensure text columns (Symbol, Company, Sector) are object dtype
- Check for missing values in critical fields
- Validate minimum expected number of rows (should be ~500 companies)
- Strip whitespace from text fields

### Environment Security
- .env file contains sensitive API keys
- .env is included in .gitignore to prevent accidental commits
- API keys should never be hardcoded in notebooks


## 4. Assumptions & Risks


### Assumptions
1. **API Availability**: Alpha Vantage API service is operational and accessible
2. **Rate Limits**: API calls respect rate limiting (5 calls/minute for free tier)
3. **Data Format**: API returns data in expected JSON structure with time series
4. **Web Structure**: Wikipedia S&P 500 page maintains consistent table structure
5. **Network Access**: Reliable internet connection for API calls and web scraping

### Risks
1. **API Rate Limiting**: Free tier limitations may cause data acquisition failures
2. **API Key Exposure**: Risk of accidentally committing sensitive credentials
3. **Web Scraping Fragility**: Website structure changes could break scraping logic
4. **Data Quality**: No guarantee of data accuracy from external sources
5. **Service Downtime**: External services may be temporarily unavailable
6. **Legal Compliance**: Web scraping may violate terms of service if not careful

### Mitigation Strategies
- Implement fallback data sources (yfinance for financial data)
- Use robust error handling and try-catch blocks
- Validate data quality after acquisition
- Respect robots.txt and website terms of service
- Monitor API usage to avoid rate limits
- Secure credential management through environment variables
