## Week 2 - Scraping Data

#### Name: Ruixin Huang 

#### NIM: 2802466713

##### Task Description:

1. Try to scrape data from any html dataset. Pick your favourite website that maybe useful for your project topic later on.
2. Store the data into csv format, try at least get 5 columns for each data.
3. For extra miles, you can also scrape as many as possible from more than one source.

The website I choose is **Yahoo Finance**, 
I originally use direct scraping, but the data might be not accessible -> SO I use Yahoo Finance API 



**Plans**

1. **The web structure understanding**:

- Yahoo Finance contains useful data like stock prices, financial statements, and historical data for various companies.
- The information typically lies within tables, spans, or other HTML tags, which we need to target in the scraping process.

2. **Selecting the Data**:

- **Technical Indicator Data**: Extract technical indicators like moving averages, Relative Strength Index (RSI)

-  **Historical Data**: Gather historical data (daily close, open, high, low, and volume) for a time-series dataset.

-  **Multiple Stock Data**: Extend the scraping to multiple stocks for diversified trading strategies.

-   **Scalable and Modular Code**: Create functions for scalability and better maintainability.

-  **Error Handling and Logging**: Implement robust error handling and logging for better diagnostics.

-  **Efficient Data Storage**: Store the scraped data in CSV format with appropriate labels and historical data, making it easier for later use in machine learning models.

3. **Deatils of Indicators I used**

- 1. Simple Moving Average (SMA)
    - Definition: The Simple Moving Average (SMA) is calculated by taking the average of a stock’s closing prices over a specific period. It smooths out price data by creating a constantly updated average price.
    - Calculation: In this project, we calculate the 20-day SMA by averaging the stock’s closing prices over the last 20 trading days.
    - Purpose:
        - SMA helps identify trends over time. For example, if the stock price is consistently above its 20-day SMA, it could indicate an uptrend, while prices below the SMA might suggest a downtrend.
        - Traders often use SMA to smooth out short-term fluctuations and make it easier to spot longer-term trends.
- 2. Exponential Moving Average (EMA)
    - Definition: The Exponential Moving Average (EMA) is similar to the SMA but gives more weight to recent prices. This makes the EMA more sensitive to recent price changes compared to the SMA.
    - Calculation: We calculate the 20-day EMA in this project, meaning that recent prices have a stronger influence on the average than prices from earlier in the period.
    - Purpose: EMA responds faster to recent price movements, which is particularly useful in detecting changes in trends early.
    - MA is often preferred by traders who want to react more quickly to market changes compared to the slower-moving SMA.


pip install yfinance


In [1]:
import yfinance as yf
import pandas as pd
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

# Setup logging for better debugging and monitoring
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

In [2]:
import yfinance as yf
import pandas as pd
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

# Setup logging for monitoring and debugging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

### Function to get the stock summary
def get_stock_summary(symbol):
    """
    Fetch stock summary data using Yahoo Finance API (yfinance).
    Args:
        symbol (str): Stock ticker symbol (e.g., 'GOOG').
    Returns:
        dict: A dictionary containing stock summary data.
    """
    # Create a Ticker object for the provided symbol
    stock = yf.Ticker(symbol)
    
    try:
        # Get the stock's information
        stock_info = stock.info
        
        # Prepare the stock summary data in dictionary format
        stock_data = {
            'Ticker': stock_info.get('symbol', 'N/A'),  # Ticker symbol (e.g., 'GOOG')
            'Current Price': stock_info.get('regularMarketPrice', 'N/A'),  # Current market price
            'Previous Close': stock_info.get('previousClose', 'N/A'),  # Previous day's close price
            'Open': stock_info.get('open', 'N/A'),  # Opening price of the current trading day
            '52 Week Range': f"{stock_info.get('fiftyTwoWeekLow', 'N/A')} - {stock_info.get('fiftyTwoWeekHigh', 'N/A')}",  # 52-week low and high price
            'Market Cap': stock_info.get('marketCap', 'N/A'),  # Total market capitalization
            'Volume': stock_info.get('volume', 'N/A'),  # Current volume traded
        }
        
        logging.info(f"Successfully fetched summary data for {symbol}")
        return stock_data  # Return the fetched data
    
    except Exception as e:
        logging.error(f"Error fetching summary data for {symbol}: {e}")
        return None  # Return None if an error occurs

### Function to get the historical stock data
def get_historical_data(symbol, period='1mo'):
    """
    Fetch historical stock data (OHLCV) using Yahoo Finance API (yfinance).
    Args:
        symbol (str): Stock ticker symbol (e.g., 'GOOG').
        period (str): Time period for historical data (e.g., '1mo', '6mo', '1y', '5y').
    Returns:
        pd.DataFrame: DataFrame containing historical stock data with technical indicators.
    """
    try:
        # Create a Ticker object for the given symbol
        stock = yf.Ticker(symbol)
        
        # Get the historical stock data for the specified period
        historical_data = stock.history(period=period)
        
        # Check if data is empty
        if historical_data.empty:
            logging.error(f"No historical data found for {symbol}")
            return None  # Return None if no data is available
        
        # Calculate the 20-day Simple Moving Average (SMA)
        historical_data['SMA_20'] = historical_data['Close'].rolling(window=20).mean()
        
        # Calculate the 20-day Exponential Moving Average (EMA)
        historical_data['EMA_20'] = historical_data['Close'].ewm(span=20, adjust=False).mean()
        
        # Reset the index to make 'Date' a column in the DataFrame
        historical_data.reset_index(inplace=True)
        
        logging.info(f"Successfully fetched historical data for {symbol}")
        return historical_data  # Return the historical data with added indicators
    
    except Exception as e:
        logging.error(f"Error fetching historical data for {symbol}: {e}")
        return None  # Return None if an error occurs

### Function to save data to CSV
def save_data_to_csv(data, filename):
    """
    Save data to CSV file.
    Args:
        data (pd.DataFrame): Data to be saved.
        filename (str): Output CSV filename.
    """
    try:
        # Save the DataFrame to a CSV file
        data.to_csv(filename, index=False)
        logging.info(f"Data saved to {filename}")
    
    except Exception as e:
        logging.error(f"Error saving data to {filename}: {e}")

### Wrapper function to fetch and save both stock summary and historical data
def fetch_and_save_stock_data(symbol, period='1mo'):
    """
    Wrapper function to fetch both stock summary and historical data, then save to CSV.
    Args:
        symbol (str): Stock ticker symbol (e.g., 'GOOG').
        period (str): Time period for historical data (default is '1mo').
    """
    logging.info(f"Fetching data for {symbol}...")  # Log the start of data fetching
    
    # Fetch stock summary
    stock_summary = get_stock_summary(symbol)
    
    # If stock summary data is fetched successfully, save it to CSV
    if stock_summary:
        pd.DataFrame([stock_summary]).to_csv(f'{symbol}_summary.csv', index=False)
    
    # Fetch historical data
    historical_data = get_historical_data(symbol, period)
    
    # If historical data is fetched successfully, save it to CSV
    if historical_data is not None:
        save_data_to_csv(historical_data, f'{symbol}_historical_data.csv')

### Function to fetch data for multiple symbols using multi-threading
def multi_threaded_stock_data_fetch(stock_symbols, period='1mo', max_workers=5):
    """
    Multi-threaded function to fetch stock data for multiple symbols.
    Args:
        stock_symbols (list): List of stock ticker symbols.
        period (str): Time period for historical data (default is '1mo').
        max_workers (int): Maximum number of threads to run concurrently.
    """
    # Start a timer to track the duration of the fetching process
    start_time = time.time()
    
    # Use ThreadPoolExecutor to fetch data concurrently for multiple stocks
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Create a dictionary to map each future to its corresponding stock symbol
        future_to_symbol = {executor.submit(fetch_and_save_stock_data, symbol, period): symbol for symbol in stock_symbols}
        
        # Process each future as it completes
        for future in as_completed(future_to_symbol):
            symbol = future_to_symbol[future]
            
            try:
                # If the future completes successfully, log the completion
                future.result()
                logging.info(f"Completed fetching data for {symbol}")
            except Exception as e:
                logging.error(f"Error fetching data for {symbol}: {e}")
    
    # Calculate the total time taken for fetching data
    end_time = time.time()
    logging.info(f"Finished fetching data for all stocks in {end_time - start_time:.2f} seconds")

### Main execution block
if __name__ == "__main__":
    # List of stock symbols to fetch data for
    stock_symbols = ['GOOG', 'AAPL', 'MSFT', 'AMZN', 'TSLA']
    
    # Run multi-threaded data fetching for the list of stock symbols
    multi_threaded_stock_data_fetch(stock_symbols, period='1mo', max_workers=3)


2024-09-18 22:11:51,363 - INFO - Fetching data for GOOG...
2024-09-18 22:11:51,363 - INFO - Fetching data for AAPL...
2024-09-18 22:11:51,363 - INFO - Fetching data for MSFT...
2024-09-18 22:11:53,470 - INFO - Successfully fetched summary data for AAPL
2024-09-18 22:11:53,892 - INFO - Successfully fetched historical data for AAPL
2024-09-18 22:11:53,894 - INFO - Data saved to AAPL_historical_data.csv
2024-09-18 22:11:53,895 - INFO - Fetching data for AMZN...
2024-09-18 22:11:53,895 - INFO - Completed fetching data for AAPL
2024-09-18 22:11:54,438 - INFO - Successfully fetched summary data for AMZN
2024-09-18 22:11:54,721 - INFO - Successfully fetched historical data for AMZN
2024-09-18 22:11:54,724 - INFO - Data saved to AMZN_historical_data.csv
2024-09-18 22:11:54,725 - INFO - Fetching data for TSLA...
2024-09-18 22:11:54,725 - INFO - Completed fetching data for AMZN
2024-09-18 22:11:55,311 - INFO - Successfully fetched summary data for TSLA
2024-09-18 22:11:55,441 - INFO - Successful

Some Points:
1. Technical Indicators:
-  Calculate technical indicators like Simple Moving Average (SMA) and Exponential Moving Average (EMA) for a deeper understanding of stock trends, which are useful for quantitative trading strategies and GAN-based models.

2. Multi-threading:
- Using ThreadPoolExecutor from the concurrent.futures module, we scrape data for multiple stocks simultaneously, dramatically reducing execution time.

3. Modular Design:
- The code is split into modular functions (get_stock_summary(), get_historical_data(), save_data_to_csv(), etc.), making it scalable and maintainable.

4. Logging and Error Handling:
- A logging system records all events, helping you trace potential issues quickly.

5. Data Storage:
- The data is saved to separate CSV files for each stock, with both stock summary and historical data being stored independently.

6. Customizable Time Period:
- Its easily change the time period for historical data (e.g., '1mo', '6mo', '1y', '5y') to fetch the exact data you need.

7. Parallel Processing:
- With the multi-threading setup, the code can handle a large number of stock symbols quickly, enabling to work with multiple assets in parallel.

- Why this is better than direct scraping: 
APIs like yfinance provide structured data access and are often more efficient and stable compared to directly scraping HTML, which can break easily due to website changes or blocking mechanisms.
- Store the Data into CSV Format
    - The code automatically stores two types of CSV files:
        - Stock summary CSVs: Contains data like Ticker, Current Price, Market Cap, etc., with at least 5 columns.
        - Historical data CSVs: Includes Open, High, Low, Close, Volume (OHLCV) and technical indicators like SMA and EMA.
- At Least 5 Columns for Each Data
    - Summary Data: Ticker, Current Price, Previous Close, Open, 52 Week Range, Market Cap, Volume.
    - Historical Data: Date, Open, High, Low, Close, Adj Close, Volume, SMA_20, EMA_20.