# Data Collection for Stock Price

## Objectives
The goal of this notebook is to collect the stock price data from Yahoo Finance and save it in an organised CSV file. In this project we will start off by using Coca-Cola (Ticker: KO)
This will include:
1. Setting up and opening the webpage
2. Scraping the stock data table
3. Processing the data and save it to a csv file

## Importing Necessary Libraries

First, we must import all the necessary libraries to run this notebook.

We will be importing:

- **pandas**: This library is used for data manipulation and analysis. It provides data structures like DataFrames that will help us organise and clean the data we scrape.
  
- **selenium**: Selenium is a powerful tool for controlling web browsers through programs. It allows us to automate web scraping by navigating to web pages and extracting data programmatically.

- **webdriver-manager**: This library is used to automatically manage the browser drivers required by Selenium. It ensures that the correct version of ChromeDriver is used to interact with Google Chrome.

- **time**: This standard Python library is used to introduce delays in the code. We will use it to pause the execution for a few seconds, allowing the web page to fully load before scraping the data.

- **os**: This library is used to create the folder in which the csv file will be saved.

## Setting Up the Web Scraping Environment

In [38]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import os


## 1.  Set Up and Open the Webpage
First, we need to set up the Safari WebDriver to control the browser. Then, we will navigate to the Yahoo Finance page with historical data for Coca-Cola.
We also handle the TimeoutError that may occur if the page takes too long to load.
Note: I will be doing using safari as my Chrome does not work on my laptop, but the code for Chrome should work in a similar manner.

In [39]:
# Setup the Safar ChromeDriver
driver = webdriver.Safari()

#Set the page load timeout 
driver.set_page_load_timeout(10)

# Define the URL for Coca-Cola's historical stock data
url = 'https://finance.yahoo.com/quote/KO/history/'

try: 
    driver.get(url) # Open the URL and wait for it to load
    time.sleep(5) # Wait for the page to load
except Exception as e:
    print(f"Error loading the page: {e}")
    driver.quit() # Close the driver if there's an error

## 2. Scrape the Table Data
Next, we will locate the table element on the webpage and extract the rows and columns of stock data.

In [None]:
# Scroll to the bottom of th page to load all data
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    #scroll down to the bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2) # Wait for new data to load
    # Check if the page height has stopped increasing
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
        
# Find the table element by XPath
try:
    table = driver.find_element(By.XPATH, '//table')
except Exception as e:
    print(f"Error finding the table: {e}")
    driver.quit()
    
# Extract the rows from the table
rows = table.find_elements(By.TAG_NAME, 'tr')

# Parse the table data into a list of rows, each containing cell data
data = []
for row in rows:
    cols = row.find_elements(By.TAG_NAME, 'td')
    cols = [ele.text for ele in cols]
    data.append(cols)

## 3. Process the Data and Save the File
After collecting the data, we will close the browser driver, convert the data to a pandas DataFrame, and save it to a CSV file.

In [None]:
# Close the WebDriver
driver.quit()

# Convert data to a DataFrame
df = pd.DataFrame(data, columns=['Date', 'Open', 'High', 'Low', 'Close*', 'Adj Close**', 'Volume'])

# Save the DataFrame to a CSV file
file_path = 'data/KO_data.csv'

# Create the directory if it doesn't exist
os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Now save the DataFrame to the CSV file
df.to_csv(file_path, index=False)

print(f"Data saved to {file_path}")

Data saved to data/KO_data.csv


## Summary
We have now successfully scraped historical stock price data for Coca-Cola (KO) from Yahoo Finance and saved it into a CSV file for future use. In the next step of the project we will preprocess this data for stock price prediction.