# Extracting Stock Data Using Web Scraping

This Jupyter Notebook demonstrates the process of extracting stock data through web scraping. We utilize a URL to retrieve stock information and employ pandas for data manipulation. Additionally, we utilize the BeautifulSoup library for parsing the data.

In [39]:
# Import libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

Download the webpage and save the HTML content as a variable named html_data:

In [40]:
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/amazon_data_webpage.html'

html_data = requests.get(url).text

Parse the HTML data using BeautifulSoup.

In [41]:
soup = BeautifulSoup(html_data, 'html5lib')

But what is parsing?

Data parsing is the systematic process of analyzing and extracting pertinent information from unprocessed, unorganized data acquired through web scraping. This entails comprehending the data's structure and content, then isolating particular data fields for subsequent utilization.

# Questions about this Notebook

## Question 1: What is the content of the title attribute?
To find the content of the title attribute, we can access the title tag and retrieve its attribute value.

In [42]:
tittle_content = soup.title.text
print("Title content", tittle_content)

Title content Amazon.com, Inc. (AMZN) Stock Historical Prices & Data - Yahoo Finance


Lets extract the table with the historical share prices and store it into a data frame named 'amazon_data'.

In [43]:
amazon_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Adj Close", "Volume"])

for row in soup.find("tbody").find_all("tr"):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text

    row_data = {"Date": date, "Open": Open, "High": high, "Low": low, "Close": close, "Adj Close": adj_close, "Volume": volume}
    amazon_data = pd.concat([amazon_data, pd.DataFrame([row_data])], ignore_index=True)



## Question 2: What are the names of the columns in the data frame?

We can retrieve the column names using the columns attribute of the data frame.

In [44]:
column_names = amazon_data.columns
print("Column names:", column_names)

Column names: Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')


Print out the first five rows of the amazon_data data frame.

In [45]:
print(amazon_data.head())

           Date      Open      High       Low     Close Adj Close       Volume
0  Jan 01, 2021  3,270.00  3,363.89  3,086.00  3,206.20  3,206.20   71,528,900
1  Dec 01, 2020  3,188.50  3,350.65  3,072.82  3,256.93  3,256.93   77,556,200
2  Nov 01, 2020  3,061.74  3,366.80  2,950.12  3,168.04  3,168.04   90,810,500
3  Oct 01, 2020  3,208.00  3,496.24  3,019.00  3,036.15  3,036.15  116,226,100
4  Sep 01, 2020  3,489.58  3,552.25  2,871.00  3,148.73  3,148.73  115,899,300


## Question 3: What is the 'Open' of the last row of the amazon_data data frame?

In [46]:
last_open_price = amazon_data.iloc[-1]['Open']
print("Open price of the last row:", last_open_price)

Open price of the last row: 656.29
