# Week 6: Requests, BeautifulSoup, and API

# Requests and BeautifulSoup

The requests library is the de facto standard for making HTTP request in Python. It abstracts the complexities of making requests behind a beautiful, simple API so that you can focus on interacting with services and consumming data in your application. 

In [None]:
import requests

# Example 1: SEC EDGAR

We will be processing financial statements from the SEC. The SEC provides a database that allows you to access every single filing of listed companies in HTML format. 

https://www.sec.gov/edgar/searchedgar/companysearch.html

Although you can apply these methods to any web page, we will concentrate on processing accoutning and finance data. 

Not all the filings are easily processed. There will be variations between companies. Therefore it will take **trial and error** to receive the information that you want to receive. 

The example below is more of an overview. 

## Before we begin

Often time when you are requesting data from the internet using a program, the website knows that it is a program that is making a request. 

One way to "act like a human" in your program is to change the request header. 

The usual request header when it is unchanged from a python program is like this: 

Accept-Encoding identity

User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36

In [None]:
# Changing the header of the request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

print (headers)

There are many ways for websites to check if the request is from a human, but based on the HTTP headers, the only setting that really mattters is the User-Agent.

There are many settings within the header that you can further explore such as the Accept-Language header where you can possbily get the website in a different langauge. 

In [None]:
# We will be using the following URL. 

url = 'https://www.sec.gov/Archives/edgar/data/1318605/000162828025003063/0001628280-25-003063.txt'


response = requests.get(url, headers=headers)

# Make up an error URL
errorurl = 'https://www.amazon.com/404' 

# Will be used later as an example
error_response = requests.get(errorurl, headers=headers)

In [None]:
print (response)

In [None]:
# The Error 404 "Page not found" is the error page displayed whenever someone asks for a page that’s simply not available on your site. 
# The reason for this is that there may be a link on your site that was wrong or the page might have been recently removed from the site. 
# As there is no web page to display, the web server sends a page that simply says "404 Page not found".
print (error_response)

In [None]:
type(response)

In [None]:
response.status_code

In [None]:
error_response.status_code

So with this status code information you can create an error check, so that when you are going through a list of urls you will know which ones are not valid. 

In [None]:
if response.status_code == 200:
    print('Success!')
elif response.status_code == 404: 
    print('Not Found')

In [None]:
if error_response.status_code ==200:
    print('Success!')
elif error_response.status_code == 404: 
    print('Not Found')

But if you noticed earlier, when we printed response we already got a 200 code because it was sucessful. 

In [None]:
if response:
    print('Success!')
else:
    print('An error has occurred.')

In [None]:
if error_response:
    print('Success!')
else:
    print('An error has occurred.')

Now that we know that the response request was successful we can view the payload or the body of the request. 

In [None]:
#This allows you to view the raw bytes(b').
response.content

In [None]:
#Often you want to convert it into text encoding 
response.text

Below are some other useful methods that you could use. 

In [None]:
response.headers

In [None]:
#Now let's save the body to a file so we don't have to constantly request it
# This file appears in the same folder as this Jupyter file.
filename = 'Tesla_10K_20241231.html'
outputfile = open(filename, 'w', encoding = "utf-8")
outputfile.write(response.text)
outputfile.close()

Now that we have the file, let's process it using Beautiful Soup 4.


In [None]:
from bs4 import BeautifulSoup
#https://www.w3schools.com/html/html_intro.asp

In [None]:
# Input the file that we just saved
inputfile = open(filename,'r')

In [None]:
soup = BeautifulSoup(inputfile)
soup

In [None]:
#Let's Clean up the HTML 
print (soup.prettify())
# soup.prettify(): print in html tag level format--it looks very comfortable.

Now let's try to look at some methods in beautiful soup.

To use Beauiful Soup, you will have to understand the structure of an HTML file. The easiest structures to manipulate are usually well defined, but in the real world that is usually not the case. 

That is why, when you are mining for data, it is often a case of trial and error. 

You have to look for patterns that you could easily exploit and make rules for. 

Using Beautiful Soup is perfect when the html follows the rules that are defined for HTML documents, but as stated before this isn't usually the case.


In [None]:
# This will give you the title tag. We could use this for any of the tag pairs in the document.
soup.title

### When you are navigating through the document, the simpliest way is to use the tag as shown below, but the downside of this is that using the tag name will only give you the first tag by that name. 

In [None]:
soup.title.name

In [None]:
soup.title.parent.name

In [None]:
# <p> tag defines a paragraph.
soup.p

In [None]:
# <a> tag defines a hyperlink, which is used to link from one page to another.
soup.a

In [None]:
soup.body

In [None]:
#List of the children are availible in contents
soup.body.contents

In [None]:
type(soup.body.contents)

###  In order to look for the other tags, you will have to use the find_all

In [None]:
# Find all <a> tags
anchor_tags = soup.find_all('a')

In [None]:
anchor_tags[1]

In [None]:
anchor_tags[1].text

In [None]:
anchor_tags[3]

In [None]:
anchor_tags[3]['href']

href:

(Hypertext REFerence) The HTML code used to create a link to another page. The HREF is an attribute of the anchor 
tag, which is also used to identify sections within a document. The HREF contains two components: the URL, which 
is the actual link, and the clickable text that appears on the page, called the "anchor text."

### If you know the id or a specific href, you could use the find function.

In [None]:
soup.find(href='#mda')

In [None]:
# An HTML table consists of one <table> element and one or more <tr>, <th>, and <td> elements.

# The <tr> element defines a table row, the <th> element defines a table header, and the <td> element defines a table cell.

# An HTML table may also include <caption>, <colgroup>, <thead>, <tfoot>, and <tbody> elements.

html_tables = soup.find_all('table')

In [None]:
html_tables

In [None]:
soup.body

In [None]:
soup.body.get_text()

# Example 2: S&P 500 List

As you can see it can be quite difficult to work with the SEC filings due to their nature. There is another skillset that is required when processing them, and that is Natural Language Processing and Regular expressions. 

Let's try to extract data from something with more structure. 

We will try to extract the list of stocks in the S&P 500. 
Sp500Wiki = https://en.wikipedia.org/wiki/List_of_S%26P_500_companies


In [None]:
Sp500Wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

In [None]:
#Let's grab the HTML just like above

spResponse = requests.get(Sp500Wiki, headers = headers)

In [None]:
#Let's do some error checking
if spResponse.status_code==200:
    print('Success!')
else:
    print('An error has occurred.')

In [None]:
#Let's save the file to our local directory so we don't have to keep requesting data.
sp500filename = 'wikiSP500.html'

#Open the Stream 
outputfile = open(sp500filename, 'w', encoding = "utf-8")

#Write to the file
outputfile.write(spResponse.text)

#Close the file
outputfile.close()

Are we ran into an encoding Error? You will run into quite often. So let's check.

In [None]:
spResponse.encoding

In [None]:
spResponse.encoding = 'utf-8'

In [None]:
#Open the Stream 
outputfile = open(sp500filename, 'w', encoding="utf-8")

#Write to the file
outputfile.write(spResponse.text)

#Close the file
outputfile.close()

In [None]:
#Now file is saved, let's load it into soup, You can choose to load in from file or the response object. 

#You could use the file if you do not have internet access to make the request again. 
soup = BeautifulSoup(spResponse.text)

So now let's take this opportunity to look at how the file is formatted. 

What do you see? 

You will notice that all the symbols are in a neat table so let's try. 



In [None]:
soup.table

In [None]:
soup.tr

In [None]:
rows = soup.find_all('tr')

In [None]:
#What do you see below? 
rows[1]

In [None]:
#So you notice that symbol is the first element between the td so you try below
rows[1].td

In [None]:
#So now you see that it is between anchor tags, so try 
rows[1].td.a

In [None]:
#Now you want the text between the tags
rows[1].td.a.text

So now you can do it for 1 of the stocks in the S&P500, so how would you get them into a list?

In [None]:
#Let's initialize a list first
sp500Stocks = []

In [None]:
#Let's check the size of the rows list. 
len(rows)

In [None]:
rows[623]

That doesn't seem right why are there more than 500 rows? So look at the data again.

There are two tables in the html so you have to be more specific with your find all. 

In [None]:
sp500_rows = soup.table.find_all('tr')

In [None]:
len(sp500_rows)

In [None]:
#That seems a little better, let's see why we have 504 elements
sp500_rows[503]

In [None]:
#let's try to extract what we can

for row in sp500_rows:
    if row.td == None:
        continue
    else:
        print (row.td.a.text)
        sp500Stocks.append(row.td.a.text)

In [None]:
len(sp500Stocks)

Despite being referred to as “500,” there are actually 503 listed stocks because a methodology change can lead to more companies having multiple listings in the index due to multiple share classes. For instance, Google’s parent company Alphabet has Class A and Class C shares.

Today, the S&P 500 index covers approximately 80 percent of available market capitalization (or the total dollar market value of all listed firms’ outstanding shares).

# Using APIs with Python

Many times there is a library written out that allows you to easily connect to an API. There are libaries for Bloomberg API and many others availible for you to install. 

We will be utilizing a free stock data API known as AlphaVantage. 

Please take the time to sign up at alphaVantage to get your API key below. 

https://www.alphavantage.co/

We will be utilizing the wrapper 
https://github.com/RomelTorres/alpha_vantage

please install with pip install alpha_vantage

# Example 1: AlphaVantage

In [None]:
!pip install alpha_vantage

In [None]:
#So let's import the library in which we will be using the timeseries module 

from alpha_vantage.timeseries import TimeSeries

ts = TimeSeries(key='HNQ62X7XN1BICIIA', output_format='pandas')
data = ts.get_intraday(symbol='DIA', outputsize='full') # DIA: Dow Jones Industrial Average ETF
# outputsize='compact' requests data for last 100 days, and outputsize='full' requests data for the whole history

# The adjusted closing price amends a stock's closing price to reflect that stock's value after accounting for 
# any corporate actions, such as stock splits, dividends, and rights offerings.

In [None]:
data

In [None]:
data, meta_data = ts.get_intraday(symbol='MSFT',interval='1min', outputsize='full')
# meta data: a set of data that describes and gives information about other data
# There are two parts in the API response: "Meta Data" and "Time Series"
# The library is mapping meta_data to "Meta Data" and "Time Series" to data

In [None]:
data.head()

In [None]:
meta_data

In [None]:
data, meta_data

In [None]:
data, meta_data = ts.get_intraday(symbol='MSFT', outputsize='full')

In [None]:
data.head()

In [None]:
data, meta_data = ts.get_intraday(symbol='MSFT', outputsize='compact')

In [None]:
data.describe()

In [None]:
data.tail()

In [None]:
data.head()

In [None]:
#now let's quickly save this to csv 
filename = 'microsoft.csv'

data.to_csv(filename)

So let's say we want to grab data for mutiple stocks. We don't have to always request the API especially since it is free, there is always a limitiation on how fast you can download files. So what do you do?

In [None]:
#We have a list of the following 5 stocks, let's just take 5 stocks from our sp500 list

data_request = []
for i in range(0,5):
    data_request.append(sp500Stocks[i])

In [None]:
data_request

In [None]:
#Now let's loop to get the data, but let's give the API some time to receive the data, so we don't overwhelm the system with a request.

import time
from random import randint

for ticker in data_request:
       
    data, meta_data = ts.get_intraday(symbol= ticker , outputsize='full')
    
    #Now we have to create a filename dynamically, so we will use string manipulation
    
    filename = ticker +'.csv'
    
    data.to_csv(filename)
    #API limits you to 5 API calls per minute
    time.sleep(randint(1,10))

In [None]:
import pandas as pd

data = pd.read_csv('MMM.csv')