# Simple Webscraping with Beautiful Soup
To scrape the web with BeautifulSoup two important libraries have to be imported in `requests` and `BeautifulSoup`.

Useful Commands:
- `Shift + Return`: Runs the current cell and moves to the next one.
- `Control + Return`: Runs the current cell and stays in the same cell.
- `Option + Return`: Runs the current cell and inserts a new one below.



In [1]:
from bs4 import BeautifulSoup
import requests # Aids in downloading the webpage.

## Creating Soup Objects
We can create soup objects by parsing HTML to the program. 

In [2]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [3]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"
soup = BeautifulSoup(html, "html.parser")

The `prettify()` method can be used to display HTML in a nested structure. This way can help us understand the relationship between tags, such as parent, sibling, and child.

In [4]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>
<html><head><title>Page Title</title></head><body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>>


## Accessing Tags
If More than one element has the same tage the first one is shown.

In [5]:
tag_object = soup.title # The <title></title> tag
print(f'Tag Object: {tag_object}')
print(f'Tag Object Type: {type(tag_object)}')

Tag Object: <title>Page Title</title>
Tag Object Type: <class 'bs4.element.Tag'>


## Acessing Child, Parent, and Sibling
To access the child object, use the element that is a step down from the main element accessed in the branch. For the parent and sibling elements use the dot notation with `parent` and `next_sibling`.

In [7]:
# The first h3 tag would be accessed
h3_object = soup.h3

# The child can be accessed using the b tag
child_object = h3_object.b
print(f'Child: {child_object}')

# Parent Object
parent_object = child_object.parent
print(f'Parent: {parent_object}')

# Sibling Object
sibling_object = parent_object.next_sibling
print(f'Sibling: {sibling_object}')

Child: <b id="boldest">Lebron James</b>
Parent: <h3><b id="boldest">Lebron James</b></h3>
Sibling: <p> Salary: $ 92,000,000 </p>


## HTML Attributes
The attributes and its content can be accessed by treating them like dictionaries. The `get()` can be used to get the content of the attribute, as well as using bracket notation. The full key-value pair can by obtained by using the `attrs` dot notation.

In [8]:
# Get the id value of the child 
bracket_way = child_object['id']
method_way = child_object.get('id')
print(f'Using Bracket: {bracket_way}')
print(f'Using the get() method: {method_way}')

# Get the full dictionary
full_dict = child_object.attrs
print(f'Full Dictionary: {full_dict}')

Using Bracket: boldest
Using the get() method: boldest
Full Dictionary: {'id': 'boldest'}


## Obtain the Text/String
The `string` is used with dot notation to obtain the text or content of an HTML tag. BeautifulSoup uses the **NavigableString** class to obtain the text. The object can also be converted to a python string by using `str()`.

In [9]:
text = sibling_object.string
print(text)

 Salary: $ 92,000,000 


## Filtering
Filtering allows us to find certain patterns and objects.

In [10]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [13]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"
table_soup = BeautifulSoup(table, "html.parser")

The `find_all(name, attrs, recursive, string, limit, **kwargs)` method looks for all the descedants. The name parameter will acess all the tags with the name and its children and produce an `iterable` object.

In [16]:
table_rows = table_soup.find_all('tr') 
print(table_rows)

first_row = table_rows[0]
print(first_row)

cell_object = first_row.td
print(cell_object)



[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>, <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]
<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
<td id="flight">Flight No</td>


We can also iterate the object and print out each row.

In [None]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)

In [17]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>
colunm 2 cell <td>80 kg</td>


In [18]:
# Using a list to parse certain arguments to the find_all method
list_input = table_soup.find_all(name=['tr', 'td'])
print(list_input)

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>, <td>80 kg</td>]


If the value of the attribute is known, it can be parsed directly to the find_all() method. Additionally, all values can be found with an attribute if the `True` boolean is used and the opposite if the `False` boolean is used. Strings can also be found in stead of tags.

In [20]:
print(table_soup.find_all(id="flight"))

# Find the string Florida
print(table_soup.find_all(string='Florida'))

[<td id="flight">Flight No</td>]
['Florida', 'Florida']


The `find` method is best used for accessing the first type of an element or attribute parsed.

In [21]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [24]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"
two_tables_bs = BeautifulSoup(two_tables, 'html.parser') 
pizza_class = two_tables_bs.find('table', class_='pizza')
print(pizza_class)


<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>


## Scraping a Web Page
This process requires the use of the `requests` framework along with BeautifulSoup

In [25]:
url = "https://web.archive.org/web/20230224123642/https://www.ibm.com/us-en/"
data  = requests.get(url).text 
soup = BeautifulSoup(data,"html.parser")

In [26]:
# Scrape all links
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

https://web.archive.org/web/20230224123642/https://www.ibm.com/reports/threat-intelligence/
https://web.archive.org/web/20230224123642/https://www.ibm.com/about
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/strategy/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/ibmix?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/technology/
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/operations/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/strategic-partnerships
https://web.archive.org/web/20230224123642/https://www.ibm.com/employment/?lnk=flatitem
https://web.archive.org/web/20230224123642/https://www.ibm.com/impact
https://web.archive.org/web/20230224123642/https://research.ibm.com/
https://web.archive.org/web/20230224123642/https://www.ibm.com/


The `read_html()` method in Python allows us to turn the BeautifulSoup contents into a DataFrame