# **Web Scraping**

## Table of Contents
* Beautiful Soup Object
* Tag
* Children, Parents, and Siblings
* HTML Attributes
* Navigable String
* Filter
* find All
* find
* HTML Attributes
* Navigable String

## Import the required libraries

In [None]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

## Beautiful Soup Objects

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:

In [None]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

We can store it as a string in the variable HTML:

In [None]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the BeautifulSoup constructor

In [None]:
soup = BeautifulSoup(html, "html.parser")

We can use the method prettify() to display the HTML in the nested structure:

In [None]:
print(soup.prettify())

## Tags

The Tag object corresponds to an HTML tag in the original document, for example, the tag title.

In [None]:
tag_object=soup.title
print("tag object:",tag_object)

If there is more than one Tag with the same name, the first element with that Tag name is called:

In [None]:
tag_object=soup.h3
tag_object

## Children, Parents, and Siblings

The Tag object is a tree of objects we can access the child of the tag or navigate down the branch as follows:

In [None]:
tag_child =tag_object.b
tag_child

You can access the parent

In [None]:
parent_tag=tag_child.parent
parent_tag

tag_object sibling is the paragraph element

In [None]:
sibling_1=tag_object.next_sibling
sibling_1

sibling_2 is the header element which is also a sibling of both sibling_1 and tag_object

In [None]:
sibling_2=sibling_1.next_sibling
sibling_2

## HTML Attributes

You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
tag_child['id']

You can access that dictionary directly as attrs:

In [None]:
tag_child.attrs

## Navigable String

we can obtain the name of the first player by extracting the sting of the Tag object tag_child as follows:

In [None]:
tag_string=tag_child.string
tag_string

## Filter

Filters allow you to find complex patterns, the simplest filter is a string.

In [None]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

We can store it as a string in the variable table:

In [None]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [None]:
table_bs = BeautifulSoup(table, "html.parser")

## find All

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

### Name
When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.

In [None]:
table_rows=table_bs.find_all('tr')
table_rows

In [None]:
first_row =table_rows[0]
first_row

If we iterate through the list, each element corresponds to a row in the table:

In [None]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)
    

As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td. The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well. We can extract the content using the string attribute.

In [None]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

If we use a list we can match against any item in that list.

In [None]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

### Attributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example, the first td elements have a value of id of flight, therefore we can filter based on that id value.

In [None]:
table_bs.find_all(id="flight")

We can find all the elements that have links to the Florida Wikipedia page:

In [None]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

If we set the href attribute to True, regardless of what the value is, the code finds all tags with href value:

In [None]:
table_bs.find_all(href=True)

### string

With string you can search for strings instead of tags, where we find all the elments with Florida:

In [None]:
table_bs.find_all(string="Florida")

## find

The find_all() method scans the entire document looking for results, it’s if you are looking for one element you can use the find() method to find the first element in the document. Consider the following two table:

In [None]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


We store the HTML as a Python string and assign two_tables:

In [None]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a BeautifulSoup object two_tables_bs

In [None]:
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

We can find the first table using the tag name table

In [None]:
two_tables_bs.find("table")

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.

In [None]:
two_tables_bs.find("table",class_='pizza')

## Downloading And Scraping The Contents Of A Web Page

We Download the contents of the web page:

In [None]:
url = "http://www.ibm.com"

We use get to download the contents of the webpage in text format and store in a variable called data:

In [None]:
data  = requests.get(url).text 

We create a BeautifulSoup object using the BeautifulSoup constructor

In [None]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape all links

In [None]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))

### Scrape all images Tags

In [None]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

## Scrape data from HTML tables

The below url contains an html table with data about colors and color codes.

In [None]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

get the contents of the webpage in text format and store in a variable called data

In [None]:
data  = requests.get(url).text

In [None]:
soup = BeautifulSoup(data,"html.parser")

In [None]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [None]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas read_html

In [None]:
import pandas as pd

In [None]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

In [None]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [None]:
soup = BeautifulSoup(data,"html.parser")

In [None]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [None]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

We can now use the pandas function read_html and give it the string version of the table as well as the flavor which is the parsing engine bs4.

In [None]:
pd.read_html(str(tables[5]), flavor='bs4')

The function read_html always returns a list of DataFrames so we must pick the one we want out of the list.

In [None]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

## Scrape data from HTML tables into a DataFrame using read_html

We can also use the read_html function to directly get DataFrames from a url.

In [None]:
dataframe_list = pd.read_html(url, flavor='bs4')

In [None]:
dataframe_list[5]

We can also use the match parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [None]:
pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]