<a href="https://colab.research.google.com/github/Matinnorouzi2023/data-science/blob/main/Web_Scraping_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping Lab**

## Objectives

After completing this lab you will be:

*   Familiar with the basics of the `BeautifulSoup` Python library
*   Be able to scrape webpages for data and filter the data

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li>
            <a href="#Beautiful-Soup-Object">Beautiful Soup Object</a>
            <ul>
                <li><a href="#Tags">Tags</a></li>
                <li><a href="#Children,-Parents,-and-Siblings">Children, Parents, and Siblings</a></li>
                <li><a href="#HTML-Attributes">HTML Attributes</a></li>
                <li><a href="#Navigable-String">Navigable String</a></li>
            </ul>
        </li>
     </ul>
    <ul>
        <li>
            <a href="#Filter">Filter</a>
            <ul>
                <li><a href="#find_All">find_All</a></li>
                <li><a href="#find">find</a></li>
            </ul>
        </li>
     </ul>
     <ul>
        <li>
            <a href="#Downloading-And-Scraping-The-Contents-Of-A-Web-Page">Downloading And Scraping The Contents Of A Web Page</a></li>
         <li> <a href="#Scraping-tables-from-a-Web-page-using-Pandas">Scraping tables from a Web page using Pandas</a></li>
    </ul>

</div>

<hr>


For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.

In [1]:
!pip install html5lib



**Note:- After running the above code cell, restart the kernel and don't run the above code cell after restarting the kernel.**


In [2]:
!pip install bs4
#!pip install requests

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1257 sha256=8f173944ffcb609244e723b6bf9e134155d0ed7c84d5da8cb2ad183ca5fda70a
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


Import the required modules and functions

In [3]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

## Beautiful Soup Object


Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree, and/or filter out what we are looking for.

Consider the following HTML:


In [4]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

We can store it as a string in the variable HTML:

In [None]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor. The <code>BeautifulSoup</code> object represents the document as a nested data structure:

In [None]:
soup = BeautifulSoup(html, 'html5lib')

First, the document is converted to Unicode (similar to ASCII) and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects, that for the purposes of this lab are identical. Finally, we will look at <code>NavigableString</code> objects.

We can use the method <code>prettify()</code> to display the HTML in the nested structure:

In [None]:
print(soup.prettify())

## Tags

Let's say we want the  title of the page and the name of the top paid player. We can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.

In [None]:
tag_object=soup.title
print("tag object:",tag_object)

we can see the tag type <code>bs4.element.Tag</code>

In [None]:
print("tag object type:",type(tag_object))

If there is more than one <code>Tag</code> with the same name, the first element with that <code>Tag</code> name is called. This corresponds to the most paid player:


In [None]:
tag_object=soup.h3
tag_object

Enclosed in the bold attribute <code>b</code>, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.

### Children, Parents, and Siblings

As stated above, the <code>Tag</code> object is a tree of objects. We can access the child of the tag or navigate down the branch as follows:


In [None]:
tag_child =tag_object.b
tag_child

You can access the parent with the <code> parent</code>.

In [None]:
parent_tag=tag_child.parent
parent_tag

this is identical to:

In [None]:
tag_object

<code>tag_object</code> parent is the <code>body</code> element.

In [None]:
tag_object.parent

<code>tag_object</code> sibling is the <code>paragraph</code> element.

In [None]:
sibling_1=tag_object.next_sibling
sibling_1

`sibling_2` is the `header` element, which is also a sibling of both `sibling_1` and `tag_object`

In [None]:
sibling_2=sibling_1.next_sibling
sibling_2

<h3 id="first_question">Exercise: <code>next_sibling</code></h3>

Use the object <code>sibling\_2</code> and the method <code>next_sibling</code> to find the salary of Stephen Curry:

<details><summary>Click here for the solution</summary>

```
sibling_2.next_sibling

```

</details>


In [None]:
sibling_2.next_sibling

### HTML Attributes

If the tag has attributes, the tag <code>id="boldest"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. You can access a tag's attributes by treating the tag like a dictionary:

In [None]:
tag_child['id']

You can access that dictionary directly as <code>attrs</code>:

In [None]:
tag_child.attrs

You can also work with Multi-valued attributes. Check out <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01">\[1]</a> for more.


We can also obtain the content of the attribute of the <code>tag</code> using the Python <code>get()</code> method.

In [None]:
tag_child.get('id')

### Navigable String

A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the string of the <code>Tag</code> object <code>tag_child</code> as follows:


In [None]:
tag_string=tag_child.string
tag_string

we can verify the type is Navigable String

In [None]:
type(tag_string)

A NavigableString is similar to a Python string or Unicode string. To be more precise, the main difference is that it also supports some <code>BeautifulSoup</code> features. We can convert it to string object in Python:

In [None]:
unicode_string = str(tag_string)
unicode_string

## Filter

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launches:


In [None]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td>
    <td>Payload mass</td>
   </tr>
  <tr>
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

We can store it as a string in the variable <code>table</code>:

In [None]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [None]:
table_bs = BeautifulSoup(table, 'html5lib')

## find_All

The <code>find_all()</code> method looks through a tag's descendants and retrieves all descendants that match your filters.

<p>
The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>
</p>


### Name

When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.

In [None]:
table_rows=table_bs.find_all('tr')
table_rows

The result is a Python iterable just like a list, each element is a <code>tag</code> object:

In [None]:
first_row =table_rows[0]
first_row

The type is <code>tag</code>

In [None]:
print(type(first_row))

we can obtain the child

In [None]:
first_row.td

In [None]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)


If we iterate through the list, each element corresponds to a row in the table:

As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code> attribute.


In [None]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

If we use a list we can match against any item in that list.

In [None]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

### Attributes

If the argument is not recognized it will be turned into a filter on the tag's attributes. For example with the <code>id</code> argument, Beautiful Soup will filter against each tag's <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.

In [None]:
table_bs.find_all(id="flight")

We can find all the elements that have links to the Florida Wikipedia page:

In [None]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

If we set the <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:

In [None]:
table_bs.find_all(href=True)

There are other methods for dealing with attributes and other related methods. Check out the following <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01#css-selectors'>link</a>

<h3 id="exer_type">Exercise: <code>find_all</code></h3>

Using the logic above, find all the elements without <code>href</code> value

<details><summary>Click here for the solution</summary>

```
table_bs.find_all(href=False)

```

</details>


In [None]:
table_bs.find_all(href=False)

Using the soup object <code>soup</code>, find the element with the <code>id</code> attribute content set to <code>"boldest"</code>.

<details><summary>Click here for the solution</summary>

```
soup.find_all(id="boldest")

```

</details>

In [None]:
soup.find_all(id="boldest")

### string

With string you can search for strings instead of tags, where we find all the elments with Florida:


In [None]:
table_bs.find_all(string="Florida")

## find

The <code>find_all()</code> method scans the entire document looking for results. It’s useful if you are looking for one element, as you can use the <code>find()</code> method to find the first element in the document. Consider the following two tables:

In [5]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td>
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>


<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td>
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


We store the HTML as a Python string and assign <code>two_tables</code>:

In [None]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a <code>BeautifulSoup</code> object  <code>two_tables_bs</code>

In [None]:
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

We can find the first table using the tag name table

In [None]:
two_tables_bs.find("table")

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore to differentiate them.

In [None]:
two_tables_bs.find("table",class_='pizza')

<h2 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h2>


We Download the contents of the web page:

In [None]:
url = "http://www.ibm.com"

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:

In [None]:
data  = requests.get(url).text

We create a <code>BeautifulSoup</code> object using the <code>BeautifulSoup</code> constructor

In [None]:
soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'

Scrape all links

In [None]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))

### Scrape all images Tags

In [None]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

### Scrape data from HTML tables


In [None]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents and the way data is organized on the website. Open the above url in your browser and check how many rows and columns there are in the color table.

In [None]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [None]:
soup = BeautifulSoup(data,"html5lib")

In [None]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [None]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].text # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

## Scraping tables from a Web page using Pandas

Particularly for extracting tabular data from a web page, you may also use the `read_html()` method of the Pandas library.

In [None]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

You may extract all the tables from the given webpage simply by using the following commands.

In [None]:
import pandas as pd

tables = pd.read_html(url)
tables

`tables` is now a list of dataframes representing the tables from the web page, in the sequence of their appearance. In the current  URL, there is only a single table, so the same can be accessed as shown below.

In [None]:
tables[0]

## Editor :

Matin Hajnorouzi