# Introduction
Web scraping, also known as web harvesting or web data extraction, is a technique used to extract large amounts of data from websites. The data on websites is unstructured, and web scraping enables us to convert it into a structured form.

## Importance of Web Scraping in Data Science
In the field of data science, web scraping plays an integral role. It is used for various purposes such as:

- **Data Collection**: Web scraping is a primary method of collecting data from the internet. This data can be used for analysis, research, etc.  
- **Real-time Application**: Web scraping is used for real-time applications like weather updates, price comparison, etc.  
- **Machine Learning**: Web scraping provides the data needed to train machine learning models.  

## Web Scraping with Python
Python provides several libraries for web scraping. Here are some of them:

- **BeautifulSoup**: BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.  

- **Scrapy**: Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract the data from the website.  

- **Selenium**: Selenium is a tool used for controlling web browsers through programs and automating browser tasks.  

## Applications of Web Scraping
Web scraping is used in various fields and has many applications:

- **Price Comparison**: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.  
- **Email address gathering**: Many companies that use email as a medium for marketing, use web scraping to collect email IDs and then send bulk emails.  
- **Social Media Scraping**: Web scraping is used to collect data from social media websites such as Twitter to find out what's trending.  

## Conclusion
Web scraping is an essential skill in the fast-growing world of data science. It provides the ability to turn the web into a source of data that can be analyzed, processed, and used for a variety of applications. However, it's important to remember that one should use web scraping responsibly and ethically, respecting the terms of use or robots.txt files of the websites being scraped.


In [1]:
from bs4 import BeautifulSoup
import requests

In [7]:
%%HTML
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [None]:
html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

soup = BeautifulSoup(html, "html.parser")

print(soup.prettify())

In [3]:
tag_object = soup.title
print("tag object: ", tag_object)
print("tag object type: ", type(tag_object))

tag_object_h3 = soup.h3
print("tag object h3: ", tag_object_h3)

child_h3 = tag_object_h3.b
print("child of h3: ", child_h3)

child_attribute = child_h3.attrs
print("child attribute: ", child_attribute)

parent_of_child_h3 = child_h3.parent
print("parent of child h3: ", parent_of_child_h3)

sibling_1 = tag_object_h3.find_next_sibling()
print("sibling 1: ", sibling_1)
sibling_2 = sibling_1.find_next_sibling()
print("sibling 2: ", sibling_2)

sibling_2_string = sibling_2.string
print("sibling 2 to string: ", sibling_2_string)


tag object:  <title>Page Title</title>
tag object type:  <class 'bs4.element.Tag'>
tag object h3:  <h3><b id="boldest">Lebron James</b></h3>
child of h3:  <b id="boldest">Lebron James</b>
child attribute:  {'id': 'boldest'}
parent of child h3:  <h3><b id="boldest">Lebron James</b></h3>
sibling 1:  <p> Salary: $ 92,000,000 </p>
sibling 2:  <h3> Stephen Curry</h3>
sibling 2 to string:   Stephen Curry


In [6]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [None]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"

table_bs = BeautifulSoup(table, "html.parser")
print(table_bs.prettify())

In [13]:
table_rows = table_bs.find_all(name=["tr"])

first_row = table_rows[0]
print("first row: ", first_row)
print("type of first row: ", type(first_row))
print("first td in the row: ", first_row.td)

# Iterate through table
for i, row in enumerate(table_rows):
    print("Row ", i, ": ", row)
    cells = row.find_all("td")
    for j, cell in enumerate(cells):
        print("\tCell ", j, ": ", cell)

find_by_id = table_bs.find_all(id="flight")
print("find by id: ", find_by_id)

find_by_link = table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
print("find by link: ", find_by_link)

find_all_href = table_bs.find_all(href=True)
print("find all href: ", find_all_href)

find_by_string = table_bs.find_all(string="Florida")
print("find by string: ", find_by_string)

first row:  <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
type of first row:  <class 'bs4.element.Tag'>
first td in the row:  <td id="flight">Flight No</td>
Row  0 :  <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
	Cell  0 :  <td id="flight">Flight No</td>
	Cell  1 :  <td>Launch site</td>
	Cell  2 :  <td>Payload mass</td>
Row  1 :  <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>
	Cell  0 :  <td>1</td>
	Cell  1 :  <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
	Cell  2 :  <td>300 kg</td>
Row  2 :  <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
	Cell  0 :  <td>2</td>
	Cell  1 :  <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
	Cell  2 :  <td>94 kg</td>
Row  3 :  <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>
	Cell  0 :  <t

In [14]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [None]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

two_tables_bs = BeautifulSoup(two_tables, "html.parser")

two_tables_bs.find("table")
two_tables_bs.find("table",class_='pizza')

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>