**What is Web scraping?**

Web scraping refers to the extraction of data from a website. This information is collected and then
exported into a format that is more useful for the user. Be it a spreadsheet or a CSV. Later this data can
be used for business analysis, training Machine learning Models etc

### Task1:

*Environment Setup*
1. Installing Beautiful Soup :
pip install beautifulsoup4
2. Installing requests:
! pip install requests

In [2]:
!pip install -q beautifulsoup4 requests

*Collecting a Web Page with Requests:*

To begin scraping, we first need to fetch the HTML content of a webpage. We'll use the ```requests```
module for this task. We’ll assign the URL (below) of the web page to the variable url:



In [3]:
import requests

url = 'https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html'

Next, we can assign the result of a request of that page to the variable page with the request.get() method.
We pass the page’s URL (that was assigned to the url variable) to that method.


In [4]:
page = requests.get(url)
print(page)


<Response [200]>


The variable **page** is assigned a Response object and you will see some status code like "<Response [200]>" after
executing the above print statement. The returned code of **200** tells us that the page downloaded successfully.
Codes that begin with the number 2 generally indicate success, while codes that begin with a 4 or 5 indicate that
an error occurred. You can read more about HTTP status codes from the [W3C’s Status Code Definitions](https://www.w3.org/Protocols/HTTP/1.1/draft-ietf-http-v11-spec-01#Status-Codes).




In [5]:
# Check the page status code to ensure the request was successful
if page.status_code == 200:
    print("Successfully fetched the webpage!")
else:
    print(f"Failed to fetch the webpage. Status code: {page.status_code}")

Successfully fetched the webpage!


*Parsing HTML Content with ```BeautifulSoup```:*

The Beautiful Soup library creates a parse tree from parsed HTML and XML documents (including documents
with non-closed tags or tag soup and other malformed markup). This functionality will make the web page text
more readable than what we saw coming from the Requests module.
To start, we’ll import Beautiful Soup into the Python console:

In [6]:
from bs4 import BeautifulSoup

Next, we’ll run the page.text document through the module to give us a BeautifulSoup object — that is, a parse
tree from this parsed page that we’ll get from running Python’s built-in html.parser over the HTML. The
constructed object represents the mockturtle.html document as a nested data structure. This is assigned to the
variable soup

In [7]:
soup = BeautifulSoup(page.text, 'html.parser')

To show the contents of the page on the terminal, we can print it with the prettify() method in order to turn the
Beautiful Soup parse tree into a nicely formatted Unicode string

In [8]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Turtle Soup
  </title>
 </head>
 <body>
  <h1>
   Turtle Soup
  </h1>
  <p class="verse" id="first">
   Beautiful Soup, so rich and green,
   <br/>
   Waiting in a hot tureen!
   <br/>
   Who for such dainties would not stoop?
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
  </p>
  <p class="chorus" id="second">
   Beau--ootiful Soo--oop!
   <br/>
   Beau--ootiful Soo--oop!
   <br/>
   Soo--oop of the e--e--evening,
   <br/>
   Beautiful, beautiful Soup!
   <br/>
  </p>
  <p class="verse" id="third">
   Beautiful Soup! Who cares for fish,
   <br/>
   Game or any other dish?
   <br/>
   Who would not give all else for two
   <br/>
   Pennyworth only of 

*Finding Instances of a Tag*

We can extract a single tag from a page by using Beautiful Soup’s find_all method. This will return all instances
of a given tag within a document.
Running that method on our object returns the full text of the song along with the relevant <p> tags and any tags
contained within that requested tag. which here includes the line break tags <br/>
Output:


In [9]:
soup.find_all('p')


[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

You will notice in the output above that the data is contained in square brackets [ ]. This means it is a Python list
data type.


Because it is a list, we can call a particular item within it (for example, the third `<p>` element), and use the
get_text() method to extract all the text from inside that tag:

The output that we receive will be what is in the third `<p>` element in this case:
Output:


In [10]:
soup.find_all('p')[2].get_text()

'Beautiful Soup! Who cares for fish,\n  Game or any other dish?\n  Who would not give all else for two\n  Pennyworth only of Beautiful Soup?\n  Pennyworth only of beautiful Soup?'

Note that \n line breaks are also shown in the returned string above


*Finding Tags by Class and ID*

HTML elements that refer to CSS selectors like class and ID can be helpful to look at when working with web
data using Beautiful Soup. We can target specific classes and IDs by using the find_all() method and passing the
class and ID strings as arguments.
First, let’s find all of the instances of the class chorus. In Beautiful Soup we will assign the string for the class to
the keyword argument class_:
Running the line above will produce the same output as before.


In [11]:
soup.find_all('p', class_='chorus')


[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]


We can also use Beautiful Soup to target IDs associated with HTML tags. In this case we will assign the string
'third' to the keyword argument id:
Once we run the line above, we’ll receive the following output:


In [12]:
soup.find_all(id='third')


[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]


The text associated with the `<p>` tag with the id of third is printed out to the terminal along with the relevant tags.




### Task2:
Scrap the following website.
https://webscraper.io/test-sites/e-commerce/allinone/computers
1) Import libraries
2) Access URL
3) Find if page is scrapable or not
4) Parse URL
5) Start empty list as names as product
6) Find title, description, price, rating and reviews using following code
7) Append data to product list
8) Display extracted data


In [1]:
# import libraries
from bs4 import BeautifulSoup
import requests

In [2]:
# access url
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
page = requests.get(url)
print(page)

<Response [200]>


In [3]:
# find if page is scrapable
if page.status_code == 200:
    print("Successfully fetched the webpage!")
else:
    print(f"Failed to fetch the webpage. Status code: {page.status_code}")

Successfully fetched the webpage!


In [4]:
# parse url
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Google Tag Manager -->
  <script nonce="2tnMkefddOxsuh7BCsVsrqNXEcEEi7Gj">
   (function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');
  </script>
  <!-- End Google Tag Manager -->
  <title>
   Allinone | Web Scraper Test Sites
  </title>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords">
   <meta content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to downloa

In [5]:
# start an empty list as names as products
products = []

In [6]:
# find title, description, price, rating and reviews
for card in soup.find_all('div', class_='product-wrapper'):
    title = card.find('a', class_='title').text.strip()
    description = card.find('p', class_='description').text.strip()
    price = card.find('h4', class_='price').text.strip()
    rating = len(card.find('div', class_='ratings').find_all('span', class_='ws-icon-star'))
    # print(rating)
    reviews = card.find('p', class_='review-count').text.strip(' ').split()[0] # Assuming the reviews are in the format 'xxx reviews'
    products.append({
        'title': title,
        'description': description,
        'price': price,
        'rating': rating,
        'reviews': reviews
    })

In [7]:
# display extracted data
for product in products:
    print(product)

{'title': 'Asus VivoBook...', 'description': 'Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd', 'price': '$295.99', 'rating': 3, 'reviews': '14'}
{'title': 'Prestigio Smar...', 'description': 'Prestigio SmartBook 133S Dark Grey, 13.3" FHD IPS, Celeron N3350 1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam', 'price': '$299', 'rating': 2, 'reviews': '8'}
{'title': 'Prestigio Smar...', 'description': 'Prestigio SmartBook 133S Gold, 13.3" FHD IPS, Celeron N3350 1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam', 'price': '$299', 'rating': 4, 'reviews': '12'}
{'title': 'Aspire E1-510', 'description': '15.6", Pentium N3520 2.16GHz, 4GB, 500GB, Linux', 'price': '$306.99', 'rating': 3, 'reviews': '2'}
{'title': 'Lenovo V110-15...', 'description': 'Lenovo V110-15IAP, 15.6" HD, Celeron N3350 1.1GHz, 4GB, 128GB SSD, Windows 10 Home', 'price': '$321.94', 'rating': 3, 'reviews': '5'}
{'title': 'Lenovo V110-15...', 'description': 'Asus V

In [12]:
# move the list into a printpretty table
!pip install -q tabulate

from tabulate import tabulate
# convert list of dictionaries to a list of lists for tabulate
headers = products[0].keys()
rows = [product.values() for product in products]
print(tabulate(rows, headers=headers, tablefmt='pretty'))

+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------+---------+
|       title       |                                                                     description                                                                      |  price   | rating | reviews |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------+---------+
| Asus VivoBook...  |                         Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd                          | $295.99  |   3    |   14    |
| Prestigio Smar... |               Prestigio SmartBook 133S Dark Grey, 13.3" FHD IPS, Celeron N3350 1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam                |   $299   |   2 

In [15]:
import pprint

pprint.pp(products)

[{'title': 'Asus VivoBook...',
  'description': 'Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron '
                 'N3450, 4GB, 128GB SSD, Endless OS, ENG kbd',
  'price': '$295.99',
  'rating': 3,
  'reviews': '14'},
 {'title': 'Prestigio Smar...',
  'description': 'Prestigio SmartBook 133S Dark Grey, 13.3" FHD IPS, Celeron '
                 'N3350 1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam',
  'price': '$299',
  'rating': 2,
  'reviews': '8'},
 {'title': 'Prestigio Smar...',
  'description': 'Prestigio SmartBook 133S Gold, 13.3" FHD IPS, Celeron N3350 '
                 '1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam',
  'price': '$299',
  'rating': 4,
  'reviews': '12'},
 {'title': 'Aspire E1-510',
  'description': '15.6", Pentium N3520 2.16GHz, 4GB, 500GB, Linux',
  'price': '$306.99',
  'rating': 3,
  'reviews': '2'},
 {'title': 'Lenovo V110-15...',
  'description': 'Lenovo V110-15IAP, 15.6" HD, Celeron N3350 1.1GHz, 4GB, '
                 '128GB SS