<a href="https://colab.research.google.com/github/Samael7264/ML_Learning/blob/main/Web_Scraping%7CLecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### requests
The requests module is a popular library in Python for making HTTP requests. It simplifies the process of interacting with web servers and APIs by providing a user-friendly interface.

Here's a quick rundown of what the requests module offers:

* **Sends HTTP Requests:** You can use it to send various types of HTTP requests like GET, POST, PUT, and DELETE.
* **Easy to Use:** It avoids complex low-level details and provides a clean API for making requests.
* **Handles Responses:** It retrieves and parses the response data from the server, including status codes, headers, and content.

Overall, the requests library makes it much easier to write Python code that interacts with web services and APIs.

In [None]:
<i>InterstellarNOT</i>

<h1>Hello</h1>

<h1 id="firstHeading" class="firstHeading mw-first-heading"><i>InterstellarNOT</i> (film)</h1>

In [None]:
# <h1 id="firstHeading" class="firstHeading mw-first-heading"><i>InterstellarNOT</i> (film)</h1>

In [None]:
import requests

url = 'https://en.wikipedia.org/wiki/abracadabra'

response = requests.get(url)
print(response)

<Response [200]>


In [None]:
response.status_code

200

In [None]:
response.content

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Abracadabra - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feat

In [None]:
import requests

base_url = "http://books.toscrape.com/index.html"
home_page = requests.get(base_url)
print(home_page.status_code)

200


In [None]:
# checking whether the request made was successful or not
if home_page.status_code == 200:
  print("SUCCESS")
else:
  print(f"FAILED, status code: {home_page.status_code}")

SUCCESS


In [None]:
print(home_page.content)



In [None]:
import requests

base_url = "http://books.toscrape.com/bookdoesnotexist"
home_page = requests.get(base_url)
print(home_page.status_code)

404


Now we have the web content as HTML code, now we need to parse this code so that we can extract relevant information from it.

One popular python module used for such task is beautiful soup. It can parse the web content and allows user to extract data from specific tags.


### Let's Structure it

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup=home_page.content, parser="html.parser")

Beautiful Soup (bs4) is a Python library designed for web scraping purposes to pull the data out of HTML and XML files. It creates parse trees that is helpful to extract the data easily. Here are some of the commonly used methods in Beautiful Soup:

1. find(): This method is used to find the first tag that matches a given criteria. For example, soup.find('div') would find the first div tag in the HTML document. You can also pass attributes to refine the search, like soup.find('div', class_='example').

2. find_all(): Unlike find(), find_all() retrieves all tags that match the criteria. It's useful when you want to extract information from multiple tags of the same type. For example, soup.find_all('a') would return a list of all anchor tags in the document.

3. select(): This method allows you to use CSS selectors to find elements in the document. It's particularly handy when dealing with classes or IDs. For instance, soup.select('.someclass') would find all elements with the class someclass.

4. select_one(): Similar to select(), but instead of returning all matches, it only returns the first match. For example, soup.select_one('#uniqueId') would find the first element with the ID uniqueId.

These methods are integral to navigating and parsing HTML/XML documents with Beautiful Soup, making it easier to scrape data from websites.

In [None]:
books = soup.find_all(name="li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")
len(books)

20

### Extracting data of a single book


In [None]:
book = books[0]
book

<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>

In [None]:
book_url = book.findChild(name="a").get("href")
book_url

'catalogue/a-light-in-the-attic_1000/index.html'

In [None]:
from urllib.parse import urljoin

book_url = urljoin(base_url, book_url)
book_url

'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

### let's send request to this new url

In [None]:
book_info = requests.get(book_url).content


and parse it with bs

In [None]:
book_soup = BeautifulSoup(markup=book_info, parser="html.parser")


#### let's get the title stored in h1

In [None]:
name = book_soup.find(name="h1").getText()
name

'A Light in the Attic'

#### Let's get the table

In [None]:
book_table_data = book_soup.find_all(name="tr")
len(book_table_data)

7

In [None]:
book_table_data

[<tr>
 <th>UPC</th><td>a897fe39b1053632</td>
 </tr>,
 <tr>
 <th>Product Type</th><td>Books</td>
 </tr>,
 <tr>
 <th>Price (excl. tax)</th><td>£51.77</td>
 </tr>,
 <tr>
 <th>Price (incl. tax)</th><td>£51.77</td>
 </tr>,
 <tr>
 <th>Tax</th><td>£0.00</td>
 </tr>,
 <tr>
 <th>Availability</th>
 <td>In stock (22 available)</td>
 </tr>,
 <tr>
 <th>Number of reviews</th>
 <td>0</td>
 </tr>]

and extract/store info in book_data from parsed table

In [None]:
book_data = {}
for row in book_table_data:
  key = row.find(name="th").getText()
  value = row.find(name="td").getText()
  book_data[key] = value

book_data

{'UPC': 'a897fe39b1053632',
 'Product Type': 'Books',
 'Price (excl. tax)': '£51.77',
 'Price (incl. tax)': '£51.77',
 'Tax': '£0.00',
 'Availability': 'In stock (22 available)',
 'Number of reviews': '0'}

Let's wrap all the functionality into one function, that takes the absolute URL of a particular book page and returns data in a dictionary.

#### function to parse and extract book info

In [None]:
def scrape_book(book_url):
  book_info = requests.get(book_url).content
  book_soup = BeautifulSoup(markup=book_info, parser="html.parser")

  book_data = {}

  # getting name
  name = book_soup.find(name="h1").getText()
  book_data['name'] = name

  # getting other data
  book_table_data = book_soup.find_all(name="tr")
  for row in book_table_data:
    key = row.find(name="th").getText()
    value = row.find(name="td").getText()
    book_data[key] = value

  # let's also keep the url of book in final result
  book_data['url'] = book_url
  return book_data

# let's test this
scrape_book(book_url)

{'name': 'A Light in the Attic',
 'UPC': 'a897fe39b1053632',
 'Product Type': 'Books',
 'Price (excl. tax)': '£51.77',
 'Price (incl. tax)': '£51.77',
 'Tax': '£0.00',
 'Availability': 'In stock (22 available)',
 'Number of reviews': '0',
 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}

Fetch all Books

In [None]:
# Fetching the data of all the books from the 1st page -
page_url = "https://books.toscrape.com/catalogue/page-1.html"

page_content = requests.get(page_url).content
page_soup = BeautifulSoup(markup=page_content, parser="html.parser")
page_books = soup.find_all(name="li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")

print(len(page_books))

20


In [None]:
def scrape_page(page_url):
  books_data = []
  page_content = requests.get(page_url).content
  page_soup = BeautifulSoup(markup=page_content, parser="html.parser")
  page_books = soup.find_all(name="li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3")

  for book in page_books:
    book_url = book.findChild(name="a").get("href")
    book_url = urljoin(base_url, book_url)
    book_data = scrape_book(book_url)
    books_data.append(book_data)
  return books_data


page_url = "https://books.toscrape.com/catalogue/page-1.html"
books_data = scrape_page(page_url)
books_data[:3]

[{'name': 'A Light in the Attic',
  'UPC': 'a897fe39b1053632',
  'Product Type': 'Books',
  'Price (excl. tax)': '£51.77',
  'Price (incl. tax)': '£51.77',
  'Tax': '£0.00',
  'Availability': 'In stock (22 available)',
  'Number of reviews': '0',
  'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'},
 {'name': 'Tipping the Velvet',
  'UPC': '90fa61229261140a',
  'Product Type': 'Books',
  'Price (excl. tax)': '£53.74',
  'Price (incl. tax)': '£53.74',
  'Tax': '£0.00',
  'Availability': 'In stock (20 available)',
  'Number of reviews': '0',
  'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'},
 {'name': 'Soumission',
  'UPC': '6957f44c3847a760',
  'Product Type': 'Books',
  'Price (excl. tax)': '£50.10',
  'Price (incl. tax)': '£50.10',
  'Tax': '£0.00',
  'Availability': 'In stock (20 available)',
  'Number of reviews': '0',
  'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html'}]

In [None]:
requests.get("https://books.toscrape.com/catalogue/page-100.html").status_code

404

In [None]:
page_count = 1
data = []

while True:
  page_url = f"https://books.toscrape.com/catalogue/page-{page_count}.html"
  status = requests.get(page_url).status_code

  # break the loop if we exceed the total page count
  if status == 404:
    break

  page_data = scrape_page(page_url)
  data.extend(page_data) # do not use .append() since the function returns a list
  print(f"Page: {page_count} is SUCCESSFULLY scraped")

  page_count += 1

Page: 1 is SUCCESSFULLY scraped
Page: 2 is SUCCESSFULLY scraped
Page: 3 is SUCCESSFULLY scraped
Page: 4 is SUCCESSFULLY scraped
Page: 5 is SUCCESSFULLY scraped
Page: 6 is SUCCESSFULLY scraped
Page: 7 is SUCCESSFULLY scraped
Page: 8 is SUCCESSFULLY scraped
Page: 9 is SUCCESSFULLY scraped
Page: 10 is SUCCESSFULLY scraped
Page: 11 is SUCCESSFULLY scraped
Page: 12 is SUCCESSFULLY scraped
Page: 13 is SUCCESSFULLY scraped
Page: 14 is SUCCESSFULLY scraped
Page: 15 is SUCCESSFULLY scraped
Page: 16 is SUCCESSFULLY scraped
Page: 17 is SUCCESSFULLY scraped
Page: 18 is SUCCESSFULLY scraped
Page: 19 is SUCCESSFULLY scraped
Page: 20 is SUCCESSFULLY scraped
Page: 21 is SUCCESSFULLY scraped
Page: 22 is SUCCESSFULLY scraped
Page: 23 is SUCCESSFULLY scraped
Page: 24 is SUCCESSFULLY scraped
Page: 25 is SUCCESSFULLY scraped
Page: 26 is SUCCESSFULLY scraped
Page: 27 is SUCCESSFULLY scraped
Page: 28 is SUCCESSFULLY scraped
Page: 29 is SUCCESSFULLY scraped
Page: 30 is SUCCESSFULLY scraped
Page: 31 is SUCCESS

### Let's format the fields

In [None]:
book = data[0].copy()
book

{'name': 'A Light in the Attic',
 'UPC': 'a897fe39b1053632',
 'Product Type': 'Books',
 'Price (excl. tax)': '£51.77',
 'Price (incl. tax)': '£51.77',
 'Tax': '£0.00',
 'Availability': 'In stock (22 available)',
 'Number of reviews': '0',
 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}

In [None]:
float(book['Price (excl. tax)'][1:])

51.77

In [None]:
# splitting 'availability' in 2 keys: 'quantity_available' and 'is_available'

quantity_available = int(book['Availability'].split("(")[-1][:-1].split()[0])
quantity_available

22

In [None]:
is_available = book['Availability'].split("(")[0].strip()
is_available

'In stock'

In [None]:
def fix(item):
  item['Price (excl. tax)'] = float(item['Price (excl. tax)'][1:])
  item['Price (incl. tax)'] = float(item['Price (incl. tax)'][1:])
  item['Tax'] = float(item['Tax'][1:])
  availability = item.pop('Availability')
  item['is_available'] = True if availability.split("(")[0].strip() == 'In stock' else False
  item['quantity_available'] = int(availability.split("(")[-1][:-1].split()[0])
  return item

formatted_book = fix(data[0].copy())
formatted_book

{'name': 'A Light in the Attic',
 'UPC': 'a897fe39b1053632',
 'Product Type': 'Books',
 'Price (excl. tax)': 51.77,
 'Price (incl. tax)': 51.77,
 'Tax': 0.0,
 'Number of reviews': '0',
 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'is_available': True,
 'quantity_available': 22}

In [None]:
formatted_book_list = []
for i in data:
    formatted_book_list.append(fix(i.copy()))
formatted_book_list

[{'name': 'A Light in the Attic',
  'UPC': 'a897fe39b1053632',
  'Product Type': 'Books',
  'Price (excl. tax)': 51.77,
  'Price (incl. tax)': 51.77,
  'Tax': 0.0,
  'Number of reviews': '0',
  'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'is_available': True,
  'quantity_available': 22},
 {'name': 'Tipping the Velvet',
  'UPC': '90fa61229261140a',
  'Product Type': 'Books',
  'Price (excl. tax)': 53.74,
  'Price (incl. tax)': 53.74,
  'Tax': 0.0,
  'Number of reviews': '0',
  'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'is_available': True,
  'quantity_available': 20},
 {'name': 'Soumission',
  'UPC': '6957f44c3847a760',
  'Product Type': 'Books',
  'Price (excl. tax)': 50.1,
  'Price (incl. tax)': 50.1,
  'Tax': 0.0,
  'Number of reviews': '0',
  'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html',
  'is_available': True,
  'quantity_available': 20},
 {'name': 'Sharp Objects',
  'UPC': 'e