<a href="https://colab.research.google.com/github/Pam2020/DataExtraction_using_WebScraping_and_APIs/blob/main/WebScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Web Scraping

In this project, we will perform web scarping and get data from the amazon website into a dataframe.

The steps involved would be:
1. Using the python requests library to send a get request (HTTP protocol) to amazon and retrieve the html code.
2. Use the beautiful soup library to parse the html code.
3. Create a dataframe object and extract data into it.

References:

1. https://www.youtube.com/watch?v=2hPCX-p_X8Q&ab_channel=DarshilParmar
2. https://www.youtube.com/watch?v=tb8gHvYlCFs&ab_channel=CoreySchafer
3. https://www.youtube.com/watch?v=ng2o98k983k&ab_channel=CoreySchafer


Let's get started!

In [70]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [50]:
!pip install fake-useragent

Collecting fake-useragent
  Downloading fake_useragent-1.5.1-py3-none-any.whl.metadata (15 kB)
Downloading fake_useragent-1.5.1-py3-none-any.whl (17 kB)
Installing collected packages: fake-useragent
Successfully installed fake-useragent-1.5.1


In [51]:
from fake_useragent import UserAgent
ua = UserAgent()

# Get a random browser user-agent string
print(ua.random)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36


In [52]:
# url from where we will be scraping data

url = "https://www.amazon.com/s?k=data+engineering+books"

In [53]:
"""We will send http request to the above webpage. To do this we need to define the HTTP headers.
  Inside the header, one of the most important factor is user-agent.
"""

# Headers
HEADERS = ({'User-Agent': ua.random, 'Accept-Language':'en-US, en;q=0.5'})

In [54]:
webpage = requests.get(url, headers=HEADERS)

In [55]:
webpage

<Response [200]>

In [56]:
# the response content is in the form of bytes
# to convert this into html format, we use the beatiful soup library
type(webpage.content)

bytes

In [57]:
# Soup Object containing all data
soup = BeautifulSoup(webpage.content, "html.parser")


In [58]:
soup

<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-us"><!-- sp:feature:head-start -->
<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link href="https://images-na.ssl-images-amazon.com" rel="dns-prefetch"/>
<link href="https://m.media-amazon.com" rel="dns-prefetch"/>
<link href="https://completion.amazon.com" rel="dns-prefetch"/>
<!-- sp:end-feature:cs-optimization -->
<!-- sp:feature:csm:head-open-part2 -->
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=func

In [92]:
books = []
seen_titles = set()  # To keep track of seen titles

page = 1

In [93]:


# Find book containers (you may need to adjust the class names based on the actual HTML structure)
book_containers = soup.find_all("div", {"class": "sg-col-4-of-24 sg-col-4-of-12 s-result-item s-asin sg-col-4-of-16 sg-col s-widget-spacing-small sg-col-4-of-20 gsx-ies-anchor"})


# Loop through the book containers and extract data
for book in book_containers:
  title = book.find("span", {"class": "a-size-base-plus a-color-base a-text-normal"})
  authors = book.find_all("a", {"class":"a-size-base"})
  price = book.find("span", {"class": "a-price-whole"})
  rating = book.find("span", {"class": "a-icon-alt"})
#"a-size-base a-link-normal s-underline-text s-underline-link-text s-link-style"
  if title and authors and price and rating:
    book_title = title.text.strip()


    # Check if title has been seen before
    if book_title not in seen_titles:
      seen_titles.add(book_title)
      author_names = [author.text.strip() for author in authors if author.text.strip()]
      if author_names:
        book_authors = ", ".join(author_names)
      else:
         "Unknown"

      books.append({
          "Title": book_title,
          "Authors": book_authors,
          "Price": price.text.strip(),
          "Rating": rating.text.strip(),
          })
    # Increment the page number for the next iteration
    #page += 1


In [94]:
# Limit to the requested number of books
#books = books[:num_books]

pd.set_option('display.max_colwidth', None)
# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(books)

# Remove duplicates based on 'Title' column
df.drop_duplicates(subset="Title", inplace=True)

In [95]:
df.head()

Unnamed: 0,Title,Authors,Price,Rating
0,Fundamentals of Data Engineering: Plan and Build Robust Data Systems,"Joe Reis, Matt Housley, Paperback, Audible Audiobook, Kindle, Audio CD",42.0,4.7 out of 5 stars
1,"Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems","Martin Kleppmann, Paperback, Audible Audiobook, Kindle",41.0,4.7 out of 5 stars
2,Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era,"Richard J. Schiller, David Larochelle, Paperback, Kindle",49.0,5.0 out of 5 stars
3,Data Engineering with AWS - Second Edition: Acquire the skills to design and build AWS-based data transformation pipelines like a pro,"Gareth Eagar, Paperback, Kindle",41.0,4.3 out of 5 stars
4,Data Pipelines Pocket Reference: Moving and Processing Data for Analytics,"James Densmore, Paperback, Kindle",17.0,4.5 out of 5 stars


Now we have the data into the dataframe. However, the Authors column has unnecessary information. Have to figure out how to get rid of that information.

In [97]:
df.shape

(46, 4)