# Python Web Scraping "Amazon.com"

## Case study: Amazon.com [web scraping]


Wuzzuf.com url: 'https://www.amazon.com/s?k=data+analyst+book&sprefix=data%2Banalyst%2Bbo%2Caps%2C488&ref=nb_sb_ss_ts-doa-p_1_15' 

## How do you scrape data from a website?
- Find URL that you want to scrape
- Inspecting the page
- Find the Data you want to extact
- Write the code
- Run the code and extract the data
- Store the data in the required format

### Import Laibraries & Methods

In [54]:
# import libraries 
from bs4 import BeautifulSoup as bs
import requests
import time
import datetime

import smtplib


#### Inputting the URL

In [55]:
URL ="https://www.amazon.com/s?k=data+analyst+book&sprefix=data%2Banalyst%2Bbo%2Caps%2C488&ref=nb_sb_ss_ts-doa-p_1_15"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

page = requests.get(URL, headers=headers)

#### Creating an HTML Parser Using BeautifulSoup

In [56]:
soup1 = bs(page.content, "html.parser")
soup2 = bs(soup1.prettify(), "html.parser")
print(soup2)

<!DOCTYPE html>

<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-us">
<!-- sp:feature:head-start -->
<head>
<script>
   var aPageStart = (new Date()).getTime();
  </script>
<meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">
   var ue_t0=ue_t0||+new Date();
  </script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link href="https://images-na.ssl-images-amazon.com" rel="dns-prefetch"/>
<link href="https://m.media-amazon.com" rel="dns-prefetch"/>
<link href="https://completion.amazon.com" rel="dns-prefetch"/>
<!-- sp:end-feature:cs-optimization -->
<!-- sp:feature:csm:head-open-part2 -->
<script type="text/javascript">
   window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){retu

#### Creating a container for Needed Data

In [57]:
containers = soup2.find_all("div",{'class':'sg-col sg-col-4-of-12 sg-col-8-of-16 sg-col-12-of-20 s-list-col-right'})

In [58]:
len(containers)

22

In [59]:
bs.prettify(containers[0])

'<div class="sg-col sg-col-4-of-12 sg-col-8-of-16 sg-col-12-of-20 s-list-col-right">\n <div class="sg-col-inner">\n  <div class="a-section a-spacing-small a-spacing-top-small">\n   <div class="a-section a-spacing-none puis-padding-right-small s-title-instructions-style">\n    <div class="a-row a-spacing-micro">\n     <span class="a-declarative" data-a-popover=\'{"closeButton":"true","dataStrategy":"preload","name":"sp-info-popover-1945051752","position":"triggerVertical"}\' data-action="a-popover" data-csa-c-func-deps="aui-da-a-popover" data-csa-c-type="widget">\n      <a aria-label="View Sponsored information or leave ad feedback" class="s-label-popover s-sponsored-label-text" href="javascript:void(0)" role="button" style="text-decoration: none;">\n       <span class="s-label-popover-default">\n        <span class="a-color-secondary">\n         Sponsored\n        </span>\n       </span>\n       <span class="s-label-popover-hover">\n        <span class="a-color-base">\n         Sponsor

#### Accessing Page elements

In [60]:
b_title = containers[0].findAll("h2",{'class':'a-size-mini a-spacing-none a-color-base s-line-clamp-2'})
b_title[0].text.strip()

"SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL"

In [61]:
b_price = containers[0].findAll('span',{'class':'a-offscreen'})
b_price[0].text.strip()

'$22.49'

In [62]:
try:
    b_rate = containers[0].findAll("span",{'class':'a-icon-alt'})
    print(b_rate[0].text.strip())
except:
     print("No Ratings")

4.6 out of 5 stars


In [63]:
book_title=[]
book_price=[]
book_rating=[]
for container in containers:
    
    b_title = container.findAll("h2",{'class':'a-size-mini a-spacing-none a-color-base s-line-clamp-2'})
    book_title.append(b_title[0].text.strip())
    
    b_price = container.findAll('span',{'class':'a-offscreen'})
    book_price.append(b_price[0].text.strip())
    
    try:
        b_rate = container.findAll("span",{'class':'a-icon-alt'})
        book_rating.append(b_rate[0].text.strip())
    except:
         book_rating.append("No Ratings")

In [64]:
#len(book_title)
#len(book_price)
#len(book_rating) 

#### Inputting the File into Pandas

In [66]:
import pandas as pd

data = {"book_title":book_title, "book_price":book_price,"book_rating": book_rating}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,book_title,book_price,book_rating
0,SQL QuickStart Guide: The Simplified Beginner'...,$22.49,4.6 out of 5 stars
1,Mastering Tableau 2021: Implement advanced bus...,$46.99,4.4 out of 5 stars
2,In God We Trust All Others Must Bring Data: Bl...,$7.99,No Ratings
3,"Data Analyst Coloring Book: A Versatile, Humor...",$6.99,5.0 out of 5 stars
4,Storytelling with Data: A Data Visualization G...,$28.49,4.6 out of 5 stars


----------------------------
## Thank You!