# **Automated Web Scraper For Amazon.com**

In this tutorial we will build an **Automated Web Scraper** to extract data from **amazon.com** that we can use for any data analysis,data science or machine learning project.

Before we get started let me make it clear that Amazon has a tight security for their platform and some of the things you can easily do on other webpages wont work on Amozon platform.

Previously, we could have used **Beautiful Soup and Request** to easily get titles from the page, but things have changed little bit. We will still use **Beautiful Soup** but in a different way.

Let's see how we can do things differently now.

# Installation

We will be using:
* **Selenium**
* **BeautifulSoup**
* **Webdrivers**

In [49]:
!pip install selenium
!pip install msedge-selenium-tools
!pip install bs4



In [4]:
!pip install chromedriver_binary==112.0.5615.49.0

Collecting chromedriver_binary==112.0.5615.49.0
  Using cached chromedriver-binary-112.0.5615.49.0.tar.gz (5.1 kB)
Building wheels for collected packages: chromedriver-binary
  Building wheel for chromedriver-binary (setup.py): started
  Building wheel for chromedriver-binary (setup.py): finished with status 'done'
  Created wheel for chromedriver-binary: filename=chromedriver_binary-112.0.5615.49.0-py3-none-any.whl size=7091665 sha256=95a00ef60fa8807db11d27b0cd744f9e328f6b31c6aa09d6bcbe24c23a8014ca
  Stored in directory: c:\users\hp\appdata\local\pip\cache\wheels\46\59\c2\be916df54da4ba517a74b756d7bc11d1018fa56f86c00daf10
Successfully built chromedriver-binary
Installing collected packages: chromedriver-binary
Successfully installed chromedriver-binary-112.0.5615.49.0


**Import the necessary libraries**

In [5]:
from selenium import webdriver
import chromedriver_binary

#for microsoft edge
from msedge.selenium_tools import Edge, EdgeOptions
import csv

**Setup the web driver**

In [11]:
driver = webdriver.Chrome()

In [12]:
url= 'https://www.amazon.com/'

In [13]:
driver.get(url)

In [14]:
def my_url(keyword):
    
#     temp = 'https://www.amazon.com/s?k=phone+case&ref=nb_sb_noss_1' #let's get rid of the 'phone+case' and replace it with {} to make the url generic.
      temp = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1' # a template url
      keyword = keyword.replace(' ', '+')
      return temp.format(keyword)

In [15]:
url=my_url('laptop') 
url

'https://www.amazon.com/s?k=laptop&ref=nb_sb_noss_1'

In [16]:
url=my_url('laptop charger')
url

'https://www.amazon.com/s?k=laptop+charger&ref=nb_sb_noss_1'

In [17]:
driver.get(url) #this will open in your browser and return the page for your keyword

In [18]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source,'html.parser')

In [19]:
soup_results=soup.find_all('div',{'data-component-type':'s-search-result'})

In [20]:
len(soup_results)

22

In [21]:
obj=soup_results[0]

In [22]:
obj

<div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 AdHolder sg-col s-widget-spacing-small sg-col-12-of-16" data-asin="B093V1XBJS" data-cel-widget="search_result_1" data-component-id="12" data-component-type="s-search-result" data-index="1" data-uuid="12352664-edf5-4687-9f3f-543b2419ba87"><div class="sg-col-inner"><div cel_widget_id="MAIN-SEARCH_RESULTS-1" class="s-widget-container s-spacing-small s-widget-container-height-small celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results_1" data-cel-widget="MAIN-SEARCH_RESULTS-1" data-csa-c-id="y40s0q-f4yc4u-w1y0zl-i6eeq2" data-csa-c-item-id="amzn1.asin.1.B093V1XBJS" data-csa-c-pos="1" data-csa-c-type="item" data-csa-op-log-render="">
<div class="rush-component" data-component-id="13" data-component-props='{"percentageShownToFire":"50","batchable":true,"requiredElementSelector":".s-image:visible","url":"https://unagi-na.amazon.com/1/events/com.amazon.eel.SponsoredProductsEventTracking.prod?qualifier=

In [43]:
atag = obj.h2.a #create the h2 tag variable

In [24]:
des = atag.text.strip()

In [25]:
des #we can see below that we have the title correctly scraped

'65W 45W USB C Laptop Power Replacement Adapter Charger for Lenovo Chromebook/Yoga/ThinkPad L580 L590 E580 E585 P43s P53s with Power Cord'

In [26]:
#let's now create a generic url

url='https://www.amazon.com/'+atag.get('href')

# **Get the Price**

In [44]:
#let's get the price same way we searched for the title by looking for the div tag, in this case, we will look for the tag that contains the price of the item.

#we will get this from the 'span' which contains the a-price and then use the 'span' which contains 'a-offscreen' to obtain the actual price.

parent=obj.find('span','a-price')

price=parent.find('span','a-offscreen').text

price

'$17.99'

# **Get the Reviews**

In [45]:
#We will do the same thing for the Reviews

rate=obj.find('span','a-icon-alt').text
rate

'4.3 out of 5 stars'

In [29]:
obj.i.text

'4.3 out of 5 stars'

# **Get the review Counts**

In [46]:
#we can get the number of customers who have reviewed the item as well

counts_review=obj.find('span',{'class':'a-size-base','dir':'auto'}).text
counts_review

AttributeError: 'NoneType' object has no attribute 'text'

In [47]:
obj.img

<img alt="Sponsored Ad - 65W 45W USB C Laptop Power Replacement Adapter Charger for Lenovo Chromebook/Yoga/ThinkPad L580 L590 E580 E..." class="s-image" data-image-index="1" data-image-latency="s-product-image" data-image-load="" data-image-source-density="1" src="https://m.media-amazon.com/images/I/71dPHTRusTS._AC_UY218_.jpg" srcset="https://m.media-amazon.com/images/I/71dPHTRusTS._AC_UY218_.jpg 1x, https://m.media-amazon.com/images/I/71dPHTRusTS._AC_UY327_FMwebp_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/71dPHTRusTS._AC_UY436_FMwebp_QL65_.jpg 2x, https://m.media-amazon.com/images/I/71dPHTRusTS._AC_UY545_FMwebp_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/71dPHTRusTS._AC_UY654_FMwebp_QL65_.jpg 3x"/>

# **Generic Fuction**

In [37]:
from selenium import webdriver
import chromedriver_binary
from bs4 import BeautifulSoup
#for moicrosoft edge
from msedge.selenium_tools import Edge, EdgeOptions
import csv

#We will be using functions to achieve this

def my_url(keyword):
    temp = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
    keyword = keyword.replace(' ', '+')
    
    # Add Term Query To URL
    url = temp.format(keyword)
    
    # Add Page Query Placeholder
    url += '&page{}'
    
    return url

def extract_record(obj):
    atag = obj.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.com' + atag.get('href')
    
    #it is possible that some items on amazom.com might not be having one of the items we are looking for(e.g. some items might not be having ratings or price), we will be getting error if we dont take care of that. We will therefore add some error handlers
    #if there are no price,probably the item is out of stock or not available, then we will ignore the item, but if there are no reviews yet, it's fine, we will still want to extract the item.
    try:
        parent=obj.find('span','a-price')
        price=parent.find('span','a-offscreen').text
    except AttributeError: #we are excepting the error if it occurs so that we can move to extract the next item, else the program will stop running and gives error
        return
    
    try:
        rate=obj.i.text
        counts_review = obj.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
    except AttributeError:
        #assigning empty string to ratings and 
        rate = ''
        counts_review = ''
    
    image = obj.find('img', {'class': 's-image'}).get('src') 
    
    #let's create a tuple that will contain all these items and assign it to a result variable
    result = (description, price, rate, counts_review, url,image)
    return result

'''Run Main Program Routine'''
def main(keyword):
    # Startup The Webdriver
    driver = webdriver.Chrome()
#     options = EdgeOptions()
#     options.use_chromium =True
#     driver = Edge(options=options)
    
    records = []  #an empty records list to contain all of our extracted records
    url =my_url(keyword)
    
    for page in range(1, 50):
        driver.get(url.format(page))
        soup =BeautifulSoup(driver.page_source, 'html.parser')
        results=soup.find_all('div',{'data-component-type':'s-search-result'})
#         results=soup.find_all('div',{'data-component-type': 's-search-result'}) #same as we did above

        
#we will like to check if what we have return from the extract_record function is empty or not
        for item in results:
            record = extract_record(item) 
            if record: #if the record has something in it append to records list
                records.append(record) 
                
#         driver.quit()
    
#     # Save Results To CSV File
        with open('Results.csv', 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Description', 'Price', 'Rating', 'Reviews Count', 'URL','Image link'])
            writer.writerows(records)

In [40]:
main('cloths')

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=112.0.5615.121)


In [35]:
import pandas as pd

In [41]:
df=pd.read_csv('Results.csv')
df

Unnamed: 0,Description,Price,Rating,Reviews Count,URL,Image link
0,"Amazon Basics Microfiber Cleaning Cloths, Non-...",$13.22,,,https://www.amazon.com/gp/slredirect/picassoRe...,https://m.media-amazon.com/images/I/91srWSvgA3...
1,USANOOKS Microfiber Cleaning Cloth Grey - 12Pc...,$13.99,,,https://www.amazon.com/Microfiber-Cleaning-Clo...,https://m.media-amazon.com/images/I/A1U4ZA-OJm...
2,"AIDEA Microfiber Cleaning Cloths-8PK, All-Purp...",$5.99,,,https://www.amazon.com/AIDEA-Microfiber-Cleani...,https://m.media-amazon.com/images/I/71QbuUZzMX...
3,"AIDEA Microfiber Cleaning Cloths White-50PK, S...",$18.99,,,https://www.amazon.com/AIDEA-Microfiber-Absorp...,https://m.media-amazon.com/images/I/71xVouSe2E...
4,"AIDEA Microfiber Cleaning Cloths-100PK, Softer...",$28.99,,,https://www.amazon.com/AIDEA-Microfiber-Absorb...,https://m.media-amazon.com/images/I/81zpetuiJz...
...,...,...,...,...,...,...
1411,Simple Joys by Carter's Unisex Babies' 6-Piece...,$22.80,,,https://www.amazon.com/Simple-Joys-Carters-6-P...,https://m.media-amazon.com/images/I/81D7d6cfKw...
1412,Aunti Em's Kitchen Wrinkle Resistant Dinner Na...,$22.88,,,https://www.amazon.com/Aunti-Ems-Kitchen-Wrink...,https://m.media-amazon.com/images/I/61tM2TfPwM...
1413,The Children's Place Baby Girls' and Toddler P...,$12.25,,,https://www.amazon.com/Childrens-Place-Toddler...,https://m.media-amazon.com/images/I/81bgz1vT+0...
1414,KAFIREN Baby Boy Clothes Toddler Boy Summer Ou...,$18.95,,,https://www.amazon.com/KAFIREN-Short-Light-Gre...,https://m.media-amazon.com/images/I/61Sm7M-x16...
