# Amazon Web Scraper

In this project, I will scrape product data(including Description, Price, Rating, ReviewCount and url) from amazon.com step by step. Then export the data to be a csv file. 

To get the data is the very first step for data analyst. So this is the very important, especially when we don't data at hand.

- import libraries
- startup the webdriver
- extract the collection
- prototype the record
- generalize the pattern
- Error handling
- Getting the next page
- Putting all together

In [71]:
# the necessary library
from selenium import webdriver
from bs4 import BeautifulSoup
import csv

## Startup the webdriver

In [81]:
driver = webdriver.Chrome()

for **driver**, there are two things need to be paid attention: 1) for different webbrowser, this maybe different(edge is different from chrome); 2) the webdriver location is important, if it is located in the same folder with the file, it can be written as **webdriver.Chrome()**,leaving the parenthesis to be blank, otherwise we have to plug the detailed position in it.

In [83]:
# normally the following two step is the way to access the webdata
url = 'http://www.amazon.com'
driver.get(url)

# to make it more specific, which means to make it target the goal product, we can define the function and get access to any product data
def get_url(search_term):
    """generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&ref=nb_sb_ss_ts-doa-p_3_4'
    
    # when we search product there are space between different words, we replace it with "+"
    search_term = search_term.replace(' ', '+')
    
    return template.format(search_term)

# make a concrete example
url = get_url('adjustable desk')
url
driver.get(url)

## Extract the collection

In [85]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

result = soup.find_all('div',{'data-component-type':'s-search-result'})

In [86]:
len(result)

60

The reason why the len(result) is different from the number on topleft of the websites is because there are some sponsored products, which are not included in the total amount of the search results.

## Prototype the record
- Description
- Price
- Rating
- Review Counts
- URL

In [94]:
# choose one item to prototype the record
item = result[0]

In [95]:
# get the description
atag = item.h2.a

description = atag.text.strip()

description

'Amazon Basics Classic Home Office Computer Desk With Shelves - 29.5 x 19.6 x 35.5 Inches, Black'

In [116]:
#get the price
price = item.find('span','a-offscreen').text

price

'$64.21'

In [132]:
# get the rating
rating = item.i.text
print(rating)

# the above can also be written as item.find('span','a-icon-alt')
item.find('span','a-icon-alt').text

4.5 out of 5 stars


'4.5 out of 5 stars'

In [133]:
# get the review counts
review_count = item.find('span',{'class':'a-size-base'})
print(review_count)
# the above can also be writen as item.find('span','a-size-base')

# if we only want to get the text
review_count = item.find('span',{'class':'a-size-base'}).text
review_count

<span class="a-size-base">6,897</span>


'6,897'

In [100]:
# get the url 
atag.get('href')

# the previous line is not actually a url since no http://..., here add them
url = 'http://www.amazon.com' + atag.get('href')

url

'http://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0933660UYTL8OF3G24R&url=%2FAmazonBasics-Classic-Computer-Desk-Shelves%2Fdp%2FB07PVL2N3D%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Dadjustable%2Bdesk%26qid%3D1641411006%26sr%3D8-1-spons%26psc%3D1&qualifier=1641411006&id=1985587025016585&widgetName=sp_atf'

## Generalize the pattern

In [50]:
def extract_record(item):
    
    # description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'http://www.amazon.com' + atag.get('href')
    
    # price
    price_parent = item.find('span','a-price')
    price = price_parent.find('span','a-offscreen').text
    
    # rank and rating
    rating = item.i.text
    review_count = item.find('span',{'class':'a-size-base'}).text
    
    result = (description, price, rating, review_count, url)
    
    return result

In [51]:
soup = BeautifulSoup(driver.page_source, 'html.parser')
result = soup.find_all('div',{'data-component-type':'s-search-result'})

records =[]

for item in result:
    records.append(extract_record(item))

AttributeError: 'NoneType' object has no attribute 'find'

The reason why such 'AttributeError' appear is because that some item may not have price or rating etc. So, we need to modify the function further.

## Error Handling

In [134]:
def extract_record(item):
    
    # description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'http://www.amazon.com' + atag.get('href')
    
    # price
    try:
        price_parent = item.find('span','a-price')
        price = price_parent.find('span','a-offscreen').text
    except AttributeError:
        return
    
    # rank and rating
    try:
        rating = item.i.text
        review_count = item.find('span',{'class':'a-size-base'}).text
    except AttributeError:
        rating = ''
        review_count = ''
    
    result = (description, price, rating, review_count, url)
    
    return result

In [135]:
soup = BeautifulSoup(driver.page_source, 'html.parser')
result = soup.find_all('div',{'data-component-type':'s-search-result'})

records =[]

for item in result:
    record = extract_record(item)
    if record:
        records.append(extract_record(item))

In [136]:
# check one item
records[0]

('Amazon Basics Classic Home Office Computer Desk With Shelves - 29.5 x 19.6 x 35.5 Inches, Black',
 '$64.21',
 '4.5 out of 5 stars',
 '6,897',
 'http://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0933660UYTL8OF3G24R&url=%2FAmazonBasics-Classic-Computer-Desk-Shelves%2Fdp%2FB07PVL2N3D%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Dadjustable%2Bdesk%26qid%3D1641411006%26sr%3D8-1-spons%26psc%3D1&qualifier=1641411006&id=1985587025016585&widgetName=sp_atf')

In [137]:
#check all prices listed so far
for item in records:
    print(item[1])

$64.21
$299.87
$259.99
$239.99
$239.99
$259.99
$219.99
$249.99
$299.87
$275.99
$429.99
$579.99
$199.99
$264.74
$679.00
$249.99
$111.51
$249.99
$239.99
$179.99
$118.97
$189.79
$399.99
$109.99
$309.99
$249.99
$269.99
$284.99
$199.99
$109.99
$264.74
$259.99
$118.97
$309.99
$299.99
$159.99
$278.17
$179.99
$179.99
$269.99
$249.99
$69.99
$259.99
$229.99
$69.99
$289.99
$269.99
$99.99
$499.99
$349.99
$237.99
$219.99
$254.99
$299.99
$359.99
$369.99
$359.00
$299.99


## Getting the next page

In [57]:
def get_url(search_term):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&ref=nb_sb_ss_ts-doa-p_3_4'
    search_term = search_term.replace(' ', '+')
    
    # add term query to url
    url = template.formate(search_term)
    
    # add page query placeholder
    url += '&page{}'
    
    return url

## Putting all together

In [138]:
import csv
from bs4 import BeautifulSoup
from selenium import webdriver


def get_url(search_term):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&ref=nb_sb_ss_ts-doa-p_3_4'
    search_term = search_term.replace(' ', '+')
    
    # add term query to url
    url = template.format(search_term)
    
    # add page query placeholder
    url += '&page{}'
    
    return url



def extract_record(item):
    
    # description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'http://www.amazon.com' + atag.get('href')
    
    # price
    try:
        price_parent = item.find('span','a-price')
        price = price_parent.find('span','a-offscreen').text
    except AttributeError:
        return
    
    # rank and rating
    try:
        rating = item.i.text
        review_count = item.find('span',{'class':'a-size-base'}).text
    except AttributeError:
        rating = ''
        review_count = ''
    
    result = (description, price, rating, review_count, url)
    
    return result


def main(search_term):
    """Run main program routine"""
    # startup the webdriver
    driver = webdriver.Chrome()
    
    records = []
    url = get_url(search_term)
    
    for page in range(1,51):
        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source,'html.parser')
        results = soup.find_all('div',{'data-component-type':'s-search-result'})
        
        for item in results:
            record = extract_record(item)
            if record:
                records.append(record)
    
    driver.close()
    
    # save data to csv file
    with open('result.csv', 'w', newline ='', encoding = 'utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Description','Price','Rating','ReviewCount','url'])
        writer.writerows(records)

In [140]:
# Make a specific example
main('adjustable desk')

In [236]:
# read the csv file and show the top 10 items
import pandas as pd
df = pd.read_csv('result.csv')
df.head()

Unnamed: 0,Description,Price,Rating,ReviewCount,url
0,"Amazon Basics Classic Home Office Computer Desk With Shelves - 29.5 x 19.6 x 35.5 Inches, Black",$64.21,4.5 out of 5 stars,6897,http://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0933660UYTL8OF3G24R&url=%2FAmazonBasics-Classic-Computer-Desk-Shelves%2Fdp%2FB07PVL2N3D%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Dadjustable%2Bdesk%26qid%3D1641434075%26sr%3D8-1-spons%26psc%3D1&qualifier=1641434075&id=692511382098507&widgetName=sp_atf
1,"VIVO 60-inch Electric Height Adjustable 60 x 24 inch Stand Up Desk, White Solid One-Piece Table Top, White Frame Standing Workstation, Home & Office Furniture Sets, DESK-KIT-W06W",$259.99,4.7 out of 5 stars,2955,http://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A07960728MBCCWN6NCDR&url=%2FVIVO-Adjustable-Workstation-Controller-DESK-KIT-W06W%2Fdp%2FB0829FV8MT%2Fref%3Dsr_1_2_sspa%3Fkeywords%3Dadjustable%2Bdesk%26qid%3D1641434075%26sr%3D8-2-spons%26psc%3D1&qualifier=1641434075&id=692511382098507&widgetName=sp_atf
2,"Flexispot EC1 Electric White Standing Desk Adjustable Height Desk, 48 x 30 Inches Whole Piece Board Sit Stand Desk Home Office Workstation Stand up Desk (White Frame + 48 in White Top)",$239.99,4.6 out of 5 stars,5103,http://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A0549693201M1ED2ESKE6&url=%2FFlexispot-EC1W-R4830W-Electric-Adjustable-Standing%2Fdp%2FB07W42DSG8%2Fref%3Dsr_1_3_sspa%3Fkeywords%3Dadjustable%2Bdesk%26qid%3D1641434075%26sr%3D8-3-spons%26psc%3D1&qualifier=1641434075&id=692511382098507&widgetName=sp_atf
3,"FAMISKY Dual Motor Adjustable Height Standing Desk, Electric Sit Stand Desk with Screen Panel, 48 x 24 Inches Stand up Desk, Home Office Desk with Greige Top and Black Frame",$304.99,4.6 out of 5 stars,115,http://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A06067243O1B8FW9X9TKU&url=%2FFAMISKY-Adjustable-Electric-Standing-Tabletop%2Fdp%2FB08HMSCBZV%2Fref%3Dsr_1_4_sspa%3Fkeywords%3Dadjustable%2Bdesk%26qid%3D1641434075%26sr%3D8-4-spons%26psc%3D1&qualifier=1641434075&id=692511382098507&widgetName=sp_atf
4,"Flexispot EC1 Electric White Standing Desk Adjustable Height Desk, 48 x 30 Inches Whole Piece Board Sit Stand Desk Home Office Workstation Stand up Desk (White Frame + 48 in White Top)",$239.99,4.6 out of 5 stars,5103,http://www.amazon.com/Flexispot-EC1W-R4830W-Electric-Adjustable-Standing/dp/B07W42DSG8/ref=sr_1_5?keywords=adjustable+desk&qid=1641434075&sr=8-5


In [230]:
df.dtypes

Description    object
Price          object
Rating         object
ReviewCount    object
url            object
dtype: object

In [231]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2778 entries, 0 to 2777
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  2778 non-null   object
 1   Price        2778 non-null   object
 2   Rating       2778 non-null   object
 3   ReviewCount  2778 non-null   object
 4   url          2778 non-null   object
dtypes: object(5)
memory usage: 108.6+ KB


In [232]:
# define the dataframe
df = pd.read_csv('result.csv')

# only take the 'ReviewCount' >= 500
df['ReviewCount'] = df['ReviewCount'].str.replace(',','').astype(int)
df = df[df['ReviewCount']>=500 ]

# sort the dataframe according to their rating and ReviewCount
df['Rating'] = df['Rating'].str[:3]
sorted_df = df.sort_values(['Rating','ReviewCount','Price'],ascending = (False,False,True))

In [239]:
pd.set_option('display.max_colwidth',None)
sorted_df.head(10)

Unnamed: 0,Description,Price,Rating,ReviewCount,url
25,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434075&sr=8-26
82,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434080&sr=8-26
139,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434084&sr=8-26
196,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434087&sr=8-26
253,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434091&sr=8-26
310,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434093&sr=8-26
367,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434097&sr=8-26
424,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434100&sr=8-26
474,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_19?keywords=adjustable+desk&qid=1641434103&sr=8-19
526,"FLEXISPOT Standing Desk Converter 28 Inches Stand up Desk Riser, Height Adjustable Home Office Desk with Deep Keyboard Tray for Laptop (M7B)",$109.99,4.8,10790,http://www.amazon.com/FlexiSpot-Converter-Standing-Keyboard-M7B/dp/B0762K7JJT/ref=sr_1_26?keywords=adjustable+desk&qid=1641434106&sr=8-26
