We are going to use functionalties provided by Beautiful Soup and Selenium to scrape data available on an Amazon product webpage.

Run the following code snippet to import all necessary tools:

 

In [2]:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys

We start with a [seed page](https://www.amazon.com/Learning-Python-Powerful-Object-Oriented-Programming-ebook/dp/B00DDZPC9S). The following code snippet drives the ChromeDriver to open the page and read in the source code:

In [3]:
driver = Chrome()     # you can also provide the pathname to the binary file as the argument
driver.get("https://www.amazon.com/Learning-Python-Powerful-Object-Oriented-Programming-ebook/dp/B00DDZPC9S")

We are going to find the product pages to be scraped by looking at items displayed in the "Customers who bought this item also bought" carousel.

<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/carousel.png" width=750   />

The carousel format organizes the details about each item in a card, and dispays only a proper number of cards to fit the width of the browser window. The rest of cards are displayed in separate slides, which can be navigated by clicking the left and right arrow button on the two ends of the carousel window.

When using the Chrome browser's developer tools function to inspect the raw content of this page, you will find the piece of HTML that displays the carousel is as follows:

```html

<div class="a-section a-spacing-large bucket" id="desktop-dp-sims_purchase-similarities-sims-feature">
    ...
</div>    
```

or

```html
<div class="a-section a-spacing-large bucket" id="desktop-dp-sims_purchase-similarities-esp-sims-feature">
    ...
</div> 
```
Write code to locate this tag and name the returned `Tag` object `carousel`:


In [5]:
# Write your code here

bs_amazon_pg = BeautifulSoup(driver.page_source, 'html.parser')
carousel = bs_amazon_pg.find('div', {'id': 'desktop-dp-sims_purchase-similarities-esp-sims-feature'})

In [6]:
print(carousel.prettify())

<div class="a-section a-spacing-large bucket" id="desktop-dp-sims_purchase-similarities-esp-sims-feature">
 <div class="a-begin a-carousel-container a-carousel-display-swap a-carousel-transition-swap similarities-aui-carousel p13n-sc-carousel a-carousel-initialized" data-a-carousel-options='{"ajax":{"params":{"asinMetadataKeys":"adId:ParentReasonId:ParentReasonId.substitutions.purchase_date:rId","widgetTemplateClass":"PI::Similarities::ViewTemplates::Carousel::Desktop","linkGetParameters":"{\"pd_rd_wg\":\"IJFur\",\"pd_rd_r\":\"ea50a232-cae9-47f4-8d63-36582ce2e9fd\",\"pf_rd_r\":\"QMB0JT1V7SAVAZRETBJE\",\"pf_rd_p\":\"ec5f570b-7db1-4816-9bbe-a67d0b1d643f\",\"pd_rd_w\":\"B7AMs\"}","productDetailsTemplateClass":"PI::P13N::ViewTemplates::ProductDetails::Desktop::DeliverySpeed","forceFreshWin":0,"painterId":"PersonalizationDesktopSimilaritiesCarousel","featureId":"SimilaritiesCarousel","reftagPrefix":"pd_sim_nf_351","imageHeight":160,"faceoutTemplateClass":"PI::P13N::ViewTemplates::Product::D

You can use the `prettify()` method to inspect contents embedded by the `Tag` object. And you may find the general structure of the HTML code that displays each product item in the carousel looks like the following: 




```html

<li class="a-carousel-card aok-float-left" role="listitem" aria-setsize="91" aria-posinset="1" aria-hidden="false" style="margin-left: 14px;">
...
</li>
```

Write code to extract all `<li>` tags that display the product items, make them available in a list, and name the resulting list `list_of_items`:

In [7]:
# Write your code here
 
list_of_items = carousel.find_all('li',{'class': 'a-carousel-card'})



Run the following code to inspect the structure of the first `<li>` tag:

In [8]:
print(list_of_items[0].prettify())

<li aria-hidden="false" aria-posinset="1" aria-setsize="91" class="a-carousel-card aok-float-left" role="listitem" style="margin-left: 18px;">
 <div class="a-section a-spacing-none p13n-asin" data-p13n-asin-metadata='{"ref":"pd_sim_nf_351_1","asin":"B004GTLFJ6"}'>
  <a class="a-link-normal" href="/Programming-Python-Powerful-Object-Oriented-ebook/dp/B004GTLFJ6/ref=pd_sim_nf_351_1/131-1505432-1604003?_encoding=UTF8&amp;pd_rd_i=B004GTLFJ6&amp;pd_rd_r=ea50a232-cae9-47f4-8d63-36582ce2e9fd&amp;pd_rd_w=B7AMs&amp;pd_rd_wg=IJFur&amp;pf_rd_p=ec5f570b-7db1-4816-9bbe-a67d0b1d643f&amp;pf_rd_r=QMB0JT1V7SAVAZRETBJE&amp;psc=1&amp;refRID=QMB0JT1V7SAVAZRETBJE">
   <div class="a-section a-spacing-mini">
    <img alt="Programming Python: Powerful Object-Oriented Programming" class="a-dynamic-image p13n-sc-dynamic-image" data-a-dynamic-image='{"https://images-na.ssl-images-amazon.com/images/I/9122eE339dL.__BG0,0,0,0_FMpng_AC_UL480_SR366,480_.jpg":[480,366],"https://images-na.ssl-images-amazon.com/images/I

Note that the title of the book (i.e., *Programming Python: Powerful Object-Oriented Programming*) can be spotted at two places in the above code; that is, the `alt` attribute of an `<img>` tag and the text (probably truncated) embedded in the `<div>` tag of both the `p13n-sc-truncate-desktop-type2` class and `p13n-sc-truncated` class.
 
To make sure what we extract is the full title rather than a truncated one, we are going to extract the value of the `alt` attribute of the `<img>` tag. Write code below to do so:

In [9]:
# Write your code here

list_of_items[0].find('img').attrs['alt']


'Programming Python: Powerful Object-Oriented Programming'

The author information is contained in the following piece of the HTML code:

```html

<div class="a-row a-size-small">...
    Mark Lutz
</div>

```

Write code to extract the author(s) of the book:

In [10]:
# Write your code here

list_of_items[0].find('div', {'class': 'a-row a-size-small'}).get_text()



'Mark Lutz'


The rating received from buyers is embedded in the following piece of the HTML code:

```html
<span class="a-icon-alt">4.5 out of 5 stars</span>
```
Write code to extract the rating only (i.e., 4.5) from the text:

In [11]:
# Write your code here

r = list_of_items[0].find('span', {'class': 'a-icon-alt'}).get_text()

r[0:3]

'4.5'

The book's price is contained in a `span` tag of class `"p13n-sc-price"`. Write code to extract the price of the book:

In [12]:
# Write your code here

list_of_items[0].find('span', {'class': 'a-size-base a-color-price'}).get_text()


'$37.49'

The last piece of information we would like to extract about this book is the URL of its product page. This information is maintained as the value of the `href` attribute of the first `<a>` tag. Write code to extract the value of that `href` attribute and name it `url_string`: 

In [13]:
# Write your code here

url_string = list_of_items[0].find('a').attrs['href']
url_string

'/Programming-Python-Powerful-Object-Oriented-ebook/dp/B004GTLFJ6/ref=pd_sim_nf_351_1/131-1505432-1604003?_encoding=UTF8&pd_rd_i=B004GTLFJ6&pd_rd_r=ea50a232-cae9-47f4-8d63-36582ce2e9fd&pd_rd_w=B7AMs&pd_rd_wg=IJFur&pf_rd_p=ec5f570b-7db1-4816-9bbe-a67d0b1d643f&pf_rd_r=QMB0JT1V7SAVAZRETBJE&psc=1&refRID=QMB0JT1V7SAVAZRETBJE'

Comparing the URL you extracted with the that of the seed page (https://www.amazon.com/Learning-Python-Powerful-Object-Oriented-Programming-ebook/dp/B00DDZPC9S) reveals that the raw is a relative one. Besides, some part of it are used for the purpose other than locating the product page (e.g., tracking referring sources). 

We want to remove the irrelevant part and use the remaining part to form absolute URLs for later use.

Run the code below to replace the irrelevant part with an empty string and prefix it with `'https://www.amazon.com'`:

In [14]:
import re

'https://www.amazon.com' + re.sub(r'/ref=.*', r'', url_string)

'https://www.amazon.com/Programming-Python-Powerful-Object-Oriented-ebook/dp/B004GTLFJ6'

Extend the previous steps that scrape the data for the first product to the remaining elements contained in `list_of_items`:


In [15]:
title_list = []; authors_list = []; rating_list = []; price_list = []; url_list = [];

# Write your code here
import re
for index in list_of_items:
    if index.find('img').attrs['alt'] is not None:
        titles = index.find('img').attrs['alt']
        title_list.append(titles)
        authors = index.find('div', {'class': 'a-row a-size-small'}).get_text()
        authors_list.append(authors)
        ratings = index.find('span', {'class': 'a-icon-alt'}).get_text()
        ratingsb = ratings[0:3]
        rating_list.append(ratingsb)
        prices = index.find('span', {'class': 'a-size-base a-color-price'}).get_text()
        price_list.append(prices)
        urls = index.find('a').attrs['href']
        urls_formats = 'https://www.amazon.com' + re.sub(r'/ref=.*', r'', urls)
        url_list.append(urls_formats)

    

With these 5 pieces of information about these books having been organized in the corresponding lists, run the following code to pack them into a structured dataset maintained in a pandas `DataFrame` object:

In [16]:
import pandas as pd
products = pd.DataFrame({'title': title_list, 'authors': authors_list, 'rating': rating_list,
                        'price': price_list, 'url': url_list})

products

Unnamed: 0,title,authors,rating,price,url
0,Programming Python: Powerful Object-Oriented P...,Mark Lutz,4.5,$37.49,https://www.amazon.com/Programming-Python-Powe...
1,Python Pocket Reference: Python In Your Pocket...,Mark Lutz,4.5,$10.99,https://www.amazon.com/Python-Pocket-Reference...
2,Python Cookbook: Recipes for Mastering Python 3,David Beazley,4.5,$23.49,https://www.amazon.com/Python-Cookbook-Recipes...
3,"Python Crash Course, 2nd Edition: A Hands-On, ...",Eric Matthes,4.7,$23.99,https://www.amazon.com/Python-Crash-Course-Eri...
4,Python for Data Analysis: Data Wrangling with ...,Wes McKinney,4.5,$34.21,https://www.amazon.com/Python-Data-Analysis-Wr...
5,"Fluent Python: Clear, Concise, and Effective P...",Luciano Ramalho,4.6,$29.99,https://www.amazon.com/Fluent-Python-Concise-E...


The code above only extracts the cards displayed on the first slide in the carousel. To scrape the data contained in the rest of cards, we have to use selenium to drive the browser to load subsequent slides and then repeat the previous steps for each of them.

To automate this process with a `for` loop, we have to know in advance how many slides we need to go through.



The information about the maximum number of slides is embedded in a `<span>` tag beneath the `Tag` object referred to by `carousel`, as shown below:

```html
<span class="a-carousel-page-count">Page <span class="a-carousel-page-current">1</span> of <span class="a-carousel-page-max">18</span>  </span>
```
Write code to extract the number (i.e., 18):

In [17]:
# Write your code here

pagenumber = carousel.find_all('div',{'class': 'a-column a-span4 a-span-last a-text-right'})

pagenumber[0].find('span', {'class': 'a-carousel-page-max'}).get_text()


'16'

The following HTML code displays the left arrow button:

```html
<a class="a-button a-button-image a-carousel-button a-carousel-goto-nextpage" tabindex="0" href="#" id="a-autoid-6" style="top: 135.594px;"><span class="a-button-inner"><i class="a-icon a-icon-next"><span class="a-icon-alt">Next</span></i></span></a>
```

Since there is a preceding carousel that owns a left arrow button with the same name, we need to use `driver`'s `find_element_by_css_selector()` method to avoid the ambiguity and locate the desired element.

The argument we will supply to the method is `'div#desktop-dp-sims_purchase-similarities(-esp)-sims-feature a.a-carousel-goto-nextpage'`, which means finding an `<a>` tag of class `a-carousel-goto-nextpage` beneath the `<div>` tag of id `desktop-dp-sims_purchase-similarities(-esp)-sims-feature` (include `-esp` as appropriate). 

Run the following code to locate the left arrow button and name the returned `WebElement` object `next_button`:

In [19]:
# include -esp as appropriate

next_button = driver.find_element_by_css_selector('div#desktop-dp-sims_purchase-similarities-esp-sims-feature \
a.a-carousel-goto-nextpage')
next_button

<selenium.webdriver.remote.webelement.WebElement (session="c4295fb076e1f8f33711b94ff4df79c6", element="491a83d4-3b81-4f2b-be72-dc851888ba34")>

Then using its `click()` or `send_keys()` method, we can simulate users' clicking on the button to load subsequent slides. Write code to load the 2nd slide: 

In [20]:
# Write your code here

next_button.click()

Then scrape data from cards displayed on the 2nd slide and append different pieces of information to the corresponding lists we created before:

In [21]:
# Write your code here

bs_amazon_pg = BeautifulSoup(driver.page_source, 'html.parser')
carousel = bs_amazon_pg.find('div', {'id': 'desktop-dp-sims_purchase-similarities-esp-sims-feature'}) #input -esp as appropriate
list_of_items = carousel.find_all('li',{'class': 'a-carousel-card'})

import re
for index in list_of_items:
    if index.find('img').attrs['alt'] is not None:
        titles = index.find('img').attrs['alt']
        title_list.append(titles)
        authors = index.find('div', {'class': 'a-row a-size-small'}).get_text()
        authors_list.append(authors)
        ratings = index.find('span', {'class': 'a-icon-alt'}).get_text()
        ratingsb = ratings[0:3]
        rating_list.append(ratingsb)
        prices = index.find('span', {'class': 'a-size-base a-color-price'}).get_text()
        price_list.append(prices)
        urls = index.find('a').attrs['href']
        urls_formats = 'https://www.amazon.com' + re.sub(r'/ref=.*', r'', urls)
        url_list.append(urls_formats)
     

Run the following code snippet, and you will find additional 5 records have been added to the dataset:

In [22]:
import pandas as pd
products = pd.DataFrame({'title': title_list, 'authors': authors_list, 'rating': rating_list,
                        'price': price_list, 'url': url_list})

products

Unnamed: 0,title,authors,rating,price,url
0,Programming Python: Powerful Object-Oriented P...,Mark Lutz,4.5,$37.49,https://www.amazon.com/Programming-Python-Powe...
1,Python Pocket Reference: Python In Your Pocket...,Mark Lutz,4.5,$10.99,https://www.amazon.com/Python-Pocket-Reference...
2,Python Cookbook: Recipes for Mastering Python 3,David Beazley,4.5,$23.49,https://www.amazon.com/Python-Cookbook-Recipes...
3,"Python Crash Course, 2nd Edition: A Hands-On, ...",Eric Matthes,4.7,$23.99,https://www.amazon.com/Python-Crash-Course-Eri...
4,Python for Data Analysis: Data Wrangling with ...,Wes McKinney,4.5,$34.21,https://www.amazon.com/Python-Data-Analysis-Wr...
5,"Fluent Python: Clear, Concise, and Effective P...",Luciano Ramalho,4.6,$29.99,https://www.amazon.com/Fluent-Python-Concise-E...
6,"Automate the Boring Stuff with Python, 2nd Edi...",Al Sweigart,4.8,$23.99,https://www.amazon.com/Automate-Boring-Stuff-P...
7,Python Data Science Handbook: Essential Tools ...,Jake VanderPlas,4.5,$34.99,https://www.amazon.com/Python-Data-Science-Han...
8,Introducing Python: Modern Computing in Simple...,Bill Lubanovic,4.7,$24.99,https://www.amazon.com/Introducing-Python-Mode...
9,"Hands-On Machine Learning with Scikit-Learn, K...",Aurélien Géron,4.8,$37.49,https://www.amazon.com/Hands-Machine-Learning-...


When we are finished with the browser session, remember to close the browser window:

In [23]:
driver.close()