# Amazon crawling!

In this task we will crawl https://www.amazon.com/ website!

**NB:** This lab is designed to be executed **locally** at your laptop, as it launches local application (browser). Indeed, headless mode can be used in colab, but this would also require specific browser installation steps. Thus, please use Anaconda.

## Dependency installation

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.amazon.com/")
print("Status:", resp.status_code)

Status: 503


Wowowow! https://www.lifewire.com/503-service-unavailable-explained-2622940 :

```
The 503 Service Unavailable error is an HTTP status code that means a website's server is not available right now. Most of the time, it occurs because the server is too busy or maintenance is being performed on it.
```

As we see, the output is not what we would expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script, and only in a valid browser? Browser engines can help us getting the data. Let's try to load the same web page, but do it in a different way: let's give a browser some time to load the scripts and run them. And then we will work with DOM (Document Object Model), but we will obtain this DOM from the browser engine itself, not via `BeautifulSoup`.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from the code, and retrieve the DOM? Service applications called `drivers` will interpret commands and translate them into browser actions.

For each supported browser engine you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko = Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location when you write the code.

### Download driver

And place it in any folder or under PATH env. variable. [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](http://chromedriver.storage.googleapis.com/index.html)

**FireFox** is recommended for this lab

### Install selenium

Selenium is a powerful tool for automated UI testing. We will use it to emulate used actions with the website.

In [2]:
!pip install -U selenium

Defaulting to user installation because normal site-packages is not writeable


Check it works

In [15]:
! chmod +x /home/kamil/Desktop/New\ IR/information-retrieval/labs/lab-02/New-Lab/geckodriver


In [3]:
from selenium import webdriver

### Launch browser

This will open a browser window

In [4]:
browser = webdriver.Chrome()


### Download the page ... again

In [5]:
from selenium.webdriver.common.by import By

browser.get('https://www.amazon.com/gp/bestsellers/?ref_=nav_cs_bestsellers')
browser.implicitly_wait(10)  #https://www.amazon.co wait for 10 seconds

Now we have a browser that has downloaded the page for us!


We want to know the price and the review score for some products (clothes).
First let's try selecting the elements that contains these clothes.

In [6]:
elements = browser.find_elements(By.CSS_SELECTOR, "li.a-carousel-card")
print("Elements found:", len(elements))


Elements found: 36


We have have found some products matching the specific class

Now we will interact with the website by opening these website seperately in different tabs! to do so you need to use `click` to perform clicking on the browser. check [here](https://stackoverflow.com/questions/9798273/python-selenium-find-and-click-an-element).

In [7]:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(browser)


for element in elements:     ### this part can be given as a task.
    #element.click()    ### this part can be given as a task.
    actions.key_down(Keys.CONTROL).click(element).key_up(Keys.CONTROL).perform()


Now your task is to complete the following cell. in order to find the element that has all the informatio we need `id=centerCol` can be used. To find title `id="title"` can be use. Notice that title is a `h1` tag. To find the rating score `id="averageCustomerReviews"` can be used. And finally the element containing the price can be found using `class="a-section a-spacing-none aok-align-center"`.

Se the documentation [here](https://selenium-python.readthedocs.io/api.html#selenium.webdriver.remote.webelement.WebElement.get_attribute) and [here](https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.by.html)

Now that we are done with the browser we can just easily close the browser

In [39]:
browser.quit()

Now that we have retrieved the data, it's time get rid of non-sense characters such as `\n` or `\t` or any useless words. Your task here is to write a small regex that would only match the rating `score` from rating field, a regex that would match the `discount` and the `price` of the product.

In [10]:
products_info.keys()

NameError: name 'products_info' is not defined

In [41]:
import re
pattern_price = '[0-9\.\-%]*[0-9\$]+' ### can be givven as task
pattern_rating = '[0-9\.]+'   ### can be givven as task
price_match = re.findall(pattern_price,products_info[2]['price'])
rating_match = re.findall(pattern_price,products_info[2]['rating'])

print(f'the prodcut has {price_match[0]} % discount and the final price is {price_match[1]}.{price_match[2]} and the rating is {rating_match[0]} out of 5')

KeyError: 2