# Amazon crawling!

In this task we will crawl https://www.ozon.com/ website!

**NB:** This lab is designed to be executed **locally** at your laptop, as it launches local application (browser). Indeed, headless mode can be used in colab, but this would also require specific browser installation steps. Thus, please use Anaconda.

## Dependency installation

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.ozon.ru/category/smartfony-15502/")
print("Status:", resp.status_code)

Status: 403


Wowowow! https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
```
403 Forbidden – you don't have permission to access this resource is an HTTP status code that occurs when the web server understands the request but can't provide additional access. :
```

As we see, the output is not what we would expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script, and only in a valid browser? Browser engines can help us getting the data. Let's try to load the same web page, but do it in a different way: let's give a browser some time to load the scripts and run them. And then we will work with DOM (Document Object Model), but we will obtain this DOM from the browser engine itself, not via `BeautifulSoup`.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from the code, and retrieve the DOM? Service applications called `drivers` will interpret commands and translate them into browser actions.

For each supported browser engine you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko = Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location when you write the code.

### Download driver

And place it in any folder or under PATH env. variable. [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](http://chromedriver.storage.googleapis.com/index.html)

**FireFox** is recommended for this lab

### Install selenium

Selenium is a powerful tool for automated UI testing. We will use it to emulate used actions with the website.

In [2]:
!pip install -U selenium

Defaulting to user installation because normal site-packages is not writeable


Check it works

In [3]:
from selenium import webdriver

### Launch browser
This will open a browser window

In [4]:
browser = webdriver.Chrome()
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page ... again

In [5]:
from selenium.webdriver.common.by import By

browser.get("https://www.ozon.ru/category/smartfony-15502/")
browser.implicitly_wait(10)  # wait for 10 seconds

Now we have a browser that has downloaded the page for us!


We want to know the price and the review score for some products (phone).
First let's try selecting the elements that contains these phones.

In [6]:
# https://www.ozon.ru/category/smartfony-15502/
from selenium.webdriver.common.by import By
import time

browser = webdriver.Chrome()

browser.get('https://www.ozon.ru/category/smartfony-15502/')
browser.implicitly_wait(10)  # wait for 10 seconds

# Scroll to the end of the page
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels
browser.implicitly_wait(3)  # wait for 3 seconds
time.sleep(3)
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels

# ToDo: Find the info for each phone



### Self practice
- what if you want to get more info about the products from inside their page
- How about the rest of the phone? how can we go to different page?