# Ozon crawling!

In this task we will crawl https://www.ozon.com/ website!

**NB:** This lab is designed to be executed **locally** at your laptop, as it launches local application (browser). Indeed, headless mode can be used in colab, but this would also require specific browser installation steps. Thus, please use Anaconda.

## Dependency installation

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.ozon.ru/category/smartfony-15502/")
print("Status:", resp.status_code)

Status: 403


Wowowow! https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
```
403 Forbidden – you don't have permission to access this resource is an HTTP status code that occurs when the web server understands the request but can't provide additional access. :
```

As we see, the output is not what we would expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script, and only in a valid browser? Browser engines can help us getting the data. Let's try to load the same web page, but do it in a different way: let's give a browser some time to load the scripts and run them. And then we will work with DOM (Document Object Model), but we will obtain this DOM from the browser engine itself, not via `BeautifulSoup`.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from the code, and retrieve the DOM? Service applications called `drivers` will interpret commands and translate them into browser actions.

For each supported browser engine you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko = Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location when you write the code.

### Download driver

And place it in any folder or under PATH env. variable. [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](http://chromedriver.storage.googleapis.com/index.html)

**FireFox** is recommended for this lab

### Install selenium

Selenium is a powerful tool for automated UI testing. We will use it to emulate used actions with the website.

In [2]:
!pip install -U selenium

Collecting selenium
  Downloading selenium-4.24.0-py3-none-any.whl.metadata (7.1 kB)
Downloading selenium-4.24.0-py3-none-any.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: selenium
  Attempting uninstall: selenium
    Found existing installation: selenium 4.23.1
    Uninstalling selenium-4.23.1:
      Successfully uninstalled selenium-4.23.1
Successfully installed selenium-4.24.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Check it works

In [1]:
from selenium import webdriver

ModuleNotFoundError: No module named 'selenium'

### Launch browser
This will open a browser window

In [None]:
browser = webdriver.Firefox()
# browser = webdriver.Chrome()  # Make sure you have the ChromeDriver installed and in your PATH
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page ... again

In [None]:
from selenium.webdriver.common.by import By

browser.get("https://www.ozon.ru/category/smartfony-15502/")
browser.implicitly_wait(5)  # wait for 10 seconds
# element = browser.find_element_by_id("button")
# element.click()
button = browser.find_element(By.ID, 'reload-button')  # Replace 'button-id' with the actual ID of the button

# Click the button
button.click()

Now we have a browser that has downloaded the page for us!


We want to know the price and the review score for some products (phone).
First let's try selecting the elements that contains these phones.

In [109]:
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels
browser.implicitly_wait(3)  # wait for 3 seconds
time.sleep(3)
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels

# ToDo: Find the info for each phone

elements = browser.find_elements(By.CSS_SELECTOR, '.j4n_23')

# Loop through each element and find the nested <span> with class 'tsBody500Medium'


36
Blackview Смартфон A200 12/256 ГБ, синий
21 451 ₽
rating not found

Moris Смартфон x500pro Ростест (EAC) 6/256 ГБ, черный
10 848 ₽
rating not found

HUAWEI Смартфон nova Y61 Ростест (EAC) 4/128 ГБ, синий
7 759 ₽
4.9  807 отзывов

Infinix Смартфон Note 30i 8/256 ГБ, черный. . Уцененный товар
11 778 ₽
rating not found

PAGRAER Смартфон GT3 Pro EU 64 ГБ, черный
4 122 ₽
5.0  17 отзывов

Starlet Смартфон 15 Pro Max Global 4/128 ГБ, серебристый
6 255 ₽
rating not found

Tecno Смартфон POP 7+SIM-карта МегаФон 2/64 ГБ, черный
5 346 ₽
4.9  459 отзывов

Tecno Смартфон Spark 20C Ростест (EAC) 4/128 ГБ, черный
8 092 ₽
4.9  1 870 отзывов

ZUNYI Смартфон Camon 30 Pro Ростест (EAC) 8/128 ГБ, черный
10 191 ₽
rating not found

Tecno Смартфон Spark Go 2024 4/64 ГБ, белый
7 265 ₽
4.9  3 053 отзыва

Xiaomi Смартфон Redmi A3 Global 4/128 ГБ, черный
7 300 ₽
4.8  2 028 отзывов

IIIF150 Смартфон B1 Pro Plus 6/128 ГБ, белый, серый
14 388 ₽
4.7  91 отзыв

Tecno Смартфон POVA 6 Pro 5G "8 ядер (2.4 ГГц), 2SIM,

### Next task 
- Navigate through the second page in the list
- Select the first product among retrieved products (navigate)
- print information about the product (О товаре)

In [106]:
# ToDo: Print information about first product in the second page

Тип
Смартфон
Диагональ экрана, дюймы
7.3
Емкость аккумулятора, мАч
5800
Процессор
Snapdragon 8 Gen2 (8 ядер), 3.2 ГГц
Основной материал корпуса
Металл, Стекло
