# Ozon crawling!

In this task we will crawl https://www.ozon.com/ website!

**NB:** This lab is designed to be executed **locally** at your laptop, as it launches local application (browser). Indeed, headless mode can be used in colab, but this would also require specific browser installation steps. Thus, please use Anaconda.

## Dependency installation

Let's try to load and parse the page the way we did before:

In [137]:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.ozon.ru/category/smartfony-15502/")
print("Status:", resp.status_code)

Status: 403


Wowowow! https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
```
403 Forbidden – you don't have permission to access this resource is an HTTP status code that occurs when the web server understands the request but can't provide additional access. :
```

As we see, the output is not what we would expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script, and only in a valid browser? Browser engines can help us getting the data. Let's try to load the same web page, but do it in a different way: let's give a browser some time to load the scripts and run them. And then we will work with DOM (Document Object Model), but we will obtain this DOM from the browser engine itself, not via `BeautifulSoup`.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from the code, and retrieve the DOM? Service applications called `drivers` will interpret commands and translate them into browser actions.

For each supported browser engine you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko = Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location when you write the code.

### Download driver

And place it in any folder or under PATH env. variable. [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](http://chromedriver.storage.googleapis.com/index.html)

**FireFox** is recommended for this lab

### Install selenium

Selenium is a powerful tool for automated UI testing. We will use it to emulate used actions with the website.

In [9]:
!pip install -U selenium


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Check it works

In [129]:
from selenium import webdriver

### Launch browser
This will open a browser window

In [138]:
browser = webdriver.Firefox()
# browser = webdriver.Chrome()  # Make sure you have the ChromeDriver installed and in your PATH
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page ... again

In [139]:
from selenium.webdriver.common.by import By

browser.get("https://www.ozon.ru/category/smartfony-15502/")
browser.implicitly_wait(5)  # wait for 10 seconds
# element = browser.find_element_by_id("button")
# element.click()
button = browser.find_element(By.ID, 'reload-button')  # Replace 'button-id' with the actual ID of the button

# Click the button
button.click()

Now we have a browser that has downloaded the page for us!


We want to know the price and the review score for some products (phone).
First let's try selecting the elements that contains these phones.

In [140]:
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels
browser.implicitly_wait(3)  # wait for 3 seconds
time.sleep(3)
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels

# ToDo: Find the info for each phone

elements = browser.find_elements(By.CSS_SELECTOR, '.j4n_23')

# Loop through each element and find the nested <span> with class 'tsBody500Medium'
elements = browser.find_elements(By.CSS_SELECTOR, '.j4n_23')

# Loop through each element and find the nested <span> with class 'tsBody500Medium'
print(len(elements))
for element in elements:
    try:
        span_element = element.find_element(By.CSS_SELECTOR, ".tsBody500Medium")
        print(span_element.text)
    except:
        print("Span element with class 'tsBody500Medium' not found in this element.")
    try:
        span_element = element.find_element(By.CSS_SELECTOR, ".c3015-a1")
        print(span_element.text)
    except:
        print("price not found")
    try:
        span_element = element.find_element(By.CSS_SELECTOR, ".tsBodyMBold")
        # span_element = span_element.find_element(By.CSS_SELECTOR, '.q2')
        print(span_element.text)
    except:
        print("rating not found")
    print()

36
TechnoMiga Смартфон g24_ultra_серый_all Ростест (EAC) 2/32 ГБ, серебристый
6 999 ₽
5.0  35 отзывов

TechnoMiga Смартфон G24_ultra_черный_all Ростест (EAC) 2/32 ГБ, черный
6 899 ₽
5.0  36 отзывов

Xiaomi Смартфон Redmi Note 13 8/256 ГБ, синий
16 019 ₽
4.9  1 701 отзыв

Tecno Смартфон Spark Go 2024 4/128 ГБ, черный
7 294 ₽
4.9  3 066 отзывов

Tecno Смартфон Spark 20 8/128 ГБ, черный
9 690 ₽
4.9  1 588 отзывов

Xiaomi Смартфон Redmi A3x 3/64 ГБ, черный
5 917 ₽
4.9  417 отзывов

realme Смартфон Note 50 4/128 ГБ, голубой
7 469 ₽
4.9  37 275 отзывов

Xiaomi Смартфон Redmi A3 Global 4/128 ГБ, черный
7 646 ₽
4.8  2 051 отзыв

realme Смартфон Note 50 3/64 ГБ, черный
6 208 ₽
4.9  37 275 отзывов

Infinix Смартфон NOTE 30i X6716 Ростест (EAC) 8/128 ГБ, черный
10 148 ₽
4.9  12 816 отзывов

Xiaomi Смартфон Redmi 13C 4/128 ГБ, синий
10 030 ₽
4.9  4 536 отзывов

Infinix Смартфон HOT 40i 8/256 ГБ, голубой
10 660 ₽
4.9  1 640 отзывов

GK Retail Смартфон SP2-10 Ростест (EAC) 16/1 ТБ, черный матовый
7 

### Next task 
- Navigate through the second page in the list
- Select the first product among retrieved products (navigate)
- print information about the product (О товаре)

In [None]:
# ToDo: Print information about first product in the second page
next_button = browser.find_element(By.CSS_SELECTOR, '.qe2')
next_button.click()
browser.implicitly_wait(5) 

element = browser.find_element(By.CSS_SELECTOR, '.j4n_23')
link = element.find_element(By.CSS_SELECTOR, '.jl0_23')
link.click()
browser.implicitly_wait(5) 

about_phone= browser.find_element(By.CSS_SELECTOR, '.mu5_27')
print(about_phone.text)