# Ozon crawling!

In this task we will crawl https://www.ozon.com/ website!

**NB:** This lab is designed to be executed **locally** at your laptop, as it launches local application (browser). Indeed, headless mode can be used in colab, but this would also require specific browser installation steps. Thus, please use Anaconda.

## Dependency installation

Let's try to load and parse the page the way we did before:

In [59]:
from datetime import time

import requests
from bs4 import BeautifulSoup
from prompt_toolkit.contrib.telnet.protocol import EC

resp = requests.get("https://www.ozon.ru/category/smartfony-15502/")
print("Status:", resp.status_code)

Status: 403


Wowowow! https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
```
403 Forbidden – you don't have permission to access this resource is an HTTP status code that occurs when the web server understands the request but can't provide additional access. :
```

As we see, the output is not what we would expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script, and only in a valid browser? Browser engines can help us getting the data. Let's try to load the same web page, but do it in a different way: let's give a browser some time to load the scripts and run them. And then we will work with DOM (Document Object Model), but we will obtain this DOM from the browser engine itself, not via `BeautifulSoup`.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from the code, and retrieve the DOM? Service applications called `drivers` will interpret commands and translate them into browser actions.

For each supported browser engine you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko = Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location when you write the code.

### Download driver

And place it in any folder or under PATH env. variable. [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](http://chromedriver.storage.googleapis.com/index.html)

**FireFox** is recommended for this lab

### Install selenium

Selenium is a powerful tool for automated UI testing. We will use it to emulate used actions with the website.

In [29]:
!pip install -U selenium




[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Check it works

In [30]:
from selenium import webdriver

### Launch browser
This will open a browser window

In [60]:
browser = webdriver.Firefox()
# browser = webdriver.Chrome()  # Make sure you have the ChromeDriver installed and in your PATH
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page ... again

In [52]:
from selenium.webdriver.common.by import By

browser.get("https://www.ozon.ru/category/smartfony-15502/")
browser.implicitly_wait(5)  # wait for 10 seconds
# element = browser.find_element_by_id("button")
# element.click()
button = browser.find_element(By.ID, 'reload-button')  # Replace 'button-id' with the actual ID of the button

# Click the button
button.click()

Now we have a browser that has downloaded the page for us!


We want to know the price and the review score for some products (phone).
First let's try selecting the elements that contains these phones.

In [43]:
import time

browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels
browser.implicitly_wait(3)  # wait for 3 seconds
time.sleep(3)
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels

elements = browser.find_elements(By.CSS_SELECTOR, '.j4n_23')

for element in elements:
    try:
        nested_span = element.find_element(By.CSS_SELECTOR, 'span.tsBody500Medium')
        name = nested_span.text

        price_span = element.find_element(By.CSS_SELECTOR, 'span.tsHeadline500Medium')
        price = price_span.text

        stars_span = element.find_elements(By.CSS_SELECTOR, 'span.q2')
        try:
            rating, reviews = stars_span[2], stars_span[3]
            print('Rating ' + rating.text + 'Reviews ' + reviews)
        except:
            print("No data")

        print(f"Name: {name}")
        print(f"Price: {price}")

    except Exception as e:
        print(f"Error: {e}")


No data
Name: Poco Смартфон M6 Pro 8/256 ГБ, фиолетовый
Price: 19 563 ₽
No data
Name: Samsung Смартфон Galaxy A55 Global 8/128 ГБ, черный
Price: 26 535 ₽
No data
Name: Xiaomi Смартфон Redmi Note 12 6/128 ГБ, зеленый
Price: 13 439 ₽
No data
Name: Tecno Смартфон POVA 6 Pro 5G Ростест (EAC) 12/256 ГБ, черный
Price: 21 255 ₽
No data
Name: ZUNYI Смартфон GT10 Plus, глобальная русская версия, сеть 4g, Android 14, две SIM-карты, 7,3 дюйма, подарок， ударопрочная и водонепроницаемая защита, мощные игровые функции, гибкая камера, длительное время автономной работы，Интерфейс Type-C Ростест (EAC) 6/128 ГБ, черный
Price: 7 487 ₽
No data
Name: Tecno Смартфон Spark 20 Pro+ Ростест (EAC) 8/256 ГБ, черный матовый
Price: 18 720 ₽
No data
Name: ZUNYI Смартфон Note 30i，Note 13 Pro，X6 Neo，Смартфон русской версии，сеть 4g，6,8 дюйма，две SIM-карты，ударопрочная и водонепроницаемая защита，длительное время автономной работы，мощные игровые функции，большой HD экран，сенсорный телефон，быстрая зарядка，отличный подарок

### Next task 
- Navigate through the second page in the list
- Select the first product among retrieved products (navigate)
- print information about the product (О товаре)

In [66]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time


browser.get("https://www.ozon.ru/category/smartfony-15502/")
browser.implicitly_wait(5)  # Initial wait

browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
time.sleep(3)

next_page_button = WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.XPATH, '//a[contains(@href, "page=2")]'))  # Update XPath if needed
)
next_page_button.click()

products = browser.find_elements(By.XPATH, '//div[contains(@class, "j4n_23")]')

browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
time.sleep(3) 

product_descriptions = []

for i in range(5):
    product = products[i]
    product_link = product.find_element(By.XPATH, './/a')
    product_link.click()

    WebDriverWait(browser, 20).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '[data-widget="webShortCharacteristics"]'))
    )

    product_details = browser.find_element(By.CSS_SELECTOR, '[data-widget="webShortCharacteristics"]')
    description = product_details.text
    product_descriptions.append(description)

    browser.back()

    WebDriverWait(browser, 50).until(
        EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "j4n_23")]'))
    )
    products = browser.find_elements(By.XPATH, '//div[contains(@class, "j4n_23")]')
    
    browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
    time.sleep(3) 

with open("product_descriptions.txt", "w", encoding="utf-8") as file:
    for idx, description in enumerate(product_descriptions, start=1):
        file.write(f"Product {idx} Description:\n{description}\n\n")

print(product_details.text)



IndexError: list index out of range