# Ozon crawling!

In this task we will crawl https://www.ozon.com/ website!

**NB:** This lab is designed to be executed **locally** at your laptop, as it launches local application (browser). Indeed, headless mode can be used in colab, but this would also require specific browser installation steps. Thus, please use Anaconda.

## Dependency installation

Let's try to load and parse the page the way we did before:

In [1]:
import requests
from bs4 import BeautifulSoup
from selenium.common import NoSuchElementException

resp = requests.get("https://www.ozon.ru/category/smartfony-15502/")
print("Status:", resp.status_code)

Status: 403


Wowowow! https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
```
403 Forbidden – you don't have permission to access this resource is an HTTP status code that occurs when the web server understands the request but can't provide additional access. :
```

As we see, the output is not what we would expect. So, what can we do when a page is not being loaded right away, but is rather rendered by a script, and only in a valid browser? Browser engines can help us getting the data. Let's try to load the same web page, but do it in a different way: let's give a browser some time to load the scripts and run them. And then we will work with DOM (Document Object Model), but we will obtain this DOM from the browser engine itself, not via `BeautifulSoup`.

Where do we get browser engine from? Simply installing a browser will do the thing. How do we send commands to it from the code, and retrieve the DOM? Service applications called `drivers` will interpret commands and translate them into browser actions.

For each supported browser engine you will need to:
1. install browser itself;
2. download 'driver' - binary executable, which passed commands from selenium to browser. E.g. [Gecko = Firefox](https://github.com/mozilla/geckodriver/releases), [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html);
3. unpack driver into a folder under PATH environment variable. Or specify exact binary location when you write the code.

### Download driver

And place it in any folder or under PATH env. variable. [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](http://chromedriver.storage.googleapis.com/index.html)

**FireFox** is recommended for this lab

### Install selenium

Selenium is a powerful tool for automated UI testing. We will use it to emulate used actions with the website.

In [9]:
!pip install -U selenium


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Check it works

### Launch browser
This will open a browser window

In [69]:
from selenium import webdriver
browser = webdriver.Firefox()
# browser = webdriver.Chrome()  # Make sure you have the ChromeDriver installed and in your PATH
# or explicitly
# browser = webdriver.Firefox(
#     executable_path='C:/bin/geckodriver.exe', 
#     firefox_binary='C:/Program Files/Mozilla Firefox/firefox.exe'
# )

### Download the page ... again

In [170]:
from selenium.webdriver.common.by import By
from selenium import webdriver

browser = webdriver.Firefox()
browser.get("https://www.ozon.ru/category/smartfony-15502/")
browser.implicitly_wait(10)  # wait for 10 seconds
# element = browser.find_element_by_id("button")
# element.click()
button = browser.find_element(By.ID, 'reload-button')  # Replace 'button-id' with the actual ID of the button

# Click the button
button.click()

Now we have a browser that has downloaded the page for us!


We want to know the price and the review score for some products (phone).
First let's try selecting the elements that contains these phones.

In [110]:
import time
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels
browser.implicitly_wait(3)  # wait for 3 seconds
time.sleep(3)
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
browser.execute_script("window.scrollBy(100, document.body.scrollHeight);")# Scroll down by 400 pixels
	
elements = browser.find_elements(By.CSS_SELECTOR, '.j4n_23')
for phone in elements:	
	name = phone.find_element(By.CSS_SELECTOR, 'span.tsBody500Medium').text
	price = phone.find_element(By.CSS_SELECTOR, 'span.tsHeadline500Medium').text
	rateScore = phone.find_elements(By.CSS_SELECTOR, 'div.tsBodyMBold span.q2')
	if len(rateScore)!= 2:
		continue
	rating = rateScore[0].text
	scores = rateScore[1].text
	
	print(f"Name: {name}\n"
		  f"Price: {price}\n"
		  f"Rating: {rating}\n"
		  f"Scores: {scores}\n------------")

Name: Tecno Смартфон Spark 20 8/128 ГБ, черный
Price: 9 690 ₽
Rating: 4.9  
Scores: 1 586 отзывов
------------
Name: realme Смартфон Note 50 4/128 ГБ, голубой
Price: 7 469 ₽
Rating: 4.9  
Scores: 37 245 отзывов
------------
Name: Xiaomi Смартфон Redmi A3 Global 4/128 ГБ, черный
Price: 7 646 ₽
Rating: 4.8  
Scores: 2 049 отзывов
------------
Name: Infinix Смартфон HOT 40i 8/256 ГБ, голубой
Price: 10 660 ₽
Rating: 4.9  
Scores: 1 639 отзывов
------------
Name: Infinix Смартфон NOTE 30i X6716 Ростест (EAC) 8/128 ГБ, черный
Price: 10 589 ₽
Rating: 4.9  
Scores: 12 812 отзывов
------------
Name: HUAWEI Смартфон nova Y91 8/128 ГБ, черный
Price: 10 799 ₽
Rating: 4.9  
Scores: 2 365 отзывов
------------
Name: Xiaomi Смартфон Redmi 13C 4/128 ГБ, синий
Price: 10 030 ₽
Rating: 4.9  
Scores: 4 523 отзыва
------------
Name: Xiaomi Смартфон Redmi A3 Global 4/128 ГБ, голубой
Price: 7 645 ₽
Rating: 4.8  
Scores: 2 049 отзывов
------------
Name: Xiaomi Смартфон Redmi Note 13 8/128 ГБ, синий
Price: 14 3

### Next task 
- Navigate through the second page in the list
- Select the first product among retrieved products (navigate)
- print information about the product (О товаре)

In [184]:
button = browser.find_element(By.CSS_SELECTOR, '.r2e .qe2')
button.click()
browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")  # Scroll down by 400 pixels
browser.implicitly_wait(3)

In [197]:
import json
links = []
buttons = browser.find_elements(By.CLASS_NAME, 'tile-hover-target')

i = 0
while len(links) != 5:
	href = buttons[i].get_attribute('href')
	i+=1
	if len(links) != 0 and href == links[-1]:
		continue
	links.append(href)

for i, link in enumerate(links):
	browser.get(link)
	data = {}
	infoS = browser.find_elements(By.CLASS_NAME, 'm5u_27')
	for element in infoS:
		name = element.find_element(By.CSS_SELECTOR, 'span.tsBodyM').text
		value = element.find_element(By.CLASS_NAME, 'tsBody400Small').text
		data[name] = value
		
	with open(f'{i}.json', 'w') as f:
		json.dump(data, f, indent=4, ensure_ascii=False)
	

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
{'Тип': 'Смартфон', 'Диагональ экрана, дюймы': '6.7', 'Емкость аккумулятора, мАч': '4500', 'Процессор': 'Snapdragon 680 (8 ядер), 2.4 ГГц', 'Беспроводные интерфейсы': 'NFC,'}
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
{'Тип': 'Смартфон', 'Диагональ экрана, дюймы': '6.6', 'Емкость аккумулятора, мАч': '5000', 'Основной материал корпуса': 'Пластик', 'Беспроводные интерфейсы': 'Wi-Fi,'}
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
{'Тип': 'Смартфон', 'Диагональ экрана, дюймы': '6.78', 'Емкость аккумулятора, мАч': '5500', 'Процессор': 'Dimensity 1000 Plus (8 ядер), 2.6 ГГц', 'Основной материал корпуса': 'Пластик'}
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
{'Тип': 'Смартфон', 'Диагональ экрана, дюймы': '6.56', 'Емкость аккумулятора, мАч': '5000', 'Процессор': 'Helio G36 (8 ядер), 2.2 ГГц', 'Основной материал корпуса': 'Пластик'}
<class 'str'>
<