# Data scraping

In this Notebook we are scraping the reviews on smartphones from the Yandex market catalog https://market.yandex.ru/catalog--mobilnye-telefony/54726/list.

The first step is to get the links on the smartphones in the catalog (get_smartphones function).

The second step consists of several stages for each smartphone from the list:
- to go to the section with reviews,
- to identify the number of reviews, and, respectively, the number of pages with reviews for the current smartphone,
- go automatically through all these pages and get the reviews and their scores,
- save each review and its score in the .json file.

To achieve the goal the selenium module was used. The attempt of using BeautifulSoup and Scrapy libraries was not successful due to blocking by the site because of the large number of automatic requests. Attempts to use various tricks were unsuccessful. The security system could not be bypassed. With the selenium module, it was necessary only once to go through the captcha.

In [66]:
from selenium import webdriver
import time 
import json

import warnings
warnings.filterwarnings('ignore')

Creating the variable for storing the links to all smartphones in the Yandex market catalog

In [3]:
links_list = []

Creating the get_smartphones function which will go through the web page of the market and get the links to all smartphones on the page. It gets the URL of the web-page and the links_list as parameters. As the result, the function returns the updated links_list.

Unfortunately, as was discovered, the smartphone links are located on the pages of the Yandex market catalog randomly (or, probably, the logic of this process was not found). This is the cause why we consider each web page separately, check if the current link is in the links_list, and update the link_list if the current link is new.

As you can see below some pages have the same links, and our links list is not updated. We check only 6 pages and get 190 links to smartphones which is more than enough for our purposes.

In [19]:
def get_smartphones(url, links_list):   
    
    driver = webdriver.Chrome('/Users/marinatrofimovich/studing/coursere_course/final_project/week6/chromedriver')
    driver.get(url)
    time.sleep(10)
    links = driver.find_elements_by_css_selector('.wwZc93J2Ao')
    for link in links:
        l = link.get_attribute('href')
        if l not in links_list:
            links_list.append(link.get_attribute('href'))                                     
   
    return(links_list)

In [20]:
url1 = "https://market.yandex.ru/catalog--mobilnye-telefony/54726/list"
links_list1 = get_smartphones(url1, links_list)

In [21]:
len(links_list1)

96

In [22]:
url2 = "https://market.yandex.ru/catalog--mobilnye-telefony/54726/list?cpa=0&hid=91491&onstock=1&page=2&local-offers-first=0"
links_list2 = get_smartphones(url2, links_list1)

In [23]:
len(links_list2)

96

In [24]:
url3 = "https://market.yandex.ru/catalog--mobilnye-telefony/54726/list?cpa=0&hid=91491&onstock=1&page=3&local-offers-first=0"
links_list3 = get_smartphones(url3, links_list2)

In [25]:
len(links_list3)

96

In [26]:
url4 = "https://market.yandex.ru/catalog--mobilnye-telefony/54726/list?cpa=0&hid=91491&onstock=1&page=4&local-offers-first=0"
links_list4 = get_smartphones(url4, links_list3)

In [27]:
len(links_list4)

96

In [28]:
url5 = "https://market.yandex.ru/catalog--mobilnye-telefony/54726/list?cpa=0&hid=91491&onstock=1&page=5&local-offers-first=0"
links_list5 = get_smartphones(url5, links_list4)

In [29]:
len(links_list5)

142

In [30]:
url6 = "https://market.yandex.ru/catalog--mobilnye-telefony/54726/list?cpa=0&hid=91491&onstock=1&page=6&local-offers-first=0"
links_list6 = get_smartphones(url6, links_list5)

In [31]:
len(links_list6)

190

The number_pages functions calculate the number of pages with reviews for each smartphones. The logic: 10 reviews per page.

In [32]:
def number_pages(num_revies):
    if (num_revies % 10) == 0:
        num_pages = num_revies // 10
    else:
        num_pages = num_revies // 10 + 1
    return num_pages

In [87]:
def get_reviews(smartphones):
    
    driver = webdriver.Chrome('/Users/marinatrofimovich/studing/coursere_course/final_project/week6/chromedriver')
    
    for i, smartphone in enumerate(smartphones):
        try:
            # get the url to the page with current smartphone
            url = smartphone
            driver.get(url)
        
            time.sleep(10)
            
            # find the reviews-button and click it
            li_element = driver.find_element_by_class_name("QjE88eF2HX")
            a_element = li_element.find_element_by_class_name("_2XmtVnQ64x")
            a_element.click()
        
            time.sleep(2)
        
            # for each smartphone get the number of reviwes
            num_reviews_text = driver.find_elements_by_class_name("yVmxx3-ZVv")
            
            # there are several elements of mentioned class on the page, we need the 3rd one of them
            for i, num in enumerate(num_reviews_text):
                if i == 2:
                    num_reviews = num.text
                    
            # for each smartphone calculate the number of pages with reviwes
            if (num_reviews != 0):
                num_pages = number_pages(int(num_reviews))
                    
                # go through all pages with reviews
                for num in range(1, num_pages + 1):
                    try:
                        
                        time.sleep(1)
                        
                        # get review and score
                        reviews_text = driver.find_elements_by_class_name("_3IXczk7DdZ")
                        scores_text = driver.find_elements_by_class_name("_2QBYNzUrMp")
                        
                        # save them in the file
                        for review, score in zip(reviews_text, scores_text):
                            review_text = review.text
                            sc = score.get_attribute("data-rate")
                            with open('train.json', 'a') as f:
                                json.dump({'review': review_text, 'score': sc}, f, ensure_ascii=False)
                                f.write('\n')
                        
                        time.sleep(2)
                        
                        # if there are unviewed pages with the reviews click the "Next page" button
                        if (num != num_pages):
                            element = driver.find_element_by_class_name('_3OFYTyXi90')
                            driver.execute_script("arguments[0].click();", element)
                            
                    except Exception:
                        pass
                        
        except Exception:
            pass
    

In [88]:
get_reviews(links_list6)

Now in the train.json file there are 24186 reviews and their scores on the smartphones scraped from Yandex market. They are saves in the format:

$\{\text{"review": text1, "score":  mark1}\}$ line feed $\{\text{"review": text2, "score":  mark2}\}$, etc. 

We'll use them further for developing a sentiment prediction model.