# Ceneo Scraper

## Components of single opinion

|Component|Selector|Key|
|---------|--------|--------|
|opinion ID|["data-entry-id"]|opinion_id|
|opinion’s author|span.user-post__author-name|author|
|author’s recommendation|span.user-post__author-recomendation > em|recommendation|
|score expressed in number of stars|span.user-post__score-count|score|
|opinion’s content|div.user-post__text|content|
|list of product advantages|div.review-feature__title--positives ~ div.review-feature_item|pros|
|list of product disadvantages|div.review-feature__title--negatives ~ div.review-feature_item|cons|
|how many users think that opinion was helpful|button.vote-yes > span|helpful|
|how many users think that opinion was unhelpful|span[id^="votes-no]|unhelpful|
|publishing date|span.user-post__published > time:nth-child(1)["datetime"]|publish_date|
|purchase date|span.user-post__published > time:nth-child(2)["datetime"]|purchase_date|

## Loading libraries

In [21]:
import os
import json
import requests #requests
from bs4 import BeautifulSoup #beautifulsoup4
from deep_translator import GoogleTranslator

## Structure of single opinion

In [22]:
selectors = {
"opinion_id" : [None, "data-entry-id"],
"author" : ["span.user-post__author-name"],
"recommendation" : ["span.user-post__author-recomendation > em"],
"score" : ["span.user-post__score-count"],
"content" : ["div.user-post__text"],
"pros" : ["div.review-feature__title--positives ~ div.review-feature__item", None, True,],
"cons" : ["div.review-feature__title--negatives ~ div.review-feature__item", None, True,],
"helpful" : ["button.vote-yes > span"],
"unhelpful" : ["button.vote-no > span"],
"publish_date" : ["span.user-post__published > time:nth-child(1)", "datetime"],
"purchase_date" : ["span.user-post__published > time:nth-child(2)", "datetime"],
}

## Transformation functions

In [23]:
def rate(score):
    score = score.split("/")
    return float(score[0].replace(",", "."))/float(score[1])
def recommend(recomendation):
    return True if recomendation == "Polecam" else False if recomendation == "Nie polecam" else None


## Transformations

In [24]:
transformations = {
    "recommendation" : recommend,
    "score" : rate,
    "helpful" : int,
    "unhelpful" : int,
    "content" : translate,
    "pros" : translate,
    "cons" : translate,
}

## Translation

In [25]:
def translate(text, from_lang = "pl",  to_lang = "en"):
    if text:
        if isinstance(text, list):
            return {
                from_lang: text,
                to_lang: [GoogleTranslator(source=from_lang, target=to_lang).translate(t) for t in text]
            }
        return {
            from_lang: text,
            to_lang: GoogleTranslator(source=from_lang, target=to_lang).translate(text)
        }
    return None

## Function of extract HTML

In [26]:
def extract(ancestor,selector,attribute=None,return_list=False): 
    if return_list: 
        if attribute: 
            return [tag[attribute] for tag in ancestor.select(selector)] 
        return [tag.get_text().strip() for tag in ancestor.select(selector)] 
    if selector: 
        if attribute: 
            try: 
                return ancestor.select_one(selector)[attribute] 
            except  TypeError: 
                return None 
        try: 
            return ancestor.select_one(selector).get_text().strip() 
        except AttributeError: 
            return None 
    if attribute: 
        return ancestor[attribute] 
    return None


## URL address for first page with opinions about products

In [27]:
#product_id = "44958016"
product_id = input("Please provide Ceneo.pl product code")
url = f"https://www.ceneo.pl/{product_id}#tab=reviews"



## Extracting all opinions from HTML code

In [28]:
all_opinions = []
while(url):
    print(url)
    response = requests.get(url)
    page_dom = BeautifulSoup(response.text, "html.parser")
    opinions = page_dom.select("div.js_product-review")
    for opinion in opinions:    
        single_opinion = {
            key: extract(opinion, *value)
                for key, value in selectors.items()
        }
        for key, value in transformations.items():
            single_opinion[key] = value(single_opinion[key])
            print(single_opinion[key])
        all_opinions.append(single_opinion)
    try:
        url = "https://www.ceneo.pl/" + extract(page_dom, "a.pagination__next", "href")
    except TypeError:
        url = None

https://www.ceneo.pl/44958016#tab=reviews
True
0.9
1
7
{'pl': 'mama będzie na pewno zadowolona jak dla starszej osoby myślę że lepszy i droższy sprzęt nie jest potrzebny co do trwałości urządzenia nie mogę się na obecną chwilę wypowiedzieć', 'en': 'My mother will definitely be happy as it is for an older person, I think that better and more expensive equipment is not necessary, I cannot comment on the durability of the device at the moment.'}
{'pl': ['głośność pracy', 'wydajność'], 'en': ['working volume', 'efficiency']}
None
True
1.0
0
0
{'pl': 'Znam firmę Esperanza ,co do samej maszynki jeszcze nie używałam', 'en': "I know the Esperanza company, but I haven't used the razor itself yet"}
None
None
True
0.8
2
0
{'pl': 'Prosty w użytkowaniu, estetyczny i jakość adekwatna do ceny.', 'en': 'Easy to use, aesthetic and quality adequate to the price.'}
{'pl': ['głośność pracy', 'robi się papka z mięsa zamiast je mielic', 'trwałość', 'wydajność'], 'en': ['working volume', 'the meat becomes a 

## Saving all opinions to JSON file

In [29]:
if not os.path.exists("opinions"):
    os.mkdir("opinions")
jf = open(f"opinions/{product_id}.json", "w", encoding="UTF-8")
json.dump(all_opinions, jf, indent=4, ensure_ascii=False)
jf.close()
