#  Ceneo Scraper

##  Components of single opinion 

|Component|Selector|Variable|
|---------|--------|--------|
|opinion ID|["data-entry-id"]|opinion_id|
|opinion's author|span.user-post__author-name|author|
|author's recomendation|span.user-post__author-recomendation > em|recomendation|
|score expressed in number of stars|span.user-post__score-count|score|
|opinion's content|div.user-post__text|content|
|list of products advantages|div.review-feature__title--positives ~ div.review-feature__item|pros|
|list of products disadvantages|div.review-feature__title--negatives ~ div.review-feature__item|cons|
|how many users think that opinion was helpful|span[id^="votes-yes"]|helpful|
|how many users think that opinion was unhelpful|span[id^="votes-no"]|unhelpful|
|publishing date|span.user-post__published > time:nth-child(1)["datetime"]|publish_date|
|purchase date|span.user-post__published > time:nth-child(2)["datetime"]|purchase_date|


## Loading libraries

In [48]:
import json
import os
import requests
from bs4 import BeautifulSoup

## Function to extract data from HTML code

In [49]:
def extract(ancestor, selector, attribute=None, return_list=False):
    if return_list:
        if attribute:
            return [tag[attribute] for tag in ancestor.select(selector)]
        return [tag.get_text().strip() for tag in ancestor.select(selector)]
    if selector:
        if attribute:
            try: 
                return ancestor.select_one(selector)[attribute]
            except TypeError:
                return None
        try:
            return ancestor.select_one(selector).get_text().strip()
        except AttributeError:
            return None
    if attribute:
        return ancestor[attribute]
    return ancestor.get_text().strip()

## Transformation functions 

In [46]:
def rate(score):
    rate = score.split("/")
    return float(rate[0].replace(",","."))/float(rate[1])
def recommend(recomendation):
    return True if recomendation == "Polecam" else False if recomendation == "Nie polecam" else None 


## Structure of single opinion

In [45]:
selectors={

"opinion_id" : [None,"data-entry-id"],
"author" : ["span.user-post__author-name"],
"recommendation" : ["span.user-post__author-recomendation > em"],
"score" : ["span.user-post__score-count"],
"content" : ["div.user-post__text"],
"pros" : ["div.review-feature__title--positives ~ div.review-feature__item", None, True],
"cons" : ["div.review-feature__title--negatives ~ div.review-feature__item", None, True],
"helpful" : ["button.vote-yes > span"],
"unhelpful" : ["button.vote-no > span"],
"publish_date" : ["span.user-post__published > time:nth-child(1)","datetime"],
"purchase_date" : ["span.user-post__published > time:nth-child(2)","datetime"],
}

## transformations

In [47]:
transformations = {
"recommendation" : recommend,
"score" : rate,
"helpful" : int,
"unhelpful" : int,
}

## URL adress for first page with opinions about product 

In [36]:
product_id = "40279876"
url = f"https://www.ceneo.pl/{product_id}#tab=reviews"



## Extracting all opinions about product from HTML code

In [None]:
all_opinions = []
while(url):
    print(url)
    response = requests.get(url)
    page_dom = BeautifulSoup(response.text, "html.parser")
    opinions = page_dom.select("div.js_product-review")
    for opinion in opinions:
        single_opinion = {
            key: extract(opinion, *value)
                for key, value in selectors.items()
        }
        for key, value in transformations.items():
            single_opinion[key] = value(single_opinion[key])
        all_opinions.append(single_opinion)
    try:
        url = 'https://www.ceneo.pl'+extract(page_dom, "a.pagination__next", "href")
    except TypeError:
        url = None




## Saving all opinions to JSON file


In [39]:
if not os.path.exists("opinions"):
    os.mkdir("opinions")
jf = open(f"opinions/{product_id}.json", "w", encoding="UTF-8")
json.dump(all_opinions, jf, indent=4, ensure_ascii=False)
jf.close()
