## Ceneo Scraper

##  składostruktura  pojedyńczej opinii

|Składowa|Selector|Zmienna|
|--------|--------|-------|
|id opinii|['data-entry-id']|opinion_id|
|autor|.user-post__author-name|author|
|rekomendacja|.user-post__author-recomendation|recommendation|
|gwiazdki|.user-post__score-count|stars|
|treść|.user-post__text|content|
|lista zalet|.review-feature__title--positives. ~ .review-feature__item|pros|
|lista wad|.review-feature__title--negatives. ~ .review-feature__item|cons|
|dla ilu przydatna|button.vote-yes > span|helpful|
|dla ilu nie przydatna|button.vote-no > span|unhelpful|
|data wystawienia|user-post__published > time:nth-child(1)|publish_date|
|data zakupu|user-post__published > time:nth-child(2)|purchase_date|

## Załadowanie bibliotek

In [1]:
import os
import requests
from bs4 import BeautifulSoup
import json


In [2]:
def extract(ancestor, selector=None, attribute=None, return_list=False):
    if selector:
        if return_list:
            if attribute:
                return [tag[attribute].strip() for tag in ancestor.select(selector)]
            return [tag.text.strip() for tag in ancestor.select(selector)]
        if attribute:
            try:
                return ancestor.select_one(selector)[attribute].strip()
            except TypeError:
                return None
        try:
            return ancestor.select_one(selector).text.strip()
        except AttributeError:
            return None
    
    if attribute:
        return ancestor[attribute].strip()
    return ancestor.text.strip()

## Selektory składowych opinii

In [3]:
selectors = { 
    'opinion_id' : (None, 'data-entry-id'),
    'author' : ('.user-post__author-name',),
    'recommendation' : ('.user-post__author-recomendation',),
    'stars' : ('.user-post__score-count',),
    'content' : ('.user-post__text',),
    'pros' : ('.review-feature__title--positives ~ .review-feature__item', None, True),
    'cons' : ('.review-feature__title--negatives ~ .review-feature__item', None, True),
    'helpful' : ('button.vote-yes > span',),
    'unhelpful' : ('button.vote-no > span',),
    'publish_date' : ('span.user-post__published > time:nth-child(1)',"datetime"),
    'purchase_date' : ('span.user-post__published > time:nth-child(2)',"datetime")
    }

## Wysyłanie zapytanie do serwera

In [4]:
# product_id = '138331381'
product_id = '39562616'
url = f'https://www.ceneo.pl/{product_id}#tab=reviews'


## Pobieranie wsztstkich opini z kodu HTML strony

In [5]:
all_opinions = []
while(url):
    response = requests.get(url)
    page_dom = BeautifulSoup(response.text, 'html.parser')
    opinions = page_dom.select("div.js_product-review")

    for opinion in opinions:
        single_opinion = { 
            key: extract(opinion, *value)
                for key, value in selectors.items()
        }
        all_opinions.append(single_opinion)

    try: 

        url = "https://www.ceneo.pl" + page_dom.select_one("a.pagination__next")['href'].strip()
    except TypeError: url = None
    print(url)



    



https://www.ceneo.pl/138331381/opinie-2
None


## Zapis opini do pliku JSON

In [6]:
if not os.path.exists("opinions"):
    os.makedirs("opinions")
with open(f"opinions/{product_id}.json", 'w',encoding="UTF-8") as jf:
    json.dump(all_opinions, jf, indent=4, ensure_ascii=False)