# Ceneo Scraper

## Struktura pojedynczej opinii

|Składowa|Selektor|Zmienna|
|--------|--------|-------|
|id opinii|["data-entry-id"]|opinion_id|
|autor|span.user-post__author-name|author|
|rekomendacja|span.user-post__author-recomendation|recommendation|
|gwiazdki |span.user-post__score-count|rating|
|treść|div.user-post__text|content|
|lista zalet|div.review-feature__title--positives ~ div.review-feature__item|pros|
|lista wad|div.review-feature__title--negatives ~ div.review-feature__item|cons|
|dla ilu przydatna|span.['id^=votes-yes']|useful|
|dla ilu nieprzydatna|span.['id^=votes-no']|useless|
|data wystawienia|span.user-post__published > time:nth-child(1)['datetime']|post_date|
|data zakupu|span.user-post__published > time:nth-child(2)['datetime']|purchase_date|

## Kod

### Biblioteki

In [1]:
#najpierw ogólnodostępne biblioteki, potem instalowane pipem, potem nasze własne moduły

import os

import json

import requests

from   bs4 import BeautifulSoup

### Funkcje do  ekstrakcji danych ze strony html

In [10]:
def extract(ancestor, selector=None, attribute=None, returns_list=False):
    if selector:
    
        if returns_list:
            if attribute:
                return [tag[attribute]().strip() for tag in ancestor.select(selector)] 
            else:
                return [tag.get_text().strip() for tag in ancestor.select(selector)] 

        if attribute:
            try:
                return ancestor.select_one(selector)[attribute].strip()
            except TypeError:
                return None

        try:
            return ancestor.select_one(selector).get_text().strip()
        except AttributeError:
            return None

    if attribute:
        return ancestor[attribute].strip()

    return ancestor.get_text().strip()

### Słownik reprezentujący  strukturę opinii

In [12]:
selectors = {
    'opinion_id'        : (None, 'data-entry-id'),
    'author'            : ('span.user-post__author-name',),
    'recommendation'    : ('span.user-post__author-recomendation',),
    'rating'            : ('span.user-post__score-count',),
    'content'           : ('div.user-post__text',),
    'pros'              : ('div.review-feature__title--positives ~ div.review-feature__item', None, True),
    'cons'              : ('div.review-feature__title--negatives ~ div.review-feature__item', None, True),
    'useful'            : ("[id^='votes-yes']",),
    'useless'           : ("[id^='votes-no']",),
    'post_date'         : (".user-post__published > time:nth-child(1)", "datetime"),
    'purchase_date'     : (".user-post__published > time:nth-child(2)", "datetime")
}

### Link do pierwszej strony z  opiniami o wskazanym produkcie

In [11]:
#114700014

product_id  = input('Proszę podać kod produktu z serwisu Ceneo.pl: ')

url         = f'https://www.ceneo.pl/{product_id}/opinie-1'

### Pobranie opinii

In [13]:
all_opinions    = []
while (url):

    response    = requests.get(url)
    page_dom    = BeautifulSoup(response.text, 'html.parser')
    opinions    = page_dom.select('div.js_product-review')

    for opinion in opinions:
            single_opinion = {

                key : extract(opinion, *value)
                    for key, value in selectors.items()

            }
            
            all_opinions.append(single_opinion) 

    try:
        url = 'https://www.ceneo.pl/'+extract(page_dom, 'a.pagination__next', 'href')
    except TypeError:
         url = None


### Zapis opinii do pliku

In [14]:
if not os.path.exists('opinions'):
    os.mkdir('opinions')

with open(f'opinions/{product_id}.json', 'w', encoding='UTF-8') as jsonfile:
    
    json.dump(all_opinions, jsonfile, indent=4, ensure_ascii=False)