## Scraping Engine

To execute the file, you need to run scraper in command line

```
$python3 dprice.py
```
or
```
$python3 dprice.py [-u URL] [-l LABEL]

```

### Input

Run for best bargains in your country market or make an audit for any seller. Just paste the link.

Engine takes any url with Discogs offers. It can be seller inventory...
<img src='discogs_view4.png'>


..or marketplace page with selected filters like country of shipping, style, format, currency and many more. With pagination, it can check up to 10k results (Discogs limit).

<img src='discogs_view.png'>

### Scrape method

In [2]:
url_page = "https://www.discogs.com/sell/list?sort=listed%2Cdesc&limit=250&ships_from=Poland&format=Vinyl&style=House&page=1"

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get(url_page).content, 'html.parser')

# Each section is written as <tr class="shortcut_navigable "> so we can create a list of listings
list_of_listings_soup = soup.select("tr.shortcut_navigable")
len(list_of_listings_soup)

250

Now `list_of_listings_soup` contains all listings from single search response page. They are grouped into list, so we can iterate over each and create objects with information we want, putting that template on next slices from
our soup. Please note that from now, next steps are operate on single item - particular listing offer.

### First object - SearchListObject

The first object is lightweight as it is scraped directly from single search response page. 

It doesn't scrape much information, but just enough to reject unwanted records based on arguments like `condition`, `wants` and `comments`


In [2]:
class SearchListObject:
    def __init__( self, listing_select_soup ):
        (...)

First object scrape data marked by red below

<img src='discogs_view2.png'>

### First selection
    
<a id='wants'>  </a>
``wants`` is a number of Discogs users that have current record in their "wanted" category. It means they are potential <b> buyers </b> and reflects <b> demand </b>.
Engine will look only for records with more 'wants' than given below. It saves time and assures that scraper will look only for wanted records to keep their potential for re-sell.
 

In [3]:
wants = 100

`condition` is a Discogs standard and requirement declared by seller for each record. It is common that low quality goes with low prices. As we don't want to see poor quality records, we can reject them at this stage. Below, uncommented condition will NOT be scraped for release and sales statistics

In [4]:
conditions = (
    # 'Mint (M)'
    # 'Near Mint (NM or M-)'
    # 'Very Good Plus (VG+)'
    'Very Good (VG)',
    'Good Plus (G+)',
    'Good (G)',
    'Fair (F)', 'Poor (P)'
)

The objects which doesn't fulfill our requirements are not further investigated, while the others are transformed into second object.

### Second object - DataFrameObject

The second object takes arguments from first object...

In [5]:
class DataFrameObject:
    def __init__( self, lst_id, rls_link, seller, have, want, item_title, condition, comment, query ):
        (...)

... but also enter the detailed release page for each listing using known `id` to access a load of information like sales statistics, last sold date, ratings and many more. This makes process a lot more time-consuming, but provides complete information about records.

<img src='discogs_view3.png'>

The last step is to write object as a row in temporary csv file.

## Pagination

Scrape engine uses pagination, so each search will be run through next pages until the end of the offers or till page 40 (Discogs limit). When pagination ends <a href='convert.ipynb'> database convert </a> run automatically and temporary file is written as required csv file.

## Summary

Following these steps for a number of searches and records will create a database with good quality records only (and by good quality I mean both physical and subjective perspective). Also this approach significantly shortens scraping time as experience shows than usually lasts only about <b> 25-30% </b> of whole search result - we rejected "trash" and compile very clear working list of items for <a href='convert.ipynb'> database convert </a> and <a href='show.ipynb'> further analysis </a>