# Scraping wine data from vivino

***

To scrap the wine data from vivino.com, I used the original set-up from the user [Gugarose](https://github.com/gugarosa/viviner).  However, the scraper did not retrieve all the necessary data for my project, such as the wine's price, ratings, and year. Therefore, I made modifications to the scraping code, which will be explained in this notebook. Please note that you will require the folder ***utils*** from [Gugarose](https://github.com/gugarosa/viviner) and his requirements to extract the data (refer to [**Set-up**](#1)).

First, the modified scraping code only worked for the first 89 pages, resulting in many duplicates when it started from page 1 again. To avoid this, I created a function that splits up the scraping into smaller steps based on country and price. This ensures that the scraped data does not exceed 2000 wines (see [**Wine Scraper Function**](#2)). Additionally, the modified scraping code only includes necessary data for the project to improve scraping speed.

The process for scraping the wines from Vivino is explained in [**Scraping Process**](#3).

***

## Notebook Contents

1. [**Set-up**](#1)<br>

2. [**Wine Scraper Function**](#2)<br>
    
3. [**Scraping Process**](#3)<br>
        

***

# Set-up <a id="1"></a>

In [None]:
# Import of pandas and constants & requester from utils folder
import pandas as pd
import utils.constants as c
from utils.requester import Requester

In [None]:
# Base URL
BASE_URL = 'https://www.vivino.com/api/'

# Number of records per page
RECORDS_PER_PAGE = 30

In [None]:
# Instantiates a wrapper over the `requests` package
r = Requester(c.BASE_URL)

# Wine Scraper Function <a id="2"></a>

In [None]:
def wine_scraper(min_price):
    #Create an empty dataframe
    wine_df = pd.DataFrame()

    max_price = min_price + 0.99

    # Add arguments to your function that update the payload
    payload = {
        #"country_codes[]": country_code,
        # "food_ids[]": 20,
        # "grape_ids[]": 3,
        # "grape_filter": "varietal",
        # "min_rating": 3.7,
        # "order_by": "ratings_average",
        # "order": "desc",
        "price_range_min": min_price,
        "price_range_max": max_price,
        # "region_ids[]": 383,
        # "wine_style_ids[]": 98,
        "wine_type_ids[]": 1, # Red wine
        # "wine_type_ids[]": 2, # White wine
        # "wine_type_ids[]": 3, # Sparkling
        # "wine_type_ids[]": 4,
        # "wine_type_ids[]": 7,
        # "wine_type_ids[]": 24,
    }
    
    # Add your code from scrap_wine_data_modified (lines 55-109)
    # Performs an initial request to get the number of records (wines)
    res = r.get('explore/explore?', params=payload)
    n_matches = res.json()['explore_vintage']['records_matched']

    print(f'Number of matches: {n_matches}')

    # Iterates through the amount of possible pages
    for i in range(1, max(1, int(n_matches / c.RECORDS_PER_PAGE)) + 1):
        # Creates a list to hold the wine data
        wine_data = []

        # Adds the page to the payload
        payload['page'] = i

        print(f'Page: {payload["page"]}')

        # Performs the request and scraps the URLs
        res = r.get('explore/explore', params=payload)
        matches = res.json()['explore_vintage']['matches']

        # Iterates over every match
        for match in matches:

            vintage_wine = match['vintage']['wine']
            vintage_statistics = match['vintage']['statistics']

            wine = {
                'wine_id': vintage_wine['id'] if vintage_wine else None,
                'wine_name': vintage_wine['name'] if vintage_wine else None,
                'winery': vintage_wine['winery']['name'] if vintage_wine and vintage_wine['winery'] else None,
                'year': match['vintage']['year'] if match['vintage']['year'] else None,
                'country': vintage_wine['region']['country']['name'] if vintage_wine and vintage_wine['region'] and vintage_wine['region']['country'] else None,
                'region': vintage_wine['region']['name'] if vintage_wine and vintage_wine['region'] else None,
                'avg_rating_wine': vintage_statistics['wine_ratings_average'] if vintage_statistics else None,
                'num_rating_wine': vintage_statistics['wine_ratings_count'] if vintage_statistics else None,
                'avg_rating_wine_year': vintage_statistics['ratings_average'] if vintage_statistics else None,
                'num_rating_wine_year': vintage_statistics['ratings_count'] if vintage_statistics else None,
                'price': match['prices'][0]['amount'] if match['prices'] else None,
                'url': match['prices'][0]['url'] if match['prices'] else None,
                'volume': match['prices'][0]['bottle_type']['volume_ml'] if match['prices'] else None,
                'currency': match['prices'][0]['currency']['code'] if match['prices'] else None,
                'body': vintage_wine['style']['body'] if vintage_wine and vintage_wine['style'] else None,
                'taste_intensity': vintage_wine['taste']['structure']['intensity'] if vintage_wine and vintage_wine['taste'] and vintage_wine['taste']['structure'] else None,
                'taste_tannin': vintage_wine['taste']['structure']['tannin'] if vintage_wine and vintage_wine['taste'] and vintage_wine['taste']['structure'] else None,
                'taste_sweetness': vintage_wine['taste']['structure']['sweetness'] if vintage_wine and vintage_wine['taste'] and vintage_wine['taste']['structure'] else None,
                'taste_acidity': vintage_wine['taste']['structure']['acidity'] if vintage_wine and vintage_wine['taste'] and vintage_wine['taste']['structure'] else None,
                'taste_fizziness': vintage_wine['taste']['structure']['fizziness'] if vintage_wine and vintage_wine['taste'] and vintage_wine['taste']['structure'] else None,
                'grapes': vintage_wine['style']['grapes'] if vintage_wine and vintage_wine['style'] else None,
                'flavor': vintage_wine['taste']['flavor'] if vintage_wine and vintage_wine['taste'] else None,
                'food': vintage_wine['style']['food'] if vintage_wine and vintage_wine['style'] else None,
                'description': vintage_wine['style']['description'] if vintage_wine and vintage_wine['style'] else None
                #'interesting_fact': vintage_wine['style']['interesting_facts'] if vintage_wine and vintage_wine['style'] else None,
                #'varietal_name': vintage_wine['style']['varietal_name'] if vintage_wine and vintage_wine['style'] else None
            }

            wine_data.append(wine)

        # Convert wine_data to DataFrame
        wine_data_df = pd.DataFrame(wine_data)
            
        # Concatenate wine_data_df with wine_df
        wine_df = pd.concat([wine_df, wine_data_df], ignore_index=True)
    
    return wine_df


# Scraping Process <a id="3"></a>

In [None]:
# 1. Select highest price on vivino (e.g. 2,500)
price_list = list(range(0,2500,1))

In [None]:
# 2. Scrap wine from vivino in small steps with a for loop and saves it to a dataframe
df_list = []
for price in price_list:
    min_price = price
    status = (min_price/price_list[-1] * 100)
    print(f'Scraping status: {status}%')
    df = wine_scraper(country, min_price)
    df_list.append(df)

df_raw = pd.concat(df_list)


In [None]:
# 3. Check how many rows
df_raw.shape

In [None]:
# 4. Save it to csv
df_raw.to_csv(f'data_raw.csv')