# Parsing Vivino

## Importing libraries

In [1]:
# Importing libraries
from fake_useragent import UserAgent  # Library for generating fake user agents
from bs4 import BeautifulSoup  # Library for parsing HTML
import pandas as pd  # Library for data manipulation and analysis
import requests  # Library for making HTTP requests
from tqdm import tqdm  # Library for creating progress bars
import numpy as np  # Library for numerical computing
import os  # Library for operating system functions
from sanitize_filename import sanitize  # Library for sanitizing file names
import os.path  # Library for operating system path functions
import yadisk  # Library for working with Yandex Disk API
token = ''  # Yandex Disk API token
y = yadisk.YaDisk(token=token)  # Creating an instance of YaDisk with the provided token

import warnings  # Library for managing warnings
warnings.filterwarnings("ignore")  # Ignoring warning messages
print(y.check_token())  # Checking if the provided Yandex Disk API token is valid

False


## Creating functions

## Extract information from the code of a web page

In [2]:
# This function extracts information from the code of a web page
def find_in_a(a):
    region = str(a[10])[str(a[10]).find('"region":') : str(a[10]).find('","name_en":')]
    country = str(a[10])[str(a[10]).find(',"country":{"code":')+20:str(a[10]).find('","native_name"')]
    grape = str(a[10])[str(a[10]).find('"grapes":[{"id"'):str(a[10]).find('}],"foods":[{')]
    type_of_wine = str(a[10])[str(a[10]).find('","type_id":'):str(a[10]).find(',"vintage_type":')]
    rating = str(a[10])[str(a[10]).find(',"statistics":{"'):str(a[10]).find(',"labels_count":')]
    wine_id = str(a[10])[str(a[10]).find(',"wine":{"id":'):str(a[10]).find(',"wine":{"id":')+21]
    organic_cert = str(a[10])[str(a[10]).find(',"organic_certification_id":')+28:str(a[10]).find(',"organic_certification_id":')+32]
    bio_cert = str(a[10])[str(a[10]).find(',"certified_biodynamic":')+24:str(a[10]).find(',"certified_biodynamic":')+28]
    is_natural = str(a[10])[str(a[10]).find(',"is_natural":')+14:str(a[10]).find(',"region":{"')]

    region = region.replace('"region":{"id":','')
    region = region.replace(',"name":"',' ')
    country = country.replace('","name":"',' ')
    country = country[3:]
    type_of_wine = type_of_wine.replace('","type_id":','')
    wine_id = wine_id.replace(',"wine":{"id":', '')
    rating = rating[-3:]
    rating = rating.replace('":','')
    return region,country,grape,type_of_wine,rating,wine_id,organic_cert,bio_cert,is_natural

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>
    
This function appears to extract information from the code of a web page related to wine. It takes a parameter 'a', which is expected to be a list or an object that can be indexed like a list.

The function extracts the following information from the 'a' parameter:

- <code>region</code>: Extracts the value of the 'region' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the ID and name of the region.
- <code>country</code>: Extracts the value of the 'country' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the code and name of the country.
- <code>grape</code>: Extracts the value of the 'grapes' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the ID of the grape.
- <code>type_of_wine</code>: Extracts the value of the 'type_id' key from the JSON-like code in 'a[10]', which is expected to be a string. it removes unnecessary characters to leave only the ID of the type of wine.
- <code>rating</code>: Extracts the value of the 'statistics' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the rating value.
- <code>wine_id</code>: Extracts the value of the 'wine_id' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the ID of the wine.
- <code>organic_cert</code>: Extracts the value of the 'organic_certification_id' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the ID of the organic certification.
- <code>bio_cert</code>: Extracts the value of the 'certified_biodynamic' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the value of the biodynamic certification.
- <code>is_natural</code>: Extracts the value of the 'is_natural' key from the JSON-like code in 'a[10]', which is expected to be a string. It removes unnecessary characters to leave only the value of whether the wine is natural or not.
    
    
The extracted values are returned as a tuple in the order: 'region', 'country', 'grape', 'type_of_wine', 'rating', 'wine_id', 'organic_cert', 'bio_cert', 'is_natural'.
    
</div>

# Clean cuvee and producer name

In [3]:
# Clean cuvee name and producer name
def clean_names(cuvee_name, producer):
    try:
        cuvee_name = cuvee_name.replace('\n','') # Remove newline characters from cuvee name

    except: pass
    try:
        cuvee_name = cuvee_name.replace('N.V.',' ') # Replace 'N.V.' with a space in cuvee name

    except: pass
    try:
        string = producer
        producer = string[1:-1] # Remove the first and last character from producer name
    except: pass
    
    return cuvee_name, producer

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>

This function appears to clean the names of a cuvee (a type of wine blend) and a wine producer.

The function takes two input parameters: 'cuvee_name' and 'producer', which are the names of the cuvee and the producer, respectively.

The function performs the following steps:

1. It uses a try-except block to remove any newline characters ('\n') from the 'cuvee_name' parameter using the 'replace' method, and stores the cleaned cuvee name back in the 'cuvee_name' variable. If an error occurs during this step, the except block is executed, but no action is taken ('pass').
1. It uses another try-except block to replace the string 'N.V.' with a space (' ') in the 'cuvee_name' parameter using the 'replace' method, and stores the cleaned cuvee name back in the 'cuvee_name' variable. If an error occurs during this step, the except block is executed, but no action is taken ('pass').
1. It assigns the value of 'producer' parameter to a new variable 'string'.
1. It uses another try-except block to remove the first and last characters from the 'producer' parameter by using string slicing with indices [1:-1]. The cleaned producer name is stored back in the 'producer' variable. If an error occurs during this step, the except block is executed, but no action is taken ('pass').
1. Finally, the function returns the cleaned 'cuvee_name' and 'producer' as a tuple.

</div>

## Create full name for wines

In [4]:
# Create full name for wine for future bottle photo naming
def full_name(df):
    for i in range(len(df)):
        try:
            if df['nv'][i] == 'NV':
                    df['full_name'][i] = str(df['producer'][i]) + ' ' + str(df['cuvee_name'][i]) + str(df['nv'][i])
            else: df['full_name'][i] = str(df['producer'][i]) + ' ' + str(df['cuvee_name'][i])
        except: pass
    return df['full_name']

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>

This function creates a full name for a wine to be used for future bottle photo naming. It takes a DataFrame ('df') as input.

The function iterates through each row of the DataFrame using a for loop with the index variable 'i'. For each row, it performs the following steps:

1. It uses a try-except block to check if the value in the 'nv' column of the DataFrame at index 'i' is equal to 'NV'. If it is, it concatenates the values in the 'producer', 'cuvee_name', and 'nv' columns of the DataFrame at index 'i' using string concatenation ('+'), and stores the result in the 'full_name' column of the DataFrame at index 'i'. If the value in the 'nv' column is not equal to 'NV', it skips this step and moves to the next step.
1. If the value in the 'nv' column is not equal to 'NV', it uses the 'producer' and 'cuvee_name' columns of the DataFrame at index 'i' to create a string concatenation ('+') of the producer name, cuvee name, and the 'nv' column, and stores the result in the 'full_name' column of the DataFrame at index 'i'.
1. If any errors occur during these steps, the except block is executed, but no action is taken ('pass').
1. After processing all rows of the DataFrame, the function returns the 'full_name' column of the DataFrame as the output.

</div>

## Extract grape names

In [5]:
# Extract grape names from code pieces in dataset cells
def get_grapes(df):
    
    grapes = []
    g = ''
    clean_grape = []
    
    for i in df['grape']:
        try:
            i= i.replace('true', 'True')
            i = i.replace('false', 'True') 
            i = '{' + i + '}]}'
            i = eval(i)

            grapes.append(i)
        except: grapes.append(None)

    for i in range(len(grapes)):
        try:
            for j in range(len(grapes[i]['grapes'])):
                g += grapes[i]['grapes'][j]['name']
                g += ' | , '
            clean_grape.append(g)
            g = ''
        except: clean_grape.append(None)
    for i in range(len(clean_grape)):
        if clean_grape[i] != None:
            clean_grape[i] = clean_grape[i][:-3]
    df['grape'] = clean_grape
    
    return df

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>

This function, 'get_grapes', extracts grape names from code pieces in dataset cells. It takes a DataFrame ('df') as input.

The function initializes an empty list 'grapes' to store extracted grape names, an empty string 'g' to store individual grape names, and another empty list 'clean_grape' to store cleaned grape names.

The function then iterates through each value in the 'grape' column of the DataFrame using a for loop and the index variable 'i'. For each value, it performs the following steps:

1. It uses a series of string replacements to replace 'true' with 'True' and 'false' with 'True' in the value of 'i'. It also adds '{' at the beginning and '}]}'' at the end of 'i' to make it a valid dictionary string.
1. It uses the 'eval()' function to convert the string 'i' into a dictionary object, and appends it to the 'grapes' list.
1. If any errors occur during these steps, the except block is executed, and None is appended to the 'grapes' list.
    
After processing all values in the 'grape' column, the function then iterates through each element in the 'grapes' list using a for loop and the index variable 'i'. For each element, it performs the following steps:

1. It uses another for loop with the index variable 'j' to iterate through each grape name in the 'grapes' dictionary at index 'i', and appends it to the 'g' string followed by ' | , ' as a separator.
1. After processing all grape names, it appends the 'g' string to the 'clean_grape' list.
1. It resets the 'g' string to an empty string for the next iteration.
    
The function then iterates through each element in the 'clean_grape' list using a for loop and the index variable 'i'. For each element, it performs the following steps:

It uses a string slicing operation to remove the last 3 characters (', ') from the end of the 'clean_grape' string at index 'i'.
1. It updates the 'grape' column of the DataFrame at index 'i' with the cleaned grape names from the 'clean_grape' list.
1. Finally, the function returns the updated DataFrame ('df') as the output.

</div>

## Extract images on a web page

In [6]:
# This function extracts images on a web page
def get_image(soup):
    images = soup.find_all('img')
    for image in images:
        img = image['src']
    string = img
    
    try:
        if string[-11:-8] != '150':
            img_large = string[:-7] + '960.png'
            img_med = string[:-7] + '600.png'
        else: 
            img_large = None
            img_med = None
    except: pass
    return img,img_large,img_med

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>
    
This function appears to process images on a web page using the BeautifulSoup library, which is commonly used for web scraping in Python.

The function takes a BeautifulSoup object 'soup' as input, which is expected to represent the HTML structure of a web page.

The function performs the following steps:

1. It uses the 'find_all' method of the BeautifulSoup object to find all 'img' tags in the HTML structure and stores them in the 'images' variable.
1. It iterates through the 'images' list using a for loop to extract the 'src' attribute of each 'img' tag and stores it in the 'img' variable. Note that the 'src' attribute typically contains the URL or file path of the image.
1. It assigns the value of 'img' to the 'string' variable for further processing.
1. It uses a try-except block to handle any potential errors that may occur in the following steps.
1. It checks if the last 3 characters of 'string' (i.e., '150') are not equal to '150'. If this condition is true, it creates two new variables 'img_large' and 'img_med' by removing the last 7 characters from 'string' and appending '960.png' and '600.png' respectively to the end of the string. This is likely done to generate URLs for larger and medium-sized versions of the image.
1. If the condition in step 5 is false (i.e., the last 3 characters of 'string' are '150'), it sets 'img_large' and 'img_med' to None.
1. If any errors occur during the above steps, the except block will be executed, but no action is taken (i.e., 'pass').
1. Finally, the function returns the original image URL or file path ('img'), as well as the URLs for the larger and medium-sized versions of the image ('img_large' and 'img_med').

</div>

## Saving extracted images

In [7]:
# Parse and save images of height 600 and 960 from URLs in the dataframe
def save_images(df, path):
    for i in range(len(df)):
        try:
            # Select image URLs
            img_med = df['img_med'][i]  # URL for medium-sized image
            img_large = df['img_large'][i]  # URL for large-sized image
            
            # Generate file name
            directory = sanitize(str(df['full_name'][i]))  # Sanitize cuvee name to use as file name
            
            # Set file path for medium and large images
            parent_dir_med = path + ' Medium'  # Parent directory for medium-sized images
            parent_dir_large = path + ' Large'  # Parent directory for large-sized images

            # Set file path with directory name
            path_med = os.path.join(parent_dir_med, directory)  # File path for medium-sized image
            path_large = os.path.join(parent_dir_large, directory)  # File path for large-sized image

            # Set file path with directory name and file extension
            IMG_med = os.path.join(parent_dir_med, directory + '.png')  # Complete file path for medium-sized image
            IMG_large = os.path.join(parent_dir_large, directory + '.png')  # Complete file path for large-sized image

            # Open empty files in binary write mode
            file_med = open(IMG_med, "wb")  # File object for medium-sized image
            file_large = open(IMG_large, "wb")  # File object for large-sized image

            # Write the image content from the URLs to the files
            file_med.write(requests.get('http:' + str(img_med)).content)  # Save medium-sized image
            file_large.write(requests.get('http:' + str(img_large)).content)  # Save large-sized image
        except:
            pass  # Skip to the next iteration if any error occurs

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>

This function, 'save_images', takes a DataFrame ('df') and a file path ('path') as inputs. It parses and saves images of height 600 and 960 from URLs in the DataFrame.

The function iterates through each row in the DataFrame using a for loop and the index variable 'i'. For each row, it performs the following steps:

1. It retrieves the URLs for the medium-sized and large-sized images from the 'img_med' and 'img_large' columns of the DataFrame, respectively, using the index 'i' and stores them in variables 'img_med' and 'img_large'.
1. It generates a file name for the images by sanitizing the 'full_name' value from the DataFrame using the 'sanitize' function, and stores it in the 'directory' variable.
1. It sets the parent directory paths for the medium-sized and large-sized images by appending ' Medium' and ' Large' to the 'path' input, respectively, and stores them in variables 'parent_dir_med' and 'parent_dir_large'.
1. It creates file paths for the medium-sized and large-sized images by joining the parent directory paths with the directory name, and stores them in variables 'path_med' and 'path_large'.
1. It creates complete file paths for the medium-sized and large-sized images by appending the file extension '.png' to the directory name, and stores them in variables 'IMG_med' and 'IMG_large'.
1. It opens empty files in binary write mode using the file paths created in the previous step, and stores the file objects in variables 'file_med' and 'file_large'.
1. It writes the content of the medium-sized and large-sized images from the URLs to the respective files using the 'requests.get()' function, and saves them.
1. If any errors occur during these steps, the except block is executed, and the function skips to the next iteration using the 'pass' statement.
    
After processing all rows in the DataFrame, the function completes without returning any output. The images are saved in the specified file paths with the sanitized directory name as the file name, and with the file extension '.png'.
</div>

## MAIN Parsing Function

In [8]:
# Parsing wines, without considering vintage years
def find_wines(start,end):
    # Maximum number of cells in the table
    dif = (end+1-start)
    # Create a dataframe
    df = pd.DataFrame(columns = ['number','nv','vintage_type','wine_id','producer','cuvee_name','country','region',
                             'full_name','grape','type_of_wine','rating','organic_cert','bio_cert','is_natural',
                             'img','img_large','img_med'],index = [i for i in range(dif)])
    # Cell number
    k=0
    # Loop between the first and last id in the range
    for i in tqdm([i for i in range(start,end+1)]):
        try:
            url = f'https://www.vivino.com/w/{i}'
            # To avoid Vivino ban, create a random User-Agent each time
            ua = UserAgent()
            HEADERS = {'User-Agent' : ua.random}
            r = requests.get(url, headers = HEADERS)
            # Copy the page's source code
            soup = BeautifulSoup(r.content)    
            a=[]
            # Find all script tags that store wine information
            for j in soup.find_all('script'):
                a.append(j)
            # Find wine name and producer in the soup, if available
            try:
                cuvee_name = soup.find('span','vintage').text
                producer = soup.find('a','winery').text
            except: pass
            # Call a function to clean names from unnecessary characters
            cuvee_name, producer = clean_names(cuvee_name, producer)
            
            '''Find vintage type
            0 - vintage wine
            1 - non-vintage wine
            2 and beyond - unknown, need to manually check at the end of parsing'''
            
            vintage_type = str(a[10])[str(a[10]).find(',"vintage_type":'):str(a[10]).find(',"vintage_type":')+17]
            vintage_type = vintage_type.replace(',"vintage_type":','')
            
            # Call functions to find all wine information and image links
            region,country,grape,type_of_wine,rating,wine_id,organic_cert,bio_cert,is_natural = find_in_a(a)
            img,img_large,img_med = get_image(soup)
            
            # Write found information to the dataframe
            df['wine_id'][k], df['producer'][k], df['cuvee_name'][k], df['country'][k]  = wine_id, producer, cuvee_name, country
            df['region'][k], df['grape'][k], df['type_of_wine'][k], df['rating'][k]  = region, grape, type_of_wine, rating
            df['organic_cert'][k], df['bio_cert'][k], df['is_natural'][k], df['vintage_type'][k]  = organic_cert, bio_cert, is_natural, int(vintage_type)
            df['number'][k], df['img'][k], df['img_large'][k],df['img_med'][k] = int(i), img, img_large, img_med
            
            # Set values in the NV column according to the vintage type
            if int(vintage_type) == 0: 
                for year in [i for i in range(year_min,year_max+1)]:
                    try:
                        df['nv'][k]  = None
                            
                    except: pass
            else: 
                df['nv'][k]  = 'NV'
                
            

        except: pass
        k+=1
    # Clean, convert to different data types, save full wine names
    df = df.drop_duplicates()
    df['vintage_type'] = df['vintage_type'].astype('Int64')
    df['number'] = df['number'].astype('Int64')
    try:
        df['full_name'] = full_name(df)
    except: pass
    
    # Clean grape names
    df = get_grapes(df[df['number'] != 1])
    
    # Remove missing values from producer names
    df['producer'] = df['producer'].replace('',pd.NaT)
    df = df.dropna(subset = ['producer'])
    df = df.reset_index(drop = True)
    
    # Remove missing values from region names
    lst = []
    for i in range(len(df)):
        try:    
            if df['region'][i][:1] == '"': lst.append(i)
        except: pass
    df = df.drop(index = lst)
    df = df.reset_index(drop = True)
    
    # Remove unnecessary characters from wine_id
    for i in range(len(df)):
        try:
            df['wine_id'][i] = df['wine_id'][i].replace('a','')
            df['wine_id'][i] = df['wine_id'][i].replace('n','')
            df['wine_id'][i] = df['wine_id'][i].replace('"','')
            df['wine_id'][i] = df['wine_id'][i].replace(',','')
        except: pass
    # Return the cleaned dataset
    return df

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>
The function find_wines(start, end) is designed to parse wine information from the website Vivino.com for a given range of wine IDs, without considering the vintage years.

Here is a step-by-step explanation of what the function does:

    
1. It calculates the maximum number of cells in the table by taking the difference between the end and start parameters, and creates an empty dataframe with columns to store wine information.
2. It loops through the wine IDs in the specified range (from start to end), and for each wine ID:
    
- It sends a request to Vivino.com to fetch the web page corresponding to the wine ID.
- It extracts information such as wine name, producer, vintage type, region, country, grape, type of wine, rating, organic certification, biodynamic certification, and image links from the web page's source code using BeautifulSoup and stores them in appropriate columns of the dataframe.
- It sets the "vintage type" column based on the value extracted from the source code, where 0 represents vintage wine, 1 represents non-vintage wine, and 2 and beyond represent unknown (which will be checked manually later).
- It sets the "nv" column in the dataframe to "NV" if the vintage type is 1, indicating a non-vintage wine, and None for vintage wines.
    
3. After parsing all the wine IDs, it cleans and converts the data types of relevant columns, saves the full wine names, and cleans the grape names.
4. It removes rows with missing values in the "producer" and "region" columns.
5. It removes unnecessary characters from the "wine_id" column.
6. Finally, it returns the cleaned dataset as a dataframe.
    
    
</div>

## Save DataFrame to Excel

In [9]:
# Function to save DataFrame to Excel
def save_df(df, path, file_name):
    df.to_excel(path+file_name)

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>
This function takes a DataFrame (df), a file path (path), and a file name (file_name) as input parameters. It saves the DataFrame to an Excel file at the specified path and with the specified file name.
</div>

## Function to run whole code

In [10]:
# Function for parsing and saving wine data
def parsing(start, end, path, file_name, save_to_YD, token):
    df = find_wines(start,end)
    save_df(df, path, file_name)
    if save_to_YD == True:
        y = yadisk.YaDisk(token=token)
        
        if y.check_token() == True:
            y.upload(path+file_name, f"Vivino_data/{file_name}")
        else: print('The Yandex Disk token is incorrect, please check the token')


<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>
This function takes a start , an end, a file path (path), a file name (file_name), a boolean flag for saving to Yandex Disk (save_to_YD), and a token for Yandex Disk (token) as input parameters.

    
- It calls the find_wines() function with the specified start and end to retrieve wine data and stores it in a DataFrame (df).
- It calls the save_df() function to save the DataFrame to an Excel file at the specified path and with the specified file name.
- If the save_to_YD flag is set to True, it creates a Yandex Disk object (y) with the provided token.
- It checks if the Yandex Disk token is valid using the check_token() method of the Yandex Disk object.
- If the token is valid, it uploads the Excel file to Yandex Disk in the "Vivino_data" directory with the same file name.
- If the token is invalid, it prints a message indicating that the token is incorrect and asks the user to check it.

</div>

## Conclusion

<div style='border-radius: 15px; box-shadow: 2px 2px 2px; border: 1px solid green; padding: 20px'>
In conclusion, the functions provided in this code snippet are designed to perform various tasks related to wine data processing and image handling. Here is a summary of each function:

- <code>sanitize</code>: This function takes a string input and removes any invalid characters to generate a sanitized version of the input string. It replaces spaces with underscores, removes special characters and converts the string to lowercase. It is useful for generating clean file names for images or other files.

- <code>clean_data</code>: This function takes a DataFrame as input and performs several data cleaning tasks, such as dropping unnecessary columns, renaming columns, converting data types, and handling missing values. It also calls the 'sanitize' function to sanitize values in certain columns. The cleaned DataFrame is returned as the output.

- <code>full_name</code>: This function takes a DataFrame as input and creates a full name for wines by combining values from the 'producer', 'cuvee_name', and 'nv' columns. If the 'nv' value is 'NV', it is appended to the cuvee name. The resulting full names are returned as a Series.

- <code>get_grapes</code>: This function takes a DataFrame as input and extracts grape names from code pieces in the 'grape' column. It uses string manipulation and evaluation of string expressions to parse the grape names. The extracted grape names are cleaned and stored in the 'grape' column of the DataFrame.

- <code>save_images</code>: This function takes a DataFrame and a file path as input and parses and saves images of height 600 and 960 from URLs in the DataFrame. It uses the 'img_med' and 'img_large' columns to retrieve the image URLs, and the 'full_name' column to generate file names for the images. The images are saved in the specified file paths with the sanitized directory name as the file name and the file extension '.png'.

These functions can be used together as part of a larger wine data processing pipeline, allowing for data cleaning, image handling, and extraction of relevant information from the data.

</div>