### 0. Imports

In [1]:
from bs4 import BeautifulSoup

import requests

import pandas as pd
import numpy as np

#
import sys
sys.path.append("..")

# import data extraction support function
from src.support.data_extraction_support_draft import extract_table_from_link, extract_productnames_links, extract_categorynames_links, extract_supermarkets

# 1. Introduction to this notebook

In this notebook, the purpose is to outline and guide in the logical process of extracting the data for the supermarket product price analysis. The goal is to extract, through scraping techniques, historical data of supermarket product prices, divided by different supermarket chains, from three main categories of products: Milk, olive oil and sunflower oil. 

The main source used for this extraction will be [FACUA](https://super.facua.org/). 

# 2. Scraping

## 2.1 Get suppermarkets urls to scrape by surface

During an initial exploration of the main page of FACUA, buttons quickly appear for every supermarket with available data. The goal is to access those hrefs, if possible, or navigate using those buttons, to be driven to their individual pages.






![surfaces.png](../assets/surfaces.png)

Let's try parsing the main html looking for the hrefs inside those buttons.

In [2]:
link = "https://super.facua.org"

response = requests.get(link)

if response.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

main_soup = BeautifulSoup(response.content, "html.parser")

Successful connection.


Looking for the keywords "Precios en {supermarket}", hrefs are found rather fast. Therefore, let's extract them that way.

In [3]:
supermarket_cards = main_soup.findAll("div",{"class":"card h-100"})

print(f"There are {len(supermarket_cards)} supermarket cards.")


There are 6 supermarket cards.


There are as many supermarket cards in the parsed html as in the visual exploration of the website. Each cards has the individual hrefs for the pages.

In [4]:
supermarket_links = [card.find("a")["href"] for card in supermarket_cards]
supermarket_links

['https://super.facua.org/mercadona/',
 'https://super.facua.org/carrefour/',
 'https://super.facua.org/eroski/',
 'https://super.facua.org/dia/',
 'https://super.facua.org/hipercor/',
 'https://super.facua.org/alcampo/']

Now that the link has been obtained, let's define the process to extract the prices from one supermarket. Then, it will be a matter of replicating it over the remaining 5.

In [5]:
mercadona_link = supermarket_links[0]

response_mercadona = requests.get(mercadona_link)

if response_mercadona.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

mercadona_soup = BeautifulSoup(response_mercadona.content, "html.parser")

Successful connection.


## 2.2 Get categories urls

Getting the category card html elements inside the supermarket link.

In [6]:
product_category_cards = mercadona_soup.findAll("div",{"class":"card h-100"})

print(f"There are {len(product_category_cards)} product cards.\n")

product_category_names = [card.find("p").text.strip() for card in product_category_cards]

product_category_links = [card.find("a")["href"] for card in product_category_cards]

for name, link in zip(product_category_names, product_category_links):
    print(f"Product category: {name}. Link: {link}")

There are 3 product cards.

Product category: Aceite de girasol. Link: https://super.facua.org/mercadona/aceite-de-girasol/
Product category: Aceite de oliva. Link: https://super.facua.org/mercadona/aceite-de-oliva/
Product category: Leche. Link: https://super.facua.org/mercadona/leche/


Again, as with the supermarket, let's just focus on the first url from the categories to later replicate the example.

In [7]:
first_category_link = product_category_links[0]

first_category_link = "https://super.facua.org/mercadona/aceite-de-girasol/"

response_first_category = requests.get(first_category_link)

if response_first_category.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

first_category_soup = BeautifulSoup(response_first_category.content, "html.parser")

Successful connection.


## 2.3 Get products urls

Inside the categories, there are again card elements. This time, the card elements hold the urls to each product's information and prices table:

In [8]:
product_cards = first_category_soup.findAll("div",{"class","row gx-4 gx-lg-5 row-cols-2 row-cols-md-3 row-cols-xl-4 justify-content-center"})[-1]

product_cards = product_cards.findAll("div",{"class":"card h-100"})

print(f"There are {len(product_cards)} product cards.\n")

product_names = [card.find("p").text.strip() for card in product_cards]

product_links = [card.find("a")["href"] for card in product_cards]

for name, link in zip(product_names, product_links):
    print(f"Product category: {name}. Link: {link}")



There are 2 product cards.

Product category: Aceite De Girasol Refinado 0,2º Hacendado 1 L.. Link: https://super.facua.org/mercadona/aceite-de-girasol/aceite-de-girasol-refinado-02-hacendado-1-l/
Product category: Aceite De Girasol Refinado 0,2º Hacendado 5 L.. Link: https://super.facua.org/mercadona/aceite-de-girasol/aceite-de-girasol-refinado-02-hacendado-5-l/


Once again. let's just take one url to parse its html and create the pattern. 

In [9]:
first_product_link = product_links[0]

response_first_product = requests.get(first_product_link)

if response_first_product.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

first_category_soup = BeautifulSoup(response_first_product.content, "html.parser")

Successful connection.


## 2.4 Scrape prices table inside the product page

Let's look for a table element inside the product url.

In [10]:
tables = first_category_soup.findAll("table")

print(f"There are {len(tables)} tables.\n")

product_price_table = tables[0]
product_price_table

There are 1 tables.



<table class="table table-striped table-responsive text-center" style="width:100%"><thead><tr><th scope="col">Día</th><th scope="col">Precio (€)</th><th scope="col">Variación</th></tr></thead><tbody><tr><td>12/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>13/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>14/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>15/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>16/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>17/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>18/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>19/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>20/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>21/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>22/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>23/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>24/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>25/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>26/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>27/07/2024</td><td>1,45</td>

Perfect. The table is there and now it's only a matter of getting it's header, to pass it as the column names for the final CSV file, and the table body, for the values.

In [11]:
product_table_head = [element.text.strip() for element in product_price_table.find("thead").findAll("th")][:2]
product_table_head

['Día', 'Precio (€)']

In [12]:
product_table_body = [[element.text.strip() for element in row.findAll("td")][:2] for row in product_price_table.find("tbody").findAll("tr")]
product_table_body[:5]

[['12/07/2024', '1,45'],
 ['13/07/2024', '1,45'],
 ['14/07/2024', '1,45'],
 ['15/07/2024', '1,45'],
 ['16/07/2024', '1,45']]

This is what the DataFrame from extracting the product's history information would look like:

In [13]:
pd.DataFrame(product_table_body)

Unnamed: 0,0,1
0,12/07/2024,145
1,13/07/2024,145
2,14/07/2024,145
3,15/07/2024,145
4,16/07/2024,145
...,...,...
103,23/10/2024,148
104,24/10/2024,148
105,25/10/2024,148
106,26/10/2024,148


And this is the format that should be used to upload the table to a database with psycopg2:

In [14]:
[tuple([row[0], row[1], "supermercado","category"]) for row in product_table_body][:3]

[('12/07/2024', '1,45', 'supermercado', 'category'),
 ('13/07/2024', '1,45', 'supermercado', 'category'),
 ('14/07/2024', '1,45', 'supermercado', 'category')]

If the structure repeats along all products, the extraction will follow this pattern as a whole. 


## 2.5 Integrated extraction of all products


The functions for extraction can be found in the script `src/support/data_extraction_support_draft.py`. For clarity purposes, the most bottom-level one is written here:

In [15]:
from typing import List
import requests
import pandas as pd
from bs4 import BeautifulSoup

def extract_table_from_link(
    link: str,
    supermarket_name: str,
    category_name: str,
    product_name: str
) -> pd.DataFrame:
    """
    Extracts a table from a web page link and returns it as a pandas DataFrame.

    Parameters:
    ----------
    link : str
        The URL of the web page containing the table to extract.
    supermarket_name : str
        Name of the supermarket associated with the product.
    category_name : str
        Category name for the product.
    product_name : str
        Name of the product.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the extracted table data with columns for 
        product name, category name, and supermarket name.
    """

    #make request to the specified link
    response = requests.get(link)

    # check response and proceed if successful
    if response.status_code == 200:
        #print("Successful connection.")
        pass
    else:
        print("Connection failed.")

    #parse html content
    product_data_soup = BeautifulSoup(response.content, "html.parser")

    #extract table header and body, keeping 2 columns
    table_head_list: List[str] = [element.text.strip() for element in product_data_soup.find("thead").findAll("th")][:2]
    table_body_list: List[List[str]] = [[element.text.strip() for element in row.findAll("td")][:2] for row in product_data_soup.find("tbody").findAll("tr")]

    # convert to dataframe and add column names
    extracted_table_df = pd.DataFrame(table_body_list, columns=table_head_list)
    extracted_table_df[["product_name", "category_name", "supermarket_name"]] = product_name, category_name, supermarket_name

    return extracted_table_df


With the function above that scrapes the product's information and the other 3 that navigate to get all the urls in the website, this is what the process to extract all product's prices in the website look like:

In [16]:
total_result_df = pd.DataFrame()

supermarket_links = extract_supermarkets("https://super.facua.org/")

for supermarket_link in supermarket_links:

    category_links = extract_categorynames_links(supermarket_link)

    for category_link in category_links:

        product_names, product_links = extract_productnames_links(category_link)

        for product_name, product_link in product_names, product_links:

            product_df = extract_table_from_link(product_link, product_name) # for some reason, this is trying to access an old version of the function

            total_result_df = pd.concat([total_result_df,product_df])
    
total_result_df

TypeError: extract_table_from_link() missing 2 required positional arguments: 'category_name' and 'product_name'

# 3. Conclusion of this notebook

This process works, but downloading the whole data base, to clean it and the upload it as a whole inside a pipeline can mean that if it fails at any point along the way, nothing will be uploaded. 

Another proposal is to extract, transform and load on the same product iteration, which can be slower per product, but safer, saving CSV checkpoints along the way. That is what has been done in the updated function `get_table_from_product_link()` inside `src/data_etl.py`, that calls for support function in the final version of extraction supports at `src/support/data_extraction_support.py`.

To follow to the transformation phase, go to `notebooks/2_data_transformation.ipynb`