### 0. Imports

In [167]:
from bs4 import BeautifulSoup

import requests

import pandas as pd
import numpy as np

#
import sys
sys.path.append("..")

# import data extraction support function
from src.support.data_extraction_support import extract_table_from_link, extract_productnames_links, extract_categorynames_links, extract_supermarkets

# 1. Introduction to this notebook

In this notebook, the purpose is to outline and guide in the logical process of extracting the data for the supermarket product price analysis. The goal is to extract, through scraping techniques, historical data of supermarket product prices, divided by different supermarket chains, from three main categories of products: Milk, olive oil and sunflower oil. 

The main source used for this extraction will be [FACUA](https://super.facua.org/). 

#### Get suppermarkets urls to scrape by surface

During an initial exploration of the main page of FACUA, buttons quickly appear for every supermarket with available data. The goal is to access those hrefs, if possible, or navigate using those buttons, to be driven to their individual pages.






![surfaces.png](../assets/surfaces.png)

Let's try parsing the main html looking for the hrefs inside those buttons.

In [168]:
link = "https://super.facua.org"

response = requests.get(link)

if response.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

main_soup = BeautifulSoup(response.content, "html.parser")
main_soup

Successful connection.



<!DOCTYPE html>

<html lang="es-ES">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="Te ayudamos a comparar y a descubrir cuándo y cuánto suben los productos básicos en los grandes supermercados" name="description"/>
<meta content="super.FACUA.org" name="author"/>
<meta content="iva aceite de oliva" name="Keywords"/>
<meta content="all" name="googlebot"/>
<meta content="index" name="googlebot"/>
<meta content="follow" name="googlebot"/>
<meta content="all" name="robots"/>
<meta content="index" name="robots"/>
<meta content="follow" name="robots"/>
<meta content="index,follow" name="robots"/>
<meta content="es" http-equiv="Content-Language">
<title>FACUA vigila los precios de los alimentos para ti</title>
<!-- Favicon-->
<link href="https://super.facua.org/assets/favicon1.ico" rel="icon" type="image/x-icon"/>
<!-- Bootstrap icons-->
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.5.0/font/b

Looking for the keywords "Precios en __", hrefs are found rather fast. Therefore, let's extract them that way.

In [169]:
supermarket_cards = main_soup.findAll("div",{"class":"card h-100"})

print(f"There are {len(supermarket_cards)} supermarket cards.")


There are 6 supermarket cards.


There are as many supermarket cards in the parsed html as in the visual exploration of the website. Each cards has the individual hrefs for the pages.

In [170]:
supermarket_links = [card.find("a")["href"] for card in supermarket_cards]
supermarket_links

['https://super.facua.org/mercadona/',
 'https://super.facua.org/carrefour/',
 'https://super.facua.org/eroski/',
 'https://super.facua.org/dia/',
 'https://super.facua.org/hipercor/',
 'https://super.facua.org/alcampo/']

Now that we have the link, let's define the process to extract the prices from one supermarket. Then, it will be a matter of replicating it over the remaining 5.

In [171]:
mercadona_link = supermarket_links[0]

response_mercadona = requests.get(mercadona_link)

if response_mercadona.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

mercadona_soup = BeautifulSoup(response_mercadona.content, "html.parser")

Successful connection.


In [172]:
product_category_cards = mercadona_soup.findAll("div",{"class":"card h-100"})

print(f"There are {len(product_category_cards)} product cards.")

There are 3 product cards.


In [173]:
product_category_names = [card.find("p").text.strip() for card in product_category_cards]

product_category_links = [card.find("a")["href"] for card in product_category_cards]

for name, link in zip(product_category_names, product_category_links):
    print(f"Product category: {name}. Link: {link}")


Product category: Aceite de girasol. Link: https://super.facua.org/mercadona/aceite-de-girasol/
Product category: Aceite de oliva. Link: https://super.facua.org/mercadona/aceite-de-oliva/
Product category: Leche. Link: https://super.facua.org/mercadona/leche/


In [174]:
first_category_link = product_category_links[0]

first_category_link = "https://super.facua.org/mercadona/aceite-de-girasol/"

response_first_category = requests.get(first_category_link)

if response_first_category.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

first_category_soup = BeautifulSoup(response_first_category.content, "html.parser")

Successful connection.


In [175]:
product_cards = first_category_soup.findAll("div",{"class","row gx-4 gx-lg-5 row-cols-2 row-cols-md-3 row-cols-xl-4 justify-content-center"})[-1]

product_cards = product_cards.findAll("div",{"class":"card h-100"})

print(f"There are {len(product_cards)} product cards.\n")

product_names = [card.find("p").text.strip() for card in product_cards]

product_links = [card.find("a")["href"] for card in product_cards]

for name, link in zip(product_names, product_links):
    print(f"Product category: {name}. Link: {link}")



There are 2 product cards.

Product category: Aceite De Girasol Refinado 0,2º Hacendado 1 L.. Link: https://super.facua.org/mercadona/aceite-de-girasol/aceite-de-girasol-refinado-02-hacendado-1-l/
Product category: Aceite De Girasol Refinado 0,2º Hacendado 5 L.. Link: https://super.facua.org/mercadona/aceite-de-girasol/aceite-de-girasol-refinado-02-hacendado-5-l/


In [176]:
first_product_link = product_links[0]

response_first_product = requests.get(first_product_link)

if response_first_product.status_code == 200:
    print("Successful connection.")

else:
    print("Connection failed.")

first_category_soup = BeautifulSoup(response_first_product.content, "html.parser")

Successful connection.


In [358]:
first_product_link.split("/")

['https:',
 '',
 'super.facua.org',
 'mercadona',
 'aceite-de-girasol',
 'aceite-de-girasol-refinado-02-hacendado-1-l',
 '']

In [177]:
tables = first_category_soup.findAll("table")

print(f"There are {len(tables)} tables.\n")

product_price_table = tables[0]
product_price_table

There are 1 tables.



<table class="table table-striped table-responsive text-center" style="width:100%"><thead><tr><th scope="col">Día</th><th scope="col">Precio (€)</th><th scope="col">Variación</th></tr></thead><tbody><tr><td>12/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>13/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>14/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>15/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>16/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>17/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>18/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>19/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>20/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>21/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>22/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>23/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>24/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>25/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>26/07/2024</td><td>1,45</td><td>=</td></tr><tr><td>27/07/2024</td><td>1,45</td>

In [178]:
product_table_head = [element.text.strip() for element in product_price_table.find("thead").findAll("th")][:2]
product_table_head

['Día', 'Precio (€)']

In [179]:
product_table_body = [[element.text.strip() for element in row.findAll("td")][:2] for row in product_price_table.find("tbody").findAll("tr")]
product_table_body[:5]

[['12/07/2024', '1,45'],
 ['13/07/2024', '1,45'],
 ['14/07/2024', '1,45'],
 ['15/07/2024', '1,45'],
 ['16/07/2024', '1,45']]

In [353]:
pd.DataFrame(product_table_body)

[('12/07/2024', '1,45'),
 ('13/07/2024', '1,45'),
 ('14/07/2024', '1,45'),
 ('15/07/2024', '1,45'),
 ('16/07/2024', '1,45'),
 ('17/07/2024', '1,45'),
 ('18/07/2024', '1,45'),
 ('19/07/2024', '1,45'),
 ('20/07/2024', '1,45'),
 ('21/07/2024', '1,45'),
 ('22/07/2024', '1,45'),
 ('23/07/2024', '1,45'),
 ('24/07/2024', '1,45'),
 ('25/07/2024', '1,45'),
 ('26/07/2024', '1,45'),
 ('27/07/2024', '1,45'),
 ('28/07/2024', '1,45'),
 ('29/07/2024', '1,45'),
 ('30/07/2024', '1,45'),
 ('31/07/2024', '1,45'),
 ('01/08/2024', '1,45'),
 ('02/08/2024', '1,45'),
 ('03/08/2024', '1,45'),
 ('04/08/2024', '1,45'),
 ('05/08/2024', '1,45'),
 ('06/08/2024', '1,45'),
 ('07/08/2024', '1,45'),
 ('08/08/2024', '1,45'),
 ('09/08/2024', '1,45'),
 ('10/08/2024', '1,45'),
 ('11/08/2024', '1,45'),
 ('12/08/2024', '1,45'),
 ('13/08/2024', '1,45'),
 ('14/08/2024', '1,45'),
 ('15/08/2024', '1,45'),
 ('16/08/2024', '1,45'),
 ('17/08/2024', '1,45'),
 ('18/08/2024', '1,45'),
 ('19/08/2024', '1,45'),
 ('20/08/2024', '1,45'),


In [None]:
[]

In [360]:
[tuple([row[0], row[1], "supermercado","category"]) for row in product_table_body]

[('12/07/2024', '1,45', 'supermercado', 'category'),
 ('13/07/2024', '1,45', 'supermercado', 'category'),
 ('14/07/2024', '1,45', 'supermercado', 'category'),
 ('15/07/2024', '1,45', 'supermercado', 'category'),
 ('16/07/2024', '1,45', 'supermercado', 'category'),
 ('17/07/2024', '1,45', 'supermercado', 'category'),
 ('18/07/2024', '1,45', 'supermercado', 'category'),
 ('19/07/2024', '1,45', 'supermercado', 'category'),
 ('20/07/2024', '1,45', 'supermercado', 'category'),
 ('21/07/2024', '1,45', 'supermercado', 'category'),
 ('22/07/2024', '1,45', 'supermercado', 'category'),
 ('23/07/2024', '1,45', 'supermercado', 'category'),
 ('24/07/2024', '1,45', 'supermercado', 'category'),
 ('25/07/2024', '1,45', 'supermercado', 'category'),
 ('26/07/2024', '1,45', 'supermercado', 'category'),
 ('27/07/2024', '1,45', 'supermercado', 'category'),
 ('28/07/2024', '1,45', 'supermercado', 'category'),
 ('29/07/2024', '1,45', 'supermercado', 'category'),
 ('30/07/2024', '1,45', 'supermercado', 'categ

In [None]:
[tuple(row) for row in product_table_body]

If the structure repeats along all products, the extraction will follow this pattern as a whole. First, let's define functions from bottom to top level.

Extract product table

In [None]:
def extract_table_from_link(link, supermarket_name,category_name, product_name):

    # make request
    response = requests.get(link)

    # check response
    if response.status_code == 200:
        # print("Successful connection.")
        pass

    else:
        print("Connection failed.")

    # parse html
    product_data_soup = BeautifulSoup(response.content, "html.parser")

    # extract table header and body
    table_head_list = [element.text.strip() for element in product_data_soup.find("thead").findAll("th")][:2]

    table_body_list = [[element.text.strip() for element in row.findAll("td")][:2] for row in product_data_soup.find("tbody").findAll("tr")]

    # convert to dataframe and return
    extracted_table_df = pd.DataFrame(table_body_list, columns=table_head_list)
    extracted_table_df[["product_name","category_name","supermarket_name"]] = product_name, category_name, supermarket_name
    

    return extracted_table_df

In [186]:
total_result_df = pd.DataFrame()

supermarket_names, supermarket_links = extract_supermarkets("https://super.facua.org/")

for supermarket_name, supermarket_link in zip(supermarket_names, supermarket_links):

    supermarket_name_repeat, category_names, category_links = extract_categorynames_links(supermarket_link, supermarket_name)

    for supermarket_name, category_name, category_link in zip(supermarket_name_repeat, category_names, category_links):

        supermarket_name_repeat, category_name_repeat, product_names, product_links = extract_productnames_links(category_link, supermarket_name, category_name)

        for supermarket_name, category_name, product_name, product_link in zip(supermarket_name_repeat, category_name_repeat, product_names, product_links):

            product_df = extract_table_from_link(product_link, supermarket_name,category_name, product_name)

            total_result_df = pd.concat([total_result_df,product_df])
    
total_result_df

Successful connection.
https://super.facua.org/mercadona/aceite-de-girasol/ Precios en Mercadona Aceite de girasol
https://super.facua.org/mercadona/aceite-de-girasol/aceite-de-girasol-refinado-02-hacendado-1-l/ Precios en Mercadona Aceite de girasol Aceite De Girasol Refinado 0,2º Hacendado 1 L.
https://super.facua.org/mercadona/aceite-de-girasol/aceite-de-girasol-refinado-02-hacendado-5-l/ Precios en Mercadona Aceite de girasol Aceite De Girasol Refinado 0,2º Hacendado 5 L.
https://super.facua.org/mercadona/aceite-de-oliva/ Precios en Mercadona Aceite de oliva
https://super.facua.org/mercadona/aceite-de-oliva/aceite-de-oliva-04-hacendado/ Precios en Mercadona Aceite de oliva Aceite De Oliva 0,4º Hacendado 1 L.
https://super.facua.org/mercadona/aceite-de-oliva/aceite-de-oliva-1-hacendado-botella-1-l/ Precios en Mercadona Aceite de oliva Aceite De Oliva 1º Hacendado 1 L.
https://super.facua.org/mercadona/aceite-de-oliva/aceite-de-oliva-intenso-hacendado-garrafa-3-l/ Precios en Mercadon

Unnamed: 0,Día,Precio (€),product_name,category_name,supermarket_name
0,12/07/2024,145,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",Aceite de girasol,Precios en Mercadona
1,13/07/2024,145,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",Aceite de girasol,Precios en Mercadona
2,14/07/2024,145,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",Aceite de girasol,Precios en Mercadona
3,15/07/2024,145,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",Aceite de girasol,Precios en Mercadona
4,16/07/2024,145,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",Aceite de girasol,Precios en Mercadona
...,...,...,...,...,...
37,21/10/2024,473,Tierra De Sabor Leche Semidesnatada De Vaca 6 ...,Leche,Precios en Alcampo
38,22/10/2024,473,Tierra De Sabor Leche Semidesnatada De Vaca 6 ...,Leche,Precios en Alcampo
39,23/10/2024,473,Tierra De Sabor Leche Semidesnatada De Vaca 6 ...,Leche,Precios en Alcampo
40,24/10/2024,473,Tierra De Sabor Leche Semidesnatada De Vaca 6 ...,Leche,Precios en Alcampo


In [187]:
total_result_df.to_csv("../data/extracted/facua_extracted.csv")

In [195]:
df = pd.read_csv("../data/extracted/facua_extracted.csv")

This process is good, but very raw. A finer process is to extract, clean and load on the same iteration.

In [203]:
link_list = list()

supermarket_names, supermarket_links = extract_supermarkets("https://super.facua.org/")

for supermarket_name, supermarket_link in zip(supermarket_names, supermarket_links):

    supermarket_name_repeat, category_names, category_links = extract_categorynames_links(supermarket_link, supermarket_name)

    for supermarket_name, category_name, category_link in zip(supermarket_name_repeat, category_names, category_links):

        supermarket_name_repeat, category_name_repeat, product_names, product_links = extract_productnames_links(category_link, supermarket_name, category_name)

        link_list.extend(product_links)

Successful connection.
https://super.facua.org/mercadona/aceite-de-girasol/ Precios en Mercadona Aceite de girasol
https://super.facua.org/mercadona/aceite-de-oliva/ Precios en Mercadona Aceite de oliva
https://super.facua.org/mercadona/leche/ Precios en Mercadona Leche
https://super.facua.org/carrefour/aceite-de-girasol/ Precios en Carrefour Aceite de girasol
https://super.facua.org/carrefour/aceite-de-oliva/ Precios en Carrefour Aceite de oliva
https://super.facua.org/carrefour/leche/ Precios en Carrefour Leche
https://super.facua.org/eroski/aceite-de-girasol/ Precios en Eroski Aceite de girasol
https://super.facua.org/eroski/aceite-de-oliva/ Precios en Eroski Aceite de oliva
https://super.facua.org/eroski/leche/ Precios en Eroski Leche
https://super.facua.org/dia/aceite-de-girasol/ Precios en Dia Aceite de girasol
https://super.facua.org/dia/aceite-de-oliva/ Precios en Dia Aceite de oliva
https://super.facua.org/dia/leche/ Precios en Dia Leche
https://super.facua.org/hipercor/aceite

In [225]:
df_links = pd.DataFrame([link.split("/")[3:-1] for link in link_list])
df_links.columns = ["supermercado","categoria","nombre"]

Inspecting the names of product by category and supermarket to extract the brands names

Sacar litros y marca

Hacendado has only 2 brands: Hacendado for mostly everything and celta leche.

What are names that remain after removing the last part?

In [222]:
df = pd.read_csv("../data/extracted/facua_extracted.csv")

In [242]:
df_links["categoria"].unique()

array(['aceite-de-girasol', 'aceite-de-oliva', 'leche'], dtype=object)

In [None]:
"(?:\d\s?x\s?)?\d?(?:,|\.)\d*\s?[A-Za-z]*(?:^|\.|\s)?"

In [302]:
names.str.extract(r"((\d+(?:[.,]\d+)?)\s?(L|litros?|ml|g|gr|cl)?(?:\s?x\s?\d+(?:[.,]\d+)?)?)")[1].unique()

array(['6', '1', '1.5', '9', '1.2', '50', '450', '210', '387', '370',
       '740', '400', '500', '2', '3', '200', '0', '1,5', '12', '800',
       '750', '4', '10', '40', '265', '14', '30', '2,2', '525', '0,0',
       '2.2', '100'], dtype=object)

### EXTRACT liters and quantities - leche

In [351]:
names = df.loc[df["category_name"] == "Leche","product_name"].str.lower().str.replace(" unidades de ", " x ").str.replace(" bricks de ", " x ").str.replace(" uds. x ", " x ").str.replace(" uds. ", " x ").str.replace(" briks de ", " x ")

names = names.str.extract(r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|g|gr|cl|g)|\d+\s?(?:uds\.?|botes|x)\s?\d+(?:[.,]\d+)?\s?(?:l|ml|g|gr|cl|g))")
names.iloc[:,0].unique()

array(['6 l', '1 l', '1.5 l', '9 l', '1.2 l', '250 ml', '450 g', '210 g',
       '387 g', '370 g', '740 g', '400 ml', '500 ml', '800 g', '1200 g',
       '200 ml', '1,5 l', '9 x 1 l', nan, '6 x 200 ml', '750 ml', '525 g',
       '10 x 7,5 g', '2 x 210 g', '2 x 160 g', '265 ml', '4 x 120 g',
       '6 x 100 g', '14 x 100 g', '270 ml', '2,2 l', '50 cl', '400 g',
       '6x200 ml', '3x210 g', '10x7,5 g', '6x188 ml', '3x200 ml',
       '6 x 1 l', '2.2 l', '6 x 1.5 l', '6 x 2.2 l', '3 x 200 ml',
       '6 x 188 ml', '600 g', '1,5 ml', '500 g', '188 ml', '200 cl',
       '2 l', '6x1 l', '6x 1 l', '6 x 1l', '4 x 1.5 l', '6 x 500 ml',
       '1.5l', '1l', '6x 1l'], dtype=object)

### EXTRACT liters and quantities - Aceite de girasol

In [328]:
names = df.loc[df["category_name"] == "Aceite de girasol","product_name"].str.lower()
names

0            aceite de girasol refinado 0,2º hacendado 1 l.
1            aceite de girasol refinado 0,2º hacendado 1 l.
2            aceite de girasol refinado 0,2º hacendado 1 l.
3            aceite de girasol refinado 0,2º hacendado 1 l.
4            aceite de girasol refinado 0,2º hacendado 1 l.
                                ...                        
102492    ucasol aceite refinado de girasol  garrafa de ...
102493    ucasol aceite refinado de girasol  garrafa de ...
102494    ucasol aceite refinado de girasol  garrafa de ...
102495    ucasol aceite refinado de girasol  garrafa de ...
102496    ucasol aceite refinado de girasol  garrafa de ...
Name: product_name, Length: 6084, dtype: object

In [350]:
names = df.loc[df["category_name"] == "Aceite de girasol","product_name"].str.lower()

names = names.str.extract(r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|mililitros?))")
names.iloc[:,0].unique()

array(['1 l', '5 l', '3 l', '150 ml', '50 ml', '200 ml'], dtype=object)

### EXTRACT liters and quantities - Aceite de oliva

In [None]:
names = df.loc[df["category_name"] == "Aceite de girasol","product_name"].str.lower()
names

0            aceite de girasol refinado 0,2º hacendado 1 l.
1            aceite de girasol refinado 0,2º hacendado 1 l.
2            aceite de girasol refinado 0,2º hacendado 1 l.
3            aceite de girasol refinado 0,2º hacendado 1 l.
4            aceite de girasol refinado 0,2º hacendado 1 l.
                                ...                        
102492    ucasol aceite refinado de girasol  garrafa de ...
102493    ucasol aceite refinado de girasol  garrafa de ...
102494    ucasol aceite refinado de girasol  garrafa de ...
102495    ucasol aceite refinado de girasol  garrafa de ...
102496    ucasol aceite refinado de girasol  garrafa de ...
Name: product_name, Length: 6084, dtype: object

In [394]:
names = df.loc[df["category_name"] == "Aceite de girasol","product_name"].str.lower()

names = names.str.extract(r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|mililitros?))")
names.iloc[:,0].unique()

array(['1 l', '5 l', '3 l', '150 ml', '50 ml', '200 ml'], dtype=object)

In [393]:
cadena = df.loc[df["category_name"] == "Aceite de girasol","product_name"].str.lower()[0]

In [396]:
re.findall(r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|mililitros?))", cadena.lower())[0]

'1 l'

In [None]:
re.findall(r"\d\s?(\w{1,2})$", "6x 1 g")

['g']

In [None]:
re.findall(r"(?:\d\s?x\s?)?(\d?\.?\d+)\s?\w{1,2}?", cadena.replace(",","."))

['6',
 '1',
 '1.5',
 '9',
 '1.2',
 '250',
 '450',
 '210',
 '387',
 '370',
 '740',
 '400',
 '500',
 '800',
 '1200',
 '200',
 '1.5',
 '1',
 '200',
 '750',
 '525',
 '10',
 '7.5',
 '210',
 '160',
 '265',
 '120',
 '100',
 '14',
 '100',
 '270',
 '2.2',
 '50',
 '400',
 '200',
 '210',
 '10',
 '7.5',
 '188',
 '200',
 '1',
 '2.2',
 '1.5',
 '2.2',
 '200',
 '188',
 '600',
 '1.5',
 '500',
 '188',
 '200',
 '2',
 '1',
 '1',
 '1',
 '1.5',
 '500',
 '1.5',
 '1',
 '1']

In [488]:
cadena = """'6 l', '1 l', '1.5 l', '9 l', '1.2 l', '250 ml', '450 g', '210 g',
       '387 g', '370 g', '740 g', '400 ml', '500 ml', '800 g', '1200 g',
       '200 ml', '1,5 l', '9 x 1 l', '6 x 200 ml', '750 ml', '525 g',
       '10 x 7,5 g', '2 x 210 g', '2 x 160 g', '265 ml', '4 x 120 g',
       '6 x 100 g', '14 x 100 g', '270 ml', '2,2 l', '50 cl', '400 g',
       '6x200 ml', '3x210 g', '10x7,5 g', '6x188 ml', '3x200 ml',
       '6 x 1 l', '2.2 l', '6 x 1.5 l', '6 x 2.2 l', '3 x 200 ml',
       '6 x 188 ml', '600 g', '1,5 ml', '500 g', '188 ml', '20 cl.',
       '2 l', '6x1 l', '6x 1 l', '6 x 1l', '4 x 1.5 l', '6 x 500 ml',
       '1.5l', '1l', '6x 1l'"""

In [489]:
re.findall(r"(?:\d\s?x\s?)?(\d?\.?\d+)\s?\w{1,2}?", cadena.replace(",","."))[-10:]

['20', '2', '1', '1', '1', '1.5', '500', '1.5', '1', '1']

In [None]:
re.findall(r"(\d+)\s?x", cadena)

['9',
 '6',
 '10',
 '2',
 '2',
 '4',
 '6',
 '14',
 '6',
 '3',
 '10',
 '6',
 '3',
 '6',
 '6',
 '6',
 '3',
 '6',
 '6',
 '6',
 '6',
 '4',
 '6',
 '6']

In [534]:
import re
def extract_quantity_from_product_name(product_name, category_name):
    patterns = {
        "aceite-de-oliva" : r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|mililitros?))",
        "aceite-de-girasol": r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|mililitros?))",
        "leche" : r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|g|gr|cl|g)|\d+\s?(?:uds\.?|botes|x)\s?\d+(?:[.,]\d+)?\s?(?:l|ml|g|gr|cl|g))"
    }

    conversions_magnitude = {'g': 1, 'kg': 1000, 'mg': 0.001, 'l': 1, 'ml': 0.001, 'cl': 0.01}
    conversions_unit = {'g': 'g', 'kg': 'g', 'mg': 'g', 'l': 'l', 'ml': 'l', 'cl': 'l'}

    try:
        quantity_magnitude_unit = re.findall(patterns[category_name], product_name.lower())[0]
        quantity = re.findall(r"(\d+)\s?x", quantity_magnitude_unit)[0]
    except:
        quantity = np.nan

    try:
        units = re.findall(r"\d\s?(\w{1,2})$", quantity_magnitude_unit)[0]
    except:
        units = np.nan

    try:
        magnitude = re.findall(r"(?:\d\s?x\s?)?(\d?\.?\d+)\s?\w{1,2}?", quantity_magnitude_unit.replace(",","."))[0]
    except:
        magnitude = 1

    magnitude = float(magnitude) * conversions_magnitude.get(units, np.nan)
    units = conversions_unit.get(units, np.nan)

    return quantity, magnitude, units

In [540]:
def extract_table_from_link(link, product_name):

    # make request
    response = requests.get(link)

    # check response
    if response.status_code == 200:
        # print("Successful connection.")
        pass

    else:
        print("Connection failed.")

    # parse html
    product_data_soup = BeautifulSoup(response.content, "html.parser")

    table = product_data_soup.find("table", {"class":"table table-striped table-responsive text-center"})

    # extract table header and body
    table_head_list = [element.text.strip() for element in table.find("thead").findAll("th")][:2]

    table_body_list = [[element.text.strip() for element in row.findAll("td")][:2] for row in table.find("tbody").findAll("tr")]


    # fill with , product_name, category, supermarket values

    category_name = link.split("/")[4]

    supermarket_name = link.split("/")[3]

    quantity, magnitude, units = extract_quantity_from_product_name(product_name, category_name)

    table_body_list_filled_tuples = [tuple([row[0], row[1].replace(",","."), product_name, quantity, magnitude, units,
                                              category_name, supermarket_name, link]) for row in table_body_list]

    # function here LOAD TO DATABASE

    # convert to dataframe and save
    table_head_list.extend(["product_name", "quantity", "magnitude", "units","category_name","supermarket_name","url"])

    extracted_table_df = pd.DataFrame(table_body_list_filled_tuples, columns=table_head_list)

    dir_path = f"../data/extracted/{supermarket_name}/{category_name}"

    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

    extracted_table_df.to_csv(f"{dir_path}/{product_name}.csv")

    return extracted_table_df

In [541]:
extract_table_from_link("https://super.facua.org/mercadona/aceite-de-oliva/aceite-de-oliva-04-hacendado/","Aceite de girasol refinado 0,2º Hacendado 100 ml.")

Unnamed: 0,Día,Precio (€),product_name,quantity,magnitude,units,category_name,supermarket_name,url
0,22/06/2024,8.00,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
1,23/06/2024,8.00,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
2,24/06/2024,8.00,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
3,25/06/2024,8.00,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
4,26/06/2024,8.00,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
...,...,...,...,...,...,...,...,...,...
121,21/10/2024,6.75,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
122,22/10/2024,6.75,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
123,23/10/2024,6.75,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
124,24/10/2024,6.75,"Aceite de girasol refinado 0,2º Hacendado 100 ml.",,0.1,l,aceite-de-oliva,mercadona,https://super.facua.org/mercadona/aceite-de-ol...
