### 0. Imports

In [61]:
# data processing
import pandas as pd
import numpy as np

# regular expressions
import re

# function typing
from typing import Tuple, Union

# standardization of strings
from unidecode import unidecode

# append parent folder to path for easier src imports
import sys
sys.path.append("..")

# import data extraction support function
from src.support.data_extraction_support_draft import extract_table_from_link, extract_productnames_links, extract_categorynames_links, extract_supermarkets

# import data transformation support functions
from src.support.data_transformation_support import extract_quantity_from_product_name, sanitize_filename

# 1. Introduction to this notebook

The purpose of this notebook is to explain the decision making in the different cleaning procedures applied to the data scraped during the extraction phase.

# 2. Data transformation

The scraping of historical data poses no major problem, as the only transformations to be made for the data to be ready for use are:
- The replacement of a comma by the correct floating point character.
- Transformation to datetime and float data types.
- Cleaning of characters incompatible with filepath writing

However, the available html fields for extraction do not explicitly provide other information about the products, such as:
- Quantity
- Unit of measure
- Volume/weight
- Brand
- Packaging
- Subcategory of product (Milk for babies, extra quality olive oil, protein milk, with or without lactose)

This information is extremely valuable for the analysis, especially the quantities, units of measure and volume/weight, as they can multiply differences in prices inadvertedly. As said, this data is not available through picky extraction of the relevant fields but from the product name itself. Thus, a processing of the extracted product name is necessary to obtain these valuable fields for the database.

An extraction of a base dataframe has been made to enable the exploration necessary for the cleaning process.

## 2.1 Extract quantity, unit of measure and volume/weight

The structures in product names accross brands, supermarkets and products is erratic. The approach to extract quantity, unit of measure and volume or weight is therefore through complex regular expression patterns.

Replicating the main extraction function up to the moment, to extract the product names and categories:

In [36]:
names_list = list()
category_list = list()
supermarket_list = list()

supermarket_links = extract_supermarkets("https://super.facua.org/")

# product categories and supermarkets information are found inside the url
for supermarket_link in supermarket_links:

    category_links = extract_categorynames_links(supermarket_link)

    for category_link in category_links:

        product_names, product_links = extract_productnames_links(category_link)

        names_list.extend(product_names)
        supermarket_names = [product_link.split("/")[3].replace("-","_") for product_link in product_links]
        supermarket_list.extend(supermarket_names)
        category_names = [product_link.split("/")[4].replace("-","_") for product_link in product_links]
        category_list.extend(category_names)


Let's use this dataframe to test the solutions, because as said above, the information wanted about quantity, units, etc, is contained in the name of the product.

In [37]:
products = pd.DataFrame(zip(names_list,category_list, supermarket_list), columns=["product_name","category","supermarket"])
products.head()

Unnamed: 0,product_name,category,supermarket
0,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",aceite_de_girasol,mercadona
1,"Aceite De Girasol Refinado 0,2º Hacendado 5 L.",aceite_de_girasol,mercadona
2,"Aceite De Oliva 0,4º Hacendado 1 L.",aceite_de_oliva,mercadona
3,Aceite De Oliva 1º Hacendado 1 L.,aceite_de_oliva,mercadona
4,Aceite De Oliva Intenso Hacendado 3 L.,aceite_de_oliva,mercadona


Let's remove accents and inconsistent characters, as would be done during processing:

In [38]:
products["product_name2"] = products["product_name"].apply(lambda x: sanitize_filename(x))

### 2.1.1 Visual exploration of different strings

Visual inspection of all the possible patterns for quantity, volume/weight and units:

In [39]:
# # IF YOU WANT TO HAVE THIS OUTPUT DISPLAYED: UNCOMMENT THIS CELL

# for name in products["product_name2"]:
#     print(name)

### 2.1.2 General exploration

Let's gather examples from the visual exploration to start testing the regular expressions:

In [40]:
cadena = """'6 l', '1 l', '1.5 l', '9 l', '1.2 l', '250ml', '450 g', '210 g',
       '387 g', '370 g', '740 g', '400 ml', '500 ml', '800 g', '1200 g',
       '200 ml', '1,5l', '9 x 1l', '6 x 200 ml', '750 ml', '525 g',
       '10 x 7,5 g', '2 x 210 g', '2 x 160 g', '265 ml', '4 x 120 g',
       '6 x 100 g', '14 x 100 g', '270 ml', '2,2 l', '50 cl', '400 g',
       '6x200 ml', '3x210 g', '10x7,5 g', '6x188ml', '3x200 ml',
       '6 x 1 l', '2.2 l', '6 x 1.5 l', '6 x 2.2 l', '3 x 200 ml',
       '6 x 188 ml', '600 g', '1,5 ml', '500 g', '188 ml', '20cl.',
       '2 l', '6x1 l', '6x 1 l', '6 x 1l', '4 x 1.5 l', '6 x 500 ml',
       '1.5l', '1l', '6x 1l'"""

Pattern to get quantity, volume/weight and units altogether:

In [41]:
re.findall(r"(\d+(?:[.,]\d+)?\s?(?:l|litros?|ml|mililitros?))", cadena.lower())[:10]

['6 l',
 '1 l',
 '1.5 l',
 '9 l',
 '1.2 l',
 '250ml',
 '400 ml',
 '500 ml',
 '200 ml',
 '1,5l']

Pattern to get the units of measure from the previous quantity_volume/weight_units extraction:

In [42]:
re.findall(r"\d\s?(\w{1,2})$", "6x 1 g")

['g']

Pattern to extract the volume/weight from the previous quantity_volume/weight_units extraction:

In [43]:
re.findall(r"(?:\d\s?x\s?)?(\d?\.?\d+)\s?\w{1,2}?", cadena.replace(",","."))[0:10]

['6', '1', '1.5', '9', '1.2', '250', '450', '210', '387', '370']

In [44]:
re.findall(r"(?:\d\s?x\s?)?(\d?\.?\d+)\s?\w{1,2}?", cadena.replace(",","."))[-10:]

['20', '2', '1', '1', '1', '1.5', '500', '1.5', '1', '1']

Pattern to extract the quantity from the previous quantity_volume/weight_units extraction:

In [45]:
re.findall(r"(\d+)\s?x", cadena)

['9',
 '6',
 '10',
 '2',
 '2',
 '4',
 '6',
 '14',
 '6',
 '3',
 '10',
 '6',
 '3',
 '6',
 '6',
 '6',
 '3',
 '6',
 '6',
 '6',
 '6',
 '4',
 '6',
 '6']

## 2.2 Extract liters and quantities

### 2.2.1 Extract liters and quantities from 'leche'

Now, let's try it out in the whole names list.

First, replace some common phrases to its common translation, to make regex extraction easier.

In [46]:
names = (products.loc[products["category"] == "leche","product_name"].str.lower().str.replace(" unidades de ", " x ")
            .str.replace(" uds. x ", " x ").str.replace(" uds. ", " x ").str.replace(" briks de ", " x "))
names.head()

13    leche +proteínas desnatada hacendado 6 l.
14        leche desnatada calcio hacendado 6 l.
15               leche desnatada hacendado 1 l.
16             leche desnatada hacendado 1.5 l.
17               leche desnatada hacendado 6 l.
Name: product_name, dtype: object

The unique results from the extraction are:

In [47]:
names = names.str.extract(r"(\d+(?:[.,]\d+)?\s?(?:litros?|mililitros?|cl|ml|l|kg|gr|g|cl)|\d+\s?(?:uds\.?|botes|x)\s?\d+(?:[.,]\d+)?\s?(?:|cl|ml|l|g|gr|g))")
names.iloc[:,0].unique()

array(['6 l', '1 l', '1.5 l', '9 l', '1.2 l', '250 ml', '450 g', '210 g',
       '387 g', '370 g', '740 g', '400 ml', '500 ml', '800 gr', '800 g',
       '1200 g', '200 ml', '1,5 l', '9 x 1 ', '1 kg', '6 x 200 ',
       '750 ml', '525 g', '10 x 7,5 ', '2 x 210 ', '2 x 160 ', '265 ml',
       '4 x 120 ', '6 x 100 ', '14 x 100 ', '270 ml', '1 litro',
       '1,5 litros', '2,2 litros', '50 cl', '400 g', '6x200 ', '3x210 ',
       '10x7,5 ', '6x188 ', '3x200 ', '6 x 1 ', '2.2 l', '6 x 1.5 ',
       '6 x 2.2 ', '3 x 200 ', '6 x 188 ', '2,2 l', '600 g', '1,5 ml',
       '500 g', '188 ml', '200 cl', '2 l', '6x1 ', '6x 1 ', '6 x 1',
       '4 x 1.5 ', '6 x 500 ', '1.5l', '1l', '6x 1', '1 x6 '],
      dtype=object)

Which succesfully captures quantity, volume/weight and units from all examples. It should be further checked later on that these extractions are not also errenous, appart from successful.

In [48]:
names.isna().sum() / names.shape[0]

0    0.0
dtype: float64

### 2.2.2 Extract liters and quantities from 'aceite_de_girasol'

Getting the names of products from the 'aceite_de_girasol' category.

In [49]:
names = products.loc[products["category"] == "aceite_de_girasol","product_name"].str.lower()
names.head()

0     aceite de girasol refinado 0,2º hacendado 1 l.
1     aceite de girasol refinado 0,2º hacendado 5 l.
41            aceite de girasol capicua garrafa 5 l.
42         aceite de girasol carrefour classic' 1 l.
43          aceite de girasol carrefour garrafa 5 l.
Name: product_name, dtype: object

Testing the function, with care that in the regex function, the "easiest" checks like single "l" go at the end.

In [50]:
names = names.str.extract(r"(\d+(?:[.,]\d+)?\s?(?:litros?|mililitros?|ml|cl|l))")
names.iloc[:,0].unique()

array(['1 l', '5 l', '3 l', '150 ml', '50 ml', '1 litro', '5 litros',
       '3 litros', '200 ml'], dtype=object)

No records stay without coincidence.

In [51]:
names.isna().sum() / names.shape[0]

0    0.0
dtype: float64

### 2.2.3 Extract liters and quantities from 'aceite_de_oliva'

Getting the names of products from the 'aceite_De_girasol' category.

In [52]:
names = products.loc[products["category"] == "aceite_de_oliva","product_name"].str.lower()
names

2                     aceite de oliva 0,4º hacendado 1 l.
3                       aceite de oliva 1º hacendado 1 l.
4                  aceite de oliva intenso hacendado 3 l.
5                    aceite de oliva suave hacendado 3 l.
6           aceite de oliva virgen extra hacendado 0.2 l.
                              ...                        
1420    ybarra aceite de oliva virgen extra  botella d...
1421    ybarra aceite de oliva virgen extra botella 50...
1422    ybarra aceite de oliva virgen extra botella de...
1423    ybarra aceite de oliva virgen extra botella de...
1424      ybarra aceite de oliva virgen extra garrafa 5 l
Name: product_name, Length: 697, dtype: object

Again, testing the function, with care that in the regex function, the "easiest" checks like single "l" go at the end.

In [53]:
names = names.str.extract(r"(\d+(?:[.,]\d+)?\s?(?:litros?|mililitros?|ml|cl|l))")
names.iloc[:,0].unique()

array(['1 l', '3 l', '0.2 l', '0.75 l', '0.5 l', '5 l', '750 ml',
       '250 ml', '500 ml', '10 ml', '200 ml', '2,5 l', '50 cl', nan,
       '75 cl', '300 ml', '400 ml', '25 cl', '150 ml', '1 litro',
       '3 litros', '5 litros', '20 cl', '20 ml', '20cl', '2 l', '280 ml',
       '100 ml', '4 l', '1l'], dtype=object)

This time, there are a 5% of non-coincidences. But it is acceptable, as the most erratic quantity names also belong to products that do not really belong to the category, such as tuna in olive oil, avocado extra virgin oil, and so on, and this serves as a filter for them.

In [54]:
names.isna().sum() / names.shape[0]

0    0.05165
dtype: float64

### 2.2.4 Final quantity, volume/weight and units extraction function

This function extracts quantity_magnitude_unit, and then divides each into its single field to be input as a separate column value.

In [55]:
def extract_quantity_from_product_name(
    product_name: str,
    category_name: str
) -> Tuple[int, Union[float, None], Union[str, None]]:
    """
    Extracts quantity, magnitude, and unit from the product name 
    based on the specified category.

    Parameters:
    ----------
    product_name : str
        The name of the product, which may contain quantity information.
    category_name : str
        The category of the product to select appropriate patterns.

    Returns:
    -------
    Tuple[int, Union[float, None], Union[str, None]]
        - Quantity: Number of units in the product (default is 1 if not found).
        - Magnitude: Calculated magnitude converted to standard unit.
        - Units: Standardized unit of measure (e.g., 'g', 'l') or None if unavailable.
    """

    patterns = {
        "aceite-de-oliva": r"(\d+(?:[.,]\d+)?\s?(?:litros?|mililitros?|ml|cl|l))",
        "aceite-de-girasol": r"(\d+(?:[.,]\d+)?\s?(?:litros?|mililitros?|ml|cl|l))",
        "leche": r"(\d+(?:[.,]\d+)?\s?(?:litros?|mililitros?|cl|ml|l|kg|gr|g|cl)|\d+\s?(?:uds\.?|botes|x)\s?\d+(?:[.,]\d+)?\s?(?:|cl|ml|l|g|gr|g))"
    }

    conversions_abbr = {
        "gramos": "g", "kilogramos": "kg", "miligramo": "mg",
        "miligramos": "mg", "litros": "l", "litro": "l",
        "mililitro": "ml", "mililitros": "ml",
        "centilitro": "cl", "centilitros": "cl"
    }
    conversions_magnitude = {'g': 1, 'kg': 1000, 'mg': 0.001, 'l': 1, 'ml': 0.001, 'cl': 0.01}
    conversions_unit = {'g': 'g', 'kg': 'g', 'mg': 'g', 'l': 'l', 'ml': 'l', 'cl': 'l'}

        # extract the quantity_magnitude_unit and quantity value
    try:
        quantity_magnitude_unit = re.findall(patterns[category_name], product_name.lower())[0]
        quantity = int(re.findall(r"(\d+)\s?x", quantity_magnitude_unit)[0])
    except:
        quantity = 1

    try:
        #find units and standardize to abbreviations
        units = re.findall(r"\d\s?(\w{1,2})$", quantity_magnitude_unit)[0]
        units = conversions_abbr.get(units, units)
    except:
        units = np.nan

    try:
        # get magnitude value
        magnitude = re.findall(r"(?:\d\s?x\s?)?(\d?\.?\d+)\s?\w{1,2}?", quantity_magnitude_unit.replace(",", "."))[0]
    except:
        magnitude = 1

    # convert to same units
    magnitude = float(magnitude) * conversions_magnitude.get(units, np.nan)
    units = conversions_unit.get(units, np.nan)

    return quantity, magnitude, units


## 2.3 Extract brand names from product name

Extracting the brand names from the products names is no easy task. A first iteration through regular expressions was approached, but the inconsistent placement and formatting of these brands made it impossible.

Rather, the taken approach has been the approximation of creating a list from the list names through AI, and reiterating it, adding through exploration of the results the brand names that had not been not captured.

In [None]:
# IF YOU WANT TO SEE RESULTS FROM THIS CELL, UNCOMMENT
# for x in products[(products["category"]=="aceite_de_girasol")]["product_name"]:
#     print(x)

The function to extract brand names is stored in `src/support/data_transformation_support.py`, but it is shown here for clarity:



In [65]:
def extract_brand(product_name: str) -> Union[str, float]:
    """
    Extracts and returns the brand from the product name, applying 
    any necessary normalization for known variations.

    Parameters:
    ----------
    product_name : str
        The name of the product to identify the brand.

    Returns:
    -------
    Union[str, float]
        The normalized brand name if recognized, or NaN if no brand is found.
    """

    normalizations = {
        "k arginano": "karlos arguinano",
        "k. arguinano": "karlos arguinano",
        "karlos arguinano": "karlos arguinano",
        "carbonel": "carbonell",
        "el molino d gines": "el molino de gines",
        "la española": "la espanola",
        "oleo cazorla": "oleocazorla",
        "coop": "dcoop",
        "arrolan": "arrolan",
        "oleaestepa": "oleoestepa",
        "bailén": "oro bailen"
    }

    brands = [
        'natursoy', 'l.r.', 'nunez de prado', 'laban', 'ram', 'hojiblanca', 'oleodiel', 'dia', "l'estornell", 
        'president', 'la masia', 'la laguna', 'aromas del sur', 'carbonel', 'feiraco', 'carrefour', 'kaiku', 
        'suroliva', 'ferrarini', 'el buen pastor', 'de nuestra tierra', 'aceites de ardales', 'priegola', 'montbelle', 
        'alhema de queiles', 'ecran sunnique', 'almaoliva', 'amarga y pica', 'capricho andaluz', 'carapelli', 
        'ondosol', 'tierra de sabor', 'verde segura', 'eroski', 'larsa', 'ideal', 'jacoliva', 'go vegg', 'lanisol', 
        'saqura', 'saha', 'oliva verde', 'merula', 'oro', 'arrolan', 'la boella', 'reales almazaras de alcaniz', 
        'carbonell', 'marques de grinon', 'finca penamoucho', 'la almazara de canjayar', 'elizondo', 'bomilk', 
        'mueloliva', 'hacienda el palo', 'fontasol', 'euskal herria', 'oro bailen', 'jaencoop', 'cantero de letur', 
        'miro', 'flor de arana', 'covap', 'cexasol', 'babaria', 'parqueoliva', 'gaza', 'pago baldios san carlos', 
        'lacturale', 'mar de olivos', 'agus', 'lauki', 'palacio de los olivos', 'casas de hualdo', 'madriz', 
        'don arroniz', 'oleoestepa', 'dcoop', 'leyma natura', 'borges', 'alcampo', 'giralda', 'duc', 'coosur', 
        'nekeas', 'santa teresa', 'ucasol', 'dominus', 'lar', 'abril', 'beyena', 'romanico', 'ester sole', 'koipe', 
        'capicua', 'lletera', 'puleva', 'babybio', 'retama', 'granja noe', 'nestle', 'santiveri', 'valroble', 
        'mustela', 'la yerbera', 'clesa', 'campomar nature', 'oleocazorla', 'ozolife', 'urzante', 'sveltesse', 
        'casa juncal', 'olilan', 'k arginano', 'hacendado', 'olivar de segura', 'flora', 'el corte ingles', 
        'picualia', 'la colmenarena', 'unicla', 'guillen', 'celta', 'altamira', 'coosol', 'arboleda', 'ybarra', 
        'conde de benalua', 'saeta', 'maestros de hojiblanca', 'el molino d gines', 'iznaoliva', 'maeva', 'denenes', 
        'lactebal', 'bizkaia esnea', 'la organic cuisine', 'aljibes', 'k. arguinano', 'el lagar del soto', 'ecomil', 
        'fruto del sur', 'olivar del sur', 'mil olivas', 'villacorona', 'valdezarza', 'oleum', 'pascual', 
        'karlos arguinano', 'tresces', 'fuenroble', 'oleo cazorla', 'ato', 'cambil', 'lilibet', 'ondoliva', 
        'changlot real', 'somontano', 'nectar of bio', 'molino de olivas de bolea', 'germanor', 'cazorliva', 
        'venta del baron', 'elosol', 'la redonda', 'olibeas', 'abaco', 'nivea', 'letona', 'santa gadea', 'monegros', 
        'asturiana', 'rio', 'llet nostra', 'danone', 'la espanola', 'castillo de canena', 'valles unidos', 'unio', 
        'oleaurum', 'senorio de segura', 'ultzama', 'el castillo'
    ]

    brands = sorted(brands, key=len, reverse=True)

    for brand in brands:
        if brand in product_name:
            return normalizations.get(brand, brand)
    else:
        return np.nan


Applying the function to the tests df and inspecting for an example brand 'alcampo':

In [66]:
products["brands"] = products["product_name"].apply(lambda product_name: extract_brand(unidecode(product_name.lower())))
products[products["brands"] == "alcampo"][:5]

Let's count how many products still did not have their brand captured. As it can be observed, those products are not worth capturing their brands, as they are not proper olive or sunflower oil, or milk products.

In [70]:
product_names_filtered = products[(products["brands"].isna())]

print(f"There are {product_names_filtered['brands'].isna().sum()} products without brand\n\n")

## IF YOU WANT TO SEE THE OUTPUT, UNCOMMENT
# for product_name in product_names_filtered["product_name"]:
#     print(product_name)

There are 36 products without brand




## 2.4 Extract subcategories

Finally, extract subcategories. Not all milk products are equal, neither are all olive oil products. Therefore, to make filtering easier during the analysis phase, and to have this information more ingestible at database level, it is worth extracting them.

The process followed is quite straight-forward, albeit not perfect, and mirrors the approach taken with the brand names; if a keyword is found inside the product_name, it is assigned a subcategory and/or subcategory distinction.

This functions is written here, although it can be found at `src/support/data_tranformation_support.py` it is outlined here as it was built iteratively and is clear to define the subcategory assignment process.

In [71]:
def extract_distinction_eco(
    product_name: str,
    category: str
) -> Tuple[Union[str, float], int]:
    """
    Extracts the distinction and eco-friendly status of a product based on 
    the product name and category.

    Parameters:
    ----------
    product_name : str
        The name of the product.
    category : str
        The category of the product to determine distinctions.

    Returns:
    -------
    Tuple[Union[str, float], int]
        - distinction: A string representing specific distinctions 
          (e.g., 'semidesnatada', 'desnatada sin lactosa') or NaN if none.
        - eco: An integer indicating eco-friendly status (1 for eco, 0 otherwise).
    """
    distinction = np.nan

    if category == "leche":
        if "semidesnatada" in product_name:
            distinction = "semidesnatada"
        elif "desnatada" in product_name:
            distinction = "desnatada"
        elif "entera" in product_name:
            distinction = "entera"

        if not pd.isna(distinction) and "lactosa" in product_name:
            distinction += " sin lactosa"
        if not pd.isna(distinction) and "calcio" in product_name:
            distinction += " calcio"
        if not pd.isna(distinction) and "proteinas" in product_name:
            distinction += " proteinas"

    if " eco " in product_name or "ecologic" in product_name:
        eco = 1
    else:
        eco = 0

    return distinction, eco

def extract_subcategory(
    product_name: str,
    category: str
) -> Union[str, float]:
    """
    Determines the subcategory of a product based on its name, category, 
    and distinction.

    Parameters:
    ----------
    product_name : str
        The name of the product.
    category : str
        The product category to narrow down subcategory choices.
    distinction : str
        Any specific distinctions of the product (e.g., 'desnatada').

    Returns:
    -------
    Union[str, float]
        The subcategory of the product, or NaN if not applicable.
    """

    if category == "aceite_de_girasol":
        if "freir" in product_name:
            subcategory = "freir"
        else:
            subcategory = "normal"

    elif category == "aceite_de_oliva" and "en aceite" not in product_name and "con aceite" not in product_name:
        if "virgen extra" in product_name:
            subcategory = "virgen extra"
        elif "virgen" in product_name:
            subcategory = "virgen"
        elif "intenso" in product_name:
            subcategory = "intenso"
        else:
            subcategory = "suave"

    elif category == "leche":
        if "cabra" in product_name:
            subcategory = "leche cabra"
        elif "vaca" in product_name:
            subcategory = "leche vaca"
        elif "condensada" in product_name:
            subcategory = "leche condensada"
        elif "leche" in product_name:
            subcategory = "leche vaca"
        else: 
            subcategory = np.nan

    else:
        subcategory = np.nan

    return subcategory


The above 2 functions do the hard work, the function below unifies it:

In [77]:
def get_subcategory_distinction(
    product_name: str,
    category: str
) -> Tuple[Union[str, float], Union[str, float], int]:
    """
    Determines the subcategory, distinction, and eco-friendly status 
    of a product based on its name and category.

    Parameters:
    ----------
    product_name : str
        The name of the product.
    category : str
        The category of the product.

    Returns:
    -------
    Tuple[Union[str, float], Union[str, float], int]
        - subcategory: A string representing the subcategory of the product, or NaN if not applicable.
        - distinction: Specific distinction within the category (e.g., 'desnatada'), or NaN if not found.
        - eco: Integer indicating eco-friendly status (1 for eco, 0 otherwise).
    """
    distinction, eco = extract_distinction_eco(product_name, category)
    subcategory = extract_subcategory(product_name, category, distinction)

    return subcategory, distinction, eco


Applying it to the test products df:

In [78]:
products[["subcategory","distinction","eco"]] = products[["product_name", "category"]].apply(
    lambda row: get_subcategory_distinction(unidecode(row["product_name"].lower()), row["category"]),
    axis=1,
    result_type="expand"
)

It yields the following results:

In [82]:
print(f"There are {products[products['subcategory'].isna()].shape[0]} out of {products.shape[0]} without subcategory assignment.")

products.head()

There are 69 out of 1629 without subcategory assignment.


Unnamed: 0,product_name,category,supermarket,product_name2,brands,subcategory,distinction,eco
0,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",aceite_de_girasol,mercadona,"Aceite De Girasol Refinado 0,2º Hacendado 1 L.",hacendado,normal,,0.0
1,"Aceite De Girasol Refinado 0,2º Hacendado 5 L.",aceite_de_girasol,mercadona,"Aceite De Girasol Refinado 0,2º Hacendado 5 L.",hacendado,normal,,0.0
2,"Aceite De Oliva 0,4º Hacendado 1 L.",aceite_de_oliva,mercadona,"Aceite De Oliva 0,4º Hacendado 1 L.",hacendado,suave,,0.0
3,Aceite De Oliva 1º Hacendado 1 L.,aceite_de_oliva,mercadona,Aceite De Oliva 1º Hacendado 1 L.,hacendado,suave,,0.0
4,Aceite De Oliva Intenso Hacendado 3 L.,aceite_de_oliva,mercadona,Aceite De Oliva Intenso Hacendado 3 L.,hacendado,intenso,,0.0


# 3. Conclusion of this notebook

The integrated functions hereby outlined can be found in the script `src/support/data_transformation_support.py`. These functions are used inside an ETL script, along with the updated functions from the extraction and load phases. 

If you wish to consult the data load process, please refer to `notebooks/3_data_load.ipynb`.