# Product Attribute Discovery, Extraction, and Harmonization Using LLMs

In this notebook, we demonstrate how LLMs can be used to extract and harmonize product attributes. The extracted attributes are supposed to be used as inputs to search engines, faceted catalog nativation, and other applications that require high-quality product data. Intelligent attribute extraction and harmonization is partcularly useful in applications with multiple sellers or suppliers such as marketplaces.

We consider the following scenarios:
  1. We receive a product description and need to determine whether it is a new or known (registered) product type. If it is a new product type, we need to generate a schema (list of attribute names) for it.
  1. Assuming that the product schema is known (either an old schema fetched from a repositoy or new schema is generated), we need to extract the attributes from the product description. In particular, complex attributes might need to be generated.
  1. Extracted or manually entered attributes need to be harmonized, so that they use the same measures, taxonomy, etc. 

We use Langchain as an abstraction layer to invoke LLMs. GCP Vertex AI is used as an LLM provider, but other proprietary LLMs can be used equally well. 

## Environment Setup and Initialization

In [106]:
#
# Initialize LLM provider
# (google-cloud-aiplatform must be installed)
#
from google.cloud import aiplatform
aiplatform.init(
    project='<< specify your project name here >>',
    location='us-central1'
)

#
# Helper function for calling LLM using a prompt template and one argument
# Parameter name in the template must be 'input'
#
from langchain.llms import VertexAI
from langchain import PromptTemplate, LLMChain

def query_llm_with_one_argument(query_template, argument):
    prompt = PromptTemplate(template=query_template, input_variables=['input'])

    llm = VertexAI()
    llm_chain = LLMChain(prompt=prompt, llm=llm)

    return llm_chain.run(argument)    

#
# Test Langchain
#
query = "Question: {input} Answer: Let's think step by step."
question = "What are the largest industries in the world?"
query_llm_with_one_argument(query, question)

'The largest industries in the world are the automotive industry, the oil and gas industry, the pharmaceutical industry, the food and beverage industry, and the telecommunications industry. The automotive industry is the largest industry in the world, with a global market of over $2.5 trillion. The oil and gas industry is the second largest industry in the world, with a global market of over $2.2 trillion. The pharmaceutical industry is the third largest industry in the world, with a global market of over $1.2 trillion. The food and beverage industry is the fourth largest industry in the world, with a global market of over $1'

## Attribute Discovery

In [20]:
product_data = [
"""
Title: Caannasweis 10 Pieces Pots and Pans Non Stick Pan White Pot Sets Nonstick Cookware Sets w/ Grill Pan
Description:
1: These cookware sets are included 1*9.5” frying pan, 1* 8” frying pan, 1*1.5QT sauce pot with glass lid, 1*4.5QT deep frying pan with lid, 1*5QT stock pot with glass lid, and 1*9.5” square grill pan. 
This 10-piece granite pots and pans set is everything you need to get cooking in your kitchen. Not just that, the cooking set also makes an appreciable housewarming gift or holiday gift for your loved ones. 
2: Our pots and pans use scratch-proof nonstick granite coating, which can keep food sliding smoothly along the surface, preventing food to stick, and making cooking easier. Sturdy interiors can avoid 
chipping and coming off. 
3: These nonstick cookware sets are free of PFOA, PFOS, lead & cadmium(Have FDA certification for SGS testing). Giving you and your family a healthier and easier home culinary experience. 
4: The pots and pans in this cookware set do not have a magnetic induction bottom, which allows faster heating on other cooktops, designed for people who do not use induction cooktops. 
5: All-in-one design. Rivetless interior to prevent snags and food buildup. Just rinse with a rag or water to finish cleaning.
"""
]

In [101]:
discovery_template = """
Your goal is to determine the product category and propose attributes for this category based on the user's input. The output must follow the format describe below.

```TypeScript
category: {{                               // Category metadata      
   category_name: string                   // Product type
   product_attribute_names: Array<string>  // A list of product attribute names that should be used to describe products in this category (not more than 4 attributes)
}}
```

Please output the extracted information in JSON format. Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional fields that do not appear in the schema.

Input: Noriega Glass Table Vase. A retro green tone meets a classic shape: This bud vase has a sleek, eye-catching look that's inspired by vintage design.
Output: {{ {{"category_name": "Table Vases"}}, {{"product_attribute_names": ["Brand", "Size", "Vase Shape", "Indoor Use Only"]}} }}

Input: {input}
Output:
"""

query_llm_with_one_argument(discovery_template, product_data[0])

'{ {"category_name": "Cookware Sets"}, {"product_attribute_names": ["Brand", "Material", "Color", "Number of Pieces"]} }'

## Attribute Extraction

In [103]:
extraction_template = """
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript
product: {{                 // Product attributes
   brand: string            // The name of the product's brand
   material: string         // The primary material the product is made of
   category: string         // Product type such as set, pot, or pan
   items_count: integer     // Number of items
   features: Array<string>  // A list of the main product features (not more than three)
}}
```

Please output the extracted information in JSON format. Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional fields that do not appear in the schema.

Input: Gotham Aluminium Cookware 12 Pieces Set. The nonstick cooking surface is coated 3x, and reinforced with ceramic and titanium to deliver the ultimate food release. Dishwasher safe.
Output: {{ {{"brand": "Gotham"}}, {{"material": "Aluminium"}}, {{"category": "Set"}}, {{"items_count": 5}}, {{"features": ["ceramic coating", "dishwasher safe"]}}  }}

Input: {input}
Output:
"""

query_llm_with_one_argument(extraction_template, product_data[0])

'{ {"brand": "Caannasweis"}, {"material": "Granite"}, {"category": "Set"}, {"items_count": 10}, {"features": ["nonstick coating", "rivetless interior", "FDA certified"]}  }'

In [118]:
#
# A more advanced example is generation of complex attributes 
# such search engine optimization (SEO) tags and personalized product page titles 
#

social_media_template = """
Create a new product page title specifically for the price-sensitive customer segment based on the product description provided below. 

Input: {input}
Output:
"""

query_llm_with_one_argument(social_media_template, product_data[0])

'Caannasweis 10 Pieces Pots and Pans Non Stick Pan White Pot Sets Nonstick Cookware Sets w/ Grill Pan - Affordable Price'

## Attribute Harmonization

In [80]:
product_attributes_raw = [
"""
{ "style" : "classic blue polka dot pattern", "dimensions" : "2-3/8 inch x 8-5/16 inch x 1/2 inch" }
"""
]

In [104]:
harmonization_template = """
Your goal is to format product attribute values in the user's input to match the format described below.

```TypeScript
product: {{                   // Product attributes
    style: string             // One of the following three style values: Traditional, Modern, Nature 
    dimensions: Array<float>  // An array of floating-point values expressed in inches 
}}
```

Please output the extracted information in JSON format. Do NOT add any clarifying information.

Input: {{ "style" : "wooden texture", "dimensions" : "8-1/2” x 1-1/16”" }}
Output: {{ "style" : "Nature", "dimensions" : [8.5, 1.0625], }}

Input: {{ "style" : "abstract hexagons", "dimensions" : "2/5 inch x 2 inch" }}
Output: {{ "style" : "Modern", "dimensions" : [0.4, 2] }}

Input: {input}
Output:
"""

query_llm_with_one_argument(harmonization_template, product_attributes_raw[0])

'{ "style" : "Traditional", "dimensions" : [2.375, 8.3125, 0.5] }'