In [1]:
import pandas as pd
import time

In this notebook, we explore the metadata columns to identify additional information that should be included in the text embeddings.

In [2]:
start = time.time()
df = pd.read_parquet("../../data/clean/clean_data.parquet", engine="pyarrow", columns=["Pid", "Category", "Name", "MergedBrand", "Condition"])
print("Load time: {:.2f} seconds".format(time.time() - start))

Load time: 6.97 seconds


## Product Name vs Brand
Let's start by checking the simple string matching to see if the Name column contains the brand information.

In [27]:
df['brand_in_name'] = df.apply(
    lambda row: str(row['MergedBrand']).lower() in str(row['Name']).lower(), axis=1
)

In [28]:
# Total number of rows
total = len(df)

# Count of True values in brand_in_name
count = df['brand_in_name'].sum()

# Percentage
percentage = (count / total) * 100

print(f"Count: {count}")
print(f"Percentage of product names containing the brand: {percentage:.2f}%")

Count: 5331277
Percentage of product names containing the brand: 48.35%


Using simple string matching, we found that 48% of product names already contain the brand name. However, not all products have brand information available. As shown below, 44% of the products are missing this detail.

In [29]:
print(df.isnull().sum() / len(df) * 100)

Pid               0.000000
Category         37.897004
Name              0.101166
MergedBrand      44.491586
Condition        13.894525
brand_in_name     0.000000
dtype: float64


Now let's check among those products whose name don't contain brand information, how many of them have a missing brand value.

In [30]:
# Total rows where brand_in_name is False
total_false = df[df['brand_in_name'] == False]

# Rows where brand is also NaN
false_and_null = total_false['MergedBrand'].isnull().sum()

# Percentage
percentage = (false_and_null / len(total_false)) * 100

print("Number of products don't contain brand information in name:", len(total_false))
print("Number of products don't contain brand information in name because brand is NaN:", len(total_false))
print(f"Percentage of missing brand values (NaN) among product names that don't contain the brand: {percentage:.2f}%")

df[(df['brand_in_name'] == False) & (~df['MergedBrand'].isnull())]

Number of products don't contain brand information in name: 5694138
Number of products don't contain brand information in name because brand is NaN: 5694138
Percentage of missing brand values (NaN) among product names that don't contain the brand: 86.15%


Unnamed: 0,Pid,Category,Name,MergedBrand,Condition,brand_in_name
7,127.2.DFF8DD86A0648144.CC18C6740EDBD90C.812303...,Home & Garden >Kitchen & Dining >Kitchen Appli...,Nsw - Cult Of The Lamb,U & I Entertainment,new,False
39,127.2.DFF8DD86A0648144.84B45CDFDD8A8F82.733569...,Sporting Goods >Indoor Games >Ping Pong >Ping ...,60'' Portable Table Tennis Ping Pong Folding T...,Costway,new,False
42,127.2.DFF8DD86A0648144.D009B1C8DBE15487.653046...,Arts & Entertainment >Hobbies & Creative Arts ...,27 Note Foldable Glockenspiel Xylophone Alumin...,Costway,new,False
53,127.2.DFF8DD86A0648144.4414CEDC31195D37.810057...,,Mosquito Repellent Refillable Wristbands - 4PK...,Cliganic,new,False
55,127.2.DFF8DD86A0648144.1FE97CECEE9445A1.860000...,Arts & Entertainment >Hobbies & Creative Arts ...,Celine Computerized Sewing Machine - White,Eversewn,new,False
...,...,...,...,...,...,...
11025410,239604.1.BBD0.E9597205DD0CA762.UB2-61042,Luggage & Bags >Backpacks,Urbo 2 Travelpack - Navy,LOJEL,,False
11025411,239604.1.BBD0.3341870201C6F4E7.UB2-61043,Luggage & Bags >Backpacks,Urbo 2 Citybag - Navy,LOJEL,,False
11025412,239604.1.BBD0.603591056B13FF27.VLC-L1-BLK-01-K...,Luggage & Bags >Luggage Accessories >Luggage C...,"Voja - Luggage Cover - Black, Large",LOJEL,,False
11025413,239604.1.BBD0.E2F97E6876AD6420.VLC-M1-BLK-01-K...,Luggage & Bags >Luggage Accessories >Luggage C...,"Voja - Luggage Cover - Black, Medium",LOJEL,,False


Now, as we can see, the main reason a product name doesn't include the brand is that many of those products simply lack brand information. However, there are 788,756 items where the brand is available but not mentioned in the product name. Note that this analysis was based on basic string matching.

### Conclusion
Including the brand in the text embeddings could be helpful, especially for edge cases where the product name is vague or uninformative (as shown below). While we may not see a significant boost in overall accuracy, adding brand information could enhance model performance in such scenarios.

In [31]:
df[(df['Name'] == '95')]

Unnamed: 0,Pid,Category,Name,MergedBrand,Condition,brand_in_name
9618512,202186.2.499B3C49E28651BB.4F0E795813525035.80I...,Apparel & Accessories >Shoes,95,GIANVITO ROSSI,new,False
9618513,202186.2.499B3C49E28651BB.A7C6CC406F655F18.80I...,Apparel & Accessories >Shoes,95,GIANVITO ROSSI,new,False
9635133,202186.2.499B3C49E28651BB.DDF4F457AFAAB9D4.79I...,Apparel & Accessories >Shoes,95,GIANVITO ROSSI,new,False
9638757,202186.2.499B3C49E28651BB.E0C71962BC15F3C3.77I...,Apparel & Accessories >Shoes,95,MACH & MACH,new,False


## Product Condition

Now let’s examine the 'Condition' column to understand which types of products are listed as 'used'.

In [24]:
df['Condition'].value_counts()

Condition
new            3523196
New            3142284
Used           2826973
refurbished        882
Refurbished        151
Name: count, dtype: Int64

In [33]:
print("Number of used books: ", len(df[
    df['Category'].str.contains('Book', case=False, na=False) & 
    (df['Condition'].str.lower() == 'used')
]))

print("Number of used Music related products: ", len(df[
    df['Category'].str.contains('Music', case=False, na=False) & 
    (df['Condition'].str.lower() == 'used')
]))

print("Number of used Software related products: ", len(df[
    df['Category'].str.contains('Software', case=False, na=False) & 
    (df['Condition'].str.lower() == 'used')
]))

print("Number of used items with NaN category: ", len(df[
    df['Category'].isnull() & 
    (df['Condition'].str.lower() == 'used')
]))

Number of used books:  697883
Number of used Music related products:  278
Number of used Software related products:  352
Number of used items with NaN category:  2128459


In [23]:
df[
    df['Category'].notnull() & 
    ~df['Category'].str.contains('Book', case=False, na=False) & 
    ~df['Category'].str.contains('Music', case=False, na=False) & 
    ~df['Category'].str.contains('Software', case=False, na=False) & 
    (df['Condition'].str.lower() == 'used')
]

Unnamed: 0,Pid,Category,Name,MergedBrand,Condition
9423583,190499.156074.AE28426A76AAE034.974BD6BDF0FAD82...,Mature >Erotic >Pole Dancing Kits,Previously Played - LEGO Marvel Super Heroes (...,Wb Games,Used


In [34]:
df[
    df['Category'].isnull() & 
    (df['Condition'].str.lower() == 'used')
]

Unnamed: 0,Pid,Category,Name,MergedBrand,Condition,brand_in_name
4299732,178866.156074.2E7F5E21C961BE0.D84BDA13AA68554F...,,"[Signed] Buddy Longway, tome 13 : Le Vent Sauv...",,Used,False
4299733,178866.156074.2E7F5E21C961BE0.8087BD32F4B5D0BD...,,"[Signed] Dragon Rider - Red Endpapers Issue, S...",,Used,False
4299734,178866.156074.2E7F5E21C961BE0.A3E463DAC8A5D9D0...,,[Signed] Carta Executoria de Hidalguia di Don ...,,Used,False
4299735,178866.156074.2E7F5E21C961BE0.B1289B1E3DA55568...,,[Signed] Will Eisner's The Spirit Portfolio Ei...,,Used,False
4299736,178866.156074.2E7F5E21C961BE0.B3E937D8F2FF8703...,,[Signed] Das Reich [with ten original serigrap...,,Used,False
...,...,...,...,...,...,...
9060133,178866.156074.99167FE934A5BC7F.89B56B2557BCCDD...,,[Signed] Autographs -- 292 of Britains House ...,,Used,False
9060148,178866.156074.99167FE934A5BC7F.B08EAAB6C08EFE9...,,"[Signed] 21 ALS, 5 TLS and 6 APCS E.P.Conkle D...",,Used,False
9060176,178866.156074.99167FE934A5BC7F.B31DBEA659669C8...,,The Sunday Times Colour Section (Magazine) 20t...,,Used,False
9060180,178866.156074.99167FE934A5BC7F.ADE6C8A6D1D5367...,,[Signed] sinbad tome 1 le cratère d'alexandrie...,,Used,False


### Conclusion 

It appears that most used items are books, music-related products, or games. While incorporating the Condition field into text embeddings could potentially improve search accuracy for used books and similar items, it is not considered essential. Instead, we prefer to treat it as a post-search filtering criterion.

## Metadata Columns Summary

Below is a summary of metadata columns and whether they will be included in the text embeddings.

**✅MergedBrand:** This will be included in the text embeddings. While most product names already contain brand information, explicitly embedding it can help in edge cases where the product name is ambiguous.

**✅Category:** After incorporating category into text embeddings, we observed some improved search results—e.g., the query "Furniture" no longer returns books. However, since 37% of products lack category information and the complexity of categories, we may consider enriching the category manually. For now, including category has already shown positive impact in certain scenarios.

**✅Gender:** Although product images can sometimes convey gender information, having explicit gender data is helpful for items where the image alone is ambiguous about the gender—such as watches or rings.

**❌Condition:** Most used items are books, music-related products, or games. As confirmed with the partner, this field will not be included in the text embeddings. Instead, it will be considered for post-search filtering.

**❌Size:** Due to the high cardinality of the size field and its relatively low importance in common search scenarios, this attribute will also be treated as a post-search filter.

**❌Color:** Since product images already capture color information effectively, we will not include color in the text embeddings.

**⚠️Price:** Price requires separate handling, as general-purpose models like all-MiniLM-L6-v2 are not trained to interpret or enforce numerical conditions effectively. I experimented with adding price to the text embeddings, but the model still fails to effectively capture or interpret the price information.

**The returned result should prioritize the category as the most important factor, followed by price, and then brand.**

## Product Description

Now let’s take a look at the product description column. We’ll start by performing named entity recognition (NER) on the descriptions. We'll use the sample dataset for now, as processing the full dataset takes quite a long time.

### Named Entity Recognition

In [38]:
df = pd.read_csv("../data/csv/sample_100k_v2.csv")

In [39]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Function to extract named entities and noun phrases
def extract_entities_and_phrases(text):
    if not isinstance(text, str):
        return [], []
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    noun_phrases = [chunk.text for chunk in doc.noun_chunks]
    return entities, noun_phrases

In [40]:
# Apply to DataFrame
df[["named_entities", "noun_phrases"]] = df["Description"].apply(
    lambda x: pd.Series(extract_entities_and_phrases(x))
)

In [41]:
pd.set_option('display.max_colwidth', None)

df[df['Category'].str.contains('Tops', case=False, na=False)][["Description", "named_entities", "noun_phrases"]].head(10)

Unnamed: 0,Description,named_entities,noun_phrases
8,"knitted, medium-weight knit, no appliqués, solid color, deep neckline, long sleeves, no pockets , Color: Turquoise , Size: S",[Size],"[knitted, medium-weight knit, no appliqués, solid color, deep neckline, long sleeves, no pockets, Color, Turquoise, Size, S]"
14,Add understated elegance to your wardrobe with this plus size top from Style & Co.,[Style & Co.],"[understated elegance, your wardrobe, this plus size top, Style, Co.]"
27,"jersey, brand logo, solid color with print, crew neck, short sleeves, no pockets , Color: Black , Size: L",[Size],"[jersey, brand logo, solid color, print, crew neck, short sleeves, no pockets, Color, Black, Size, L]"
49,This graphic Hurley T-shirt will make a perfect year-round wardrobe essential. It features a graphic Hurley logo at chest on a soft jersey.,"[Hurley, Hurley]","[This graphic Hurley T-shirt, a perfect year-round wardrobe essential, It, a graphic Hurley logo, chest, a soft jersey]"
55,"For an extra layer when the temperatures drop, wear the Antigua Women's Pittsburgh Steelers Figure 1/4 Zip. With Pittsburgh Steelers colors and graphics, you&rsquo;ll be able to show off your team pride throughout the entire season and year. Made with soft fabric, you&rsquo;ll stay comfortable throughout the whole game in the Antigua Women's Pittsburgh Steelers Figure 1/4 Zip.Fabric Content:100% polyester","[the Antigua Women's Pittsburgh Steelers, 1/4, Pittsburgh, the entire season and year, the Antigua Women's Pittsburgh Steelers, 1/4]","[an extra layer, the temperatures, the Antigua Women's Pittsburgh Steelers Figure, 1/4 Zip, Pittsburgh Steelers colors, graphics, you&rsquo;ll, your team pride, the entire season, year, soft fabric, you&rsquo;ll, the whole game, the Antigua Women's Pittsburgh Steelers Figure, 1/4 Zip, Fabric Content:100% polyester]"
56,This short sleeve T-shirt features a crew neckline and chest signatures.,[],"[This short sleeve T-shirt, a crew, neckline, chest signatures]"
60,"Magic mirror, on the wall - what is the fairest Disney shirt of all add a little Disney magic to your day with a fun Disney T-Shirt celebrate all of your favorites with designs that feature Beauty Rose.","[Magic, Disney, Disney, day, Disney, Beauty Rose]","[Magic mirror, the wall, what, the fairest Disney shirt, all, a little Disney magic, your day, Disney T-Shirt, all, your favorites, designs, that, Beauty Rose]"
67,"sweatshirt fleece, no appliqués, solid color with print, hooded collar, long sleeves, fleece lining, single pocket , Color: White , Size: S",[Size],"[sweatshirt fleece, no appliqués, solid color, print, hooded collar, long sleeves, fleece lining, single pocket, Color, White, Size, S]"
73,"Diversify your T-shirt drawer with our ribbed space dye top. Elbow-length sleeves lend an extra touch of polish to this piece. About The Brand: Our brand was designed for and with women like you. Made for your body, your style, and your truth. As your wardrobe's best friend, these mix-and-match pieces work with everything in your closet.","[Elbow, polish]","[your T-shirt drawer, our ribbed space dye top, Elbow-length sleeves, an extra touch, polish, this piece, The Brand, Our brand, women, you, your body, your style, your truth, your wardrobe's best friend, these mix-and-match pieces, everything, your closet]"
79,"jersey, brand logo, solid color, classic neckline, long sleeves, button closing, single chest pocket , Color: White , Size: 15 ½","[Size, 15, ½]","[jersey, brand logo, solid color, classic neckline, long sleeves, button closing, single chest pocket, Color, White, Size, 15 ½]"


In [43]:
df[df['Category'].str.contains('Pants', case=False, na=False)][["Description", "named_entities", "noun_phrases"]].head(10)

Unnamed: 0,Description,named_entities,noun_phrases
2,. . . . 2,[2],[]
6,"corduroy, ribbed, no appliqués, solid color, high waisted, elasticized waist, tapered leg, regular fit, multipockets, this brand runs small , Color: Red , Size: 8","[Size, 8]","[no appliqués, solid color, high waisted, elasticized waist, leg, regular fit, multipockets, this brand, Red, Size]"
47,"denim, brand logo, solid color with appliqués, colored wash, high waisted, belt loops, relaxed fit, 1 button, zipper fastening, multipockets, stretch, machine wash or dry clean, do not bleach, do not tumble dry, cropped, straight-leg jeans , Color: Black , Size: 27","[denim, 1, Size, 27]","[denim, brand logo, solid color, appliqués, colored wash, high waisted, belt loops, 1 button, zipper fastening, multipockets, stretch, machine wash, cropped, straight-leg jeans, Black, Size]"
52,"The Exaggerated Icon Jogger is for all the icons. This men's pair of joggers feature slanted front pockets, elasticized trim, slanted front pockets, and a back patch pocket. Finished with a large ""True Religion"" text written down one of the legs.","[True Religion, one]","[The Exaggerated Icon Jogger, all the icons, This men's pair, joggers, slanted front pockets, trim, slanted front pockets, a back patch pocket, a large ""True Religion"" text, the legs]"
57,"jersey, brand logo, solid color, regular fit, without pockets, stretch, cropped, skinny pants , Color: Ivory , Size: 10","[Size, 10]","[jersey, brand logo, solid color, regular fit, pockets, stretch, Color, Ivory, Size]"
86,"Move effortlessly throughout your day with these unrestrictive 4-way stretch chino pants from Frank and Oak, featuring a streamlined fit and a stylish cropped leg.","[4, Frank]","[your day, these unrestrictive 4-way stretch chino pants, Frank, Oak, a streamlined fit, a stylish cropped leg]"
88,"Step into effortless style and unbeatable comfort with the Women's Mica Denim Crop Drawstring Jogger Pants. These versatile joggers feature a chic cropped design and an adjustable drawstring waist, perfect for customizing your fit. Made from soft, high-quality fabric, they offer the perfect blend of durability and stretch for all-day wear. Whether you're running errands or enjoying a casual day out, these jogger pants will quickly become your go-to wardrobe essential.","[the Women's Mica Denim Crop Drawstring Jogger Pants, all-day]","[Step, effortless style, unbeatable comfort, the Women's Mica Denim Crop Drawstring Jogger Pants, These versatile joggers, a chic cropped design, an adjustable drawstring waist, your fit, soft, high-quality fabric, they, the perfect blend, durability, stretch, all-day wear, you, errands, a casual day, these jogger pants, your go]"
152,"This high rise, 30"" inseam wide leg jean features a tulip hem made with ""Ab"" Solution technology, meant to mold, hold, and boost your assets","[30, Ab"" Solution]","[This high rise, 30"" inseam wide leg jean, a tulip hem, ""Ab"" Solution technology, mold, your assets]"
219,"corduroy, no appliqués, solid color, low waisted, regular fit, multipockets, stretch, straight-leg pants , Color: Camel , Size: 32","[Size, 32]","[Color, Camel, Size]"
246,"bootcut pants, woven, solid color, brand logo, embellished, mid rise, regular fit, zipper fastening, button fastenings, multipockets, stretch, machine wash or dry clean, do not bleach, tumble dry , Color: Cream , Size: 4","[Size, 4]","[bootcut pants, Cream]"


In [44]:
df[df['Category'].str.contains('Shoes', case=False, na=False)][["Description", "named_entities", "noun_phrases"]].head(10)

Unnamed: 0,Description,named_entities,noun_phrases
21,"stained effect, brand logo, solid color, rubber lining, buckle fastening, round toeline, flat, rubber sole, flip flops , Color: Pastel pink , Size: 6","[Size, 6]","[stained effect, brand logo, solid color, rubber lining, buckle fastening, round toeline, flat, rubber sole, flip flops, Pastel pink]"
24,"A fresh update to one of your favorite women's tennis shoes, you'll heighten every occasion. With easy lace up styling, women's casual sneakers with a clean white platform sole go the extra mile in comfort. Sporty women's sneakers with cushioned back collar. Leather, suede, leather/synthetic metallic, leather/suede, leather/fabric, fabric/leather or synthetic upper with a closed round toe. Contour+ Comfort technology for a premium fit and all-day comfort experience. Removable footbed, non-slip outsole. 1.25"" platform heel. Why You'Ll Love It: Designed to the contours of a woman's foot. Available in an inclusive range of sizes and widths for a custom-designed fit and all-day wear. The Beautiful Fit. Est. 1927.","[one, Sporty, Comfort, all-day, 1.25, Love, all-day, 1927]","[your favorite women's tennis shoes, you, every occasion, easy lace, styling, women's casual sneakers, a clean white platform sole, the extra mile, comfort, Sporty women's sneakers, cushioned back collar, Leather, suede, leather/synthetic metallic, leather/suede, leather/fabric, fabric/leather, a closed round toe, Contour+ Comfort technology, a premium fit and all-day comfort experience, 1.25"" platform heel, You'Ll, It, the contours, a woman's foot, an inclusive range, sizes, widths, a custom-designed fit and all-day wear, The Beautiful Fit. Est]"
30,"chunky loafers, glossed-leather, brand logo, solid color, leather lining, round toe, square heel, rubber sole, contains non-textile parts of animal origin , Color: Black , Size: 8","[Size, 8]","[chunky loafers, glossed-leather, brand logo, solid color, leather lining, round toe, square heel, rubber sole, non-textile parts, animal origin, Black, Size]"
31,"canvas, brand logo, solid color, lace-up, round toe, flat, rubber sole, high-top sneakers , Color: Military green , Size: 11","[Size, 11]","[canvas, brand logo, solid color, lace-up, round toe, flat, rubber sole, high-top sneakers, Color, Military green]"
61,"Elevate your winter experience with the Men's Pajar Canada Maddox Ice Grip Boots, engineered to tackle the toughest of winter challenges. The waterproof leather and nylon upper provide a dry and flexible fit, ensuring comfort in harsh conditions. With Pajar's signature ice-gripper sole, you'll stride confidently through icy terrain. Seam-sealed and comfort rated to -25°C (-13°F), these boots feature Pajar-Tex waterproof membrane bootie construction for snug warmth, while the premium removable comfort molded insole, crafted with a cozy wool blend, offers breathability and antimicrobial benefits. Face the cold in style and comfort with the rugged sophistication of the Men's Pajar Canada Maddox Ice Grip Boots.","[winter, the Men's Pajar Canada Maddox Ice Grip Boots, winter, Pajar, Pajar-Tex, the Men's Pajar Canada Maddox Ice Grip Boots]","[your winter experience, the Men's Pajar Canada Maddox Ice Grip Boots, winter challenges, The waterproof leather, nylon, a dry and flexible fit, comfort, harsh conditions, Pajar's signature ice-gripper sole, you, icy terrain, °, C, F, these boots, Tex, snug warmth, the premium removable comfort, a cozy wool blend, breathability, antimicrobial benefits, the cold, style, comfort, the rugged sophistication, the Men's Pajar Canada Maddox Ice Grip Boots]"
68,"You bring the heat every mission-but when the mission brings the heat right back, the Scorch is here to serve. It's built first for breathability thanks to a lightweight upper patterned with mesh panels*. The Plylolite midsole and the speed lace system work together to make it light, agile and ready when you are-while the outsole is slip resistant and designed for stability. When the heat is on, the Scorch is extra breathable to keep you cool all the way down.Lightweight, strategically patterned upper for maximum air circulation (mesh panels are transparent and may reveal sock color or pattern) Semi-locking YKK Side-Zip Polishable toe Breathable mesh lining Speed lacing for quick adjustments Cushioning open-cell OrthoLite footbed Danner Plyolite EVA midsole offers lasting support and rebound Nylon shank Danner Scorch outsole is slip resistant and designed with a stability control archHeight: 8""Weight: 40 ozInsulation: Non-Insulated","[Scorch, first, Plylolite, Scorch, Nylon shank Danner Scorch, 40, Non-Insulated]","[You, the heat, every mission, the mission, the heat, the Scorch, It, breathability thanks, mesh panels, The Plylolite midsole, it, you, the outsole, stability, the heat, the Scorch, you, maximum air circulation, mesh panels, sock color, pattern, quick adjustments, open-cell OrthoLite, Danner Plyolite EVA midsole, lasting support, Nylon shank Danner Scorch outsole, a stability control]"
94,"Break up the monotony with the Run Star Hike Platform Sneaker from Converse! These fashion-forward Chucks feature sturdy denim uppers with SmartFOAM&reg; cushioning, chunky platform midsoles, and&nbsp;two-tone split rubber outsoles for a unique look. Please note: this style runs a half size large.","[the Run Star Hike Platform Sneaker, half]","[the monotony, the Run Star Hike Platform Sneaker, Converse, These fashion-forward Chucks, sturdy denim uppers, SmartFOAM&reg, cushioning, chunky platform midsoles, and&nbsp;two-tone, rubber outsoles, a unique look, this style]"
98,"A utilitarian option with stylish versatility, the Garnerr booties from Style & Co take your look from work to weekend with comfort in mind.","[Garnerr, Style & Co]","[A utilitarian option, stylish versatility, the Garnerr booties, Style, Co, your look, work, weekend, comfort, mind]"
103,"Get ready for summer fun with the Vanni heeled sandals by Easy Street. This stylish sandal combines an on-trend interwoven, chevron pattern to elevate all your favorite outfits. The block heel is a fashionable touch, while the back zipper ensures easy on and off. You'll enjoy all day style and comfort.","[summer, Vanni, Easy Street, chevron, all day]","[summer fun, the Vanni, heeled sandals, Easy Street, This stylish sandal, -trend, , chevron pattern, all your favorite outfits, The block heel, a fashionable touch, the back zipper, You, all day style, comfort]"
115,"leather, studs, brand logo, solid color with appliqués, lined in shearling, buckle fastening, round toe, wedge heel, rubber cleated sole, contains non-textile parts of animal origin, mules , Color: Black , Size: 6","[Size, 6]","[leather, studs, brand logo, solid color, appliqués, buckle fastening, round toe, wedge heel, rubber cleated sole, non-textile parts, animal origin, mules, Black, Size]"


**The named entities still appear quite messy, making it difficult to extract meaningful information. And we don't need duplicate information like size, color and brand.**

### Frequent words
So instead, I’ll examine the top 500 most frequent words in the description column to see what insights they might reveal.

In [10]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

all_noun_phrases = []

for desc in df['Description'].dropna():
    doc = nlp(desc)
    all_noun_phrases.extend([chunk.text.lower() for chunk in doc.noun_chunks])

# Count frequency
freq = Counter(all_noun_phrases)

In [11]:
for phrase, count in freq.most_common(500):
    print(phrase, count)

you 32874
it 29528
that 27136
he 12739
they 10981
who 10910
size 9825
we 8094
she 7928
what 6741
which 6284
them 5755
this book 5729
this 4825
color 4764
solid color 4384
faster shipping 4367
better service 4367
the book 4292
i 3741
all 3545
comfort 3351
style 3302
him 3273
us 3176
the world 3147
book 2963
the perfect way 2777
women 2770
no appliqués 2768
readers 2751
pages 2717
brand logo 2707
condition 2662
eyes 2637
collection 2616
warranty 2597
24 months 2593
a step 2592
harmful uv rays 2581
life 2572
the fashion curve 2558
a discounted price 2557
smartbuyglasses 2557
authencity 2463
her 2419
non-textile parts 2373
animal origin 2371
men 2129
students 2095
those 1996
leather 1826
everything 1767
stretch 1748
black 1722
access codes 1712
long sleeves 1708
some 1688
measurements 1679
people 1672
everyone 1671
god 1659
cds 1651
time 1623
nbsp 1604
- 1535
love 1533
multipockets 1528
the author 1516
you&rsquo;ll 1463
the story 1408
children 1392
anti-scratch coating 1385
anti-reflective

I provided the top 500 most frequent words from the product descriptions to ChatGPT for analysis. It grouped them into meaningful groups, which can help us extract useful insights and enrich the data:

**Material & Build Quality**  
These words suggest something about the item's durability, comfort, craftsmanship, or premium feel:  
`leather`, `cotton`, `polyester`, `rubber sole`, `steel`, `glass`, `wood`, `soft`, `stretch`, `lightweight`, `durable`, `high-quality`, `handcrafted`, `waterproof`, `scratch-resistant`

**Intended Use or Target Audience**  
These terms indicate who the product is designed for or its typical use scenarios:  
`for men`, `for women`, `for kids`, `for home`, `for gym`, `daily use`, `office`, `travel`, `casual`, `formal`, `reading`, `workwear`

**Features / Selling Points**   
Highlight specific functionalities or product benefits that might influence a purchase decision:  
`machine washable`, `UV protection`, `ergonomic`, `noise-cancelling`, `wireless`, `easy to clean`, `eco-friendly`, `made in USA`, `adjustable`, `foldable`, `battery operated`

**Descriptive Adjectives & Vibes**  
These adjectives evoke a mood, aesthetic, or style associated with the product:  
`elegant`, `cozy`, `sleek`, `trendy`, `classic`, `vintage`, `minimalist`, `sporty`, `relaxing`, `professional`

Some others like:  
Condition & Edition: new, used, like new, publisher overstock, former library book, annotated, revised, hardcover, first edition, dust jacket, signed. Mainly for books or media, will ignore for now.  
Themes / Genres / Subject Matter: romance, history, science fiction, philosophy, religion, self-help, memoir, bestseller, spiritual, educational. Minly for books, movie or game, will ignore for now.

Ideally, if we could perform the same analysis on the entire dataset, we would identify more words for these groups. We could then add all those words as tags and incorporate them into the text embeddings. **However, this won't be able to capture all the important information for all products.**

Below is the sample code taht use the sample data to extract the information mentioned above from the description column as tags, and then incorporate these tags into the text embeddings.

In [14]:
df = pd.read_csv("../data/csv/sample_100k_v2.csv")

In [15]:
import re

# Use joined patterns for speed
MATERIAL_PATTERN = re.compile(
    r"\b(?:leather|cotton|polyester|rubber|nylon|wood|steel|glass|soft|stretch|durable|lightweight|handcrafted|waterproof|scratch[- ]?resistant|machine[- ]?washable)\b",
    re.I
)

USE_PATTERN = re.compile(
    r"\b(?:for (?:men|women|kids)|unisex|everyday use|office|travel|casual|formal|gym|workwear|sleepwear|loungewear)\b",
    re.I
)

FEATURE_PATTERN = re.compile(
    r"\b(?:uv protection|eco[- ]?friendly|wireless|adjustable|foldable|ergonomic|noise[- ]?cancelling|battery[- ]?powered|lightweight|easy to clean|compact|space[- ]?saving|made in [A-Z]{2,})\b",
    re.I
)

VIBE_PATTERN = re.compile(
    r"\b(?:elegant|cozy|sleek|trendy|classic|vintage|minimalist|sporty|relaxing|professional|bold|modern|artsy|luxurious)\b",
    re.I
)

def extract_tags_column(df, pattern, col='Description'):
    return df[col].fillna('').str.findall(pattern).apply(
        lambda x: list(set(map(str.lower, x))) if isinstance(x, list) else []
    )

In [16]:
df['tags_material'] = extract_tags_column(df, MATERIAL_PATTERN)
df['tags_use'] = extract_tags_column(df, USE_PATTERN)
df['tags_features'] = extract_tags_column(df, FEATURE_PATTERN)
df['tags_vibes'] = extract_tags_column(df, VIBE_PATTERN)

In [17]:
df['all_tags'] = (
    df['tags_material'] + 
    df['tags_use'] + 
    df['tags_features'] + 
    df['tags_vibes']
).apply(lambda x: " ".join(sorted(set(x))))

In [18]:
df['CombinedInfo'] = df.apply(
    lambda row: f"name: {row['Name']}, gender: {row['Gender']}, brand: {row['MergedBrand']}, category: {row['Category']}, tag: {row['all_tags']}",
    axis=1
)
len(df[df['all_tags'].astype(bool)])

38609

In [21]:
df[df['all_tags'].astype(bool)][['Name', 'all_tags', 'CombinedInfo']]

Unnamed: 0,Name,all_tags,CombinedInfo
3,"Slickblue Console Sofa Table With 3 Shelves, Metal Frame - Black",glass steel,"name: Slickblue Console Sofa Table With 3 Shelves, Metal Frame - Black, gender: unisex, brand: Slickblue, category: Furniture >Sofas , tag: glass steel"
9,One Handed: A Guide to Piano Music for One Hand (Music Reference Collection),compact,"name: One Handed: A Guide to Piano Music for One Hand (Music Reference Collection), gender: nan, brand: nan, category: Media >Books , tag: compact"
13,Barrier Free Travel: Utah National Parks for Wheelers and Slow Walkers,travel,"name: Barrier Free Travel: Utah National Parks for Wheelers and Slow Walkers, gender: nan, brand: nan, category: nan, tag: travel"
17,Suzy Levian New York Suzy Levian Sterling Silver Cubic Zirconia Three Row Modern Eternity Band Ring - Silver,modern,"name: Suzy Levian New York Suzy Levian Sterling Silver Cubic Zirconia Three Row Modern Eternity Band Ring - Silver, gender: female, brand: Suzy Levian New York, category: Apparel & Accessories >Jewelry >Rings , tag: modern"
21,Havaianas Woman Thong sandal Pastel pink Size 6 Rubber,rubber,"name: Havaianas Woman Thong sandal Pastel pink Size 6 Rubber, gender: female, brand: HAVAIANAS, category: Apparel & Accessories >Shoes , tag: rubber"
...,...,...,...
99991,Doucal's Man Ankle boots Dark brown Size 7 Soft Leather,leather rubber,"name: Doucal's Man Ankle boots Dark brown Size 7 Soft Leather, gender: male, brand: DOUCAL'S, category: Apparel & Accessories >Shoes , tag: leather rubber"
99992,Kids Squall Knit Hat - Lands' End - Red - M-L,cozy soft,"name: Kids Squall Knit Hat - Lands' End - Red - M-L, gender: unisex, brand: Lands' End, category: Apparel & Accessories >Clothing Accessories >Hats , tag: cozy soft"
99996,Boys' Jordan Post Slide Sandals Big Orange Blaze/Orange Peel/University Red,durable lightweight sleek,"name: Boys' Jordan Post Slide Sandals Big Orange Blaze/Orange Peel/University Red, gender: nan, brand: Jordan, category: Apparel & Accessories >Shoes , tag: durable lightweight sleek"
99997,Men's '47 Brand Gray Distressed Jacksonville Jaguars Downburst Franklin T-shirt - Gray,classic vintage,"name: Men's '47 Brand Gray Distressed Jacksonville Jaguars Downburst Franklin T-shirt - Gray, gender: male, brand: '47 Brand, category: Apparel & Accessories , tag: classic vintage"


In [None]:
df_sample = df.sample(frac=0.1, random_state=42)

In [None]:
# df_sample.to_csv("data/csv/processed.csv", index=False)

### Keyword/Phrase Extraction

Now let's try the [KeyBERT](https://github.com/MaartenGr/KeyBERT) model to extract relevant keywords or phrases. It uses text embeddings to find meaningful terms. The speed is slow though. When we use KeyBERT, we'll get keywords or keyphrases that represent the most relevant and meaningful parts of the text — based on the actual semantic content, not just frequency.

KeyBERT uses a pretrained BERT-based embedding model (e.g., MiniLM) and follows this process:

1. Text Embedding  
It computes a vector for the entire input text (e.g., a product description).

2. Candidate Keyword Extraction  
It extracts candidate keywords or phrases using n-grams in our cases. A candidate keyword (or candidate phrase) is a potential keyword or keyphrase extracted from the input text before scoring and ranking. These are the raw pieces of text that might be relevant — but KeyBERT hasn’t yet determined how relevant they are. KeyBERT filters and scores them to find the most representative keywords of that text.

3. Candidate Embeddings  
It embeds each candidate phrase using the same embedding model.

4. Similarity Scoring    
For each candidate:  
Computes cosine similarity between the candidate and the full-text embedding.

5. Ranking  
Ranks all candidates by similarity score and returns the top ones as keywords.



In [46]:
df = pd.read_csv("../data/csv/sample_100k_v2.csv")

In [47]:
df_sample = df.sample(frac=0.1, random_state=42)
df_sample.shape

(10000, 17)

In [48]:
# Step 1: Install KeyBERT if you haven't already
# !pip install keybert

from keybert import KeyBERT

# Step 2: Initialize the model
kw_model = KeyBERT()

# Step 3: Define a function to extract top keywords from each description
def extract_keybert_keywords(text, top_n=5):
    if not isinstance(text, str) or not text.strip():
        return []
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 2),  # extract 1-gram and 2-gram phrases
        stop_words='english',
        use_maxsum=True,
        nr_candidates=20,  # generate the top 20 n-gram candidates, and rank them using MaxSum similarity to choose the final top_n
        top_n=top_n
    )
    return [kw[0] for kw in keywords]

# Step 4: Apply to your DataFrame
start = time.time()
df_sample['tags_from_keybert'] = df_sample['Description'].apply(extract_keybert_keywords)
print(f"Process time: {(time.time() - start) / 60:.2f} minutes")

Process time: 44.69 minutes


In [51]:
# Optional: Join all tags into a string for easier viewing
df_sample['tags_from_keybert_str'] = df_sample['tags_from_keybert'].apply(lambda x: ', '.join(x))

In [52]:
pd.set_option('display.max_colwidth', None)
df_sample[df_sample['Category'].str.contains('Shoes', case=False, na=False)][['Pid', 'Name', 'Category', 'tags_from_keybert_str']].head(10)

Unnamed: 0,Pid,Name,Category,tags_from_keybert_str
76434,158391.2.5542A8E9F71ED4D3.106B68B52D8A3997.439454,Mens Minnetonka Casey Slipper - Charcoal,Apparel & Accessories >Shoes,"minnetonka, stitching available, luxurious suede, way casey, slipper features"
80917,127.2.DFF8DD86A0648144.3F5EC200C893296E.197943396380,Franco Sarto Women's Stevie Mid Shaft Boots - Bronze Leather,Apparel & Accessories >Shoes,"high, chic touch, loved mid, dressy casual, women boots"
51685,127.2.DFF8DD86A0648144.FBDE27B7AFFBA569.196371757732,Easy Street Women's Feena Slingback Pumps - Nude,Apparel & Accessories >Shoes,"dress occasion, adjustable buckle, heel sleek, easy street, feena"
39691,159390.1.5EDD.A616FA04F2C8CF41.US-17898774CI-9-AI24,Tua By Braccialini Woman Sandals Beige Size 8 Textile fibers,Apparel & Accessories >Shoes,"lining buckle, parts animal, sole contains, fabric appliqués, beige size"
32260,127.2.DFF8DD86A0648144.9F032A3E13FFD1C9.197651056965,Lucky Brand Women's Carolie Strappy Espadrille Wedge Sandals - Coffee Quart Leather,Apparel & Accessories >Shoes,"addition jeans, espadrille, asymmetrical straps, wedges lucky, carolie"
49182,159390.1.5EDD.8C9A5B4686172176.US-17778370UQ-11-AI24,Pollini Woman Sandals Tan Size 9 Leather,Apparel & Accessories >Shoes,"contains non, parts animal, toe geometric, slide sandals, textured leather"
55325,127.2.DFF8DD86A0648144.5E60F5EE8E431DC4.194655553469,BCBGeneration Women's Darmena Kitten Heel Dress Booties - Black,Apparel & Accessories >Shoes,"bit, bcbgeneration, dress, darmena, pointed toe"
93780,158391.2.6B6A87FA39B97B14.5D152E65285A82D2.808755,Dr. Martens Zebzag Slingback Platform Mule - Ultimate Grey,Apparel & Accessories >Shoes,"dr martens, zebzag, slingback platform, treat feet, mule features"
38603,127.2.DFF8DD86A0648144.2958FBD3A1FE1A66.198756092858,Nine West Women's Rhonda Pointy Toe Tapered Heel Dress Pumps - Blue Denim,Apparel & Accessories >Shoes,"slingback strap, style founded, empowers women, pumps features, west rhonda"
14952,127.2.DFF8DD86A0648144.E9D49C911A7AADFB.195690637367,White Mountain Women's Bocci Ballet Flat - Beige Boucle Fabric Multi,Apparel & Accessories >Shoes,"white mountain, accent design, ballet, chic cap, statement bocci"


In [53]:
df_sample[df_sample['Category'].str.contains('Tops', case=False, na=False)][['Pid', 'Name', 'Category', 'tags_from_keybert_str']].head(10)

Unnamed: 0,Pid,Name,Category,tags_from_keybert_str
50074,161091.2.52965A6F00B51D20.67CB1DDC41BC6569.6259013,Women's Plus Size Waffle Relaxed Long Sleeve Mock Neck Pullover - Lands' End - Green - 1X,Apparel & Accessories >Clothing >Shirts & Tops,"stay cozy, raglan, blazer stylish, waffle textured, sleeves ribbed"
63120,159390.1.5EDD.E5A11790F33EDD68.US-10284375GO-6-AI24,"The Editor Man Sweatshirt White Size L Cotton, Polyester",Apparel & Accessories >Clothing >Shirts & Tops,"brand logo, multipockets color, collar long, size, lining zipper"
41331,159390.1.5EDD.6325ADAC31DD1B5B.US-10482406RE-7-AI24,"Erika Cavallini Woman Top Ivory Size M Alpaca wool, Virgin Wool, Polyamide",Apparel & Accessories >Clothing >Shirts & Tops,"sleeves pockets, color detachable, medium weight, ivory, knit appliqués"
88636,127.2.DFF8DD86A0648144.107DA92078A4ED86.197178693216,Democracy Women's Mineral Washed Embroidered Sweatshirt - Dusty Blue,Apparel & Accessories >Clothing >Shirts & Tops,"hem mineral, finish embroidered, women long, scoop neck, sweatshirt features"
84936,202186.2.29701ABDE4D61068.F3D79DB39A80914E.80IM9L024-WDA4MzE1,Printed Jersey Turtleneck,Clothing & Accessories >Clothing >Shirts & Tops,"vary, model, placement, wearing size42, print placement"
31776,159390.1.5EDD.D8D01AE8D151976B.US-14469151HT-2-AI24,Bottega Veneta Woman Sweater Black Size 2 Silk,Apparel & Accessories >Clothing >Shirts & Tops,"pockets dry, solid, bleach tumble, black size, knit appliqués"
73691,159390.1.5EDD.B21A7EAEFF191100.US-10212151FU-7-AI24,"Dsquared2 Man T-shirt Light grey Size M Cotton, Viscose",Apparel & Accessories >Clothing >Shirts & Tops,"collar short, color light, pockets, brand, logo solid"
84610,159390.1.5EDD.5E7719B2F2A3E4EC.US-10261615CB-7-AI24,Some Ware Man T-shirt Black Size XL Organic cotton,Apparel & Accessories >Clothing >Shirts & Tops,"pattern crew, pockets, print brand, logo multicolor, size xl"
34451,159390.1.5EDD.48882FA69169598A.US-10272677QE-9-AI24,Tommy Jeans Woman Shirt White Size L Viscose,Apparel & Accessories >Clothing >Shirts & Tops,"dot, closing pockets, sleeves button, color white, crepe brand"
53929,159390.1.5EDD.52A53EB6B3D6042E.US-14472700KV-9-AI24,Brunello Cucinelli Man Turtleneck Burgundy Size 50 Cashmere,Apparel & Accessories >Clothing >Shirts & Tops,"pockets stretch, wash dry, lightweight, knit appliqués, color burgundy"


In [54]:
df_sample[df_sample['Category'].str.contains('Pants', case=False, na=False)][['Pid', 'Name', 'Category', 'tags_from_keybert_str']].head(10)

Unnamed: 0,Pid,Name,Category,tags_from_keybert_str
89577,160244.2.C6DCFA4883B330A3.AC19CCD9F9E63E35.696691483958,Connected Petite Popover Jumpsuit - Dark Plum,Apparel & Accessories >Clothing >Pants,"petite, soft popover, bodice, chic connected, jumpsuit finished"
84463,161091.2.52965A6F00B51D20.1B5422A9D74B2B7B.5701309,Men's Traditional Fit Stretch Jeans - Lands' End - Blue - 31,Apparel & Accessories >Clothing >Pants,"follow footsteps, durable 12, rigger, introduced square, denim breaks"
39525,127.2.DFF8DD86A0648144.400F818D3B19973B.755403668516,Dkny Sport Women's Cargo Bungee-Hem Pants - Black,Apparel & Accessories >Clothing >Pants,"style, cargo, pair bungee, relaxed fit, drawcord waistband"
13799,159390.1.5EDD.F161F1005E5C115F.US-13891094SL-3-AI24,"Y's Yohji Yamamoto Woman Pants Black Size 2 Cotton, Bamboo fiber",Apparel & Accessories >Clothing >Pants,"regular fit, solid color, chinos large, unlined zipper, fleece"
29231,159390.1.5EDD.CD0860EF397B4AED.US-13587118UW-80-AI24,"Dondup Kid Girl Pants Burgundy Size 12 Polyester, Acetate, Polyamide",Apparel & Accessories >Clothing >Pants,"size 12, multipockets wash, multicolor pattern, jacquard, cleanable iron"
7144,127.2.DFF8DD86A0648144.ADBAFFFC1FCE1A5.5056270178630,Seraphine Women's Denim Maternity Overalls - Black,Apparel & Accessories >Clothing >Pants,"adjustable panels, mama black, expertly tailored, pregnancy, soft stretch"
99836,160244.2.C6DCFA4883B330A3.ECA81A1348343BD3.670589702611,Michael Kors Men's Classic Fit Performance Dress Pants - Navy,Apparel & Accessories >Clothing >Pants,"classic, designed versatile, michael, dress, kors delivers"
62005,159390.1.5EDD.8B468F5FBA06BFBF.US-13884640WN-10-AI24,"Golden Goose Kid Boy Pants Slate blue Size 12 Cotton, Polyester",Apparel & Accessories >Clothing >Pants,"wash 30, color print, logo solid, sporty large, fleece"
94921,160244.2.C6DCFA4883B330A3.73433896681E666F.194136888042,"Charter Club Plus Size 100% Linen Cropped Pants, Created for Macy's - Bright White",Apparel & Accessories >Clothing >Pants,"modern pull, ease charter, size cropped, club plus, crisp linen"
62122,202186.2.499B3C49E28651BB.1EE08F18DB5D4483.81IM5B006-MDA4NQ2,,Apparel & Accessories >Clothing >Pants,


ChatGPT also suggests using a MiniLM + FAISS-style keyword extraction pipeline — similar in concept to KeyBERT, but more customizable and potentially faster when properly tuned. However, since we don't plan to fine-tune the model, its out-of-the-box performance doesn't seem as accurate or reliable as KeyBERT.

In [55]:
df_sample_2 = df.sample(frac=0.001, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the MiniLM model
model = SentenceTransformer('all-MiniLM-L6-v2')

def extract_keywords(text, top_n=5, ngram_range=(1, 3)):
    if not isinstance(text, str) or len(text.strip()) == 0:
        return []

    # Generate candidate keywords/phrases
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words='english').fit([text])
    candidates = vectorizer.get_feature_names_out()

    # Embed document and candidates
    doc_embedding = model.encode([text], convert_to_tensor=True)
    candidate_embeddings = model.encode(candidates, convert_to_tensor=True)

    # Compute cosine similarity
    similarities = cosine_similarity(doc_embedding.cpu().numpy(), candidate_embeddings.cpu().numpy())[0]

    # Sort and select top N
    top_indices = np.argsort(similarities)[::-1][:top_n]
    return [candidates[i] for i in top_indices]

# Example usage on a DataFrame
df_sample_2["keywords"] = df_sample_2["Description"].apply(lambda x: extract_keywords(x, top_n=5))

In [56]:
pd.set_option('display.max_colwidth', None)
df_sample_2[df_sample_2['Category'].str.contains('Shoes', case=False, na=False)][['Pid', 'Name', 'Category', 'keywords']].head(10)

Unnamed: 0,Pid,Name,Category,keywords
76434,158391.2.5542A8E9F71ED4D3.106B68B52D8A3997.439454,Mens Minnetonka Casey Slipper - Charcoal,Apparel & Accessories >Shoes,"[new casey slipper, casey slipper features, casey slipper, casey slipper minnetonka, way casey slipper]"
80917,127.2.DFF8DD86A0648144.3F5EC200C893296E.197943396380,Franco Sarto Women's Stevie Mid Shaft Boots - Bronze Leather,Apparel & Accessories >Shoes,"[calf boot women, mid calf boot, calf boot, women boots high, women boots]"
51685,127.2.DFF8DD86A0648144.FBDE27B7AFFBA569.196371757732,Easy Street Women's Feena Slingback Pumps - Nude,Apparel & Accessories >Shoes,"[feena easy street, feena easy, feena, easy street versatile, easy street]"
39691,159390.1.5EDD.A616FA04F2C8CF41.US-17898774CI-9-AI24,Tua By Braccialini Woman Sandals Beige Size 8 Textile fibers,Apparel & Accessories >Shoes,"[sandals color beige, textile parts animal, coated fabric, contains non textile, color leather lining]"
32260,127.2.DFF8DD86A0648144.9F032A3E13FFD1C9.197651056965,Lucky Brand Women's Carolie Strappy Espadrille Wedge Sandals - Coffee Quart Leather,Apparel & Accessories >Shoes,"[carolie wedges, breezy espadrille heels, espadrille heels, carolie wedges lucky, espadrille heels asymmetrical]"
49182,159390.1.5EDD.8C9A5B4686172176.US-17778370UQ-11-AI24,Pollini Woman Sandals Tan Size 9 Leather,Apparel & Accessories >Shoes,"[leather sole contains, color leather, textured leather, heel leather, textured leather appliqués]"
55325,127.2.DFF8DD86A0648144.5E60F5EE8E431DC4.194655553469,BCBGeneration Women's Darmena Kitten Heel Dress Booties - Black,Apparel & Accessories >Shoes,"[bcbgeneration dress bootie, dress bootie darmena, dress bootie, darmena pointed toe, bcbgeneration dress]"


### Conclusion

The keywords generated by KeyBERT appear promising. To enrich the text embeddings, we could apply the KeyBERT model to generate a new keyword tag column and incorporate it into the embedding process.

In [57]:
df_sample['CombinedInfo'] = df_sample.apply(
    lambda row: f"name: {row['Name']}, gender: {row['Gender']}, brand: {row['MergedBrand']}, category: {row['Category']}, tag: {row['tags_from_keybert_str']}",
    axis=1
)
df_sample[['Pid', 'Name', 'Category', 'CombinedInfo']]

Unnamed: 0,Pid,Name,Category,CombinedInfo
75721,226170.2.82D42C0152FBB491.4B463957A03AAFCE.245VR8SN250XL,Falken Sincera SN250 A/S 225/45-18 XL 95V Grand Touring All-Season Tire 28294339,Vehicles & Parts >Vehicle Parts & Accessories >Motor Vehicle Parts >Motor Vehicle Wheel Systems >Motor Vehicle Tires >Automotive Tires,"name: Falken Sincera SN250 A/S 225/45-18 XL 95V Grand Touring All-Season Tire 28294339, gender: nan, brand: Falken, category: Vehicles & Parts >Vehicle Parts & Accessories >Motor Vehicle Parts >Motor Vehicle Wheel Systems >Motor Vehicle Tires >Automotive Tires , tag: sincera, falken, grand touring, sn250 195, tire price"
80184,178866.156074.820F1205554371C6.B6C77FD0E98449E6.COM9780295988689USED,Walls of Algiers: Narratives of the City through Text and Image [first edition],,"name: Walls of Algiers: Narratives of the City through Text and Image [first edition], gender: nan, brand: nan, category: nan, tag: ottoman, urban studies, paintings architectural, history walls, algiers serves"
19864,178866.156074.99167FE934A5BC7F.B66A9A98420EE68A.bi: 30934116025,"[Signed] The Inaugural Album Carter, Jimmy [Near Fine] [Hardcover]",Office Supplies >Book Accessories >Book Lights,"name: [Signed] The Inaugural Album Carter, Jimmy [Near Fine] [Hardcover], gender: nan, brand: nan, category: Office Supplies >Book Accessories >Book Lights , tag: cloth title, illustrated endpapers, signed president, beige, quarto 36pp"
76699,178866.156074.820F1205554371C6.2778968A7B7BC45B.COM9781887983129USED,Working With the Poor: New Insights and Learnings from Development Practitioners,,"name: Working With the Poor: New Insights and Learnings from Development Practitioners, gender: nan, brand: nan, category: nan, tag: development sustainable, spiritual realm, christian practitioners, understanding poverty, transformation urban"
92991,159390.1.5EDD.6CB3287A75491FCF.US-15300316SI-4-AI24,"Bikkembergs Infant Boy Baby set Grey Size 6 Cotton, Elastane",Apparel & Accessories >Clothing >Baby & Toddler Clothing >Baby & Toddler Outfits,"name: Bikkembergs Infant Boy Baby set Grey Size 6 Cotton, Elastane, gender: male, brand: BIKKEMBERGS, category: Apparel & Accessories >Clothing >Baby & Toddler Clothing >Baby & Toddler Outfits , tag: wash 30, pockets, multicolor pattern, size, brand logo"
...,...,...,...,...
5002,159390.1.5EDD.DF5C7E19B754023D.US-11835500WE-3-AI24,Pedro García Woman Thong sandal Blue Size 5 Textile fibers,Apparel & Accessories >Shoes,"name: Pedro García Woman Thong sandal Blue Size 5 Textile fibers, gender: female, brand: PEDRO GARCÍA, category: Apparel & Accessories >Shoes , tag: design square, size, cleated sole, appliqués floral, satin"
30151,178866.156074.820F1205554371C6.56D766DB4D107543.COM9780993467868USED,Dip In Brilliant: An Indian Recipe Adventure with a Contemporary Twist,,"name: Dip In Brilliant: An Indian Recipe Adventure with a Contemporary Twist, gender: nan, brand: nan, category: nan, tag: purchase signed, brilliant, dip, anand new, copy chef"
93194,127.2.DFF8DD86A0648144.5789065A624CB8BF.190052019959,Superior Memory Foam Wedge Pillow with Removable Cover - White,Home & Garden >Linens & Bedding >Bedding >Pillows,"name: Superior Memory Foam Wedge Pillow with Removable Cover - White, gender: unisex, brand: Superior, category: Home & Garden >Linens & Bedding >Bedding >Pillows , tag: heat sensitive, wedge, makes breathing, relaxation memory, pillow shape"
73199,127.2.DFF8DD86A0648144.85179F33E58E00F3.724190605080,"Mepra Serving Set Fork and Spoon Flatware Set, Set of 2 - Silver-tone",Home & Garden >Kitchen & Dining >Tableware >Flatware >Flatware Sets,"name: Mepra Serving Set Fork and Spoon Flatware Set, Set of 2 - Silver-tone, gender: unisex, brand: Mepra, category: Home & Garden >Kitchen & Dining >Tableware >Flatware >Flatware Sets , tag: double, ergonomic, highest, quality, serration durable"


In [36]:
df_sample['CombinedInfo'].loc[df_sample['Pid'] == '127.2.DFF8DD86A0648144.5D97B0B9DB1AB9FB.194573409152']

29857    name: Lands' End Big & Tall Super-t Long Sleeve T-Shirt with Pocket - Rich burgundy, gender: male, brand: Lands' End, category: Apparel & Accessories >Clothing >Shirts & Tops , tag: end men, jersey knit, cotton stretches, strongest long, tee ll
Name: CombinedInfo, dtype: object