---

# Lumora : Multi-Label Classifier and Smart Search Feature for Contemporary Arts Craft of Filipinos

---
# Background of the Study


## Source of Data
to be continue...


## Brief Description of Dataset
This dataset is designed to train a model for the platform Lumora, which aims to support Filipino artisans. It is a table of product listings, with each row representing a unique handcrafted item.


## Model Variables
to be continue...


## Objectives
The objective is to develop a multi-label NLP classifier to automatically assign relevant categories and stylistic attributes to new product listings on the Lumora C2C e-commerce platform. The model will improve product discoverability by generating tags (e.g., cute, crochet, pastel, minimalist) that allow the Smart Search feature to find items even with varied or imperfect user queries.

---

# Data Collection / Loading
description...

In [1]:
# Importing necessary libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import re
# initializing dataframe
df = pd.read_csv('LumoraProductDataset_refactored.csv')
df.head()

Unnamed: 0,Product name,Product description,Price,Category,Subcategory,Color,Size,Material,Tags,Product link,Image link,Brand / seller name
0,Flowers Convertible Puso Wedding Tote,A versatile hobo-style tote embroidered with f...,PHP 11172.22,Bags,Wedding Tote,White,Unspecified,"Upcycled fabric, leather","wedding, tote, floral embroidery, Filipino, su...",Unspecified,Unspecified,SintaWeddings
1,Manila Jeepney 3-in-1 Handbag,A colorful handbag inspired by the iconic jeep...,PHP 12406.79,Bags,Handbag,Multicolor,Unspecified,"Upcycled fabric, leather","jeepney, handbag, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings
2,Vinia Hardin Fanny Pack,A belt-style fanny pack handwoven with upcycle...,PHP 4875.93,Bags,Fanny Pack,Black,Unspecified,"Upcycled fabric, leather","fanny pack, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings
3,Sling Bag (Pinilian/Inabel Weave),A crossbody sling bag showcasing traditional P...,PHP 5554.94,Bags,Sling Bag,Blue,Unspecified,"Upcycled fabric, Pinilian/Inabel weave","sling bag, Filipino, handwoven, sustainable",Unspecified,Unspecified,SintaWeddings
4,Alon Woven Waves Shoulder Bag,"A shoulder bag with wave-pattern weaving, comb...",PHP 12653.70,Bags,Shoulder Bag,Blue,Unspecified,"Upcycled fabric, leather","shoulder bag, woven waves, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings


---
# Data Information and Summary Statistics
description...

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Product name         644 non-null    object
 1   Product description  644 non-null    object
 2   Price                644 non-null    object
 3   Category             644 non-null    object
 4   Subcategory          644 non-null    object
 5   Color                644 non-null    object
 6   Size                 644 non-null    object
 7   Material             644 non-null    object
 8   Tags                 644 non-null    object
 9   Product link         644 non-null    object
 10  Image link           644 non-null    object
 11  Brand / seller name  639 non-null    object
dtypes: object(12)
memory usage: 60.5+ KB


In [3]:
df.describe()

Unnamed: 0,Product name,Product description,Price,Category,Subcategory,Color,Size,Material,Tags,Product link,Image link,Brand / seller name
count,644,644,644,644,644,644,644,644,644,644,644,639
unique,601,560,120,46,150,71,68,102,565,4,3,15
top,Filipino Capiz Parol Wooden Christmas Ornament,"Downloadable graphic design, shirt design, mad...",PHP 1604.94,Jewelry,Ring,Multicolor,Standard,Brass,"Filipino Christmas lantern, parol, capiz shell",Unspecified,Unspecified,PrettyfulShop
freq,7,11,62,151,59,189,362,64,10,510,505,172


In [4]:
df.shape

(644, 12)

In [13]:
category_df = df['Category'].value_counts().to_frame()
print(category_df)

                            count
Category                         
Jewelry                       151
Vintage                        59
Ornaments                      32
POD                            28
Stickers                       25
Clothing                       23
Philippine Handicrafts         22
Digital Downloads              22
Philippine Souvenir            18
Pasko & Parols                 17
Bundle Deals                   16
Accessories                    16
Wedding Ceremony               16
Bags                           16
Stickers/Decals                16
Apparel                        13
Keychains/Charms               13
Filipiniana Attire             13
Prints                         12
Pinoy Keychains & Charms       10
Vintage Movies                  9
Stationery & Stickers           9
Keychains                       8
Native                          7
Printables                      5
Decor                           4
Mugs                            4
Art Prints    

In [5]:
df.isnull().sum()

Product name           0
Product description    0
Price                  0
Category               0
Subcategory            0
Color                  0
Size                   0
Material               0
Tags                   0
Product link           0
Image link             0
Brand / seller name    5
dtype: int64

## Summary:
- findings...
- findings...

---
# Data Cleaning
This section shows the critical data cleaning steps applied to the raw text datasets before feature engineering and pre-processing of the dataset. The goal of these steps is to transform noisy, unstructured text into a clean and consistent format.

## Handle Missing Values
Missing data (NaN) in text columns can cause errors or be treated as a useless string. It must be fill with a specific, non-empty placeholder like an empty string ('') before concatenation.

In [6]:
# Cleaning Column Names
df.columns = df.columns.str.strip()
df.columns

Index(['Product name', 'Product description', 'Price', 'Category',
       'Subcategory', 'Color', 'Size', 'Material', 'Tags', 'Product link',
       'Image link', 'Brand / seller name'],
      dtype='object')

In [7]:
# Handling Missing Values
df = df.fillna('')
df.isnull().sum()

Product name           0
Product description    0
Price                  0
Category               0
Subcategory            0
Color                  0
Size                   0
Material               0
Tags                   0
Product link           0
Image link             0
Brand / seller name    0
dtype: int64

## Handle Duplicate Rows 
description...

In [8]:
# handle duplicate rows
df = df.drop_duplicates()
df.shape

(618, 12)

## Handle Inconsistent Data
lowercasing, removal of special characters

In [10]:
# 1. Strip whitespace from all text columns
df_clean = df.copy()

text_columns = ['Product name', 'Product description', 'Category', 'Subcategory', 'Size', 'Material', 'Tags']

for col in text_columns:
    # Remove leading/trailing whitespace
    df_clean[col] = df_clean[col].str.strip()
    # Remove extra spaces between words (replace multiple spaces with single space)
    df_clean[col] = df_clean[col].str.replace(r'\s+', ' ', regex=True)

print("✓ Whitespace standardized")

✓ Whitespace standardized


In [None]:
# 2. Fix inconsistent capitalization in Categorical Data Columns

# Standardize COLOR field
df_clean['COLOR'] = df_clean['COLOR'].str.title()

# standardize SIZE field
df_clean['SIZE'] = df_clean['SIZE'].str.upper()

# Standardize MATERIAL field
df_clean['MATERIAL'] = df_clean['MATERIAL'].str.title()

print("✓ Capitalization standardized")
df_clean.head(15)

✓ Capitalization standardized


Unnamed: 0,PRODUCT TITLE,PRODUCT DESCRIPTION,COLOR,SIZE,MATERIAL
0,Halo-Halo Keychain,Miniature halo-halo dessert keychain made from...,"Transparent, Multi-Colored",SMALL,"Clay, Resin"
1,Filipino Funny Magnets,A set of novelty refrigerator magnets featurin...,Multi-Colored,2X2 INCHES,Vinyl-Coated Magnetic Sheet
2,"Filipino Snacks Parody Stickers (Skyflakes, Su...",A set of Filipino parody snack stickers inspir...,Multi-Colored,SUSMARYOSEP: 3.5 IN X 2 IN AND POTCHA: 3.25 IN...,Vinyl Sticker With Matte Finish
3,Philippine-Inspired Monogram Keychain,A 2-inch tall handmade keychain featuring a mo...,"Red, Blue, Yellow",2 IN,"Resin, Sticker, Glitter, Metal Findings"
4,Filipino Dessert Drink Stickers: Taho and Iskr...,A pair of illustrated stickers inspired by ico...,Multi-Colored,APPROX. 3 IN X 2.3 IN,Vinyl Sticker With Matte Finish
5,Wala Akong Pake! - Filipino Plastic Bag Sticke...,A waterproof vinyl sticker designed as a playf...,White With Orange-Black Accents,APPROX. 3 IN X 2.75 IN,Vinyl Sticker With Matte Finish
6,"I'm Not Late, I'm on Filipino Time! - Funny Fi...",A 3-inch vinyl sticker featuring a hand-drawn ...,Multi-Colored,3 X 3 IN,Vinyl Sticker With Matte Finish
7,Isaw (Chicken Gizzard) - Philippine Street Foo...,A 3-inch vinyl sticker featuring a cartoon-sty...,Brown,3 X 3 IN,Vinyl Sticker With Matte Finish
8,Baguio-Inspired Barrel Man: A Wood Carved Stic...,A 4-inch vinyl sticker inspired by the iconic ...,Multi-Colored,4 IN X 3 IN,Vinyl Sticker With Matte Finish
9,Laban Lang! - Motivational Sticker Tag,A 4-inch vinyl sticker featuring a pop-up wind...,Black/Yellow,APPROX. 4 IN X 1.25 IN,"Vinyl, Ink"


In [None]:
# 3. Standardize SIZE format inconsistencies
def standardize_size(size_str):
    """Standardize size formatting"""
    if pd.isna(size_str) or size_str == '':
        return ''
    
    size_str = str(size_str).strip()
    
    # Standardize "inches" variations
    size_str = re.sub(r'\binches\b', 'in', size_str, flags=re.IGNORECASE)
    size_str = re.sub(r'\binch\b', 'in', size_str, flags=re.IGNORECASE)
    
    # Standardize "x" separator (remove spaces around 'x')
    size_str = re.sub(r'\s*x\s*', ' x ', size_str, flags=re.IGNORECASE)
    
    # Standardize "approx." variations
    size_str = re.sub(r'\bapprox\.?\b', 'Approx.', size_str, flags=re.IGNORECASE)
    
    # Standardize "diameter"
    size_str = re.sub(r'\bdiameter\b', 'diameter', size_str, flags=re.IGNORECASE)
    
    # Handle "Unidentified" consistently
    if size_str.lower() == 'unidentified':
        return 'Unidentified'
    
    return size_str.strip()

df_clean['SIZE'] = df_clean['SIZE'].apply(standardize_size)

print("✓ Size formatting standardized")
df_clean.head(15)

✓ Size formatting standardized


Unnamed: 0,PRODUCT TITLE,PRODUCT DESCRIPTION,COLOR,SIZE,MATERIAL
0,Halo-Halo Keychain,Miniature halo-halo dessert keychain made from...,"Transparent, Multi-Colored",SMALL,"Clay, Resin"
1,Filipino Funny Magnets,A set of novelty refrigerator magnets featurin...,Multi-Colored,2 x 2 in,Vinyl-Coated Magnetic Sheet
2,"Filipino Snacks Parody Stickers (Skyflakes, Su...",A set of Filipino parody snack stickers inspir...,Multi-Colored,SUSMARYOSEP: 3.5 IN x 2 IN AND POTCHA: 3.25 IN...,Vinyl Sticker With Matte Finish
3,Philippine-Inspired Monogram Keychain,A 2-inch tall handmade keychain featuring a mo...,"Red, Blue, Yellow",2 IN,"Resin, Sticker, Glitter, Metal Findings"
4,Filipino Dessert Drink Stickers: Taho and Iskr...,A pair of illustrated stickers inspired by ico...,Multi-Colored,APPRO x . 3 IN x 2.3 IN,Vinyl Sticker With Matte Finish
5,Wala Akong Pake! - Filipino Plastic Bag Sticke...,A waterproof vinyl sticker designed as a playf...,White With Orange-Black Accents,APPRO x . 3 IN x 2.75 IN,Vinyl Sticker With Matte Finish
6,"I'm Not Late, I'm on Filipino Time! - Funny Fi...",A 3-inch vinyl sticker featuring a hand-drawn ...,Multi-Colored,3 x 3 IN,Vinyl Sticker With Matte Finish
7,Isaw (Chicken Gizzard) - Philippine Street Foo...,A 3-inch vinyl sticker featuring a cartoon-sty...,Brown,3 x 3 IN,Vinyl Sticker With Matte Finish
8,Baguio-Inspired Barrel Man: A Wood Carved Stic...,A 4-inch vinyl sticker inspired by the iconic ...,Multi-Colored,4 IN x 3 IN,Vinyl Sticker With Matte Finish
9,Laban Lang! - Motivational Sticker Tag,A 4-inch vinyl sticker featuring a pop-up wind...,Black/Yellow,APPRO x . 4 IN x 1.25 IN,"Vinyl, Ink"


In [None]:
# 4. Standardize MATERIAL field
def standardize_materials(material_str):
    """Standardize material names and formatting"""
    if pd.isna(material_str) or material_str == '':
        return ''
    
    material_str = str(material_str).strip()
    
    # Split by comma, clean each material, and rejoin
    materials = [m.strip() for m in material_str.split(',')]
    
    # Standardize common material names
    material_mapping = {
        'resin': 'Resin',
        'clay': 'Clay',
        'polymer clay': 'Polymer Clay',
        'vinyl': 'Vinyl',
        'ink': 'Ink',
        'foam': 'Foam',
        'beads': 'Beads',
        'metal': 'Metal',
        'glitter': 'Glitter',
        'sticker': 'Sticker',
        'vinyl sticker with matte finish': 'Vinyl Sticker (Matte Finish)',
        'vinyl-coated magnetic sheet': 'Vinyl-Coated Magnetic Sheet',
        'capiz shell': 'Capiz Shell',
        'brass': 'Brass',
        'pearl': 'Pearl',
        'silk organza': 'Silk Organza',
        'stainless steel': 'Stainless Steel',
        'polysatin': 'Polysatin',
        'canvas': 'Canvas',
        'zipper': 'Zipper'
    }
    
    standardized_materials = []
    for mat in materials:
        mat_lower = mat.lower().strip()
        standardized = material_mapping.get(mat_lower, mat.title())
        standardized_materials.append(standardized)
    
    return ', '.join(standardized_materials)

df_clean['MATERIAL'] = df_clean['MATERIAL'].apply(standardize_materials)

print("✓ Material names standardized")
df_clean.head(10)

✓ Material names standardized


Unnamed: 0,PRODUCT TITLE,PRODUCT DESCRIPTION,COLOR,SIZE,MATERIAL
0,Halo-Halo Keychain,Miniature halo-halo dessert keychain made from...,"Transparent, Multi-Colored",SMALL,"Clay, Resin"
1,Filipino Funny Magnets,A set of novelty refrigerator magnets featurin...,Multi-Colored,2 x 2 in,Vinyl-Coated Magnetic Sheet
2,"Filipino Snacks Parody Stickers (Skyflakes, Su...",A set of Filipino parody snack stickers inspir...,Multi-Colored,SUSMARYOSEP: 3.5 IN x 2 IN AND POTCHA: 3.25 IN...,Vinyl Sticker (Matte Finish)
3,Philippine-Inspired Monogram Keychain,A 2-inch tall handmade keychain featuring a mo...,"Red, Blue, Yellow",2 IN,"Resin, Sticker, Glitter, Metal Findings"
4,Filipino Dessert Drink Stickers: Taho and Iskr...,A pair of illustrated stickers inspired by ico...,Multi-Colored,APPRO x . 3 IN x 2.3 IN,Vinyl Sticker (Matte Finish)
5,Wala Akong Pake! - Filipino Plastic Bag Sticke...,A waterproof vinyl sticker designed as a playf...,White With Orange-Black Accents,APPRO x . 3 IN x 2.75 IN,Vinyl Sticker (Matte Finish)
6,"I'm Not Late, I'm on Filipino Time! - Funny Fi...",A 3-inch vinyl sticker featuring a hand-drawn ...,Multi-Colored,3 x 3 IN,Vinyl Sticker (Matte Finish)
7,Isaw (Chicken Gizzard) - Philippine Street Foo...,A 3-inch vinyl sticker featuring a cartoon-sty...,Brown,3 x 3 IN,Vinyl Sticker (Matte Finish)
8,Baguio-Inspired Barrel Man: A Wood Carved Stic...,A 4-inch vinyl sticker inspired by the iconic ...,Multi-Colored,4 IN x 3 IN,Vinyl Sticker (Matte Finish)
9,Laban Lang! - Motivational Sticker Tag,A 4-inch vinyl sticker featuring a pop-up wind...,Black/Yellow,APPRO x . 4 IN x 1.25 IN,"Vinyl, Ink"


In [None]:
# 5. Fix special characters and encoding issues
def clean_special_chars(text):
    """Remove or replace problematic special characters"""
    if pd.isna(text) or text == '':
        return ''
    
    text = str(text)
    
    # Replace problematic quotes
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(''', "'").replace(''', "'")
    
    # Remove zero-width spaces and other invisible characters
    text = re.sub(r'[\u200b-\u200f\u202a-\u202e\ufeff]', '', text)
    
    # Normalize em-dash and en-dash
    text = text.replace('—', '-').replace('–', '-')
    
    return text

for col in text_columns:
    df_clean[col] = df_clean[col].apply(clean_special_chars)

print("✓ Special characters cleaned")

✓ Special characters cleaned


In [None]:
# 6. Standardize common terms in PRODUCT DESCRIPTION
def standardize_description_terms(desc):
    """Standardize common terms and phrases in descriptions"""
    if pd.isna(desc) or desc == '':
        return ''
    
    desc = str(desc)
    
    # Standardize measurement units
    desc = re.sub(r'\binches\b', 'in', desc, flags=re.IGNORECASE)
    desc = re.sub(r'\binch\b', 'in', desc, flags=re.IGNORECASE)
    
    # Standardize "handmade/hand-made/hand made"
    desc = re.sub(r'\bhand[\s-]?made\b', 'handmade', desc, flags=re.IGNORECASE)
    desc = re.sub(r'\bhand[\s-]?crafted\b', 'handcrafted', desc, flags=re.IGNORECASE)
    
    # Standardize product type terms
    desc = re.sub(r'\bkey[\s-]?chain\b', 'keychain', desc, flags=re.IGNORECASE)
    desc = re.sub(r'\bwater[\s-]?proof\b', 'waterproof', desc, flags=re.IGNORECASE)
    
    return desc

df_clean['PRODUCT DESCRIPTION'] = df_clean['PRODUCT DESCRIPTION'].apply(standardize_description_terms)

print("✓ Description terms standardized")
df_clean.head(10)

✓ Description terms standardized


Unnamed: 0,PRODUCT TITLE,PRODUCT DESCRIPTION,COLOR,SIZE,MATERIAL
0,Halo-Halo Keychain,Miniature halo-halo dessert keychain made from...,"Transparent, Multi-Colored",SMALL,"Clay, Resin"
1,Filipino Funny Magnets,A set of novelty refrigerator magnets featurin...,Multi-Colored,2 x 2 in,Vinyl-Coated Magnetic Sheet
2,"Filipino Snacks Parody Stickers (Skyflakes, Su...",A set of Filipino parody snack stickers inspir...,Multi-Colored,SUSMARYOSEP: 3.5 IN x 2 IN AND POTCHA: 3.25 IN...,Vinyl Sticker (Matte Finish)
3,Philippine-Inspired Monogram Keychain,A 2-in tall handmade keychain featuring a mono...,"Red, Blue, Yellow",2 IN,"Resin, Sticker, Glitter, Metal Findings"
4,Filipino Dessert Drink Stickers: Taho and Iskr...,A pair of illustrated stickers inspired by ico...,Multi-Colored,APPRO x . 3 IN x 2.3 IN,Vinyl Sticker (Matte Finish)
5,Wala Akong Pake! - Filipino Plastic Bag Sticke...,A waterproof vinyl sticker designed as a playf...,White With Orange-Black Accents,APPRO x . 3 IN x 2.75 IN,Vinyl Sticker (Matte Finish)
6,"I'm Not Late, I'm on Filipino Time! - Funny Fi...",A 3-in vinyl sticker featuring a hand-drawn wr...,Multi-Colored,3 x 3 IN,Vinyl Sticker (Matte Finish)
7,Isaw (Chicken Gizzard) - Philippine Street Foo...,A 3-in vinyl sticker featuring a cartoon-style...,Brown,3 x 3 IN,Vinyl Sticker (Matte Finish)
8,Baguio-Inspired Barrel Man: A Wood Carved Stic...,A 4-in vinyl sticker inspired by the iconic Ba...,Multi-Colored,4 IN x 3 IN,Vinyl Sticker (Matte Finish)
9,Laban Lang! - Motivational Sticker Tag,A 4-in vinyl sticker featuring a pop-up window...,Black/Yellow,APPRO x . 4 IN x 1.25 IN,"Vinyl, Ink"


In [None]:

# 7. Check for and report remaining inconsistencies
print("\n" + "="*60)
print("INCONSISTENCY REPORT")
print("="*60)

# Check COLOR field
unique_colors = df_clean['COLOR'].value_counts()
print(f"\nUnique COLOR values ({len(unique_colors)}):")
print(unique_colors.head(10))

# Check SIZE patterns
unique_sizes = df_clean['SIZE'].value_counts()
print(f"\nUnique SIZE values ({len(unique_sizes)}):")
print(unique_sizes.head(10))

# Check MATERIAL patterns
unique_materials = df_clean['MATERIAL'].value_counts()
print(f"\nUnique MATERIAL values ({len(unique_materials)}):")
print(unique_materials.head(10))

# 8. Display before/after comparison for verification
print("\n" + "="*60)
print("BEFORE vs AFTER COMPARISON (Sample)")
print("="*60)

sample_idx = 2  # You can change this to check different rows
print(f"\nRow {sample_idx} - ORIGINAL:")
print(f"Size: '{df.loc[sample_idx, 'SIZE']}'")
print(f"Material: '{df.loc[sample_idx, 'MATERIAL']}'")

print(f"\nRow {sample_idx} - CLEANED:")
print(f"Size: '{df_clean.loc[sample_idx, 'SIZE']}'")
print(f"Material: '{df_clean.loc[sample_idx, 'MATERIAL']}'")

print("\n✓ Data cleaning completed successfully!")
print(f"\nShape: {df_clean.shape}")
# Optional: Save cleaned data
# df_clean.to_csv('Product_Datasets_Cleaned.csv', index=False)
# print("\n✓ Cleaned data saved to 'Product_Datasets_Cleaned.csv'")


INCONSISTENCY REPORT

Unique COLOR values (17):
COLOR
Multi-Colored                      30
Translucent White/Gold              3
Transparent, Multi-Colored          2
Red, Blue, Yellow, White            2
Antique Brass                       2
Black/Yellow                        1
Red, Blue, Yellow                   1
White With Orange-Black Accents     1
Brown                               1
White/Red/Pink                      1
Name: count, dtype: int64

Unique SIZE values (37):
SIZE
3 IN x 3 IN                                                 6
Unidentified                                                4
APPRO x . 2 IN TALL                                         3
3 x 3 IN                                                    2
SMALL                                                       2
APPRO x . 4 IN x 2 IN                                       2
APPRO x . 4 IN x 1.25 IN                                    2
SUSMARYOSEP: 3.5 IN x 2 IN AND POTCHA: 3.25 IN x 1.75 IN    1
2 x 2 in    

In [None]:
# save cleaned data
df_clean.to_csv('Product_Datasets_Cleaned.csv', index=False)
print("\n✓ Cleaned data saved to 'Product_Datasets_Cleaned.csv'")


✓ Cleaned data saved to 'Product_Datasets_Cleaned.csv'


## Summary:
- sdfad
- adfsdf

---

# Data Engineering / Pre-processing
- Tokenization (splitting text into individual words/tokens)
- Removal of Stop Words (common words like "the," "is," "a")
- Stemming or Lemmatization (reducing words to their root form, e.g., "processing" $\rightarrow$ "process")

## Concatination

In [None]:
# initialize dataframe
df_cleaned = pd.read_csv('Product_Datasets_Cleaned.csv')

# Concatenate text columns into TEXT_CONTENT
df_cleaned['TEXT_CONTENT'] = (
    df_cleaned['PRODUCT TITLE'].fillna('').astype(str) + ' ' +
    df_cleaned['PRODUCT DESCRIPTION'].fillna('').astype(str) + ' ' +
    df_cleaned['COLOR'].fillna('').astype(str) + ' ' +
    df_cleaned['SIZE'].fillna('').astype(str) + ' ' +
    df_cleaned['MATERIAL'].fillna('').astype(str)
)

print("✓ TEXT_CONTENT column created")

✓ TEXT_CONTENT column created


In [None]:
# Step 2: Text Standardization & Noise Reduction
def clean_text(text):
    """
    Clean and standardize text for NLP processing
    """
    # Handle NaN or non-string values
    if pd.isna(text) or text == '':
        return ''
    
    # Ensure text is string type
    text = str(text)
    
    # A. Lowercasing
    text = text.lower()
    
    # B. Remove HTML Tags/URLs (common in scraped descriptions)
    text = re.sub(r'<.*?>|http\S+|www\.\S+', '', text)
    
    # C. Remove Punctuation (keep letters, numbers, and space)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # D. Remove Extra Whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply the cleaning function to the TEXT_CONTENT column
df_cleaned['TEXT_CONTENT'] = df_cleaned['TEXT_CONTENT'].apply(clean_text)

print("✓ Text cleaning completed\n")

df_cleaned.to_csv('Product_Datasets_check.csv', index=False)
df_cleaned.head()

✓ Text cleaning completed



Unnamed: 0,PRODUCT TITLE,PRODUCT DESCRIPTION,COLOR,SIZE,MATERIAL,TEXT_CONTENT
0,Halo-Halo Keychain,Miniature halo-halo dessert keychain made from...,"Transparent, Multi-Colored",SMALL,"Clay, Resin",halo halo keychain miniature halo halo dessert...
1,Filipino Funny Magnets,A set of novelty refrigerator magnets featurin...,Multi-Colored,2 x 2 in,Vinyl-Coated Magnetic Sheet,filipino funny magnets a set of novelty refrig...
2,"Filipino Snacks Parody Stickers (Skyflakes, Su...",A set of Filipino parody snack stickers inspir...,Multi-Colored,SUSMARYOSEP: 3.5 IN x 2 IN AND POTCHA: 3.25 IN...,Vinyl Sticker (Matte Finish),filipino snacks parody stickers skyflakes susm...
3,Philippine-Inspired Monogram Keychain,A 2-in tall handmade keychain featuring a mono...,"Red, Blue, Yellow",2 IN,"Resin, Sticker, Glitter, Metal Findings",philippine inspired monogram keychain a 2 in t...
4,Filipino Dessert Drink Stickers: Taho and Iskr...,A pair of illustrated stickers inspired by ico...,Multi-Colored,APPRO x . 3 IN x 2.3 IN,Vinyl Sticker (Matte Finish),filipino dessert drink stickers taho and iskra...


## Tokenization and Stopword Removal

In [None]:
# Word Frequency Analysis for Custom Stopwords Selection
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

print("Attempting to find 'punkt'...")
try:
    # This line checks if the resource is present in the standard search paths
    nltk.find('tokenizers/punkt')
    print("✅ Resource 'punkt' found. Please ensure you have restarted the kernel.")
except LookupError:
    # If it still fails, run the download command again just in case
    print("❌ Resource 'punkt' still not found. Downloading again...")
    nltk.download('punkt')

# Rerun the code after this.
# Get standard English stopwords
standard_stopwords = set(stopwords.words('english'))


Attempting to find 'punkt'...
✅ Resource 'punkt' found. Please ensure you have restarted the kernel.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\63920\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\63920\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:

print("="*70)
print("WORD FREQUENCY ANALYSIS FOR CUSTOM STOPWORDS")
print("="*70)

# ============================================================
# STEP 2: Tokenize All Text
# ============================================================
print("\n[STEP 2] Tokenizing all text...")

def basic_tokenize(text):
    """Basic tokenization with lowercase conversion"""
    if pd.isna(text) or text == '':
        return []
    
    text = text.lower()
    tokens = word_tokenize(text)
    
    # Remove punctuation and single characters
    tokens = [t for t in tokens if t.isalnum() and len(t) > 1]
    
    return tokens


df_cleaned['ALL_TOKENS'] = df_cleaned['TEXT_CONTENT'].apply(basic_tokenize)

# Flatten all tokens into single list
all_tokens = [token for tokens in df_cleaned['ALL_TOKENS'] for token in tokens]

print(f"✓ Total tokens extracted: {len(all_tokens)}")
print(f"✓ Unique tokens: {len(set(all_tokens))}")

# ============================================================
# STEP 3: Calculate Word Frequencies
# ============================================================
print("\n[STEP 3] Calculating word frequencies...")

# Get frequency distribution
word_freq = Counter(all_tokens)

print(f"✓ Word frequencies calculated")

# ============================================================
# STEP 4: Separate Words by Stopword Status
# ============================================================
print("\n[STEP 4] Categorizing words...")

# Words that are already stopwords
stopword_freq = {word: count for word, count in word_freq.items() 
                 if word in standard_stopwords}

# Words that are NOT stopwords (candidates for custom stopwords)
non_stopword_freq = {word: count for word, count in word_freq.items() 
                     if word not in standard_stopwords}

print(f"✓ Standard stopwords found: {len(stopword_freq)}")
print(f"✓ Non-stopwords found: {len(non_stopword_freq)}")

# ============================================================
# STEP 5: Display Complete Frequency Tables
# ============================================================
print("\n" + "="*70)
print("COMPLETE WORD FREQUENCY ANALYSIS")
print("="*70)

# Create DataFrames for better display
print("\n📊 ALL WORDS (Sorted by Frequency)")
print("-"*70)

all_freq_df = pd.DataFrame(word_freq.most_common(), 
                           columns=['Word', 'Frequency'])
all_freq_df['Is_Stopword'] = all_freq_df['Word'].apply(
    lambda x: 'Yes' if x in standard_stopwords else 'No'
)
all_freq_df['Cumulative_Frequency'] = all_freq_df['Frequency'].cumsum()
all_freq_df['Percentage'] = (all_freq_df['Frequency'] / len(all_tokens) * 100).round(2)

print(all_freq_df.to_string(index=False))

# ============================================================
# STEP 6: Top Non-Stopwords (Candidates for Custom Stopwords)
# ============================================================
print("\n" + "="*70)
print("🎯 TOP NON-STOPWORDS (Candidates for Custom Stopwords)")
print("="*70)
print("\nThese are frequent words NOT in standard stopwords.")
print("Review these to decide which should be added to custom_stopwords:\n")

non_stopword_df = pd.DataFrame(
    [(word, count) for word, count in non_stopword_freq.items()],
    columns=['Word', 'Frequency']
).sort_values('Frequency', ascending=False).reset_index(drop=True)

non_stopword_df['Percentage'] = (
    non_stopword_df['Frequency'] / len(all_tokens) * 100
).round(2)

print(non_stopword_df.head(50).to_string(index=True))

# ============================================================
# STEP 7: Category-Based Analysis
# ============================================================
print("\n" + "="*70)
print("📂 CATEGORY-BASED WORD ANALYSIS")
print("="*70)

# Common categories to identify
categories = {
    'Product_Types': ['sticker', 'keychain', 'magnet', 'earring', 'pin', 'bag', 'pouch'],
    'Materials': ['vinyl', 'resin', 'clay', 'metal', 'polymer', 'beads', 'foam'],
    'Colors': ['red', 'blue', 'yellow', 'green', 'brown', 'black', 'white', 'multi', 'colored'],
    'Sizes': ['inch', 'small', 'medium', 'large', 'tall', 'approx'],
    'Descriptors': ['handmade', 'handcrafted', 'featuring', 'inspired', 'designed', 'set'],
    'Filipino_Terms': ['filipino', 'philippine', 'pinoy', 'pinay', 'tagalog']
}

print("\nWord frequency by category:")
for category, words in categories.items():
    print(f"\n{category}:")
    for word in words:
        if word in word_freq:
            print(f"  {word:20s} : {word_freq[word]:3d}")

# ============================================================
# STEP 8: Visualizations
# ============================================================
print("\n[STEP 8] Creating visualizations...")

# Set style
sns.set_style("whitegrid")
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Top 30 Most Common Words (All)
ax1 = axes[0, 0]
top_30 = all_freq_df.head(30)
colors = ['red' if x == 'Yes' else 'steelblue' for x in top_30['Is_Stopword']]
ax1.barh(range(len(top_30)), top_30['Frequency'], color=colors)
ax1.set_yticks(range(len(top_30)))
ax1.set_yticklabels(top_30['Word'])
ax1.invert_yaxis()
ax1.set_xlabel('Frequency')
ax1.set_title('Top 30 Most Common Words\n(Red = Standard Stopwords)', fontweight='bold')
ax1.grid(axis='x', alpha=0.3)

# 2. Top 30 Non-Stopwords
ax2 = axes[0, 1]
top_30_non = non_stopword_df.head(30)
ax2.barh(range(len(top_30_non)), top_30_non['Frequency'], color='green', alpha=0.7)
ax2.set_yticks(range(len(top_30_non)))
ax2.set_yticklabels(top_30_non['Word'])
ax2.invert_yaxis()
ax2.set_xlabel('Frequency')
ax2.set_title('Top 30 Non-Stopwords\n(Candidates for Custom Stopwords)', fontweight='bold')
ax2.grid(axis='x', alpha=0.3)

# 3. Word Length Distribution
ax3 = axes[1, 0]
word_lengths = [len(word) for word in all_tokens]
ax3.hist(word_lengths, bins=range(2, 20), color='purple', alpha=0.7, edgecolor='black')
ax3.set_xlabel('Word Length')
ax3.set_ylabel('Frequency')
ax3.set_title('Distribution of Word Lengths', fontweight='bold')
ax3.grid(axis='y', alpha=0.3)

# 4. Cumulative Frequency (Pareto)
ax4 = axes[1, 1]
top_100 = all_freq_df.head(100)
ax4.plot(range(len(top_100)), top_100['Cumulative_Frequency'], color='darkorange', linewidth=2)
ax4.fill_between(range(len(top_100)), top_100['Cumulative_Frequency'], alpha=0.3, color='orange')
ax4.set_xlabel('Word Rank')
ax4.set_ylabel('Cumulative Frequency')
ax4.set_title('Cumulative Frequency (Top 100 Words)', fontweight='bold')
ax4.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('word_frequency_analysis.png', dpi=300, bbox_inches='tight')
print("✓ Visualizations saved as 'word_frequency_analysis.png'")

# ============================================================
# STEP 9: Save Frequency Tables to CSV
# ============================================================
print("\n[STEP 9] Saving frequency tables...")

# Save complete frequency table
all_freq_df.to_csv('all_word_frequencies.csv', index=False)
print("✓ Saved: all_word_frequencies.csv")

# Save non-stopwords only
non_stopword_df.to_csv('non_stopword_frequencies.csv', index=False)
print("✓ Saved: non_stopword_frequencies.csv")

# ============================================================
# STEP 10: Recommendations for Custom Stopwords
# ============================================================
print("\n" + "="*70)
print("💡 RECOMMENDATIONS FOR CUSTOM STOPWORDS")
print("="*70)

# Suggest words that appear very frequently but may not add semantic value
high_freq_threshold = 10  # Words appearing more than 10 times
suggested_custom_stopwords = []

print(f"\nWords appearing more than {high_freq_threshold} times:")
print("(Consider adding these to custom_stopwords if they don't add semantic value)\n")

for word, count in non_stopword_df.head(40).values:
    if count > high_freq_threshold:
        # Check if it's a generic descriptor
        generic_words = ['featuring', 'inspired', 'designed', 'made', 'set', 
                        'inch', 'product', 'item', 'perfect', 'ideal', 'great']
        status = "⚠️  CONSIDER" if word in generic_words else "✓ KEEP"
        suggested_custom_stopwords.append((word, count, status))
        print(f"{status:12s} - {word:20s} (frequency: {count})")

print("\n" + "="*70)
print("✓ WORD FREQUENCY ANALYSIS COMPLETE!")
print("="*70)
print(f"\nTotal unique words: {len(word_freq)}")
print(f"Total tokens processed: {len(all_tokens)}")
print(f"\nReview the CSV files and visualizations to decide on custom stopwords.")

WORD FREQUENCY ANALYSIS FOR CUSTOM STOPWORDS

[STEP 2] Tokenizing all text...


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\63920/nltk_data'
    - 'c:\\Users\\63920\\AppData\\Local\\Programs\\Python\\Python311\\nltk_data'
    - 'c:\\Users\\63920\\AppData\\Local\\Programs\\Python\\Python311\\share\\nltk_data'
    - 'c:\\Users\\63920\\AppData\\Local\\Programs\\Python\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\63920\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:
'''
# Tokenization and Stopword Removal
# Import required libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

# Download required NLTK resources (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Get English stopwords
stop_words = set(stopwords.words('english'))
'''

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\63920\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\63920\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\63920\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\63920\AppData\Roaming\nltk_data...


In [None]:
'''
custom_stopwords = {
    'product', 'item', 'featuring', 'made', 'designed', 'inspired',
    'perfect', 'ideal', 'great', 'comes', 'includes'
}
stop_words.update(custom_stopwords)

print(f"Total stopwords: {len(stop_words)}")
print(f"\nSample stopwords: {list(stop_words)[:20]}")
'''

In [None]:
'''
def tokenize_text(text):
    """
    Tokenize text into individual words
    
    Args:
        text: Input string to tokenize
    
    Returns:
        List of tokens (words)
    """
    if pd.isna(text) or text == '':
        return []
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize using NLTK
    tokens = word_tokenize(text)
    
    return tokens

# Apply tokenization
df_cleaned['tokens'] = df_cleaned['combined_text'].apply(tokenize_text)

# Display tokenization results
print("\nOriginal text:")
print(df_cleaned['combined_text'].iloc[0][:150])
print("\nTokenized:")
print(df_cleaned['tokens'].iloc[0][:30])
print(f"\nTotal tokens in first product: {len(df_cleaned['tokens'].iloc[0])}")
'''

In [None]:
'''
def remove_stopwords_and_punctuation(tokens):
    """
    Remove stopwords and punctuation from token list
    
    Args:
        tokens: List of word tokens
    
    Returns:
        Filtered list of tokens
    """
    # Remove punctuation and stopwords
    filtered_tokens = [
        token for token in tokens 
        if token not in stop_words 
        and token not in string.punctuation
        and len(token) > 1  # Remove single characters
        and not token.isdigit()  # Remove pure numbers
    ]
    
    return filtered_tokens

# Apply stopword removal
df_cleaned['tokens_filtered'] = df_cleaned['tokens'].apply(remove_stopwords_and_punctuation)
'''


In [None]:
'''
def lemmatize_tokens(tokens):
    """
    Reduce tokens to their base/root form
    
    Args:
        tokens: List of filtered tokens
    
    Returns:
        List of lemmatized tokens
    """
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized

# Apply lemmatization
df_cleaned['tokens_lemmatized'] = df_cleaned['tokens_filtered'].apply(lemmatize_tokens)
'''

In [None]:
#df_cleaned['processed_text'] = df_cleaned['tokens_lemmatized'].apply(lambda x: ' '.join(x))

In [None]:
''''

print("\n" + "="*60)
print("STEP 7: MOST COMMON TOKENS")
print("="*60)

from collections import Counter

# Flatten all tokens into single list
all_tokens = [token for tokens in df_cleaned['tokens_lemmatized'] for token in tokens]

# Get most common tokens
token_freq = Counter(all_tokens)
most_common = token_freq.most_common(30)

print("\nTop 30 most common tokens:")
for token, count in most_common:
    print(f"{token:20s} : {count:3d}")

# ============================================================
# STEP 8: Save Preprocessed Data
# ============================================================
print("\n" + "="*60)
print("STEP 8: SAVING PREPROCESSED DATA")
print("="*60)

# Select relevant columns for saving
output_columns = [
    'PRODUCT TITLE', 'PRODUCT DESCRIPTION', 'COLOR', 'SIZE', 'MATERIAL',
    'processed_text', 'tokens_lemmatized'
]

# Save to CSV
df_cleaned[output_columns].to_csv('Product_Datasets_Preprocessed.csv', index=False)
print("\n✓ Preprocessed data saved to 'Product_Datasets_Preprocessed.csv'")

# Display final dataframe info
print("\nFinal DataFrame shape:", df_cleaned.shape)
print("\nColumns:", df_cleaned.columns.tolist())

print("\n" + "="*60)
print("✓ TOKENIZATION AND STOPWORD REMOVAL COMPLETED!")
print("="*60)

'''