# 1. Background of the Study: Multi-Label Product Classifier
This notebook documents the development of a Multi-Label Classification Model for the Lumora e-commerce platform.

---

# Title of the Study
## **Lumora: Multi-Label NLP Classifier for Automatic Tagging of Filipino Contemporary Arts and Crafts**

---

# Source of Data
The dataset, named LumoraProductDataset.csv, was collected from various online sources showcasing Filipino handmade goods and artisanal products.

- Source: Aggregated data from various public e-commerce listings focusing on Filipino handcrafted goods.
- Original Format: CSV/Excel tabular data.

---

# Brief Description of Dataset
This dataset consists of unique product listings designed to train an automated tagging model for the Lumora C2C platform. Each row represents a single handcrafted or creative item.

- Data Dimensions: The initial dataset contained 644 rows and 12 columns. 

- Meaning of Each Variable (Selected for Modeling):

1. `Product name` and `Product description`: Primary text fields used to infer the tags.

2. `Color`, `Size`, `Material`: Secondary text fields concatenated to enrich the product context.

3. `Tags`: The comma-separated field of labels manually assigned to the product (e.g., wedding, tote, floral embroidery, Filipino, sustainable).

---

# Model Variables
| Variable | Description | Role in Model |
|--------|-------------|-------------|
| **Selected Features (Independent Variables)** | The concatenated and pre-processed text derived from the Product name, Product description, Color, Size, and Material columns. | **X (Input)**: This single text input is converted into a numerical vector (e.g., using TF-IDF). |
| **Target / Label Column (Dependent Variable)** | The cleaned Tags column. This column will be converted into a binary matrix where each unique tag (e.g., cute, crochet, minimalist) is a separate binary feature (0 or 1). | **Y (Output)**: The labels the model is trained to predict simultaneously for a given product. |

---

# Objective
The objective is to develop a highly accurate Multi-Label NLP Classifier that can automatically assign a set of relevant categories and stylistic attributes to new product listings. This model will reduce the manual effort for sellers and ensure new products are appropriately tagged (e.g., keychain, crochet, kawaii, minimalist), thereby improving product discoverability on the Lumora platform for both general and niche search queries.

---

# 2. Data Collection / Loading
The data used for training the Multi-Label Classifier model is sourced from the `LumoraProductDataset.csv` file, which aggregates product listings from various Filipino arts and crafts sellers.

In [1]:
# Importing necessary libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import re
# initializing dataframe
df = pd.read_csv('Lumora_Product_Dataset.csv')
df.head()

Unnamed: 0,Product name,Product description,Price,Category,Subcategory,Color,Size,Material,Tags,Product link,Image link,Brand / seller name
0,Flowers Convertible Puso Wedding Tote,A versatile hobo-style tote embroidered with f...,PHP 11172.22,Bags,Wedding Tote,White,Unspecified,"Upcycled fabric, leather","wedding, tote, floral embroidery, Filipino, su...",Unspecified,Unspecified,SintaWeddings
1,Manila Jeepney 3-in-1 Handbag,A colorful handbag inspired by the iconic jeep...,PHP 12406.79,Bags,Handbag,Multicolor,Unspecified,"Upcycled fabric, leather","jeepney, handbag, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings
2,Vinia Hardin Fanny Pack,A belt-style fanny pack handwoven with upcycle...,PHP 4875.93,Bags,Fanny Pack,Black,Unspecified,"Upcycled fabric, leather","fanny pack, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings
3,Sling Bag (Pinilian/Inabel Weave),A crossbody sling bag showcasing traditional P...,PHP 5554.94,Bags,Sling Bag,Blue,Unspecified,"Upcycled fabric, Pinilian/Inabel weave","sling bag, Filipino, handwoven, sustainable",Unspecified,Unspecified,SintaWeddings
4,Alon Woven Waves Shoulder Bag,"A shoulder bag with wave-pattern weaving, comb...",PHP 12653.70,Bags,Shoulder Bag,Blue,Unspecified,"Upcycled fabric, leather","shoulder bag, woven waves, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings


---
# 3. Data Information and Summary Statistics
This section presents the initial inspection of the loaded dataset to understand its structure, completeness, and the distribution of the key variables.

## Initial Data Inspection
The initial inspection confirms the overall data integrity, type, and dimensions.

In [2]:
# Show column types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Product name         644 non-null    object
 1   Product description  644 non-null    object
 2   Price                644 non-null    object
 3   Category             644 non-null    object
 4   Subcategory          644 non-null    object
 5   Color                644 non-null    object
 6   Size                 644 non-null    object
 7   Material             644 non-null    object
 8   Tags                 644 non-null    object
 9   Product link         644 non-null    object
 10  Image link           644 non-null    object
 11  Brand / seller name  639 non-null    object
dtypes: object(12)
memory usage: 60.5+ KB


In [3]:
# Show dataset dimensions
df.shape

(644, 12)

In [4]:
# Show count of missing values per column
df.isnull().sum()

Product name           0
Product description    0
Price                  0
Category               0
Subcategory            0
Color                  0
Size                   0
Material               0
Tags                   0
Product link           0
Image link             0
Brand / seller name    5
dtype: int64

## Summary of Initial Findings
- **Data Dimensions**: The initial raw dataset contains 644 entries (rows) and 12 columns.

- **Missing Values**: Only the Brand / seller name column has 5 missing (non-null) values. Since this column is not directly used for the NLP model's input text content or the target tags, these missing values will be handled by dropping the column during the cleaning phase.

- **Data Types**: All columns are of the generic object (string) type, which is expected since the majority of the columns (Product name, Product description, Tags, Material, etc.) are text-based inputs for the NLP model.

# Key Metric Analysis (Categorical and Target)
Since this is a classification problem, it is crucial to analyze the unique values and frequency distribution of the target column (Tags) and other categorical columns that influence it (Category).

In [5]:
# Show descriptive statistics for object columns
df.describe(include='object')

# Show value counts for the primary Category column
category_df = df['Category'].value_counts().to_frame()
print("\nCategory Distribution:\n")
print(category_df.head(46))


Category Distribution:

                            count
Category                         
Jewelry                       151
Vintage                        59
Ornaments                      46
POD                            28
Stickers                       25
Clothing                       23
Philippine Handicrafts         22
Digital Downloads              22
Stickers/Decals                19
Philippine Souvenir            18
Pasko & Parols                 17
Keychains/Charms               17
Wedding Ceremony               16
Bags                           16
Accessories                    16
Bundle Deals                   16
Filipiniana Attire             13
Apparel                        13
Prints                         12
Pinoy Keychains & Charms       10
Vintage Movies                  9
Stationery & Stickers           9
Keychains                       8
Native                          7
Capiz Decor                     6
Printables                      5
Mugs                   

## Summary of Key Metrics
**Categories**: There are 46 unique categories in the dataset. The most frequent category is "Jewelry" (151 counts), followed by "Vintage" (59 counts). This unequal distribution is common and must be considered during modeling.

**Product Names/Descriptions**: There are 601 unique product names and 560 unique product descriptions out of 644 total entries, suggesting high diversity among the listed products.

**Tags (Target Variable)**: The Tags column has 565 unique values out of 644 total rows. This high cardinality confirms that the problem is highly suited for Multi-Label Classification, as most products are uniquely tagged with a combination of attributes.

---
# 4. Data Cleaning
This section details the critical data cleaning operations performed on the raw text dataset to ensure consistency, handle missing values, and prepare the data for subsequent feature engineering.

## A. Handle Missing Values
Missing values (NaN or empty strings) in text columns can disrupt the NLP pipeline. Based on the initial data inspection, only the Brand / seller name column had 5 missing values. Since this column is text-based and its content might be useful for enrichment, we fill the missing values with an empty string ('') rather than dropping the entire row.

In [6]:
# Handle missing values by filling NaN with an empty string (for text compatibility)
df = df.fillna('')

# Verify that all missing values have been handled
print("Missing values after filling:")
print(df.isnull().sum())

Missing values after filling:
Product name           0
Product description    0
Price                  0
Category               0
Subcategory            0
Color                  0
Size                   0
Material               0
Tags                   0
Product link           0
Image link             0
Brand / seller name    0
dtype: int64


## B. Handle Duplicate Rows
Duplicate product listings can skew frequency analysis and unnecessarily increase model training time. We remove any identical rows that may have resulted from data scraping or entry errors.

In [7]:
# Drop any completely duplicate rows and update the DataFrame in place
initial_rows = df.shape[0]
df = df.drop_duplicates()
rows_after_cleaning = df.shape[0]

# Report the change in dimensions
print(f"\nInitial rows: {initial_rows}")
print(f"Rows after dropping duplicates: {rows_after_cleaning}")
print(f"Total duplicates removed: {initial_rows - rows_after_cleaning}")
print(f"New DataFrame shape: {df.shape}")


Initial rows: 644
Rows after dropping duplicates: 618
Total duplicates removed: 26
New DataFrame shape: (618, 12)


## C. Standardize Inconsistent Data (Text Normalization)
Inconsistent text data, such as differing capitalization, irregular spacing, and varied units, confuses the model by treating the same concept (e.g., 'RING' vs. 'ring') as two different entities. We apply normalization steps to the text columns that will serve as model features.

### 1. Standardize Whitespace and Casing
We remove leading/trailing whitespace and reduce multiple spaces between words to a single space. We then convert the feature columns to a standardized casing (e.g., Title Case for categorical fields, Lowercase for the main text fields like Product Description) to group similar terms.

In [8]:
# List of text columns for general cleaning
text_columns = ['Product name', 'Product description', 'Category', 'Subcategory', 'Size', 'Material', 'Tags']
df_clean = df.copy()

# 1. Strip whitespace and reduce multi-spaces
for col in text_columns:
    df_clean[col] = df_clean[col].str.strip()
    df_clean[col] = df_clean[col].str.replace(r'\s+', ' ', regex=True)

# 2. Standardize Casing (Note: Categorical data is often left Title/Upper until feature encoding)
df_clean['Size'] = df_clean['Size'].str.upper()
df_clean['Material'] = df_clean['Material'].str.title()
# The main text fields for the model will be lowercased later in the pre-processing stage

print("\n✓ Whitespace and Casing Standardization applied to categorical features.")
df.head(2)


✓ Whitespace and Casing Standardization applied to categorical features.


Unnamed: 0,Product name,Product description,Price,Category,Subcategory,Color,Size,Material,Tags,Product link,Image link,Brand / seller name
0,Flowers Convertible Puso Wedding Tote,A versatile hobo-style tote embroidered with f...,PHP 11172.22,Bags,Wedding Tote,White,Unspecified,"Upcycled fabric, leather","wedding, tote, floral embroidery, Filipino, su...",Unspecified,Unspecified,SintaWeddings
1,Manila Jeepney 3-in-1 Handbag,A colorful handbag inspired by the iconic jeep...,PHP 12406.79,Bags,Handbag,Multicolor,Unspecified,"Upcycled fabric, leather","jeepney, handbag, Filipino, sustainable",Unspecified,Unspecified,SintaWeddings


### 2. Standardize Inconsistent Formatting
We use regular expressions to fix common inconsistencies in the unstructured data, such as measurement units and common compound words.

In [9]:
# 3. Standardize SIZE field inconsistencies (e.g., 'inches' to 'in')
def standardize_size(size_str):
    if not isinstance(size_str, str) or size_str == '':
        return ''
    # Standardize "inches" variations
    size_str = re.sub(r'\binches\b', 'in', size_str, flags=re.IGNORECASE)
    size_str = re.sub(r'\binch\b', 'in', size_str, flags=re.IGNORECASE)
    # Standardize 'x' separator
    size_str = re.sub(r'\s*x\s*', ' x ', size_str, flags=re.IGNORECASE)
    # Standardize common abbreviations
    size_str = re.sub(r'\bapprox\.?\b', 'Approx.', size_str, flags=re.IGNORECASE)
    return size_str.strip()

df_clean['Size'] = df_clean['Size'].apply(standardize_size)

# 4. Standardize MATERIAL field inconsistencies
def standardize_materials(material_str):
    if not isinstance(material_str, str) or material_str == '':
        return ''
    # Standardize common material combinations
    material_str = re.sub(r'\bvinyl sticker with matte finish\b', 'Vinyl Sticker (Matte Finish)', material_str, flags=re.IGNORECASE)
    material_str = re.sub(r'\bmetal keychain ring\b', 'Metal Findings', material_str, flags=re.IGNORECASE)
    return material_str.title()

df_clean['Material'] = df_clean['Material'].apply(standardize_materials)

print("✓ Size and Material formats standardized.")

✓ Size and Material formats standardized.


### 3. Clean Special Characters
We clean up stray characters, quotes, and encoding issues that can split words or introduce noise into the tokenization process.

In [10]:
# 5. Fix special characters and encoding issues
def clean_special_chars(text):
    if not isinstance(text, str) or text == '':
        return ''
    # Normalize dashes and remove invisible characters
    text = text.replace('—', '-').replace('–', '-')
    text = re.sub(r'[\u200b-\u200f\u202a-\u202e\ufeff]', '', text)
    # Remove problematic quotes/symbols (already handled in cleaning step 1)
    return text

for col in text_columns:
    df_clean[col] = df_clean[col].apply(clean_special_chars)

print("✓ Special characters cleaned.")

✓ Special characters cleaned.


## Summary of Data Cleaning Operations
The data cleaning phase achieved the following:

- **Completeness**: All 5 missing values in the Brand / seller name column were successfully filled with empty strings.

- **Validity**: 26 duplicate rows were removed, resulting in a cleaner dataset of 618 unique entries for modeling.

- **Consistency**: All text-based feature columns (Product name, Description, Size, Material, etc.) were normalized for casing, whitespace, and key format variations, ensuring the NLP model trains on unified concepts (e.g., '2 IN' instead of '2 inches', 'Vinyl Sticker (Matte Finish)' instead of 'vinyl sticker with matte finish').

---
# 5. Data Engineering / Pre-processing
The goal of the Multi-Label Classifier is to predict the tags from the product description and related attributes. To provide the model with the richest context, combinine the most descriptive text fields: `Product name`, `Product description`, `Color`, `Size`, and `Material`.

## A. Dropping Unnecessary Columns
The columns Price, Product link, Image link, and Brand / seller name are unnecessary for the Multi-Label Classifier because they don't help determine the product's descriptive tags, or they introduce noise. 
1. Irrelevance: The Price is a numerical variable and is generally not a direct semantic feature that dictates the style or material tags of an item.

2. Noise: Product link and Image link mostly contain non-semantic URLs that, even after cleaning, would add unnecessary noise to the model's vocabulary.

3. Low Value: The Brand / seller name might add a little value, but it's not core to describing the product itself and can introduce bias or overfitting based on a specific seller. Since the column also had missing values that were just filled with empty strings, it's best to exclude it.

4. Efficiency: Dropping these columns makes the DataFrame smaller and faster to process during the subsequent NLP steps (Tokenization, Vectorization, etc.).

In [11]:
# List of columns to drop as they are not needed for tag prediction
columns_to_drop = [
    'Price', 
    'Product link', 
    'Image link', 
    'Brand / seller name'
]

# Drop the columns from the cleaned DataFrame
df_clean = df_clean.drop(columns=columns_to_drop, axis=1)

print("✓ Unnecessary columns successfully dropped.")
print(f"New DataFrame shape: {df_clean.shape}")
print(f"Remaining columns: {df_clean.columns.tolist()}")

✓ Unnecessary columns successfully dropped.
New DataFrame shape: (618, 8)
Remaining columns: ['Product name', 'Product description', 'Category', 'Subcategory', 'Color', 'Size', 'Material', 'Tags']


## B. Concatination
Combining these features into a single string ensures that the model learns the relationship between attributes (e.g., the word "Blue" in the Color column) and the resulting tags (e.g., a tag like "pastel" or "ocean-themed" which might be implied by the description).


In [12]:
# --- RUN THIS CODE AFTER DROPPING COLUMNS ---

# Concatenate relevant text columns into one feature column (TEXT_CONTENT)
# We fill any remaining blanks with an empty string and ensure they are strings before concatenation
df_clean['TEXT_CONTENT'] = (
    df_clean['Product name'].fillna('').astype(str) + ' ' +
    df_clean['Product description'].fillna('').astype(str) + ' ' +
    df_clean['Color'].fillna('').astype(str) + ' ' +
    df_clean['Size'].fillna('').astype(str) + ' ' +
    df_clean['Material'].fillna('').astype(str)
)

print("\n✓ TEXT_CONTENT column created successfully.")

# Display the remaining columns to verify
print("\nVerification of Final Features and Target:")
print(df_clean[['Product name', 'Product description', 'Tags', 'TEXT_CONTENT']].head(2).to_string())


✓ TEXT_CONTENT column created successfully.

Verification of Final Features and Target:
                            Product name                                                                                                              Product description                                                     Tags                                                                                                                                                                                                      TEXT_CONTENT
0  Flowers Convertible Puso Wedding Tote  A versatile hobo-style tote embroidered with floral motifs, designed for weddings and crafted from upcycled fabric and leather.  wedding, tote, floral embroidery, Filipino, sustainable  Flowers Convertible Puso Wedding Tote A versatile hobo-style tote embroidered with floral motifs, designed for weddings and crafted from upcycled fabric and leather. White UNSPECIFIED Upcycled Fabric, Leather
1          Manila Jeepney 3-in-1 Hand

## C. Tokenization
Tokenization is the process of splitting a continuous sequence of text into smaller, meaningful units called tokens. These tokens are typically individual words, but they can also be phrases, numbers, or punctuation.

In [13]:
import pandas as pd
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

# Assuming df_clean is your current DataFrame with the 'TEXT_CONTENT' column

# 1. Ensure NLTK resources are available (needed if the kernel was restarted)
# nltk.download('punkt', quiet=True) 
# Note: You only need to run the download once.

def tokenize_content(text):
    """Tokenizes text into a list of individual words."""
    if not isinstance(text, str) or text == '':
        return []
    
    # Tokenize the text using NLTK's word tokenizer
    tokens = word_tokenize(text)
    
    return tokens

# Apply tokenization to the TEXT_CONTENT column
df_clean['TOKENS'] = df_clean['TEXT_CONTENT'].apply(tokenize_content)

print("✓ Tokenization completed. 'TOKENS' column created.")
print(f"New DataFrame shape: {df_clean.shape}")

# Display verification of the tokens for the first product
print("\nVerification of Tokenization (First Product):")
print("-" * 50)
print(f"Original Text (Snippet): {df_clean['TEXT_CONTENT'].iloc[0][:100]}...")
print(f"Tokens: {df_clean['TOKENS'].iloc[0][:20]} (first 20 tokens)")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\PLPASIG\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


✓ Tokenization completed. 'TOKENS' column created.
New DataFrame shape: (618, 10)

Verification of Tokenization (First Product):
--------------------------------------------------
Original Text (Snippet): Flowers Convertible Puso Wedding Tote A versatile hobo-style tote embroidered with floral motifs, de...
Tokens: ['Flowers', 'Convertible', 'Puso', 'Wedding', 'Tote', 'A', 'versatile', 'hobo-style', 'tote', 'embroidered', 'with', 'floral', 'motifs', ',', 'designed', 'for', 'weddings', 'and', 'crafted', 'from'] (first 20 tokens)


## D. Stopword Removal
**Stopword removal** is the process of eliminating common words that appear frequently in text but hold little semantic value or unique meaning for the task, such as `"a," "the," "is," "and," and "with"`. Removing these words helps reduce the dimensionality of the text data and focuses the model on the most descriptive keywords (like `crochet`, `kawaii`, `keychain`) that are critical for predicting the product's tags.

In [14]:
import string
nltk.download('stopwords')
from nltk.corpus import stopwords

# --- Stopword Setup ---
# 1. Get standard English stopwords
standard_stopwords = set(stopwords.words('english'))

# 2. Define custom domain-specific stopwords (e.g., words common to all e-commerce items)
custom_stopwords = {
    'product', 'item', 'featuring', 'made', 'designed', 'inspired',
    'perfect', 'ideal', 'great', 'comes', 'includes', 'set', 
    'inch', 'pinoy', 'tagalog', 'in', 'approx', 'tote', 
    'link', 'php', 'style', 'versatile', 'convertible', 'hobo' # Based on observed dataset values
}
stop_words = standard_stopwords.union(custom_stopwords)
print(f"✓ Total unique words in stoplist: {len(stop_words)}")

def remove_stopwords(tokens):
    """Removes stopwords and single-character tokens from a list of words."""
    if not tokens:
        return []
    
    # Filter out stopwords, punctuation, and single characters (e.g., 'A', 'I', 'S')
    filtered_tokens = [
        token for token in tokens
        if token not in stop_words 
        and token not in string.punctuation
        and len(token) > 1  # Remove single characters
    ]
    
    return filtered_tokens

# Apply stopword removal to the TOKENS column
df_clean['TOKENS_FILTERED'] = df_clean['TOKENS'].apply(remove_stopwords)

print("✓ Stopword removal completed. 'TOKENS_FILTERED' column created.")
print(f"New DataFrame shape: {df_clean.shape}")

# Display verification of the filtered tokens for the first product
print("\nVerification of Stopword Removal (First Product):")
print("-" * 50)
print(f"Original Tokens (Snippet): {df_clean['TOKENS'].iloc[0][:15]}")
print(f"Filtered Tokens (Snippet): {df_clean['TOKENS_FILTERED'].iloc[0]}")

✓ Total unique words in stoplist: 221
✓ Stopword removal completed. 'TOKENS_FILTERED' column created.
New DataFrame shape: (618, 11)

Verification of Stopword Removal (First Product):
--------------------------------------------------
Original Tokens (Snippet): ['Flowers', 'Convertible', 'Puso', 'Wedding', 'Tote', 'A', 'versatile', 'hobo-style', 'tote', 'embroidered', 'with', 'floral', 'motifs', ',', 'designed']
Filtered Tokens (Snippet): ['Flowers', 'Convertible', 'Puso', 'Wedding', 'Tote', 'hobo-style', 'embroidered', 'floral', 'motifs', 'weddings', 'crafted', 'upcycled', 'fabric', 'leather', 'White', 'UNSPECIFIED', 'Upcycled', 'Fabric', 'Leather']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PLPASIG\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## E. Lemmatization
Lemmatization is the process of reducing different inflected forms of a word to a single base form, known as the lemma. This is more sophisticated than stemming, as it relies on a dictionary or vocabulary to ensure the root form is an actual word (e.g., changing "crocheting" to "crochet," or "leaves" to "leaf").

This step ensures that variations of the same product attribute or material are treated as one feature by the classification model, reducing the total vocabulary size and improving prediction accuracy.

In [15]:
import nltk
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
from nltk.stem import WordNetLemmatizer

# --- Lemmatization Setup ---
# Note: Ensure wordnet and omw-1.4 were downloaded in an earlier step
# nltk.download('wordnet', quiet=True)
# nltk.download('omw-1.4', quiet=True) 

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    """Reduces tokens to their base/root form (lemma)."""
    if not tokens:
        return []
    
    # Apply lemmatization to each token in the list
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
    
    return lemmatized

# Apply lemmatization to the filtered tokens column
df_clean['TOKENS_LEMMATIZED'] = df_clean['TOKENS_FILTERED'].apply(lemmatize_tokens)

print("✓ Lemmatization completed. 'TOKENS_LEMMATIZED' column created.")
print(f"New DataFrame shape: {df_clean.shape}")

# Display verification of the lemmatized tokens for the first product
print("\nVerification of Lemmatization (First Product):")
print("-" * 50)
print(f"Filtered Tokens (Snippet): {df_clean['TOKENS_FILTERED'].iloc[0]}")
print(f"Lemmatized Tokens (Snippet): {df_clean['TOKENS_LEMMATIZED'].iloc[0]}")

✓ Lemmatization completed. 'TOKENS_LEMMATIZED' column created.
New DataFrame shape: (618, 12)

Verification of Lemmatization (First Product):
--------------------------------------------------
Filtered Tokens (Snippet): ['Flowers', 'Convertible', 'Puso', 'Wedding', 'Tote', 'hobo-style', 'embroidered', 'floral', 'motifs', 'weddings', 'crafted', 'upcycled', 'fabric', 'leather', 'White', 'UNSPECIFIED', 'Upcycled', 'Fabric', 'Leather']
Lemmatized Tokens (Snippet): ['Flowers', 'Convertible', 'Puso', 'Wedding', 'Tote', 'hobo-style', 'embroidered', 'floral', 'motif', 'wedding', 'crafted', 'upcycled', 'fabric', 'leather', 'White', 'UNSPECIFIED', 'Upcycled', 'Fabric', 'Leather']


## F. Target Label Encoding
The target variable of model is the Tags column, which contains a string of comma-separated tags (e.g., "wedding, tote, floral embroidery").

Multi-label encoding converts this string into a binary matrix (or multi-hot encoded vector) .
- Each unique tag in the entire dataset becomes a separate column.
- For each product, a 1 is placed in the column corresponding to a tag that applies to that product, and a 0 is placed everywhere else.

This process is necessary because the Multi-Label Classifier predicts a probability for every single possible tag simultaneously.

In [16]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# --- Data Preparation for Encoding ---

# 1. Clean up and split the 'Tags' column
# The 'Tags' column contains comma-separated strings (e.g., 'tag1, tag2, tag3').
# We must split this string into a list of individual tags.

# Assuming df_clean is your current DataFrame with the 'Tags' column
# Apply a lambda function to split the string by comma and remove surrounding whitespace
df_clean['TAGS_LIST'] = df_clean['Tags'].apply(
    lambda x: [tag.strip() for tag in x.split(',')] if isinstance(x, str) and x.strip() else []
)

# 2. Initialize and Fit the MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit the binarizer to all the tags across the entire dataset
# This discovers all unique tags and assigns them an index
Y_labels = mlb.fit_transform(df_clean['TAGS_LIST'])

# 3. Create the final Target DataFrame (Y)
# Convert the binary matrix back into a labeled DataFrame
Y = pd.DataFrame(Y_labels, columns=mlb.classes_)

# 4. Concatenate Y back to the main DataFrame (optional, but good for inspection)
df_encoded = pd.concat([df_clean.reset_index(drop=True), Y], axis=1)

print("✓ Target Labels encoded into binary matrix (Y).")
print(f"Total Unique Tags Discovered: {len(mlb.classes_)}")
print(f"Target Matrix Shape (Rows, Tags): {Y.shape}")

# Display verification of the target encoding
print("\nVerification of Target Label Encoding (First 5 Rows):")
print("-" * 70)
# Show the original tags and the first few encoded tag columns
print(df_encoded[['Tags'] + list(Y.columns)[:5]].head(5).to_string())

✓ Target Labels encoded into binary matrix (Y).
Total Unique Tags Discovered: 717
Target Matrix Shape (Rows, Tags): (618, 717)

Verification of Target Label Encoding (First 5 Rows):
----------------------------------------------------------------------
                                                      Tags  1940s movie  1980s movie  1990s movie  2 custom heart Instagram decals  Acrylic Elysse parol ornament
0  wedding, tote, floral embroidery, Filipino, sustainable            0            0            0                                0                              0
1                  jeepney, handbag, Filipino, sustainable            0            0            0                                0                              0
2                        fanny pack, Filipino, sustainable            0            0            0                                0                              0
3              sling bag, Filipino, handwoven, sustainable            0            0            0  

## G. Targeted Pruning
Identify and remove non-semantic noise tags (like branded or ultra-specific single-count tags) from the $\mathbf{Y}$ matrix. Because noise tags are mixed with meaningful rare tags, a clearly defined set of noise criteria is required to preserve only high-value descriptors.

In [None]:
# List of tags identified as noise based on low semantic value or duplication (Count 1-4)
noise_tags_to_drop = [
    # Too specific/non-generalizable
    '2 custom heart Instagram decals', 'Acrylic Elysse parol ornament', 'Acrylic Sampabell parol ornament', 
    'Acrylic Tala parol ornament', 'Acrylic parol ornament special with ring', 
    'Acrylic parol ornament special with ring duplicate', 'Acrylic tulip parol ornament', 
    'Wooden Capiz parol ornament variant 2', 'Wooden Capiz parol ornament variant', 
    'Y2K Heart Tsurikawa', 'Wooden tricycle ornament', 'Wooden pride fist ornament', 
    'White enamel Capiz parol ornament', 'White Men Can\'t Jump', 'Wedgwood', 'Wheel acrylic keychain', 
    'Vincent Van Beau', 'Utot shirt', 'Utot', 'Tsinelas keychain', 'Tres Santan motif', 
    'Tree blueprint notepad', 'Time of Your Life', 'Tilso Japan', 'Thomas', 'Theology books sticker pack', 
    'Theology books keychain', 'Teletubbies', 'Tboli malong', 'Tboli', 'Tagalog humor', 
    'Tagalog design', 'Tagalog hat', 'Tagalog ay nako sticker', 'Tabo keychain', 'Sungka keychain', 
    'Sun Araw stud', 'Sun Araw statement necklace', 'Sun Araw statement hoops', 'Sun Araw silver necklace',
    'Sun Araw silver hoops', 'Sun Araw pearl ring', 'Sun Araw hoop earrings', 'Sun Araw enamel pin',
    'Sun Araw + pearl dangle', 'Sun + pearl studs silver', 'Sun + pearl dangle gold', 
    'Rice knuckle cooking printable', 'Philippine Sun friendship bracelet', 
    'Philippine pearls bridal trio', 'Philippine pearls bridal drop necklace', 
    'Philippine parol', 'Philippine flag heart pin', 'Philippine Sun hoop earrings silver', 
    'Philippine Sun pearl dangle silver', 'Philippine Sun pearl necklace silver', 
    'Philippine farming regions print', 'Philippine flag graduation stole variant', 
    'Philippine flag graduation stole limited', 'Philippine flag graduation stole', 
    'Philippine flag', 'Philippine Sun bracelet', 'Philippine Sun citrine necklace', 
    'Philippine Sun dainty earrings silver', 'Philippine Sun dangle earrings silver', 
    'Philippine Sun enamel pin', 'Philippine Sun hoop earrings gold', 
    'Philippine Sun necklace gold', 'Philippine Sun pearl bracelet silver', 
    'Pekpek Turbo sticker', 'Pearl scoop necklace PR-4', 'Pearl beaded Christmas ornament', 
    'toy', 'yoga mat', 'zip lips', 'Bahala ka sticker', 'Bahala Ka Sa Buhay Mo greeting card', 
    'Bad Dog Club', 'Babae print', 'Artsy floral bookmark', 'Artist sticker pack', 
    'Araw necklace', 'wedding heritage motif', 'woven waves', 'wrap top', 'wooden plaque', 
    'wooden keychain', 'whale sticker', 'wedding veil motif', 'wedding symbol motif', 
    'wedding symbol', 'Berenstain', 'Batman Forever', 'Batik button pins', 'Pekpek power print', 
    'Palau', 'Outline Sun Araw', 'Olo motif', 'NeverEnding Story', 'Nonom motif', 
    'Objects in mirror are cuter decal', 'Obsidian Sun Araw', 'Michael', 'Mikasa', 
    'Mini Capiz star ornaments', 'Motorcycle plate frame', 'Munggo food art printable', 
    'Jollibee reusable tote bag', 'John Travolta', 'Japanese sticker', 'Japanese illustration', 
    'James Cagney', 'Jack Russell', 'Jade Sun Araw', 'JDM wheel shoe charm', 'Itneg tapis', 
    'Italian cookbook', 'Indiana Glass', 'Ilocos Abra', 'Honda Civic Type R art print', 
    'Hellacute windshield banner sticker', 'Hellacute windshield banner', 'Hellacute tactical keychain',
    'Hellacute heart croc charms', 'Hellacute croc charms', 'Healing vibes vapor rub printable', 
    'Haw Flakes candy printable', 'Halo-halo Filipino dessert sticker', 'Half Sun', 
    'Greenhouse house keychain', 'Green Pastures notepad', 'Godfather Part 1', 'Gengar shoe charms', 
    'Fried Green Tomatoes', 'Flag-inspired earrings', 'Free all political prisoners print', 
    'FRS/S2000/Miata/Type R charms', 'floral drop earrings', 'floral earrings', 
    'floral bloom earrings', 'flag pin', 'flag enamel pin', 'flag design', 'flag', 
    'fishtail motif', 'Filipino tarot', 'Filipino pet accessory', 'Filipino motif', 
    'Filipina nurse sticker set', 'Filipina mug', 'Filipina empowerment', 'Federal Windsor', 
    'Federal Pressed Glass', 'Eiffel Tower', 'Elizabeth Arden', 'Empowerment pin', 
    'Enamel Capiz parol ornament', 'Driving Miss Daisy', 'Diana Princess of Wales', 
    'Death to imperialism print', 'Dainty Sun Araw silver', 'Dainty Sun Araw dangle', 
    'Custom heart Instagram decal', 'Custom cherry blossom Instagram decal', 
    'Cristal D\'Arques', 'Colonizers burned our fields print', 'Coconut grater sticker', 
    'Christian sticker pack', 'Christmas lantern stickers', 'Christmas décor', 'Cherry blossom valve stem caps', 
    'Cherry blossom Instagram decal variant 3', 'Cherry blossom Instagram decal variant 2', 
    'Cherry blossom Instagram decal variant', 'Cherry blossom Instagram decal', 
    'Cherry Blossom motorcycle frame', 'Capiz star ornaments', 'Capiz shell', 
    'Capiz mango tray set', 'Capiz flower napkin holders', 'Capiz candy cane ornaments', 
    'Burwood', 'Buri reindeer ornaments', 'unity cord motif', 'two-way earrings', 'turtle', 
    'tumbler', 'tulip candle holder', 'trinket box', 'tray', 'tradition motif', 'wedding pillow motif',
    'Amethyst Sun Araw', 'Anime girl keychain', '1940s movie', '2 custom heart Instagram decals', 
    'Acrylic Elysse parol ornament', 'Acrylic Sampabell parol ornament', 'Acrylic Tala parol ornament', 
    'Acrylic parol ornament special with ring', 'Acrylic parol ornament special with ring duplicate', 
    'Acrylic tulip parol ornament',
    # Specific Brand Names (Low-Value)
    'Judy Belle', 'Royal Cornwall', 'Heisey', 'Lenox', 'Mikasa', 'Paul Revere', 'Thomas', 'Roger Duvoisin', 
    'Berman & Anderson', 'Anchor Hocking', 'Burwood', 'Hofbauer Byrdes', 'Indiana Glass', 'Cristal D\'Arques', 'Federal Pressed Glass', 'Federal Windsor', 'Wedgwood',
    # Product Metadata / Duplicates
    'Unspecified', 'Parol', 'Parol dangle variant', 'Parol earrings', 'Parol earrings Christmas', 
    'Christmas lantern stickers', 'Christmas décor', 'Christmas lantern stickers', 'Parol coconut shell keychain',
    'wall art', 'wall pocket', 'tote bag', 'serving tray', 'silver-plated', 'snowman', 'reversible', 'ribbed glass',
    'rosary motif', 'rose pattern', 'rose pearl', 'rose pearl motif', 'rosette', 'rosette motif', 'sampaguita',
    'sampaguita motif', 'santan flower', 'scrunchie', 'portrait sticker', 'pressed cut', 'pride fist necklace',
    'printable poster', 'programming', 'puso', 'pyramid', 'raffia', 'rainbow', 'relish dish', 'reproduction bowl',
    'resin', 'reverse psychology', 'pedicab', 'pendant', 'pestle', 'photo album', 'picnic mat', 'pillow', 
    'pineapple style', 'pitcher', 'platter', 'porcelain floral earrings', 'porcelain plaque', 'porcelain set',
    'portrait illustration', 'portrait mug', 'owl', 'oval dish', 'ornament', 'nostalgia', 'net bag', 'mythology',
    'music box', 'mother & child', 'mortar', 'monogram', 'minimalist motif', 'miniature jeepney', 'mini hoops',
    'mahal kita print', 'mahal kita necklace', 'mahal kita', 'lingling-o earrings', 'lady frame', 'lamp',
    'leaf motif', 'lighthouse', 'kutsinta', 'jewelry box', 'jacquard', 'inabel', 'hoop motif', 'hoodie',
    'headscarf', 'hair clip', 'handbag', 'handloomed', 'handmade earrings', 'handmade keychain', 
    'handwoven skirt', 'handwoven table runner', 'halo-halo', 'gold plated hoops', 'glitter', 'fruit pattern',
    'frog tote bag', 'fried chicken', 'floral vase', 'fishtail motif', 'flag and sun', 'flag pin', 'flag enamel pin',
    'flag design', 'flag', 'floral bloom earrings', 'floral drop earrings', 'floral earrings', 'formal wear',
    'gravy bowl', 'golden bloom motif', 'crystal dish', 'crystal bowl', 'fanny pack', 'fan design', 
    'family motif', 'faith motif', 'etched glass', 'embroidered', 'elephant', 'dominoes', 'distressed cap', 
    'dinuguan', 'dinner plate', 'different', 'decor', 'decanter', 'dad cap', 'cut glass', 'culture motif', 
    'creamer cup', 'cracker nut', 'coding power', 'coin wallet', 'coin motif', 'coffee mug', 'collectible plush',
    'collectible mug', 'collectible frame', 'dog pin', 'dog lover', 'dog bandana', 'clutch bag', 'clutch', 
    'classic motif', 'clam shell', 'creative mind', 'candy bowl', 'calendar', 'collectible dish', 'capiz motif',
    'carabao', 'cardinal bird', 'cat figurine', 'ceremony motif', 'champorado tuyo', 'charm', 'chicken inasal',
    'bowl set', 'bowl', 'board game', 'bloom earrings', 'birds of paradise motif', 'bilo-bilo', 'bell', 
    'bees buddies', 'beer mug', 'beer', 'beaded keychain', 'bayong bag', 'bangle', 'balikbayan', 'bag', 
    'badge reel', 'backpack', 'ashtray', 'artist', 'arroz caldo', 'arrow design', 'arras motif', 'apparel', 
    'angel', 'abaca fiber', 'abaca bag', 'bridal necklace', 'bridal earrings', 'brass earrings', 
    'butterfly basket', 'butter dish', 'bubble glass', 'bridal set', 'caldereta', 'capiz shell dinnerware set', 
    'capiz shell', 'capiz', 'capiz christmas tree ornaments', 'Capiz', 'jollibee reusable tote bag', 
    'kamagong wood cross necklace', 'kapwa tarot', 'katol keychain', 'kawaii cow croc charms', 
    'kirby shoe charms set', 'klifus motif', 'kumain ka na ba sticker', 'longganisa', 'love motif', 
    'malong motif', 'mama bear', 'micro bag', 'kalesa', 'seashell', 'seaglass', 'seaglass earring', 
    'seaglass jewelry', 'seaglass necklace', 'seaglass pendant', 'seashell and pearl', 'seashell jewelry', 
    'sinigang ingredients print', 'sinigang mini print', 'star and sun araw dangle', 'stars and sun ear cuff', 
    'sun araw + pearl dangle', 'sun araw enamel pin', 'sun araw hoop earrings', 'sun araw pearl ring', 
    'sun araw silver hoops', 'sun araw silver necklace', 'sun araw statement hoops', 'sun araw statement necklace',
    'sun araw stud', 'sunday morning', 'sungka keychain', 'tabo keychain', 'tagalog ay nako sticker', 
    'tagalog hat', 'tagalog humor', 'tboli', 'tboli malong', 'teletubbies', 'theology books keychain', 
    'theology books sticker pack', 'thomas', 'tilso japan', 'time of your life', 'tita tagalog necklace', 
    'to love is to resist print', 'todd parr', 'wooden pride fist ornament', 'wooden tricycle ornament', 
    'y2k heart tsurikawa', 'tree blueprint notepad', 'tres santan motif', 'tsinelas keychain', 'utot', 'utot shirt',
    'vincent van beau', 'wedgwood', 'wheel acrylic keychain', 'white men can\'t jump', 'white enamel capiz parol ornament',
    'wooden \'mahal kita\' ornament', 'wooden capiz parol ornament', 'wooden capiz parol ornament duplicate', 
    'wooden sampabell parol ornament', 'wooden sampabell parol ornament variant', 'wooden bahay kubo ornament', 
    'wooden bahay kubo ornament duplicate', 'wooden basketball ornament', 'wooden firework parol ornament', 
    'wooden firework parol ornament variant', 'wooden jeepney ornament', 'wooden jeepney ornament variant', 
    'wooden jeepney ornament variant 2', 'wooden lechon ornament', 'wooden nurse heart + flag ornament', 
    'abaniko motif', 'acrylic box', 'adjustable', 'neverending story', 'nonom motif', 
    'objects in mirror are cuter decal', 'obsidian sun araw', 'john travolta', 'jollibee reusable tote bag', 
    'kamagong wood cross necklace', 'kapwa tarot', 'katol keychain', 'kawaii cow croc charms', 
    'kirby shoe charms set', 'klifus motif', 'kumain ka na ba sticker', 'birds of paradise motif', 
    'bloom earrings', 'board game', 'boho tote', 'arras motif', 'arrow design', 'arroz caldo', 'artist', 
    'ashtray', 'backpack', 'badge reel', 'bag', 'balikbayan', 'bangle', 'abaca fiber', 'abaca bag', 
    'brass earrings', 'bridal earrings', 'bridal necklace', 'bridal set', 'bubble glass', 'butter dish', 
    'butterfly basket', 'caldereta', 'bayong bag', 'beaded keychain', 'beer', 'beer mug', 'bees buddies', 
    'bell', 'bilo-bilo', 'birds of paradise', 'honda civic type r art print', 'ilocos abra', 'indiana glass',
    'italian cookbook', 'itneg tapis', 'jdm wheel shoe charm', 'jack russell', 'jade sun araw', 
    'james cagney', 'japan pottery', 'ceremony motif', 'champorado tuyo', 'charm', 'chicken inasal', 'bowl', 
    'bowl set', 'coding power', 'coffee mug', 'coin motif', 'coin wallet', 'collectible dish', 'calendar', 
    'candy bowl', 'capiz motif', 'carabao', 'cardinal bird', 'cat figurine', 'lazy lechon sticker', 'lenox',
    'leonard', 'lewd anime keychain', 'hofbauer byrdes', 'green pastures notepad', 'greenhouse house keychain',
    'half sun', 'halo-halo filipino dessert sticker', 'haw flakes candy printable', 
    'healing vibes vapor rub printable', 'hellacute croc charms', 'hellacute heart croc charms', 
    'hellacute lanyard keychain', 'cracker nut', 'creamer cup', 'creative mind', 'clam shell', 
    'classic motif', 'clutch', 'dog bandana', 'dog lover', 'dog pin', 'collectible frame', 
    'collectible mug', 'collectible plush', 'compote bowl', 'condensed milk', 'condensed milk can', 
    'cord motif', 'japanese illustration', 'japanese sticker', 'jeepney charm', 'jo ann shirley', 
    'gengar shoe charms', 'godfather part 1', 'culture motif', 'cut glass', 'dad cap', 'decanter', 
    'decor', 'different', 'dinner plate', 'dinuguan', 'distressed cap', 'hellacute tactical keychain', 
    'hellacute windshield banner', 'hellacute windshield banner sticker', 'filipino motif', 
    'filipino pet accessory', 'filipino tarot', 'filipino traditional', 'filipino dad shirt', 
    'filipino decor', 'filipino dessert charm', 'floral vase', 'formal wear', 'dominoes', 'elephant',
    'embroidered', 'enamel brass', 'etched glass', 'faith motif', 'family motif', 'fan design', 
    'fanny pack', 'crystal bowl', 'crystal dish', 'golden bloom motif', 'gravy bowl', 
    'flag-inspired earrings', 'free all political prisoners print', 'fried green tomatoes', 
    'genesis notepad', 'filipino bread', 'filipino car magnet', 'fried chicken', 'frog tote bag', 
    'fruit pattern', 'glitter', 'glossy pearl motif', 'filipino hat', 'filipino humor shirt', 
    'english', 'frs/s2000/miata/type r charms', 'federal pressed glass', 'federal windsor', 
    'filipina empowerment', 'filipina mug', 'filipina nurse sticker set', 'filipino christmas', 
    'kawaii', 'hair clip', 'halo-halo', 'handbag', 'handloomed', 'handmade earrings', 
    'handmade keychain', 'handwoven skirt', 'handwoven table runner', 'longganisa', 'love motif', 
    'malong motif', 'mama bear', 'micro bag', 'headscarf', 'hoodie', 'hoop motif', 'inabel', 
    'jacquard', 'jewelry box', 'filipino apparel', 'coconut grater sticker', 
    'colonizers burned our fields print', 'cristal d\'arques', 'custom blossom instagram decal',
    'miniature jeepney', 'minimalist motif', 'monogram', 'mortar', 'mother & child', 'music box', 
    'mythology', 'net bag', 'nostalgia', 'ornament', 'oval dish', 'kutsinta', 'lady frame', 
    'lamp', 'leaf motif', 'lighthouse', 'capiz flower napkin holders', 'capiz mango tray set', 
    'capiz shell', 'capiz shell dinnerware set', 'capiz star ornaments', 
    'cherry blossom motorcycle frame', 'cherry blossom instagram decal', 
    'cherry blossom instagram decal variant', 'cherry blossom instagram decal variant 2', 
    'cherry blossom instagram decal variant 3', 'cherry blossom valve stem caps', 
    'pearl dangle', 'pearl jewelry set', 'pearl necklace', 'pearls', 'mini hoops', 
    'porcelain plaque', 'porcelain set', 'portrait illustration', 'portrait mug', 'owl', 
    'palm', 'pandesal', 'paradise motif', 'patadyong', 'pear pin', 'pear pin set', 'eiffel tower',
    'elizabeth arden', 'empowerment pin', 'enamel capiz parol ornament', 
    'capiz candy cane ornaments', 'rattan', 'religious', 'relish dish', 'reproduction bowl', 
    'resin', 'reverse psychology', 'pedicab', 'pendant', 'pestle', 'photo album', 'picnic mat', 
    'pillow', 'pineapple style', 'pitcher', 'platter', 'porcelain floral earrings', 
    'rose pearl motif', 'rosette', 'rosette motif', 'sampaguita', 'sampaguita motif', 
    'santan flower', 'scrunchie', 'portrait sticker', 'pressed cut', 'pride fist necklace', 
    'printable poster', 'programming', 'puso', 'pyramid', 'raffia', 'rainbow', 'bookstore notepad',
    'bridal pearl cluster', 'bridal pearl set', 'bridal scoop pearl', 'bukaka', 'bukaka shirt', 
    'bukaka toddler tee', 'silver-plated', 'snowman', 'reversible', 'ribbed glass', 
    'ring pillow motif', 'rituals', 'rosary motif', 'rose pattern', 'rose pearl', 
    'serving tray', 'shell', 'shell keychain', 'shirt', 'silk', 'silk accent', 'silk flower', 
    'silk organza', 'christian sticker pack', 'christmas décor', 'christmas lantern stickers', 
    'coconut grater printable', 'berman & anderson', 'binakol', 'binakol jacket', 
    'bluso bnetek motif', 'teak vase', 'textile', 'threader earrings', 'toothpick holder', 
    'tote bag', 'spaghetti', 'sphere motif', 'spiral motif', 'spoon', 'star', 'sticker sheet', 
    'stocking', 'stud earrings', 'sun rays', 'sungka', 'sunset', 'vintage decor', 'vinyl', 
    'votive candle holder', 'wall art', 'wall pocket', 'waterproof', 'unity cord symbol', 
    'unity heritage motif', 'unity symbol', 'unity symbol motif', 'swan vase', 'sweatshirt', 
    'tapis wrap', 'tapsilog', 'tassel design', 'tassel earrings', 'wedding heritage motif', 
    'araw necklace', 'artist sticker pack', 'artsy floral bookmark', 'asian snacks', 'babae print',
    'bad dog club', 'bahala ka sa buhay mo greeting card', 'bahala ka sticker', 'zip lips', 
    'bastos shirt', 'bastos toddler tee', 'yoga mat', 'yugal motif', 'anime girl car decal',
    'toy', 'pearl beaded christmas ornament', 'pearl scoop necklace pr-4', 'pekpek turbo sticker', 
    'batik button pins', 'batman forever', 'berenstain', 'wedding symbol', 'wedding symbol motif', 
    'wedding veil motif', 'whale sticker', 'wooden keychain', 'wooden plaque', 'woven waves', 'wrap top',
    'wedding bag', 'wedding gift'
]

# 1. Prune the Noise Tags from Y (Target Matrix)
# Get all tag columns EXCEPT for the noise tags
Y_pruned = Y.drop(columns=noise_tags_to_drop, errors='ignore')

# 2. Prune the Noise Tags from the TAGS_LIST column (for Augmentation)
def final_filter_tags(tag_list, tags_to_keep):
    """Filters a list of tags to only include those in the final model output."""
    return [tag for tag in tag_list if tag in tags_to_keep]

tags_to_keep_final = Y_pruned.columns.tolist()

df_clean['TAGS_LIST_FINAL'] = df_clean['TAGS_LIST'].apply(
    lambda x: final_filter_tags(x, tags_to_keep_final)
)

print("✓ Noise tags successfully pruned from the target matrix (Y) and TAGS_LIST.")
print(f"Original number of tags: {len(Y.columns)}")
print(f"Final number of tags (after noise removal): {Y_pruned.shape[1]}")

## Conclusion of Data Engineering / Pre-processing
The data preparation for the Multi-Label Classifier is now complete.
- Input Feature (X): Created and cleaned (TEXT_CONTENT, tokenized, lemmatized).
- Output Label (Y): Created and encoded into a binary matrix (Y DataFrame).