# 1. Part 2: Smart Search System Development

## Project Title
**Lumora: Smart Search and Multi-Label NLP Classifier for Automatic Tagging of Filipino Contemporary Arts and Crafts**

---

## Introduction
This section of the Lumora Documentation presents the development of the Smart Search System, which serves as the practical implementation of the structured data and automated tags generated in Part 1. The Smart Search System is designed to enhance product discoverability within the Lumora e-commerce platform by applying Natural Language Processing (NLP) techniques to deliver a search experience that moves beyond traditional keyword matching. Through semantic understanding, the system identifies products that are contextually relevant to the user's query.

The model and search components developed in this part are built using the **df_final_preprocessed.csv** dataset, which was fully cleaned, engineered, and augmented during the initial phase of the project.

---

## Core Objectives of the Smart Search System

### **1. Semantic Search**
Enable product retrieval through descriptive and natural phrases (e.g., *“small woven red bag”*) rather than relying on strict keyword matches. This ensures more accurate and intuitive search outcomes.

### **2. Feature Reuse**
Utilize the `TEXT_CONTENT` column—constructed from the combined and pre-processed product attributes such as name, description, color, size, and material—as the primary feature input for vectorization and similarity computation.

### **3. Deployment Readiness**
Produce deployable outputs such as the **TF-IDF Vectorizer**, the **feature matrix**, and the **similarity index**, which can be integrated into a backend API for real-time semantic search functionality.

---

## Data and Feature Recap

The Smart Search System is developed using the following key outputs from Part 1:

- **Dataset:** `df_final_preprocessed.csv` containing **2,509 rows** after data augmentation.
- **Primary Input Feature (X):**  
  The `TEXT_CONTENT` column, which consolidates all relevant descriptive product attributes into a single, normalized text string suitable for NLP vectorization.
- **Vectorization Technique:**  
  The **TF-IDF (Term Frequency–Inverse Document Frequency)** method is applied to convert text features into numerical vectors, forming the searchable representation used for similarity-based retrieval.

---



# 2. Data Loading for the Smart Search System

## Overview
The dataset used to build the **Smart Search System** is sourced from the preprocessed file **`df_final_preprocessed.csv`**, which contains the cleaned, engineered, and augmented product information generated in Part 1. This dataset includes consolidated text features used to train the semantic search model.

In this step, the dataset is loaded into a pandas DataFrame to prepare the text content for vectorization and similarity computation.


In [7]:
# Importing necessary libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import re
# initializing dataframe
df = pd.read_csv('df_final_preprocessed.csv')
df.head()

Unnamed: 0,Product name,Product description,Category,Subcategory,Color,Size,Material,Tags,TEXT_CONTENT,TOKENS,TOKENS_FILTERED,TOKENS_LEMMATIZED,TAGS_LIST,TAGS_LIST_FINAL,PROCESSED_TEXT
0,Flowers Convertible Puso Wedding Tote,A versatile hobo-style tote embroidered with f...,Bags,Wedding Tote,White,UNSPECIFIED,"Upcycled Fabric, Leather","wedding, tote, floral embroidery, Filipino, su...",Flowers Convertible Puso Wedding Tote A versat...,"['Flowers', 'Convertible', 'Puso', 'Wedding', ...","['Flowers', 'Convertible', 'Puso', 'Wedding', ...","['Flowers', 'Convertible', 'Puso', 'Wedding', ...","['wedding', 'tote', 'floral embroidery', 'Fili...","['wedding', 'tote', 'floral embroidery', 'Fili...",flower convertible puso wedding tote hobo-styl...
1,Manila Jeepney 3-in-1 Handbag,A colorful handbag inspired by the iconic jeep...,Bags,Handbag,Multicolor,UNSPECIFIED,"Upcycled Fabric, Leather","jeepney, handbag, Filipino, sustainable",Manila Jeepney 3-in-1 Handbag A colorful handb...,"['Manila', 'Jeepney', '3-in-1', 'Handbag', 'A'...","['Manila', 'Jeepney', '3-in-1', 'Handbag', 'co...","['Manila', 'Jeepney', '3-in-1', 'Handbag', 'co...","['jeepney', 'handbag', 'Filipino', 'sustainable']","['jeepney', 'Filipino', 'sustainable']",manila jeepney 3-in-1 handbag colorful handbag...
2,Vinia Hardin Fanny Pack,A belt-style fanny pack handwoven with upcycle...,Bags,Fanny Pack,Black,UNSPECIFIED,"Upcycled Fabric, Leather","fanny pack, Filipino, sustainable",Vinia Hardin Fanny Pack A belt-style fanny pac...,"['Vinia', 'Hardin', 'Fanny', 'Pack', 'A', 'bel...","['Vinia', 'Hardin', 'Fanny', 'Pack', 'belt-sty...","['Vinia', 'Hardin', 'Fanny', 'Pack', 'belt-sty...","['fanny pack', 'Filipino', 'sustainable']","['Filipino', 'sustainable']",vinia hardin fanny pack belt-style fanny pack ...
3,Sling Bag (Pinilian/Inabel Weave),A crossbody sling bag showcasing traditional P...,Bags,Sling Bag,Blue,UNSPECIFIED,"Upcycled Fabric, Pinilian/Inabel Weave","sling bag, Filipino, handwoven, sustainable",Sling Bag (Pinilian/Inabel Weave) A crossbody ...,"['Sling', 'Bag', '(', 'Pinilian/Inabel', 'Weav...","['Sling', 'Bag', 'Pinilian/Inabel', 'Weave', '...","['Sling', 'Bag', 'Pinilian/Inabel', 'Weave', '...","['sling bag', 'Filipino', 'handwoven', 'sustai...","['sling bag', 'Filipino', 'handwoven', 'sustai...",sling bag pinilian/inabel weave crossbody slin...
4,Alon Woven Waves Shoulder Bag,"A shoulder bag with wave-pattern weaving, comb...",Bags,Shoulder Bag,Blue,UNSPECIFIED,"Upcycled Fabric, Leather","shoulder bag, woven waves, Filipino, sustainable",Alon Woven Waves Shoulder Bag A shoulder bag w...,"['Alon', 'Woven', 'Waves', 'Shoulder', 'Bag', ...","['Alon', 'Woven', 'Waves', 'Shoulder', 'Bag', ...","['Alon', 'Woven', 'Waves', 'Shoulder', 'Bag', ...","['shoulder bag', 'woven waves', 'Filipino', 's...","['shoulder bag', 'Filipino', 'sustainable']",alon woven wave shoulder bag shoulder bag wave...


---

# 3. Model Development
The core of the Smart Search system is the transformation of product descriptions and search queries into a numerical format, allowing for the calculation of semantic similarity.

### 3.1. Vectorization (TF-IDF)
The first step is to transform the text in the TEXT_CONTENT column into numerical vectors.

In [13]:
# Drop not unique rows based on 'PRODUCT_NAME' column
df = df.drop_duplicates(subset=['Product name'])
df.shape

(601, 15)

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Initialize the TF-IDF Vectorizer
# - We'll use the 'english' stop words list to remove common words.
# - The actual model should be trained on the full 'df_final_preprocessed.csv'
tfidf = TfidfVectorizer(stop_words='english')

# 2. Fit and Transform the text data
# X will be the sparse matrix containing the TF-IDF scores for each product
X = tfidf.fit_transform(df['PROCESSED_TEXT'])

# Print the shape of the resulting matrix
print("Shape of the TF-IDF Matrix (Documents x Features):", X.shape)

Shape of the TF-IDF Matrix (Documents x Features): (601, 1282)


### 3.2. Similarity Computation (Cosine Similarity)
Now that products are represented as vectors, Cosine Similarity is the next step. It measures the cosine of the angle between two vectors. A smaller angle (cosine value closer to 1) means higher similarity, indicating the documents are semantically closer.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# 1. Compute the Cosine Similarity Matrix
# The input 'X' is the TF-IDF matrix from the previous step.
# 'cosine_sim' is a matrix where cosine_sim[i][j] is the similarity
# between the i-th and j-th product.
cosine_sim = cosine_similarity(X, X)

# 2. Print the shape of the similarity matrix
print("Shape of the Cosine Similarity Matrix:", cosine_sim.shape)
# Print a snippet of the matrix
print("\nCosine Similarity Matrix (Snippet):")
print(cosine_sim)

Shape of the Cosine Similarity Matrix: (601, 601)

Cosine Similarity Matrix (Snippet):
[[1.         0.39784617 0.40740689 ... 0.         0.         0.14913856]
 [0.39784617 1.         0.33284693 ... 0.         0.01083348 0.14414243]
 [0.40740689 0.33284693 1.         ... 0.00582015 0.         0.13048839]
 ...
 [0.         0.         0.00582015 ... 1.         0.00631365 0.01161331]
 [0.         0.01083348 0.         ... 0.00631365 1.         0.01827303]
 [0.14913856 0.14414243 0.13048839 ... 0.01161331 0.01827303 1.        ]]


In [16]:
### 3.3. Saved Models and Data Structures
import joblib
from scipy.sparse import save_npz, load_npz

# --- Saving Assets (Run this ONLY ONCE after final training) ---

# Assuming 'tfidf' is your trained TfidfVectorizer object
# and 'X' is your transformed sparse matrix of product vectors.

# 1. Save the trained TF-IDF Vectorizer model
joblib.dump(tfidf, 'tfidf_vectorizer.joblib')
# File saved as: tfidf_vectorizer.joblib

# 2. Save the Product Vectors (Sparse Matrix X)
save_npz('product_vectors_X.npz', X)
# File saved as: product_vectors_X.npz


# --- Loading Assets in Production Environment ---

# 1. Load the trained TF-IDF Vectorizer
loaded_tfidf = joblib.load('tfidf_vectorizer.joblib')

# 2. Load the Product Vectors (X)
loaded_X = load_npz('product_vectors_X.npz')

# You can now use loaded_tfidf and loaded_X in your smart_search function!
print("Assets loaded successfully for production use.")

Assets loaded successfully for production use.


In [17]:
# Create a series that maps the index to the Product name for easy lookup
product_names = df['Product name']

def smart_search(query, tfidf_model, product_vectors, product_names_series, top_n=3):
    """
    Performs a semantic search for a given query.
    """
    # 1. Vectorize the query using the fitted TF-IDF model
    query_vec = tfidf_model.transform([query])

    # 2. Calculate the cosine similarity between the query and all product vectors
    # This results in a 1D array of scores.
    cosine_scores = cosine_similarity(query_vec, product_vectors).flatten()

    # 3. Get the indices of the top N most similar products
    # np.argsort returns indices that would sort the array, so we slice with [::-1] for descending
    top_indices = cosine_scores.argsort()[::-1][:top_n]

    # 4. Retrieve the product names and scores
    results = []
    for i in top_indices:
        results.append({
            'Product': product_names_series.iloc[i],
            'Similarity Score': cosine_scores[i]
        })

    return results

# --- Test the Smart Search System ---
query_test = "woven bag from Ilocos"
search_results = smart_search(query_test, tfidf, X, product_names, top_n=10)

print(f"--- Search Results for: '{query_test}' ---")
for result in search_results:
    print(f"Product: {result['Product']} (Score: {result['Similarity Score']:.4f})")

--- Search Results for: 'woven bag from Ilocos' ---
Product: Natural White Clutch bag | Woven Sabutan sling bag | Souvenir from Philippines | Eco friendly gift for her | Lightweight pouch bag (Score: 0.2658)
Product: Puso Micro Bag Charms (Score: 0.2041)
Product: Alon Woven Waves Shoulder Bag (Score: 0.2022)
Product: Handwoven Pink Kantarines Scrunchie (Score: 0.1815)
Product: Natural White Clutch bag | Souvenir from Philippines | Lightweight pouch bag (Score: 0.1681)
Product: Bayong Bag with Wooden Handle | Filipino Handwoven Bag (Score: 0.1680)
Product: Made in Philippines Flag Barcode Canvas Tote Bag (Score: 0.1662)
Product: Bayong Bag with Plastic Handle | Filipino Handwoven Bag (Score: 0.1611)
Product: Bayong Bag with Leather Handle | Filipino Handwoven Bag (Score: 0.1604)
Product: Bayong Bag with Zipper | Filipino Handwoven Bag (Score: 0.1583)
