# Semantic Search Demo with Clothing Reviews

This notebook demonstrates how to perform semantic search using a dataset of clothing reviews. We will:
1. Preprocess the dataset (cleaning and handling missing values).
2. Use Sentence-BERT to generate embeddings for the `Title` and `Review` columns.
3. Perform semantic search using cosine similarity.

The dataset includes information about clothing items, reviews, and additional attributes like material, color, and durability.


In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("../backend/data/dataset.csv")

# Inspect the data
print(df.info())
print(df.head(10))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49338 entries, 0 to 49337
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         45370 non-null  object 
 1   Review        48507 non-null  object 
 2   Cons_rating   49124 non-null  float64
 3   Cloth_class   49322 non-null  object 
 4   Materials     5741 non-null   float64
 5   Construction  5743 non-null   float64
 6   Color         5742 non-null   float64
 7   Finishing     5737 non-null   float64
 8   Durability    5734 non-null   float64
dtypes: float64(6), object(3)
memory usage: 3.4+ MB
None
                                  Title  \
0                                   NaN   
1                                   NaN   
2               Some major design flaws   
3                      My favorite buy!   
4                      Flattering shirt   
5               Not for the very petite   
6                  Cagrcoal shimmer fun   
7  Shimmer, surpri

In [2]:
# Fill missing 'Title' and 'Review' values with empty strings
df['Title'] = df['Title'].fillna("")
df['Review'] = df['Review'].fillna("")

# Drop rows where all numerical fields are NaN (if applicable)
df = df.dropna(subset=['Materials', 'Construction', 'Color', 'Finishing', 'Durability'], how='all')


# Display the cleaned dataset
print(df.info())


<class 'pandas.core.frame.DataFrame'>
Index: 5744 entries, 0 to 49337
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         5744 non-null   object 
 1   Review        5744 non-null   object 
 2   Cons_rating   5739 non-null   float64
 3   Cloth_class   5744 non-null   object 
 4   Materials     5741 non-null   float64
 5   Construction  5743 non-null   float64
 6   Color         5742 non-null   float64
 7   Finishing     5737 non-null   float64
 8   Durability    5734 non-null   float64
dtypes: float64(6), object(3)
memory usage: 448.8+ KB
None


In [3]:
# Combine Title and Review for semantic analysis
df['Combined_Text'] = df['Title'] + " " + df['Review']
df = df.dropna(subset=['Title', 'Combined_Text'], how='any')

# Inspect the combined column
print(df[['Title', 'Review', 'Combined_Text']].head(10))


                                  Title  \
0                                         
1                                         
2               Some major design flaws   
3                      My favorite buy!   
4                      Flattering shirt   
5               Not for the very petite   
6                  Cagrcoal shimmer fun   
7  Shimmer, surprisingly goes with lots   
8                            Flattering   
9                     Such a fun dress!   

                                              Review  \
0  Absolutely wonderful - silky and sexy and comf...   
1  Love this dress!  it's sooo pretty.  i happene...   
2  I had such high hopes for this dress and reall...   
3  I love, love, love this jumpsuit. it's fun, fl...   
4  This shirt is very flattering to all due to th...   
5  I love tracy reese dresses, but this one is no...   
6  I aded this in my basket at hte last mintue to...   
7  I ordered this in carbon for store pick up, an...   
8  I love this dress. 

In [4]:
df = df.head(1000)    

In [5]:
df.to_csv("../backend/data/cleaned_dataset.csv")

In [8]:
from sentence_transformers import SentenceTransformer

# Load pre-trained Sentence-BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for the Combined_Text column
embeddings = model.encode(df['Combined_Text'].tolist(), convert_to_tensor=True)

print(f"Generated embeddings for {len(embeddings)} entries.")


RuntimeError: Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Example query
query = "Comfortable and durable dress for summer"
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute cosine similarity
similarities = cosine_similarity([query_embedding], embeddings)

# Add similarity scores to the DataFrame
df['Similarity'] = similarities[0]

# Sort results by similarity
top_results = df.sort_values(by='Similarity', ascending=False).head(5)

# Display top results
print(top_results[['Title', 'Review', 'Cloth_class', 'Similarity']])


## Dataset
The dataset contains 49,338 entries with the following columns:
- `Title`: Short title for the review.
- `Review`: Detailed review text.
- `Cons_rating`: Numerical rating of the product.
- `Cloth_class`: Category of clothing (e.g., Dresses, Intimates).
- Other numerical attributes like `Materials`, `Construction`, `Color`, `Finishing`, and `Durability`.

## Notebook
The `notebooks/semantic_search_demo.ipynb` includes:
1. Data preprocessing.
2. Embedding generation using Sentence-BERT.
3. Semantic search implementation with cosine similarity.
4. Saved processed dataset for reuse.

Run the notebook using:
```bash
jupyter notebook notebooks/semantic_search_demo.ipynb


In [None]:
from neo4j import GraphDatabase

# Connect to the Neo4j database
uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "password"))

# Add nodes and relationships
def add_data_to_neo4j(tx, title, cloth_class, review):
    query = """
    MERGE (product:Product {title: $title})
    MERGE (category:Category {name: $cloth_class})
    MERGE (review:Review {text: $review})
    MERGE (product)-[:BELONGS_TO]->(category)
    MERGE (product)-[:HAS_REVIEW]->(review)
    """
    tx.run(query, title=title, cloth_class=cloth_class, review=review)

# Execute for all rows in the dataset
with driver.session() as session:
    for _, row in df.iterrows():
        session.write_transaction(add_data_to_neo4j, row['Title'], row['Cloth_class'], row['Review'])

print("Knowledge graph populated!")


In [None]:
def query_knowledge_graph(tx, keyword):
    query = """
    MATCH (product:Product)-[:BELONGS_TO]->(category:Category),
          (product)-[:HAS_REVIEW]->(review:Review)
    WHERE product.title CONTAINS $keyword OR review.text CONTAINS $keyword
    RETURN product.title AS Title, category.name AS Category, review.text AS Review
    LIMIT 5
    """
    return tx.run(query, keyword=keyword).data()

# Execute a query
with driver.session() as session:
    results = session.read_transaction(query_knowledge_graph, "dress")
    print(results)


In [None]:
from nltk.corpus import wordnet

def expand_query(query):
    words = query.split()
    expanded_query = []
    for word in words:
        synonyms = wordnet.synsets(word)
        lemmas = [syn.lemmas()[0].name() for syn in synonyms if syn.lemmas()]
        expanded_query.extend(lemmas[:3])  # Limit to 3 synonyms per word
    return list(set(expanded_query + words))

# Example
query = "comfortable dress"
expanded_query = expand_query(query)
print(f"Original Query: {query}")
print(f"Expanded Query: {expanded_query}")


In [None]:
# Generate embeddings for expanded query terms
expanded_embeddings = [model.encode(term, convert_to_tensor=True) for term in expanded_query]

# Compute similarity for each term
all_similarities = [cosine_similarity([embedding], embeddings)[0] for embedding in expanded_embeddings]

# Aggregate similarities and rank results
df['Similarity'] = sum(all_similarities)
top_results = df.sort_values(by='Similarity', ascending=False).head(5)
print(top_results[['Title', 'Review', 'Similarity']])


In [None]:
import joblib

# Save embeddings to file
joblib.dump(embeddings, 'data/embeddings.pkl')
print("Embeddings saved!")
    

In [None]:
import os

# Load embeddings if available
if os.path.exists('data/embeddings.pkl'):
    embeddings = joblib.load('data/embeddings.pkl')
    print("Loaded precomputed embeddings.")
else:
    embeddings = model.encode(df['Combined_Text'].tolist(), convert_to_tensor=True)
    joblib.dump(embeddings, 'data/embeddings.pkl')
