## Building the Search Engine for E-commerce Product Recommendation

### Project Overview
This notebook focuses on constructing a semantic search engine for e-commerce product recommendations, specifically for Nike products. I used product names, subtitles, and descriptions to provide a more comprehensive search experience. The steps include preprocessing, embedding computation, similarity calculation, ranking, knowledge graph integration, query expansion, and optimization.

### Steps to Implement

1. **Preprocess the Dataset**: Utilize name, subtitle, and description for embedding computation.
2. **Compute Embeddings**: Use SBERT to generate embeddings for a concatenated string of name, subtitle, and description.
3. **Calculate Cosine Similarity**: Compare the embeddings to find relevant products.
4. **Rank Products**: Sort results based on similarity scores.
5. **Knowledge Graph Integration**: simulate a knowledge graph for smaller datasets or prototyping.
6. **Query Expansion**: Allow users to expand their search terms.
7. **Optimization**: Implement caching for embeddings to improve speed.

### Why SBERT?
SBERT is chosen for its ability to produce dense vector representations that capture semantic similarity across different text fields, which is beneficial for our multi-field approach.

### Why NetworkX for Knowledge Graph?
NetworkX provides a simple way to simulate a knowledge graph for smaller datasets or prototyping. It allows us to represent relationships between products based on attributes like color, which can enrich our search capabilities.



### Step 1: Preprocess the Dataset

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the cleaned dataset
data = pd.read_csv('../data/semantic_search_ready_data.csv')

# Combine name, sub_title, and description for embedding
data['combined_text'] = data['name'] + ' ' + data['sub_title'] + ' ' + data['description']
data['combined_text'] = data['combined_text'].fillna('') 

In [2]:
data.head()

Unnamed: 0,name,sub_title,color,price,description,avg_rating,review_count,parsed_sizes,dominant_color,combined_text
0,nike dri-fit team (mlb minnesota twins),men's long-sleeve t-shirt,Navy,40.0,sweat-wicking comfort.the nike dri-fit team (m...,4.773913,0.0,"['S', 'M', 'L', 'XL', '2XL']",Navy,nike dri-fit team (mlb minnesota twins) men's ...
1,club américa,women's nike dri-fit soccer jersey dress,Black/Black,90.0,"inspired by traditional soccer jerseys, the cl...",5.0,1.0,['L (12–14)'],Black,club américa women's nike dri-fit soccer jerse...
2,nike sportswear swoosh,men's overalls,Black/White,140.0,working hard to keep you comfortable.the nike ...,4.9,11.0,[],Black,nike sportswear swoosh men's overalls working ...
3,nike dri-fit one luxe,big kids' (girls') printed tights (extended size),Black/Rush Pink,22.97,elevated comfort goes full bloom.the nike dri-...,4.773913,0.0,[],Black,nike dri-fit one luxe big kids' (girls') print...
4,paris saint-germain repel academy awf,big kids' soccer jacket,Dark Grey/Black/Siren Red/Siren Red,70.0,water-repellent coverage gets psg details.the ...,4.773913,0.0,"['XS', 'S', 'M', 'L', 'XL']",Dark Grey,paris saint-germain repel academy awf big kids...


### Step 2: Compute Embeddings

In [3]:
from sentence_transformers import SentenceTransformer

# Initialize SBERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings for combined text (name + sub_title + description)
data['embedding'] = data['combined_text'].apply(lambda x: model.encode(x))

# Save embeddings 
data.to_csv('../data/semantic_search_with_embeddings.csv', index=False)

  from .autonotebook import tqdm as notebook_tqdm


### Step 3: Calculate Cosine Similarity

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(query, data):
    # Compute embedding for the query
    query_embedding = model.encode(query)
    
    # Compute cosine similarity between query embedding and all product embeddings
    similarities = cosine_similarity([query_embedding], list(data['embedding']))
    
    # Get the indices of the top 5 similar products
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    
    return top_indices, similarities[0][top_indices]

### Step 4 : Rank Products

In [5]:
def rank_and_display_results(query, data):
    indices, scores = search(query, data)
    
    print(f"Top 5 results for query '{query}':")
    for idx, score in zip(indices, scores):
        product = data.iloc[idx]
        print(f"- {product['name']} - {product['sub_title']} (Score: {score:.4f})")
        print(f"  Description: {product['description'][:100]}...")
        print()

# Example query
query = "running shoes for men"
rank_and_display_results(query, data)

Top 5 results for query 'running shoes for men':
- zion 2 - men's basketball shoes (Score: 0.5316)
  Description: channel your skills in footwear designed for zion williamson and built for ballers at any level. an ...

- nike air pegasus 83 premium - men's shoes (Score: 0.5296)
  Description: travel first class in the nike air pegasus 83 premium—the callback classic. with a splash of running...

- nike offline pack - men's shoes (Score: 0.5220)
  Description: sleeping bags for your feet.get ready for a sensory experience like no other—the ergonomic design, p...

- nike air force 1 lv8 utility - big kids' shoes (Score: 0.5173)
  Description: legendary sneaker for a legendary stride.with a nod to the 1982 classic, the nike air force 1 lv8 ut...

- nike air deschutz+ se - men's shoes (Score: 0.5093)
  Description: from city hikes to canyon trails and long walks on the beach, the nike acg air deschutz+ is built to...



### Step 5 Query Expansion

In [6]:
import nltk
# Download wordnet
nltk.download('wordnet')
from nltk.corpus import wordnet

def expand_query(query):
    """
    Expands the query by adding synonyms from WordNet.
    
    Args:
    query (str): The original search query.
    
    Returns:
    list: A list of terms including the original query and its expansions.
    """
    terms = query.split()
    expanded_terms = []
    for term in terms:
        synsets = wordnet.synsets(term)
        if synsets:
            # Getting synonyms (lemmas) from the first synset
            for lemma in synsets[0].lemmas():
                expanded_terms.append(lemma.name())
    # Combine original terms with expanded terms and remove duplicates
    return list(set(expanded_terms + terms))

# Example usage
query = "running shoes for men"
expanded_query = expand_query(query)
print(f"Original query: {query}")
print(f"Expanded query: {' '.join(expanded_query)}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original query: running shoes for men
Expanded query: men workforce place running_play for running_game manpower hands shoes work_force running run


### Step 6 Knowledge graph simulation with NetworkX

In [7]:
import networkx as nx

# Create a directed graph
G = nx.DiGraph()

# Add nodes and edges based on product attributes
for _, row in data.iterrows():
    # Adding product node
    G.add_node(row['name'], type='Product', description=row['description'], color=row['dominant_color'])
    
    # Example: Add edges based on color
    G.add_edge(row['name'], row['dominant_color'], type='has_color')

def query_knowledge_graph(query, G):
    """
    Queries the knowledge graph simulation for products related to the query.
    
    Args:
    query (str): The search query.
    G (nx.DiGraph): The knowledge graph.
    
    Returns:
    list: A list of tuples with product details matching the query.
    """
    results = []
    for node in G.nodes(data=True):
        if 'Product' in node[1].get('type', ''):
            if query.lower() in node[1]['description'].lower() or query.lower() in node[1]['color'].lower():
                results.append((node[0], node[1]['description'], node[1]['color']))
    return results[:5]  # Limit to top 5 results

# Example usage
query = "running shoes for men"
results = query_knowledge_graph(query, G)
print("\nKnowledge Graph Results:")
for name, desc, color in results:
    print(f"- Name: {name}, Color: {color}")
    print(f"  Description: {desc[:100]}...")
    print()


Knowledge Graph Results:


### Cosine Similarity with query Expansion 

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_with_expansion(query, data):
    """
    Performs a search using an expanded query and returns the top 5 results.
    
    Args:
    query (str): The original search query.
    data (pd.DataFrame): The dataset with product information and embeddings.
    
    Returns:
    tuple: Indices of top results and their similarity scores.
    """
    expanded_query = expand_query(query)
    query_embedding = model.encode(' '.join(expanded_query))
    similarities = cosine_similarity([query_embedding], list(data['embedding']))
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    return top_indices, similarities[0][top_indices]

def rank_and_display_results(query, data):
    indices, scores = search_with_expansion(query, data)
    
    print(f"\nTop 5 results for query '{query}':")
    for idx, score in zip(indices, scores):
        product = data.iloc[idx]
        print(f"- {product['name']} - {product['sub_title']} (Score: {score:.4f})")
        print(f"  Description: {product['description'][:100]}...")
        print()

# Example query
query = "running shoes for men"
rank_and_display_results(query, data)


Top 5 results for query 'running shoes for men':
- zion 2 - men's basketball shoes (Score: 0.3397)
  Description: channel your skills in footwear designed for zion williamson and built for ballers at any level. an ...

- chelsea fc strike - men's nike dri-fit soccer track jacket (Score: 0.3351)
  Description: elevate your training.the chelsea fc strike track jacket has design details specifically tailored fo...

- portugal 2022/23 stadium home - men's nike dri-fit soccer shorts (Score: 0.3039)
  Description: like other shorts from our stadium collection, these ones pair replica design details with sweat-wic...

- air jordan xxxv low ds pf - basketball shoes (Score: 0.2981)
  Description: basketball players create separation by cutting quicker, running faster and jumping higher than the ...

- nike college dri-fit coach (duke) - men's shorts (Score: 0.2946)
  Description: sweat-wicking comfort for the stands.designed to fit the needs of coaches on the practice field, the...



### Optimization using Caching Embeddings 

In [9]:
import joblib

# Save embeddings to disk for caching
joblib.dump(data['embedding'].tolist(), '../data/embeddings_cache.joblib')

# Load cached embeddings for faster access
cached_embeddings = joblib.load('../data/embeddings_cache.joblib')

def search_with_cache(query):
    """
    Performs a search using cached embeddings for speed optimization.
    
    Args:
    query (str): The search query.
    
    Returns:
    tuple: Indices of top results and their similarity scores.
    """
    query_embedding = model.encode(query)
    similarities = cosine_similarity([query_embedding], cached_embeddings)
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    return top_indices, similarities[0][top_indices]

# Example using cached embeddings
indices, scores = search_with_cache(query)
rank_and_display_results(query, data)


Top 5 results for query 'running shoes for men':
- zion 2 - men's basketball shoes (Score: 0.3397)
  Description: channel your skills in footwear designed for zion williamson and built for ballers at any level. an ...

- chelsea fc strike - men's nike dri-fit soccer track jacket (Score: 0.3351)
  Description: elevate your training.the chelsea fc strike track jacket has design details specifically tailored fo...

- portugal 2022/23 stadium home - men's nike dri-fit soccer shorts (Score: 0.3039)
  Description: like other shorts from our stadium collection, these ones pair replica design details with sweat-wic...

- air jordan xxxv low ds pf - basketball shoes (Score: 0.2981)
  Description: basketball players create separation by cutting quicker, running faster and jumping higher than the ...

- nike college dri-fit coach (duke) - men's shorts (Score: 0.2946)
  Description: sweat-wicking comfort for the stands.designed to fit the needs of coaches on the practice field, the...

