## Building the Search Engine for E-commerce Product Recommendation

### Project Overview
This notebook focuses on constructing a semantic search engine for e-commerce product recommendations, specifically for Nike products. I used product names, subtitles, and descriptions to provide a more comprehensive search experience. The steps include preprocessing, embedding computation, similarity calculation, ranking, knowledge graph integration, query expansion, and optimization.

### Steps to Implement

1. **Preprocess the Dataset**: Utilize name, subtitle, and description for embedding computation.
2. **Compute Embeddings**: Use SBERT to generate embeddings for a concatenated string of name, subtitle, and description.
3. **Calculate Cosine Similarity**: Compare the embeddings to find relevant products.
4. **Rank Products**: Sort results based on similarity scores.
5. **Knowledge Graph Integration**: Use Neo4j to enhance search capabilities.
6. **Query Expansion**: Allow users to expand their search terms.
7. **Optimization**: Implement caching for embeddings to improve speed.

### Why SBERT?
SBERT is chosen for its ability to produce dense vector representations that capture semantic similarity across different text fields, which is beneficial for our multi-field approach.

### Why Neo4j for Knowledge Graph?
Neo4j efficiently manages relationships, which can be extended to include product names, subtitles, and descriptions, providing a richer context for search.


### Step 1: Preprocess the Dataset

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the cleaned dataset
data = pd.read_csv('../data/semantic_search_ready_data.csv')

# Combine name, sub_title, and description for embedding
data['combined_text'] = data['name'] + ' ' + data['sub_title'] + ' ' + data['description']
data['combined_text'] = data['combined_text'].fillna('') 

In [2]:
data.head()

Unnamed: 0,name,sub_title,color,price,description,avg_rating,review_count,parsed_sizes,dominant_color,combined_text
0,nike dri-fit team (mlb minnesota twins),men's long-sleeve t-shirt,Navy,40.0,sweat-wicking comfort.the nike dri-fit team (m...,4.773913,0.0,"['S', 'M', 'L', 'XL', '2XL']",Navy,nike dri-fit team (mlb minnesota twins) men's ...
1,club américa,women's nike dri-fit soccer jersey dress,Black/Black,90.0,"inspired by traditional soccer jerseys, the cl...",5.0,1.0,['L (12–14)'],Black,club américa women's nike dri-fit soccer jerse...
2,nike sportswear swoosh,men's overalls,Black/White,140.0,working hard to keep you comfortable.the nike ...,4.9,11.0,[],Black,nike sportswear swoosh men's overalls working ...
3,nike dri-fit one luxe,big kids' (girls') printed tights (extended size),Black/Rush Pink,22.97,elevated comfort goes full bloom.the nike dri-...,4.773913,0.0,[],Black,nike dri-fit one luxe big kids' (girls') print...
4,paris saint-germain repel academy awf,big kids' soccer jacket,Dark Grey/Black/Siren Red/Siren Red,70.0,water-repellent coverage gets psg details.the ...,4.773913,0.0,"['XS', 'S', 'M', 'L', 'XL']",Dark Grey,paris saint-germain repel academy awf big kids...


### Step 2: Compute Embeddings

In [3]:
from sentence_transformers import SentenceTransformer

# Initialize SBERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings for combined text (name + sub_title + description)
data['embedding'] = data['combined_text'].apply(lambda x: model.encode(x))

# Save embeddings 
data.to_csv('../data/semantic_search_with_embeddings.csv', index=False)

  from .autonotebook import tqdm as notebook_tqdm


### Step 3: Calculate Cosine Similarity

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(query, data):
    # Compute embedding for the query
    query_embedding = model.encode(query)
    
    # Compute cosine similarity between query embedding and all product embeddings
    similarities = cosine_similarity([query_embedding], list(data['embedding']))
    
    # Get the indices of the top 5 similar products
    top_indices = np.argsort(similarities[0])[-5:][::-1]
    
    return top_indices, similarities[0][top_indices]