# STEP 1 : SCRAPE leetcode.com 

ps. no need i found a dataset, 

in future ill scrape the entire leetcode to make a DB

but for now, i have

https://www.kaggle.com/datasets/gzipchrist/leetcode-problem-dataset

# STEP 2: Preprocessing and Vectorizing with TF-IDF:

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the dataset (assuming it's in CSV format)
df = pd.read_csv('leetcode_problems.csv')

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the description column
tfidf_matrix = vectorizer.fit_transform(df['description'])

# Function to find similar problems based on a user query
def find_similar_problems(user_query, top_n=5):
    # Transform the user query using the same vectorizer
    query_vec = vectorizer.transform([user_query])
    
    # Compute the cosine similarity between the query and all problem descriptions
    cosine_sim = cosine_similarity(query_vec, tfidf_matrix).flatten()
    
    # Get the top N similar problems
    similar_indices = cosine_sim.argsort()[-top_n:][::-1]
    
    # Return the titles and URLs of the most similar problems
    return df.iloc[similar_indices][['title', 'url', 'difficulty', 'acceptance_rate']]

# Example: User input description
user_query = "find 2 numbers in an array with given sum"
similar_problems = find_similar_problems(user_query)

# Display the most similar problems
similar_problems


Unnamed: 0,title,url,difficulty,acceptance_rate
166,Two Sum II - Input array is sorted,https://leetcode.com/problems/two-sum-ii-input...,Easy,55.8
547,Split Array with Equal Sum,https://leetcode.com/problems/split-array-with...,Medium,48.3
1507,Range Sum of Sorted Subarray Sums,https://leetcode.com/problems/range-sum-of-sor...,Medium,60.4
1261,Greatest Sum Divisible by Three,https://leetcode.com/problems/greatest-sum-div...,Medium,49.9
1678,Max Number of K-Sum Pairs,https://leetcode.com/problems/max-number-of-k-...,Medium,53.9


# APPROACH 2: WORD EMBEDDINGS

In [None]:
%pip install sentence-transformers

In [8]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the dataset
df = pd.read_csv('leetcode_problems.csv')

# Load the pre-trained Sentence-BERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Encode the problem descriptions into sentence embeddings
description_embeddings = model.encode(df['description'].tolist(), show_progress_bar=True)


Batches: 100%|██████████| 58/58 [00:13<00:00,  4.39it/s]


                                                  title  \
423             Longest Repeating Character Replacement   
1155      Swap For Longest Repeated Character Substring   
1061                        Longest Repeating Substring   
1519       Maximum Number of Non-Overlapping Substrings   
1370  Find the Longest Substring Containing Vowels i...   

                                                    url difficulty  \
423   https://leetcode.com/problems/longest-repeatin...     Medium   
1155  https://leetcode.com/problems/swap-for-longest...     Medium   
1061  https://leetcode.com/problems/longest-repeatin...     Medium   
1519  https://leetcode.com/problems/maximum-number-o...       Hard   
1370  https://leetcode.com/problems/find-the-longest...     Medium   

      acceptance_rate  discuss_count  
423              48.3            379  
1155             47.1            172  
1061             58.4            114  
1519             36.5             90  
1370             60.8        

In [14]:

# Function to find similar problems based on a user query
def find_similar_problems(user_query, top_n=7):
    # Encode the user query
    query_embedding = model.encode([user_query])
    
    # Compute the cosine similarity between the query and all problem descriptions
    cosine_sim = cosine_similarity(query_embedding, description_embeddings).flatten()
    
    # Get the top N similar problems
    similar_indices = cosine_sim.argsort()[-top_n:][::-1]
    
    # Return the titles, URLs, and other info of the most similar problems
    return df.iloc[similar_indices][['title', 'url', 'difficulty', 'acceptance_rate', 'discuss_count']]

# Example: User input description
user_query = "find longest substring in the string in which the elements repeats atleast k times"
similar_problems = find_similar_problems(user_query)

# Display the most similar problems
similar_problems

Unnamed: 0,title,url,difficulty,acceptance_rate,discuss_count
1003,Max Consecutive Ones III,https://leetcode.com/problems/max-consecutive-...,Medium,60.8,615
1061,Longest Repeating Substring,https://leetcode.com/problems/longest-repeatin...,Medium,58.4,114
394,Longest Substring with At Least K Repeating Ch...,https://leetcode.com/problems/longest-substrin...,Medium,43.6,546
299,Longest Increasing Subsequence,https://leetcode.com/problems/longest-increasi...,Medium,44.5,999
672,Number of Longest Increasing Subsequence,https://leetcode.com/problems/number-of-longes...,Medium,38.6,308
561,Longest Line of Consecutive One in Matrix,https://leetcode.com/problems/longest-line-of-...,Medium,46.2,224
127,Longest Consecutive Sequence,https://leetcode.com/problems/longest-consecut...,Hard,46.5,999
