# Project Title: Hybrid Recommendation System for Similar Questions

## Introduction
This project aims to develop a **Hybrid Recommendation System** to predict similar coding problems based on multiple textual features such as titles, topic tags, and problem descriptions. The system integrates **Sentence-BERT embeddings**, **TF-IDF weights**, and **XGBoost-based multi-label classification** to provide accurate recommendations.

The methodology followed for this project adheres to the **CRISP-DM** framework, which includes:
1. **Business Understanding**: Understanding the problem and defining the goals.
2. **Data Understanding**: Exploration and preprocessing of the data.
3. **Data Preparation**: Feature engineering and transformation.
4. **Modeling**: Building the hybrid recommendation system.
5. **Evaluation**: Assessing the system using metrics such as Precision, Recall, F1-Score, and Hamming Loss.
6. **Deployment**: Preparing the model for practical use and visualization of results.

---


# Experiment 1: Simple Content-Based Recommender System

## Objective
To build a **content-based recommendation system** that suggests similar coding problems using:
- **TF-IDF Vectorization** for text descriptions.
- **Cosine Similarity** to measure problem similarity.
- **Popularity Score** to rank problems based on acceptance, engagement, and submission metrics.

---

## Workflow

1. **Text Preprocessing**:
   - Convert `problem_description` into numerical features using **TF-IDF**.
   - Remove stop words and handle missing values.

2. **Popularity Score**:
   - Combine acceptance rate, likes, and submission count using weighted normalization:
     \[
     \text{popularity\_score} = 0.3 \times \text{Acceptance} + 0.5 \times \text{Engagement} + 0.2 \times \text{Submissions}
     \]

3. **Recommendations**:
   - **Content-Based**: Use **Cosine Similarity** to find top `n` most similar problems based on text.
   - **Popularity-Based**: Rank the top similar problems using the calculated popularity score.

---

## Key Components

- **TextProcessor**: Preprocesses text data using TF-IDF.
- **PopularityCalculator**: Calculates the weighted popularity score.
- **ProblemRecommender**:
   - `recommend_similar_problems`: Finds top `n` problems based on cosine similarity.
   - `recommend_top_problems`: Ranks problems based on popularity.

---

## Function: recommender_system

### Input:
- `problem_id` (default: `1`): ID of the problem to generate recommendations.

### Output:
A DataFrame with the following columns:
- `title`: Problem title.
- `difficulty`: Difficulty level.
- `topic_tags`: Tags associated with the problem.
- `problem_URL`: URL link to the problem.

### Usage:
```python
recommendations = recommender_system(problem_id=1)
print(recommendations)


In [2]:
!pip install pandas numpy scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from recommender_system import recommender_system
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score

In [202]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path

# current_directory = Path(__file__).resolve().parent
csv_file_path =  "preprocessed_data.csv"

# Constants
ACCEPTANCE_WEIGHT = 0.3
ENGAGEMENT_WEIGHT = 0.5
SUBMISSION_WEIGHT = 0.2
SMOOTHING_FACTOR = 100000

class TextProcessor:
    def __init__(self, text_data):
        self.text_data = text_data
        self.text_preprocessor = TfidfVectorizer(stop_words='english')

    def preprocess_text_data(self):
        self.text_data.fillna('', inplace=True)
        return self.text_preprocessor.fit_transform(self.text_data)

class PopularityCalculator:
    def __init__(self, acceptance, engagement, submission):
        self.acceptance = acceptance
        self.engagement = engagement
        self.submission = submission

    def calculate_engagement_score(self):
        total_engagement = self.acceptance.fillna(0) + self.engagement.fillna(0) + SMOOTHING_FACTOR
        max_engagement = total_engagement.max()
        return total_engagement / max_engagement

    def normalize_series(self, series):
        return (series - series.min()) / (series.max() - series.min())

    def calculate_popularity_score(self):
        return (
            self.normalize_series(self.acceptance) * ACCEPTANCE_WEIGHT +
            self.normalize_series(self.engagement) * ENGAGEMENT_WEIGHT +
            self.normalize_series(self.submission) * SUBMISSION_WEIGHT
        )

class ProblemRecommender:
    @staticmethod
    def recommend_similar_problems(df, problem_id, X_processed, n=10):
        idx = df[df['id'] == problem_id].index
        if len(idx) == 0:
            return pd.DataFrame()  # Return empty DataFrame if problem_id is not found
        idx = idx[0]
        sim_scores = cosine_similarity(X_processed[idx], X_processed).flatten()
        # sim_scores[idx] = 0  # Exclude the problem itself
        sim_indices = sim_scores.argsort()[-n:][::-1]
        return df.iloc[sim_indices]

    @staticmethod
    def recommend_top_problems(df, n=10):
        return df.sort_values(by='popularity_score', ascending=False).head(n)

def recommender_system(problem_id=1):
    df = pd.read_csv(csv_file_path)
    df['topic_tags'] = df['topic_tags'].str.replace("'", "")

    text_processor = TextProcessor(df['problem_description'])
    X_processed = text_processor.preprocess_text_data()

    popularity_calculator = PopularityCalculator(df['acceptance'], df['likes'], df['submission'])
    df['popularity_score'] = popularity_calculator.calculate_popularity_score()

    content_based_recommendations = ProblemRecommender.recommend_similar_problems(df, problem_id=problem_id, X_processed=X_processed)
    popularity_based_recommendations = ProblemRecommender.recommend_top_problems(content_based_recommendations)

    return popularity_based_recommendations[['title', 'difficulty', 'topic_tags', 'problem_URL']]


In [203]:
results_df = recommender_system(1)
results_df.head()

Unnamed: 0,title,difficulty,topic_tags,problem_URL
0,1. Two Sum,Easy,"Array, Hash Table",https://leetcode.com/problems/two-sum
32,33. Search in Rotated Sorted Array,Medium,"Array, Binary Search",https://leetcode.com/problems/search-in-rotate...
33,34. Find First and Last Position of Element in...,Medium,"Array, Binary Search",https://leetcode.com/problems/find-first-and-l...
34,35. Search Insert Position,Easy,"Array, Binary Search",https://leetcode.com/problems/search-insert-po...
703,704. Binary Search,Easy,"Array, Binary Search",https://leetcode.com/problems/binary-search


In [204]:
# Load the preprocessed data
df = pd.read_csv("preprocessed_data.csv")
df.head()
# df.iloc[0].title == results_df.iloc[0].title

Unnamed: 0,id,page_number,is_premium,title,problem_description,topic_tags,difficulty,similar_questions,no_similar_questions,acceptance,accepted,submission,solution,discussion_count,likes,dislikes,problem_URL,solution_URL
0,1,1,False,1. Two Sum,Given an array of integers nums and an integer...,"'Array', 'Hash Table'",Easy,"[""'3Sum'"", ""'4Sum'"", ""'Two Sum II - Input Arra...",21.0,51.0,11300000.0,22100000.0,26800.0,638.0,52700.0,1700.0,https://leetcode.com/problems/two-sum,https://leetcode.com/problems/two-sum/solution
1,2,1,False,2. Add Two Numbers,You are given two non-empty linked lists repre...,"'Linked List', 'Math', 'Recursion'",Medium,"[""'Multiply Strings'"", ""'Add Binary'"", ""'Sum o...",8.0,41.5,4000000.0,9700000.0,15700.0,428.0,28900.0,5600.0,https://leetcode.com/problems/add-two-numbers,https://leetcode.com/problems/add-two-numbers/...
2,3,1,False,3. Longest Substring Without Repeating Characters,"Given a string s, find the length of the longe...","'Hash Table', 'String', 'Sliding Window'",Medium,"[""'Longest Substring with At Most Two Distinct...",9.0,34.1,5100000.0,14900000.0,18100.0,237.0,37700.0,1700.0,https://leetcode.com/problems/longest-substrin...,https://leetcode.com/problems/longest-substrin...
3,4,1,False,4. Median of Two Sorted Arrays,Given two sorted arrays nums1 and nums2 of siz...,"'Array', 'Binary Search', 'Divide and Conquer'",Hard,"[""'Median of a Row Wise Sorted Matrix'""]",1.0,38.3,2200000.0,5800000.0,14100.0,304.0,26600.0,2900.0,https://leetcode.com/problems/median-of-two-so...,https://leetcode.com/problems/median-of-two-so...
4,5,1,False,5. Longest Palindromic Substring,"Given a string s, return the longest palindrom...","'String', 'Dynamic Programming'",Medium,"[""'Shortest Palindrome'"", ""'Palindrome Permuta...",6.0,33.2,2700000.0,8200000.0,9600.0,225.0,27900.0,1600.0,https://leetcode.com/problems/longest-palindro...,https://leetcode.com/problems/longest-palindro...


In [207]:
# Function to parse similar problems from the ground truth column
def parse_similar_problems(similar_problems_str):
    if pd.isna(similar_problems_str):
        return []
    # Strip brackets and split by comma, then clean each string
    similar_problems_str = similar_problems_str.strip("[]")
    return [problem.strip().strip("'\"") for problem in similar_problems_str.split(",")]

# Add a column with parsed similar problems as lists
df["similar_problems_list"] = df["similar_questions"].apply(parse_similar_problems)
df.head()

Unnamed: 0,id,page_number,is_premium,title,problem_description,topic_tags,difficulty,similar_questions,no_similar_questions,acceptance,accepted,submission,solution,discussion_count,likes,dislikes,problem_URL,solution_URL,similar_problems_list
0,1,1,False,1. Two Sum,Given an array of integers nums and an integer...,"'Array', 'Hash Table'",Easy,"[""'3Sum'"", ""'4Sum'"", ""'Two Sum II - Input Arra...",21.0,51.0,11300000.0,22100000.0,26800.0,638.0,52700.0,1700.0,https://leetcode.com/problems/two-sum,https://leetcode.com/problems/two-sum/solution,"[3Sum, 4Sum, Two Sum II - Input Array Is Sorte..."
1,2,1,False,2. Add Two Numbers,You are given two non-empty linked lists repre...,"'Linked List', 'Math', 'Recursion'",Medium,"[""'Multiply Strings'"", ""'Add Binary'"", ""'Sum o...",8.0,41.5,4000000.0,9700000.0,15700.0,428.0,28900.0,5600.0,https://leetcode.com/problems/add-two-numbers,https://leetcode.com/problems/add-two-numbers/...,"[Multiply Strings, Add Binary, Sum of Two Inte..."
2,3,1,False,3. Longest Substring Without Repeating Characters,"Given a string s, find the length of the longe...","'Hash Table', 'String', 'Sliding Window'",Medium,"[""'Longest Substring with At Most Two Distinct...",9.0,34.1,5100000.0,14900000.0,18100.0,237.0,37700.0,1700.0,https://leetcode.com/problems/longest-substrin...,https://leetcode.com/problems/longest-substrin...,[Longest Substring with At Most Two Distinct C...
3,4,1,False,4. Median of Two Sorted Arrays,Given two sorted arrays nums1 and nums2 of siz...,"'Array', 'Binary Search', 'Divide and Conquer'",Hard,"[""'Median of a Row Wise Sorted Matrix'""]",1.0,38.3,2200000.0,5800000.0,14100.0,304.0,26600.0,2900.0,https://leetcode.com/problems/median-of-two-so...,https://leetcode.com/problems/median-of-two-so...,[Median of a Row Wise Sorted Matrix]
4,5,1,False,5. Longest Palindromic Substring,"Given a string s, return the longest palindrom...","'String', 'Dynamic Programming'",Medium,"[""'Shortest Palindrome'"", ""'Palindrome Permuta...",6.0,33.2,2700000.0,8200000.0,9600.0,225.0,27900.0,1600.0,https://leetcode.com/problems/longest-palindro...,https://leetcode.com/problems/longest-palindro...,"[Shortest Palindrome, Palindrome Permutation, ..."


In [208]:
df.head()

Unnamed: 0,id,page_number,is_premium,title,problem_description,topic_tags,difficulty,similar_questions,no_similar_questions,acceptance,accepted,submission,solution,discussion_count,likes,dislikes,problem_URL,solution_URL,similar_problems_list
0,1,1,False,1. Two Sum,Given an array of integers nums and an integer...,"'Array', 'Hash Table'",Easy,"[""'3Sum'"", ""'4Sum'"", ""'Two Sum II - Input Arra...",21.0,51.0,11300000.0,22100000.0,26800.0,638.0,52700.0,1700.0,https://leetcode.com/problems/two-sum,https://leetcode.com/problems/two-sum/solution,"[3Sum, 4Sum, Two Sum II - Input Array Is Sorte..."
1,2,1,False,2. Add Two Numbers,You are given two non-empty linked lists repre...,"'Linked List', 'Math', 'Recursion'",Medium,"[""'Multiply Strings'"", ""'Add Binary'"", ""'Sum o...",8.0,41.5,4000000.0,9700000.0,15700.0,428.0,28900.0,5600.0,https://leetcode.com/problems/add-two-numbers,https://leetcode.com/problems/add-two-numbers/...,"[Multiply Strings, Add Binary, Sum of Two Inte..."
2,3,1,False,3. Longest Substring Without Repeating Characters,"Given a string s, find the length of the longe...","'Hash Table', 'String', 'Sliding Window'",Medium,"[""'Longest Substring with At Most Two Distinct...",9.0,34.1,5100000.0,14900000.0,18100.0,237.0,37700.0,1700.0,https://leetcode.com/problems/longest-substrin...,https://leetcode.com/problems/longest-substrin...,[Longest Substring with At Most Two Distinct C...
3,4,1,False,4. Median of Two Sorted Arrays,Given two sorted arrays nums1 and nums2 of siz...,"'Array', 'Binary Search', 'Divide and Conquer'",Hard,"[""'Median of a Row Wise Sorted Matrix'""]",1.0,38.3,2200000.0,5800000.0,14100.0,304.0,26600.0,2900.0,https://leetcode.com/problems/median-of-two-so...,https://leetcode.com/problems/median-of-two-so...,[Median of a Row Wise Sorted Matrix]
4,5,1,False,5. Longest Palindromic Substring,"Given a string s, return the longest palindrom...","'String', 'Dynamic Programming'",Medium,"[""'Shortest Palindrome'"", ""'Palindrome Permuta...",6.0,33.2,2700000.0,8200000.0,9600.0,225.0,27900.0,1600.0,https://leetcode.com/problems/longest-palindro...,https://leetcode.com/problems/longest-palindro...,"[Shortest Palindrome, Palindrome Permutation, ..."


In [209]:
# Function to clean recommendation titles by removing numbering and periods
def clean_recommendation_title(title):
    # Remove leading numbers and periods (e.g., "1. Two Sum" -> "Two Sum")
    return title.split('. ', 1)[-1].strip()

def evaluate_recommender(df, k=10):
    precision_scores = []
    recall_scores = []
    
    for idx, row in df.iterrows():
        problem_id = row["id"]
        ground_truth = set(row["similar_problems_list"])
        
        if not ground_truth:
            # Skip problems with no ground truth
            continue
        
        # Get model recommendations for the current problem
        recommended_problems = recommender_system(problem_id)
        recommended_titles = recommended_problems["title"].map(clean_recommendation_title)
        # print(recommended_titles, 'hereee')
        
        # Compute relevant items
        relevant_items = set(recommended_titles) & ground_truth
        
        # Precision@k
        precision_at_k = len(relevant_items) / k
        precision_scores.append(precision_at_k)
        
        # Recall@k
        recall_at_k = len(relevant_items) / len(ground_truth)
        recall_scores.append(recall_at_k)
    
    # Return average metrics
    avg_precision = np.mean(precision_scores)
    avg_recall = np.mean(recall_scores)
    return avg_precision, avg_recall

# Evaluate the recommender system
precision, recall = evaluate_recommender(df, k=10)
print(f"Precision@10: {precision:.2f}")
print(f"Recall@10: {recall:.2f}")

Precision@10: 0.03
Recall@10: 0.13


In [22]:
# Function to clean recommendation titles by removing numbering and periods
def clean_recommendation_title(title):
    # Remove leading numbers and periods (e.g., "1. Two Sum" -> "Two Sum")
    return title.split('. ', 1)[-1].strip()

def evaluate_recommender(df, k=10):
    precision_scores = []
    recall_scores = []
    
    for idx, row in df.iterrows():
        problem_id = row["id"]
        ground_truth = set(row["similar_problems_list"])
        
        if not ground_truth:
            # Skip problems with no ground truth
            continue
        
        # Get model recommendations for the current problem
        recommended_problems = recommender_system(problem_id)
        recommended_titles = recommended_problems["title"].map(clean_recommendation_title)
        # print(problem_id, 'idddddd')
        # print(set(recommended_titles), 'hereee')
        # print(ground_truth, 'ground truth')
        # print(set(recommended_titles) & ground_truth, 'common')
        
        # Compute relevant items
        relevant_items = set(recommended_titles) & ground_truth
        
        # Precision@k
        precision_at_k = len(relevant_items) / k
        precision_scores.append(precision_at_k)
        
        # Recall@k
        recall_at_k = len(relevant_items) / len(ground_truth)
        recall_scores.append(recall_at_k)
    
    # Return average metrics
    avg_precision = np.mean(precision_scores)
    avg_recall = np.mean(recall_scores)
    return avg_precision, avg_recall

# Evaluate the recommender system
precision, recall = evaluate_recommender(df.head(), k=10)
print(f"Precision@10: {precision:.2f}")
print(f"Recall@10: {recall:.2f}")

1 idddddd
{'Count Pairs Whose Sum is Less than Target', 'Find Target Indices After Sorting Array', 'Search Insert Position', 'Minimum Size Subarray Sum', '4Sum', 'Find First and Last Position of Element in Sorted Array', 'Two Sum', '3Sum Closest', 'Binary Search', 'Search in Rotated Sorted Array'} hereee
{'4Sum', 'First Letter to Appear Twice', 'Largest Positive Integer That Exists With Its Negative', 'Number of Pairs of Strings With Concatenation Equal to Target', 'Two Sum II - Input Array Is Sorted', '3Sum', 'Number of Excellent Pairs', 'Two Sum IV - Input is a BST', 'Find Subarrays With Equal Sum', 'Number of Distinct Averages', 'Subarray Sum Equals K', 'Two Sum Less Than K', 'Count Number of Pairs With Absolute Difference K', 'Count Good Meals', 'Max Number of K-Sum Pairs', 'Find All K-Distant Indices in an Array', 'Two Sum III - Data structure design', 'Count Pairs Whose Sum is Less than Target', 'Node With Highest Edge Score', 'Check Distances Between Same Letters', 'Number of Ar

In [26]:
from joblib import Parallel, delayed

def clean_recommendation_title(title):
    return title.split('. ', 1)[-1].strip()

# Function to evaluate a single problem
def evaluate_problem(row, recommender_system, k=10):
    problem_id = row["id"]
    ground_truth = set(row["similar_problems_list"])
    
    if not ground_truth:
        # Skip problems with no ground truth
        return None, None
    
    # Get model recommendations for the current problem
    recommended_problems = recommender_system(problem_id)
    recommended_titles = recommended_problems["title"].map(clean_recommendation_title)
    
    # Compute relevant items
    relevant_items = set(recommended_titles) & ground_truth
    
    # Precision@k
    precision_at_k = len(relevant_items) / k
    
    # Recall@k
    recall_at_k = len(relevant_items) / len(ground_truth)

    # F1 score
    f1_score = 0.0
    if precision_at_k + recall_at_k > 0:
        f1_score = 2 * (precision_at_k * recall_at_k) / (precision_at_k + recall_at_k)
    
    return precision_at_k, recall_at_k, f1_score

# Function to evaluate the recommender system in parallel
def parallel_evaluate_recommender(df, recommender_system, k=10, n_jobs=-1):
    # Use joblib to process rows in parallel
    results = Parallel(n_jobs=n_jobs)(
        delayed(evaluate_problem)(row, recommender_system, k) for _, row in df.iterrows()
    )
    print(results)
    
    # Collect metrics
    # precision_scores = [result[0] for result in results if result[0] is not None]
    # recall_scores = [result[1] for result in results if result[1] is not None]
    # f1_scores = [result[2] for result in results if result[2] is not None]
    
    # # Compute average metrics
    # avg_precision = np.mean(precision_scores)
    # avg_recall = np.mean(recall_scores)
    # avg_f1_score = np.mean(f1_scores)
    # return avg_precision, avg_recall, avg_f1_score
    return 0,0,0

precision, recall, f1 = parallel_evaluate_recommender(df, recommender_system, k=10, n_jobs=10)
print(f"Precision@10: {precision:.2f}")
print(f"Recall@10: {recall:.2f}")
print(f"F1 Score@10: {f1:.2f}")

[(0.2, 0.09523809523809523, 0.12903225806451613), (0.2, 0.25, 0.22222222222222224), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.2, 0.3333333333333333, 0.25), (0.0, 0.0, 0.0), (0.1, 0.25, 0.14285714285714288), (0.1, 0.3333333333333333, 0.15384615384615383), (0.0, 0.0, 0.0), (0.1, 1.0, 0.18181818181818182), (0.1, 0.3333333333333333, 0.15384615384615383), (0.2, 1.0, 0.33333333333333337), (0.1, 1.0, 0.18181818181818182), (0.0, 0.0, 0.0), (0.2, 0.2857142857142857, 0.23529411764705882), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.1, 0.25, 0.14285714285714288), (0.1, 0.3333333333333333, 0.15384615384615383), (0.2, 0.3333333333333333, 0.25), (0.1, 0.14285714285714285, 0.11764705882352941), (0.0, 0.0, 0.0), (0.1, 0.3333333333333333, 0.15384615384615383), (0.2, 1.0, 0.33333333333333337), (0.2, 0.6666666666666666, 0.30769230769230765), (0.2, 0.5, 0.28571428571428575), (0.1, 0.3333333333333333, 0.15384615384615383), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.1, 0.2, 0.13333333333333333), (0.0, 0.0, 0

In [27]:
# Function to clean recommendation titles
def clean_recommendation_title(title):
    return title.split('. ', 1)[-1].strip()

# Function to evaluate a single problem
def evaluate_problem(row, recommender_system, k=10):
    problem_id = row["id"]
    ground_truth = set(row["similar_problems_list"])
    
    if not ground_truth:
        # Skip problems with no ground truth
        return None  # Return None if no ground truth
    
    # Get model recommendations for the current problem
    recommended_problems = recommender_system(problem_id)
    recommended_titles = recommended_problems["title"].map(clean_recommendation_title)
    
    # Compute relevant items
    relevant_items = set(recommended_titles) & ground_truth
    
    # Precision@k
    precision_at_k = len(relevant_items) / k if k > 0 else 0.0
    
    # Recall@k
    recall_at_k = len(relevant_items) / len(ground_truth) if len(ground_truth) > 0 else 0.0
    
    # F1 score
    if precision_at_k + recall_at_k > 0:
        f1_score = 2 * (precision_at_k * recall_at_k) / (precision_at_k + recall_at_k)
    else:
        f1_score = 0.0
    
    return precision_at_k, recall_at_k, f1_score

# Function to evaluate the recommender system in parallel
def parallel_evaluate_recommender(df, recommender_system, k=10, n_jobs=-1):
    # Use joblib to process rows in parallel
    results = Parallel(n_jobs=n_jobs)(
        delayed(evaluate_problem)(row, recommender_system, k) for _, row in df.iterrows()
    )
    
    # Filter out None results
    valid_results = [result for result in results if result is not None]
    
    # Collect metrics
    precision_scores = [result[0] for result in valid_results]
    recall_scores = [result[1] for result in valid_results]
    f1_scores = [result[2] for result in valid_results]
    
    # Compute average metrics
    avg_precision = np.mean(precision_scores) if precision_scores else 0.0
    avg_recall = np.mean(recall_scores) if recall_scores else 0.0
    avg_f1_score = np.mean(f1_scores) if f1_scores else 0.0
    return avg_precision, avg_recall, avg_f1_score

# Example usage
precision, recall, f1 = parallel_evaluate_recommender(df, recommender_system, k=10, n_jobs=10)
print(f"Precision@10: {precision:.2f}")
print(f"Recall@10: {recall:.2f}")
print(f"F1 Score@10: {f1:.2f}")

Precision@10: 0.03
Recall@10: 0.13
F1 Score@10: 0.04


# Experiment 2: Multi-Label Classification Using Logistic Regression

## Objective
To build a **multi-label classification model** that predicts **similar problems** for a given problem. The workflow uses:
- **TF-IDF** to process text data.
- **MultiLabelBinarizer** to encode multi-label targets.
- **One-vs-Rest Logistic Regression** for multi-label classification.

---

## Workflow

1. **Data Preparation**:
   - Use the `MultiLabelBinarizer` to transform the `similar_problems_list` column into a **binary multi-label matrix**.
   - Combine `title` and `problem_description` into a new column `combined_text`.

2. **Feature Extraction**:
   - Apply **TF-IDF Vectorization** to convert `combined_text` into a feature matrix.
   - The vectorizer removes stop words and uses the top 5,000 features.

3. **Model Training**:
   - Train a **One-vs-Rest Logistic Regression** model using the TF-IDF feature matrix.
   - Split the data into **80% training** and **20% testing**.

4. **Evaluation**:
   - Evaluate the model on the test set using:
     - **Hamming Loss**: Measures the fraction of labels incorrectly predicted.
     - **F1 Score (Micro)**: Balances precision and recall across all labels.

---

## Key Steps and Components

1. **MultiLabelBinarizer**:
   - Converts the multi-label column `similar_problems_list` into a binary matrix where each column represents a label.

2. **TF-IDF Vectorizer**:
   - Combines `title` and `problem_description` into `combined_text` for feature extraction.
   - Reduces text data to a numerical matrix with 5,000 important features.

3. **Logistic Regression (One-vs-Rest)**:
   - Treats each label as a separate binary classification problem.
   - Efficient for multi-label classification tasks with linear decision boundaries.

---

## Evaluation Results

The following metrics are used to evaluate the performance of the multi-label classifier:
- **Hamming Loss**: Lower is better, indicating fewer incorrect predictions.
- **F1 Score (Micro)**: Higher is better, balancing precision and recall across all labels.

**Sample Output**:


In [28]:
from sklearn.preprocessing import MultiLabelBinarizer

# Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the labels
labels = mlb.fit_transform(df["similar_problems_list"])

# Add binary labels as new columns to the DataFrame
label_columns = mlb.classes_
df_binary = pd.DataFrame(labels, columns=label_columns)

# Combine binary labels with original DataFrame
df = pd.concat([df, df_binary], axis=1)

# Show the updated DataFrame
print("\nDataFrame with binary labels:")
print(df.head())

# Save the label columns for later use
print("\nLabel Columns:", label_columns)



DataFrame with binary labels:
   id  page_number  is_premium  \
0   1            1       False   
1   2            1       False   
2   3            1       False   
3   4            1       False   
4   5            1       False   

                                               title  \
0                                         1. Two Sum   
1                                 2. Add Two Numbers   
2  3. Longest Substring Without Repeating Characters   
3                     4. Median of Two Sorted Arrays   
4                   5. Longest Palindromic Substring   

                                 problem_description  \
0  Given an array of integers nums and an integer...   
1  You are given two non-empty linked lists repre...   
2  Given a string s, find the length of the longe...   
3  Given two sorted arrays nums1 and nums2 of siz...   
4  Given a string s, return the longest palindrom...   

                                       topic_tags difficulty  \
0                         

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine title and description as input features
df["combined_text"] = df["title"].fillna('') + " " + df["problem_description"].fillna('')

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fit and transform the text data
X = tfidf_vectorizer.fit_transform(df["combined_text"])

# Print shape of the resulting feature matrix
print("\nTF-IDF feature matrix shape:", X.shape)



TF-IDF feature matrix shape: (3000, 5000)


In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split

# Define features (X) and multi-label targets (y)
y = labels  # Binary multi-label matrix from Step 3

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize One-vs-Rest Logistic Regression model
model = OneVsRestClassifier(LogisticRegression(solver='liblinear'))

# Train the model
print("\nTraining the model...")
model.fit(X_train, y_train)
print("Model training complete.")



Training the model...




Model training complete.




In [31]:
from sklearn.metrics import hamming_loss, f1_score

# Predict on the test set
y_pred = model.predict(X_test)

# Compute evaluation metrics
hamming = hamming_loss(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='micro')  # 'micro' averages across all labels

# Print results
print(f"\nEvaluation Results:")
print(f"Hamming Loss: {hamming:.4f}")
print(f"F1 Score: {f1:.4f}")



Evaluation Results:
Hamming Loss: 0.0009
F1 Score: 0.0334


In [None]:
# Function to make predictions for a new problem
def predict_similar_questions(title, description, model, vectorizer, mlb):
    text = title + " " + description
    text_vectorized = vectorizer.transform([text])
    predictions = model.predict(text_vectorized)
    predicted_labels = mlb.inverse_transform(predictions)
    return predicted_labels

# Example input
title = "Find pairs of numbers with a target sum"
description = "Given an array of integers, return indices of two numbers that add up to a given target."

# Make predictions
predicted_similar = predict_similar_questions(title, description, model, tfidf_vectorizer, mlb)
print("\nPredicted Similar Questions:", predicted_similar)


# Experiment 3: Title-Based Hybrid Model

## Objective
To build a **hybrid recommendation system** that combines:
1. **Title-Based Similarity** using **TF-IDF** and **Cosine Similarity**.
2. **Machine Learning (ML) Predictions** using a trained multi-label classifier.

The hybrid approach prioritizes **title similarity** when strong patterns exist and falls back to the **ML model** otherwise.

---

## Workflow

1. **Title-Based Similarity**:
   - Use **TF-IDF Vectorization** to process problem titles into numerical vectors.
   - Compute **Cosine Similarity** between all titles.
   - Identify similar problems based on a similarity threshold.

2. **Multi-Label Model Predictions**:
   - Combine `title` and `problem_description` into a unified text feature.
   - Use a trained **One-vs-Rest Logistic Regression** model to predict similar problems.

3. **Hybrid Prediction**:
   - Step 1: Retrieve similar questions based on title similarity if the cosine similarity exceeds a threshold (default: 0.7).
   - Step 2: Use the ML model to predict similar problems as a fallback.
   - Step 3: Combine title-based predictions and ML-based predictions.

4. **Evaluation**:
   - Evaluate the system using:
     - **Hamming Loss**: Measures incorrect predictions.
     - **Precision@k**: Measures accuracy of predicted labels.
     - **Recall@k**: Measures coverage of ground truth labels.
     - **F1-Score**: Balances precision and recall.

---

## Key Components

### Title-Based Similarity
- **TF-IDF Vectorizer**:
   - Converts problem titles into a TF-IDF matrix.
- **Cosine Similarity**:
   - Measures similarity between the TF-IDF vectors of titles.

### Hybrid Prediction Function
The **`hybrid_predict`** function combines:
1. **Title Similarity**: Retrieves problems with cosine similarity above a threshold.
2. **ML Model Predictions**: Uses text data to predict labels.

**Output**: A combined set of recommendations.

---

## Evaluation Metrics

The following metrics are used to evaluate the hybrid model:
- **Hamming Loss**: Fraction of incorrect labels predicted.
- **Precision@k**: Proportion of predicted labels that are correct.
- **Recall@k**: Proportion of true labels that are correctly predicted.
- **F1-Score**: Harmonic mean of precision and recall.

---


In [36]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load Data
# df = pd.read_csv("preprocessed_data.csv")

# Step 1: Compute TF-IDF Matrix for Titles
df["cleaned_title"] = df["title"].fillna('').str.lower()
tfidf_vectorizer_titles = TfidfVectorizer(stop_words='english')
tfidf_title_matrix = tfidf_vectorizer_titles.fit_transform(df["cleaned_title"])

# Step 2: Compute Cosine Similarity
title_similarity = cosine_similarity(tfidf_title_matrix, tfidf_title_matrix)
title_similarity_df = pd.DataFrame(title_similarity, index=df["title"], columns=df["title"])

# Function to get title-based similar questions
def get_title_based_similar(title, similarity_df, threshold=0.7, top_k=3):
    if title not in similarity_df.columns:
        return []
    similar_titles = similarity_df[title]
    filtered_titles = similar_titles[similar_titles > threshold].sort_values(ascending=False)
    return list(filtered_titles.index[1:top_k+1])  # Exclude the title itself

# Hybrid Prediction Function
def hybrid_predict(title, description, model, vectorizer, mlb, similarity_df, threshold=0.7, top_k=3):
    """
    Hybrid prediction function:
    1. Use title similarity when strong patterns exist.
    2. Fall back to the ML model for multi-label classification.
    """
    # Step 1: Title-based similar questions
    title_similar_questions = get_title_based_similar(title, similarity_df, threshold, top_k)
    
    # Step 2: Multi-label model predictions
    text = title + " " + description
    text_vectorized = vectorizer.transform([text])
    predictions = model.predict(text_vectorized)
    model_predicted_labels = mlb.inverse_transform(predictions)
    
    # Combine both results
    combined_predictions = set(title_similar_questions) | set(model_predicted_labels[0])
    return list(combined_predictions)


In [37]:

from sklearn.metrics import precision_score, recall_score, f1_score, hamming_loss

# Function to evaluate the hybrid model
def evaluate_hybrid_model(df, model, vectorizer, mlb, similarity_df, threshold=0.7, top_k=3):
    y_true = []  # Ground truth labels
    y_pred = []  # Predicted labels
    
    for _, row in df.iterrows():
        title = row["title"]
        description = row["problem_description"]
        ground_truth = row["similar_problems_list"]
        
        # Skip rows with no ground truth
        if not ground_truth:
            continue
        
        # Hybrid model prediction
        predicted_labels = hybrid_predict(
            title, description, model, vectorizer, mlb, similarity_df, threshold, top_k
        )
        
        # Append binary format
        y_true.append([1 if label in ground_truth else 0 for label in mlb.classes_])
        y_pred.append([1 if label in predicted_labels else 0 for label in mlb.classes_])
    
    # Convert to numpy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Compute evaluation metrics
    hamming = hamming_loss(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average="micro")
    recall = recall_score(y_true, y_pred, average="micro")
    f1 = f1_score(y_true, y_pred, average="micro")
    
    # Print results
    print("\nHybrid Model Evaluation Results:")
    print(f"Hamming Loss: {hamming:.4f}")
    print(f"Precision@k: {precision:.4f}")
    print(f"Recall@k: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")


In [38]:
# Evaluate the Hybrid Model
evaluate_hybrid_model(
    df, model, tfidf_vectorizer, mlb, title_similarity_df, threshold=0.7, top_k=3
)



Hybrid Model Evaluation Results:
Hamming Loss: 0.0011
Precision@k: 0.8667
Recall@k: 0.0429
F1-Score: 0.0817


In [43]:
# Update the threshold
evaluate_hybrid_model(
    df, model, tfidf_vectorizer, mlb, title_similarity_df, threshold=0.7, top_k=10
)



Hybrid Model Evaluation Results:
Hamming Loss: 0.0011
Precision@k: 0.8667
Recall@k: 0.0429
F1-Score: 0.0817


# Experiment 4: Topic Tags Weighted Hybrid Model

## Objective
To improve the **hybrid recommendation system** by incorporating **Topic Tags** alongside titles and descriptions using **weighted TF-IDF features** and title-based similarity.

---

## Workflow

1. **TF-IDF Feature Engineering**:
   - Apply **TF-IDF Vectorization** to the following text fields:
     - `title` (weight: **0.35**)
     - `topic_tags` (weight: **0.35**)
     - `problem_description` (weight: **0.3**)
   - Scale each TF-IDF feature matrix based on its weight.

2. **Combine Features**:
   - Concatenate the weighted TF-IDF matrices into a single feature matrix.

3. **Model Training**:
   - Use a **One-vs-Rest Logistic Regression** model to train on the combined TF-IDF feature matrix.

4. **Hybrid Predictions**:
   - **Title Similarity**: Use cosine similarity between problem titles to identify similar questions.
   - **Weighted TF-IDF Prediction**: Predict similar questions using the trained ML model.
   - **Combine Results**: Merge title-based similar questions with model predictions to generate final recommendations.

5. **Evaluation**:
   - Evaluate the hybrid model using:
     - **Hamming Loss**: Measures incorrect predictions.
     - **Precision@k**: Accuracy of predicted labels.
     - **Recall@k**: Coverage of ground truth labels.
     - **F1-Score**: Balance between precision and recall.

---

## Key Components

- **TF-IDF Vectorization**:
   - Processes `title`, `topic_tags`, and `problem_description` into weighted numerical vectors.
   - Ensures different features contribute proportionally to the final prediction.

- **Title-Based Similarity**:
   - Computes cosine similarity between titles to find questions with strong title patterns.

- **Hybrid Prediction Function**:
   - Combines title-based similar questions with multi-label classification predictions for better performance.

---

## Results

**Evaluation Metrics**:

- Hamming Loss: 0.0012 

- Precision@k: 0.4901 

- Recall@k: 0.1197 

- F1-Score: 0.1925

In [118]:
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer

# Define weights
TITLE_WEIGHT = 0.35
TAGS_WEIGHT = 0.35
DESCRIPTION_WEIGHT = 0.3

# TF-IDF for Titles
tfidf_title = TfidfVectorizer(stop_words='english', max_features=2000)
X_title = tfidf_title.fit_transform(df["title"].fillna(''))

# TF-IDF for Topic Tags
tfidf_tags = TfidfVectorizer(stop_words='english', max_features=1000)
X_tags = tfidf_tags.fit_transform(df["topic_tags"].fillna(''))

# TF-IDF for Problem Descriptions
tfidf_description = TfidfVectorizer(stop_words='english', max_features=2000)
X_description = tfidf_description.fit_transform(df["problem_description"].fillna(''))

# Scale TF-IDF matrices with their respective weights
X_title_weighted = X_title * TITLE_WEIGHT
X_tags_weighted = X_tags * TAGS_WEIGHT
X_description_weighted = X_description * DESCRIPTION_WEIGHT

# Concatenate the weighted matrices
X_combined = hstack([X_title_weighted, X_tags_weighted, X_description_weighted])

print("\nFinal Combined Feature Matrix Shape:", X_combined.shape)



Final Combined Feature Matrix Shape: (3000, 4090)


In [119]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# MultiLabelBinarizer for targets
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df["similar_problems_list"])

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Train One-vs-Rest Logistic Regression
model = OneVsRestClassifier(LogisticRegression(solver='liblinear'))
model.fit(X_train, y_train)

print("Model training complete.")




Model training complete.




In [120]:
def hybrid_predict_weighted(title, topic_tags, description, model, tfidf_title, tfidf_tags, tfidf_description, mlb, similarity_df, threshold=0.5, top_k=5):
    """
    Predict similar questions using weighted hybrid features and title similarity.
    """
    # Step 1: Title-based similar questions
    title_similar_questions = get_title_based_similar(title, similarity_df, threshold, top_k)
    
    # Step 2: Create weighted TF-IDF features for the input
    title_vector = tfidf_title.transform([title]) * TITLE_WEIGHT
    tags_vector = tfidf_tags.transform([topic_tags]) * TAGS_WEIGHT
    description_vector = tfidf_description.transform([description]) * DESCRIPTION_WEIGHT
    
    # Combine features
    combined_vector = hstack([title_vector, tags_vector, description_vector])
    
    # Step 3: Predict using the model
    model_pred = model.predict(combined_vector)
    model_predicted_labels = mlb.inverse_transform(model_pred)
    
    # Step 4: Combine title-based and model predictions
    combined_predictions = list(set(title_similar_questions) | set(model_predicted_labels[0]))
    return combined_predictions


In [121]:
# y_true = []
# y_pred = []
def evaluate_hybrid_model_weighted(df, model, tfidf_title, tfidf_tags, tfidf_description, mlb, similarity_df, threshold=0.5, top_k=5):
    y_true = []
    y_pred = []

    for idx, row in df.iterrows():
        title = row["title"]
        topic_tags = row["topic_tags"]
        description = row["problem_description"]
        ground_truth = row["similar_problems_list"]

        if not ground_truth:
            continue

        # Hybrid predictions
        predicted_labels = hybrid_predict_weighted(
            title, topic_tags, description, model, tfidf_title, tfidf_tags, tfidf_description, mlb, similarity_df, threshold, top_k
        )
        # print(f"\nQuestion {idx+1}: {title}")
        # print(f"Ground Truth Labels: {ground_truth}")
        # print(f"Predicted Labels: {predicted_labels}")
        # print(type(predicted_labels), 'predicted_labels')
        predicted_labels = pd.Series(predicted_labels)
        predicted_labels = predicted_labels.map(clean_recommendation_title)
        predicted_labels = predicted_labels.tolist()
        # print(type(predicted_labels), 'predicted_labels')

        # Convert to binary format
        y_true.append([1 if label in ground_truth else 0 for label in mlb.classes_])
        y_pred.append([1 if label in predicted_labels else 0 for label in mlb.classes_])

    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    # print(y_true, 'y_true')
    # print(y_pred, 'y_pred')

    # Compute metrics
    hamming = hamming_loss(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average="micro")
    recall = recall_score(y_true, y_pred, average="micro")
    f1 = f1_score(y_true, y_pred, average="micro")

    print("\nUpdated Hybrid Model Evaluation Results:")
    print(f"Hamming Loss: {hamming:.4f}")
    print(f"Precision@k: {precision:.4f}")
    print(f"Recall@k: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")


In [122]:
# Evaluate the updated hybrid model
evaluate_hybrid_model_weighted(
    df, model, tfidf_title, tfidf_tags, tfidf_description, mlb, title_similarity_df, threshold=0.4, top_k=5
)



Updated Hybrid Model Evaluation Results:
Hamming Loss: 0.0012
Precision@k: 0.4901
Recall@k: 0.1197
F1-Score: 0.1925


# Experiment 5: XGBoost-Based Multi-Label Classification with Threshold Tuning

## Objective
To train a **multi-label classification model** using **XGBoost** and optimize its predictions by dynamically adjusting the probability threshold for better F1-Score.

---

## Workflow

1. **Model Training**:
   - Use **TF-IDF features** for titles, topic tags, and descriptions.
   - Train an **XGBoost classifier** (One-vs-Rest strategy) for multi-label classification.

2. **Probability-Based Predictions**:
   - Predict label probabilities for the test dataset.
   - Convert probabilities into binary predictions based on a **threshold**.

3. **Threshold Tuning**:
   - Iterate over multiple thresholds to find the **best threshold** that maximizes the **F1-Score**.

4. **Evaluation**:
   - Evaluate the model using:
     - **Precision**: Proportion of correctly predicted labels.
     - **Recall**: Proportion of ground truth labels captured.
     - **F1-Score**: Balance between precision and recall.

---

## Key Steps

### 1. Model Initialization and Training
- **XGBoost Classifier**:
   - `XGBClassifier` with `OneVsRestClassifier` for multi-label classification.
   - `logloss` as the evaluation metric.

### 2. Threshold Adjustment
- Convert predicted probabilities into binary predictions using a dynamic threshold.
- Tune the threshold using a range of values (e.g., 0.1 to 0.6) to find the **optimal threshold** for F1-Score.

### 3. Evaluation Metrics
- **Precision**: Measures prediction accuracy.
- **Recall**: Measures the model's ability to capture ground truth labels.
- **F1-Score**: A single metric balancing precision and recall.

---

## Results

**Initial Results with Threshold 0.3**:

- Precision: 0.3138 

- Recall: 0.1985 

- F1-Score: 0.2432

In [123]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.3-py3-none-macosx_12_0_arm64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: xgboost
Successfully installed xgboost-2.1.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Train with XGBoost

In [124]:
from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier

# Initialize XGBoost classifier
xgb_model = OneVsRestClassifier(XGBClassifier(use_label_encoder=False, eval_metric='logloss'))

# Train the model
xgb_model.fit(X_train, y_train)

print("XGBoost training complete.")


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

XGBoost training complete.


In [128]:
# Predict probabilities
y_pred_proba = xgb_model.predict_proba(X_test)
# print(y_pred_proba, 'y_pred_proba')

# Adjust threshold dynamically
threshold = 0.3
y_pred_adjusted = (y_pred_proba >= threshold).astype(int)

# Evaluate performance
precision = precision_score(y_test, y_pred_adjusted, average="micro")
recall = recall_score(y_test, y_pred_adjusted, average="micro")
f1 = f1_score(y_test, y_pred_adjusted, average="micro")

print(f"\nAdjusted Threshold Results:")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")



Adjusted Threshold Results:
Precision: 0.3902, Recall: 0.1184, F1-Score: 0.1816


In [130]:
# Define threshold
threshold = 0.3

# Convert probabilities into binary predictions
y_pred_binary = (y_pred_proba >= threshold).astype(int)

# Print predictions and ground truth for comparison
for i in range(len(y_test)):
    print(f"\nTest Instance {i+1}:")
    print("Ground Truth Labels:")
    print([label for label, value in zip(mlb.classes_, y_test[i]) if value == 1])
    
    print("Predicted Labels (Binary):")
    print([label for label, value in zip(mlb.classes_, y_pred_binary[i]) if value == 1])
    
    print("Predicted Probabilities:")
    print({label: round(prob, 3) for label, prob in zip(mlb.classes_, y_pred_proba[i]) if prob >= 0.1})



Test Instance 1:
Ground Truth Labels:
['']
Predicted Labels (Binary):
['']
Predicted Probabilities:
{'': 0.355, 'Left and Right Sum Differences': 0.245}

Test Instance 2:
Ground Truth Labels:
['']
Predicted Labels (Binary):
['']
Predicted Probabilities:
{'': 0.689}

Test Instance 3:
Ground Truth Labels:
['Minimum Sum of Squared Difference']
Predicted Labels (Binary):
[]
Predicted Probabilities:
{}

Test Instance 4:
Ground Truth Labels:
[]
Predicted Labels (Binary):
[]
Predicted Probabilities:
{}

Test Instance 5:
Ground Truth Labels:
['Count the Number of Consistent Strings', 'Number of Good Paths', 'Sort Characters By Frequency']
Predicted Labels (Binary):
['']
Predicted Probabilities:
{'': 0.419}

Test Instance 6:
Ground Truth Labels:
[]
Predicted Labels (Binary):
[]
Predicted Probabilities:
{}

Test Instance 7:
Ground Truth Labels:
[]
Predicted Labels (Binary):
[]
Predicted Probabilities:
{}

Test Instance 8:
Ground Truth Labels:
['Number of Valid Words in a Sentence']
Predicted La

In [131]:
from sklearn.metrics import precision_score, recall_score, f1_score

def tune_threshold(y_true, y_pred_proba, thresholds):
    best_threshold = 0.5
    best_f1 = 0.0
    
    for threshold in thresholds:
        y_pred_binary = (y_pred_proba >= threshold).astype(int)
        f1 = f1_score(y_true, y_pred_binary, average="micro")
        
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold
    
    print(f"Best Threshold: {best_threshold:.2f} | Best F1-Score: {best_f1:.4f}")
    return best_threshold

# Tune thresholds
thresholds = np.arange(0.1, 0.6, 0.05)
best_threshold = tune_threshold(y_test, y_pred_proba, thresholds)

# Apply best threshold
y_pred_best = (y_pred_proba >= best_threshold).astype(int)

# Final metrics
precision = precision_score(y_test, y_pred_best, average="micro")
recall = recall_score(y_test, y_pred_best, average="micro")
f1 = f1_score(y_test, y_pred_best, average="micro")

print("\nFinal Results with Tuned Threshold:")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")


Best Threshold: 0.10 | Best F1-Score: 0.2432

Final Results with Tuned Threshold:
Precision: 0.3138, Recall: 0.1985, F1-Score: 0.2432


## Modeling

In this phase, we focus on building the hybrid recommendation system by:
- **Sentence-BERT embeddings**: Generating semantic embeddings for titles, tags, and problem descriptions.
- **Weighted Feature Combination**: Assigning custom weights to each feature (Title: 0.35, Tags: 0.35, Description: 0.3) to balance their importance.
- **XGBoost Classifier**: Training a multi-label classification model to predict similar questions.
- **Hybrid Approach**: Combining predictions from the classifier with title-based similarity scores for improved accuracy.

### Sentence-BERT Embeddings
We utilize the `all-MiniLM-L6-v2` variant of **Sentence-BERT** to generate embeddings for each feature. Sentence-BERT captures semantic relationships between texts, making it suitable for identifying question similarity.

### Weighted Embedding Combination
The final embedding is a **weighted combination** of:
- Title embeddings
- Topic tag embeddings
- Description embeddings

This ensures that **titles** and **tags** have higher influence, as they often directly determine similarity.

### XGBoost for Multi-Label Classification
We use **XGBoost** (One-vs-Rest strategy) to predict multiple similar questions for a given input. XGBoost is chosen for its:
- Efficiency with structured data.
- Robustness in handling multi-label problems.
- Ability to handle class imbalance.

---


In [133]:
!pip install sentence-transformers



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [137]:
from sentence_transformers import SentenceTransformer

# Load pre-trained Sentence-BERT model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Define weights for title, topic tags, and description embeddings
TITLE_WEIGHT = 0.35
TAGS_WEIGHT = 0.35
DESCRIPTION_WEIGHT = 0.3

# Generate Sentence-BERT embeddings
print("Generating Sentence-BERT embeddings for components...")

# Title embeddings
title_embeddings = sbert_model.encode(df["title"].fillna(''), show_progress_bar=True)
title_embeddings_weighted = np.array(title_embeddings) * TITLE_WEIGHT

# Topic tags embeddings
tags_embeddings = sbert_model.encode(df["topic_tags"].fillna(''), show_progress_bar=True)
tags_embeddings_weighted = np.array(tags_embeddings) * TAGS_WEIGHT

# Problem description embeddings
description_embeddings = sbert_model.encode(df["problem_description"].fillna(''), show_progress_bar=True)
description_embeddings_weighted = np.array(description_embeddings) * DESCRIPTION_WEIGHT

print("Embeddings for title, tags, and descriptions generated and weighted.")


Generating Sentence-BERT embeddings for components...


Batches: 100%|██████████| 94/94 [00:01<00:00, 77.49it/s]
Batches: 100%|██████████| 94/94 [00:02<00:00, 39.68it/s]
Batches: 100%|██████████| 94/94 [00:09<00:00,  9.75it/s]

Embeddings for title, tags, and descriptions generated and weighted.





In [138]:
# Combine weighted embeddings
X_combined = np.hstack([
    title_embeddings_weighted,
    tags_embeddings_weighted,
    description_embeddings_weighted
])

print("Shape of Final Combined Sentence-BERT Embedding Matrix:", X_combined.shape)


Shape of Final Combined Sentence-BERT Embedding Matrix: (3000, 1152)


In [139]:
# MultiLabelBinarizer for multi-label encoding
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df["similar_problems_list"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets.")


Data split into training and testing sets.


In [140]:
# Train Logistic Regression with One-vs-Rest
logreg_model = OneVsRestClassifier(LogisticRegression(solver='liblinear'))
logreg_model.fit(X_train, y_train)

# Predict on test set
y_pred = logreg_model.predict(X_test)

# Evaluate performance
precision = precision_score(y_test, y_pred, average="micro")
recall = recall_score(y_test, y_pred, average="micro")
f1 = f1_score(y_test, y_pred, average="micro")
hamming = hamming_loss(y_test, y_pred)

print("\nLogistic Regression (One-vs-Rest) with Weighted Sentence-BERT Results:")
print(f"Hamming Loss: {hamming:.4f}")
print(f"Precision@k: {precision:.4f}")
print(f"Recall@k: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")





Logistic Regression (One-vs-Rest) with Weighted Sentence-BERT Results:
Hamming Loss: 0.0008
Precision@k: 0.7037
Recall@k: 0.0234
F1-Score: 0.0453


In [141]:
# Train XGBoost with One-vs-Rest
xgb_model = OneVsRestClassifier(XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
xgb_model.fit(X_train, y_train)

# Predict on test set
y_pred = xgb_model.predict(X_test)

# Evaluate performance
precision = precision_score(y_test, y_pred, average="micro")
recall = recall_score(y_test, y_pred, average="micro")
f1 = f1_score(y_test, y_pred, average="micro")
hamming = hamming_loss(y_test, y_pred)

print("\nXGBoost (One-vs-Rest) with Weighted Sentence-BERT Results:")
print(f"Hamming Loss: {hamming:.4f}")
print(f"Precision@k: {precision:.4f}")
print(f"Recall@k: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode


XGBoost (One-vs-Rest) with Weighted Sentence-BERT Results:
Hamming Loss: 0.0008
Precision@k: 0.5354
Recall@k: 0.0654
F1-Score: 0.1165


In [None]:
# Function to get title-based similar questions
def get_title_based_similar_sbert(title, embeddings, df, top_k=5):
    """
    Find top-k similar titles using Sentence-BERT embeddings and cosine similarity.
    """
    title_embedding = sbert_model.encode([title])  # Embed the input title
    similarity_scores = cosine_similarity(title_embedding, embeddings).flatten()
    top_indices = similarity_scores.argsort()[-top_k:][::-1]  # Top-k similar titles
    return [df.iloc[idx]["title"] for idx in top_indices]

# Hybrid prediction function
def hybrid_predict_sbert(title, description, model, mlb, df, embeddings, top_k=5):
    """
    Hybrid predictions combining model and title-based similarity.
    """
    # Title-based similarity
    title_similar_questions = get_title_based_similar_sbert(title, embeddings, df, top_k)
    
    # Model predictions
    combined_text = title + " " + description
    text_embedding = sbert_model.encode([combined_text])
    model_pred = model.predict(text_embedding)
    model_predicted_labels = mlb.inverse_transform(model_pred)

    # Combine predictions
    combined_predictions = list(set(title_similar_questions) | set(model_predicted_labels[0]))
    return combined_predictions


In [136]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, hamming_loss
from sklearn.preprocessing import MultiLabelBinarizer

# MultiLabelBinarizer for targets
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df["similar_problems_list"])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_sbert, y, test_size=0.2, random_state=42)

# Train Logistic Regression
logreg_model = OneVsRestClassifier(LogisticRegression(solver='liblinear'))
logreg_model.fit(X_train, y_train)

# Train XGBoost with One-vs-Rest
xgb_model = OneVsRestClassifier(XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
xgb_model.fit(X_train, y_train)

print("XGBoost training complete.")


# Evaluate performance
# precision = precision_score(y_test, y_pred, average="micro")
# recall = recall_score(y_test, y_pred, average="micro")
# f1 = f1_score(y_test, y_pred, average="micro")
# hamming = hamming_loss(y_test, y_pred)

# print("\nLogistic Regression (One-vs-Rest) with Sentence-BERT Results:")
# print(f"Hamming Loss: {hamming:.4f}")
# print(f"Precision@k: {precision:.4f}")
# print(f"Recall@k: {recall:.4f}")
# print(f"F1-Score: {f1:.4f}")


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

XGBoost training complete.


Parameters: { "use_label_encoder" } are not used.



In [None]:
# Function to get title-based similar questions
def get_title_based_similar_sbert(title, embeddings, df, top_k=5):
    """
    Find top-k similar titles using Sentence-BERT embeddings and cosine similarity.
    """
    title_embedding = sbert_model.encode([title])  # Embed the input title
    similarity_scores = cosine_similarity(title_embedding, embeddings).flatten()
    top_indices = similarity_scores.argsort()[-top_k:][::-1]  # Top-k similar titles
    return [df.iloc[idx]["title"] for idx in top_indices]


Hybrid one again

In [142]:
# Generate Sentence-BERT embeddings for each component
print("Generating Sentence-BERT embeddings for title, tags, and descriptions...")

title_embeddings = sbert_model.encode(df["title"].fillna(''), show_progress_bar=True)
tags_embeddings = sbert_model.encode(df["topic_tags"].fillna(''), show_progress_bar=True)
description_embeddings = sbert_model.encode(df["problem_description"].fillna(''), show_progress_bar=True)

print("Embeddings generated for all components.")


Generating Sentence-BERT embeddings for title, tags, and descriptions...


Batches: 100%|██████████| 94/94 [00:01<00:00, 79.76it/s] 
Batches: 100%|██████████| 94/94 [00:00<00:00, 96.51it/s] 
Batches: 100%|██████████| 94/94 [00:09<00:00,  9.65it/s]

Embeddings generated for all components.





In [144]:
def hybrid_predict_sbert_weighted(title, topic_tags, description, model, mlb, title_embeddings, tags_embeddings, description_embeddings, df, top_k=5):
    """
    Predict similar questions using weighted Sentence-BERT embeddings and title-based similarity.
    """
    # Step 1: Compute Sentence-BERT embeddings for the input
    title_embedding = sbert_model.encode([title]) * TITLE_WEIGHT
    tags_embedding = sbert_model.encode([topic_tags]) * TAGS_WEIGHT
    description_embedding = sbert_model.encode([description]) * DESCRIPTION_WEIGHT

    # Combine weighted embeddings
    combined_embedding = np.hstack([title_embedding, tags_embedding, description_embedding])
    
    # Step 2: Model predictions
    model_pred = model.predict(combined_embedding)
    model_predicted_labels = mlb.inverse_transform(model_pred)
    
    # Step 3: Title-based similarity
    title_similarity_scores = cosine_similarity(title_embedding, title_embeddings).flatten()
    top_indices = title_similarity_scores.argsort()[-top_k:][::-1]
    title_similar_questions = [df.iloc[idx]["title"] for idx in top_indices]

    # Step 4: Combine predictions
    combined_predictions = list(set(title_similar_questions) | set(model_predicted_labels[0]))
    
    # Step 5: Cleanup predictions
    predicted_labels = pd.Series(combined_predictions)
    predicted_labels = predicted_labels.map(clean_recommendation_title)
    return predicted_labels.tolist()


In [145]:
def evaluate_hybrid_model_weighted(df, model, mlb, title_embeddings, tags_embeddings, description_embeddings, top_k=5):
    """
    Evaluate the hybrid model using Sentence-BERT embeddings for title, tags, and descriptions.
    """
    y_true = []
    y_pred = []

    for idx, row in df.iterrows():
        title = row["title"]
        topic_tags = row["topic_tags"]
        description = row["problem_description"]
        ground_truth = row["similar_problems_list"]

        if not ground_truth:
            continue

        # Hybrid predictions
        predicted_labels = hybrid_predict_sbert_weighted(
            title, topic_tags, description, model, mlb,
            title_embeddings, tags_embeddings, description_embeddings,
            df, top_k=top_k
        )
        # predicted_labels = pd.Series(predicted_labels)
        # predicted_labels = predicted_labels.map(clean_recommendation_title)
        # predicted_labels = predicted_labels.tolist()

        # Convert to binary format
        y_true.append([1 if label in ground_truth else 0 for label in mlb.classes_])
        y_pred.append([1 if label in predicted_labels else 0 for label in mlb.classes_])

    # Convert to numpy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    # Compute metrics
    hamming = hamming_loss(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average="micro")
    recall = recall_score(y_true, y_pred, average="micro")
    f1 = f1_score(y_true, y_pred, average="micro")

    print("\nUpdated Hybrid Model Evaluation Results:")
    print(f"Hamming Loss: {hamming:.4f}")
    print(f"Precision@k: {precision:.4f}")
    print(f"Recall@k: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")


In [147]:
from joblib import Parallel, delayed
from sklearn.metrics import precision_score, recall_score, f1_score, hamming_loss
import numpy as np

# Define the parallelized function to process a single row
def process_row(row, model, mlb, title_embeddings, tags_embeddings, description_embeddings, df, top_k):
    """
    Processes a single row to compute predictions and ground truth.
    """
    title = row["title"]
    topic_tags = row["topic_tags"]
    description = row["problem_description"]
    ground_truth = row["similar_problems_list"]

    if not ground_truth:
        return None, None

    # Hybrid predictions
    predicted_labels = hybrid_predict_sbert_weighted(
        title, topic_tags, description, model, mlb, title_embeddings, tags_embeddings, description_embeddings, df, top_k
    )
    predicted_labels = pd.Series(predicted_labels)
    predicted_labels = predicted_labels.map(clean_recommendation_title)
    predicted_labels = predicted_labels.tolist()

    # Convert to binary format
    y_true = [1 if label in ground_truth else 0 for label in mlb.classes_]
    y_pred = [1 if label in predicted_labels else 0 for label in mlb.classes_]

    return y_true, y_pred

# Parallelized evaluation function
def evaluate_hybrid_model_weighted_parallel(df, model, mlb, title_embeddings, tags_embeddings, description_embeddings, top_k=5, n_jobs=-1):
    """
    Evaluate the hybrid model using Sentence-BERT embeddings for title, tags, and descriptions in parallel.
    """
    # Use joblib to parallelize row processing
    results = Parallel(n_jobs=n_jobs)(
        delayed(process_row)(row, model, mlb, title_embeddings, tags_embeddings, description_embeddings, df, top_k)
        for _, row in df.iterrows()
    )

    # Filter out None results (rows without ground truth)
    results = [result for result in results if result[0] is not None and result[1] is not None]
    
    # Separate y_true and y_pred
    y_true = np.array([result[0] for result in results])
    y_pred = np.array([result[1] for result in results])

    # Compute metrics
    hamming = hamming_loss(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average="micro")
    recall = recall_score(y_true, y_pred, average="micro")
    f1 = f1_score(y_true, y_pred, average="micro")

    print("\nUpdated Hybrid Model Evaluation Results (Parallelized):")
    print(f"Hamming Loss: {hamming:.4f}")
    print(f"Precision@k: {precision:.4f}")
    print(f"Recall@k: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")


In [148]:
# Define weights for Sentence-BERT components
TITLE_WEIGHT = 0.35
TAGS_WEIGHT = 0.35
DESCRIPTION_WEIGHT = 0.3

# Evaluate the updated hybrid model
# evaluate_hybrid_model_weighted(
#     df, xgb_model, mlb, title_embeddings, tags_embeddings, description_embeddings, top_k=5
# )
# Run parallelized evaluation
evaluate_hybrid_model_weighted_parallel(
    df, xgb_model, mlb, title_embeddings, tags_embeddings, description_embeddings, top_k=5, n_jobs=10
)



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av


Updated Hybrid Model Evaluation Results (Parallelized):
Hamming Loss: 0.0021
Precision@k: 0.2842
Recall@k: 0.5459
F1-Score: 0.3738


In [151]:
# Grid search for weights
weights = [(0.3, 0.4, 0.3)]
best_f1 = 0
best_weights = None

for title_w, tags_w, desc_w in weights:
    print(f"Testing Weights - Title: {title_w}, Tags: {tags_w}, Description: {desc_w}")
    # Recompute embeddings with current weights
    title_weighted = np.array(title_embeddings) * title_w
    tags_weighted = np.array(tags_embeddings) * tags_w
    description_weighted = np.array(description_embeddings) * desc_w
    combined_embeddings = np.hstack([title_weighted, tags_weighted, description_weighted])

    # Train and evaluate with the combined embeddings
    X_train, X_test, y_train, y_test = train_test_split(combined_embeddings, y, test_size=0.2, random_state=42)
    xgb_model.fit(X_train, y_train)
    y_pred = xgb_model.predict(X_test)
    f1 = f1_score(y_test, y_pred, average="micro")

    if f1 > best_f1:
        best_f1 = f1
        best_weights = (title_w, tags_w, desc_w)

print(f"Best Weights: {best_weights}, Best F1-Score: {best_f1:.4f}")


Testing Weights - Title: 0.3, Tags: 0.4, Description: 0.3


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

Best Weights: (0.3, 0.4, 0.3), Best F1-Score: 0.1227


Adding code to get recommendations

In [184]:
sample_title = df["title"].iloc[10]
print(sample_title)
sample_topic_tag = df["topic_tags"].iloc[10]
print(sample_topic_tag)
sample_description = df["problem_description"].iloc[10]
print(sample_description)
sample_id = df["id"].iloc[10]
print(sample_id)
sample_similar_problems = df["similar_problems_list"].iloc[10]
print(sample_similar_problems)

11. Container With Most Water
'Array', 'Two Pointers', 'Greedy'
You are given an integer array height of length n. There are n vertical lines drawn such that the two endpoints of the ith line are (i, 0) and (i, height[i]).
Find two lines that together with the x-axis form a container, such that the container contains the most water.
Return the maximum amount of water a container can store.
Notice that you may not slant the container.
 
Example 1:

Input: height = [1,8,6,2,5,4,8,3,7]
Output: 49
Explanation: The above vertical lines are represented by array [1,8,6,2,5,4,8,3,7]. In this case, the max area of water (blue section) the container can contain is 49.

Example 2:
Input: height = [1,1]
Output: 1

 
Constraints:

n == height.length
2 <= n <= 105
0 <= height[i] <= 104


11
['Trapping Rain Water', 'Maximum Tastiness of Candy Basket', 'House Robber IV']


In [197]:
sample_title = df["title"].iloc[19]
print(sample_title)
sample_topic_tag = df["topic_tags"].iloc[19]
print(sample_topic_tag)
sample_description = df["problem_description"].iloc[19]
print(sample_description)
sample_id = df["id"].iloc[19]
print(sample_id)
sample_similar_problems = df["similar_problems_list"].iloc[19]
print(sample_similar_problems)

title_embedding = sbert_model.encode([sample_title]) * 0.35
tags_embedding = sbert_model.encode([sample_topic_tag]) * 0.35
description_embedding = sbert_model.encode([sample_description]) * 0.3

# Combine weighted embeddings
combined_embedding = np.hstack([title_embedding, tags_embedding, description_embedding])
    
# Step 2: Model predictions
model_pred = xgb_model.predict(combined_embedding)
model_predicted_labels = mlb.inverse_transform(model_pred)
print(model_predicted_labels)

20. Valid Parentheses
'String', 'Stack'
Given a string s containing just the characters '(', ')', '{', '}', '[' and ']', determine if the input string is valid.
An input string is valid if:

Open brackets must be closed by the same type of brackets.
Open brackets must be closed in the correct order.
Every close bracket has a corresponding open bracket of the same type.

 
Example 1:
Input: s = "()"
Output: true

Example 2:
Input: s = "()[]{}"
Output: true

Example 3:
Input: s = "(]"
Output: false

 
Constraints:

1 <= s.length <= 104
s consists of parentheses only '()[]{}'.


20
['Generate Parentheses', 'Longest Valid Parentheses', 'Remove Invalid Parentheses', 'Check If Word Is Valid After Substitutions', 'Check if a Parentheses String Can Be Valid', 'Move Pieces to Obtain a String']
[('Check if a Parentheses String Can Be Valid', 'Generate Parentheses')]


Deployment

In [186]:
!pip install gradio

python(75023) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting gradio
  Downloading gradio-5.9.1-py3-none-any.whl (57.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.2/57.2 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting ruff>=0.2.2
  Downloading ruff-0.8.3-py3-none-macosx_11_0_arm64.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting starlette<1.0,>=0.40.0
  Downloading starlette-0.42.0-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi<1.0,>=0.115.2
  Downloading fastapi-0.115.6-py3-none-any.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting tomlkit<0.14.0,>=0.12.0
  Downloading tomlkit-0.13.2-py3-none-any.whl (37 kB)
Collecting safehttpx<0.2.0,>=0.1.6
  Downloading s

In [192]:
import joblib
joblib.dump(xgb_model, "xgb_model.pkl")

['xgb_model.pkl']

In [193]:
joblib.dump(mlb, "mlb.pkl")


['mlb.pkl']

In [200]:
import gradio as gr
import numpy as np
from sentence_transformers import SentenceTransformer
import joblib
from sklearn.preprocessing import MultiLabelBinarizer

# Load your trained model and MultiLabelBinarizer
xgb_model = joblib.load("xgb_model.pkl")  # Replace with your model file
mlb = joblib.load("mlb.pkl")             # MultiLabelBinarizer saved during training

# Load Sentence-BERT model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')  # Replace with the same model used during training

# Function to process input and get predictions
def recommend_labels(title, topic_tags, description):
    # Step 1: Generate Sentence-BERT embeddings with custom weights
    title_embedding = sbert_model.encode([title]) * 0.35
    tags_embedding = sbert_model.encode([topic_tags]) * 0.35
    description_embedding = sbert_model.encode([description]) * 0.3

    # Combine embeddings into a single vector
    combined_embedding = np.hstack([title_embedding, tags_embedding, description_embedding])

    # Step 2: Predict using the XGBoost model
    model_pred = xgb_model.predict(combined_embedding)

    # Step 3: Convert predictions back to human-readable text
    predicted_labels = mlb.inverse_transform(model_pred)

    # Format the output
    if predicted_labels and len(predicted_labels[0]) > 0:
        return ", ".join(predicted_labels[0])  # Join the labels into a readable string
    else:
        return "No recommendations found."

# Gradio interface
interface = gr.Interface(
    fn=recommend_labels,  # The function to process input and output
    inputs=[
        gr.Textbox(label="Title of the Problem", placeholder="Enter the problem title here..."),
        gr.Textbox(label="Topic Tags", placeholder="Enter topic tags (comma-separated)..."),
        gr.Textbox(label="Problem Description", placeholder="Describe the problem...")
    ],
    outputs=gr.Textbox(label="Recommended Labels"),
    title="Multi-Label Classification App",
    description="Provide the problem title, topic tags, and description to get the most relevant recommendations."
)

# Launch the Gradio app
interface.launch()


* Running on local URL:  http://127.0.0.1:7864

To create a public link, set `share=True` in `launch()`.


