#  NLP Project: Text Similarity Checker using SBERT

**Name:** Karim Gaber  
**Registration Number:** 221003403  
**Date:** February 2024

---

##  1. Project Description

This project aims to build a model that determines whether two questions are semantically similar or duplicates. We use the Quora Question Pairs dataset, which contains real-world examples of question pairs from Quora's platform.

This task falls under **semantic textual similarity**, a fundamental problem in **Natural Language Processing (NLP)**, with real-world applications in question answering, search engines, and chatbots.

The project involves text preprocessing, embedding generation using **Sentence-BERT (SBERT)**, and similarity score-based classification using **Logistic Regression**.

---

##  2. Problem Statement

Duplicate questions on platforms like Quora create redundancy and reduce user experience. Our goal is to automatically identify if two questions express the same intent, even when worded differently.

This project will:
- Extract meaningful representations of question pairs.
- Compute similarity scores between the question embeddings.
- Train a binary classification model to detect duplicates.

---



##  3. Dataset Description

We used the **Quora Question Pairs** dataset from Kaggle, which contains over 400,000 question pairs collected from the Quora Q&A platform.

Each data point includes:
- Two questions (`question1`, `question2`)
- A binary label (`is_duplicate`) indicating whether they have the same meaning.

| Feature        | Description                                |
|----------------|--------------------------------------------|
| id             | Unique row ID                              |
| qid1, qid2     | Unique IDs for each question                |
| question1/2    | Text of the two questions                  |
| is_duplicate   | 1 = same meaning, 0 = different meaning     |

 **Class Distribution**:  
About 37% of the pairs are duplicates, while 63% are not — a mild class imbalance that should be addressed during modeling.

 **Cleaning Steps**:
- Drop rows with missing questions
- Convert text to lowercase
- (Optionally) remove punctuation

 **Why It Matters**:  
This dataset provides a real-world binary classification challenge for detecting semantic similarity in NLP tasks.


## load data

In [None]:
import pandas as pd

df = pd.read_csv('dataset.csv')

# Shape dataset

In [None]:
print("Dataset shape:", df.shape)


Dataset shape: (49848, 6)


# Columns name

In [None]:
print("Columns:", df.columns)

Columns: Index(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], dtype='object')


In [None]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2.0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0.0
1,1,3,4.0,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0.0
2,2,5,6.0,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0.0
3,3,7,8.0,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0.0
4,4,9,10.0,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0.0


In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49848 entries, 0 to 49847
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            49848 non-null  int64  
 1   qid1          49848 non-null  int64  
 2   qid2          49847 non-null  float64
 3   question1     49847 non-null  object 
 4   question2     49847 non-null  object 
 5   is_duplicate  49847 non-null  float64
dtypes: float64(2), int64(2), object(2)
memory usage: 2.3+ MB
None


# Check class balance

In [None]:
print(df['is_duplicate'].value_counts())

is_duplicate
0.0    31254
1.0    18593
Name: count, dtype: int64


# Check nulls

In [None]:
print(df.isnull().sum())


id              0
qid1            0
qid2            1
question1       1
question2       1
is_duplicate    1
dtype: int64


In [None]:
df = df.dropna(subset=['question1', 'question2'])


### 🧹 Text Preprocessing

To prepare the text data for modeling, we applied a comprehensive cleaning process to both `question1` and `question2`. The steps include:

- Converting text to lowercase
- Removing HTML tags and special characters
- Removing punctuation and numbers
- Removing common English stopwords
- Lemmatizing each word to its base form
- Removing extra whitespace

This ensures the data is standardized, noise-free, and ready for vectorization and modeling.


In [None]:
import pandas as pd
import numpy as np
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
def clean_text(text):
    if pd.isnull(text):
        return ""
    text = text.lower()  # Lowercase
    text = re.sub(r'<.*?>', ' ', text)  # Remove HTML tags
    text = re.sub(r'[^a-z\s]', ' ', text)  # Remove non-letters
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatize
    return ' '.join(words)


## Feature Engineering

In [None]:
df['question1_clean'] = df['question1'].apply(clean_text)
df['question2_clean'] = df['question2'].apply(clean_text)

In [None]:
df.drop(['id', 'qid1', 'qid2', 'question1', 'question2'], axis=1, inplace=True)

##  Text Representation

In [None]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

# Encoding using TF IDF

In [None]:
# 3. TF-IDF Vectorization
tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=5000)
combined_qs = pd.concat([df['question1_clean'], df['question2_clean']], axis=0)
tfidf.fit(combined_qs)

q1_tfidf = tfidf.transform(df['question1_clean'])
q2_tfidf = tfidf.transform(df['question2_clean'])


In [None]:
# Cosine Similarity Feature
df['cosine_sim'] = [cosine_similarity(q1, q2)[0][0] for q1, q2 in zip(q1_tfidf, q2_tfidf)]

# Modeling after TF IDF

## Logistic Regression

In [None]:
X = df[['cosine_sim']]
y = df['is_duplicate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
y_pred_log = model.predict(X_test)

print("Accuracy with  LogisticRegression model :", accuracy_score(y_test, y_pred_log))

Accuracy with  LogisticRegression model : 0.6456369107321966


In [None]:
y_pred_log = model.predict(X_test)
print("Confusion Matrix for LogisticRegression model :\n", confusion_matrix(y_test, y_pred_log))

Confusion Matrix for LogisticRegression model :
 [[4992 1282]
 [2251 1445]]


In [None]:
print("\nClassification Report:\n",classification_report(y_test, y_pred_log))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.69      0.80      0.74      6274
         1.0       0.53      0.39      0.45      3696

    accuracy                           0.65      9970
   macro avg       0.61      0.59      0.59      9970
weighted avg       0.63      0.65      0.63      9970



## svm (svc)

In [None]:
from sklearn.svm import SVC
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)


In [None]:
y_pred_svm = svm_model.predict(X_test)
print("Confusion Matrix for SVC model :\n", confusion_matrix(y_test, y_pred_svm))

Confusion Matrix for SVC model :
 [[6274    0]
 [3696    0]]


In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.63      1.00      0.77      6274
         1.0       0.00      0.00      0.00      3696

    accuracy                           0.63      9970
   macro avg       0.31      0.50      0.39      9970
weighted avg       0.40      0.63      0.49      9970




Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



## Random Forest


In [None]:
from sklearn.ensemble import RandomForestClassifier
# Initialize and train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
# Predict and evaluate
y_pred_rf = rf_model.predict(X_test)

In [None]:
y_pred_rf = rf_model.predict(X_test)
print("Confusion Matrix for Random Forest model :\n", confusion_matrix(y_test, y_pred_rf))

Confusion Matrix for Random Forest model :
 [[4172 2102]
 [1635 2061]]


In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.72      0.66      0.69      6274
         1.0       0.50      0.56      0.52      3696

    accuracy                           0.63      9970
   macro avg       0.61      0.61      0.61      9970
weighted avg       0.64      0.63      0.63      9970



##  Model Comparison: Interactive Visualization

Below is an interactive bar chart comparing the performance of different classifiers based on accuracy, precision, recall, and F1-score.


In [None]:
import plotly.graph_objects as go
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Collect all predictions
models = {
    "Logistic Regression": model,
    "SVM": svm_model,
    "Random Forest": rf_model
}

# Evaluation storage
model_names = []
accuracies = []
precisions = []
recalls = []
f1_scores = []

for name, model in models.items():
    y_pred = model.predict(X_test)
    model_names.append(name)
    accuracies.append(accuracy_score(y_test, y_pred))
    precisions.append(precision_score(y_test, y_pred))
    recalls.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

# Plotly bar chart
fig = go.Figure()

fig.add_trace(go.Bar(x=model_names, y=accuracies, name='Accuracy'))
fig.add_trace(go.Bar(x=model_names, y=precisions, name='Precision'))
fig.add_trace(go.Bar(x=model_names, y=recalls, name='Recall'))
fig.add_trace(go.Bar(x=model_names, y=f1_scores, name='F1-Score'))

fig.update_layout(
    title='📊 Model Performance Comparison',
    xaxis_title='Models',
    yaxis_title='Score',
    barmode='group',
    height=500,
    template='plotly_white'
)

fig.show()



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



### 🔍 Citation for SBERT
This project uses [Sentence-BERT (SBERT)](https://www.sbert.net) for generating high-quality sentence embeddings.  
SBERT was introduced in:  
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *EMNLP 2019*.  
Paper: https://arxiv.org/abs/1908.10084


# Encoding using SBERT

In [None]:
# Load SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
embeddings_q1 = model.encode(df['question1_clean'].tolist(), convert_to_tensor=False)
embeddings_q2 = model.encode(df['question2_clean'].tolist(), convert_to_tensor=False)

## Measure similarity between question 1 & question 2

In [None]:
# Compute cosine similarity
similarities = [cosine_similarity([vec1], [vec2])[0][0] for vec1, vec2 in zip(embeddings_q1, embeddings_q2)]

# Add to DataFrame
df['sbert_similarity'] = similarities

# Modeling after SBERT & cosine similarity

## Using Logistic Regression

In [None]:
X = df[['sbert_similarity']]
y = df['is_duplicate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
y_pred_log = model.predict(X_test)

print("Accuracy with  LogisticRegression model :", accuracy_score(y_test, y_pred_log))

Accuracy with  LogisticRegression model : 0.7228686058174524


In [None]:
y_pred_log = model.predict(X_test)
print("Confusion Matrix for LogisticRegression model :\n", confusion_matrix(y_test, y_pred_log))

Confusion Matrix for LogisticRegression model :
 [[4961 1313]
 [1450 2246]]


In [None]:
print("\nClassification Report:\n",classification_report(y_test, y_pred_log))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.77      0.79      0.78      6274
         1.0       0.63      0.61      0.62      3696

    accuracy                           0.72      9970
   macro avg       0.70      0.70      0.70      9970
weighted avg       0.72      0.72      0.72      9970



## Using SVM classifier

In [None]:
from sklearn.svm import SVC
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)


In [None]:
y_pred_svm = svm_model.predict(X_test)
print("Confusion Matrix for SVC model :\n", confusion_matrix(y_test, y_pred_svm))

Confusion Matrix for SVC model :
 [[4799 1475]
 [1293 2403]]


In [None]:
print("Accuracy with  SVC model :", accuracy_score(y_test, y_pred_svm))

Accuracy with  SVC model : 0.7223671013039117


In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.79      0.76      0.78      6274
         1.0       0.62      0.65      0.63      3696

    accuracy                           0.72      9970
   macro avg       0.70      0.71      0.71      9970
weighted avg       0.73      0.72      0.72      9970



## modeling using random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Initialize and train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
# Predict and evaluate
y_pred_rf = rf_model.predict(X_test)

In [None]:
y_pred_rf = rf_model.predict(X_test)
print("Confusion Matrix for Random Forest model :\n", confusion_matrix(y_test, y_pred_rf))

Confusion Matrix for Random Forest model :
 [[4552 1722]
 [1516 2180]]


In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.75      0.73      0.74      6274
         1.0       0.56      0.59      0.57      3696

    accuracy                           0.68      9970
   macro avg       0.65      0.66      0.66      9970
weighted avg       0.68      0.68      0.68      9970



##  Model Comparison: Interactive Visualization

Below is an interactive bar chart comparing the performance of different classifiers based on accuracy, precision, recall, and F1-score.


In [None]:
import plotly.graph_objects as go
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Collect all predictions
models = {
    "Logistic Regression": model,
    "SVM": svm_model,
    "Random Forest": rf_model
}

# Evaluation storage
model_names = []
accuracies = []
precisions = []
recalls = []
f1_scores = []

for name, model in models.items():
    y_pred = model.predict(X_test)
    model_names.append(name)
    accuracies.append(accuracy_score(y_test, y_pred))
    precisions.append(precision_score(y_test, y_pred))
    recalls.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

# Plotly bar chart
fig = go.Figure()

fig.add_trace(go.Bar(x=model_names, y=accuracies, name='Accuracy'))
fig.add_trace(go.Bar(x=model_names, y=precisions, name='Precision'))
fig.add_trace(go.Bar(x=model_names, y=recalls, name='Recall'))
fig.add_trace(go.Bar(x=model_names, y=f1_scores, name='F1-Score'))

fig.update_layout(
    title='📊 Model Performance Comparison',
    xaxis_title='Models',
    yaxis_title='Score',
    barmode='group',
    height=500,
    template='plotly_white'
)

fig.show()


## ✅ Conclusion and Future Work

In this project, we successfully built a Text Similarity Checker using the Quora Question Pairs dataset. We explored both traditional NLP techniques (TF-IDF + Cosine Similarity) and modern deep learning methods (SBERT embeddings). Multiple classifiers were evaluated including Logistic Regression, SVM, Random Forest, and Decision Tree. SVM and SBERT gave the most promising results.
# for text similarity checker the best opinion using SBERT & MODELING SVC
