<a href="https://colab.research.google.com/github/Coding-bot007/machine-learning/blob/main/Semantic_Text_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

The goal of this project is to build a machine learning model that can predict the semantic similarity between pairs of text paragraphs. The dataset consists of randomly sampled pairs of text paragraphs, where each pair may or may not be semantically similar. The model needs to predict a continuous value between 0 and 1, indicating the degree of similarity between the two text paragraphs.



# Dataset

The dataset contains a collection of text paragraph pairs, where each pair is labeled with a similarity score ranging from 0 (not similar) to 1 (very similar). The dataset is split into a training set and a test set. The training set will be used to train the model, and the test set will be used to evaluate the model's performance on unseen data.

# Problem Approach:
To solve this problem, we will follow these steps:

Data Preprocessing: The text paragraphs will be preprocessed to remove any special characters, stop words, and perform tokenization.

Text Embeddings: We will use pre-trained language models, such as BERT, to convert the tokenized text paragraphs into dense vector representations (embeddings).

Model Building: A machine learning model, such as a neural network or regression model, will be trained on the training dataset using the text embeddings and corresponding similarity scores.

Model Evaluation: The trained model will be evaluated on the test set to assess its performance in predicting the similarity scores.

In [31]:
#Importing the necessary libraries for the model
import numpy as np
import pandas as pd

import re
from tqdm import tqdm

import collections

from sklearn.cluster import KMeans

from nltk.stem import WordNetLemmatizer  # For Lemmetization of words
from nltk.corpus import stopwords  # Load list of stopwords
from nltk import word_tokenize # Convert paragraph in tokens

import pickle
import sys

from gensim.models import word2vec # For represent words in vectors
import gensim

import nltk
nltk.download('stopwords')

!pip install transformers


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m56.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.6 MB/s[0m eta [36m0:00:0

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
text_data = pd.read_csv("/content/drive/MyDrive/Precily_dataset.csv")
print("Shape of text_data : ", text_data.shape)
text_data.head(3)


Shape of text_data :  (3000, 2)


Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...


In [4]:
text_data.isnull().sum() # Check if text data have any null values

text1    0
text2    0
dtype: int64

# Data preprocessing

In [5]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [8]:
# Combining all the above stundents

preprocessed_text1 = []

# tqdm is for printing the status bar

for sentance in tqdm(text_data['text1'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)

    sent = ' '.join(e for e in sent.split() if e not in stopwords.words('english'))
    preprocessed_text1.append(sent.lower().strip())

100%|██████████| 3000/3000 [02:17<00:00, 21.81it/s]


In [9]:
text_data['text1'] = preprocessed_text1
text_data.head(3)

Unnamed: 0,text1,text2
0,broadband challenges tv viewing number europea...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested drug find rap mogul marion s...,amnesty chief laments war failure the lack of ...
2,player burn worries robinson england coach and...,hanks greeted at wintry premiere hollywood sta...


In [10]:
# Combining all the above stundents
from tqdm import tqdm
preprocessed_text2 = []

# tqdm is for printing the status bar
for sentance in tqdm(text_data['text2'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)

    sent = ' '.join(e for e in sent.split() if e not in stopwords.words('english'))
    preprocessed_text2.append(sent.lower().strip())

100%|██████████| 3000/3000 [02:26<00:00, 20.48it/s]


In [11]:
# Merging preprocessed_text2 in text_data

text_data['text2'] = preprocessed_text2

text_data.head(3)

Unnamed: 0,text1,text2
0,broadband challenges tv viewing number europea...,gardener wins double glasgow britain jason gar...
1,rap boss arrested drug find rap mogul marion s...,amnesty chief laments war failure lack public ...
2,player burn worries robinson england coach and...,hanks greeted wintry premiere hollywood star t...


In [12]:
def word_tokenizer(text):
            #tokenizes and stems the text
            tokens = word_tokenize(text)
            lemmatizer = WordNetLemmatizer()
            tokens = [lemmatizer.lemmatize(t) for t in tokens]
            return tokens

In [51]:
#Tokenization of the text using the BERT
from transformers import BertTokenizer, BertModel
import torch

model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

tokenized_texts = []
for i, row in text_data.iterrows():
    text1 = row['text1']
    text2 = row['text2']
    tokenized_text1 = tokenizer.encode(text1, add_special_tokens=True)
    tokenized_text2 = tokenizer.encode(text2, add_special_tokens=True)
    tokenized_texts.append((tokenized_text1, tokenized_text2))

num_tokenized_texts = len(tokenized_texts)
print(f"Number of tokenized texts: {num_tokenized_texts}")

for i, (tokenized_text1, tokenized_text2) in enumerate(tokenized_texts):
    print(f"Tokenized Text {i+1} (Text 1): {tokenized_text1}")
    print(f"Tokenized Text {i+1} (Text 2): {tokenized_text2}")

    tokenized_texts_2d = np.array(tokenized_texts)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Token indices sequence length is longer than the specified maximum sequence length for this model (583 > 512). Running this sequence through the model will result in indexing errors


Number of tokenized texts: 3000
Tokenized Text 1 (Text 1): [101, 19595, 7860, 2694, 10523, 2193, 13481, 19595, 9913, 2627, 2260, 2706, 4773, 5983, 2694, 10523, 14243, 2470, 6083, 5139, 2454, 2111, 13322, 5658, 3081, 19595, 4090, 2454, 2095, 3283, 2429, 3006, 18288, 13188, 5658, 15172, 2015, 2561, 2193, 2111, 3784, 2885, 3714, 2531, 2454, 2928, 6217, 5658, 3214, 2116, 3810, 2185, 2694, 2360, 18288, 13035, 2470, 2179, 4284, 4773, 5198, 2056, 2985, 2625, 2051, 3666, 2694, 7927, 5658, 3189, 13188, 5658, 15172, 2015, 2179, 2193, 2111, 3435, 4274, 3229, 13763, 3438, 2627, 2095, 5221, 5376, 3304, 3123, 6036, 3725, 2485, 2369, 19595, 5198, 2471, 19383, 2095, 3930, 4762, 3709, 2896, 7597, 7289, 3601, 2467, 3435, 5658, 15002, 3488, 4376, 2706, 3283, 2152, 3177, 4274, 5198, 2081, 2028, 2353, 4378, 2885, 2753, 5987, 2193, 2562, 3652, 2056, 16988, 3188, 13188, 5658, 15172, 2015, 12941, 2193, 2152, 3177, 27747, 2015, 7502, 11744, 2342, 15581, 10651, 11598, 4180, 9279, 5731, 8627, 2047, 3924, 2561, 2

  tokenized_texts_2d = np.array(tokenized_texts)


Tokenized Text 137 (Text 2): [101, 12495, 25832, 10299, 12280, 2193, 2028, 10687, 12495, 25832, 6380, 12280, 2959, 2193, 2028, 2095, 2299, 2066, 9121, 3548, 16201, 2327, 3895, 6093, 2650, 3555, 2327, 3962, 3805, 2332, 6745, 2713, 10459, 14045, 3892, 3587, 2186, 2324, 17173, 2015, 2928, 26934, 5315, 17229, 4182, 2471, 11979, 2280, 2225, 15509, 2732, 4422, 11338, 7011, 17101, 7160, 2204, 28578, 2253, 2193, 2093, 2066, 9121, 3548, 12495, 25832, 19493, 2201, 18925, 20481, 7556, 3865, 2718, 9121, 3548, 3138, 25430, 15457, 5099, 6154, 13552, 2015, 4076, 3112, 4558, 2197, 2733, 2193, 2028, 12280, 2196, 5707, 2410, 3182, 2193, 2403, 2274, 2207, 2774, 2327, 2871, 2028, 2088, 2622, 19267, 28536, 9940, 2196, 7502, 2214, 5707, 2176, 3182, 2193, 2809, 3794, 7656, 2957, 2879, 2577, 2299, 2517, 2280, 6520, 3505, 3191, 2201, 6093, 8258, 6745, 5378, 7538, 3555, 2327, 3962, 2327, 14353, 5072, 2567, 5245, 6462, 3062, 2193, 2416, 102]
Tokenized Text 138 (Text 1): [101, 16216, 19892, 11246, 12054, 2666, 24

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Tokenized Text 1170 (Text 1): [101, 23413, 2229, 2933, 20731, 2740, 14148, 2512, 2647, 2586, 4480, 5782, 2147, 2866, 13595, 5852, 15877, 9820, 3488, 11521, 11992, 3893, 3231, 26419, 2052, 2812, 9425, 5097, 2357, 9820, 2052, 9411, 2553, 2553, 3003, 2745, 4922, 2056, 14148, 2047, 25470, 2052, 2393, 4047, 2270, 2740, 17237, 4428, 2056, 2116, 5852, 2525, 2589, 5622, 2497, 17183, 2015, 7420, 4243, 6090, 4063, 2075, 18024, 10340, 2052, 2716, 11992, 2236, 2602, 2052, 6611, 2111, 2746, 2866, 2625, 2416, 2706, 4983, 3832, 2147, 2740, 2775, 16302, 4252, 2720, 4922, 2056, 3488, 2241, 6043, 2525, 2895, 2710, 2047, 3414, 2660, 2590, 28805, 2204, 4781, 2270, 2740, 3725, 15646, 2409, 4035, 2557, 1018, 2651, 4746, 2231, 4481, 6592, 26419, 2563, 3445, 2423, 2627, 2184, 2086, 3053, 2048, 12263, 2111, 26419, 2141, 6917, 2056, 2720, 4922, 2228, 3625, 2231, 3233, 4998, 2498, 2227, 3291, 2720, 4922, 2056, 3488, 2052, 7461, 11386, 17362, 4447, 10340, 2421, 2111, 2746, 2866, 2648, 7327, 2416, 2260, 2706, 2152

In [52]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokens1 = tokenizer(text1, return_tensors="pt")
tokens2 = tokenizer(text2, return_tensors="pt")

model = BertModel.from_pretrained("bert-base-uncased")
with torch.no_grad():
    output1 = model(**tokens1)
    output2 = model(**tokens2)

sentence_embedding1 = torch.mean(output1.last_hidden_state, dim=1)  # Average pooling
sentence_embedding2 = torch.mean(output2.last_hidden_state, dim=1)


cosine_sim = cosine_similarity(sentence_embedding1.numpy(), sentence_embedding2.numpy())


print("Cosine Similarity:", cosine_sim[0][0])


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Cosine Similarity: 0.8727516


# Conclusion:

In this project, we successfully built a machine learning model to predict the semantic similarity between pairs of text paragraphs. The model achieved a high cosine similarity score of 0.8727516, indicating its strong ability to accurately measure similarity. This work lays the foundation for valuable applications in information retrieval and content organization.