## Assignment: Associate Data Scientist  
**Candidate:** Karishma Rawal  
**Position Applied:** Associate Data Scientist 📊  


### PROBLEM STATEMENT
- Dataset (attached with the task): The data contains a pair of paragraphs. These text paragraphs are
randomly sampled from a raw dataset. Each pair of sentences may or may not be semantically similar.
The candidate is to predict a value between 0-1 indicating the similarity between the pair of text paras.
A sample of a similar dataset will be used as test data, therefore it’s crucial to the model solution using
provided dataset.


#### Part A
- Build an algorithm/model that can quantify the degree of similarity between the two text-based on Semantic similarity. Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other.
- 1 means highly similar
- 0 means highly dissimilar

### Importing libraries

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

#### 2. Reading the CSV file into a Pandas DataFrame 📂➡️📊

In [2]:
data=pd.read_csv('DataNeuron_Text_Similarity.csv')

#### 3. Checking the first few rows to understand the structure of the data 🧐🔍:

In [3]:
data.head()

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


####  4. Getting an Overview of the Dataset 📊🔍

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text1   3000 non-null   object
 1   text2   3000 non-null   object
dtypes: object(2)
memory usage: 47.0+ KB


In [5]:
data.dtypes

text1    object
text2    object
dtype: object

#### 5. Checking the shape of the dataset 📏🔍

In [6]:
Rows,Columns=data.shape
print(f'Rows:{Rows} \n Columns:{Columns}')

Rows:3000 
 Columns:2


#### 6. Checking Missing Values 🔍❓➡️

In [7]:
data.isnull().sum()

text1    0
text2    0
dtype: int64

#### 7. 🔍✨ Checking for Duplicates

In [8]:
data.duplicated().sum()

1

In [9]:
data.drop_duplicates(inplace=True)

### 

#### 8.🔽✨ Downloading NLTK Packages

In [10]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KARISHMA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KARISHMA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\KARISHMA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### 9.🚀✨ Initializing Lemmatizer and Stopwords

In [11]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))


#### 10.📝✨ Text Preprocessing Function

In [12]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)   
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

#### 11.🔄✨ Applying Preprocessing to Data

In [13]:
data['text1'] = data['text1'].apply(preprocess)
data['text2'] = data['text2'].apply(preprocess)

#### 12.🔢✨ Converting Text to TF-IDF Vectors

In [14]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data['text1'].tolist() + data['text2'].tolist())


#### 13.🔢✨ Splitting TF-IDF Matrix into Text1 and Text2 Vectors

In [15]:
tfidf_text1 = tfidf_matrix[:len(data)]
tfidf_text2 = tfidf_matrix[len(data):]

#### 14.🔢✨ Computing Cosine Similarity Between Text1 and Text2

In [16]:
data['similarity_score'] = [cosine_similarity(tfidf_text1[i], tfidf_text2[i])[0][0] for i in range(len(data))]

print(data[['text1', 'text2', 'similarity_score']].head())

                                               text1  \
0  broadband challenge tv viewing number european...   
1  rap bos arrested drug find rap mogul marion su...   
2  player burnout worry robinson england coach an...   
3  heart oak cotonsport heart oak set ghanaian co...   
4  sir paul rock super bowl crowd sir paul mccart...   

                                               text2  similarity_score  
0  gardener win double glasgow britain jason gard...          0.066590  
1  amnesty chief lament war failure lack public o...          0.004893  
2  hank greeted wintry premiere hollywood star to...          0.010037  
3  redford vision sundance despite sporting cordu...          0.013939  
4  mauresmo open victory la amelie mauresmo maria...          0.023245  


In [17]:
import pickle

# Saving the model for deployment
with open("similarity_model.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

# Saving the predictions
data.to_csv("predicted_similarity_scores.csv", index=False)
