<a href="https://colab.research.google.com/github/Raj-kumarpatidar/Text-Similarity/blob/main/Text_Similarity_in_python_with_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I use spacy library.\
spaCy is one of the popular and easy-to-use natural language processing library in Python. It helps in building applications that can process and get insights from large volumes of text. It can be used in task related to information extraction or natural language understanding systems, deep learning etc.

In [None]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


Spacy (Natural Language Processing Library):\
spacy is a popular Python library for natural language processing (NLP). It's used for various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, and more. Spacy also provides word vectors, which can be useful for various NLP applications.\
scikit-learn (Machine Learning Library):\
Scikit-learn is a widely used library for machine learning in Python. It includes a variety of tools for tasks such as classification, regression, clustering, dimensionality reduction, and more. here I use scikit-learn for text analysis.\
TfidfVectorizer (Term Frequency-Inverse Document Frequency Vectorizer):\
TfidfVectorizer is a feature extraction technique used in natural language processing. It converts a collection of raw documents (text) into a matrix of TF-IDF features. This allows me to represent text data numerically, where each document's words are weighted based on their importance in the document and across the entire corpus. It's often used for text classification and clustering tasks.\
cosine_similarity:\
cosine_similarity is a function in scikit-learn that calculates the cosine similarity between two or more vectors. In the context of text analysis, here,I use it to measure the similarity between text documents. It's commonly used for tasks like finding similar documents, clustering related documents, or recommendation systems. \

The typical workflow for using these libraries in text analysis might involve the following steps:

Data Preprocessing:

Use Spacy for text preprocessing tasks like tokenization, stemming, and removing stop words.
Prepare your text data by cleaning and structuring it for analysis.

Feature Extraction:

Use TfidfVectorizer to convert the text data into TF-IDF feature vectors. This step transforms the text data into a format suitable for machine learning.

Cosine Similarity Calculation:

Calculate the cosine similarity between pairs of documents using the cosine_similarity function. The result will be a similarity score indicating how closely related the documents are.



In [None]:
nlp = spacy.load("en_core_web_sm")

By loading the "en_core_web_sm" model using spacy.load(), you are creating an NLP object named nlp. This NLP object contains various language processing components, including tokenizers, part-of-speech taggers, named entity recognizers, dependency parsers, and word vectors. It allows you to process and analyze text using these pre-trained components.

"en_core_web_sm" is the name of a specific pre-trained model in SpaCy. In this case, it's the "English (en) core web small (core_web_sm)" model.



In [None]:
def preprocess(text):
    doc = nlp(text)
    tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]
    return ' '.join(tokens)


After processing the text, this line creates a list called tokens. It iterates over each token in the processed doc. For each token, it does the following:

token.text.lower() extracts the lowercase form of the token. This helps standardize the text and make it case-insensitive.
not token.is_stop checks if the token is not a stop word. Stop words are commonly used words like "the," "is," "and" that are often removed in text preprocessing because they don't typically add much meaning to the text.
not token.is_punct checks if the token is not a punctuation mark. Punctuation marks like periods, commas, and exclamation marks are often removed from text during preprocessing.
So, this line effectively creates a list of lowercase words from the input text, excluding stop words and punctuation marks.
Finally, the tokens list is joined together into a single string. The words are separated by spaces, creating a preprocessed version of the text.

In [None]:
def calculate_cosine_similarity(documents):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return similarity_matrix


A TF-IDF vectorizer is created using TfidfVectorizer. This vectorizer is then fitted to the input documents using the fit_transform method. TF-IDF is a technique used to convert text data into numerical vectors, where each vector represents a document, and the elements of the vector correspond to the importance of words in the document relative to the entire collection of documents. It helps to emphasize important words while downweighting common words (stop words).\
 cosine_similarity function from scikit-learn to calculate the cosine similarity between the documents. The cosine_similarity function takes the TF-IDF matrix as input and computes the pairwise cosine similarity between the rows of this matrix\
 this function takes a list of text documents, converts them into TF-IDF vectors, and then calculates the cosine similarity between these vectors. The resulting similarity matrix provides a measure of how similar or related each pair of documents is, which can be useful for various text analysis tasks such as document retrieval, clustering, or recommendation systems.

In [None]:
def calculate_word_embedding_similarity(doc1, doc2):
    tokens1 = nlp(doc1)
    tokens2 = nlp(doc2)

    vec1 = tokens1.vector
    vec2 = tokens2.vector

    similarity = cosine_similarity([vec1], [vec2])[0][0]
    return similarity


 defines a function named calculate_word_embedding_similarity that takes two arguments, doc1 and doc2. These arguments should be strings representing the two documents for which you want to calculate similarity.\
  the input documents are processed using the nlp object, which is the SpaCy language model. This step tokenizes the text and performs various linguistic analyses.\
  After tokenization, this code extracts the word embedding vectors for both doc1 and doc2. In SpaCy, the tokens.vector attribute provides a dense vector representation of the entire document based on the word embeddings. These vectors capture the semantic meaning of the words in the document.\
  With the word embedding vectors obtained, the code then calculates the cosine similarity between these vectors using the cosine_similarity function from scikit-learn. It compares vec1 and vec2 to determine how similar or related the two documents are.\
   this function leverages SpaCy's word embeddings to calculate the cosine similarity between two documents. It measures the semantic similarity between the documents based on the distributional properties of words in their respective embeddings. This can be useful for various NLP tasks, such as document similarity, recommendation systems, or information retrieval.

In [None]:
doc1 = "This is the first document."
doc2 = "Here is the second document."
doc3 = "A different document, unrelated to the others."

# doc1="raj"
# doc2="ravina"
# doc3="ravi"

documents = [doc1, doc2, doc3]

# Preprocess the documents
processed_documents = [preprocess(doc) for doc in documents]

# Calculate cosine similarity
similarity_matrix = calculate_cosine_similarity(processed_documents)

print("Cosine Similarity Matrix:")
print(similarity_matrix)


Cosine Similarity Matrix:
[[1.         0.50854232 0.38537163]
 [0.50854232 1.         0.19597778]
 [0.38537163 0.19597778 1.        ]]


You have three documents (doc1, doc2, and doc3) stored as strings.

You create a list called documents that contains these three documents.

You then preprocess these documents using the preprocess function. The preprocessed documents are stored in the processed_documents list.

Next, you calculate the cosine similarity between the preprocessed documents using the calculate_cosine_similarity function. The result is stored in the similarity_matrix.

Finally, you print the cosine similarity matrix, which represents the pairwise similarity between the documents.