# Document Similarity Task


        This notebook aims to create three documents with specific requirements, preprocess them, and then compare their similarities using a custom formula.
        

In [6]:

# Document Texts with approximately 100 words each
doc1 = '''The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet. The fox is quick and brown. 
He repeatedly does so throughout the day, every day. Observing this, the lazy dog decides it needs to start jumping too. 
However, the dog is not just lazy but also not as agile as the fox. So, it starts with small leaps. This goes on for weeks, 
and the dog gradually increases its efforts, inspired by the energetic fox. Despite many falls, the dog keeps trying, aiming 
to match the fox's agility and speed one day.'''
doc2 = '''The quick brown fox jumps over the lazy cat. This sentence contains all letters of the alphabet. The fox is fast and brown. 
Unlike the dog, the cat is intrigued but unmotivated to join. The fox, on the other hand, continues his routines without 
distraction. Over time, the cat observes and starts to ponder if it could do the same. It's a slow start, but the cat is 
curious and eventually tries to mimic the fox. The attempts are clumsy at first, but with persistence, the cat improves, 
becoming a little more like the fox every day.'''
doc3 = '''Data science involves crunching large datasets to find patterns that are not immediately obvious. It uses statistical methods 
to analyze data and generate useful business insights. This field combines expertise from several areas such as statistics, 
machine learning, and software engineering. Data scientists work on various challenges, including prediction models, 
algorithm design, and data visualization to make data understandable to stakeholders. They also ensure that data 
privacy and security are maintained, given the sensitive nature of the information handled.'''

# Display the documents
print("Document 1:", doc1)
print("Document 2:", doc2)
print("Document 3:", doc3)
    

Document 1: The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet. The fox is quick and brown. 
He repeatedly does so throughout the day, every day. Observing this, the lazy dog decides it needs to start jumping too. 
However, the dog is not just lazy but also not as agile as the fox. So, it starts with small leaps. This goes on for weeks, 
and the dog gradually increases its efforts, inspired by the energetic fox. Despite many falls, the dog keeps trying, aiming 
to match the fox's agility and speed one day.
Document 2: The quick brown fox jumps over the lazy cat. This sentence contains all letters of the alphabet. The fox is fast and brown. 
Unlike the dog, the cat is intrigued but unmotivated to join. The fox, on the other hand, continues his routines without 
distraction. Over time, the cat observes and starts to ponder if it could do the same. It's a slow start, but the cat is 
curious and eventually tries to mimic the fox. The attempts a

In [7]:

# Document Texts
doc1 = "The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet. The fox is quick and brown."
doc2 = "The quick brown fox jumps over the lazy cat. This sentence contains all letters of the alphabet. The fox is fast and brown."
doc3 = "Data science involves crunching large datasets. It uses statistical methods to analyze data and generate useful business insights."

# Display the documents
print("Document 1:", doc1)
print("Document 2:", doc2)
print("Document 3:", doc3)
        

Document 1: The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet. The fox is quick and brown.
Document 2: The quick brown fox jumps over the lazy cat. This sentence contains all letters of the alphabet. The fox is fast and brown.
Document 3: Data science involves crunching large datasets. It uses statistical methods to analyze data and generate useful business insights.


## Step 2: Text Preprocessing

In [8]:

# Function to preprocess text: convert to lowercase, remove punctuation, split into words
import string

def preprocess(text):
    text = text.lower()  # convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    words = text.split()  # split into words
    return words

# Preprocess the documents
doc1_words = preprocess(doc1)
doc2_words = preprocess(doc2)
doc3_words = preprocess(doc3)

# Display preprocessed documents
print("Preprocessed Document 1:", doc1_words)
print("Preprocessed Document 2:", doc2_words)
print("Preprocessed Document 3:", doc3_words)
        

Preprocessed Document 1: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'this', 'sentence', 'contains', 'every', 'letter', 'of', 'the', 'alphabet', 'the', 'fox', 'is', 'quick', 'and', 'brown']
Preprocessed Document 2: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'cat', 'this', 'sentence', 'contains', 'all', 'letters', 'of', 'the', 'alphabet', 'the', 'fox', 'is', 'fast', 'and', 'brown']
Preprocessed Document 3: ['data', 'science', 'involves', 'crunching', 'large', 'datasets', 'it', 'uses', 'statistical', 'methods', 'to', 'analyze', 'data', 'and', 'generate', 'useful', 'business', 'insights']


## Step 3: Compare Similarity

In [9]:

# Function to compute Jaccard similarity
def jaccard_similarity(docx, docy):
    set_x = set(docx)
    set_y = set(docy)
    intersection = set_x.intersection(set_y)
    union = set_x.union(set_y)
    return len(intersection) / len(union)

# Function to compute Euclidean distance
def euclidean_distance(docx, docy):
    word_set = set(docx) | set(docy)
    freqx = {word: docx.count(word) for word in word_set}
    freqy = {word: docy.count(word) for word in word_set}
    distance = math.sqrt(sum((freqx.get(word, 0) - freqy.get(word, 0))**2 for word in word_set))
    return distance

# Function to compute Cosine similarity
def cosine_similarity(docx, docy):
    word_set = set(docx) | set(docy)
    freqx = {word: docx.count(word) for word in word_set}
    freqy = {word: docy.count(word) for word in word_set}
    dot_product = sum(freqx.get(word, 0) * freqy.get(word, 0) for word in word_set)
    magnitude_x = math.sqrt(sum(freqx.get(word, 0)**2 for word in word_set))
    magnitude_y = math.sqrt(sum(freqy.get(word, 0)**2 for word in word_set))
    similarity = dot_product / (magnitude_x * magnitude_y)
    return similarity



# Calculate similarity
similarity_12 = jaccard_similarity(doc1_words, doc2_words)
similarity_13 = jaccard_similarity(doc1_words, doc3_words)
similarity_23 = jaccard_similarity(doc2_words, doc3_words)

# Display the similarities
print("Similarity between Document 1 and 2:", similarity_12)
print("Similarity between Document 1 and 3:", similarity_13)
print("Similarity between Document 2 and 3:", similarity_23)
        

Similarity between Document 1 and 2: 0.6666666666666666
Similarity between Document 1 and 3: 0.030303030303030304
Similarity between Document 2 and 3: 0.029411764705882353


In [12]:

preprocessed_docs = [preprocess(doc) for doc in [doc1, doc2, doc3]]

# Calculate similarities
jaccard_similarities = [[jaccard_similarity(preprocessed_docs[i], preprocessed_docs[j]) for j in range(len([doc1, doc2, doc3]))] for i in range(len([doc1, doc2, doc3]))]
cosine_similarities = [[cosine_similarity(preprocessed_docs[i], preprocessed_docs[j]) for j in range(len([doc1, doc2, doc3]))] for i in range(len([doc1, doc2, doc3]))]
euclidean_distances = [[euclidean_distance(preprocessed_docs[i], preprocessed_docs[j]) for j in range(len([doc1, doc2, doc3]))] for i in range(len([doc1, doc2, doc3]))]

# Display the results
print("Jaccard Similarities:")
for row in jaccard_similarities:
    print(row)
print("\nCosine Similarities:")
for row in cosine_similarities:
    print(row)
print("\nEuclidean Distances:")
for row in euclidean_distances:
    print(row)
        

NameError: name 'math' is not defined