## Sting matching

This Python code defines a function called simple_string_match that checks if one string (text2) is contained within another string (text1). The function uses the in operator to perform this check, returning True if text2 is found in text1, and False otherwise. The function includes a docstring explaining its purpose, parameters, and return value.
The code then demonstrates how to use this function with an example. It creates two variables: text1 containing a full sentence, and text2 containing the word "example". It calls the simple_string_match function with these variables and uses an if-else statement to print whether the substring was found or not. This example illustrates basic string manipulation, function definition and usage, and conditional statements in Python.

In [3]:
def simple_string_match(text1, text2):
    """
    This function checks for the presence of a substring (text2) within a larger text (text1).

    Args:
        text1: The larger string to search within.
        text2: The substring to search for.

    Returns:
        True if the substring is found, False otherwise.
    """
    if text2 in text1:
        return True
    else:
        return False

# Usage:
text1 = "This is an example sentence."
text2 = "example"

if simple_string_match(text1, text2):
    print("Substring found!")
else:
    print("Substring not found.")

Substring found!


# Distance metrics

This Python code defines a function called euclidean_distance that calculates the Euclidean distance between two text documents. It uses the CountVectorizer from scikit-learn to convert the text into numerical vectors based on word frequencies, and then applies the Euclidean distance formula from SciPy's spatial module. The function takes two text strings as input and returns a float representing their distance.
The code then demonstrates the function's usage with an example. It creates two identical text strings and calculates their Euclidean distance using the defined function. The result is printed, which in this case would be 0.0 since the texts are identical.


The Euclidean distance calculates the straight-line distance between two points in Euclidean space. It works by applying the Pythagorean theorem to a right triangle formed by the two points and their projections onto the axes.

Here's how it works in two dimensions:

- Consider two points:  Let's say we have point A with coordinates (x1, y1) and point B with coordinates (x2, y2).

- Form a right triangle: Imagine drawing a horizontal line from A and a vertical line from B.  The point where these lines intersect (let's call it C) forms a right angle.

- Calculate side lengths: The length of AC is the difference in x-coordinates (x2 - x1), and the length of BC is the difference in y-coordinates (y2 - y1).

- Apply Pythagorean theorem:  The Euclidean distance between A and B is the hypotenuse of the triangle (AB).  Using the Pythagorean theorem (a² + b² = c²), we can calculate it as: distance AB = √[(x2 - x1)² + (y2 - y1)²]

In higher dimensions, the Euclidean distance is simply an extension of this concept, incorporating the differences in additional coordinates (e.g., z-coordinates in three dimensions).

**Analogy**

 You have a map with two points marked – your starting position and the location of the hidden treasure. The Euclidean distance is like measuring the shortest distance between your starting point and the treasure, as if you could walk in a straight line through everything in the park (trees, ponds, etc.).

It's like using a ruler to find the length of that straight path, ignoring any obstacles or detours you might have to take in reality.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance

def euclidean_distance(text1, text2):
    """
    Calculates the Euclidean distance between two text documents.

    Args:
        text1: The first text document (string).
        text2: The second text document (string).

    Returns:
        The Euclidean distance between the documents (float).
    """
    vectorizer = CountVectorizer()
    vectors = vectorizer.fit_transform([text1, text2])

    # Convert sparse vectors to dense arrays and flatten them
    vector1 = vectors[0].toarray().flatten()
    vector2 = vectors[1].toarray().flatten()

    return distance.euclidean(vector1, vector2)

# Usage:
text1 = "This is an example sentence."
text2 = "This is an example sentence."

dist = euclidean_distance(text1, text2)
print(f"Euclidean distance: {dist}")



Euclidean distance: 0.0


# Word embedding

This Python code demonstrates the use of Word2Vec, a popular word embedding technique, using the Gensim library. It starts by defining a small dataset of tokenized sentences. The Word2Vec model is then trained on these sentences, creating vector representations for each word in the vocabulary.
After training, the code retrieves the vector representations for two specific words: "sentence" and "cats". It then calculates the cosine similarity between these two word vectors using the model's built-in similarity function. Finally, it prints the vector representations of both words and their similarity score.


**Cosine similarity** is a way to measure how similar two things are, especially when those things are represented as vectors (lists of numbers).  Imagine each vector as an arrow pointing in a certain direction in space. Cosine similarity focuses on the angle between these two arrows, rather than their lengths.

Here's how it works:

1. **Represent your items as vectors:**
   Each item (e.g., a document, a word, an image) is converted into a vector, where each number in the vector represents a specific feature or attribute. For example, in text analysis, each number might represent the frequency of a particular word.

2. **Calculate the dot product:**
   The dot product is a mathematical operation that multiplies corresponding elements of the two vectors and then sums up the results. It gives you an idea of how much the two vectors "overlap" in terms of their direction.

3. **Calculate the magnitudes:**
   The magnitude of a vector is its length, calculated using the Pythagorean theorem (square root of the sum of squares of its components).

4. **Find the cosine of the angle:**
   Cosine similarity is the dot product of the two vectors divided by the product of their magnitudes.  This value corresponds to the cosine of the angle between the two vectors.

The resulting cosine similarity value ranges from -1 to 1:

* **1:** The vectors point in exactly the same direction (the angle between them is 0 degrees). This means the items are very similar.
* **0:** The vectors are perpendicular (the angle between them is 90 degrees). This means the items are completely dissimilar.
* **-1:** The vectors point in opposite directions (the angle between them is 180 degrees). This means the items are very dissimilar in an opposing way.

**Analogy:**

Imagine two flashlights shining on a wall. If they are shining in the same direction, the light overlaps completely (cosine similarity = 1). If they are shining at a right angle, there's no overlap (cosine similarity = 0). If they are shining in opposite directions, the light is completely separate (cosine similarity = -1).


In [17]:
from gensim.models import Word2Vec

# Sample sentences (your actual dataset would be much larger)
sentences = [
    ["This", "is", "an", "example", "sentence"],
    ["This", "is", "another", "sentence"],
    ["Yet", "another", "sentence", "about", "cats"],
    ["Cats", "are", "awesome", "pets"]
]

# Train the Word2Vec model
model = Word2Vec(sentences, min_count=1)  # Train on all words (min_count=1)

# Get word vectors
word1 = "sentence"
word2 = "cats"

vector1 = model.wv[word1]
vector2 = model.wv[word2]

# Calculate similarity (cosine similarity is common for word embeddings)
similarity = model.wv.similarity(word1, word2)

print(f"Vector for '{word1}': {vector1}")
print(f"Vector for '{word2}': {vector2}")
print(f"Similarity between '{word1}' and '{word2}': {similarity}")


Vector for 'sentence': [-5.3624064e-04  2.3643726e-04  5.1034773e-03  9.0094982e-03
 -9.3031824e-03 -7.1169869e-03  6.4590340e-03  8.9732129e-03
 -5.0155534e-03 -3.7634657e-03  7.3806890e-03 -1.5335097e-03
 -4.5367270e-03  6.5542157e-03 -4.8602819e-03 -1.8160631e-03
  2.8766517e-03  9.9189859e-04 -8.2854219e-03 -9.4490545e-03
  7.3119490e-03  5.0703888e-03  6.7578624e-03  7.6288462e-04
  6.3510491e-03 -3.4054511e-03 -9.4642502e-04  5.7687177e-03
 -7.5218258e-03 -3.9362018e-03 -7.5117699e-03 -9.3006546e-04
  9.5383571e-03 -7.3193498e-03 -2.3338271e-03 -1.9377895e-03
  8.0776392e-03 -5.9310440e-03  4.5163568e-05 -4.7538527e-03
 -9.6037909e-03  5.0074183e-03 -8.7598041e-03 -4.3919352e-03
 -3.5100860e-05 -2.9618884e-04 -7.6614316e-03  9.6149836e-03
  4.9821823e-03  9.2333741e-03 -8.1581213e-03  4.4959104e-03
 -4.1371793e-03  8.2455669e-04  8.4988326e-03 -4.4622882e-03
  4.5176134e-03 -6.7871297e-03 -3.5485774e-03  9.3987426e-03
 -1.5776921e-03  3.2137960e-04 -4.1407333e-03 -7.6828799e-03
 

## Pretrained word embedding

This Python code demonstrates the use of a pre-trained Word2Vec model from the Gensim library. It loads the 'word2vec-google-news-300' model, which was trained on Google News data and produces 300-dimensional word vectors. The code then defines a list of target words and uses the model to find the three most similar words for each target word based on their vector representations.
For each target word, the code prints the top three most similar words along with their similarity scores. This similarity is based on the cosine similarity between word vectors in the embedding space.

In [16]:
import gensim.downloader as api

# Load a pre-trained Word2Vec model
model = api.load('word2vec-google-news-300')

# Target words
target_words = ["king", "queen", "man", "woman", "cat", "dog"]

# Find similar words for each target word
for word in target_words:
    similar_words = model.most_similar(word, topn=3)  # Get top 3 most similar words

    print(f"\nWords most similar to '{word}':")
    for similar_word, similarity in similar_words:
        print(f"- {similar_word}: {similarity:.3f}")



Words most similar to 'king':
- kings: 0.714
- queen: 0.651
- monarch: 0.641

Words most similar to 'queen':
- queens: 0.740
- princess: 0.707
- king: 0.651

Words most similar to 'man':
- woman: 0.766
- boy: 0.682
- teenager: 0.659

Words most similar to 'woman':
- man: 0.766
- girl: 0.749
- teenage_girl: 0.734

Words most similar to 'cat':
- cats: 0.810
- dog: 0.761
- kitten: 0.746

Words most similar to 'dog':
- dogs: 0.868
- puppy: 0.811
- pit_bull: 0.780


# N-Grams

This Python code demonstrates the creation of n-grams, specifically bigrams and trigrams, from a given text string. The code starts by defining a sample sentence. It then splits this sentence into individual words and creates bigrams by pairing each word with the next word in the sequence. This process is repeated to create trigrams, where each group consists of three consecutive words.
The code then prints out all the generated bigrams and trigrams. This technique is commonly used in natural language processing for tasks such as language modeling, text generation, and feature extraction. N-grams capture local word order and can provide insights into word associations and patterns within the text.

In [18]:
text = "the quick brown fox jumps over the lazy dog"

# Create bigrams (2-grams)
bigrams = []
words = text.split()
for i in range(len(words)-1):
    bigram = (words[i], words[i+1])  # Pair current word with the next
    bigrams.append(bigram)

print("Bigrams:")
for bigram in bigrams:
    print(bigram)

# Create trigrams (3-grams) – just expand the window
trigrams = []
for i in range(len(words)-2):
    trigram = (words[i], words[i+1], words[i+2])
    trigrams.append(trigram)

print("\nTrigrams:")
for trigram in trigrams:
    print(trigram)


Bigrams:
('the', 'quick')
('quick', 'brown')
('brown', 'fox')
('fox', 'jumps')
('jumps', 'over')
('over', 'the')
('the', 'lazy')
('lazy', 'dog')

Trigrams:
('the', 'quick', 'brown')
('quick', 'brown', 'fox')
('brown', 'fox', 'jumps')
('fox', 'jumps', 'over')
('jumps', 'over', 'the')
('over', 'the', 'lazy')
('the', 'lazy', 'dog')


In [35]:
def generate_ngrams(text, n):
    """Generates n-grams (tuples of n consecutive words) from a given text."""
    words = text.split()  # Split the text into words
    return zip(*[words[i:] for i in range(n)])  # Generate n-grams

# Two texts to compare
text1 = "This is an original sentence."
text2 = "This is a partly reused sentence."

# Generate bigrams (2-grams) and convert them to sets for easy comparison
bigrams1 = set(generate_ngrams(text1, 2))
bigrams2 = set(generate_ngrams(text2, 2))

# Find common bigrams
common_bigrams = bigrams1.intersection(bigrams2)

print("Common Bigrams:")  # Display common bigrams
for bigram in common_bigrams:
    print(bigram)

# Calculate a simple similarity score based on shared bigrams
similarity_score = len(common_bigrams) / max(len(bigrams1), len(bigrams2))
print(f"\nSimilarity Score: {similarity_score:.2f}")



Common Bigrams:
('This', 'is')

Similarity Score: 0.20


# HASH

This Python code demonstrates the use of hashing for text comparison. It defines a function hash_text that calculates a hash value for a given string using Python's built-in hash() function. The code then applies this function to three text strings: an original text, a plagiarized (identical) text, and a slightly modified text.


In [38]:
def hash_text(text):
    """Calculates the hash value of a given text."""
    return hash(text)

# Example usage
original_text = "The quick brown fox jumps over the lazy dog."
plagiarized_text = "The quick brown fox jumps over the lazy dog."
modified_text = "The quick brown fox jumps over the lazy cat."

hash_original = hash_text(original_text)
hash_plagiarized = hash_text(plagiarized_text)
hash_modified = hash_text(modified_text)

print(f"Hash of Original Text (check the length): {hash_original}")
print(f"Hash of Plagiarized Text (check the length): {hash_plagiarized}")
print(f"Hash of Modified Text (check the length): {hash_modified}")

# Compare hashes
if hash_original == hash_plagiarized:
    print("Plagiarized Text is an exact copy of the Original Text.")
else:
    print("Plagiarized Text is NOT an exact copy of the Original Text.")

if hash_original == hash_modified:
    print("Modified Text is an exact copy of the Original Text.")
else:
    print("Modified Text is NOT an exact copy of the Original Text.")



Hash of Original Text (check the length): 6692867180466064543
Hash of Plagiarized Text (check the length): 6692867180466064543
Hash of Modified Text (check the length): 4220578042092582446
Plagiarized Text is an exact copy of the Original Text.
Modified Text is NOT an exact copy of the Original Text.
