<a href="https://colab.research.google.com/github/AiMl-hub/Gists/blob/main/text_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tokeniztaion

In [9]:
import nltk
# nltk.download('punkt') # Download the necessary tokenizer models


text = "Hello, world! This is a simple sentence. Tokenization is fun."

# Word Tokenization
words = nltk.word_tokenize(text, language='english')
print("\nWord Tokenization:")
print(words)

# Sentence Tokenization
sentences = nltk.sent_tokenize(text)
print("\nSentence Tokenization:")
print(sentences)


Word Tokenization:
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'simple', 'sentence', '.', 'Tokenization', 'is', 'fun', '.']

Sentence Tokenization:
['Hello, world!', 'This is a simple sentence.', 'Tokenization is fun.']


# Keyword Search vs Semantic Search

In [10]:
# Keyword Search Example
print("--- Keyword Search Example ---")
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "A dog barks at the cat.",
    "Semantic search uses embeddings.",
    "Keyword search relies on exact word matches."
]
keyword_query = "dog"

print(f"Keyword search for: '{keyword_query}'")
keyword_results = []
for i, doc in enumerate(documents):
    if keyword_query.lower() in doc.lower():
        keyword_results.append((i, doc))

if keyword_results:
    print("Found in documents:")
    for i, doc in keyword_results:
        print(f"- Doc {i+1}: {doc}")
else:
    print("No matching documents found.")
print("\n")


# Semantic Search Example
print("--- Semantic Search Example ---")

# Install sentence-transformers if not already installed
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util

# Load a pre-trained model. 'all-MiniLM-L6-v2' is a good general-purpose model.
model = SentenceTransformer('all-MiniLM-L6-v2')

corpus = [
    "A cat sits on the mat.",
    "The dog runs in the park.",
    "Machine learning is a field of artificial intelligence.",
    "Natural Language Processing deals with text data.",
    "This document talks about pets and animals.",
    "Information retrieval methods include keyword and semantic search."
]

print("Corpus documents:")
for i, doc in enumerate(corpus):
    print(f"- {i+1}: {doc}")
print("\n")

# Encode the corpus to get embeddings
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Define a query for semantic search
semantic_query = "animals in a park"
print(f"Semantic search query: '{semantic_query}'")

# Encode the query
query_embedding = model.encode(semantic_query, convert_to_tensor=True)

# Compute cosine similarity between query and all corpus embeddings
cosine_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]

# Combine corpus and scores, then sort by score
results = []
for i, score in enumerate(cosine_scores):
    results.append({'corpus_id': i, 'score': score.item(), 'text': corpus[i]})

# Sort the results by score in descending order
results = sorted(results, key=lambda x: x['score'], reverse=True)

print("\nTop 3 semantic search results:")
for i, result in enumerate(results[:3]):
    print(f"{i+1}. Score: {result['score']:.4f}, Document: {result['text']}")

--- Keyword Search Example ---
Keyword search for: 'dog'
Found in documents:
- Doc 1: The quick brown fox jumps over the lazy dog.
- Doc 2: A dog barks at the cat.


--- Semantic Search Example ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Corpus documents:
- 1: A cat sits on the mat.
- 2: The dog runs in the park.
- 3: Machine learning is a field of artificial intelligence.
- 4: Natural Language Processing deals with text data.
- 5: This document talks about pets and animals.
- 6: Information retrieval methods include keyword and semantic search.


Semantic search query: 'animals in a park'

Top 3 semantic search results:
1. Score: 0.6072, Document: The dog runs in the park.
2. Score: 0.4447, Document: This document talks about pets and animals.
3. Score: 0.0620, Document: A cat sits on the mat.


#Measuring Vector Distance

In [11]:
import numpy as np
from scipy.spatial import distance

# Define two sample embeddings (vectors)
embedding1 = np.array([1.0, 2.0, 3.0, 4.0])
embedding2 = np.array([2.0, 3.0, 4.0, 5.0])
embedding3 = np.array([-1.0, -2.0, -3.0, -4.0])

print("Embedding 1:", embedding1)
print("Embedding 2:", embedding2)
print("Embedding 3:", embedding3)
print("\n-- Similarity between Embedding 1 and Embedding 2 --")

# 1. Euclidean Distance
# Lower distance means higher similarity
euclidean_dist_1_2 = distance.euclidean(embedding1, embedding2)
print(f"Euclidean Distance: {euclidean_dist_1_2:.4f}")

# 2. Cosine Similarity
# Ranges from -1 (opposite) to 1 (identical), 0 (orthogonal)
# Using 1 - cosine_distance because scipy's cosine is a distance metric
cosine_similarity_1_2 = 1 - distance.cosine(embedding1, embedding2)
print(f"Cosine Similarity: {cosine_similarity_1_2:.4f}")

# 3. Dot Product Similarity
# Higher value means higher similarity (especially for non-negative vectors)
dot_product_similarity_1_2 = np.dot(embedding1, embedding2)
print(f"Dot Product Similarity: {dot_product_similarity_1_2:.4f}")

print("\n-- Similarity between Embedding 1 and Embedding 3 --")

# Euclidean Distance
euclidean_dist_1_3 = distance.euclidean(embedding1, embedding3)
print(f"Euclidean Distance: {euclidean_dist_1_3:.4f}")

# Cosine Similarity
cosine_similarity_1_3 = 1 - distance.cosine(embedding1, embedding3)
print(f"Cosine Similarity: {cosine_similarity_1_3:.4f}")

# Dot Product Similarity
dot_product_similarity_1_3 = np.dot(embedding1, embedding3)
print(f"Dot Product Similarity: {dot_product_similarity_1_3:.4f}")

Embedding 1: [1. 2. 3. 4.]
Embedding 2: [2. 3. 4. 5.]
Embedding 3: [-1. -2. -3. -4.]

-- Similarity between Embedding 1 and Embedding 2 --
Euclidean Distance: 2.0000
Cosine Similarity: 0.9938
Dot Product Similarity: 40.0000

-- Similarity between Embedding 1 and Embedding 3 --
Euclidean Distance: 10.9545
Cosine Similarity: -1.0000
Dot Product Similarity: -30.0000
