<a href="https://colab.research.google.com/github/Aananda-giri/scripts/blob/main/search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* [source: chat-gpt](https://chat.openai.com/share/5551867b-a01d-4706-8a5c-ba2251c93112)

* [colab](https://colab.research.google.com/drive/1x2J_i_gj4iMSxV5W-C0-PLe4AqJFF-9p?authuser=2)

In [27]:
# simple search

# Sample corpus for simple search
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits."
]

# Function for simple search
def simple_search(query, corpus):
    results = []
    for sentence in corpus:
        if query.lower() in sentence.lower():
            results.append(sentence)
    return results

# Example usage
query = "Radio Frequency and Microwave"
results = simple_search(query, corpus)

print(f"Results containing '{query}':")
for result in results:
    print(result)


Results containing 'Radio Frequency and Microwave':
I'm studying Radio Frequency and Microwave circuits.


In [3]:
import re

def normalize_search_term(search_term):
    normalized_term = search_term.lower()  # Convert to lowercase for case insensitivity
    normalized_term = re.sub(r'[^a-z\s]', '', normalized_term)  # Remove non-alphabetic characters
    normalized_term = re.sub(r'\s+', ' ', normalized_term).strip()  # Remove extra spaces

    # Handle specific terms or abbreviations
    replacements = {
        'radio frequency': 'rf',
        'microwave': 'mw'
        # Add more replacements as needed
    }

    for original, replacement in replacements.items():
        normalized_term = normalized_term.replace(original, replacement)

    return normalized_term

def search(query, document):
    normalized_query = normalize_search_term(query)
    normalized_document = normalize_search_term(document)

    return normalized_query in normalized_document

# Example usage
query = "Radio Frequency and microwave"
document = "Introduction to RF and Microwave Engineering"
result = search(query, document)

if result:
    print("Found a match!")
else:
    print("No match found.")


Found a match!


In [8]:
# 1. Word Embeddings and Similarity with the popular Word2Vec model and Gensim library:

# download necessary resources
import nltk
nltk.download('punkt')

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

# Sample corpus for training the Word2Vec model
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits.",
    "EDC",
    "Electronic device and Circuit"
]

# Preprocess the corpus by tokenizing and lowercasing
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train a Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, sg=0)

# Function to calculate similarity between two search terms
def similarity(search_term, document):
    search_vec = np.mean([model.wv[word] for word in search_term if word in model.wv], axis=0)
    doc_vec = np.mean([model.wv[word] for word in document if word in model.wv], axis=0)
    return np.dot(search_vec, doc_vec) / (np.linalg.norm(search_vec) * np.linalg.norm(doc_vec))

# Example usage
# query = "Radio Frequency and Microwave"
query = "Electronic device and Circuit"
# query = "EDC"
for sentence in corpus:
    sim = similarity(query.lower().split(), word_tokenize(sentence.lower()))
    print(f"Similarity with '{sentence}': {sim}")



'''\n\n\n\n\n
In this code, we first define a sample corpus. We tokenize and lowercase the sentences. Then, we train a Word2Vec model on the tokenized sentences.

The similarity function calculates the cosine similarity between the vectors of the search term and the document content.

Keep in mind that this is a simple example and may not cover all possible variations. Depending on your specific use case and the complexity of your data, you might need to fine-tune the Word2Vec model or consider using more advanced embedding techniques.

'''

Similarity with 'RF and Microwave engineering is fascinating.': 0.25191694498062134
Similarity with 'Microwave frequencies are used in many applications.': -0.046546995639801025
Similarity with 'The RF signal strength is measured in decibels.': 0.11139184981584549
Similarity with 'I'm studying Radio Frequency and Microwave circuits.': 0.22794122993946075
Similarity with 'EDC': 0.1180267333984375
Similarity with 'Electronic device and Circuit': 1.0000001192092896


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'\n\n\n\n\n\nIn this code, we first define a sample corpus. We tokenize and lowercase the sentences. Then, we train a Word2Vec model on the tokenized sentences.\n\nThe similarity function calculates the cosine similarity between the vectors of the search term and the document content.\n\nKeep in mind that this is a simple example and may not cover all possible variations. Depending on your specific use case and the complexity of your data, you might need to fine-tune the Word2Vec model or consider using more advanced embedding techniques.\n\n'

In [11]:
# 2. Latent Dirichlet Allocation (LDA) for topic modeling using the popular library Gensim:
from gensim import corpora
from gensim.models import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

import nltk
nltk.download('stopwords')


# Sample corpus for topic modeling
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits."
]

# Preprocessing: Tokenization, lowercase, removing punctuation and stopwords
stop_words = set(stopwords.words('english'))
punctuations = string.punctuation

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations]
    return tokens

tokenized_corpus = [preprocess_text(sentence) for sentence in corpus]

# Create a dictionary and a corpus
dictionary = corpora.Dictionary(tokenized_corpus)
corpus_bow = [dictionary.doc2bow(text) for text in tokenized_corpus]

# Train the LDA model
lda_model = LdaModel(corpus_bow, num_topics=2, id2word=dictionary)

# Print the topics
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}: {topic}")

# Example usage for searching for topics
search_query = "RF and Microwave"
query_bow = dictionary.doc2bow(preprocess_text(search_query))

# Get the topic distribution for the query
query_topic_distribution = lda_model[query_bow]
print(f"Topic distribution for '{search_query}': {query_topic_distribution}")


# '''
# In this code:

# We start with a sample corpus.
# We preprocess the text by tokenizing, converting to lowercase, removing punctuation, and eliminating stopwords.
# We create a dictionary and a Bag of Words representation of the corpus.
# We train an LDA model with 2 topics. You can adjust the num_topics parameter as needed.
# We print out the topics discovered by the model.
# We demonstrate how to use the model for a search query.
# Keep in mind that topic modeling may not always produce interpretable results, and the number of topics (num_topics) may need to be adjusted based on the nature of your corpus. Additionally, more sophisticated preprocessing steps and techniques like coherence score evaluation can be used to enhance the quality of topics.
# '''

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Topic 0: 0.117*"rf" + 0.078*"strength" + 0.077*"signal" + 0.077*"decibels" + 0.075*"microwave" + 0.074*"measured" + 0.070*"engineering" + 0.067*"fascinating" + 0.049*"frequencies" + 0.045*"many"
Topic 1: 0.134*"microwave" + 0.068*"circuits" + 0.068*"radio" + 0.067*"studying" + 0.067*"frequency" + 0.067*"'m" + 0.064*"used" + 0.062*"applications" + 0.061*"many" + 0.058*"frequencies"
Topic distribution for 'RF and Microwave': [(0, 0.5527477), (1, 0.4472523)]


In [12]:
# 3. TF-IDF and Cosine Similarity to handle text variations

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample corpus for TF-IDF and similarity
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits."
]

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Create TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Function to calculate similarity between two search terms
def similarity(search_term, document):
    search_tfidf = tfidf_vectorizer.transform([search_term])
    doc_tfidf = tfidf_vectorizer.transform([document])
    return cosine_similarity(search_tfidf, doc_tfidf)[0][0]

# Example usage
query = "Radio Frequency and Microwave"
for sentence in corpus:
    sim = similarity(query, sentence)
    print(f"Similarity with '{sentence}': {sim}")


# '''
# In this code:

# We start with a sample corpus.
# We initialize a TF-IDF vectorizer.
# We create a TF-IDF matrix for the corpus.
# We define a function similarity that calculates the cosine similarity between the TF-IDF vectors of the search term and the document content.
# We use the function to compare the query with each sentence in the corpus.
# Keep in mind that this is a basic example and may require further refinement depending on your specific use case. For instance, you might want to experiment with different preprocessing steps or explore more advanced techniques for handling text variations.
# '''

Similarity with 'RF and Microwave engineering is fascinating.': 0.28604976898651346
Similarity with 'Microwave frequencies are used in many applications.': 0.09533655557814368
Similarity with 'The RF signal strength is measured in decibels.': 0.0
Similarity with 'I'm studying Radio Frequency and Microwave circuits.': 0.7760843129566192


In [15]:
# 4. Nearest Neighbor Search

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# Sample corpus for nearest neighbor search
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits."
]

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Create TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Initialize NearestNeighbors for nearest neighbor search
n_neighbors = 1  # Set the number of neighbors to find
knn = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute', metric='cosine')
knn.fit(tfidf_matrix)

# Function to find nearest neighbor to a search term
def nearest_neighbor(search_term):
    search_vector = tfidf_vectorizer.transform([search_term])
    _, indices = knn.kneighbors(search_vector)
    return corpus[indices[0][0]]

# Example usage
query = "Radio Frequency and Microwave"
result = nearest_neighbor(query)

print(f"The nearest neighbor to '{query}' is: {result}")



The nearest neighbor to 'Radio Frequency and Microwave' is: I'm studying Radio Frequency and Microwave circuits.


In [18]:
# 5. BERT and Fine-tuning to handle text variations without explictly specifying replacement text

!pip install transformers

from transformers import pipeline

# Load pre-trained BERT model for question answering
qa_pipeline = pipeline('question-answering', model='bert-large-uncased-whole-word-masking-finetuned-squad', tokenizer='bert-large-uncased-whole-word-masking-finetuned-squad')

# Sample context
context = (
    "RF and Microwave engineering is fascinating. Microwave frequencies are used in many applications. "
    "The RF signal strength is measured in decibels. I'm studying Radio Frequency and Microwave circuits."
)

# Example questions
questions = [
    "What is RF and Microwave engineering?",
    "How are Microwave frequencies used?",
    "How is RF signal strength measured?",
    "What are you studying?"
]

# Answer the questions
for question in questions:
    answer = qa_pipeline(question=question, context=context)
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']}\n")


'''
Explanation:

We load a pre-trained BERT model and tokenizer that is fine-tuned for the SQuAD dataset. This model is capable of question answering tasks.

We define a function answer_question that takes a question and a context and returns the answer using BERT.

We provide a sample context related to RF and Microwave engineering.

We define a list of example questions.

We loop through the questions, use the answer_question function to find the answers, and print them out.

Keep in mind that this is a simplified example. Fine-tuning BERT for specific tasks may require additional steps such as creating a suitable dataset and training the model on it. The Hugging Face library provides tools for these tasks as well.
'''



Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Question: What is RF and Microwave engineering?
Answer: fascinating

Question: How are Microwave frequencies used?
Answer: in many applications

Question: How is RF signal strength measured?
Answer: in decibels

Question: What are you studying?
Answer: Radio Frequency and Microwave circuits



'\nExplanation:\n\nWe load a pre-trained BERT model and tokenizer that is fine-tuned for the SQuAD dataset. This model is capable of question answering tasks.\n\nWe define a function answer_question that takes a question and a context and returns the answer using BERT.\n\nWe provide a sample context related to RF and Microwave engineering.\n\nWe define a list of example questions.\n\nWe loop through the questions, use the answer_question function to find the answers, and print them out.\n\nKeep in mind that this is a simplified example. Fine-tuning BERT for specific tasks may require additional steps such as creating a suitable dataset and training the model on it. The Hugging Face library provides tools for these tasks as well.\n'

In [19]:
# 6. Clustering to handle text variations without explictly specifying replacement text

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Sample corpus for clustering
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits."
]

# Preprocessing: Tokenization, lowercase, removing punctuation and stopwords
stop_words = set(stopwords.words('english'))
punctuations = string.punctuation

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations]
    return " ".join(tokens)

preprocessed_corpus = [preprocess_text(sentence) for sentence in corpus]

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Create TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_corpus)

# Apply K-Means clustering
num_clusters = 2  # Set the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(tfidf_matrix)

# Group sentences by cluster
clustered_sentences = [[] for _ in range(num_clusters)]
for i, cluster in enumerate(clusters):
    clustered_sentences[cluster].append(corpus[i])

# Print clusters
for i, cluster in enumerate(clustered_sentences):
    print(f"Cluster {i+1}:")
    for sentence in cluster:
        print(sentence)
    print()


'''
In this code:

We start with a sample corpus.
We define a preprocessing function that tokenizes, converts to lowercase, removes punctuation, stopwords, and then joins the tokens back into a sentence.
We preprocess the corpus.
We initialize a TF-IDF vectorizer and create a TF-IDF matrix for the preprocessed corpus.
We apply K-Means clustering with a specified number of clusters.
We group the sentences by their assigned cluster.
Finally, we print out the clusters.
Please note that the number of clusters (num_clusters) is set arbitrarily in this example. In practice, you might need to experiment with different numbers of clusters to find the optimal grouping for your specific data.
'''




Cluster 1:
RF and Microwave engineering is fascinating.
The RF signal strength is measured in decibels.

Cluster 2:
Microwave frequencies are used in many applications.
I'm studying Radio Frequency and Microwave circuits.



'\nIn this code:\n\nWe start with a sample corpus.\nWe define a preprocessing function that tokenizes, converts to lowercase, removes punctuation, stopwords, and then joins the tokens back into a sentence.\nWe preprocess the corpus.\nWe initialize a TF-IDF vectorizer and create a TF-IDF matrix for the preprocessed corpus.\nWe apply K-Means clustering with a specified number of clusters.\nWe group the sentences by their assigned cluster.\nFinally, we print out the clusters.\nPlease note that the number of clusters (num_clusters) is set arbitrarily in this example. In practice, you might need to experiment with different numbers of clusters to find the optimal grouping for your specific data.\n'

In [26]:
# 7. Semantic Search with Sentence Embedding

# we'll use Universal Sentence Encoder (USE)
import tensorflow as tf
import tensorflow_hub as hub

# Load Universal Sentence Encoder
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Sample corpus for semantic search
corpus = [
    "RF and Microwave engineering is fascinating.",
    "Microwave frequencies are used in many applications.",
    "The RF signal strength is measured in decibels.",
    "I'm studying Radio Frequency and Microwave circuits."
]

# Encode sentences
sentence_embeddings = embed(corpus)

# Function to find semantically similar sentences
def find_similar_sentences(query, threshold=0.2):
    query_embedding = embed([query])[0]
    similarity_scores = tf.tensordot(query_embedding, sentence_embeddings, axes=[[0], [1]])
    similar_sentences = [[corpus[i], similarity_scores[i]] for i in tf.argsort(similarity_scores, direction='DESCENDING') if similarity_scores[i] > threshold]
    return similar_sentences

# Example usage
query = "RF and Microwave"
similar_sentences = find_similar_sentences(query)

print(f"Sentences similar to '{query}':")
for sentence in similar_sentences:
    print(sentence)

# '''
# Explanation:

# We load the Universal Sentence Encoder from TensorFlow Hub. This model converts sentences into high-dimensional embeddings that capture their semantic meaning.

# We define a sample corpus.

# We encode the sentences using the Universal Sentence Encoder.

# We define a function find_similar_sentences that takes a query and a threshold for similarity scores. It computes the similarity between the query and each sentence in the corpus and returns sentences with similarity scores above the threshold.

# We use the function to find sentences similar to the given query.

# Keep in mind that the threshold value may need to be adjusted based on your specific use case and data. This example provides a basic illustration of how to perform semantic search using sentence embeddings.
# '''

Sentences similar to 'RF and Microwave':
['RF and Microwave engineering is fascinating.', <tf.Tensor: shape=(), dtype=float32, numpy=0.60963064>]
["I'm studying Radio Frequency and Microwave circuits.", <tf.Tensor: shape=(), dtype=float32, numpy=0.513118>]
['Microwave frequencies are used in many applications.', <tf.Tensor: shape=(), dtype=float32, numpy=0.31462774>]
['The RF signal strength is measured in decibels.', <tf.Tensor: shape=(), dtype=float32, numpy=0.2823853>]
