<a href="https://colab.research.google.com/github/SrushtiGunjal/whatsapp_bots/blob/main/EMBEDDINGPRACTICE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""

import torch

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Corpus with example sentences
corpus = [
    "Leakage detected in sodium-air cell after 3 hours of operation at Pune plant.",
    "Voltage drop reported in hydrogen stack during startup phase.",
    "Temperature spike noted in storage chamber due to faulty insulation.",
    "Sodium battery exploded due to internal short circuit.",
    "Pressure variation found in cathode chamber at Nagpur facility.",
    "Cell degradation observed under high humidity conditions.",
    "Production halted due to electrolyte contamination.",
    "Cooling system failure caused overheating in pack #42.",
    "Test batch failed due to excessive impedance levels.",
    "Sensor malfunction caused false reading in plant 3."
]
# MAKES THE VALUES IN CORPUS AS A VECTOR ,USING PYTORCH SO THST IT MAKES IT WORK ON GPU
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
      "Why was the sodium battery unstable last week?",
    "Any overheating incidents in Pune plant?",
    "What are common issues in hydrogen stacks?",
    "Was there any false alarm or sensor error?",
    "Did electrolyte ever get contaminated?",
]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(corpus[idx], f"(Score: {score:.3f})")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Query: Why was the sodium battery unstable last week?
Top 5 most similar sentences in corpus:
Sodium battery exploded due to internal short circuit. (Score: 0.692)
Leakage detected in sodium-air cell after 3 hours of operation at Pune plant. (Score: 0.421)
Production halted due to electrolyte contamination. (Score: 0.374)
Voltage drop reported in hydrogen stack during startup phase. (Score: 0.332)
Temperature spike noted in storage chamber due to faulty insulation. (Score: 0.321)

Query: Any overheating incidents in Pune plant?
Top 5 most similar sentences in corpus:
Temperature spike noted in storage chamber due to faulty insulation. (Score: 0.461)
Cooling system failure caused overheating in pack #42. (Score: 0.441)
Leakage detected in sodium-air cell after 3 hours of operation at Pune plant. (Score: 0.435)
Pressure variation found in cathode chamber at Nagpur facility. (Score: 0.263)
Cell degradation observed under high humidity conditions. (Score: 0.261)

Query: What are common is

In [None]:
# ✅ Install the required library (only for Colab or local Python)
!pip install sentence-transformers

import pandas as pd
import pickle
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

# Load the CSV
df = pd.read_csv("fact_dpos_1_day.csv")

# Filter valid remarks
df_valid = df[df['remark'].notnull()]

# Create searchable sentences
corpus_sentences = df_valid.apply(
    lambda row: f"On {str(row['date_id'])[:10]}, turbine {row['turbine_code']} at {row['site']} reported: {row['remark']}",
    axis=1
).tolist()

# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)

# Cache embeddings
with open("sorigin_embeddings.pkl", "wb") as fOut:
    pickle.dump({"sentences": corpus_sentences, "embeddings": corpus_embeddings}, fOut)


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from sentence_transformers import util
import torch

# Load embeddings
with open("sorigin_embeddings.pkl", "rb") as fIn:
    cache = pickle.load(fIn)
    corpus_sentences = cache["sentences"]
    corpus_embeddings = cache["embeddings"]

# Move embeddings to available device
device = "cuda" if torch.cuda.is_available() else "cpu"
corpus_embeddings = corpus_embeddings.to(device)
model = model.to(device)

# Start semantic query loop
while True:
    query = input("\n💬 Enter your query (or type 'exit' to stop): ")
    if query.lower() == "exit":
        print("🔚 Exiting search...")
        break

    # Embed user query
    query_embedding = model.encode(query, convert_to_tensor=True).to(device)

    # Semantic search (cosine similarity)
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)[0]

    print("\n🔍 Top 5 Most Similar Issues:\n")
    for idx, hit in enumerate(hits, 1):
        score = hit["score"]
        sentence = corpus_sentences[hit["corpus_id"]]
        print(f"{idx}. (Score: {score:.4f}) - {sentence}")

    # Pick the top answer
    top_answer = corpus_sentences[hits[0]['corpus_id']]
    print(f"\n✅ Final Answer:\n{top_answer}")



💬 Enter your query (or type 'exit' to stop): give me site details which is running

🔍 Top 5 Most Similar Issues:

1. (Score: 0.1906) - On 2024-11-19, turbine M121 at Suthari reported: Turbine was under breakdown due to Elec Yaw Sensor Error stop from 08:25 hours and restored at 10:21 hours.
2. (Score: 0.1878) - On 2024-11-19, turbine VM09 at Maliya Miyana reported: WTG manual stop on 16-Nov-24 @ 13:11 Hrs for Hub lock system repairing work, During trial run Pitch system error occurred on 16-Nov-24 21:53Hrs , In T/s found pitch brake3 found faund faulty. 
As of 17-Nov-24, the Pitch Brake and SFS Blower Fan are not available on-site. We are actively working on procuring these components. While the Pitch Brake has been received, the SFS Blower Fan is still awaited. The vendor is being expedited to expedite delivery, which may take an additional 2 days. He will dispatched from Vadodara today evening.
3. (Score: 0.1614) - On 2024-11-19, turbine M58 at Suthari reported: FSS Overspeed error 

KeyboardInterrupt: Interrupted by user