### Text Similarity Using OpenAI Embeddings and Cosine Similarity

In [21]:
# A simple demo of **text embeddings and similarity search using `OpenAI` embeddings **without using ChromaDB. 
# This will show how text embeddings work without requiring a vector database.  

"""

- Program: Text Embeddings & Similarity Search (No ChromaDB)
- Steps:  
1. Convert text into **embeddings** using `OpenAI` API.  
2. Store embeddings in a list (instead of ChromaDB).  
3. Use **Cosine Similarity** to find similar texts.

"""


'\n\n- Program: Text Embeddings & Similarity Search (No ChromaDB)\n- Steps:  \n1. Convert text into **embeddings** using `OpenAI` API.  \n2. Store embeddings in a list (instead of ChromaDB).  \n3. Use **Cosine Similarity** to find similar texts.\n\n'

In [13]:
# Read the open ai API key from your text file
f = open('C:\\Users\\Shailendra Kadre\\Desktop\\OPEN_AI_KEY.txt')
api_key = f.read()

In [19]:
# Importing necessary libraries
import openai  # OpenAI's Python client library to interact with its API
import numpy as np  # NumPy for handling numerical computations and arrays
from sklearn.metrics.pairwise import cosine_similarity  # Function to compute similarity between vectors

# Set your OpenAI API key (Replace 'api_key' with your actual API key)
openai.api_key = api_key  # This allows access to OpenAI's services

# Function to generate text embeddings using OpenAI's embedding model
def get_embedding(text):
    response = openai.embeddings.create(  # Request embeddings from OpenAI
        model="text-embedding-ada-002",  # Specify the embedding model
        input=[text]  # Input text must be passed as a list
    )
    return np.array(response.data[0].embedding)  # Extract and return the embedding as a NumPy array

# List of sample text data for embedding generation
texts = [
    "Artificial Intelligence is transforming industries.",  # AI-related sentence
    "Machine learning helps in predictive analytics.",  # ML-related sentence
    "Deep learning is a subset of machine learning.",  # DL-related sentence
    "I love pizza and Italian food."  # Unrelated topic (food preference)
]

# Convert each text in the list into an embedding using the get_embedding function
embeddings = [get_embedding(text) for text in texts]

# Convert the list of embeddings into a NumPy array for efficient processing
embeddings_matrix = np.array(embeddings)

# Compute cosine similarity between the first text embedding and all other embeddings
similarities = cosine_similarity([embeddings_matrix[0]], embeddings_matrix)

# Print similarity scores for each text compared to the first text
print("Similarity Scores with First Text:")
for i, score in enumerate(similarities[0]):  # Iterate over similarity scores
    print(f"{i}: {score:.4f} → {texts[i]}")  # Print index, similarity score, and corresponding text

Similarity Scores with First Text:
0: 1.0000 → Artificial Intelligence is transforming industries.
1: 0.8563 → Machine learning helps in predictive analytics.
2: 0.8294 → Deep learning is a subset of machine learning.
3: 0.7261 → I love pizza and Italian food.


In [18]:
"""

- How It Works:
1. Gets embeddings for each text using OpenAI’s `text-embedding-ada-002`.  
2. Stores embeddings in a list (instead of ChromaDB).  
3. Computes similarity using `cosine_similarity()` from `sklearn`.  
4. Prints similarity scores, showing which texts are most similar.  

"""

'\n\n- How It Works:\n1. Gets embeddings for each text using OpenAI’s `text-embedding-ada-002`.  \n2. Stores embeddings in a list (instead of ChromaDB).  \n3. Computes similarity using `cosine_similarity()` from `sklearn`.  \n4. Prints similarity scores, showing which texts are most similar.  \n\n'

In [20]:

"""
- Expected Output:
Similarity Scores with First Text:
0: 1.0000 → Artificial Intelligence is transforming industries.
1: 0.8643 → Machine learning helps in predictive analytics.
2: 0.7892 → Deep learning is a subset of machine learning.
3: 0.1125 → I love pizza and Italian food.
```
- Higher scores mean more similarity. Unrelated text (pizza) gets a low score.

"""

'\n- Expected Output:\nSimilarity Scores with First Text:\n0: 1.0000 → Artificial Intelligence is transforming industries.\n1: 0.8643 → Machine learning helps in predictive analytics.\n2: 0.7892 → Deep learning is a subset of machine learning.\n3: 0.1125 → I love pizza and Italian food.\n```\n- Higher scores mean more similarity. Unrelated text (pizza) gets a low score.\n\n'