### **What is an Embedding Model?**

An **embedding model** transforms data (e.g., text or images) into dense numerical vectors called **embeddings** that capture semantic meaning.

#### **Key Features:**
- **Dimensionality Reduction**: Converts high-dimensional data into lower-dimensional space.
- **Semantic Representation**: Similar inputs are mapped to similar vectors.
- **Applications**: Used in search, clustering, classification, and recommendations.

---

### **OpenAI Embedding Models**

OpenAI's embedding models, such as **`text-embedding-ada-002`**, generate high-quality text embeddings optimized for semantic tasks.

#### **Key Features:**
- **Scalable**: Supports inputs up to 8192 tokens.
- **Applications**: Ideal for semantic search, question-answering, and ranking.
- **Efficiency**: Provides accurate results with cost-effectiveness.

OpenAI models simplify the use of embeddings in AI applications without requiring deep ML expertise.


In [1]:
from openai import OpenAI

In [None]:
#Only in google colab
from google.colab import userdata
client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))

In [2]:
# Getting up the openai key file on local pc
from dotenv import load_dotenv
import os

# Load environment variables from the .env file
load_dotenv()

# Retrieve the API key from the environment
groq_api = os.getenv("OPENAI_API_KEY")

In [None]:

client = OpenAI()
response = client.embeddings.create(
  model="text-embedding-3-small",
  input=input("Enter a sentence to embed"),
  encoding_format="float"
)

print(response.data[0].embedding)


[0.008252129, -0.033444047, -0.024320858, 0.0049856612, 0.04155864, -0.00044555767, 0.009610292, -0.0058968337, -0.061936814, 0.008452701, -0.018544368, 0.053226233, -0.044446886, -0.037019968, 0.020676168, 0.04004575, -0.0149226, 0.006905427, 0.032481298, 0.03426926, 0.012916874, -0.048550025, -0.024710542, 0.012848106, -0.011873897, 0.005366749, 0.0128939515, 0.030830871, 0.023908252, -0.010590232, 0.023587335, -0.034750633, -0.0028610246, -0.07880783, 0.03972483, -0.030257806, 0.0072435355, 0.016229186, -0.03447556, 0.0046045734, 0.01864752, -0.0019297948, 0.0064011305, 0.0038395324, 0.05033799, -0.025902515, -0.014418303, 0.011701978, 0.013776471, -0.0062120194, -0.028469846, -0.021719145, -0.046234846, 0.04465319, -0.05189672, -0.03323774, 0.017489929, 0.024756387, 0.005461305, -0.046418227, 0.0052005607, -0.011249256, -0.007701987, 0.014636068, -0.04148987, -0.040297896, 0.0150372125, 0.011094529, 0.002051571, 0.019277891, 0.01980511, -0.0009068746, 0.024343781, -0.02727787, -0.0

In [15]:
def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
        encoding_format="float"
    )
    return response.data[0].embedding

In [23]:
# Create five random sentences from different fields
sentences = [
  "I am a software engineer",
  "I am a civil engineer",
  "I am going to school",
  "The engine is running very hot.",
  "India is a country in Asia."
]

#Use the OpenAI API to embed the sentences
embeddings = [get_embedding(sentence) for sentence in sentences]
print(embeddings[0])
print(embeddings[1])
print(len(embeddings[0]))

[0.007288928, -0.017361501, -0.041477628, -0.008027716, 0.011372047, -0.036649127, 0.03005281, 0.007994734, -0.06290246, -0.022664938, 0.01941955, -0.05467026, -0.044168923, -0.022387892, 0.017018493, 0.03000004, -0.023628, -0.007420854, -0.010969671, -0.013073896, 0.02382589, 0.009689987, 0.019643826, 0.014340389, 0.027150432, -0.0050956532, 0.006560035, 0.025356235, -0.0064445995, -0.012862814, -0.02357523, -0.01992087, -0.034379993, -0.00036836296, 0.058522508, 0.0074076615, 0.030712442, -0.00065056153, 0.034195296, 0.0017117438, -0.03105545, -0.06000008, 0.015593688, 0.014907672, -0.012796851, 0.018693956, -0.050026454, 0.036939364, 0.036490813, 0.0061081876, -0.04960429, -0.026622728, 0.05292883, 0.076358944, 0.015316643, 0.019023772, 0.02212404, 0.0028413627, 0.008364127, -0.008938007, 0.038153086, -0.007981541, 0.027097661, 0.011391836, -0.025369426, 0.015567304, -0.02062008, -0.0109366905, 0.029182097, -0.024459135, 0.023258606, 0.0465436, -0.0059828577, 0.03316627, 0.017084455

In [21]:
#ask the  user to enter a query via input() and search the embedding space for the most similar sentence
query = input("Enter a query sentence: ")
query_embedding = get_embedding(query)

# Calculate the cosine similarity between the query embedding and each of the sentence embeddings
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarities = [cosine_similarity([query_embedding], [embedding]) for embedding in embeddings]
most_similar_index = np.argmax(similarities)
print(sentences[most_similar_index])

The engine is running very hot.


# API Request Body Documentation

## **Input**
- **Type:** `string` or `array`  
- **Required:** Yes  
- **Description:**  
  Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or an array of token arrays.  
  - The input must:
    - Not exceed the maximum input tokens for the model (e.g., **8192 tokens** for `text-embedding-ada-002`).
    - Not be an empty string.
    - Be a valid array with **2048 dimensions or fewer** (if an array).  
  - Example: Use Python code to count tokens. Some models may also impose a limit on the total number of tokens summed across all inputs.

### **Show Possible Types**
- `string`
- `array`

---

## **Model**
- **Type:** `string`  
- **Required:** Yes  
- **Description:**  
  ID of the model to use.  
  - Use the `List models API` to view available models.  
  - Refer to the `Model Overview` for detailed descriptions of each model.

---

## **Encoding Format**
- **Type:** `string`  
- **Required:** No (Optional)  
- **Default:** `float`  
- **Description:**  
  The format to return the embeddings in.  
  - Options:
    - `float`
    - `base64`

---

## **Dimensions**
- **Type:** `integer`  
- **Required:** No (Optional)  
- **Description:**  
  Specifies the number of dimensions for the resulting output embeddings.  
  - **Supported Only In:** `text-embedding-3` and later models.

---

## **User**
- **Type:** `string`  
- **Required:** No (Optional)  
- **Description:**  
  A unique identifier representing your end-user, which can help OpenAI monitor and detect abuse.  
  - [Learn more](https://platform.openai.com/docs/abuse-monitoring)


In [6]:
print(response)

CreateEmbeddingResponse(data=[Embedding(embedding=[0.0022756963, -0.009305916, 0.015742613, -0.0077253063, -0.0047450014, 0.014917395, -0.009807394, -0.038264707, -0.0069127847, -0.028590616, 0.025251659, 0.018116701, -0.0036309576, -0.02554366, 0.00055543496, -0.016428178, 0.02828592, 0.0054083494, 0.009610611, -0.016415482, -0.015412526, 0.004272088, 0.0069953064, -0.007223828, -0.0039007403, 0.018573744, 0.008734611, -0.022699833, 0.011508612, 0.023893224, 0.015602961, -0.0035706533, -0.034963835, -0.0041514793, -0.026178442, -0.02150644, -0.0057066972, 0.011768873, 0.008455306, 0.004129262, 0.019157745, -0.014358787, 0.008982176, 0.0063605234, -0.04570436, 0.017900875, -0.005570219, -0.0007716578, -0.02215392, -0.0039229575, 0.02101131, -0.017608874, -0.011699047, -0.02256018, 0.01633931, 0.017164527, -0.00838548, 0.0015901309, 0.02507392, -0.024997747, 0.007807828, 0.005798741, -0.022115832, 0.002948566, -0.0061764363, -0.025556356, -0.008074437, 0.0010585003, 0.00023824192, 0.004

# API Response Documentation: Embedding Object

## **The Embedding Object**
Represents an embedding vector returned by the embedding endpoint.

---

### **Index**
- **Type:** `integer`  
- **Description:**  
  The index of the embedding in the list of embeddings.

---

### **Embedding**
- **Type:** `array`  
- **Description:**  
  The embedding vector, which is a list of floats.  
  - The length of the vector depends on the model, as specified in the [Embedding Guide](https://platform.openai.com/docs/guides/embeddings).

---

### **Object**
- **Type:** `string`  
- **Description:**  
  The object type, which is always `"embedding"`.
