# Text Embedding

Embedding models convert documents into numerical vectors. These vectors capture the semantic meaning of the text, enabling more accurate and context-aware search capabilities. 

Document embedding is a powerful technique to convert textual data into numerical vectors, which can then be used for various downstream tasks such as search, classification, clustering, and more.

In [None]:
%%capture
#After executing the cell,please RESTART the kernel and run all the cells.
%pip install --user "ibm-watsonx-ai==1.1.2"
%pip install --user "langchain==0.2.11"
%pip install --user "langchain-ibm==0.1.11"
%pip install --user "langchain-community==0.2.10"
%pip install --user "sentence-transformers==3.0.1"

### Getting the data

In [None]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/i5V3ACEyz6hnYpVq6MTSvg/state-of-the-union.txt"

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("state-of-the-union.txt")
data = loader.load()

data

### Splitting the data

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

chunks = text_splitter.split_text(data[0].page_content)

len(chunks)

## Watsonx embedding model

Here, we will use IBM `slate-125m-english-rtrvr` model.

The slate.125m.english.rtrvr model is a [standard sentence](https://www.sbert.net/) transformers model based on bi-encoders.

 At a high level, the model is trained to maximize the cosine similarity between two input pieces of text, e.g., text A (query text) and text B (passage text), which results in the sentence embeddings q and p.These sentence embeddings can be compared using cosine similarity, which measures the distance between sentences by calculating the distance between their embeddings.

### Building the model

In [None]:
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_ibm import WatsonxEmbeddings

embed_params = {
    EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
    EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
}

watsonx_embedding = WatsonxEmbeddings(
    model_id="ibm/slate-125m-english-rtrvr",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="skills-network",
    params=embed_params,
)

let's try it out (needs an api key)

In [None]:
query = "How are you?"

query_result = watsonx_embedding.embed_query(query)

len(query_result)

We can also create embeddings of documents

In [None]:
doc_result = watsonx_embedding.embed_documents(chunks)
len(doc_result)

doc_result[0][:5]

In [None]:
len(doc_result[0])