<a href="https://colab.research.google.com/github/Mays-Waddah/Text-Embedding-/blob/main/TextEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

notes: 1. Why Use transformers and torch
transformers:

What it Does: This library, developed by Hugging Face, provides pre-trained transformer models and tools to work with them. Transformers are a type of neural network architecture that excels at understanding and generating text. The transformers library includes a wide range of models (e.g., BERT, GPT, T5) that can be used for various natural language processing (NLP) tasks.
Why: It simplifies the process of using state-of-the-art models for NLP tasks. It provides easy-to-use interfaces for loading pre-trained models and tokenizers.
torch:

What it Does: PyTorch is an open-source machine learning library that provides tools for tensor computation (similar to NumPy) and deep learning. It is commonly used to perform the computations required to train and use neural networks.
Why: It handles the underlying computations involved in running the transformer models. For instance, it manages the tensor operations and gradients required for model inference.

In [None]:
!pip install transformers



In [None]:
# Creating a config.ini file in Google Colab
with open('config.ini', 'w') as configfile:
    configfile.write('[model]\n')
    configfile.write('name = bert-base-uncased\n')

In [None]:
import configparser
from transformers import AutoTokenizer, AutoModel

def get_text_embedding(text, config_file='config.ini'):
    """
    This function accepts the text as an input and using a transformer model it returns the embedding of that text.

    Args:
      text: The text to be embedded.
      config_file: The path to the config file to read the model name from.

    Returns:
      The embedding of the text.
    """
    # Read the config file
    config = configparser.ConfigParser()
    config.read(config_file)

    # Get the model name from the config file
    model_name = config.get('model', 'name')

    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Tokenize and encode the text
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)

    # Get the last hidden state and calculate the mean to get the embedding
    embeddings = outputs.last_hidden_state.mean(dim=1)

    return embeddings.detach().numpy()

# Example usage:
text = "This is a sample sentence."
embedding = get_text_embedding(text)
print(embedding)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

[[-6.38738796e-02 -4.28365737e-01 -6.67791516e-02 -3.84301543e-01
  -6.57844692e-02 -2.18256712e-01  4.76355165e-01  4.86585647e-01
   4.06466424e-05 -7.42733032e-02 -7.47402012e-02 -4.76345003e-01
  -1.97731793e-01  2.48242348e-01 -1.21619247e-01  1.66783273e-01
   2.10446060e-01 -1.45755380e-01  1.26364425e-01  1.86348483e-02
   2.46397227e-01  5.70897281e-01 -4.70136940e-01  1.37820095e-01
   7.36503124e-01 -3.38082016e-01 -5.03306016e-02 -1.64524272e-01
  -4.35167968e-01 -1.28996313e-01  1.65156007e-01  3.40043575e-01
  -1.49296954e-01  2.24215575e-02 -1.04884043e-01 -5.19162595e-01
   3.29641640e-01 -2.21617594e-01 -3.42063248e-01  1.19932838e-01
  -7.01478362e-01 -2.31263906e-01  1.12237230e-01  1.25501096e-01
  -2.51906514e-01 -4.63743001e-01 -2.72606201e-02 -2.84153402e-01
  -9.92493853e-02 -3.70169654e-02 -8.91916752e-01  2.50046223e-01
   1.58158571e-01  2.27008805e-01 -2.84966260e-01  4.53001261e-01
   5.09447046e-03 -7.94410348e-01 -3.10075462e-01 -1.74034357e-01
   4.30291

Create a functionality that accepts the text as an input and using a transformer model it returns the embedding of that text. This functionality should takes two arguments the text to be embedded and the model name read from a config file.
The input text is in English