<a href="https://colab.research.google.com/github/EtzionR/LM4GeoAI/blob/main/Tutorial_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Textual Embedding

### created by Etzion Harari | Geo-AI Course

[**https://github.com/EtzionR/LM4GeoAI**](https://github.com/EtzionR/LM4GeoAI)

## Imports

In [1]:
from transformers import pipeline, AutoTokenizer
import numpy as np



## Define embedding model

In [2]:
MODEL = "intfloat/multilingual-e5-large"
MODEL

'intfloat/multilingual-e5-large'

## Init embedding object

In [3]:
embedder = pipeline("feature-extraction", model=MODEL)
embedder

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


<transformers.pipelines.feature_extraction.FeatureExtractionPipeline at 0x7cf5a574ac00>

## Define embedding function

In [4]:
get_text_embedding  = lambda text: np.array(embedder(text))

get_text_embedding('input text for example')

array([[[ 0.79483914,  0.09726129, -0.67902607, ..., -0.17881775,
         -0.8551029 , -0.64124048],
        [ 1.01271474, -0.09400418, -0.65372515, ..., -0.16159202,
         -0.41561291, -0.86768848],
        [ 0.99399877, -0.10537175, -0.44847006, ..., -0.04342989,
         -0.83776927, -0.63593787],
        [ 1.15808368,  0.15189037, -0.38839754, ..., -0.25002098,
         -0.74189728, -0.88681883],
        [ 0.91491306,  0.02136497, -0.65068048, ..., -0.02586937,
         -0.74884582, -1.05341482],
        [ 0.79489964,  0.09747247, -0.67871487, ..., -0.17891693,
         -0.8550998 , -0.64096588]]])

## Simple embedding experiment

In [5]:
king  = 'king'
queen = 'queen'
kong  = 'kong'

king_embedding = get_text_embedding(king)[0][-2]
queen_embedding = get_text_embedding(queen)[0][-2]
kong_embedding = get_text_embedding(kong)[0][-2]

king_to_queen = ((king_embedding - queen_embedding)**2).mean().round(3)
king_to_kong  = ((king_embedding - kong_embedding)**2).mean().round(3)
queen_to_kong = ((queen_embedding - kong_embedding)**2).mean().round(3)

print(f'Distance between king and queen: {king_to_queen}')
print(f'Distance between king and kong: {king_to_kong}')
print(f'Distance between queen and kong: {queen_to_kong}')

Distance between king and queen: 0.235
Distance between king and kong: 0.294
Distance between queen and kong: 0.356


## Contextual embedding experiment

In [6]:
first_example_text = 'Dog, cat and mouse'
second_example_text = 'Computer, keyboard and mouse'

print(f'First text length: {len(first_example_text)}\nSecond text length: {len(second_example_text)}')

First text length: 18
Second text length: 28


## Tokenization process example

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

tokenized_first_text = tokenizer.convert_ids_to_tokens(tokenizer(first_example_text)['input_ids'])
tokenized_second_text = tokenizer.convert_ids_to_tokens(tokenizer(second_example_text)['input_ids'])

print(f'First text after tokenization: {tokenized_first_text} (length = {len(tokenized_first_text)})')
print(f'Second text after tokenization: {tokenized_second_text} (length = {len(tokenized_second_text)})')

First text after tokenization: ['<s>', '▁Dog', ',', '▁cat', '▁and', '▁mouse', '</s>'] (length = 7)
Second text after tokenization: ['<s>', '▁Computer', ',', '▁keyboard', '▁and', '▁mouse', '</s>'] (length = 7)


## First input embedding output

In [8]:
first_output = get_text_embedding(first_example_text)

first_output.shape

(1, 7, 1024)

## Second input embedding output

In [9]:
second_output = get_text_embedding(second_example_text)

second_output.shape

(1, 7, 1024)

## The embedding for "mouse" (for the first text)

In [10]:
first_output[0][-2]

array([ 0.4099046 ,  0.2268908 , -0.85698509, ..., -0.42554754,
       -1.81469822, -0.54044926])

## The embedding for "mouse" (for the second text)

In [11]:
second_output[0][-2]

array([ 0.96612006,  0.6694116 , -1.14001191, ..., -0.12522601,
       -0.43087587, -0.5866484 ])

Create by Etzion Harari | Geo-AI Course | [https://github.com/EtzionR/LM4GeoAI](https://github.com/EtzionR/LM4GeoAI)