# Busca semântica de emojis

# Integrantes
* Mylena Paes Santos Matsuki - 11202230790

* Pedro Paulo Ayala Yamada - 11202020731

# Introdução
Os emojis estão cada vez mais sendo utilizados em diversos contextos, seja como símbolo para um fã clube ou até mesmo para movimentos sociais. Entretanto, o seu uso pode ser diferenciado dependendo da cultura. Este projeto se concentra na busca semântica de emojis utilizando o LLM "Llama 3" da Meta e embeddings, assim, não há apenas o seu metadado, mas também a sua representação como embedding utilizando um transformador multilíngue, dessa forma, a busca pode ser realizada por mais 50 línguas e não se restringe apenas ao inglês. Ademais, através, do Qdrant é possível trazer ainda mais performance na pesquisa semântica.

## Dependências

In [None]:
import os
import sys
from google.colab import drive

In [None]:
! python --version

Python 3.10.12


In [None]:
! python --upgrade

unknown option --upgrade
usage: python3 [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.


In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.chdir('/content/drive/MyDrive/Q 2024.2/PLN/seminario_PLN/emojeez')

In [None]:
! pip install -r requirements.txt

Collecting aiohttp==3.9.5 (from -r requirements.txt (line 1))
  Downloading aiohttp-3.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting altair==5.3.0 (from -r requirements.txt (line 3))
  Downloading altair-5.3.0-py3-none-any.whl.metadata (9.2 kB)
Collecting anyio==4.4.0 (from -r requirements.txt (line 5))
  Downloading anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting asttokens==2.4.1 (from -r requirements.txt (line 6))
  Downloading asttokens-2.4.1-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting attrs==23.2.0 (from -r requirements.txt (line 7))
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting blinker==1.8.2 (from -r requirements.txt (line 8))
  Downloading blinker-1.8.2-py3-none-any.whl.metadata (1.6 kB)
Collecting cachetools==5.3.3 (from -r requirements.txt (line 9))
  Downloading cachetools-5.3.3-py3-none-any.whl.metadata (5.3 kB)
Collecting colorama==0.4.6 (from -r requirements.txt (line 13))
  Downloading

In [None]:
! pip install -U sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Using cached sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1


In [None]:
! pip install qdrant-client

Collecting qdrant-client
  Downloading qdrant_client-1.11.1-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Downloading grpcio_tools-1.66.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting httpx>=0.20.0 (from httpx[http2]>=0.20.0->qdrant-client)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting grpcio>=1.41.0 (from qdrant-client)
  Downloading grpcio-1.66.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting httpcore==1.* (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client)
  Using cached httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant-client)
  Using cached h2-4.1.0-py3-none-any.whl.metadata (3.6 kB)
Downloading qdrant_client-1.11.1-py3-none-any.whl (259 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.4/259.4 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownl

In [None]:
import numpy as np
import pickle
from typing import Dict, List, Any
import random
from tqdm.notebook import tqdm
from sentence_transformers import SentenceTransformer
from qdrant_client import models, QdrantClient
import emoji as em
import warnings

warnings.filterwarnings('ignore')

  from tqdm.autonotebook import tqdm, trange


## Representação de emojis por embeddings

![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*bW0-VPzolusp2GKCDCA1wQ.png)

**Llama 3**: LLM utilizado para gerar metadados para cada emoji e treinado pela web para identificar como cada emoji é utilizado em diversos contextos.
  *Documentação*: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

**Dataset**: Descrição gerada pelo LLM, unicode, tags para cada emoji.
  *Documentação*: https://huggingface.co/datasets/badrex/llm-emoji-dataset

In [None]:
# Leitura do dicionário de emoji, o qual o próprio emoji é a chave e o seu valor é composto pela descrição e tags
with open('data/emoji_llm.pkl', 'rb') as file:
    emoji_dict: Dict[str, Dict[str, str]] = pickle.load(file)
emoji_dict

{'🥇': {'Emoji': '🥇',
  'Description': 'This emoji represents a first place medal, often used to symbolize victory, achievement, and being the best in a competition or event.',
  'Semantic_Tags': ['first place',
   'victory',
   'achievement',
   'success',
   'competition',
   'winner',
   'award']},
 '🥈': {'Emoji': '🥈',
  'Description': 'This emoji represents a silver medal, often used to symbolize coming in second place or achieving a high level of success.',
  'Semantic_Tags': ['medal',
   'silver',
   'second place',
   'achievement',
   'success',
   'ranking']},
 '🥉': {'Emoji': '🥉',
  'Description': "This emoji represents a bronze medal, symbolizing third place or a commendable achievement, often used in sports and competitions to recognize participants' efforts and accomplishments.",
  'Semantic_Tags': ['medal',
   'bronze',
   'third place',
   'achievement',
   'sports',
   'competition',
   'recognition']},
 '🆎': {'Emoji': '🆎',
  'Description': 'This emoji represents the AB b

In [None]:
# Inicializa sentence encoder e define o modelo de embedding que vai ser usado para converter a descrição de emojis em vetores de alta dimensão.
embedging_model: str = 'paraphrase-multilingual-MiniLM-L12-v2'
sentence_encoder = SentenceTransformer(embedging_model)

# Faz uma descrição ainda mais completa para cada emoji, de modo que, combina a descrição existente com uma frase que contextualiza o uso dos emojis com base nas tags.
for emoji in tqdm(emoji_dict):
    try:
        description = emoji_dict[emoji]['Description']
        semantic_tags = emoji_dict[emoji]['Semantic_Tags']

        emoji_dict[emoji]['LLM_description'] = description + \
            ' This emoji is usually used in the contexts of: ' + \
            ', '.join(str(s) for s in semantic_tags[:-1]) + \
            ', and ' + \
            str(semantic_tags[-1]) + '.'

    # Try e except para imprimir possíveis erros que possam ocorrer
    except Exception as e:
        print(f"Error occurred for emoji: {emoji}. Error message: {str(e)}")



# Inicializa o dicionário para armazenar os embeddings
vector_dict:Dict[str, np.array] = {}
# É criado um vetor de embedding para cada emoji.
for emoji in tqdm(emoji_dict):
    vector_dict[emoji] = sentence_encoder.encode(
        emoji_dict[emoji]['LLM_description']
    )

# Criação de um dicionário que reúne as informações os dados dos emojis anteriormente, mas com a adição dos embedding gerado.
emoji_embeddings_dict: Dict[str, Dict[str, str]] = {
    emoji: {
        **emoji_dict[emoji],
        "embedding": vector_dict[emoji]
    }
    for emoji in emoji_dict
}

emoji_embeddings_dict

Output hidden; open in https://colab.research.google.com to view.

## Realizando Busca

Definindo funções para:
* `load_dictionary` : carrega dicionário de emojis
* `load_encoder` : carrega o encoder de sentenças
* `load_qdrant_client` : cria a base de dados vetorial para todos os emojis do dicionário

Documentação QDRANT: https://python-client.qdrant.tech/ \\
Documentação Sentence transformers utilizado: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

In [None]:
# Carrega o dicionário de emojis
def load_dictionary(file_path: str) -> Dict[str, Dict[str, Any]]:

    with open(file_path, 'rb') as file:
        emoji_dict = pickle.load(file)
    return emoji_dict


# Carrega o modelo de encoder
def load_encoder(model_name: str) -> SentenceTransformer:

    sentence_encoder = SentenceTransformer(model_name)
    #st.session_state.sentence_encoder = sentence_encoder
    return sentence_encoder


# Cria a base de dados vetorial qdrant
def load_qdrant_client(emoji_dict: Dict[str, Dict[str, Any]]) -> QdrantClient:

    # Setup da base
    vector_DB_client = QdrantClient(":memory:")
    embedding_dict = {
        emoji: np.array(metadata['embedding'])
        for emoji, metadata in emoji_dict.items()
    }

    # Remove os embeddings para que estes possam ser usados
    # como pesos da base
    for emoji in list(emoji_dict):
        del emoji_dict[emoji]['embedding']

    embedding_dim = next(iter(embedding_dict.values())).shape[0]

    # Cria a coleção de emojis na base
    vector_DB_client.create_collection(
        collection_name="EMOJIS",
        vectors_config=models.VectorParams(
            size=embedding_dim,
            distance=models.Distance.COSINE
        ),
    )

    # Insere os pontos na base
    vector_DB_client.upload_points(
        collection_name="EMOJIS",
        points=[
            models.PointStruct(
                id=idx,
                vector=embedding_dict[emoji].tolist(),
                payload=emoji_dict[emoji]
            )
            for idx, emoji in enumerate(emoji_dict)
        ],
    )

    return vector_DB_client


Função que retorna `n` emojis mais relevantes, de acordo com a pergunta de entrada

In [None]:
def retrieve_relevant_emojis(
        embedding_model: SentenceTransformer,
        vector_DB_client: QdrantClient,
        query: str,
        num_to_retrieve: int) -> List[str]:
    """
    Return similar emojis to the query using the sentence encoder and Qdrant.
    """

    # Cria embedding da query
    query_vector = embedding_model.encode(query).tolist()

    # Calcula a distância
    hits = vector_DB_client.search(
        collection_name="EMOJIS",
        query_vector=query_vector,
        limit=num_to_retrieve,
    )

    return hits

In [None]:
# Carrega encoder de textos
model_name = 'paraphrase-multilingual-MiniLM-L12-v2'
sentence_encoder = load_encoder(model_name)

# Carrega dicionário de metadados
embedding_dict = load_dictionary('data/emoji_embeddings_dict.pkl')

# Carrega a base de dados
vector_DB_clinet = load_qdrant_client(embedding_dict)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
import unicodedata

def show_top_10(query: str) -> None:
    """
    Mostra emojis em ordem de relevância.
    """
    emojis = retrieve_relevant_emojis(
        sentence_encoder,
        vector_DB_clinet,
        query,
        num_to_retrieve=10
    )


    for i, hit in enumerate(emojis, start=1):

        emoji_char = hit.payload['Emoji']
        score = hit.score

        _ord = ''
        for c in emoji_char:
            _ord += str(ord(c)) + ' '

        _spec = len(emoji_char) + 3

        unicode_desc = ' '.join(em.demojize(emoji_char).split('_'))[1:-1].upper()

        print(f"{i:<3} {emoji_char:<{_spec}}", end='')
        print(f"{score:<7.3f}", end= '')
        print(f"{unicode_desc:<55}")

In [None]:
show_top_10('cat smiling') #inglês

1   😼   0.651  CAT WITH WRY SMILE                                     
2   😸   0.643  GRINNING CAT WITH SMILING EYES                         
3   😹   0.611  CAT WITH TEARS OF JOY                                  
4   😻   0.603  SMILING CAT WITH HEART-EYES                            
5   😺   0.596  GRINNING CAT                                           
6   🐱   0.522  CAT FACE                                               
7   🐈   0.513  CAT                                                    
8   🐈‍⬛   0.495  BLACK CAT                                              
9   😽   0.468  KISSING CAT                                            
10  🐆   0.452  LEOPARD                                                


In [None]:
show_top_10('protect from evil eye') #inglês

1   🧿   0.409  NAZAR AMULET                                           
2   👓   0.405  GLASSES                                                
3   🥽   0.387  GOGGLES                                                
4   👁   0.383  EYE                                                    
5   🦹🏻   0.382  SUPERVILLAIN LIGHT SKIN TONE                           
6   👀   0.374  EYES                                                   
7   🦹🏿   0.370  SUPERVILLAIN DARK SKIN TONE                            
8   🛡️   0.369  SHIELD                                                 
9   🦹🏼   0.366  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    
10  🦹🏻‍♂   0.364  MAN SUPERVILLAIN LIGHT SKIN TONE                       


In [None]:
show_top_10('proteção contra mal olhar') #pt-br

1   🥽   0.405  GOGGLES                                                
2   🛡️   0.397  SHIELD                                                 
3   👓   0.396  GLASSES                                                
4   💂🏻‍♀️   0.385  WOMAN GUARD LIGHT SKIN TONE                            
5   💂🏻‍♀   0.385  WOMAN GUARD LIGHT SKIN TONE                            
6   👀   0.384  EYES                                                   
7   🕵🏻‍♀   0.382  WOMAN DETECTIVE LIGHT SKIN TONE                        
8   🕵🏻‍♀️   0.382  WOMAN DETECTIVE LIGHT SKIN TONE                        
9   👁   0.380  EYE                                                    
10  💂🏻   0.379  GUARD LIGHT SKIN TONE                                  


In [None]:
show_top_10('يحمي من العين الشريرة') #árabe

1   🧿   0.442  NAZAR AMULET                                           
2   👓   0.430  GLASSES                                                
3   👁   0.414  EYE                                                    
4   🥽   0.403  GOGGLES                                                
5   👀   0.403  EYES                                                   
6   🦹🏻   0.398  SUPERVILLAIN LIGHT SKIN TONE                           
7   🙈   0.394  SEE-NO-EVIL MONKEY                                     
8   🫣   0.387  FACE WITH PEEKING EYE                                  
9   🧛🏻   0.385  VAMPIRE LIGHT SKIN TONE                                
10  🦹🏼   0.383  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    


In [None]:
show_top_10('Vor dem bösen Blick schützen') # Alemão

1   😷   0.369  FACE WITH MEDICAL MASK                                 
2   🫣   0.364  FACE WITH PEEKING EYE                                  
3   🛡️   0.360  SHIELD                                                 
4   🙈   0.359  SEE-NO-EVIL MONKEY                                     
5   👀   0.353  EYES                                                   
6   🙉   0.350  HEAR-NO-EVIL MONKEY                                    
7   👁   0.346  EYE                                                    
8   🧿   0.345  NAZAR AMULET                                           
9   💂🏿‍♀️   0.345  WOMAN GUARD DARK SKIN TONE                             
10  💂🏿‍♀   0.345  WOMAN GUARD DARK SKIN TONE                             


In [None]:
show_top_10('Προστατέψτε από το κακό μάτι') #grego

1   👓   0.497  GLASSES                                                
2   🥽   0.484  GOGGLES                                                
3   👁   0.452  EYE                                                    
4   🕶️   0.430  SUNGLASSES                                             
5   🕶   0.430  SUNGLASSES                                             
6   👀   0.429  EYES                                                   
7   👁️   0.415  EYE                                                    
8   🧿   0.411  NAZAR AMULET                                           
9   🫣   0.404  FACE WITH PEEKING EYE                                  
10  😷   0.391  FACE WITH MEDICAL MASK                                 


In [None]:
show_top_10('Защитете от лошото око') # búlgaro

1   👓   0.475  GLASSES                                                
2   🥽   0.452  GOGGLES                                                
3   👁   0.448  EYE                                                    
4   👀   0.418  EYES                                                   
5   👁️   0.412  EYE                                                    
6   🫣   0.397  FACE WITH PEEKING EYE                                  
7   🕶️   0.387  SUNGLASSES                                             
8   🕶   0.387  SUNGLASSES                                             
9   😝   0.375  SQUINTING FACE WITH TONGUE                             
10  🧿   0.373  NAZAR AMULET                                           


In [None]:
show_top_10('防止邪眼') #japonês

1   👓   0.425  GLASSES                                                
2   🥽   0.397  GOGGLES                                                
3   👁   0.392  EYE                                                    
4   🧿   0.383  NAZAR AMULET                                           
5   👀   0.380  EYES                                                   
6   🙈   0.370  SEE-NO-EVIL MONKEY                                     
7   😷   0.369  FACE WITH MEDICAL MASK                                 
8   🕶️   0.363  SUNGLASSES                                             
9   🕶   0.363  SUNGLASSES                                             
10  🫣   0.360  FACE WITH PEEKING EYE                                  


In [None]:
show_top_10('邪眼から守る') # japonês (versão alternativa)

1   🙈   0.379  SEE-NO-EVIL MONKEY                                     
2   🧿   0.379  NAZAR AMULET                                           
3   🙉   0.370  HEAR-NO-EVIL MONKEY                                    
4   😷   0.363  FACE WITH MEDICAL MASK                                 
5   🙊   0.363  SPEAK-NO-EVIL MONKEY                                   
6   🫣   0.355  FACE WITH PEEKING EYE                                  
7   🛡️   0.355  SHIELD                                                 
8   👁   0.351  EYE                                                    
9   🦹🏼   0.350  SUPERVILLAIN MEDIUM-LIGHT SKIN TONE                    
10  👓   0.350  GLASSES                                                


In [None]:
show_top_10('amuleto contra mau olhar') # pt-br

1   🫣   0.611  FACE WITH PEEKING EYE                                  
2   😝   0.605  SQUINTING FACE WITH TONGUE                             
3   👀   0.582  EYES                                                   
4   🧿   0.576  NAZAR AMULET                                           
5   🦂   0.574  SCORPION                                               
6   🥽   0.573  GOGGLES                                                
7   🧑🏼‍🦯‍➡   0.573  PERSON WITH WHITE CANE FACING RIGHT MEDIUM-LIGHT SKIN TONE
8   🧑🏼‍🦯‍➡️   0.573  PERSON WITH WHITE CANE FACING RIGHT MEDIUM-LIGHT SKIN TONE
9   👓   0.569  GLASSES                                                
10  👨🏼‍🦯‍➡️   0.567  MAN WITH WHITE CANE FACING RIGHT MEDIUM-LIGHT SKIN TONE


In [None]:
show_top_10('risada') #pt-br

1   🤣   0.434  ROLLING ON THE FLOOR LAUGHING                          
2   😂   0.425  FACE WITH TEARS OF JOY                                 
3   😆   0.409  GRINNING SQUINTING FACE                                
4   😄   0.396  GRINNING FACE WITH SMILING EYES                        
5   😉   0.393  WINKING FACE                                           
6   😁   0.388  BEAMING FACE WITH SMILING EYES                         
7   😃   0.386  GRINNING FACE WITH BIG EYES                            
8   😏   0.381  SMIRKING FACE                                          
9   🙂‍↕️   0.378  HEAD SHAKING VERTICALLY                                
10  🫠   0.378  MELTING FACE                                           


# Limitações


1.   O aprendizado do modelo pode ficar à merce de vieses socioculturais. Além da amplificação de estereótipos;
2.   Precisão, é possível que seja necessário passar por diversos emojis até encontrar o procurado;
3. Novos contextos criados. A utilização de emoji em diferentes contextos é contínua, ou seja, sempre novos contextos são criados e o processo de aprendizagem é necessário;
4. Dependência do contexto cultural da língua.



# Conclusão
Esse projeto mostra a busca semântica dos emojis o que pode ser considerado um avanço na forma tradicional limitada de realizar pesquisas através de palavras-chave. Entretanto, conforme mencionado anteriormente, existem desafios, como, a necessidade de lidar com variações socioculturais, além de oportunidades para explorar mais esse modelo.