In this repository, we created a word card recommendation system using the text bison (PaLM2) model and embeddings from SBERT, a pre-trained model.
The goal is to make it easier for educators to find words that fit the situation as educators are recommended word cards that fit the situation.
📦Say_Better_ML
┣ 📂image
┃ ┣ 📜Say-Better_logo1.png
┃ ┣ 📜Say_Better-System-Architecture.drawio.png
┃ ┣ 📜say_better_embedding_graph2d.png
┃ ┗ 📜say_better_embedding_graph3d.png
┣ 📂recommender
┃ ┣ 📜create_relate_word.py
┃ ┗ 📜recommend_word_card.py
┣ 📜KAAC_basic.csv
┣ 📜main.py
┣ 📜README.md
┣ 📜requirements.txt
┣ 📜your_key.json
┣ 📜word_card_embed.npy
┗ 📂preprocessing
┗ 📜sentence_transformer_word_embedding.ipynb
Preprocessing
-
Since the word card had a lot of StopWords and no spaces, the pre-processing process removed the StopWord and put spaces using the bison (text) model.
-
I vectorized word cards using SBERT
Operation
-
The user enters the situation.
-
The Cloud Function receives input and communicates the situation to Vertex AI.
-
Vertex AI returns 10 keywords to Cloud Function.
-
We calculate the cosine similarity of keywords vectorized by Cloud Funciton with 543 word card vectors.
-
Add to the list the top three word cards with the highest cosine similarity per keyword.
-
Cloud Function returns a total of 30 word cards.
The codebase is developed with Python 3.10.12. After creating an environment install the requirements as follows:
pip install -r requirements.txt
We used a hugging face pre-learning model, kykim/vert-kor-base. Loading the pre-trained models is as easy as running the following code piece:
from transformers import BertTokenizer, TFBertModel
model_id = 'kykim/bert-kor-base'
tokenizer = BertTokenizer.from_pretrained(model_id)
model = TFBertModel.from_pretrained(model_id)
This graph shows the embedding vector values of the word card.
Click on the image to zoom in.
It can be seen that word cards with similar meanings are gathered together.
The text bison (PaLM2) model was used for keyword extraction.
The kykim/vert-kor-base model was used for sentence similarity analysis.
The word cards used were brought by KAAC with copyright permission, and 2000 word cards were used.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Google Cloud Function Overview
SayBetter-TeamDoc:
(1) https://github.com/Say-Better/Team-Docs
SayBetter-Server:
(1) https://github.com/Say-Better/Service-Server
SayBetter-Front: