## Team Members
1) Meet Patel (C0871240)

## Deliverables for your Assignment:

1) Describe the MODEL that you are using in your code as the Model for your Embedding. Research and Discuss WHY you choose that model. How is it of particular value to your Project Business Domain.

2) Research and select a MODEL for your Embedding (and therefore later your Project), and support and defend your reasoning and decision making as to why you choose that MODEL for your Use Cases and Business Domain:

3) If you were doing this at work: What licensing and pricing considerations for using the APIs would factor into account?

1) **Model Description and Selection**: The model used in the code is the `jinaai/jina-embeddings-v2-base-en`¹. This is an English, monolingual embedding model that supports a sequence length of 8192¹. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length¹. The backbone `jina-bert-v2-base-en` is pretrained on the C4 dataset¹. The model is further trained on Jina AI's collection of more than 400 millions of sentence pairs and hard negatives¹. These pairs were obtained from various domains and were carefully selected through a thorough cleaning process¹. The model was chosen for its ability to handle long sequences, making it particularly useful for tasks that require processing long documents, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc¹.

2) **Model Value for Business Domain**: The `jinaai/jina-embeddings-v2-base-en` model is of particular value to many business domains due to its extended context capabilities⁴. For instance, in the legal domain, it can capture and analyze intricate details in extensive legal texts effectively⁴. In the medical research domain, it can holistically embed scientific papers for advanced analytics and discoveries⁴. The model's ability to handle long sequences makes it especially useful when processing long documents is needed¹.

3) **Licensing and Pricing Considerations**: The `jinaai/jina-embeddings-v2-base-en` model is freely available under the Apache 2.0 license³. This means it can be used without any cost, making it a cost-effective choice for businesses. However, if you plan to use the model in a commercial product, you should review the terms of the Apache 2.0 license to ensure compliance. As for API usage, pricing would depend on the specific API provider and usage requirements. It's important to consider factors such as the number of API calls needed, data transfer costs, and whether the API provider offers a free tier or volume discounts. Always review the API provider's pricing documentation for the most accurate information.

Source: Conversation with Bing, 11/18/2023
(1) jinaai/jina-embeddings-v2-base-en · Hugging Face. https://huggingface.co/jinaai/jina-embeddings-v2-base-en.
(2) jina-embeddings-v2-base-en model | Clarifai - The World's AI. https://clarifai.com/jinaai/jina-embeddings/models/jina-embeddings-v2-base-en.
(3) Jina AI's Open-Source Embedding Model Outperforms OpenAI's Ada - InfoQ. https://www.infoq.com/news/2023/11/jina-ai-embeddings/.
(4) jinaai/jina-embeddings-v2-small-en · Hugging Face. https://huggingface.co/jinaai/jina-embeddings-v2-small-en.
(5) Embedding API - jinaai.cn. https://www.jinaai.cn/embeddings/.
(6) Jina AI’s jina-embeddings-v2: an open source text embedding model that .... https://www.baseten.co/blog/jina-embeddings-v2-open-source-text-embedding-that-matches-openai-ada-002/.
(7) Jina Embeddings - Finetuner documentation. https://finetuner.jina.ai/get-started/pretrained/.

# Before using Pretrained model from Hugging Face, Lets create our own model

In [None]:

corpus = [
    'I love machine learning',
    'I love deep learning',
    'Deep learning is a subfield of machine learning',
    'AI is fascinating',
    'Machine learning is fascinating'
]


In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Convert text to sequence of integers
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences for equal length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

Model Explanation

1. **Embedding Layer**: The first layer is an Embedding layer, which is used for word embeddings. Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. The Embedding layer takes the integer-encoded vocabulary (`total_words`) and the length of input sequences (`max_sequence_len-1`) as inputs and produces dense vectors of fixed size (10 in this case). This layer can only be used as the first layer in a model.

2. **LSTM Layer**: The next layer is an LSTM (Long Short-Term Memory) layer with 50 units. LSTM is a type of recurrent neural network (RNN) that can learn and remember over long sequences and is not prone to the vanishing gradient problem, which is a common issue with traditional RNNs. This makes LSTMs useful for processing and making predictions based on time series data or any data where the temporal dynamics are important.

3. **Dense Layer**: The final layer is a Dense layer, which is a regular densely-connected neural network layer. It implements the operation: `output = activation(dot(input, kernel) + bias)`. Here, `total_words` is the dimensionality of the output space and `softmax` is the activation function. The softmax function outputs a vector that represents the probability distribution of a list of potential outcomes.

4. **Compilation**: Finally, the model is compiled with the `adam` optimizer and the `categorical_crossentropy` loss function, which is suitable for multi-class classification problems. The model's performance is measured with the `accuracy` metric during training and testing.

In [4]:
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))  # Embedding layer
model.add(LSTM(50))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 7, 10)             120       
                                                                 
 lstm (LSTM)                 (None, 50)                12200     
                                                                 
 dense (Dense)               (None, 12)                612       
                                                                 
Total params: 12932 (50.52 KB)
Trainable params: 12932 (50.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [5]:
from tensorflow.keras.utils import to_categorical

# Splitting data into predictors and label
X = input_sequences[:,:-1]
y = input_sequences[:,-1]

# One-hot encoding the labels
y = to_categorical(y, num_classes=total_words)

# Training the model
model.fit(X, y, epochs=200, verbose=1)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.src.callbacks.History at 0x1eccd89d110>

# Now for extracting the embeddings from our trained embedding layer we are using the code as shown below

In [6]:
embedding_layer = model.layers[0]
weights = embedding_layer.get_weights()[0]

# Create a dictionary to store the embeddings
word_embeddings = {}
for word, i in tokenizer.word_index.items():
    word_embeddings[word] = weights[i]

In [7]:
print(word_embeddings['machine'])

[-0.1577532  -0.26447624 -0.27063128 -0.20841956  0.2808393  -0.28767025
 -0.3072741   0.08063859  0.25579762  0.16986361]


# Let's Now use a model from Hugging face

Installation: The transformers library, which provides pre-trained models for various text-related tasks, is installed using the command !pip install transformers.

Imports: The AutoModel class from the transformers library and the norm function from the numpy.linalg module are imported.

Cosine Similarity Function: A function named cos_sim is defined to calculate the cosine similarity between two vectors. This measure is used to determine the cosine of the angle between two non-zero vectors, providing a measure of their similarity.

Model Loading: A pre-trained model, ‘jinaai/jina-embeddings-v2-base-en’, is loaded using the AutoModel.from_pretrained method. The trust_remote_code=True argument is required to use the encode method of the model.

Encoding and Similarity Calculation: Two sentences, ‘How is the weather today?’ and ‘What is the current weather like today?’, are encoded using the pre-trained model. The cosine similarity between the resulting embeddings is then calculated using the cos_sim function.

In [1]:
# !pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
# print(cos_sim(embeddings[0], embeddings[1]))






[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: C:\Users\gurda\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
  from .autonotebook import tqdm as notebook_tqdm
Downloading config.json: 100%|██████████| 1.18k/1.18k [00:00<00:00, 1.16MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading configuration_bert.py: 100%|██████████| 8.24k/8.24k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading

0.9341315


In [2]:
embeddings

array([[-0.34827104, -0.60091805,  0.6022362 , ..., -0.2523272 ,
         0.23249894, -0.7026478 ],
       [-0.11724894, -0.89896137,  0.4500913 , ..., -0.02847653,
        -0.22871459, -0.42282885]], dtype=float32)