# **Basic infromations**

In each sentence, we have Token, Vocab, and sequence, that here we try to learn about them.
To apply or extract this information, there are a few libraries which in the first step we will learn about "Spacy" library, which has ready-made models, languages, and methods for use. On the next step, we will try to understand how we are able to embed our text into a vector space.

## **Spacy**

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.
It can be used to build information extraction or natural language understanding systems.


In [None]:
# Install useful libraries for LLMs

!pip install spacy

'''Summon a specific dataset form spacy. Here we will download and use en_core_web_md (An English pipeline optimized for CPU).
It also support the other languages (https://spacy.io/usage/models) like Persian and we will able to download them
separately in multi language section (xx_sent_ud_sm is used for Persian).
'''
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Features of Spacy library:

| **Name**                     | **Description** |
|------------------------------|---------------|
| **Tokenization**            | Segmenting text into words, punctuation marks, etc. |
| **Part-of-speech (POS) Tagging** | Assigning word types to tokens (e.g., verb, noun). |
| **Dependency Parsing**      | Assigning syntactic dependency labels (e.g., subject, object) to describe token relations. |
| **Lemmatization**          | Assigning the base forms of words (e.g., "was" → "be", "rats" → "rat"). |
| **Sentence Boundary Detection (SBD)** | Finding and segmenting individual sentences. |
| **Named Entity Recognition (NER)** | Labelling named "real-world" objects (e.g., persons, companies, locations). |
| **Entity Linking (EL)**    | Disambiguating textual entities to unique identifiers in a knowledge base. |
| **Similarity**             | Comparing words, text spans, or documents to measure similarity. |
| **Text Classification**    | Assigning categories/labels to a document or parts of a document. |
| **Rule-based Matching**    | Finding token sequences based on text/linguistic patterns (like regex for NLP). |
| **Training**              | Updating and improving a statistical model’s predictions. |
| **Serialization**         | Saving objects (e.g., models, docs) to files or byte strings. |


There are several libraries similar to **spaCy** for Natural Language Processing (NLP), each with its own strengths. Here are some popular alternatives:

### **1. Hugging Face Transformers (🤗)**
   - **Best for:** State-of-the-art (SOTA) transformer models (BERT, GPT, T5, etc.)
   - **Features:**
     - Pre-trained models for tasks like text classification, NER, summarization, translation.
     - Easy fine-tuning with `pipeline()` API.
     - Supports PyTorch & TensorFlow.
   - **Website:** [huggingface.co](https://huggingface.co)

### **2. NLTK (Natural Language Toolkit)**
   - **Best for:** Education, research, and basic NLP tasks.
   - **Features:**
     - Tokenization, stemming, POS tagging, parsing.
     - Large collection of corpora and lexical resources.
     - Less optimized for production than spaCy.
   - **Website:** [nltk.org](https://www.nltk.org/)

### **3. Stanza (by Stanford NLP)**
   - **Best for:** Multilingual NLP with high accuracy.
   - **Features:**
     - Supports 70+ languages.
     - Dependency parsing, NER, POS tagging.
     - Built on PyTorch.
   - **Website:** [stanfordnlp.github.io/stanza](https://stanfordnlp.github.io/stanza/)

### **4. Flair (by Zalando Research)**
   - **Best for:** Contextual embeddings & advanced NLP.
   - **Features:**
     - Built on PyTorch.
     - Supports embeddings (BERT, ELMo, Flair).
     - Good for NER and text classification.
   - **Website:** [github.com/flairNLP/flair](https://github.com/flairNLP/flair)

### **5. Gensim**
   - **Best for:** Topic modeling & word embeddings.
   - **Features:**
     - Implements Word2Vec, Doc2Vec, FastText.
     - LDA for topic modeling.
     - Not for deep learning tasks.
   - **Website:** [radimrehurek.com/gensim](https://radimrehurek.com/gensim/)

### **6. AllenNLP**
   - **Best for:** Research & custom deep learning NLP models.
   - **Features:**
     - Built on PyTorch.
     - High-level API for NLP tasks.
     - Good for prototyping new models.
   - **Website:** [allennlp.org](https://allennlp.org/)

### **7. TextBlob**
   - **Best for:** Simple NLP tasks (beginners).
   - **Features:**
     - Built on NLTK & Pattern.
     - Sentiment analysis, translation, noun phrase extraction.
     - Easy-to-use API.
   - **Website:** [textblob.readthedocs.io](https://textblob.readthedocs.io/)

### **Comparison Table**
| Library          | Best For                     | Deep Learning Support | Production Ready | Multilingual |
|------------------|-----------------------------|----------------------|------------------|--------------|
| **spaCy**       | Fast, production NLP        | ✅ (via extensions)  | ✅               | ✅ (20+ langs) |
| **Hugging Face**| SOTA transformers           | ✅ (PyTorch/TF)      | ✅               | ✅ (100+ langs) |
| **NLTK**        | Education/research          | ❌                   | ❌               | ✅ (limited) |
| **Stanza**      | Accurate multilingual NLP   | ✅ (PyTorch)         | ✅               | ✅ (70+ langs) |
| **Flair**       | Contextual embeddings       | ✅ (PyTorch)         | ✅               | ✅ (limited) |
| **Gensim**      | Topic modeling/embeddings   | ❌                   | ✅               | ✅ (limited) |
| **AllenNLP**    | Custom deep NLP models      | ✅ (PyTorch)         | ⚠️ (research)   | ✅ |
| **TextBlob**    | Simple NLP tasks            | ❌                   | ❌               | ✅ (limited) |

### **Which One Should You Choose?**
- **For production pipelines** → **spaCy** (fast) or **Hugging Face** (transformers).
- **For research/education** → **NLTK**, **AllenNLP**, or **Flair**.
- **For multilingual tasks** → **Stanza** or **Hugging Face**.
- **For embeddings & topic modeling** → **Gensim**.


In [None]:
# Import and load dataset from spacy
import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")

''' The vectorisation process that Spacy has been on each word or symbol is semantic.
It means the bus semantically will be  near to the car vector because they are sort of vehicle'''

# Test a vocab vector
print(nlp.vocab['bus'].vector,'\n',100*'-')
# Test Tokenizer
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
print("Tokeniz of input sentence: ",'\n')
for token in doc:
    print(token.text)

print(100*'-','\n',"Tokeniz of input sentence with other informations: ",'\n')
for token in doc:
    print(token.text, token.has_vector, token.vector_norm)

[-6.0572e-01 -1.0594e-01 -3.9338e-01 -1.5178e-01 -6.6658e-02 -2.9802e-01
  2.9612e-01 -4.3170e-01  7.1819e-02  2.4310e+00 -3.2158e-01 -6.7728e-02
 -2.7922e-01 -1.7656e-01 -6.6799e-01 -3.2203e-01 -8.3831e-02 -9.6224e-02
  4.0088e-01  3.2672e-02  1.7744e-01 -6.0914e-01  3.2836e-01 -3.2014e-01
 -9.8430e-02 -1.6396e-01  1.8223e-02 -1.8930e-01  2.9505e-01 -7.8922e-01
 -1.5273e-01 -2.0994e-01 -6.2897e-02  3.0917e-01  1.9331e-02  5.8974e-02
 -4.5463e-02 -3.4386e-01 -3.7858e-01  4.5687e-01 -1.5136e-01 -5.4682e-01
  5.1279e-01  3.7433e-01 -4.1068e-01 -6.2451e-03  2.0224e-01 -2.4735e-01
  1.5234e-01  3.6067e-01 -4.7580e-01  3.1522e-01  1.6621e-01  3.0444e-01
 -5.9939e-01  4.8027e-01 -9.8689e-02 -5.0677e-01  1.1521e-01 -3.1151e-01
 -6.6905e-02 -1.0644e-01 -2.4069e-01  1.3970e-01  1.3763e-01 -1.8278e-02
 -4.0451e-02 -1.7977e-01 -8.9384e-02  2.7004e-01 -2.8265e-01 -4.2389e-01
  7.8729e-01 -3.2836e-01  1.6841e-01 -2.2462e-01 -3.7333e-01  3.7803e-01
 -3.1320e-01  6.9726e-01  7.1111e-02  1.4135e-01 -1

In [None]:
# Vocablury: Each word, punctuation, or the same token has a number called vocab
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee


## **Similarity between two vectors**

Calculating Similarity between 2 vectors is one way to find their distance. This technique helps us determine whether semantically they are near each other. Because this is how vectorisation is done. There are many techniques [link](https://www.elastic.co/search-labs/blog/vector-similarity-techniques-and-scoring) for calculating similarity, including L1 distance, L2 distance, Cosine Similarity, Dot product similarity, and Max inner similarity, which will learn here about Cosine similarity in the following.


In [None]:
'''
Cosine similarity measures the similarity between two non-zero vectors by calculating the cosine of the angle between them. It is widely used in machine learning and data analysis, especially in text analysis, document comparison, search queries, and recommendation systems.

Similarity measure calculates the distance between data objects based on their feature dimensions in a dataset.
A smaller distance indicates a higher similarity, while a larger distance indicates a lower similarity.

The formula to find the cosine similarity between two vectors is -

Cs(x, y) = x . y / ||x|| × ||y||
where,

x . y = product (dot) of the vectors 'x' and 'y'.
||x|| and ||y|| = length (magnitude) of the two vectors 'x' and 'y'.
||x|| × ||y|| = regular product of the two vectors 'x' and 'y'.

Example
Consider an example to find the similarity between two vectors - 'x' and 'y', using Cosine Similarity.
The 'x' vector has values, x = { 3, 2, 0, 5 } The 'y' vector has values, y = { 1, 0, 0, 0 } The formula for calculating the cosine similarity is :
Cs(x, y) = x . y / ||x|| × ||y||

x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

Cs(x, y) = 3 / (6.16 * 1) = 0.49

The dissimilarity between the two vectors 'x' and 'y' is given by 1 - (x, y) = 1 - 0.49 = 0.51
'''

def cosine_sim(v1,v2):
  return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))


bus_v = nlp.vocab['bus'].vector
car_v = nlp.vocab['car'].vector
cat_v = nlp.vocab['cat'].vector
horse_v = nlp.vocab['horse'].vector

print(f'This valuse shows the similarty value between car and bus: {cosine_sim(bus_v,car_v)}')
print(f'This valuse shows the similarty value between car and cat: {cosine_sim(horse_v,car_v)}')

print("Test the perform of similarity function with spacy --> Similarity between Bus and Car: ", nlp.vocab['bus'].similarity(nlp.vocab['car']))


This valuse shows the similarty value between car and bus: 0.21017424762248993
This valuse shows the similarty value between car and cat: 0.16499139368534088
Test the perform of similarity function with spacy --> Similarity between Bus and Car:  0.21017423272132874


In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.8015960454940796
salty fries <-> hamburgers 0.5733411312103271


## **Embedding or Vectorization**

Embedding or Vectorisation is a technique that makes our input data understandable to a computer by mapping input text into a vector space. there are a few libraries to do this like spacy-transformer but for having diversity and its streanghs we will try to learn about sentence transformer.

Sentence Transformers is the go-to Python module for accessing, using, and training state-of-the-art embedding and reranker models. It can be used to compute embeddings using Sentence Transformer models or to calculate similarity scores using Cross-Encoder models. This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining.




The name and advantages of other libraries are in following:

### **1. Hugging Face `sentence-transformers` (Official)**
   - **Best for:** State-of-the-art (SOTA) sentence embeddings.
   - **Features:**
     - Built on top of Hugging Face Transformers.
     - Pre-trained models (e.g., `all-MiniLM-L6-v2`, `mpnet-base`).
     - Supports semantic search, clustering, and similarity tasks.
   - **GitHub:** [github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)

### **2. Hugging Face Transformers (`pipeline` + Custom Models)**
   - **Best for:** Using standalone models like `BERT`, `RoBERTa`, `T5` for embeddings.
   - **Features:**
     - Can extract embeddings from any Hugging Face model.
     - Less optimized for sentence-level tasks than `sentence-transformers`.
   - **Example:**


### **3. FastText (by Facebook)**
   - **Best for:** Word and sentence embeddings (especially for rare words).
   - **Features:**
     - Trains subword embeddings (good for morphologically rich languages).
     - Can generate sentence embeddings by averaging word vectors.
   - **GitHub:** [github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText)

### **4. Gensim (`Doc2Vec`, `Word2Vec`)**
   - **Best for:** Lightweight document/paragraph embeddings.
   - **Features:**
     - `Doc2Vec` for fixed-length document embeddings.
     - No transformer-based SOTA, but fast for small datasets.

### **5. Flair (Contextual Embeddings)**
   - **Best for:** Advanced contextualized embeddings (e.g., `FlairEmbeddings`, `TransformerEmbeddings`).
   - **Features:**
     - Combines multiple embeddings (e.g., BERT + Flair).
     - Good for downstream NLP tasks (NER, classification).
   - **GitHub:** [github.com/flairNLP/flair](https://github.com/flairNLP/flair)

### **6. TensorFlow Hub (Pre-trained Encoders)**
   - **Best for:** Ready-to-use TF models for embeddings.
   - **Features:**
     - Hosts models like `Universal Sentence Encoder` (USE).
     - One-line embedding extraction.


### **7. spaCy (with `spacy-transformers`)**
   - **Best for:** Embeddings within a production NLP pipeline.
   - **Features:**
     - Integrates Hugging Face models into spaCy.
     - Supports sentence embeddings via `doc.vector` or `span.vector`.


### **8. Jina AI (`Finetuner`)**
   - **Best for:** Fine-tuning sentence embeddings for specific domains.
   - **Features:**
     - Optimizes `sentence-transformers` models for custom data.
     - Focuses on search and retrieval tasks.
   - **GitHub:** [github.com/jina-ai/finetuner](https://github.com/jina-ai/finetuner)

---

### **Comparison Table**

| **Library**               | **Strengths** | **Sentence Transformers' Strengths** | **Transformer-Based?** | **Best Use Case** |
|---------------------------|--------------|-------------------------------------|------------------------|------------------|
| **Sentence Transformers** | SOTA sentence embeddings | ✅ **Optimized for sentence-level tasks** (unlike raw Hugging Face models)<br>✅ **Pre-trained models fine-tuned for similarity** (e.g., `all-mpnet-base-v2`)<br>✅ **Built-in pooling** (no manual mean/max pooling needed)<br>✅ **Semantic search/clustering support** (e.g., `util.cos_sim()`) | ✅ | Semantic search, clustering, retrieval |
| **Hugging Face (Raw Models)** | Flexible model usage | ❌ Requires manual pooling (e.g., `mean` of BERT outputs)<br>❌ Not fine-tuned for sentence similarity by default | ✅ | Custom embedding extraction |
| **FastText** | Subword embeddings, rare words | ❌ Word-level only (no native sentence embeddings)<br>❌ No transformer-based context | ❌ | Multilingual/word-level tasks |
| **Gensim** | Lightweight Doc2Vec/Word2Vec | ❌ Bag-of-words style (no contextual embeddings)<br>❌ Outperformed by transformers | ❌ | Small-scale doc similarity |
| **Flair** | Hybrid contextual embeddings | ❌ Focused on NER/classification, not sentence similarity | ✅ (optional) | NER, classification |
| **TF Hub (USE)** | Pre-trained Universal Sentence Encoder | ❌ Fixed models (less flexible than Sentence Transformers)<br>✅ Good for quick prototypes | ✅ | Quick sentence embeddings |
| **spaCy + transformers** | Production NLP pipelines | ❌ Embeddings are side effect (not optimized for similarity)<br>✅ Integrates with NLP tasks | ✅ (with plugin) | Combined NLP + embeddings |
| **Jina AI Finetuner** | Domain-specific fine-tuning | ✅ **Extends Sentence Transformers** for custom data<br>✅ Optimized for search/retrieval | ✅ | Custom search systems |




In [None]:
# Insatall and import libraries
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
'''
Sentence Transformer has many mapping models for vectorizing (https://sbert.net/docs/sentence_transformer/pretrained_models.html).
Here we used one of the smallest model from pretrained section called all-MiniLM-L6-v2.
This model has trained on 1 Billion data and it maps our input data to a 384 dimention vector'''

model = SentenceTransformer("all-MiniLM-L6-v2")

# Testing perfomance of embedding in all-MiniLM-L6-v2
Text = [
    'من به رستوران مراجعه کردم و کباب کوبیده خوردم',
    'در هنگام درس خواندن، شنیدن موسیقی آرامشبخش می تواند کمک کننده باشد',
    'استاندارد های کافه در حوزه قهوه و خوراکی بالا رفته است',
    'برای قبولی در کنکور باید تلاش کنیم و کتاب های مختلف را چندبار مطالعه کنیم'
]

# Injecting text into the model for mapping in vector space
Text_vec = model.encode(Text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#Chencking Similarity between each sentences by Cosine similarity
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (1 & 2): {cosine_sim(Text_vec[0],Text_vec[1])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (3 & 4): {cosine_sim(Text_vec[2],Text_vec[3])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to eating (1 & 3): {cosine_sim(Text_vec[0],Text_vec[2])}')
print(f'This valuse shows the similarty value between sentence related to studing and sentence related to studing (2 & 4): {cosine_sim(Text_vec[1],Text_vec[3])}')

print("As we can see model could detect relation and difference between the senctences but it doesn's have a good perfomance.",'\n',100*'-')

# Checking embedding similarity by model.similarity in sentence transformer
similarities = model.similarity(Text_vec, Text_vec)
print(similarities)


This valuse shows the similarty value between sentence related to eating and sentence related to studing (1 & 2): -0.06872143596410751
This valuse shows the similarty value between sentence related to eating and sentence related to studing (3 & 4): 0.07620497792959213
This valuse shows the similarty value between sentence related to eating and sentence related to eating (1 & 3): 0.2625127136707306
This valuse shows the similarty value between sentence related to studing and sentence related to studing (2 & 4): 0.22927787899971008
As we can see model could detect relation and difference between the senctences but it doesn's have a good perfomance. 
 ----------------------------------------------------------------------------------------------------
tensor([[ 1.0000, -0.0687,  0.2625,  0.1134],
        [-0.0687,  1.0000,  0.0260,  0.2293],
        [ 0.2625,  0.0260,  1.0000,  0.0762],
        [ 0.1134,  0.2293,  0.0762,  1.0000]])


In [None]:
# Testing perfomance of embedding in distiluse-base-multilingual-cased-v2
model_2 = SentenceTransformer("distiluse-base-multilingual-cased-v2")

# Injecting text into the model for mapping in vector space
Text_vec = model_2.encode(Text)

#Chencking Similarity between each sentences
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (1 & 2): {cosine_sim(Text_vec[0],Text_vec[1])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (3 & 4): {cosine_sim(Text_vec[2],Text_vec[3])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to eating (1 & 3): {cosine_sim(Text_vec[0],Text_vec[2])}')
print(f'This valuse shows the similarty value between sentence related to studing and sentence related to studing (2 & 4): {cosine_sim(Text_vec[1],Text_vec[3])}')

print("As we can see model could detect relation and difference between the senctences but it has a good perfomance.",'\n',100*'-')

# Checking embedding similarity by model.similarity in sentence transformer
similarities = model.similarity(Text_vec, Text_vec)
print(similarities)

This valuse shows the similarty value between sentence related to eating and sentence related to studing (1 & 2): -0.06872143596410751
This valuse shows the similarty value between sentence related to eating and sentence related to studing (3 & 4): 0.07620497792959213
This valuse shows the similarty value between sentence related to eating and sentence related to eating (1 & 3): 0.2625127136707306
This valuse shows the similarty value between sentence related to studing and sentence related to studing (2 & 4): 0.22927787899971008
As we can see model could detect relation and difference between the senctences but it has a good perfomance. 
 ----------------------------------------------------------------------------------------------------
tensor([[ 1.0000, -0.0687,  0.2625,  0.1134],
        [-0.0687,  1.0000,  0.0260,  0.2293],
        [ 0.2625,  0.0260,  1.0000,  0.0762],
        [ 0.1134,  0.2293,  0.0762,  1.0000]])


## **Making Simple sentence collection through vectorized data in previous section**

The first step after doing embedding is making collection of words, sentences, and punctuations using database maker libraries like Chromadb, Milvest, and Faiss. Here we will do it by chromadb.
**Chroma db** [link text](https://github.com/chroma-core/chroma). you can see some example here.
The advantages of Chromadb as a database maker is the ability of combing with sentence-transformer library and sementic searching new input on entire collection.   

In [None]:
!pip install chromadb
import chromadb
# Chromadb has some utils like embedding function that allows us to combine sentence transformer with this library
from chromadb.utils import embedding_functions

In [None]:
# Define a path to save its output
'''
we can configure Chroma to save and load the database from your local machine, using the PersistentClient
'''
Client = chromadb.PersistentClient(path='/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/')

In [None]:
# As said before we can combine chroma with sentence-transformer.
# This atribute allows us to use sentence-transformer models (https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) in embedding process.
# summon one of sentence-transform models as embedding function
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name='distiluse-base-multilingual-cased-v2')

In [None]:
# make an empty database (collectio) that it includs distiluse-base-multilingual-cased-v2 embedding function model
# metadata part in this function allows us to use some other techniques like searching methods like cosine distance at hnsw searching engine.
# There are some other techniques insted of HNSW like PQ (Product quantization) or IVF (Inverted file index) or IVFPQ (Combination of IVF and PQ) but HNSW is faster than them.
Collection = Client.create_collection(name="Sina", embedding_function= ef, metadata={"hnsw:space": "cosine"})

# Our Data
Text = [
    'من به رستوران مراجعه کردم و کباب کوبیده خوردم',
    'در هنگام درس خواندن، شنیدن موسیقی آرامشبخش می تواند کمک کننده باشد',
    'استاندارد های کافه در حوزه قهوه و خوراکی بالا رفته است',
    'برای قبولی در کنکور باید تلاش کنیم و کتاب های مختلف را چندبار مطالعه کنیم'
]

# By ids we will able to assign a specific id for each sentences
Collection.add(documents=Text, ids=[f'id_{i}' for i in range(len(Text))])

In [None]:
# Testing with new text. Indeed, in this searching, a semantic searching will performed on the entire data and return ID and document that sementically is near to our input.
# Query is an attribute that makes a searching process on the collection.
# n_results leads to the return number of documents that are semantically near the input, which we set to 1.

query_results = Collection.query(query_texts=['قرمه سبزی'], n_results=1)
query_results

{'ids': [['id0']],
 'embeddings': None,
 'documents': [['من به رستوران مراجعه کردم و کباب کوبیده خوردم']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None]],
 'distances': [[0.8582597970962524]]}

## **Making Database through text (PDFs) that we have.**

Here we will try do collecting process by PDFs that we have on local system.
In this process we will READ, SPLIT, and SAVE our documents in our path.

In [None]:
# Insatall useful libraries
!pip install langchain langchain_community pypdf chromadb langchain_huggingface
!pip install sentence-transformers

In [None]:
# To read our PDF documents we need to import few libraries like PyPDFDirectoryLoader
# To Split our texts into chunks or paragraphs we need to import RecursiveCharacterTextSplitter
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.vectorstores.chroma import Chroma
import os
import shutil

In [None]:
# Determining our data path
DATA_PATH = r"/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/Data"

def load_documents():
    '''
    Load document is a function that allows us to read all pdfs in our directory.
    It takes data path and return documents.
    The type of output should be a list with the information of all pdfs
    '''
    document_loader = PyPDFDirectoryLoader(DATA_PATH)
    return document_loader.load()

documents = load_documents()
print(documents)
print("\n","The type of out is: ",type(documents))

In [None]:
'''
We need to inject our data into the model so that we can answer questions by them.
To do this, one of methods is splitting all text into sub-text (separate paragraphs or chunks).
So, we have a document and we must turn it into little chunks. We will able to do it by "RecursiveCharacterTextSplitter".
In this method we can determine how much will be each chunk or paragraph.
The reason of using this technique is that we should make a database (a vector based database).
'''
def split_text(documents: list[Document]):
  '''
  split_text function takes documents that we have and it return chunks (paragraphs).
  In this function we use RecursiveCharacterTextSplitter as text splitter.
  '''


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400, # split text with 400 word into chunks
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks

chunks = split_text(documents)
print('\n',100*'-')
for chunk in chunks[:10]:
    print(chunk)

In [None]:
# Making collection
# apply vectorization method (Chromadb) on our dataset to make a database.
# insted of using sentence-transformer, we will use HuggingFaceEmbeddings for vectorization.

CHROMA_PATH = "/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/Data/chromadb"
def save_to_chroma(chunks: list[Document]):
    '''
    Here we will try to make a collection by chroma. In this function initially we check our path to save.
    Then by Chroma.from_documents that takes prepared chunks and embedding technique and path it will make our database.
    '''
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH) # The shutil module offers a number of high-level operations on files and collections of files.

    db = Chroma.from_documents(
        chunks, HuggingFaceEmbeddings(), persist_directory=CHROMA_PATH
    ) # here we used HuggingFaceEmbeddings insted of sentence transformer method.
    db.persist() # The persist() method in ChromaDB is used to save the vector database (index) to disk so that it can be reloaded later without reprocessing the documents.
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

In [None]:
def generate_data_store():
    '''
    generate_data_store is a function that integrat our 3 function (read, split, and save as vector)
    '''
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)

generate_data_store()

In [None]:
# It is just a test to see how HuggingFaceEmbeddings work
ex = "apple"
ex_1 = "orange"
ex_2 = "iphone"

embedding_function = HuggingFaceEmbeddings()
vector = embedding_function.embed_query(ex)
vector_1 = embedding_function.embed_query(ex_1)
vector_2 = embedding_function.embed_query(ex_2)
print(vector)