# **Basic infromations**

In each sentence, we have Token, Vocab, and sequence, that here we try to learn about them.
To apply or extract this information, there are a few libraries which in the first step we will learn about "Spacy" library, which has ready-made models, languages, and methods for use. On the next step, we will try to understand how we are able to embed our text into a vector space.

## **Spacy**

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.
It can be used to build information extraction or natural language understanding systems.


In [None]:
# Install useful libraries for LLMs

!pip install spacy

'''Summon a specific dataset form spacy. Here we will download and use en_core_web_md (An English pipeline optimized for CPU).
It also support the other languages (https://spacy.io/usage/models) like Persian and we will able to download them
separately in multi language section (xx_sent_ud_sm is used for Persian).
'''
!python -m spacy download en_core_web_md

Features of Spacy library:

| **Name**                     | **Description** |
|------------------------------|---------------|
| **Tokenization**            | Segmenting text into words, punctuation marks, etc. |
| **Part-of-speech (POS) Tagging** | Assigning word types to tokens (e.g., verb, noun). |
| **Dependency Parsing**      | Assigning syntactic dependency labels (e.g., subject, object) to describe token relations. |
| **Lemmatization**          | Assigning the base forms of words (e.g., "was" → "be", "rats" → "rat"). |
| **Sentence Boundary Detection (SBD)** | Finding and segmenting individual sentences. |
| **Named Entity Recognition (NER)** | Labelling named "real-world" objects (e.g., persons, companies, locations). |
| **Entity Linking (EL)**    | Disambiguating textual entities to unique identifiers in a knowledge base. |
| **Similarity**             | Comparing words, text spans, or documents to measure similarity. |
| **Text Classification**    | Assigning categories/labels to a document or parts of a document. |
| **Rule-based Matching**    | Finding token sequences based on text/linguistic patterns (like regex for NLP). |
| **Training**              | Updating and improving a statistical model’s predictions. |
| **Serialization**         | Saving objects (e.g., models, docs) to files or byte strings. |


There are several libraries similar to **spaCy** for Natural Language Processing (NLP), each with its own strengths. Here are some popular alternatives:

### **1. Hugging Face Transformers (🤗)**
   - **Best for:** State-of-the-art (SOTA) transformer models (BERT, GPT, T5, etc.)
   - **Features:**
     - Pre-trained models for tasks like text classification, NER, summarization, translation.
     - Easy fine-tuning with `pipeline()` API.
     - Supports PyTorch & TensorFlow.
   - **Website:** [huggingface.co](https://huggingface.co)

### **2. NLTK (Natural Language Toolkit)**
   - **Best for:** Education, research, and basic NLP tasks.
   - **Features:**
     - Tokenization, stemming, POS tagging, parsing.
     - Large collection of corpora and lexical resources.
     - Less optimized for production than spaCy.
   - **Website:** [nltk.org](https://www.nltk.org/)

### **3. Stanza (by Stanford NLP)**
   - **Best for:** Multilingual NLP with high accuracy.
   - **Features:**
     - Supports 70+ languages.
     - Dependency parsing, NER, POS tagging.
     - Built on PyTorch.
   - **Website:** [stanfordnlp.github.io/stanza](https://stanfordnlp.github.io/stanza/)

### **4. Flair (by Zalando Research)**
   - **Best for:** Contextual embeddings & advanced NLP.
   - **Features:**
     - Built on PyTorch.
     - Supports embeddings (BERT, ELMo, Flair).
     - Good for NER and text classification.
   - **Website:** [github.com/flairNLP/flair](https://github.com/flairNLP/flair)

### **5. Gensim**
   - **Best for:** Topic modeling & word embeddings.
   - **Features:**
     - Implements Word2Vec, Doc2Vec, FastText.
     - LDA for topic modeling.
     - Not for deep learning tasks.
   - **Website:** [radimrehurek.com/gensim](https://radimrehurek.com/gensim/)

### **6. AllenNLP**
   - **Best for:** Research & custom deep learning NLP models.
   - **Features:**
     - Built on PyTorch.
     - High-level API for NLP tasks.
     - Good for prototyping new models.
   - **Website:** [allennlp.org](https://allennlp.org/)

### **7. TextBlob**
   - **Best for:** Simple NLP tasks (beginners).
   - **Features:**
     - Built on NLTK & Pattern.
     - Sentiment analysis, translation, noun phrase extraction.
     - Easy-to-use API.
   - **Website:** [textblob.readthedocs.io](https://textblob.readthedocs.io/)

### **Comparison Table**
| Library          | Best For                     | Deep Learning Support | Production Ready | Multilingual |
|------------------|-----------------------------|----------------------|------------------|--------------|
| **spaCy**       | Fast, production NLP        | ✅ (via extensions)  | ✅               | ✅ (20+ langs) |
| **Hugging Face**| SOTA transformers           | ✅ (PyTorch/TF)      | ✅               | ✅ (100+ langs) |
| **NLTK**        | Education/research          | ❌                   | ❌               | ✅ (limited) |
| **Stanza**      | Accurate multilingual NLP   | ✅ (PyTorch)         | ✅               | ✅ (70+ langs) |
| **Flair**       | Contextual embeddings       | ✅ (PyTorch)         | ✅               | ✅ (limited) |
| **Gensim**      | Topic modeling/embeddings   | ❌                   | ✅               | ✅ (limited) |
| **AllenNLP**    | Custom deep NLP models      | ✅ (PyTorch)         | ⚠️ (research)   | ✅ |
| **TextBlob**    | Simple NLP tasks            | ❌                   | ❌               | ✅ (limited) |

### **Which One Should You Choose?**
- **For production pipelines** → **spaCy** (fast) or **Hugging Face** (transformers).
- **For research/education** → **NLTK**, **AllenNLP**, or **Flair**.
- **For multilingual tasks** → **Stanza** or **Hugging Face**.
- **For embeddings & topic modeling** → **Gensim**.


In [None]:
# Import and load dataset from spacy
import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")

''' The vectorisation process that Spacy has been on each word or symbol is semantic.
It means the bus semantically will be  near to the car vector because they are sort of vehicle'''

# Test a vocab vector
print(nlp.vocab['bus'].vector,'\n',100*'-')
# Test Tokenizer
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
print("Tokeniz of input sentence: ",'\n')
for token in doc:
    print(token.text)

print(100*'-','\n',"Tokeniz of input sentence with other informations: ",'\n')
for token in doc:
    print(token.text, token.has_vector, token.vector_norm)

In [None]:
# Vocablury: Each word, punctuation, or the same token has a number called vocab
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

## **Similarity between two vectors**

Calculating Similarity between 2 vectors is one way to find their distance. This technique helps us determine whether semantically they are near each other. Because this is how vectorisation is done. There are many techniques [link](https://www.elastic.co/search-labs/blog/vector-similarity-techniques-and-scoring) for calculating similarity, including L1 distance, L2 distance, Cosine Similarity, Dot product similarity, and Max inner similarity, which will learn here about Cosine similarity in the following.


In [None]:
'''
Cosine similarity measures the similarity between two non-zero vectors by calculating the cosine of the angle between them. It is widely used in machine learning and data analysis, especially in text analysis, document comparison, search queries, and recommendation systems.

Similarity measure calculates the distance between data objects based on their feature dimensions in a dataset.
A smaller distance indicates a higher similarity, while a larger distance indicates a lower similarity.

The formula to find the cosine similarity between two vectors is -

Cs(x, y) = x . y / ||x|| × ||y||
where,

x . y = product (dot) of the vectors 'x' and 'y'.
||x|| and ||y|| = length (magnitude) of the two vectors 'x' and 'y'.
||x|| × ||y|| = regular product of the two vectors 'x' and 'y'.

Example
Consider an example to find the similarity between two vectors - 'x' and 'y', using Cosine Similarity.
The 'x' vector has values, x = { 3, 2, 0, 5 } The 'y' vector has values, y = { 1, 0, 0, 0 } The formula for calculating the cosine similarity is :
Cs(x, y) = x . y / ||x|| × ||y||

x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

Cs(x, y) = 3 / (6.16 * 1) = 0.49

The dissimilarity between the two vectors 'x' and 'y' is given by 1 - (x, y) = 1 - 0.49 = 0.51
'''

def cosine_sim(v1,v2):
  return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))


bus_v = nlp.vocab['bus'].vector
car_v = nlp.vocab['car'].vector
cat_v = nlp.vocab['cat'].vector
horse_v = nlp.vocab['horse'].vector

print(f'This valuse shows the similarty value between car and bus: {cosine_sim(bus_v,car_v)}')
print(f'This valuse shows the similarty value between car and cat: {cosine_sim(horse_v,car_v)}')

print("Test the perform of similarity function with spacy --> Similarity between Bus and Car: ", nlp.vocab['bus'].similarity(nlp.vocab['car']))


In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

## **Embedding or Vectorization**

Embedding or Vectorisation is a technique that makes our input data understandable to a computer by mapping input text into a vector space. there are a few libraries to do this like spacy-transformer but for having diversity and its streanghs we will try to learn about sentence transformer.

Sentence Transformers is the go-to Python module for accessing, using, and training state-of-the-art embedding and reranker models. It can be used to compute embeddings using Sentence Transformer models or to calculate similarity scores using Cross-Encoder models. This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining.




The name and advantages of other libraries are in following:

### **1. Hugging Face `sentence-transformers` (Official)**
   - **Best for:** State-of-the-art (SOTA) sentence embeddings.
   - **Features:**
     - Built on top of Hugging Face Transformers.
     - Pre-trained models (e.g., `all-MiniLM-L6-v2`, `mpnet-base`).
     - Supports semantic search, clustering, and similarity tasks.
   - **GitHub:** [github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)

### **2. Hugging Face Transformers (`pipeline` + Custom Models)**
   - **Best for:** Using standalone models like `BERT`, `RoBERTa`, `T5` for embeddings.
   - **Features:**
     - Can extract embeddings from any Hugging Face model.
     - Less optimized for sentence-level tasks than `sentence-transformers`.
   - **Example:**


### **3. FastText (by Facebook)**
   - **Best for:** Word and sentence embeddings (especially for rare words).
   - **Features:**
     - Trains subword embeddings (good for morphologically rich languages).
     - Can generate sentence embeddings by averaging word vectors.
   - **GitHub:** [github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText)

### **4. Gensim (`Doc2Vec`, `Word2Vec`)**
   - **Best for:** Lightweight document/paragraph embeddings.
   - **Features:**
     - `Doc2Vec` for fixed-length document embeddings.
     - No transformer-based SOTA, but fast for small datasets.

### **5. Flair (Contextual Embeddings)**
   - **Best for:** Advanced contextualized embeddings (e.g., `FlairEmbeddings`, `TransformerEmbeddings`).
   - **Features:**
     - Combines multiple embeddings (e.g., BERT + Flair).
     - Good for downstream NLP tasks (NER, classification).
   - **GitHub:** [github.com/flairNLP/flair](https://github.com/flairNLP/flair)

### **6. TensorFlow Hub (Pre-trained Encoders)**
   - **Best for:** Ready-to-use TF models for embeddings.
   - **Features:**
     - Hosts models like `Universal Sentence Encoder` (USE).
     - One-line embedding extraction.


### **7. spaCy (with `spacy-transformers`)**
   - **Best for:** Embeddings within a production NLP pipeline.
   - **Features:**
     - Integrates Hugging Face models into spaCy.
     - Supports sentence embeddings via `doc.vector` or `span.vector`.


### **8. Jina AI (`Finetuner`)**
   - **Best for:** Fine-tuning sentence embeddings for specific domains.
   - **Features:**
     - Optimizes `sentence-transformers` models for custom data.
     - Focuses on search and retrieval tasks.
   - **GitHub:** [github.com/jina-ai/finetuner](https://github.com/jina-ai/finetuner)

---

### **Comparison Table**

| **Library**               | **Strengths** | **Sentence Transformers' Strengths** | **Transformer-Based?** | **Best Use Case** |
|---------------------------|--------------|-------------------------------------|------------------------|------------------|
| **Sentence Transformers** | SOTA sentence embeddings | ✅ **Optimized for sentence-level tasks** (unlike raw Hugging Face models)<br>✅ **Pre-trained models fine-tuned for similarity** (e.g., `all-mpnet-base-v2`)<br>✅ **Built-in pooling** (no manual mean/max pooling needed)<br>✅ **Semantic search/clustering support** (e.g., `util.cos_sim()`) | ✅ | Semantic search, clustering, retrieval |
| **Hugging Face (Raw Models)** | Flexible model usage | ❌ Requires manual pooling (e.g., `mean` of BERT outputs)<br>❌ Not fine-tuned for sentence similarity by default | ✅ | Custom embedding extraction |
| **FastText** | Subword embeddings, rare words | ❌ Word-level only (no native sentence embeddings)<br>❌ No transformer-based context | ❌ | Multilingual/word-level tasks |
| **Gensim** | Lightweight Doc2Vec/Word2Vec | ❌ Bag-of-words style (no contextual embeddings)<br>❌ Outperformed by transformers | ❌ | Small-scale doc similarity |
| **Flair** | Hybrid contextual embeddings | ❌ Focused on NER/classification, not sentence similarity | ✅ (optional) | NER, classification |
| **TF Hub (USE)** | Pre-trained Universal Sentence Encoder | ❌ Fixed models (less flexible than Sentence Transformers)<br>✅ Good for quick prototypes | ✅ | Quick sentence embeddings |
| **spaCy + transformers** | Production NLP pipelines | ❌ Embeddings are side effect (not optimized for similarity)<br>✅ Integrates with NLP tasks | ✅ (with plugin) | Combined NLP + embeddings |
| **Jina AI Finetuner** | Domain-specific fine-tuning | ✅ **Extends Sentence Transformers** for custom data<br>✅ Optimized for search/retrieval | ✅ | Custom search systems |




In [None]:
# Insatall and import libraries
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer

In [None]:
'''
Sentence Transformer has many mapping models for vectorizing (https://sbert.net/docs/sentence_transformer/pretrained_models.html).
Here we used one of the smallest model from pretrained section called all-MiniLM-L6-v2.
This model has trained on 1 Billion data and it maps our input data to a 384 dimention vector'''

model = SentenceTransformer("all-MiniLM-L6-v2")

# Testing perfomance of embedding in all-MiniLM-L6-v2
Text = [
    'من به رستوران مراجعه کردم و کباب کوبیده خوردم',
    'در هنگام درس خواندن، شنیدن موسیقی آرامشبخش می تواند کمک کننده باشد',
    'استاندارد های کافه در حوزه قهوه و خوراکی بالا رفته است',
    'برای قبولی در کنکور باید تلاش کنیم و کتاب های مختلف را چندبار مطالعه کنیم'
]

# Injecting text into the model for mapping in vector space
Text_vec = model.encode(Text)

In [None]:
#Chencking Similarity between each sentences by Cosine similarity
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (1 & 2): {cosine_sim(Text_vec[0],Text_vec[1])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (3 & 4): {cosine_sim(Text_vec[2],Text_vec[3])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to eating (1 & 3): {cosine_sim(Text_vec[0],Text_vec[2])}')
print(f'This valuse shows the similarty value between sentence related to studing and sentence related to studing (2 & 4): {cosine_sim(Text_vec[1],Text_vec[3])}')

print("As we can see model could detect relation and difference between the senctences but it doesn's have a good perfomance.",'\n',100*'-')

# Checking embedding similarity by model.similarity in sentence transformer
similarities = model.similarity(Text_vec, Text_vec)
print(similarities)


In [None]:
# Testing perfomance of embedding in distiluse-base-multilingual-cased-v2
model_2 = SentenceTransformer("distiluse-base-multilingual-cased-v2")

# Injecting text into the model for mapping in vector space
Text_vec = model_2.encode(Text)

#Chencking Similarity between each sentences
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (1 & 2): {cosine_sim(Text_vec[0],Text_vec[1])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to studing (3 & 4): {cosine_sim(Text_vec[2],Text_vec[3])}')
print(f'This valuse shows the similarty value between sentence related to eating and sentence related to eating (1 & 3): {cosine_sim(Text_vec[0],Text_vec[2])}')
print(f'This valuse shows the similarty value between sentence related to studing and sentence related to studing (2 & 4): {cosine_sim(Text_vec[1],Text_vec[3])}')

print("As we can see model could detect relation and difference between the senctences but it has a good perfomance.",'\n',100*'-')

# Checking embedding similarity by model.similarity in sentence transformer
similarities = model.similarity(Text_vec, Text_vec)
print(similarities)

## **Making Simple sentence collection through vectorized data in previous section**

The first step after doing embedding is making collection of words, sentences, and punctuations using database maker libraries like Chromadb, Milvest, and Faiss. Here we will do it by chromadb.
**Chroma db** [link text](https://github.com/chroma-core/chroma). you can see some example here.
The advantages of Chromadb as a database maker is the ability of combing with sentence-transformer library and sementic searching new input on entire collection.   

In [None]:
!pip install chromadb
import chromadb
# Chromadb has some utils like embedding function that allows us to combine sentence transformer with this library
from chromadb.utils import embedding_functions

In [None]:
# Define a path to save its output
'''
we can configure Chroma to save and load the database from your local machine, using the PersistentClient
'''
Client = chromadb.PersistentClient(path='/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/')

In [None]:
# As said before we can combine chroma with sentence-transformer.
# This atribute allows us to use sentence-transformer models (https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) in embedding process.
# summon one of sentence-transform models as embedding function
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name='distiluse-base-multilingual-cased-v2')

In [None]:
# make an empty database (collectio) that it includs distiluse-base-multilingual-cased-v2 embedding function model
# metadata part in this function allows us to use some other techniques like searching methods like cosine distance at hnsw searching engine.
# There are some other techniques insted of HNSW like PQ (Product quantization) or IVF (Inverted file index) or IVFPQ (Combination of IVF and PQ) but HNSW is faster than them.
Collection = Client.create_collection(name="Sina", embedding_function= ef, metadata={"hnsw:space": "cosine"})

# Our Data
Text = [
    'من به رستوران مراجعه کردم و کباب کوبیده خوردم',
    'در هنگام درس خواندن، شنیدن موسیقی آرامشبخش می تواند کمک کننده باشد',
    'استاندارد های کافه در حوزه قهوه و خوراکی بالا رفته است',
    'برای قبولی در کنکور باید تلاش کنیم و کتاب های مختلف را چندبار مطالعه کنیم'
]

# By ids we will able to assign a specific id for each sentences
Collection.add(documents=Text, ids=[f'id_{i}' for i in range(len(Text))])

In [None]:
# Testing with new text. Indeed, in this searching, a semantic searching will performed on the entire data and return ID and document that sementically is near to our input.
# Query is an attribute that makes a searching process on the collection.
# n_results leads to the return number of documents that are semantically near the input, which we set to 1.

query_results = Collection.query(query_texts=['قرمه سبزی'], n_results=1)
query_results

## **Making Database through text (PDFs) that we have.**

Here we will try do collecting process by PDFs that we have on local system.
In this process we will READ, SPLIT, and SAVE our documents in our path.

In [None]:
# Insatall useful libraries
!pip install langchain langchain_community pypdf chromadb langchain_huggingface
!pip install sentence-transformers

In [None]:
# To read our PDF documents we need to import few libraries like PyPDFDirectoryLoader
# To Split our texts into chunks or paragraphs we need to import RecursiveCharacterTextSplitter
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.vectorstores.chroma import Chroma
import os
import shutil

In [None]:
# Determining our data path
DATA_PATH = r"/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/Data"

def load_documents():
    '''
    Load document is a function that allows us to read all pdfs in our directory.
    It takes data path and return documents.
    The type of output should be a list with the information of all pdfs
    '''
    document_loader = PyPDFDirectoryLoader(DATA_PATH)
    return document_loader.load()

documents = load_documents()
print(documents)
print("\n","The type of out is: ",type(documents))

In [None]:
'''
We need to inject our data into the model so that we can answer questions by them.
To do this, one of methods is splitting all text into sub-text (separate paragraphs or chunks).
So, we have a document and we must turn it into little chunks. We will able to do it by "RecursiveCharacterTextSplitter".
In this method we can determine how much will be each chunk or paragraph.
The reason of using this technique is that we should make a database (a vector based database).
'''
def split_text(documents: list[Document]):
  '''
  split_text function takes documents that we have and it return chunks (paragraphs).
  In this function we use RecursiveCharacterTextSplitter as text splitter.
  '''


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400, # split text with 400 word into chunks
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks

chunks = split_text(documents)
print('\n',100*'-')
for chunk in chunks[:10]:
    print(chunk)

In [None]:
# Making collection
# apply vectorization method (Chromadb) on our dataset to make a database.
# insted of using sentence-transformer, we will use HuggingFaceEmbeddings for vectorization.

CHROMA_PATH = "/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/Data/chromadb"
def save_to_chroma(chunks: list[Document]):
    '''
    Here we will try to make a collection by chroma. In this function initially we check our path to save.
    Then by Chroma.from_documents that takes prepared chunks and embedding technique and path it will make our database.
    '''
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH) # The shutil module offers a number of high-level operations on files and collections of files.

    db = Chroma.from_documents(
        chunks, HuggingFaceEmbeddings(), persist_directory=CHROMA_PATH
    ) # here we used HuggingFaceEmbeddings insted of sentence transformer method.
    db.persist() # The persist() method in ChromaDB is used to save the vector database (index) to disk so that it can be reloaded later without reprocessing the documents.
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

In [None]:
def generate_data_store():
    '''
    generate_data_store is a function that integrat our 3 function (read, split, and save as vector)
    '''
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)

generate_data_store()

In [None]:
# It is just a test to see how HuggingFaceEmbeddings work
ex = "apple"
ex_1 = "orange"
ex_2 = "iphone"

embedding_function = HuggingFaceEmbeddings()
vector = embedding_function.embed_query(ex)
vector_1 = embedding_function.embed_query(ex_1)
vector_2 = embedding_function.embed_query(ex_2)
print(vector)

# **Developing Models**
Here we will try to develope some language models by HuggingFace and LangChain

## **A definition of Multi and Single Stage Task**

In LLMs we have sort of two type of task called Multi Stage and Single stage task. The difference between them is that in Single stage we will achieve to our output directly but in multi stage we will try to break-down our task into sub-tasks. So, the aim of this technique is that break down our task into sub-taks and use them to extrac final result like sentiment of an article.
Insted of using one single shot, first we can extract summary of article and then by injecting this summary to other agent we will able to calculate emotion of it.
To do this, we used some models that will perform our tasks.

In [None]:
!pip install transformers
# This library includes multitute of Large Language models

In [None]:
'''
HuggingFace:
Transformers is a library of pretrained natural language processing, computer vision, audio, and multimodal models for inference and training.
 Use Transformers to train models on your data, build inference applications, and generate text with large language models. (https://huggingface.co/docs/transformers/en/index#design)
'''
from transformers import pipeline

In [None]:
'''
summarize_article is a function that help us to summarize our articles
It takes article as input and return its summary.
The engine of this function is Transformer pipeline that used t5-small model for summarization. The reason of using t5-small is that,
it is small and we can run it in this enviroment easily.
summarize pipeline takes article and according to the lenghs, it return maximum 150 words and minmum 40 words a sentence.
'''
def summarize_article(article):
  summarize = pipeline("summarization", model="t5-small") # Pipeline takes our task and type of model to create our summarization engine
  summary = summarize(article, max_length=150, min_length=40, do_sample=False)[0]['summary_text'] # summarization engine takes input text, max and min lengths for summarize.
  return summary

In [None]:
'''
sentiment_analysis is a function that help us to calculate emotion of text.
It takes text as input and return emotion of it.
In this function we used t5-small that is one of piplines in transformers library.
'''
def sentiment_analysis(text):
  sentiment_analyzer= pipeline("sentiment-analysis", model='t5-small') # Pipeline takes our task and type of model to create our sentiment analysis engine
  sentiment = sentiment_analyzer(text)[0]
  return sentiment

In [None]:
article = """
Global surface temperature has increased faster since 1970 than in any other 50-year period over at least the last 2000 years.
Based on the global average temperature for the most recent 10-year period (2014- 2023), the Earth is now about 1.2°C warmer than it was in the pre-industrial era (1850- 1900). 2023 was the warmest year on record, with the global average near-surface temperature 1.45°C above the pre-industrial baseline. The period 2011-2020 was the warmest decade on record for both land and ocean.
Monthly and annual breaches of 1.5°C do not mean that the world has failed to achieve the Paris Agreement’s temperature goal, which refers to a long-term temperature increase over decades, not individual months or years. Temperatures for any single month or year fluctuate due to natural variability, including El Niño/La Niña and volcanic eruptions. Consequently, long-term temperature changes are typically considered on decadal timescales.
On an average day in 2023, nearly one third of the ocean was gripped by a marine heatwave. Over 90 per cent of the ocean experienced heatwave conditions at some point during 2023. Glaciers around the world thinned by an average of one meter per year and sea level rose at a rate of 4.5mm per year between 2011 and 2020. Greenland and Antarctica lost 38 per cent more ice during the period 2011-2020 than during 2001- 2010.
Every fraction of a degree of warming matters. With every additional increment of global warming, changes in extremes and risks become larger. For example, every 0.1°C increase in global warming causes clearly discernible increases in the intensity and frequency of temperature and precipitation extremes, as well as agricultural and ecological droughts in some regions.
Greenhouse gas emissions reached a new record high of 57.4 gigatonnes in 2023. They must drop by 43 per cent by 2030 (compared to 2019 levels) to keep temperature increase from exceeding 1.5°C. Under current national climate plans, the world is on track for a global average temperature rise of 2.5-2.9°C above pre-industrial levels.
Greenhouse gas concentrations in the atmosphere, already at their highest levels in 2 million years, have continued to rise. Global concentrations of carbon dioxide are now a full 50 per cent higher than they were in the pre-industrial era.
The emissions gap in 2030, or the difference between necessary carbon dioxide reduction and current trends, is estimated at 21-24 gigatons of carbon dioxide equivalent (Gt CO2e) to limit global warming to 1.5°C.
To ensure a safe and liveable planet, experts say humanity must phase out global coal production and use by 2040, and reduce oil and gas production and use by three- quarters between 2020 and 2050.
"""

summary = summarize_article(article)
print(f"Summary: {summary}")

sentiment = sentiment_analysis(summary)
print(f"Sentiment: {sentiment}")


## **LangChain**

Here we will try to implement a Mulit stage reasoning model (summarization and sentiment analysis) using langchain.

In [None]:
'''
To do this initialy we will install langchain, openai, langchain_community, huggingface, and Transformers
'''
!pip install langchain openai langchain_community
!pip install huggingface-hub==0.16.4
!pip install transformers==4.33.3

In [None]:
'''According to the langchain website, it has multitute of modules like llms and so fort and so on.
it has also a community that some guys developed some function in this way
The reason of using OpenAI and HuggingFaceEndpoint is that the server of some LLMs are not on google colab.
'''
import os
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate # Prompt templates help to translate user input and parameters into instructions for a language model.
# This can be used to guide a model's response, helping it understand the context and generate relevant and coherent language-based output.
from langchain.chains import LLMChain
from langchain_community.llms import HuggingFaceEndpoint # Huggingface Endpoints: The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k demo apps (Spaces),
# all open source and publicly available, in an online platform where people can easily collaborate and build ML together.
from langchain_community.chat_models.huggingface import ChatHuggingFace # This will help us getting started with langchain_huggingface chat models

In [None]:
'''
To use HuggingFace or OpenAI facilities we have to get permission or access form its websites.
To do this, we must loge in on Hugging face (https://huggingface.co/settings/) and openai websites and then make new tokens.
By doing this we will able to use LLMs that google colab doesn't have them init.
'''
# os.environ["OPENAI_API_KEY"] = "xxx"
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "....."

In [None]:
'''
Here we will able to summon models in each OpenAI and HuggingFace.
OpenAI is not free.
'''
# llm = OpenAI(model="gpt-3.5-turbo")
# llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

llm = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="summarization",
    huggingfacehub_api_token="....."
)

In [None]:
'''
After picking our model from HuggingFace we need to define our prompt template.
This defination allows us to make a multi stage enviroment for our task.
'''
# summarization task
'''
In this template we must tell the model what should it does. For instance, "Summarize the following article:" and then inject input.
'''
summarization_prompt_template = PromptTemplate(
    input_variables=["article"],
    template="Summarize the following article in 50 words:\n\n{article}\n\nSummary:"
)

# sentiment task
'''
In this template we must tell the model what should it does. For instance, "Analyze the sentiment of the following text:" and then inject input.
'''
sentiment_analysis_prompt_template = PromptTemplate(
    input_variables=["summary"],
    template="Analyze the sentiment of the following text:\n\n{summary}\n\nSentiment:"
)

In [None]:
'''
After defining our templates, we will able to make chains that will perform our tasks.
To do this, we use LLMChain function that takes type of Large Language model and our prompt template.
'''

summarization_chain = LLMChain(
    llm=llm,
    prompt=summarization_prompt_template
)

sentiment_analysis_chain = LLMChain(
    llm=llm,
    prompt=sentiment_analysis_prompt_template
)

In [None]:
'''
Now by defining a function we will have a multi stage agent that will do our specific tasks.
'''
def process_article(article):
    summary = summarization_chain.run(article)
    sentiment = sentiment_analysis_chain.run(summary)
    return summary, sentiment


In [None]:
article = """
Global surface temperature has increased faster since 1970 than in any other 50-year period over at least the last 2000 years.
Based on the global average temperature for the most recent 10-year period (2014- 2023), the Earth is now about 1.2°C warmer than it was in the pre-industrial era (1850- 1900). 2023 was the warmest year on record, with the global average near-surface temperature 1.45°C above the pre-industrial baseline. The period 2011-2020 was the warmest decade on record for both land and ocean.
Monthly and annual breaches of 1.5°C do not mean that the world has failed to achieve the Paris Agreement’s temperature goal, which refers to a long-term temperature increase over decades, not individual months or years. Temperatures for any single month or year fluctuate due to natural variability, including El Niño/La Niña and volcanic eruptions. Consequently, long-term temperature changes are typically considered on decadal timescales.
On an average day in 2023, nearly one third of the ocean was gripped by a marine heatwave. Over 90 per cent of the ocean experienced heatwave conditions at some point during 2023. Glaciers around the world thinned by an average of one meter per year and sea level rose at a rate of 4.5mm per year between 2011 and 2020. Greenland and Antarctica lost 38 per cent more ice during the period 2011-2020 than during 2001- 2010.
Every fraction of a degree of warming matters. With every additional increment of global warming, changes in extremes and risks become larger. For example, every 0.1°C increase in global warming causes clearly discernible increases in the intensity and frequency of temperature and precipitation extremes, as well as agricultural and ecological droughts in some regions.
Greenhouse gas emissions reached a new record high of 57.4 gigatonnes in 2023. They must drop by 43 per cent by 2030 (compared to 2019 levels) to keep temperature increase from exceeding 1.5°C. Under current national climate plans, the world is on track for a global average temperature rise of 2.5-2.9°C above pre-industrial levels.
Greenhouse gas concentrations in the atmosphere, already at their highest levels in 2 million years, have continued to rise. Global concentrations of carbon dioxide are now a full 50 per cent higher than they were in the pre-industrial era.
The emissions gap in 2030, or the difference between necessary carbon dioxide reduction and current trends, is estimated at 21-24 gigatons of carbon dioxide equivalent (Gt CO2e) to limit global warming to 1.5°C.
To ensure a safe and liveable planet, experts say humanity must phase out global coal production and use by 2040, and reduce oil and gas production and use by three- quarters between 2020 and 2050.
"""

In [None]:
# Process the article
summary, sentiment = process_article(article)
print(f"Summary: {summary}")
print(f"Sentiment: {sentiment}")

### **Lang Chain 2nd part**

In the previous section, we develop a multi stage agent that helped us to summarize and sentiment analysis on our text. Here we will try develop an agent to answer our question by searching on internet and its knowledge

In [None]:
# Some libraries need to install for implementation. Some of them is about searching tools from internet like httpx, wikipedia, etc.
!pip install -U openai httpcore httpx typing-extensions pydantic langchain
!pip install -U langchain-experimental
!pip install -U wikipedia google-search-results sqlalchemy

# To have a efficent implementation without any error. We need to install partical version of some library.
!pip install --force-reinstall pydantic==1.10.8
!pip install --force-reinstall typing-inspect==0.8.0 typing_extensions==4.5.
!pip install --force-reinstall chromadb==0.3.26

!pip install huggingface-hub==0.16.4
!pip install transformers==4.33.3

In [None]:
# Importing libraries
import openai,os,json
import pandas as pd
from langchain import OpenAI
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, ConversationalRetrievalChain, ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain.schema import messages_from_dict, messages_to_dict
from langchain.memory.chat_message_histories.in_memory import ChatMessageHistory
from langchain.agents import Tool
from langchain.agents import initialize_agent
from langchain.agents import AgentType

from io import StringIO
import sys
from typing import Dict, Optional

from langchain.agents import load_tools
from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.chat_models.huggingface import ChatHuggingFace

In [None]:
# Apply some setting on pandas library
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

In [None]:
# Dedicating API Tokens form OpenAI, Serpapi (https://serper.dev/api-key), and HuggingFace.
# os.environ["OPENAI_API_KEY"] = "xxx"
# os.environ["SERPAPI_API_KEY"] = "....." # Serpapi assign a web searching mechanisim on our model. it has limitation so you should consider it.
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "......."

In [None]:
# Python Repl allows us to run python code during searching. To have this atribute we have to write a class and use Tool from "langchain.agent" library

class PythonREPL:
    def __init__(self):
        pass

    def run(self, command: str) -> str:
        sys.stderr.write("EXECUTING PYTHON CODE:\n---\n" + command + "\n---\n")
        old_stdout = sys.stdout
        sys.stdout = mystdout = StringIO()
        try:
            exec(command, globals())
            sys.stdout = old_stdout
            output = mystdout.getvalue()
        except Exception as e:
            sys.stdout = old_stdout
            output = str(e)
        sys.stderr.write("PYTHON OUTPUT: \"" + output + "\"\n")
        return output


python_repl = Tool(
        "Python REPL",
        PythonREPL().run,
        """A Python shell. Use this to execute python commands. Input should be a valid python command.
        If you expect output it should be printed out.""",
    )
tools_py = [python_repl] # List of Tools

In [None]:
# llm = OpenAI(model="gpt-3.5-turbo")
llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

tools = load_tools(["wikipedia", "serpapi", "terminal"], llm=llm, allow_dangerous_tools=True) # Some existing tool can summon with load tools. "wikipedia": searching in wikipedia, "serpapi": searching in entire website, "terminal":executing output in terminal

agent = initialize_agent(
    tools + tools_py, #merge all tools together
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

In [None]:
agent.run(
    "Create a sample fake timeseries data for an airport using searching on the web. You need to plot your results." # our command for model
)

In [None]:
agent.run(
    "plot a sinsuse chart." # our command for model
)

## **Run LLM Localy on Google Colab using GPU**

Here we will try to have an implementation for Single stage reasoning model (an assitant) on Google Colab using LangChain

In [None]:
!pip install langchain openai langchain_community transformers langchain_huggingface

In [None]:
from langchain_huggingface.llms import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation", # type of task
    pipeline_kwargs={"max_new_tokens": 30}, # number of words that has access to generate
    device=0,
)

In [None]:
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

print(chain.invoke({"question": question})) # Answer our question from the model

In [None]:
template = """Question: {question}

Answer: """
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "Is the weather cold in Canada?"

print(chain.invoke({"question": question})) # Answer our question from the model

## **Run LLM based on chromadb and langchain**

We want to develop a model that can read our documnets like PDFs of an organization and answer our question by it.

In [None]:
# Insatall useful libraries
!pip install langchain langchain_community pypdf chromadb langchain_huggingface openai tiktoken huggingface_hub
!pip install sentence-transformers

In [None]:
# To read our PDF documents we need import few libraries like PyPDFDirectoryLoader
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.vectorstores.chroma import Chroma
from langchain.chat_models import ChatOpenAI
from langchain_community.chat_models.huggingface import ChatHuggingFace
import os
import shutil
from langchain.prompts import ChatPromptTemplate

# assign hugging face token
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = ".........."

In [None]:
DATA_PATH = r"/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/Data"

def load_documents():
    '''
    Load document is a function that allows us to read all pdfs in our directory.
    It takes data path and return documents.
    The type of output should be a list with the information of all pdfs
    '''
    document_loader = PyPDFDirectoryLoader(DATA_PATH)
    return document_loader.load()

documents = load_documents()
print(documents)
print("\n","The type of out is: ",type(documents))

In [None]:
'''
As we need to inject our data into the model so that we can answer questions by them.
To do this, one of methods is splitting all text into sub-text (separate paragraphs or chunks).
So, we have a document and we must turn it into little chunks. We will able to do it by "RecursiveCharacterTextSplitter".
In this method we can determine how much will be each chunk or paragraph.
The reason of using this technique is that we should make a database (a vector based database).
'''
def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks

chunks = split_text(documents)
print('\n',100*'-')
for chunk in chunks[:10]:
    print(chunk)

In [None]:
# Here we will apply vectorization method (Chromadb) on our dataset to make a database.
CHROMA_PATH = "/content/drive/MyDrive/Large Language Model (LLM)/Developing My Knowledge In LLM/FaraDars/Data/chromadb"
def save_to_chroma(chunks: list[Document]):
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    db = Chroma.from_documents(
        chunks, HuggingFaceEmbeddings(), persist_directory=CHROMA_PATH
    ) # here we used HuggingFaceEmbeddings insted of sentence transformer method.
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

In [None]:
def generate_data_store():
    '''
    generate_data_store is a function that integrat our 3 function (read, split, and save as vector)
    '''
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)

generate_data_store()

In [None]:
# It is just a test to see how HuggingFaceEmbeddings work
ex = "apple"
ex_1 = "orange"
ex_2 = "iphone"

embedding_function = HuggingFaceEmbeddings()
vector = embedding_function.embed_query(ex)
vector_1 = embedding_function.embed_query(ex_1)
vector_2 = embedding_function.embed_query(ex_2)
print(vector)

**Based on our documents we will able to answer questions. For instance, regarding that our documents are about MRI motions, we need write a question in this way.
Then we have to make Prompt template to take our question and sementically find good chunks that are meaningfully near to our question.**

In [None]:
query_text = "How many motion we have in MR imaging"
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""


In [None]:
'''
For finding similarity between our question and chunks, we initially must embed our question.
Then we will able to it by " similarity_search_with_relevance_scores ".
'''
embedding_function = HuggingFaceEmbeddings()
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

results = db.similarity_search_with_relevance_scores(query_text, k=3)

if len(results) == 0 or results[0][1] < 0.1:
    print(f"Unable to find matching results.")

In [None]:
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results]) # stick all 3 text - "\n\n---\n\n help to sperate each sentence"
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE) # turn our prompt template to chat propmpt template
prompt = prompt_template.format(context=context_text, question=query_text)
print(prompt)

'''
All this prompt will inject to a LLM. Based on our question, extracted information from our database, and knowledge of LLM model, model will answer the questio
'''

In [None]:
n_gpu_layers = -1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="https://huggingface.co/mradermacher/dolphin-2.9.3-mistral-nemo-12b-llamacppfixed-GGUF",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

In [None]:
!huggingface-cli login

In [None]:
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface import HuggingFacePipeline
llm = HuggingFacePipeline.from_model_id(
    # model_id="openai-community/gpt2",
    model_id="MaziyarPanahi/Llama-3-Groq-8B-Tool-Use-GGUF",
    # model_id="HuggingFaceH4/zephyr-7b-alpha",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512,
        do_sample=False,
        repetition_penalty=1.03,
        token = ".....",
    ),
)
model = ChatHuggingFace(llm = llm)
response_text = model.predict(prompt)
sources = [doc.metadata.get("source", None) for doc, _score in results]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

In [None]:
response_text