# **1. Create Embeddings using `sentence_transformers`** from Hugging Face
The sentence_transformers model we're using is `all-MiniLM-L6-v2`. You can find more info about it here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [47]:
(# pip install PyPDF2==3.0.1

In [2]:
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv,dotenv_values
import json

In [3]:
# Specify the path to your source document
FILE_PATH = r"C:\Users\Seyed Barabadi\Downloads\Gen AI\kebo104.pdf"

In [4]:
load_dotenv()

# specify the address for the .env file where all the keys for different services are saved.
values_env = dotenv_values(r"C:\Users\Seyed Barabadi\Downloads\Gen AI\AZURE-AI-VECTOR-SEARCH-main\azure_ai_vector_search\notebooks\keys.env")
values_env

# This model is used to create embeddings
MODEL_NAME = values_env['MODEL_NAME']
MODEL_NAME

'all-MiniLM-L6-v2'

In [12]:
# Read the PDF file and return the text
def get_pdf_data(file_path, num_pages = 1):
    reader = PdfReader(file_path)
    full_doc_text = ""
    pages = reader.pages
    num_pages = len(pages) 
    
    try:
        for page in range(num_pages):
            current_page = reader.pages[page]
            text = current_page.extract_text()
            full_doc_text += text
    except:
        print("Error reading file")
    finally:
        return full_doc_text

## When deviding the input file into chuncks the best values are:

> 512 tokens

> 25% overlapping chuks


In [13]:
import nltk.data

def create_chunks(text, max_tokens=512, overlap_ratio=0.25):
    # Tokenize the input text into sentences
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = tokenizer.tokenize(text)

    # Initialize variables
    chunks = []
    current_chunk = []
    current_chunk_length = 0
    first_chunk = True

    # Check if adding the current sentence to the chunk would exceed the maximum new token limit
    # becaue at the end we're adding 25% of the previous chunk, we need to adjust the limit  
    adjusted_max_token_limit = max_tokens * (1 - overlap_ratio)

    # Iterate over sentences to create chunks
    for sentence in sentences:
        tokens = nltk.word_tokenize(sentence)
        num_tokens = len(tokens)
        
        if first_chunk:
            adjusted_max_token_limit = max_tokens
            first_chunk = False

        if current_chunk_length + num_tokens > adjusted_max_token_limit:
            # Append the current chunk to the list of chunks
            chunks.append(current_chunk)

            # Reset the current chunk and update its length
            current_chunk = []
            current_chunk_length = 0
            max_tokens = max_tokens * (1 - overlap_ratio)

        # Add the tokens of the current sentence to the current chunk
        current_chunk.extend(tokens)
        current_chunk_length += num_tokens

    # Append the last chunk
    if current_chunk:
        chunks.append(current_chunk)

    # Apply overlap between chunks
    overlap_size = int(max_tokens * overlap_ratio)
    for i in range(1, len(chunks)):
        chunks[i] = chunks[i - 1][-overlap_size:] + chunks[i]

    return [' '.join(chunk) for chunk in chunks]
    # return chunks




In [33]:
full_doc_text = get_pdf_data(FILE_PATH)

In [18]:
print(f'Full doc text length: {len(full_doc_text)}')

Full doc text length: 31306


In [34]:
Lines = create_chunks(full_doc_text, 512)

In [35]:
len(Lines)

12

In [36]:
type(Lines)

list

In [37]:
Lines[0]

'46 BIOLOGY When you look around , you will observe different animals with different structures and forms . As over a million species of animals have been described till now , the need for classification becomes all the more important . The classification also helps in assigning a systematic position to newly described species . 4.1 BASIS OF CLASSIFICATION Inspite of differences in structure and form of different animals , there are fundamental features common to various individuals in relation to the arrangement of cells , body symmetry , nature of coelom , patterns of digestive , circulatory or reproductive systems . These features are used as the basis of animal classification and some of them are discussed here . 4.1.1 Levels of Organisation Though all members of Animalia are multicellular , all of them do not exhibit the same pattern of organisation of cells . For example , in sponges , the cells are arranged as loose cell aggregates , i.e. , they exhibit cellular level of organis

In [24]:
model = SentenceTransformer(MODEL_NAME)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [38]:
embeddings_all = model.encode(Lines, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [39]:
len(embeddings_all)

12

In [40]:
len(embeddings_all[4].tolist())

384

In [41]:
embeddings_all[4].tolist()[:5]

[0.007901678793132305,
 -0.02649737522006035,
 0.03473445028066635,
 -0.03499046340584755,
 -0.10507051646709442]

In [42]:
counter = 0
input_data = []

for line in Lines:
    d = {}
    d['id'] = str(counter)
    d['line'] = line
    d['embedding'] = embeddings_all[counter].tolist()
    d['filename'] = FILE_PATH.split('\\')[-1]
    counter +=  1
    input_data.append(d)

In [43]:
len(input_data)

12

In [44]:
input_data[0]

{'id': '0',
 'line': '46 BIOLOGY When you look around , you will observe different animals with different structures and forms . As over a million species of animals have been described till now , the need for classification becomes all the more important . The classification also helps in assigning a systematic position to newly described species . 4.1 BASIS OF CLASSIFICATION Inspite of differences in structure and form of different animals , there are fundamental features common to various individuals in relation to the arrangement of cells , body symmetry , nature of coelom , patterns of digestive , circulatory or reproductive systems . These features are used as the basis of animal classification and some of them are discussed here . 4.1.1 Levels of Organisation Though all members of Animalia are multicellular , all of them do not exhibit the same pattern of organisation of cells . For example , in sponges , the cells are arranged as loose cell aggregates , i.e. , they exhibit cell

In [46]:
# Save embeddings to docVectors.json file
with open("../output/docVectors.json", "w") as f:
    json.dump(input_data, f)

$$End$$