## Vectorstores And Embeddings

### What is Embedding

In the context of machine learning and natural language processing (NLP), an embedding refers to a numerical representation of a word, sentence, or document in a continuous vector space. It is a way to convert textual data into a format that can be effectively processed by machine learning algorithms.

Word embeddings, in particular, are widely used in NLP tasks. They capture the semantic and syntactic meaning of words by representing them as dense, low-dimensional vectors. **The main idea behind word embeddings is that similar words should have similar vector representations**, allowing algorithms to capture relationships and similarities between words.

### Vector Stores

In the context of NLP, a vector store refers to a collection or database of precomputed word embeddings or other types of embeddings. It is essentially a repository of vector representations for words, sentences, or documents.

**The vector store allows for efficient similarity search and retrieval based on vector distances**. For example, given a query word, one can compare its vector representation with the vectors of all other words in the store to find the most similar words based on cosine similarity or other distance metrics. This enables tasks like finding synonyms, identifying related terms, or building recommendation systems based on textual similarity.

## Load Documents

In [32]:
from langchain.document_loaders import PyPDFLoader

In [33]:
loader = PyPDFLoader("./datasets/example_doc.pdf")
pages = loader.load_and_split()

In [34]:
pages

[Document(page_content='Mastering\nFunctions\nin\nTypeScript:\nA\nComprehensive\nGuide\n|\nCode\nwith\nPrince\nDescription:\nWelcome\nback\nto\n"Code\nwith\nPrince"!\nIn\nthis\nsecond\nvideo\nof\nour\nTypeScript\ntutorial\nseries,\nwe\'ll\ndelve\ninto\nthe\npowerful\nworld\nof\nfunctions\nin\nTypeScript.\nFunctions\nare\nthe\nbackbone\nof\nany\nprogramming\nlanguage,\nand\nTypeScript\nprovides\nadditional\nfeatures\nand\nenhancements\nto\nmake\nyour\ncode\nmore\nmaintainable\nand\nscalable.\nIn\nthis\ntutorial,\nwe\'ll\nexplore\na\nwide\nrange\nof\ntopics\nrelated\nto\nfunctions\nin\nTypeScript:\n1.\nIntroduction\nto\nFunctions:\nUnderstand\nthe\nfundamentals\nof\nfunctions\nand\ntheir\nsignificance\nin\nprogramming.\n2.\nFunction\nDeclaration\nand\nParameters:\nLearn\nhow\nto\ndeclare\nfunctions,\ndefine\nparameters,\nand\nspecify\nreturn\ntypes\nin\nTypeScript.\n3.\nOptional\nand\nDefault\nParameters:\nDiscover\nTypeScript\'s\nsupport\nfor\noptional\nand\ndefault\nparameters,\nallowi

## Create Splits

In [35]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [36]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 10
)

In [37]:
splits = text_splitter.split_documents(pages)

In [38]:
splits

[Document(page_content='Mastering\nFunctions\nin\nTypeScript:\nA\nComprehensive\nGuide\n|\nCode\nwith\nPrince\nDescription:\nWelcome', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='Welcome\nback\nto\n"Code\nwith\nPrince"!\nIn\nthis\nsecond\nvideo\nof\nour\nTypeScript\ntutorial\nseries,\nwe\'ll', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content="we'll\ndelve\ninto\nthe\npowerful\nworld\nof\nfunctions\nin\nTypeScript.\nFunctions\nare\nthe\nbackbone\nof\nany", metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='of\nany\nprogramming\nlanguage,\nand\nTypeScript\nprovides\nadditional\nfeatures\nand\nenhancements\nto\nmake', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content="to\nmake\nyour\ncode\nmore\nmaintainable\nand\nscalable.\nIn\nthis\ntutorial,\nwe'll\nexplore\na\nwide\nrange\nof", metadata={'source': './datasets/example_doc.pdf', 'page': 

In [39]:
len(splits)

26

## Embeddings

In [40]:
from langchain.embeddings.openai import OpenAIEmbeddings
import openai

In [41]:
from dotenv import load_dotenv
import os

%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [42]:
openai_api_key  = os.environ['OPANAI_API_KEY']

In [43]:
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

In [44]:
toy_text_one = "Hello world"
toy_text_two = "Hello everyone, this is John Doe"
toy_text_three = "I love coding in Python"

In [45]:
embedding_one = embedding.embed_query(toy_text_one)
embedding_two = embedding.embed_query(toy_text_two)
embedding_three = embedding.embed_query(toy_text_three)

In [46]:
embedding_one

[-0.0048746285028755665,
 0.0048683262430131435,
 -0.01641051284968853,
 -0.02448972873389721,
 -0.017292799428105354,
 0.012547362595796585,
 -0.01913299411535263,
 0.009081240743398666,
 -0.010184097103774548,
 -0.02704835683107376,
 0.022838594391942024,
 0.01036685612052679,
 -0.023468798026442528,
 -0.006591934245079756,
 0.007990987040102482,
 0.0025917140301316977,
 0.025145141407847404,
 -0.012125126086175442,
 0.01293178740888834,
 0.013032619841396809,
 -0.010461387224495411,
 -0.003447216236963868,
 0.003989191725850105,
 0.008633795194327831,
 -0.020658088847994804,
 -0.001871705986559391,
 0.012200750410556793,
 -0.0192338265478611,
 0.030375834554433823,
 -0.03105645626783371,
 0.0035669549833983183,
 -0.007814530283212662,
 -0.006043656729161739,
 -0.017784358933568,
 0.004934497643262148,
 -0.015629060566425323,
 0.0013242162531241775,
 -0.01559124793857336,
 0.019410284236073494,
 -0.016108015552163124,
 0.007266252767294645,
 0.008331297896802425,
 0.01141929719597101

#### Checking the vector distance

In [47]:
import numpy as np

In [48]:
np.dot(embedding_one, embedding_two)

0.8604210596039017

In [49]:
np.dot(embedding_one, embedding_three)

0.7803344007246864

## Vectorstores

In [50]:
#! pip install chromadb

In [51]:
from langchain.vectorstores import Chroma

In [52]:
persist_directory = 'vecstores/chroma/'

In [53]:
# !rm -rf ./vecstores/chroma 

In [54]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [56]:
vectordb.persist()

### Similarity Checks / Semantic Searches

In [57]:
question = "Is ther a point where we go over function parameters"

The `k` argument specifies the number of documents we want back after the search. In this case we have 26 documents

In [58]:
docs = vectordb.similarity_search(question,k=6)

In [59]:
len(docs)

6

### Preprocessing

In [60]:
for i, doc in enumerate(docs):
    doc.page_content = doc.page_content.replace("\n", " ")
    docs[i] = doc

In [61]:
print(docs[0].page_content)

#TypeScriptRestParameters #TypeScriptFunctionOverloading #TypeScriptArrowFunctions


In [62]:
print(docs[1].page_content)

#TypeScriptFunctionParameters #TypeScriptOptionalParameters #TypeScriptDefaultParameters


In [63]:
print(docs[2].page_content)

variable number of arguments. 5. Function Overloading: Understand function overloading in


In [64]:
for i, doc in enumerate(docs):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: #TypeScriptRestParameters #TypeScriptFunctionOverloading #TypeScriptArrowFunctions 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 1: #TypeScriptFunctionParameters #TypeScriptOptionalParameters #TypeScriptDefaultParameters 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 2: variable number of arguments. 5. Function Overloading: Understand function overloading in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 3: which can accept functions as parameters or return functions, enabling powerful abstractions in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 4: range of topics related to functions in TypeScript: 1. Introduction to Functions: Understand the 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 5: and Parameters: Learn how to declare functions, define parameters, and specify return types in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

