##                                      VectorDatasets
![VectorDB](../assets/PineCone.png)

A vector database stores and manages data in the form of vectors. These vectors represent complex data items such as images, text,
 and audio files in a format that is more conducive to certain types of processing like similarity searches.

- **Specialized in Similarity Searches-** Ideal for finding 'similar' items, crucial for recommendation systems and image recognition.
- **Leveraging Machine Learning-** Uses ML models to transform data into vectors for semantic-based indexing and retrieval.
- **Fast Indexing and Retrieval-** Enables quick responses in real-time applications through specialized indexing.
- **Scalable for Big Data-** Well-suited for large volumes, facilitating horizontal scalability in big data environments.
- **Integration with Existing Systems-** Can be merged with current databases and data processing pipelines.


## Working with Pinecone Vector Database
Sets up a Python environment to use the Pinecone vector database service, including importing necessary libraries, loading environment variables for security, and initializing the Pinecone service with an API key.

In [2]:
# Importing Dependencies
import pinecone
from langchain.vectorstores import Pinecone
from dotenv import load_dotenv, find_dotenv
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
import PyPDF2
from tqdm.autonotebook import tqdm

# Getting pinecone api key and environment
load_dotenv(find_dotenv(),  override=True)
api_key = os.getenv("PINECONE_API_KEY")
env = os.getenv("PINECONE_ENV")
print(api_key) 
pinecone.init(api_key = api_key, environment= env)
pinecone.info.version()

9d3f4ffd-cc84-49ce-a06a-1bf3598f34cb


VersionResponse(server='2.0.11', client='2.2.4')

In [None]:
# Check if there is an index with the given name
indexes = pinecone.list_indexes()
print(indexes)

In [None]:
# deleting all indexes
for i in indexes:
    pinecone.delete_index(i)
    print("Index Deleted")
indexes = pinecone.list_indexes()
print(f'There are: {len(indexes)} indexes in database')

In [None]:
# Create Pinecone index
index_name = "fishing"
if index_name not in pinecone.list_indexes():
    print(f'Create index {index_name}')
    pinecone.create_index(index_name,dimension=1536, metric='cosine', pods=1, pod_type='p1.x2')
    print('Index Created')
else:
    print("Index exists")
pinecone.list_indexes()

In [None]:
# Read the PDF file
pdf_reader = PyPDF2.PdfReader('../data/TroutStocking.pdf')

# Extract text from each page and concatenate it
full_text = ""
for page in pdf_reader.pages:
    full_text += page.extract_text() + "\n"

print(full_text)

In [None]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, # Number of chunks to split the text into
    chunk_overlap=20, # Overlapping between chunks
    length_function=len)

# Create chunks from the extracted text
chunks = text_splitter.create_documents([full_text])
# Rerun first chunk
print(chunks[0])


### Creating Embeddings

- Definition: Embeddings are vector representations of complex data (like words, images, sounds) in a lower-dimensional space.
- Purpose: They transform data into a format understandable by machine learning models, preserving essential characteristics.
- NLP Use: In natural language processing, word embeddings capture semantic relationships between words.
- Examples: Tools like Word2Vec, GloVe, and BERT are used for creating word embeddings.
- Applications: Beyond text, used in image processing and recommendation systems to represent visual features or user preferences.
- Advantages: Facilitate efficient model processing and capture deep similarities and relationships in data.
- Training: Can be pre-trained on large datasets or trained from scratch for specific tasks.

In [None]:
# Create embeddings
from langchain.embeddings import  OpenAIEmbeddings
# Getting OpenAI api key and environment
load_dotenv(find_dotenv(),  override=True)
api_key = os.getenv("OPENAI_API_KEY")
print(api_key) 

embeddings = OpenAIEmbeddings(api_key = api_key)
vector = embeddings.embed_query(chunks[0].page_content)
Pinecone.from_documents(chunks, embeddings,index_name = index_name)

### Asking Question and performing Similarity Searches
![VectorDB](../assets/Similarity.png)
Vector similarity search is a method used in computing to find elements in a database that are similar to a given query item. This process is particularly relevant in the context of vector databases, where data is represented as vectors - lists of numbers that encode information about the items. Here’s a basic explanation
- **Vector Representation -** In a vector similarity search, items in the database (like text, images, or sounds) are transformed into vectors using algorithms. These vectors are numerical representations that capture the essential features of the items.

- **Measuring Similarity -** The core idea is to measure how 'close' two vectors are to each other. This closeness is typically determined by calculating the distance or angle between vectors. Common measures include Euclidean distance, cosine similarity, and Manhattan distance.

- **Querying -** When a query is made (for instance, a search for an image), the query item is also converted into a vector. The search algorithm then looks through the database to find vectors that are closest to the query vector.

- **Applications -** This method is widely used in various fields like recommendation systems (suggesting products or content similar to what a user likes), image and voice recognition systems, and natural language processing (finding documents or texts similar to a given piece of text).

- **Advantages -** Vector similarity searches are powerful because they can find items that are 'semantically' similar, not just exact matches. This allows for more nuanced and context-aware results.

- **Challenges-** One challenge in vector similarity search is the computational cost, especially with very large databases. Efficient algorithms and indexing strategies are crucial for maintaining fast and accurate search results.



In [None]:
## Asking Questions ( Similarity Search)
vector_store = Pinecone.from_documents(chunks,embeddings,index_name=index_name)
query = 'Give me all bodies of watter in Lumpkin county'
results = vector_store.similarity_search(query)
print(results)

### Clustering based on similarity
The concept of clustering in a vector space, where items are grouped based on similarity. 
The entire diagram can be thought of as a vector space, which is a mathematical space where each item (represented by a dot) is a vector. The position of each dot indicates its relationship to the others.

![VectorDB](../assets/VectorGroups.png)


### CHROMA VectorDB Example

Chroma is an open-source vector database designed to store and utilize embeddings for various applications, such as building large language model (LLM) applications. It's engineered to make knowledge, facts, and skills easily pluggable into LLMs, streamlining the development of AI applications by efficiently handling vector similarity searches crucial for recommendation systems, image recognition, and natural language processing tasks.

In [None]:
# import
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

# load the document and split it into chunks
loader = TextLoader("../data/AI_And_Morality.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=20)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "Who is Nick Bostrom?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)