How to vectorize document with neural network and search ChatGPT
To vectorize a document with a neural network and perform a search within it, you typically follow these steps:

1. Preprocessing the Document
    1. Tokenization: Split the document into words or subwords.
    2. Normalization: Convert to lowercase, remove punctuation, and handle stopwords.
2. Vectorizing the Document
    1. Choose a Pre-trained Model: Select a model like BERT, RoBERTa, or any other transformer-based model.
    2. Embedding: Use the chosen model to convert tokens into dense vectors. For BERT and similar models, the [CLS] token is often used as a representation of the whole document or sentence.
3. Storing Vectors for Search
    1. Database or Index: Store the vectors in a database optimized for vector search, such as Elasticsearch with its vector search plugin, Faiss, or Annoy.
4. Performing a Search
    1. Query Vectorization: Convert the search query into a vector using the same model and preprocessing steps

# custom vectorize and search

here’s a step-by-step guide on how to vectorize a document using a neural network (like BERT) and perform a search in it using Python. We will use the Hugging Face Transformers library for the neural network and Faiss for efficient vector search.

### Step 1: Install Necessary Libraries
First, you need to install the required libraries

In [None]:
!pip install transformers faiss-cpu torch

### Step 2: Load Pre-trained BERT Model and Tokenizer

In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to convert a document to a BERT vector
def document_to_vector(doc):
    inputs = tokenizer(doc, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the [CLS] token representation as the document vector
    return outputs.last_hidden_state[:, 0, :].squeeze().numpy()

### Step 3: Vectorize Documents

In [None]:
# Example documents
documents = [
    "This is a sample document.",
    "Another document for testing.",
    "Neural networks are powerful for natural language processing."
]

# Convert documents to vectors
vectors = [document_to_vector(doc) for doc in documents]

In [None]:
documents = open('movies.txt')
# Convert documents to vectors
vectors = [document_to_vector(doc) for doc in documents.readlines()]

### Step 4: Store Vectors in Faiss

In [None]:
import faiss
import numpy as np

# Convert list of vectors to a numpy array
vector_array = np.vstack(vectors).astype('float32')

# Create a Faiss index
index = faiss.IndexFlatL2(vector_array.shape[1])

# Add vectors to the index
index.add(vector_array)

### Step 5: Vectorize Query and Perform Search

In [None]:
# Example query
query = "harry"

# Convert query to vector
query_vector = document_to_vector(query).astype('float32')

# Perform search
k = 3  # Number of nearest neighbors
distances, indices = index.search(query_vector.reshape(1, -1), k)

# Print results
for i, idx in enumerate(indices[0]):
    print(f"Document {idx} - Distance: {distances[0][i]}")
    print(documents[idx])

Document 0 - Distance: 46.69883346557617
This is a sample document.
Document 1 - Distance: 47.88974380493164
Another document for testing.
Document 2 - Distance: 99.28144836425781
Neural networks are powerful for natural language processing.


This script shows how to preprocess documents, convert them into vectors using a neural network, store them in a Faiss index, and perform a search with a query vector.

### Full Script
Putting it all together

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


In [None]:
from transformers import BertTokenizer, BertModel
import torch
import faiss
import numpy as np

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to convert a document to a BERT vector
def document_to_vector(doc):
    inputs = tokenizer(doc, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].squeeze().numpy()

# Example documents
documents = [
    "This is a sample document.",
    "Another document for testing.",
    "Neural networks are powerful for natural language processing."
]
# documents = open('sample_data/movies_short.txt')
# Convert documents to vectors
vectors = [document_to_vector(doc) for doc in documents]

# Convert list of vectors to a numpy array
vector_array = np.vstack(vectors).astype('float32')

# Create a Faiss index
index = faiss.IndexFlatL2(vector_array.shape[1])

# Add vectors to the index
index.add(vector_array)

# Example query
query = "Harry"

# Convert query to vector
query_vector = document_to_vector(query).astype('float32')

# Perform search
k = len(vectors)  # Number of nearest neighbors
distances, indices = index.search(query_vector.reshape(1, -1), k)

# Print results
for i, idx in enumerate(indices[0]):
    print(f"Document {idx} - Distance: {distances[0][i]}")
    print(documents[idx])

Document 0 - Distance: 46.69883346557617
This is a sample document.
Document 1 - Distance: 47.88974380493164
Another document for testing.
Document 2 - Distance: 99.28144836425781
Neural networks are powerful for natural language processing.


### Word2vec api

In [None]:
!pip install fastapi uvicorn

Collecting fastapi
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn
  Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi-cli>=0.0.2 (from fastapi)
  Downloading fastapi_cli-0.0.4-py3-none-any.whl (9.5 kB)
Collecting httpx>=0.23.0 (from fastapi)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting python-multipart>=0.0.7 (from fastapi)
  Downlo

In [None]:
import json
from fastapi import FastAPI, Path
from typing import Optional
from fastapi import FastAPI, Path
from transformers import BertTokenizer, BertModel
import torch

In [None]:
app = FastAPI()

In [None]:
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
# Function to convert a document to a BERT vector
def document_to_vector(doc):
    inputs = tokenizer(doc, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].squeeze().numpy()

In [None]:
@app.get('/vectoriz')  # query parameter route
def get_by_name(vectorized: str):
    return json.dumps({'data': document_to_vector(vectorized).reshape(1, -1).astype('str').tolist()[0]})

### how to run api

In [None]:
!python -m uvicorn api:app

[31mERROR[0m:    Error loading ASGI app. Could not import module "api".


### test api

In [None]:
import json
import requests

text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Egestas purus viverra accumsan in nisl nisi. Arcu cursus vitae congue mauris rhoncus aenean vel elit scelerisque. In egestas erat imperdiet sed euismod nisi porta lorem mollis. Morbi tristique senectus et netus. Mattis pellentesque id nibh tortor id aliquet lectus proin. Sapien faucibus et molestie ac feugiat sed lectus vestibulum. Ullamcorper velit sed ullamcorper morbi tincidunt ornare massa eget. Dictum varius duis at consectetur lorem. Nisi vitae suscipit tellus mauris a diam maecenas sed enim. Velit ut tortor pretium viverra suspendisse potenti nullam. Et molestie ac feugiat sed lectus. Non nisi est sit amet facilisis magna. Dignissim diam quis enim lobortis scelerisque fermentum. Odio ut enim blandit volutpat maecenas volutpat. Ornare lectus sit amet est placerat in egestas erat. Nisi vitae suscipit tellus mauris a diam maecenas sed. Placerat duis ultricies lacus sed turpis tincidunt id aliquet.'

response = requests.get('http://127.0.0.1:8000/vectoriz', params={'vectorized': text})
print(json.loads(response.json()))

### searching without fiass

In [None]:

from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to convert a document to a BERT vector
def document_to_vector(doc):
    inputs = tokenizer(doc, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].squeeze().numpy()


# Example documents
documents = [
    {"id": 1, "text": "This is a sample document."},
    {"id": 2, "text": "Another document for testing."},
    {"id": 3, "text": "Neural networks are powerful for natural language processing."}
]

# Vectorize documents and prepare for indexing
for doc in documents:
    vector = document_to_vector(doc['text'])
    doc['vector'] = vector.tolist()  # Convert numpy array to list for JSON serialization

# Example query
query = "How do neural networks work?"

# Convert query to vector
query_vector = document_to_vector(query).astype('float32')

# Compute distances
for result in documents:
    doc_vector = np.array(result['vector'])
    distance = np.linalg.norm(doc_vector - query_vector)
    result['distance'] = distance

# Sort results by distance (ascending order)
sorted_results = sorted(documents, key=lambda x: x['distance'])

# Display results
for result in sorted_results:
    print(f"Document ID: {result['id']} - Distance: {result['distance']}")
    print(result['text'])

Document ID: 1 - Distance: 6.609472434707151
This is a sample document.
Document ID: 2 - Distance: 7.044052941662354
Another document for testing.
Document ID: 3 - Distance: 7.270871062557614
Neural networks are powerful for natural language processing.


## searching using sklearn cosine

To vectorize documents with a neural network and perform a search using cosine similarity, you can use libraries like transformers for vectorization and scikit-learn for cosine similarity computation. Here's a step-by-step guide:

#### Step-by-Step Guide
    1. Install Necessary Libraries
    2. Vectorize Documents Using BERT
    3. Calculate Cosine Similarity

### Step 1: Install Necessary Libraries
First, ensure you have the required libraries installed:

In [None]:
!pip install transformers torch scikit-learn numpy

### Step 2: Vectorize Documents Using BERT
Here’s how to vectorize documents using BERT:

In [None]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to convert a document to a BERT vector
def document_to_vector(doc):
    inputs = tokenizer(doc, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the mean of all token embeddings as the document vector
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Example documents
documents = [
    "This is a sample document.",
    "Another document for testing.",
    "Neural networks are powerful for natural language processing."
]

# Convert documents to vectors
vectors = np.array([document_to_vector(doc) for doc in documents])

### Step 3: Calculate Cosine Similarity and Perform Search
We'll calculate cosine similarity between the query vector and document vectors:

In [None]:

from sklearn.metrics.pairwise import cosine_similarity

# Function to perform a cosine similarity search
def search_documents(query, documents, vectors):
    query_vector = document_to_vector(query).reshape(1, -1)
    similarities = cosine_similarity(query_vector, vectors).flatten()
    sorted_indices = similarities.argsort()[::-1]

    results = [(documents[idx], similarities[idx]) for idx in sorted_indices]
    return results

# Example query
query = "How do neural networks work?"

# Perform search
results = search_documents(query, documents, vectors)

# Display results
for i, (doc, similarity) in enumerate(results):
    print(f"Document {i} - Similarity: {similarity:.4f}")
    print(doc)

Document 0 - Similarity: 0.8089
Neural networks are powerful for natural language processing.
Document 1 - Similarity: 0.7755
Another document for testing.
Document 2 - Similarity: 0.7485
This is a sample document.


### Full Script
Here’s the complete script integrating all steps:

In [1]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to convert a document to a BERT vector
def document_to_vector(doc):
    inputs = tokenizer(doc, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the mean of all token embeddings as the document vector
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Example documents
documents = [
    "This is a sample document.",
    "Another document for testing.",
    "Neural networks are powerful for natural language processing."
]

# Convert documents to vectors
vectors = np.array([document_to_vector(doc) for doc in documents])

# Function to perform a cosine similarity search
def search_documents(query, documents, vectors):
    query_vector = document_to_vector(query).reshape(1, -1)
    similarities = cosine_similarity(query_vector, vectors).flatten()
    sorted_indices = similarities.argsort()[::-1]

    results = [(documents[idx], similarities[idx]) for idx in sorted_indices]
    return results

# Example query
query = "How do neural networks work?"

# Perform search
results = search_documents(query, documents, vectors)

# Display results
for i, (doc, similarity) in enumerate(results):
    print(f"Document {i} - Similarity: {similarity:.4f}")
    print(doc)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Document 0 - Similarity: 0.8089
Neural networks are powerful for natural language processing.
Document 1 - Similarity: 0.7755
Another document for testing.
Document 2 - Similarity: 0.7485
This is a sample document.
