<a href="https://colab.research.google.com/github/GeorgeCrossIV/Yelp-Review-Search-AstraPy/blob/main/Yelp_Review_Search_via_AstraPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Astra DB with AstraPy - Query Yelp reviews

This notebook will load Yelp review data from a CSV file. The text field will be embedded and added to a new field named $vector. An example of querying the embedded table is provided near the end of the notebook.

_Prerequisites:_ Make sure you have an Astra DB instance and get ready to supply the corresponding *Token* and the *API Endpoint*
(read more [here](https://docs.datastax.com/en/astra/home/astra.html)).

Set the Colab keys:
- ASTRA_DB_API_ENDPOINT - the Astra endpoint
- ASTRA_DB_TOKEN_BASED_PASSWORD - the Astra DB token

## Setup

In [None]:
!pip install --quiet --upgrade astrapy transformers torch

### Import needed libraries

In [2]:
import os, json
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel, tokenization_utils
from getpass import getpass
from astrapy.db import AstraDB
from google.colab import userdata

In [3]:
# Function to get embedding from a model and tokenizer
model='sentence-transformers/all-MiniLM-L6-v2'
token = userdata.get("huggingface_token") # hugging face token
def get_embedding(text, model, tokenizer):
    # Tokenize input text
    inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")

    # Get model output
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the embeddings from the last hidden state
    # You might also consider using pooled output for sentence-level embeddings
    embeddings = outputs.last_hidden_state.mean(dim=1)

    # Convert the tensor embeddings into a flat list of floats
    float_embeddings = embeddings.numpy().flatten().tolist()

    return float_embeddings

# Load pretrained MiniLM model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model, token=token)
model = AutoModel.from_pretrained(model, token=token)

### Provide database credentials

These are the connection parameters on your Astra dashboard. Example values:

- API Endpoint: `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`
- Token: `AstraCS:6gBhNmsk135...`


In [4]:
ASTRA_DB_API_ENDPOINT = userdata.get("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = userdata.get("ASTRA_DB_TOKEN_BASED_PASSWORD")

## Create a collection

### Create the client

In [5]:
astra_db = AstraDB(
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
)

Create the review_collection

In [6]:
astra_db.delete_collection("review_collection")
collection = astra_db.create_collection("review_collection", dimension=384)

Load the Yelp review data from a CSV file

In [7]:
# get the file from GitHub
!wget https://raw.githubusercontent.com/GeorgeCrossIV/Yelp-Review-Search-AstraPy/main/review_chunk_sample.csv

# Load the CSV file
df = pd.read_csv('review_chunk_sample.csv')

# Convert the DataFrame to a list of JSON objects
review_list = df.to_dict(orient='records')

--2023-12-06 07:28:16--  https://raw.githubusercontent.com/GeorgeCrossIV/Yelp-Review-Search-AstraPy/main/review_chunk_sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5248 (5.1K) [text/plain]
Saving to: ‘review_chunk_sample.csv.1’


2023-12-06 07:28:16 (55.0 MB/s) - ‘review_chunk_sample.csv.1’ saved [5248/5248]



In [8]:
# embed the text field into a new field named minilm_description_embedding
field_name = '$vector'
for record in review_list:
    record[field_name] = get_embedding(record['text'], model, tokenizer)

In [9]:
# show the embedding of the first record
#review_list[0]['minilm_description_embedding']
print(review_list[0]['$vector'])

[0.1061306819319725, -0.06004926189780235, 0.1146770715713501, 0.19054478406906128, -0.22961261868476868, -0.03450368344783783, -0.18126365542411804, -0.16320475935935974, -0.09324700385332108, -0.10770738869905472, 0.014381177723407745, 0.04382028803229332, 0.030794376507401466, 0.015380681492388248, 0.060133907943964005, -0.17556646466255188, 0.45995041728019714, -0.3072662949562073, 0.1463429182767868, 0.025539351627230644, -0.006727352272719145, -0.11765347421169281, -0.01867607980966568, -0.0904260203242302, -0.018381208181381226, -0.01733686588704586, 0.06587473303079605, 0.20797519385814667, -0.044517841190099716, -0.03884926810860634, -0.075787752866745, 0.08765065670013428, 0.11222372949123383, -0.08224109560251236, -0.06109614297747612, 0.1512645184993744, 0.11071901768445969, -0.11991360038518906, 0.017604226246476173, 0.17823468148708344, -0.029410287737846375, 0.0993693396449089, 0.07005283981561661, -0.05324579030275345, -0.04112647846341133, -0.1238551065325737, -0.13363

### Insert multiple documents

In [10]:
response = collection.insert_many(review_list)
print(response)

{'status': {'insertedIds': ['44b3dbb2-55f5-454a-8bd9-cb46a35f7b3c', 'cfc105e7-4188-439d-8179-3200683145a0', 'e7f21ae5-0f9e-4047-8cb8-fc30f0036945', 'e872a1b6-ea95-46be-ba29-c92a196a86ca', '6b8cd559-ad2b-48aa-8b1f-61daa140e6a4', '4f4914dc-3c7c-49d8-bdff-9436dc1aac94', '5c9142b6-e19a-4c38-a45c-533b83bde200', '56c4db60-cc20-4af1-9b3c-203849f4d75f', '89bb8284-3d3f-41c4-b78c-d987be716bf4']}}


## Find documents

Find by `_id`:

In [11]:
document = collection.find_one(filter={"_id":"d5c5a29f-6841-451d-85c2-e8409c215b41"})
print(document)

{'data': {'document': None}}


Find by any (non-vector) filter clause:

In [12]:
document = collection.find_one(filter={"review_id":"KU_O5udG6zpxOg-VcAEodg"})
print(document)

{'data': {'document': {'_id': '44b3dbb2-55f5-454a-8bd9-cb46a35f7b3c', 'review_id': 'KU_O5udG6zpxOg-VcAEodg', 'business_id': 'XQfwVwDr-v0ZS3_CbbE5Xw', 'cool': 0, 'date': '7/7/2018 22:09', 'funny': 0, 'stars': 3, 'text': "If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.", 'useful': 0, 'user_id': 'mh_-eMZ6K5RLWhZyISBhwA', '$vector': [0.1061306819319725, -0.06004926189780235, 0.1146770715713501, 0.19054478406906128, -0.22961261868476868, -0.03450368344783783, -0.18126365542411804, -0.16320475935935974, -0.09324700385332108, -0.1077

### Find by vector similarity

By default, the `$similarity` field is returned with each document (note the decreasing order):

In [13]:
customer_input = "Which review mention cycling?'"

query_vector = get_embedding(customer_input, model, tokenizer)

documents = collection.vector_find(query_vector, limit=5)
for document in documents:
    print(f"\n{document}")


{'_id': 'cfc105e7-4188-439d-8179-3200683145a0', 'review_id': 'BiTunyQ73aT9WBnpR9DZGw', 'business_id': '7ATYjTIgM3jUlt4UM3IypQ', 'cool': 1, 'date': '1/3/2012 15:28', 'funny': 0, 'stars': 5, 'text': "I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space and amazing bikes, to the welcoming and motivating instructors, every class is a top notch work out.\n\nFor anyone who struggles to fit workouts in, the online scheduling system makes it easy to plan ahead (and there's no need to line up way in advanced like many gyms make you do).\n\nThere is no way I can write this review without giving Russell, the owner of Body Cycle, a shout out. Russell's passion for fitness and cycling is so evident, as is his desire for all of his clients to succeed. He is always dropping in to classes to check in/provide encouragement, and is open to ideas and recommendations from anyone. Russell always wears a smile on his face, even when 

You can specify which **fields** you'll get back and/or whether you need the **similarity** as well:

In [14]:
documents = collection.vector_find(
    query_vector,
    limit=5,
    fields=["text", "$vector"],  # remember the dollar sign (reserved name)
    include_similarity=False,
)
for document in documents:
    print(f"\n{document}")


{'_id': 'cfc105e7-4188-439d-8179-3200683145a0', 'text': "I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space and amazing bikes, to the welcoming and motivating instructors, every class is a top notch work out.\n\nFor anyone who struggles to fit workouts in, the online scheduling system makes it easy to plan ahead (and there's no need to line up way in advanced like many gyms make you do).\n\nThere is no way I can write this review without giving Russell, the owner of Body Cycle, a shout out. Russell's passion for fitness and cycling is so evident, as is his desire for all of his clients to succeed. He is always dropping in to classes to check in/provide encouragement, and is open to ideas and recommendations from anyone. Russell always wears a smile on his face, even when he's kicking your butt in class!", '$vector': [-0.023203440010547638, 0.20790797472000122, -0.015124350786209106, -0.019379911944270134, -0.1

You can compound with other `filter` clauses, effectively implementing **metadata filtering** on your vector searches:

In [15]:
# find reviews bsed on the query that have three stars
documents = collection.vector_find(
    query_vector,
    limit=5,
    filter={"stars": 3},
)
for document in documents:
    print(f"\n{document}")


{'_id': '44b3dbb2-55f5-454a-8bd9-cb46a35f7b3c', 'review_id': 'KU_O5udG6zpxOg-VcAEodg', 'business_id': 'XQfwVwDr-v0ZS3_CbbE5Xw', 'cool': 0, 'date': '7/7/2018 22:09', 'funny': 0, 'stars': 3, 'text': "If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.", 'useful': 0, 'user_id': 'mh_-eMZ6K5RLWhZyISBhwA', '$vector': [0.1061306819319725, -0.06004926189780235, 0.1146770715713501, 0.19054478406906128, -0.22961261868476868, -0.03450368344783783, -0.18126365542411804, -0.16320475935935974, -0.09324700385332108, -0.10770738869905472, 0.0143

## Delete a collection

In [16]:
#response = astra_db.delete_collection("test_collection")
#print(response)