<a href="https://colab.research.google.com/github/GeorgeCrossIV/trip-recommendations/blob/master/Trips_Similarity_Search_Vector_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trip Recommendations, Similarity Search & Collaborative Filtering

# Introduction

This project is dedicated to developing a trip similarity search feature for travelers using AstraDB and vector embeddings. The goal is to enhance travel recommendations by identifying trips similar to a user's past preferences or trending selections. By leveraging the powerful capabilities of AstraDB's vector database and OpenAI's advanced embedding techniques, we can offer personalized travel suggestions that are both relevant and timely.

Adding collaborative filtering to the project is a great way to enhance the recommendation capabilities beyond vector-based similarity searches. Collaborative filtering is a method used in recommender systems to predict a user’s preferences based on the preferences of other users. It works by building a matrix of user preferences (such as ratings or behavioral patterns) and finding similarities between users or items based on that matrix.

We will utilize the Kaggle Travel Dataset (argodatathon2019) which provides rich, detailed travel data. Our process will involve several key phases:

1. **Vector Embedding Generation**: Convert travel data into meaningful vector embeddings using OpenAI’s models. This step involves preprocessing the data to extract useful features and transforming them into numerical vectors that capture the essence of each trip.

2. **AstraDB Setup**: Configure and deploy AstraDB to store and manage these embeddings efficiently. This includes designing a schema that supports quick retrieval and similarity searches within the vector space.

3. **Data Insertion**: Populate AstraDB with the preprocessed and embedded travel data.

4. **Similarity Search Implementation**: Develop a querying mechanism in AstraDB to identify and suggest trips that are most similar to a user's input or preferred travel profile.

5. **Collaborative Filtering Implementation**: Integrate collaborative filtering. The project will leverage both content-based features (from vector embeddings) and user interaction patterns, thus improving the quality and relevance of the travel recommendations

6. **Interface and Deployment**: Create a user-friendly interface that allows travelers to find and explore suggested trips, refining the system based on user feedback and interaction patterns. This notebook won't provide the UI.

This notebook will guide you through each of these steps, providing code snippets, explanations, and best practices for setting up a robust and scalable trip similarity search system.

### Quick Links to the queries
- [Go to the Similarity Search Implementation](#similarity_search_implementation)
- [Go to the Collaborative Filtering Implementation](#collaborative_filtering_implementation)


## Prerequisites
- Astra DB database with vector enabled (endpoint and token)
- Open AI API key
- Kaggle API key (username and token)

## Data Preparation

In [None]:
!pip install -qU pandas==1.5.3 scipy torch==2.1.0 numpy==1.24.4 implicit scipy kaggle astrapy openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m72.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.7/79.7 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.0/124.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m27.8 MB/

## Determine how you will enter variables
Set use_keys to True to use the Google Colab secrets; otherwise, set use_keys to False to get prompted to enter variable values

In [None]:
# global settings
process_data = False  # set to True to reset the data and generate the vectors
use_keys = True # set to False if you'd like to be prompted for key variables

**Import Python Libraries**

In [None]:
# Standard libraries for file operations, time tracking, and secure user input
import os
import time
import getpass

# Data handling and numerical operations
import numpy as np
import pandas as pd
import scipy.sparse as sparse

# Machine learning and neural network operations
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset

# For working with JSON files and random number generation
import json
import random

# Google Colab specific utilities
from google.colab import userdata

# External libraries for specific functionalities
from openai import OpenAI  # for OpenAI-related operations
from astrapy.db import AstraDB, AstraDBCollection  # AstraDB database operations

# Utilities for working with compressed files
import zipfile

import itertools
import implicit
from implicit import evaluation

In [None]:
# Astra variables
token = userdata.get('ASTRA_DB_TOKEN_BASED_PASSWORD')  if use_keys else getpass.getpass(prompt="Enter your AstraDB token: ")
api_endpoint = userdata.get('ASTRA_DB_API_ENDPOINT') if use_keys else input("Enter your AstraDB API endpoint: ")

# Vector variables
vector_model="text-embedding-ada-002"

## Get Kaggle credentials so you can access the Instacart dataset
Start by going to your profile after logging into Kaggle. Your profile button is in the top right corner after clicking on your username.

In [None]:
kaggle_token = userdata.get('kaggle_key') if use_keys else getpass.getpass("Enter your kaggle token/key: ")
kaggle_username = userdata.get('kaggle_username') if use_keys else input("Enter your username: ")

### Download the Kaggle Travel Dataset (argodatathon2019)

In [None]:
# Ensure the directory exists before writing the file
os.makedirs('/root/.kaggle', exist_ok=True)
os.makedirs('/content/logs', exist_ok=True)

with open('/root/.kaggle/kaggle.json', 'w') as fp:
    fp.write(json.dumps({"username":f"{kaggle_username}","key":f"{kaggle_token}"}))


In [None]:
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d leomauro/argodatathon2019

Dataset URL: https://www.kaggle.com/datasets/leomauro/argodatathon2019
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading argodatathon2019.zip to /content
  0% 0.00/3.00M [00:00<?, ?B/s]
100% 3.00M/3.00M [00:00<00:00, 121MB/s]


### Extract the downloaded files:

In [None]:
zip_path = '/content/argodatathon2019.zip'
extract_to_dir = '/content/travel-data'  # Specify the directory to extract to

# Ensure the extraction directory exists
os.makedirs(extract_to_dir, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_dir)

print(f'Files extracted to {extract_to_dir}')

Files extracted to /content/travel-data


### Load the files we downloaded and extracted:
In this example, the dataset utilized data that focuses on flights and hotels.
The dataset has over one thousand users and 250 thousand travels

In [None]:
# load the travel data
users = pd.read_csv('/content/travel-data/users.csv')
flights = pd.read_csv('/content/travel-data/flights.csv')
hotels = pd.read_csv('/content/travel-data/hotels.csv')

In [None]:
# Function to join the dataframes
def combine_user_data(users_df, hotels_df, flights_df):
    # Merging dataframes as before
    user_hotels = pd.merge(users_df, hotels_df, left_on='code', right_on='userCode')
    combined_df = pd.merge(user_hotels, flights_df, on=['userCode', 'travelCode'], how='inner')

    # renamed the date field for the hotels and flights
    combined_df = combined_df.rename(columns={'date_y':'flightDate','date_x':'hotelDate'})

    # Sorting by 'userCode', 'travelCode', and 'date' to ensure outgoing flights come first
    combined_df_sorted = combined_df.sort_values(by=['userCode', 'travelCode', 'flightDate'])

    # Dropping duplicates based on 'userCode' and 'travelCode', keeping the first (which is the outgoing flight)
    combined_df_deduped = combined_df_sorted.drop_duplicates(subset=['userCode', 'travelCode'], keep='first')

    return combined_df_deduped

def get_user_profile_from_row(row):
    # Constructing the profile string directly from the row
    profile_str = get_user_profile(row['gender'], row['age_group'])
    #profile_str = f"User Profile: gender: {row['gender']}; age_group: {row['age_group']}"
    return profile_str

def get_user_profile(gender, age_group):
    # Constructing the profile string directly from the row
    profile_str = f"User Profile: gender: {gender}; age_group: {age_group}"
    return profile_str

def encode_hotel_price(price):
    if price < 100:
        return 'Low'
    elif 100 <= price < 250:
        return 'Medium'
    else:  # for prices >= 250
        return 'High'

def encode_flight_distance(distance_in_miles):
    if distance_in_miles < 200:
        return 'Short'
    elif 200 <= distance_in_miles <= 750:
        return 'Medium'
    else:  # for distances greater than 750 miles
        return 'Long'

def encode_flight_price(price):
    if price < 200:
        return 'Low'
    elif 200 <= price <= 750:
        return 'Medium'
    else:  # for prices greater than $750
        return 'High'

def create_user_trip_summary(combined_df):
    # Prepare a list to hold data for the new dataframe
    data = []

    # Iterate over each row in the combined dataframe
    for index, row in combined_df.iterrows():
        # Create the user profile for the current row
        profile = get_user_profile_from_row(row)

        # Extract necessary information from the current row
        user_id = row['user_id']
        destination = row['place']
        days = row['days']
        flight_price = encode_flight_price(row['flightPrice'])  # Assuming this is the flight price in the combined_df
        hotel_price = encode_hotel_price(row['hotelPrice'])  # Assuming 'total' is the total hotel price in combined_df
        flight_type = row['flightType']
        distance = encode_flight_distance(row['distance'])
        agency = row['agency']
        user_trip = f"{profile} Destination: {destination} Flight Price: {flight_price} Hotel Price: {hotel_price} Distance: {distance}"

        # Append the extracted information as a new record in the data list
        data.append([user_id, user_trip, profile, destination, days, flight_price, hotel_price, flight_type, distance, agency])

    # Create a new dataframe with the specified columns and the compiled data
    summary_df = pd.DataFrame(data, columns=['user_id', 'user_trip', 'user_profile', 'destination', 'days', 'flight_price', 'hotel_price', 'flight_type', 'distance', 'agency'])

    return summary_df

def create_user_trip_string_from_row(row):
    return f"{row['user_profile']} Destination: {row['destination']} Flight Price: {row['flight_price']} Hotel Price: {row['hotel_price']} Distance: {row['distance']}"

def create_user_trip_string(gender, age_group, destination, flight_price, hotel_price, distance):
    return f"{get_user_profile(gender, age_group)} Destination: {destination} Flight Price: {flight_price} Hotel Price: {hotel_price} Distance: {distance}"

def populate_age_group(data):
    # Define the age bins and corresponding labels for the age groups
    bins = [0, 12, 17, 24, 34, 49, 64, float('inf')]
    labels = ['Children (0-12)', 'Teenagers (13-17)', 'Young Adults (18-24)',
              'Adults (25-34)', 'Middle-Aged Adults (35-49)',
              'Pre-Retirement (50-64)', 'Seniors (65+)']

    # Use the pd.cut() function to categorize each 'age' value into an 'age_group'
    data['age_group'] = pd.cut(data['age'], bins=bins, labels=labels, right=False)

    # The function doesn't need to return anything because the DataFrame is modified in place
    return data

def create_user_products(user_trip_summaries):
    # prepare list to hold the user trip strings
    data = []

    # Iterate over each row in the combined dataframe
    for index, row in user_trip_summaries.iterrows():
        product_name = create_user_trip_string_from_row(row)
        data.append([row['user_id'],product_name])

    summary_df = pd.DataFrame(data, columns=['user_id','product_name'])

    return summary_df

def generate_random_user_trip():
    genders = ['Male', 'Female']
    age_groups = ['Children (0-12)', 'Teenagers (13-17)', 'Young Adults (18-24)', 'Adults (25-34)', 'Middle-Aged Adults (35-49)', 'Pre-Retirement (50-64)', 'Seniors (65+)']
    destinations = ['Paris', 'New York', 'Tokyo', 'Sydney', 'Cape Town']
    flight_prices = ['Low', 'Medium', 'High']
    hotel_prices = ['Low', 'Medium', 'High']
    distances = ['Short', 'Medium', 'Long']

    # Randomly select one value from each list
    gender = random.choice(genders)
    age_group = random.choice(age_groups)
    destination = random.choice(destinations)
    flight_price = random.choice(flight_prices)
    hotel_price = random.choice(hotel_prices)
    distance = random.choice(distances)

    # Use the create_user_trip_string function to generate the trip description
    return create_user_trip_string(gender, age_group, destination, flight_price, hotel_price, distance)

def products_bought_by_user_in_the_past(user_id: int, top: int = 10):

    selected = data[data.user_id == user_id].sort_values(by=['total_orders'], ascending=False)

    selected['product_name'] = selected['product_id'].map(products_lookup.set_index('product_id')['product_name'])
    selected = selected[['product_id', 'product_name', 'total_orders']].reset_index(drop=True)
    if selected.shape[0] < top:
        return selected

    return selected[:top]

def get_embedding(text, model=vector_model):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

In [None]:
# combine_user_data, populate_age_group, create_user_trip_summary

# Step 1: Combine data from users, hotels, and flights into a single DataFrame
user_trips = combine_user_data(users, hotels, flights)

# Step 2: Rename columns for clarity and consistency
user_trips = user_trips.rename(columns={
    'code': 'user_id',
    'date_x': 'hotelDate',
    'date_y': 'flightDate',
    'price_y': 'flightPrice',
    'total': 'hotelPrice'
})

# Step 3: Remove duplicates to ensure each trip is represented once per user
user_trips = user_trips.drop_duplicates(subset=['userCode', 'travelCode'], keep='first')

# Step 4: Categorize users into age groups
user_trips = populate_age_group(user_trips)

# Step 5: Create detailed trip summaries for each user
user_trip_summaries = create_user_trip_summary(user_trips)

# Step 6: Preview the first few entries of the transformed data
print(user_trip_summaries.head())


   user_id                                          user_trip  \
0        0  User Profile: gender: male; age_group: Young A...   
1        0  User Profile: gender: male; age_group: Young A...   
2        0  User Profile: gender: male; age_group: Young A...   
3        0  User Profile: gender: male; age_group: Young A...   
4        0  User Profile: gender: male; age_group: Young A...   

                                        user_profile         destination  \
0  User Profile: gender: male; age_group: Young A...  Florianopolis (SC)   
1  User Profile: gender: male; age_group: Young A...       Salvador (BH)   
2  User Profile: gender: male; age_group: Young A...       Salvador (BH)   
3  User Profile: gender: male; age_group: Young A...       Salvador (BH)   
4  User Profile: gender: male; age_group: Young A...  Florianopolis (SC)   

   days flight_price hotel_price flight_type distance       agency  
0     4         High        High  firstClass   Medium  FlyingDrops  
1     2       

## Setup AstraDB collection

In [None]:
# Initialize AsyncAstraDB
astrapy_db = AstraDB(
    token=token,
    api_endpoint=api_endpoint
)

In [None]:
# set the collection name
vector_collection_name = "user_trips_vector_collection"
vector_factors=1536
vector_model="text-embedding-ada-002"
cf_collection_name = "user_trips_collaborative_filtering_collection"
cf_factors=100

if (process_data):
# delete the collection
  astrapy_db.delete_collection(vector_collection_name)
  astrapy_db.delete_collection(cf_collection_name)

  # create the collection
  vector_collection = astrapy_db.create_collection(vector_collection_name,dimension=vector_factors)
  cf_collection = astrapy_db.create_collection(cf_collection_name,dimension=cf_factors)
else:
  # connect to an existing collection
  vector_collection = AstraDBCollection(vector_collection_name, astra_db=astrapy_db)
  cf_collection = AstraDBCollection(cf_collection_name,astra_db=astrapy_db)

## Load Data for similarity searches

In [None]:
client = OpenAI(api_key=userdata.get('openai_api_key'))

In [None]:
user_trip_summaries.shape

(40552, 10)

In [None]:
# embed the user_trips field into a new field named $vector
if (process_data):
  #user_trip_summaries = user_trip_summaries.head(1000)  # reduce the dataset to the top 1,000 rows
  user_trip_list = user_trip_summaries.to_dict(orient='records')
  field_name = '$vector'
  for record in user_trip_list:
      try:
          embedding = get_embedding(record['user_trip'])
          record['$vector'] = embedding
          vector_collection.insert_one(record)
      except Exception as e:
          # Handle exceptions, such as network errors or API limits
          print(f"An error occurred: {e}")
          # Assign a default value or handle the error as needed
          record['$vector'] = None

# Prepare data for Collaborative Filtering
In this example, the Collaborative Filtering model utilizes solely the historical preferences of users towards a range of items. Since our dataset lacks explicit ratings, the quantity of purchases can serve as a proxy for "confidence," indicating the strength of interaction between users and products.

This data will be stored in a dataframe, forming the foundation for the model.

In [None]:
# Add a vector field to each record
user_products = create_user_products(user_trip_summaries)
user_products.head()
# create summary table for user trip products (flights and hotels)
travel_products = user_products['product_name'].drop_duplicates().reset_index(drop=True)
# Assign a unique product_id to each product
travel_products = travel_products.reset_index().rename(columns={'index': 'product_id', 0: 'product'})

# Map product to product_id for user_flights
data = user_products.merge(travel_products, on='product_name', how='left').drop('product_name', axis=1)
#data.head()

# create the summary table
data = data.groupby(['user_id', 'product_id']).size().reset_index(name='total_orders')
data.product_id = data.product_id.astype('int64')

# renaming the travel_products table to products_lookup to reduce changing the example code. Travel Trips --> Products
products_lookup = travel_products
#products_lookup.head()

# rename some fields in the user table
users = users.rename(columns={'code':'user_id'})
users = populate_age_group(users)
#users.head()

# pring the shape of the products_lookup table
print(f"Unique products (rows,cols): {products_lookup.shape} unique trips (rows,cols): {data.shape}")

Unique products (rows,cols): (674, 2) unique trips (rows,cols): (18970, 3)


Next, we will begin by extracting unique user and item identifiers to facilitate the creation of a CSR (Compressed Sparse Row) matrix.
Credit to this snippet goes to [miararoy](https://github.com/pinecone-io/examples/blob/master/learn/recommendation/product-recommender/product_recommender.ipynb).

In [None]:
users = list(np.sort(users.user_id.unique()))
items = list(np.sort(products_lookup.product_id.unique()))
purchases = list(data.total_orders)

# create zero-based index position <-> user/item ID mappings
index_to_user = pd.Series(users)

# create reverse mappings from user/item ID to index positions
user_to_index = pd.Series(data=index_to_user.index + 1, index=index_to_user.values)

# create zero-based index position <-> item/user ID mappings
index_to_item = pd.Series(items)

# create reverse mapping from item/user ID to index positions
item_to_index = pd.Series(data=index_to_item.index, index=index_to_item.values)

# Get the rows and columns for our new matrix
products_rows = data.product_id.astype(int)
users_cols = data.user_id.astype(int)

# Create a sparse matrix for our users and products containing number of purchases
sparse_product_user = sparse.csr_matrix((purchases, (products_rows, users_cols)), shape=(len(items) + 1, len(users) + 1))
sparse_product_user.data = np.nan_to_num(sparse_product_user.data, copy=False)

sparse_user_product = sparse.csr_matrix((purchases, (users_cols, products_rows)), shape=(len(users) + 1, len(items) + 1))
sparse_user_product.data = np.nan_to_num(sparse_user_product.data, copy=False)

## Implicit Model
In this section we will demonstrate creation and training of a recommender model using the **implicit** library. The recommendation model is based off the algorithms described in the paper [Collaborative Filtering for Implicit Feedback Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets) with performance optimizations described in [Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.379.6473&rep=rep1&type=pdf).

In [None]:
#split data into train and test sets
train_set, test_set = evaluation.train_test_split(sparse_user_product, train_percentage=0.9)

factors = 100

# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=factors,
                                             regularization=0.05,
                                             iterations=50,
                                             num_threads=1)

alpha_val = 15
train_set = (train_set * alpha_val).astype('double')

# train the model on a sparse matrix of item/user/confidence weights
model.fit(train_set, show_progress = True)

Exception ignored on calling ctypes callback function: <function ThreadpoolController._find_libraries_with_dl_iterate_phdr.<locals>.match_library_callback at 0x7b2a1ef20e50>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/threadpoolctl.py", line 1014, in match_library_callback
    self._make_controller_from_path(filepath)
  File "/usr/local/lib/python3.10/dist-packages/threadpoolctl.py", line 1184, in _make_controller_from_path
    lib_controller = controller_class(
  File "/usr/local/lib/python3.10/dist-packages/threadpoolctl.py", line 113, in __init__
    self.dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so: cannot open shared object file: No such file or directory
  check_blas_config()


  0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
# We will evaluate the model using the inbuilt library function - May take a few minutes
test_set = (test_set * alpha_val).astype('double')
evaluation.ranking_metrics_at_k(model, train_set, test_set, K=100,
                         show_progress=True, num_threads=1)

  0%|          | 0/939 [00:00<?, ?it/s]

{'precision': 0.9745395449620802,
 'map': 0.4592624231974912,
 'ndcg': 0.6059004027778536,
 'auc': 0.976092454391791}

This is what item and user factors look like. These vectors will be stored in our vector index later and used for recommendation.

In [None]:
model.item_factors[1:3]

array([[-0.01869054,  0.01624159,  0.01445023, -0.01471141, -0.01752022,
        -0.03099159, -0.0045502 ,  0.01801882,  0.01576493,  0.01532222,
         0.00287872, -0.00283497, -0.04835457, -0.01986769, -0.00463835,
         0.02314002, -0.0124543 ,  0.00029302, -0.01941025,  0.00919497,
         0.00468089, -0.00799547,  0.00281197,  0.01179938,  0.00473643,
         0.03054055, -0.00940915,  0.00585659,  0.00819855,  0.02541376,
        -0.00571563,  0.01245054,  0.01300477, -0.00661199,  0.01853507,
        -0.01769906,  0.01168907, -0.01397778,  0.00521798,  0.00477376,
         0.02675454,  0.0043005 ,  0.01148617,  0.00259627,  0.00454474,
        -0.014676  , -0.00043504, -0.0215588 , -0.0093579 ,  0.00651564,
         0.01726513,  0.00127685,  0.00256362, -0.01637328, -0.01024396,
         0.01700275,  0.00831247,  0.00215778,  0.00491612, -0.0114581 ,
         0.02117233, -0.0243272 , -0.01256597, -0.02053637, -0.01829091,
         0.02650644, -0.01280063, -0.00946465, -0.0

In [None]:
model.user_factors[1:3]

array([[ 0.18877874, -0.2078289 ,  0.1875522 ,  0.66935563,  0.03368911,
        -0.24832201, -0.02690276,  0.18913196,  0.20950755,  0.18112655,
         0.22417712,  0.15185118,  0.08453239,  0.32291588, -0.09702265,
         0.1488636 , -0.16453679,  0.16499755,  0.1871986 , -0.17420869,
         0.27767295, -0.22264476, -0.323834  , -0.04821065, -0.18742846,
         0.04331789, -0.15769868, -0.01048369,  0.15476489,  0.14720456,
         0.30790466, -0.03473528,  0.27523896, -0.33198532,  0.15769248,
        -0.0963456 , -0.36345157,  0.2766423 ,  0.255034  ,  0.29527998,
        -0.08704789,  0.10603929,  0.6572821 , -0.21875231,  0.16968183,
         0.1364574 ,  0.02857736,  0.00440361, -0.09264286,  0.04441794,
         0.2432849 , -0.15182377,  0.02409743, -0.48702347, -0.1244401 ,
        -0.42523366,  0.42168048,  0.00580856,  0.15157332,  0.11410328,
        -0.15036887,  0.2343768 , -0.3716784 , -0.12161482,  0.39719427,
         0.27621576,  0.04035279, -0.16038208, -0.1

In [None]:
model.user_factors.shape

(1341, 100)

# Load data for Collaborative Filtering
Uploading all items (products or user_trips that one can buy) and displaying some examples of products and their vector representations.

In [None]:
# Get all of the items
all_items_titles = [{'title': title} for title in products_lookup['product_name']]
all_items_ids = [str(product_id) for product_id in products_lookup['product_id']]

# Transform items into factors
items_factors = model.item_factors

item_embeddings = items_factors[1:].tolist()

# Prepare item factors for upload
items_to_insert = list(zip(all_items_ids, item_embeddings, all_items_titles))

In [None]:
display(items_to_insert[1])

In [None]:
# Insert records into AstraDB collection
if (process_data):
  from tqdm.auto import tqdm
  # Iterate through the DataFrame and insert each movie
  for row in tqdm(items_to_insert):
      # Generate a unique ID for each product
      item_id = row[0]  # Assuming id is unique and suitable for _id
      embedding = row[1]
      item_data = row[2] # This is only {'title': '. . .'}), but we can pass the entire object.
      try:
      # Insert the data into the collection
        cf_collection.insert_one({"_id": item_id, "$vector": embedding, **item_data})
      except Exception as e:
        print(f"Exception: {e}") # We can ignore any 0 vectors.

In [None]:
data.head()

Unnamed: 0,user_id,product_id,total_orders
0,0,0,3
1,0,1,4
2,0,2,3
3,0,3,2
4,0,4,3


<a name="similarity_search_implementation"></a>
# Similarity Search Implementation

This section of the notebook demonstrates the core functionality of our trip similarity search system. By leveraging the vector database capabilities of AstraDB and the computational power of vector embeddings, we will showcase various scenarios that illustrate how this system can be applied to real-world travel recommendation tasks.

We will explore several practical examples that highlight the flexibility and effectiveness of our approach:
1. **Retrieve a User Profile**: Fetching user data based on a unique identifier (Profile ID).
2. **Profile Search by Attributes**: Searching for profiles using key factors such as gender, age group, trip destination, and price preferences.
3. **Similar Trips from Attributes**: Finding trips that align with a set of randomly generated user trip attributes.
4. **Natural Language Search**: Utilizing natural language descriptions to find trips that match user preferences described in plain English.

Each example will be supported by detailed explanations and Python code snippets, demonstrating the querying capabilities and how they can be used to enhance user experience by providing personalized travel recommendations.


In [None]:
# Find a user's profile based on a user trip (Traditional search or filtering method)
document = vector_collection.find_one(filter={"user_trip":"User Profile: gender: female; age_group: Middle-Aged Adults (35-49) Destination: Rio de Janeiro (RJ) Flight Price: Medium Hotel Price: Medium Distance: Medium"})
print(document)

{'data': {'document': {'_id': '5f90d8ca-bfd8-4b81-90d8-cabfd82b81ec', 'user_id': 2, 'user_trip': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49) Destination: Rio de Janeiro (RJ) Flight Price: Medium Hotel Price: Medium Distance: Medium', 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'days': 1, 'flight_price': 'Medium', 'hotel_price': 'Medium', 'flight_type': 'economic', 'distance': 'Medium', 'agency': 'CloudFy', '$vector': [-0.012503329664468765, 0.002725820289924741, -0.020294280722737312, -0.009823080152273178, -0.003341872710734606, 0.002081075217574835, -0.02666746824979782, -0.022333160042762756, -0.02123945765197277, -0.013212211430072784, -0.013988606631755829, 0.006585851777344942, -0.012894902378320694, -0.01725621521472931, -0.00038608754402957857, 0.0005742788780480623, 0.014987792819738388, -0.018039360642433167, 0.012003736570477486, -0.01304342970252037, -0.0172832198441028

In [None]:
# Find a user's profile based on a user trip string that's vectorized (Uses similarity search and vectors)
query_vector_string = create_user_trip_string('female','Middle-Aged Adults','Rio de Janeiro (RJ)','Medium','Medium','Medium')
query_vector = get_embedding(query_vector_string)

documents = vector_collection.vector_find(
    query_vector,
    limit=10,
    fields=["user_id","user_profile","destination", "flight_price","hotel_price","distance"],
)
for document in documents:
    print(f"\n similarity: {document['$similarity']} - {document}")


 similarity: 0.9983452 - {'_id': '26adb7b9-3089-4cb7-adb7-b93089ccb7e1', 'user_id': 1255, 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Medium', '$similarity': 0.9983452}

 similarity: 0.9983452 - {'_id': '47f57d38-17b2-499e-b57d-3817b2f99e27', 'user_id': 1255, 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Medium', '$similarity': 0.9983452}

 similarity: 0.9983452 - {'_id': '06ca5870-8a1e-47e1-8a58-708a1e87e1d9', 'user_id': 835, 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Medium', '$similarity': 0.9983452}

 similarity: 0.99834144 - {'_id': 'a350d25e-73c5-4398-9

In [None]:
# Find a user's profile based on a Random user trip profile that's vectorized
query_vector_string = generate_random_user_trip()
print(f"The random query string: {query_vector_string}")
query_vector = get_embedding(query_vector_string)

documents = vector_collection.vector_find(
    query_vector,
    limit=10,
    fields=["user_id","user_profile","destination", "flight_price","hotel_price","distance"],
)
for document in documents:
    print(f"\n similarity: {document['$similarity']} - {document}")

The random query string: User Profile: gender: Female; age_group: Pre-Retirement (50-64) Destination: New York Flight Price: High Hotel Price: Low Distance: Long

 similarity: 0.9779643 - {'_id': '6e197068-2286-4b94-9970-682286ab9481', 'user_id': 11, 'user_profile': 'User Profile: gender: female; age_group: Pre-Retirement (50-64)', 'destination': 'Natal (RN)', 'flight_price': 'High', 'hotel_price': 'Medium', 'distance': 'Medium', '$similarity': 0.9779643}

 similarity: 0.9779643 - {'_id': '49d0dd54-fc34-4238-90dd-54fc34423808', 'user_id': 1036, 'user_profile': 'User Profile: gender: female; age_group: Pre-Retirement (50-64)', 'destination': 'Natal (RN)', 'flight_price': 'High', 'hotel_price': 'Medium', 'distance': 'Medium', '$similarity': 0.9779643}

 similarity: 0.9779643 - {'_id': 'e6e9870b-de31-42cf-a987-0bde3172cf64', 'user_id': 882, 'user_profile': 'User Profile: gender: female; age_group: Pre-Retirement (50-64)', 'destination': 'Natal (RN)', 'flight_price': 'High', 'hotel_price':

In [None]:
# Find trips based on a natural language query
query_vector_string = "An older gentleman who likes inexpensive trips that are short"
print(f"The random query string: {query_vector_string}")
query_vector = get_embedding(query_vector_string)

documents = vector_collection.vector_find(
    query_vector,
    limit=10,
    fields=["user_id","user_profile","destination", "flight_price","hotel_price","distance"],  # remember the dollar sign (reserved name)
)
for document in documents:
    print(f"\n similarity: {document['$similarity']} - {document}")

The random query string: An older gentleman who likes inexpensive trips that are short

 similarity: 0.90976644 - {'_id': '8d295961-7556-49c3-a959-617556b9c32d', 'user_id': 12, 'user_profile': 'User Profile: gender: male; age_group: Pre-Retirement (50-64)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Short', '$similarity': 0.90976644}

 similarity: 0.90976644 - {'_id': 'b4223d3d-2914-41d2-a23d-3d291481d232', 'user_id': 273, 'user_profile': 'User Profile: gender: male; age_group: Pre-Retirement (50-64)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Short', '$similarity': 0.90976644}

 similarity: 0.90976644 - {'_id': 'b066d4a2-0b72-4824-a6d4-a20b727824e2', 'user_id': 274, 'user_profile': 'User Profile: gender: male; age_group: Pre-Retirement (50-64)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Short', '$similarity':

In [None]:
# Find trips based on a natural language query
query_vector_string = "A thirty-something woman who likes middle distanced reasonably priced trips"
print(f"The random query string: {query_vector_string}")
query_vector = get_embedding(query_vector_string)

documents = vector_collection.vector_find(
    query_vector,
    limit=10,
    fields=["user_id","user_profile","destination", "flight_price","hotel_price","distance"],
)
for document in documents:
    print(f"\n similarity: {document['$similarity']} - {document}")

The random query string: A thirty-something woman who likes middle distanced reasonably priced trips

 similarity: 0.9195458 - {'_id': '9cc85738-6fcc-4202-8857-386fccc20257', 'user_id': 1314, 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'Medium', 'distance': 'Medium', '$similarity': 0.9195458}

 similarity: 0.9194253 - {'_id': '4cef41ad-da41-49c9-af41-adda4149c978', 'user_id': 1295, 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'High', 'distance': 'Medium', '$similarity': 0.9194253}

 similarity: 0.9194253 - {'_id': 'ac9afc83-2e27-4bea-9afc-832e27cbea49', 'user_id': 1295, 'user_profile': 'User Profile: gender: female; age_group: Middle-Aged Adults (35-49)', 'destination': 'Rio de Janeiro (RJ)', 'flight_price': 'Medium', 'hotel_price': 'High', 'dista

<a name="collaborative_filtering_implementation"></a>
## Collaborative Filtering Queries - Find recommended user trips
Collaborative filtering leverages user similarity and past interactions to recommend trips, while vector search uses the semantic content of trips and user data to find and recommend trips that are contextually similar to a user’s interests.  collaborative filtering can factor in the weight of user trips, such as how frequently a user visits a particular location. This weighting is crucial because it helps to refine the recommendations based on the intensity or frequency of user preferences.

In [None]:
# Find a user's profile based on an Id
document = cf_collection.find_one(filter={"_id":"26"})
start_time = time.process_time()
user_embedding = document['data']['document']['$vector']
query_results = cf_collection.vector_find(
        vector=user_embedding,
        limit=10,
    )
print("Time needed for retrieving recommended products using AstraDB: " + str(time.process_time() - start_time) + ' seconds.\n')
print(query_results)
for result in query_results:
  print(f"id={result['_id']} | product_title={result['title']} | similarity={result['$similarity']}")

print('\r') # add a blank line

Time needed for retrieving recommended products using AstraDB: 0.006238177000000178 seconds.

[{'_id': '26', '$vector': [0.02385653369128704, -0.033946532756090164, -0.017650935798883438, -0.04332209378480911, 0.02083425596356392, -0.018534818664193153, 0.020569683983922005, -0.026660744100809097, 0.0256964024156332, -0.01084749773144722, -0.022798756137490273, -0.044192757457494736, 0.02369656041264534, -0.007521080318838358, 0.0634937733411789, -0.007625754922628403, -0.024149693548679352, -0.01254771463572979, 0.006926544941961765, 0.015137676149606705, -0.015155146829783916, -0.005329479463398457, -0.015582157298922539, 0.009699961170554161, -0.007862324826419353, -0.026955444365739822, 0.010041138157248497, 0.0322384387254715, 0.0028038881719112396, -0.04351278021931648, 0.01609230786561966, -0.014077933505177498, -0.026340875774621964, 0.010667816735804081, 0.021877044811844826, 0.025248365476727486, 0.021565914154052734, 0.05605557933449745, -0.02739633619785309, 0.0022363376338

# Project Summary

This notebook details the creation of a trip similarity search system using AstraDB and vector embeddings from OpenAI. Our objective is to enhance the travel planning experience by enabling users to find trips that closely match their preferences or past travel experiences. In this example we used [AstraDB](https://astra.datastax.com/) to build and deploy a product recommendation engine that uses vector search, relatively quickly.

## Key Components:
- **Vector Embeddings**: Utilizing OpenAI's advanced models, we generate embeddings from travel data obtained from the Kaggle Travel Dataset (argodatathon2019). These embeddings capture the essential features of each trip.
  
- **AstraDB Configuration**: We set up and configure AstraDB to store and efficiently retrieve these embeddings, leveraging its robust vector database capabilities to facilitate fast and scalable similarity searches.

- **Data Management**: The notebook guides through the process of inserting and managing travel data within AstraDB, ensuring data integrity and accessibility.

- **Similarity Searches**: We implement several search functionalities, demonstrating the system's capability to find similar trips based on user profiles, specific trip attributes, and natural language descriptions.

- **User Interface Integration**: The final step involves creating a user-friendly interface that allows travelers to interact with the system, inputting their preferences and receiving personalized trip recommendations.

This project aims to showcase how vector databases and AI-driven embeddings can revolutionize the way we approach personalized travel recommendations, making the process more intuitive and user-centric.

