# Recommendation Systems

Recommendation systems have revolutionized the way we discover and explore new things. These intelligent systems utilize sophisticated algorithms and data analysis to understand individual preferences and provide personalized recommendations.
<br><br>
By analyzing user data, such as **browsing history**, **purchase patterns**, and **social interactions**, recommendation systems can effectively predict and suggest items that align with users' interests. Whether it's suggesting a new movie to watch, a book to read, or a product to buy, these systems streamline decision-making and enhance the overall user experience.
<br>
With their ability to uncover hidden gems and introduce users to exciting possibilities, recommendation systems have become invaluable tools in navigating the overwhelming abundance of choices in today's digital landscape.

### Recommendation Systems and Vector Databases

In recommendation systems, understanding the similarity between users and items is crucial for generating accurate and personalized recommendations. By leveraging vector databases, these systems can store and organize user and item vectors, which capture the essential characteristics and preferences associated with each user and item.
<br><br>
The vector database employs <a href="https://www.pinecone.io/learn/vector-database/#:~:text=a%20vector%20database.-,Algorithms,-Several%20algorithms%20can">advanced indexing techniques</a>, to enable fast retrieval of **similar users or items** based on their **vector representations**. This enables recommendation systems to efficiently process *large-scale datasets* and identify meaningful connections, leading to more precise and relevant recommendations.
<br><br>
By harnessing the power of vector databases, such as **Pinecone**, recommendation systems can optimize their performance, enhance user satisfaction, and deliver tailored experiences that align with individual preferences.
<br><br><br>
Let's take a look at how we can implement one of those use cases!

We start by installing all necessary libraries.

In [6]:
!pip install -qU \
    "pinecone-client[grpc]"==2.2.1 \
    pinecone-datasets==0.5.1 \
    transformers==4.30.2 \
    tensorflow==2.11.1

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preparation

<img alt="Onboarding recommender diagram" src="https://raw.githubusercontent.com/pinecone-io/examples/master/recommendation/onboarding-recommender/assets/onboarding_recommender_data_flow.jpg"  width="70%">

#### Downloading the Dataset

We will download a pre-embedding dataset from `pinecone-datasets`. Allowing us to skip the embedding and any other preprocessing steps.
<br><br>
When working with your own dataset you will need to perform this embedding step but we have prebuilt the embeddings so we can jump right to the action.

In [10]:
from pinecone_datasets import load_dataset

dataset_name = "movielens-user-ratings"
dataset = load_dataset(dataset_name)
dataset.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,tt5027774,"[-0.12388430535793304, 0.23021861910820007, -0...",,,"{'imdb_id': 'tt5027774', 'movie_id': 6705, 'po..."
1,tt5463162,"[0.008479624055325985, 0.3665461540222168, -0....",,,"{'imdb_id': 'tt5463162', 'movie_id': 7966, 'po..."
2,tt4007502,"[-0.0022702165879309177, 0.5886886715888977, -...",,,"{'imdb_id': 'tt4007502', 'movie_id': 1614, 'po..."
3,tt4209788,"[0.08350061625242233, 0.4322584867477417, -0.2...",,,"{'imdb_id': 'tt4209788', 'movie_id': 7022, 'po..."
4,tt2948356,"[-0.1614755392074585, 0.41389355063438416, -0....",,,"{'imdb_id': 'tt2948356', 'movie_id': 3571, 'po..."


In [11]:
len(dataset)

970582

#### Reformatting the Dataset

A `pinecone-dataset` always contains `id`, `values`, `sparse_values`, `metadata`, and `blob`. All we need are the IDs, vector embeddings (stored in `values`), and some metadata (which is actually stored in `blob`). Let's reformat the dataset ready for adding to Pinecone. We also drop `sparse_values` as they are not needed for this example.


In [12]:
dataset.documents.drop(['sparse_values', 'metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

dataset.head()

Unnamed: 0,id,values,metadata
0,tt5027774,"[-0.12388430535793304, 0.23021861910820007, -0...","{'imdb_id': 'tt5027774', 'movie_id': 6705, 'po..."
1,tt5463162,"[0.008479624055325985, 0.3665461540222168, -0....","{'imdb_id': 'tt5463162', 'movie_id': 7966, 'po..."
2,tt4007502,"[-0.0022702165879309177, 0.5886886715888977, -...","{'imdb_id': 'tt4007502', 'movie_id': 1614, 'po..."
3,tt4209788,"[0.08350061625242233, 0.4322584867477417, -0.2...","{'imdb_id': 'tt4209788', 'movie_id': 7022, 'po..."
4,tt2948356,"[-0.1614755392074585, 0.41389355063438416, -0....","{'imdb_id': 'tt2948356', 'movie_id': 3571, 'po..."


Here is an example of the metadata value.

In [13]:
from pprint import pp

pp(dataset.documents['metadata'][0])

{'imdb_id': 'tt5027774',
 'movie_id': 6705,
 'poster': 'https://m.media-amazon.com/images/M/MV5BMjI0ODcxNzM1N15BMl5BanBnXkFtZTgwMzIwMTEwNDI@._V1_SX300.jpg',
 'rating': 4.0,
 'title': 'Three Billboards Outside Ebbing, Missouri (2017)',
 'user_id': 4556}


Now we move on to initializing our Pinecone vector database.

## Creating an Index

In [14]:
import os
import pinecone
import time

We set `PINECONE_API_KEY` and `PINECONE_ENV` variables that we are going to use during initialization step. You can find these values in [Pinecone Console](https://app.pinecone.io/) in the API Keys section.

In [15]:
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
env = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

Now we can use these variables for initialization.

In [16]:
pinecone.init(
    api_key=api_key,
    environment=env
)

In order to create a new index, we need to specify the index name, similarity metric, as well as the dimension of the vectors stored in that index.
<br>
We will assign these values here.
<br>
Note that the dimension parameter has to match the embedding dimensions provided in the dataset (or the model that outputs those embeddings).

In [17]:
# embedding dimensions
len(dataset.documents['values'][0])

32

In [18]:
index_name = 'onboarding-recommender'

First, we need to check if the index already exists. In this example, we will delete it and create a new one.

In [20]:
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(
    name=index_name,
    metric='cosine',
    dimension=32,
)
# wait a moment for the index to be fully initialized
while not pinecone.describe_index(index_name).status['ready']:
    time.sleep(1)

We are going to initialize an index variable so that we can use it later on to describe the index and perform vector upsert.

In [21]:
index = pinecone.GRPCIndex(index_name)

Initially the index will be empty:

In [22]:
index.describe_index_stats()

{'dimension': 32,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We upsert like so:

In [23]:
index.upsert_from_dataframe(dataset.documents)

sending upsert requests:   0%|          | 0/970582 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1942 [00:00<?, ?it/s]

upserted_count: 970582

Now we should see ~900K vectors in our index:

In [24]:
index.describe_index_stats()

{'dimension': 32,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 10269}},
 'total_vector_count': 10269}

## Querying the Index

Now, when the index is populated, we can perform queries on it to find the most relevant recommendations.
<br>
To do that, we need to instantiate our embedding models so that we can create vectors from our input user or input item objects.

### Getting the Model

We will download the models from the HuggingFace Hub. We will use one model to embed the *example user* and another model to embed the *example item*. <br>
This will allow us to retrieve the most relevant items for a specific user or find the most similar items to a specific item.

In [25]:
from huggingface_hub import from_pretrained_keras

user_model = from_pretrained_keras("pinecone/movie-recommender-user-model")
movie_model = from_pretrained_keras("pinecone/movie-recommender-movie-model")

config.json not found in HuggingFace Hub.


Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading (…)1cf860e010/README.md:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading (…)0e010/.gitattributes:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading model.png:   0%|          | 0.00/5.47k [00:00<?, ?B/s]

Downloading (…).data-00000-of-00001:   0%|          | 0.00/4.57M [00:00<?, ?B/s]

Downloading keras_metadata.pb:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

Downloading saved_model.pb:   0%|          | 0.00/33.3k [00:00<?, ?B/s]

Downloading variables.index:   0%|          | 0.00/236 [00:00<?, ?B/s]

config.json not found in HuggingFace Hub.


Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading (…)dfaeb/.gitattributes:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)7ac11dfaeb/README.md:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading model.png:   0%|          | 0.00/5.93k [00:00<?, ?B/s]

Downloading keras_metadata.pb:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

Downloading saved_model.pb:   0%|          | 0.00/33.7k [00:00<?, ?B/s]

Downloading (…).data-00000-of-00001:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading variables.index:   0%|          | 0.00/233 [00:00<?, ?B/s]



Before we proceed, we can create a `movie_details` dataset that we can use later on to print out the results.

In [26]:
import pandas as pd

movies_details = pd.DataFrame(
    dataset.documents['metadata'].values.tolist()
)
movies_details.head()

Unnamed: 0,imdb_id,movie_id,poster,rating,title,user_id
0,tt5027774,6705,https://m.media-amazon.com/images/M/MV5BMjI0OD...,4.0,"Three Billboards Outside Ebbing, Missouri (2017)",4556
1,tt5463162,7966,https://m.media-amazon.com/images/M/MV5BMDkzNm...,3.5,Deadpool 2 (2018),20798
2,tt4007502,1614,https://m.media-amazon.com/images/M/MV5BMjY3YT...,4.5,Frozen Fever (2015),26543
3,tt4209788,7022,https://m.media-amazon.com/images/M/MV5BNTkzMz...,4.0,Molly's Game (2017),4106
4,tt2948356,3571,https://m.media-amazon.com/images/M/MV5BOTMyMj...,4.0,Zootopia (2016),15259


#### Item Similarity

First, we can check how our vector database behaves when returning the most similar movies upon querying it using the movie vector created using the `movie_model` loaded above.

In [27]:
movie_id = 1263  # you can try experimenting with different movie ids to obtain different results, for example 3571
movie_vector = movie_model(movie_id).numpy().tolist()

In [28]:
movies_details[movies_details['movie_id'] == movie_id]['title'].tolist()[0]

'Avengers: Infinity War - Part I (2018)'

In [29]:
movie_query_results = index.query(
    queries=[movie_vector],
    top_k=10,
    include_metadata=True
)

In [30]:
for res in movie_query_results.results:
    df = pd.DataFrame(
        {
            'movies': [match.metadata['title'] for match in res.matches],
            'scores': [match.score for match in res.matches]
        }
    )
    print("Recommendations: ")
    display(df)

Recommendations: 


Unnamed: 0,movies,scores
0,Avengers: Infinity War - Part I (2018),1.0
1,Avengers: Infinity War - Part II (2019),0.987279
2,Thor: Ragnarok (2017),0.981357
3,Captain America: Civil War (2016),0.978873
4,Guardians of the Galaxy (2014),0.976149
5,Guardians of the Galaxy 2 (2017),0.960252
6,Avengers: Age of Ultron (2015),0.945295
7,Untitled Spider-Man Reboot (2017),0.943886
8,Logan (2017),0.936165
9,Star Wars: Dresca,0.933541


We can observe that it is doing an excellent job in finding similar movies, and it is accomplishing this task very quickly.

#### User Recommendations

Now, let's observe how our vector database behaves when we query it using the user vector.
<br>
We expect to receive movies that closely resemble the ones that the user rated highly.

In [31]:
user_id = 3
user_vector = user_model(user_id).numpy().tolist()

Here, we are defining a function that allows us to easily display the movies that the user rated in the past.

In [32]:
def top_movies_user_rated(user):
    # get list of movies that the user has rated
    user_movies = movies_details[movies_details["user_id"] == user]
    # order by their top rated movies
    top_rated = user_movies.sort_values(by=['rating'], ascending=False)
    # return the top 14 movies
    return pd.DataFrame(
        {
            'movies': top_rated['title'].tolist()[:14],
            'ratings': top_rated['rating'].tolist()[:14]
        }
    )

In [33]:
display(top_movies_user_rated(user_id))

Unnamed: 0,movies,ratings
0,Big Hero 6 (2014),4.5
1,Captain America: Civil War (2016),4.0
2,Avengers: Age of Ultron (2015),4.0
3,Arrival (2016),2.5
4,The Martian (2015),2.5


And now we can pass our `user_vector` to the query to get the recommendations.

In [34]:
query_results = index.query(
    queries=[user_vector],
    top_k=10,
    include_metadata=True
)

In [35]:
for res in query_results.results:
    df = pd.DataFrame(
        {
            'movies': [match.metadata['title'] for match in res.matches],
            'scores': [match.score for match in res.matches]
        }
    )
    print("Recommendations: ")
    display(df)

Recommendations: 


Unnamed: 0,movies,scores
0,Big Hero 6 (2014),0.854731
1,Captain America: Civil War (2016),0.84943
2,Avengers: Age of Ultron (2015),0.834037
3,The Witch Files (2018),0.830561
4,Avengers: Infinity War - Part I (2018),0.829733
5,Untitled Spider-Man Reboot (2017),0.82595
6,Monster High: 13 Wishes (2013),0.824344
7,Guardians of the Galaxy 2 (2017),0.822629
8,Lovestruck: The Musical (2013),0.820314
9,Spider-Man: Far from Home (2019),0.820075


Again, we can observe that these recommendations strongly resemble the movies that the user rated highly, and there are no movies similar to the ones that the user rated with a low value.

## Summary

The notebook demonstrated the step-by-step process of creating and populating an index in the vector database. It covered aspects such as specifying the index name, similarity metric, and vector dimensions. The example also included instructions on checking if an index exists, deleting and creating new indexes when necessary.

Furthermore, the notebook illustrated the usage of embedding models to generate vector representations of both users and items. The results showed that the recommendations closely resembled the movies that the user rated highly, and dissimilar movies were not included.

Overall, this example showcased the power and efficiency of vector databases in recommendation systems. It is important to note that the benefits of vector databases extend beyond movies and can be applied to various types of items, making them a valuable tool in building effective recommendation systems.