<a href="https://colab.research.google.com/github/TJhon/lanchain_curso/blob/day3/Pinecone/recomender_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install  pinecone-client openai numpy pandas langchain tiktoken -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.4/201.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.9/815.9 kB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.2/241.2 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

# Recommendation System

## Pinecone

Pinecone simplifies the provision of long-term memory for high-performance AI applications. It is a managed, cloud-native vector database with a straightforward API and no infrastructure complexities. Pinecone delivers fresh, filtered query results with low latency, capable of scaling to billions of vectors.

## Preview

To access the resource, you need to create an account on [Pinecone](https://app.pinecone.io/). Currently, only the free version will be used, which allows for only one index storage, sufficient for the exercise.

Previously, two elements of Pinecone were needed to make queries. However, now only the Pinecone API key is required to handle vectors.

### Creating API KEY

Login Pinecone

![](https://github.com/TJhon/lanchain_curso/blob/day3/Pinecone/figs/login-pinecone.png?raw=1)

Create Api Key

![](https://github.com/TJhon/lanchain_curso/blob/day3/Pinecone/figs/api-key.png?raw=1)

![](https://github.com/TJhon/lanchain_curso/blob/day3/Pinecone/figs/create_api-key.png?raw=1)

Save Api Key

![](https://github.com/TJhon/lanchain_curso/blob/day3/Pinecone/figs/save-api-key.png?raw=1)

## How to Use

To create the index in Pinecone, we need to import the necessary methods. Additionally, we must create the OpenAI client to generate embeddings (remember that embeddings are the vector representation of content).

In this specific case, we are deleting all existing indexes with `pc.delete_index` in Pinecone because we can only have one in the free version. Additionally, we define the index name as `nameindex` and create the index (index creation typically takes between 1 to 3 minutes). Finally, we define our `index` to make queries.

In [2]:
# pinecone
from pinecone import Pinecone, PodSpec
from google.colab import userdata

pc_api_key = userdata.get("PINECONE_API_KEY")
pc = Pinecone(api_key=pc_api_key)

nameindex = "recommended"
for index in pc.list_indexes():
  pc.delete_index(index.get('name'))

pc.create_index(
    name=nameindex,
    dimension=1536,
    metric="cosine",
    spec=PodSpec(
        environment='us-west1-gcp',
        pod_type='p1.x1'
    )
)
# define the endpoint
index = pc.Index(nameindex)

In [3]:
from openai import OpenAI

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
openai_client = OpenAI(api_key=OPENAI_API_KEY)

To better understand what embeddings are, let's create them with OpenAI. First, we need to call the embeddings method and create the embeddings by providing the text and the model to use. OpenAI works with embeddings of 1536 elements, so each embedding must be a list of that same size. Within these embeddings, there is the element `data[0].embedding`, which is the value needed to upload to Pinecone as `value`.

Para entender mejor que son los embedings, haremos la creacion de este con openai, primero tenemos que llamar al metodo de embeddings y crear el embeddings tomando el texto y el modelo a utilizar. Openai trabaja con un los embeddings de 1536 elementos, por lo que cada emebddings debe ser una lista de ese mismo tamanio. Dentro de este embeddings existe el elemmeto `data[0].embeddig` el cual es el valor que se necesita para poder subir a pinecone como `value`


In [4]:
model_openai_e = "text-embedding-ada-002"
embed = openai_client.embeddings.create(input='text', model=model_openai_e)
# dir(embed)

In [5]:
value = embed.data[0].embedding
print(f"""value: {value[:10]}\n length: {len(value)}""")

value: [-0.010077533312141895, -0.015133497305214405, 0.008619214408099651, -0.012760289944708347, 0.00478080939501524, 0.018903113901615143, -0.015326105058193207, -0.016440480947494507, -0.013750845566391945, -0.002245948649942875]
 length: 1536


Since this procedure will be repetitive, it will be included in a function.

In [6]:
def get_embeddings(input, model=model_openai_e):
    embed = openai_client.embeddings.create(input='text', model=model_openai_e)
    value = embed.data[0].embedding
    return value

In order to upload this vector to Pinecone, two additional elements are needed: the ID, which will serve as an identifier, and the metadata, which in Python is a dictionary and will be used for more precise queries.

As an example, let's use a sample:

In [7]:
import numpy as np, time

time.sleep(5)

value_vector = np.random.rand(1536)  # length of vector
metadata = {
    "test": "yes",
    "title": "none"
}

upsert_response = index.upsert(
    vectors=[
        ("id_1", value_vector, metadata)
    ]
)

![](https://github.com/TJhon/lanchain_curso/blob/day3/Pinecone/figs/index-created.png?raw=1)
![](https://github.com/TJhon/lanchain_curso/blob/day3/Pinecone/figs/data-inside-index.png?raw=1)

In the Pinecone console, you can see that there is now an element with the values defined previously.

## Example

Let's use news data, which is available at the following [link](https://www.dropbox.com/scl/fi/wruzj2bwyg743d0jzd7ku/all-the-news-3.zip?rlkey=rgwtwpeznbdadpv3f01sznwxa&dl=1%22). To use it, we can directly execute this code, which will download a zip file and extract its contents into the current folder.

In [8]:
!wget -q --show-progress -O all-the-news-3.zip "https://www.dropbox.com/scl/fi/wruzj2bwyg743d0jzd7ku/all-the-news-3.zip?rlkey=rgwtwpeznbdadpv3f01sznwxa&dl=1"

!unzip all-the-news-3.zip

Archive:  all-the-news-3.zip
  inflating: all-the-news-3.csv      


The data is contained within a CSV file.

In [9]:
import pandas as pd
df = pd.read_csv('./all-the-news-3.csv', nrows=99)
print(df.columns, df.shape)

Index(['date', 'year', 'month', 'day', 'author', 'title', 'article', 'url',
       'section', 'publication'],
      dtype='object') (99, 10)


In [10]:
df.head(3)

Unnamed: 0,date,year,month,day,author,title,article,url,section,publication
0,2016-12-09 18:31:00,2016,12.0,9,Lee Drutman,We should take concerns about the health of li...,"This post is part of Polyarchy, an independent...",https://www.vox.com/polyarchy/2016/12/9/138983...,,Vox
1,2016-10-07 21:26:46,2016,10.0,7,Scott Davis,Colts GM Ryan Grigson says Andrew Luck's contr...,The Indianapolis Colts made Andrew Luck the h...,https://www.businessinsider.com/colts-gm-ryan-...,,Business Insider
2,2018-01-26 00:00:00,2018,1.0,26,,Trump denies report he ordered Mueller fired,"DAVOS, Switzerland (Reuters) - U.S. President ...",https://www.reuters.com/article/us-davos-meeti...,Davos,Reuters


In this example, only a sample will be uploaded to Pinecone.

In [11]:
import numpy as np
df = pd.read_csv('./all-the-news-3.csv', nrows=100)

Remember that we need 3 elements - ID, vector, and metadata - to upload to Pinecone. We'll generate a function that generates an ID, the embedding of the news titles, and the metadata, all within a tuple.

In [12]:
import uuid

def to_update(title: str):
    _id = str(uuid.uuid1())[:12]
    embed = get_embeddings(title)
    metadata = {
        "title": title
    }
    return _id, embed, metadata

value = to_update("We should take concerns about the health ...")

In [13]:
from tqdm import tqdm

titles = df['title'].values
for title in titles:
    value = to_update(title)
    index.upsert(vectors = [value])

To view recommendations from the database, we can make the request with the following prompt, where we specify in `vector` the query's embedding, `top_k` the number of items to display, and `include_metadata` to extract metadata that matches the vector.

In [14]:
query_vector = get_embeddings("Google")
response = index.query(
    vector=query_vector,
    top_k=10,
    include_metadata=True
)



Subsequently, we can see the score of how matching our vector recommendations are with the query embed, and we can also access the metadata.

In [15]:
print(response.matches)

[{'id': '2a7b0d0e-cb9',
 'metadata': {'title': 'UK PM May presses on with bid to get Brexit deal '
                       'through parliament: spokesman'},
 'score': 1.0,
 'values': []}, {'id': '2908afc6-cb9',
 'metadata': {'title': "Paris Hilton: Woman In Black For Uncle Monty's "
                       'Funeral'},
 'score': 1.0,
 'values': []}, {'id': '29232d10-cb9',
 'metadata': {'title': "ECB's Coeure: If we decide to cut rates, we'd have to "
                       'consider tiering'},
 'score': 1.0,
 'values': []}, {'id': '28b630b6-cb9',
 'metadata': {'title': 'Trump denies report he ordered Mueller fired'},
 'score': 1.0,
 'values': []}, {'id': '296cc204-cb9',
 'metadata': {'title': 'You Can Trick Your Brain Into Being More Focused'},
 'score': 1.0,
 'values': []}, {'id': '28df85c4-cb9',
 'metadata': {'title': "France's Sarkozy reveals his 'Passions' but insists no "
                       'come-back on cards'},
 'score': 1.0,
 'values': []}, {'id': '29d0d6ea-cb9',
 'metadata': 

## Large Texts

The previous example used the title as the embedding and metadata. Now, the exercise will involve using the content of the news as the embedding and the title as metadata. To achieve this, we need to consider that the texts can be very long, so we'll need to split them.

First, let's delete the existing index to upload the content again.

In [16]:
name_index = 'article'
for index in pc.list_indexes():
    pc.delete_index(index.get('name'))

pc.create_index(
    name=name_index,
    dimension=1536,
    metric="cosine",
    spec=PodSpec(
        environment='us-west1-gcp',
        pod_type='p1.x1'
    )
)
index = pc.Index(name_index)

We'll reuse the data, but modify the `get_embeddings` function because the previous one only accepted a string. However, when the text passes through `text_splitter`, it will return a list of strings, each of which must be uploaded individually to Pinecone.

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


def get_embeddings(articles, model="text-embedding-ada-002"):
   return openai_client.embeddings.create(input = articles, model=model)

def embed(embeddings, title, prepped, embed_num):
  for embedding in embeddings.data:
    prepped.append({'id':str(embed_num), 'values':embedding.embedding, 'metadata':{'title':title}})
    embed_num += 1
    if len(prepped) >= 100:
        index.upsert(prepped)
        prepped.clear()
  return embed_num

news_data_rows_num = 100

embed_num = 0 #keep track of embedding number for 'id'
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,
    chunk_overlap=20) # how to chunk each article
prepped = []

articles_list = df['article'].values
titles_list = df['title'].values

for i in tqdm(range(0, len(articles_list))):
    # print(".",end="")
    art = articles_list[i]
    title = titles_list[i]
    if art is not None and isinstance(art, str):
      texts = text_splitter.split_text(art)
      embeddings = get_embeddings(texts)
      embed_num = embed(embeddings, title, prepped, embed_num)

100%|██████████| 100/100 [01:39<00:00,  1.01it/s]


<!-- Now that we have our index, we need to make queries. We'll reuse the `get_recommendations` function to find the vectors that match the query from highest to lowest. -->

Unlike before, it will search within the articles for the most similar vectors. So, if an article is highly related to the query, the response will contain many matches corresponding to the same article. However, this is where we can use the metadata, as we can classify when each different article is mentioned.

In [18]:
def get_recommendations(pinecone_index, search_term, top_k=10):
  embed = get_embeddings([search_term]).data[0].embedding
  res = pinecone_index.query(vector=embed, top_k=top_k, include_metadata=True)
  return res

query = "Google and Innovation"
reco = get_recommendations(index, query, top_k=10)
seen = {}
for r in reco.matches:
    title = r.metadata['title']
    if title not in seen:
        print(f"Score: {r.score} \t Title: {title}")
        seen[title] = '.'


Score: 0.85238415 	 Title: Forget Facebook, Amazon or Google. Up-and-coming top tech talent is opting for startups.
Score: 0.829823554 	 Title: How to watch the Google I/O keynote live
Score: 0.816159368 	 Title: Peter Thiel vs. the FDA
Score: 0.814539969 	 Title: How love and marriage are changing, according to 63,000 New York Times wedding announcements
