# Medium Article Semantic Search by Title+Subtitle

### Load Data

In [2]:
import pandas as pd

In [42]:
df = pd.read_csv("Data/medium_post_titles.csv", nrows=10000) # excercise whole data set
# data source: https://www.kaggle.com/datasets/nulldata/medium-post-titles

In [43]:
df["subtitle_truncated_flag"].value_counts()

False    6318
True     3682
Name: subtitle_truncated_flag, dtype: int64

### Data Cleanup

In [44]:
df = df.dropna()
df = df[~df["subtitle_truncated_flag"]]

df['title_extended'] = df['title'] + df['subtitle']

df.head()

Unnamed: 0,category,title,subtitle,subtitle_truncated_flag,title_extended
0,work,"""21 Conversations"" - A fun (and easy) game for...",A (new?) Icebreaker game to get your team to s...,False,"""21 Conversations"" - A fun (and easy) game for..."
1,spirituality,"""Biblical Porn"" at Mars Hill",Author and UW lecturer Jessica Johnson talks a...,False,"""Biblical Porn"" at Mars HillAuthor and UW lect..."
2,lgbtqia,"""CISGENDER?! Is That A Disease?!""","Or, a primer in gender vocabulary for the curi...",False,"""CISGENDER?! Is That A Disease?!""Or, a primer ..."
4,artificial-intelligence,"""Can I Train my Model on Your Computer?""",How we waste computational resources and how t...,False,"""Can I Train my Model on Your Computer?""How we..."
5,cryptocurrency,"""Cypherpunks and Wall Street"": The Security To...",Bruce Fenton presents at the World Blockchain ...,False,"""Cypherpunks and Wall Street"": The Security To..."


In [45]:
print(df.shape) # 6k vectors, full set in excercise

df.groupby(["category","subtitle_truncated_flag"], as_index = False).count().sort_values("title", ascending = False)

(6211, 5)


Unnamed: 0,category,subtitle_truncated_flag,title,subtitle,title_extended
92,writing,False,292,292,292
90,work,False,285,285,285
9,business,False,224,224,224
24,equality,False,213,213,213
60,politics,False,212,212,212
...,...,...,...,...,...
82,transportation,False,2,2,2
67,race,False,2,2,2
65,psychedelics,False,2,2,2
87,venture-capital,False,1,1,1


### Prep for Upsert

In [46]:
import os

variable_name = "pinecone_api_key_Cordero"
API_KEY = os.getenv(variable_name)

In [48]:
# init pinecone
from pinecone import Pinecone, ServerlessSpec
# API_KEY = "YOUR API KEY"
pc = Pinecone(api_key = API_KEY)

In [49]:
pc.create_index(name = "medium-data", 
                dimension=384, 
                metric="cosine",
                spec=ServerlessSpec(
                    cloud="aws",
                    region="us-east-1"
                )) # remember to use only us-east-1 in free tier

In [None]:
# pinecone.create_index(name='medium-data', dimension=384, pod_type='s1', metric="cosine" )

In [52]:
#!pip install sentence-transformers

The device='cuda' parameter in SentenceTransformer allows the model to run on the GPU for better performance, but you need to ensure that PyTorch detects CUDA correctly.

1️⃣ Check if PyTorch detects CUDA

Before running the code, make sure your Python environment has PyTorch with CUDA support. You can check this by running:

If torch.cuda.is_available() returns False, PyTorch is not using CUDA, and you might need to install the correct version of PyTorch.


In [56]:
import torch
print(torch.cuda.is_available())  # Debe imprimir True si CUDA está disponible
print(torch.cuda.device_count())  # Número de GPUs disponibles
print(torch.cuda.get_device_name(0))  # Nombre de la GPU


True
1
NVIDIA GeForce RTX 3060 Laptop GPU


2️⃣ Install PyTorch with CUDA support
If you need to install PyTorch with CUDA support, use this command (adjust according to your CUDA version):

In [57]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

^C


3️⃣ Configure the SentenceTransformer model
If a GPU is available, the model will use CUDA; otherwise, it will use the CPU:

In [None]:
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

print(f"Modelo cargado en {device}.")


In [10]:
df['values'] = df['title_extended'].map(
    lambda x: (model.encode(x)).tolist()) # python list, 6k rows 1 min

In [11]:
df['id'] = df.reset_index(drop = 'index').index

In [12]:
df['metadata'] = df.apply(lambda x: {
    'title' : x['title'],
    'subtitle': x['subtitle'],
    'category': x['category']
    
}, axis=1)

In [13]:
df_upsert = df[['id', 'values', 'metadata']]

In [14]:
df_upsert['id'] = df_upsert['id'].map(lambda x: str(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_upsert['id'] = df_upsert['id'].map(lambda x: str(x))


In [18]:
index =pc.Index('medium-data')

In [19]:
index.upsert_from_dataframe(df_upsert) # 6k takes 1 min

sending upsert requests: 100%|████████████████████████████████████████████████████| 6211/6211 [00:20<00:00, 305.21it/s]


{'upserted_count': 6211}

### Query

In [26]:
xc = index.query(vector=(model.encode("which city is the most beautiful")).tolist(), # python list
           top_k=10,
           include_metadata=True) 

In [27]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['title']}: {result['metadata']['category']} ")

0.57: 3 Places Where You Can Find Beauty: photography 
0.46: 6 Easy Reasons to Enjoy Exploring South Wales: travel 
0.45: A City That’s Better for the Blind Is Better for Everyone: accessibility 
0.45: A Shining City on a Hill: politics 
0.42: A Most Beautiful Game: sports 
0.4: 6 Literary Cities for Book Lovers To Visit This Year: travel 
0.4: Ace Hotel: A UX Case Study: ux 
0.39: A city and its architecture: cities 
0.39: Adaptive urban design: design 
0.38: Aesthetics of Being: spirituality 


In [28]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['subtitle']}: {result['metadata']['category']} ")

0.57: If you are willing to look hard enough, eventually you will see beauty in the most difficult of places.: photography 
0.46: Pembrokeshire is as beautiful as the Italian Coast.: travel 
0.45: Complete parity with the sighted may seem like an impossible goal, but maybe the only thing holding us back is a lack of imagination.: accessibility 
0.45: What does America stand for?: politics 
0.42: The World Cup gets advertising right: sports 
0.4: Combine your love for books and travel with these 6 literary cities.: travel 
0.4: Discover the city you are visting like a local: ux 
0.39: Bangalore Chapter: cities 
0.39: Choatic nature of order: design 
0.38: Examining life through a lens of beauty: spirituality 


### Excercise: Upsert all data