# Universidad Politécnica de Madrid
**Nombre:** Javier Manobanda

MUIA 2023 - 2024

**Descripción:**

Buscador semántico de Peliculas implementado utilizando un modelo basado en BERT para la creación de los embedding. Utiliza una base de datos vectorial Pinecone y una interfaz grafica de [GRADIO](https://www.gradio.app/)

Para utilizar con la interfaz se debe tener una cuenta en [PINECODE](https://www.pinecone.io/) (free) y agregar el api-key.

#Requiremientos

In [180]:
%%capture
!pip install -U sentence-transformers;
!pip install pinecone-client;
!pip install gradio;
!pip install rank-bm25

# Imports

In [181]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
from ast import literal_eval
import pinecone
from getpass import getpass
import gradio as gr

# Dataset

El dataset que se va a utilizar es de una lista de películas de [IMBD](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows), que tiene las siguientes columnas:
- Poster_Link - Enlace del póster que utiliza IMDb.
- Series_Title - Nombre de la película.
- Released_Year - Año de estreno de la película.
- Certificate - Certificado obtenido por la película.
- Runtime - Duración total de la película.
- Genre - Género de la película.
- IMDB_Rating - Calificación de la película en el sitio IMDb.
- Overview - Mini historia/resumen.
- Meta_score - Puntuación obtenida por la película.
- Director - Nombre del director.
- Star1, Star2, Star3, Star4 - Nombres de las estrellas.
- No_of_votes - Número total de votos.
- Gross - Ingresos generados por la película.



In [182]:
df = pd.read_csv('/content/imdb_top_1000.csv')
df.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


In [183]:
df = df.fillna('')
df['Genre'] = df['Genre'].apply(lambda x: x.split(','))
UNIQUE_GENEROUS = df['Genre'].explode().unique()
df['Actors'] = df.apply(lambda x: f"{x['Star1']} {x['Star2']} {x['Star3']} {x['Star4']}", axis=1)

In [184]:
df.drop(['Star1','Star2','Star3', 'Star4'], axis=1, inplace=True)
df.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,No_of_Votes,Gross,Actors
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,[Drama],9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,2343110,28341469,Tim Robbins Morgan Freeman Bob Gunton William ...
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"[Crime, Drama]",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,1620367,134966411,Marlon Brando Al Pacino James Caan Diane Keaton
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"[Action, Crime, Drama]",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,2303232,534858444,Christian Bale Heath Ledger Aaron Eckhart Mich...


In [185]:
df['text_use_to_found'] = df.apply(lambda x: f"{x['Overview']} {x['Actors']}", axis=1)
df.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,No_of_Votes,Gross,Actors,text_use_to_found
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,[Drama],9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,2343110,28341469,Tim Robbins Morgan Freeman Bob Gunton William ...,Two imprisoned men bond over a number of years...
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"[Crime, Drama]",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,1620367,134966411,Marlon Brando Al Pacino James Caan Diane Keaton,An organized crime dynasty's aging patriarch t...
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"[Action, Crime, Drama]",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,2303232,534858444,Christian Bale Heath Ledger Aaron Eckhart Mich...,When the menace known as the Joker wreaks havo...


Se crea una nueva columna con el texto que se se utilizará en el embedding. Se ha utilizado el `overview` y los `actors`, que son los campos que permitirán realizar la búsqueda semántica

In [186]:
print('Ejemplo del texto que se usará como embedding para la búsqueda semántica:')
print()
df['text_use_to_found'][0]

Ejemplo del texto que se usará como embedding para la búsqueda semántica:



'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency. Tim Robbins Morgan Freeman Bob Gunton William Sadler'

Se utiliza el `SentenceTransformer` que es un embedding preentrenado, La biblioteca se basa en modelos de Transformer, como BERT, RoBERTa, o DistilBERT, que han sido ampliamente usados y estudiados en el campo del NLP.

In [187]:
%%capture
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [188]:
embeddings = model.encode(df['text_use_to_found'],batch_size=64,show_progress_bar=True)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [189]:
df['embeddings'] = embeddings.tolist()
df['ids'] = df.index
df['ids'] = df['ids'].astype('str')

In [190]:
df.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,No_of_Votes,Gross,Actors,text_use_to_found,embeddings,ids
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,[Drama],9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,2343110,28341469,Tim Robbins Morgan Freeman Bob Gunton William ...,Two imprisoned men bond over a number of years...,"[-0.10262078791856766, 0.007203442044556141, -...",0
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"[Crime, Drama]",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,1620367,134966411,Marlon Brando Al Pacino James Caan Diane Keaton,An organized crime dynasty's aging patriarch t...,"[-0.09972403943538666, 0.004932244773954153, -...",1
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"[Action, Crime, Drama]",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,2303232,534858444,Christian Bale Heath Ledger Aaron Eckhart Mich...,When the menace known as the Joker wreaks havo...,"[-0.002044508932158351, 0.04783324524760246, -...",2


# Base de Datos Vectorial
Se utiliza una base de datos vectorial que permite almacenar nuestros embeddings.

ℹ **NOTA: al ejecutar la siguiente celda se debe ingresar la api-key de PINECONE**

In [191]:
pincone_api_key = getpass('Enter the secret value: ')

Enter the secret value: ··········


In [192]:
pc = pinecone.Pinecone(api_key=pincone_api_key)

Se agrega los embedding a la base de datos vectorial, la metrica de comparación que se utiliza es el `similary cosine`

In [193]:
dimensions_embeddings = len(df['embeddings'][0])
index_name = "imdb-movies-embeddings"
all_index = [ index['name'] for index in pc.list_indexes().index_list['indexes']]

if index_name in all_index:
    index = pc.Index(index_name)
else:
    pc.create_index(
    name=index_name,
    dimension=dimensions_embeddings,
    metric="cosine",
    spec=pinecone.PodSpec(environment="gcp-starter", pod_type="s1", pods=1 )
    )
    index = pc.Index(index_name)
index = pc.Index(index_name)

Se generan los embeddings

In [194]:
from tqdm.auto import tqdm
batch_size=64
for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i+batch_size, len(df))
    batch = df[i:i_end]
    data = batch.rename(columns={'embeddings':'values', 'ids': 'id'})
    data['metadata'] = data.drop(['id','values','text_use_to_found'],axis=1).to_dict('records')
    to_upsert = data[['id','values','metadata']].to_dict('records')
    _ = index.upsert(to_upsert)
index.describe_index_stats()

  0%|          | 0/16 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.01,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

Consulta de ejemplo

In [195]:
query = 'a history about travel time'
query_vector = model.encode(query).tolist()

responses = index.query(
  vector=query_vector,
  top_k=3,
  include_metadata=True,
)

In [196]:
responses

{'matches': [{'id': '761',
              'metadata': {'Actors': 'Riisa Naka Takuya Ishida Mitsutaka '
                                     'Itakura Ayami Kakiuchi',
                           'Certificate': 'U',
                           'Director': 'Mamoru Hosoda',
                           'Genre': ['Animation', ' Adventure', ' Comedy'],
                           'Gross': '',
                           'IMDB_Rating': 7.7,
                           'Meta_score': '',
                           'No_of_Votes': 60368.0,
                           'Overview': 'A high-school girl named Makoto '
                                       'acquires the power to travel back in '
                                       'time, and decides to use it for her '
                                       'own personal benefits. Little does she '
                                       'know that she is affecting the lives '
                                       'of others just as much as she is her '
   

La busqueda se lo puede realizar utilizando la consulta (query), genero, ranking y el número limite de respuestas (top_k).

In [197]:
def search(query, genre, ranking, top_k):
    query_vector = model.encode(query).tolist()
    filter_ranking = ranking if ranking else 0
    if genre:
         conditions ={
                "Genre": { "$in": [genre] },
                "IMDB_Rating": { "$gte": filter_ranking }
                }
    else:
        conditions ={
                "IMDB_Rating": { "$gte": filter_ranking },
                }
    responses = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=conditions
    )
    response_data = []
    for response in responses['matches']:
        response_data.append({
            'Title': response['metadata']['Series_Title'],
            'Overview': response['metadata']['Overview'],
            'Director': response['metadata']['Director'],
            'Genre': response['metadata']['Genre'],
            'year': response['metadata']['Released_Year'],
            'Rating': response['metadata']['IMDB_Rating'],
            'Score': response['score'],
        })

    df = pd.DataFrame(response_data)
    return df


In [198]:
search('Movie about resurrected reptiles.', None,  0, 3)

Unnamed: 0,Title,Overview,Director,Genre,year,Rating,Score
0,Jurassic Park,A pragmatic paleontologist visiting an almost ...,Steven Spielberg,"[Action, Adventure, Sci-Fi]",1993,8.1,0.352887
1,Young Frankenstein,An American grandson of the infamous scientist...,Mel Brooks,[Comedy],1974,8.0,0.352224
2,Edge of Tomorrow,A soldier fighting aliens gets to relive the s...,Doug Liman,"[Action, Adventure, Sci-Fi]",2014,7.9,0.338429


# Buscador Tradicional

Se utiliza el algoritmo BM25 para implementar un buscador tradicional

In [212]:
import pandas as pd
from rank_bm25 import BM25Okapi
import numpy as np

documents = df['text_use_to_found'].tolist()
tokenized_documents = [doc.split(" ") for doc in documents]
bm25 = BM25Okapi(tokenized_documents)

def bm25_search(query, top_n=3):
  tokenized_query = query.split(" ")
  doc_scores = bm25.get_scores(tokenized_query)
  top_n_indices = np.argsort(doc_scores)[::-1][:top_n]
  print(f"\nDocumentos más relevantes top ({top_n}):")
  documents_index = []
  documents_scores = {}
  for idx in top_n_indices:
      print(f"Documento {idx + 1}: {documents[idx]} (Score: {doc_scores[idx]})")
      documents_index.append(idx)
      documents_scores[idx]=doc_scores[idx]

  df_results = df.iloc[documents_index][["Series_Title", "Overview", "Genre", "Released_Year", "IMDB_Rating"]]
  df_results["score"] = df_results.apply(lambda x: documents_scores[x.name]/10.0, axis=1)
  return df_results


In [215]:
df_result = bm25_search("Movie about resurrected reptiles.")
df_result


Documentos más relevantes top (3):
Documento 972: A twenty-seven-year-old office worker travels to the countryside while reminiscing about her childhood in Tokyo. Miki Imai Toshirô Yanagiba Yoko Honna Mayumi Izuka (Score: 4.057417181897897)
Documento 134: An experienced investigator confronts several conflicting theories about the perpetrators of a violent double homicide. Irrfan Khan Konkona Sen Sharma Neeraj Kabi Sohum Shah (Score: 3.994650214425348)
Documento 838: "Documentary" about a man who can look and act like whoever he's around, and meets various famous people. Woody Allen Mia Farrow Patrick Horgan John Buckwalter (Score: 3.8747673503567492)


Unnamed: 0,Series_Title,Overview,Genre,Released_Year,IMDB_Rating,score
971,Omohide poro poro,A twenty-seven-year-old office worker travels ...,"[Animation, Drama, Romance]",1991,7.6,0.405742
133,Talvar,An experienced investigator confronts several ...,"[Crime, Drama, Mystery]",2015,8.2,0.399465
837,Zelig,"""Documentary"" about a man who can look and act...",[Comedy],1983,7.7,0.387477


# Comparaciones entre buscador semántico y BM25

In [216]:
search("Movie about resurrected reptiles.", None,  0, 3) # Buscador semántico

Unnamed: 0,Title,Overview,Director,Genre,year,Rating,Score
0,Jurassic Park,A pragmatic paleontologist visiting an almost ...,Steven Spielberg,"[Action, Adventure, Sci-Fi]",1993,8.1,0.352887
1,Young Frankenstein,An American grandson of the infamous scientist...,Mel Brooks,[Comedy],1974,8.0,0.352224
2,Edge of Tomorrow,A soldier fighting aliens gets to relive the s...,Doug Liman,"[Action, Adventure, Sci-Fi]",2014,7.9,0.338429


In [202]:
df_result = bm25_search('Movie about resurrected reptiles.')
df_result


Documentos más relevantes top (3):
Documento 972: A twenty-seven-year-old office worker travels to the countryside while reminiscing about her childhood in Tokyo. Miki Imai Toshirô Yanagiba Yoko Honna Mayumi Izuka (Score: 4.057417181897897)
Documento 134: An experienced investigator confronts several conflicting theories about the perpetrators of a violent double homicide. Irrfan Khan Konkona Sen Sharma Neeraj Kabi Sohum Shah (Score: 3.994650214425348)
Documento 838: "Documentary" about a man who can look and act like whoever he's around, and meets various famous people. Woody Allen Mia Farrow Patrick Horgan John Buckwalter (Score: 3.8747673503567492)


Unnamed: 0,Series_Title,Overview,Genre,Released_Year,IMDB_Rating,score
971,Omohide poro poro,A twenty-seven-year-old office worker travels ...,"[Animation, Drama, Romance]",1991,7.6,4.057417
133,Talvar,An experienced investigator confronts several ...,"[Crime, Drama, Mystery]",2015,8.2,3.99465
837,Zelig,"""Documentary"" about a man who can look and act...",[Comedy],1983,7.7,3.874767


In [223]:
search("spide man", None,  0, 3) # Buscador semántico

Unnamed: 0,Title,Overview,Director,Genre,year,Rating,Score
0,Det sjunde inseglet,"A man seeks answers about life, death, and the...",Ingmar Bergman,"[Drama, Fantasy, History]",1957,8.2,0.347928
1,Chhichhore,"A tragic incident forces Anirudh, a middle-age...",Nitesh Tiwari,"[Comedy, Drama]",2019,8.2,0.305351
2,Vikram Vedha,"Vikram, a no-nonsense police officer, accompan...",Gayatri,"[Action, Crime, Drama]",2017,8.4,0.298801


In [224]:
df_result = bm25_search('spide man')
df_result


Documentos más relevantes top (3):
Documento 419: A man befriends a fellow criminal as the two of them begin serving their sentence on a dreadful prison island, which inspires the man to plot his escape. Steve McQueen Dustin Hoffman Victor Jory Don Gordon (Score: 3.114221600226639)
Documento 486: A young Arab man is sent to a French prison. Tahar Rahim Niels Arestrup Adel Bencherif Reda Kateb (Score: 2.8199040183634403)
Documento 1000: A man in London tries to help a counter-espionage Agent. But when the Agent is killed, and the man stands accused, he must go on the run to save himself and stop a spy ring which is trying to steal top secret information. Robert Donat Madeleine Carroll Lucie Mannheim Godfrey Tearle (Score: 2.7288006082174237)


Unnamed: 0,Series_Title,Overview,Genre,Released_Year,IMDB_Rating,score
418,Papillon,A man befriends a fellow criminal as the two o...,"[Biography, Crime, Drama]",1973,8.0,0.311422
485,Un prophète,A young Arab man is sent to a French prison.,"[Crime, Drama]",2009,7.9,0.28199
999,The 39 Steps,A man in London tries to help a counter-espion...,"[Crime, Mystery, Thriller]",1935,7.6,0.27288


# Interfaz gráfica para el buscador semántico

El siguiente bloque muestra una UI, para interactuar con el buscador semántico.

In [203]:
genres = UNIQUE_GENEROUS.tolist()
iface = gr.Interface(
    fn=search,
    inputs=[
        gr.Textbox(lines=5, placeholder="Pelicula a búscar", label="Consulta"),
        gr.Dropdown(choices=genres, label="Género de la película"),
        gr.Slider(minimum=1, maximum=10, value=5, label="Puntuación mínima"),
        gr.Number(minimum=1, maximum=10, value=3, label="Número de resultados")

    ],
    outputs=gr.Dataframe(type="pandas", label="Resultados"),
    title="Buscador Semántico de películas",
    description="Introduce tu consulta, selecciona un género y define un ranking mínimo para buscar películas.",
)

# Launch the interface
iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://73fb7afa6a4491e779.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


