# Semantic Search for Movie Plots

This notebook implements a semantic search engine for movie plots. We will use `sentence-transformers` to create embeddings for movie plots and then use cosine similarity to find movies that are semantically similar to a given query.

### 1. Install and Import Libraries

In [1]:
# Install the required libraries
!pip install sentence-transformers pandas scikit-learn



In [3]:


# Import necessary libraries
import pandas as pd
!pip install tf_keras
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

Collecting tf_keras
  Downloading tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow<2.20,>=2.19 (from tf_keras)
  Downloading tensorflow-2.19.1-cp311-cp311-win_amd64.whl.metadata (4.1 kB)
Collecting tensorboard~=2.19.0 (from tensorflow<2.20,>=2.19->tf_keras)
  Downloading tensorboard-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting keras>=3.5.0 (from tensorflow<2.20,>=2.19->tf_keras)
  Downloading keras-3.11.3-py3-none-any.whl.metadata (5.9 kB)
Collecting ml-dtypes<1.0.0,>=0.5.1 (from tensorflow<2.20,>=2.19->tf_keras)
  Downloading ml_dtypes-0.5.3-cp311-cp311-win_amd64.whl.metadata (9.2 kB)
Downloading tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
    --------------------------------------- 0.0/1.7 MB 667.8 kB/s eta 0:00:03
   -- ------------------------------

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-intel 2.17.0 requires ml-dtypes<0.5.0,>=0.3.1, but you have ml-dtypes 0.5.3 which is incompatible.
tensorflow-intel 2.17.0 requires tensorboard<2.18,>=2.17, but you have tensorboard 2.19.0 which is incompatible.





### 2. Load the Dataset

In [4]:
# Load the movies.csv file into a pandas DataFrame
df = pd.read_csv('movies.csv')
print("Dataset loaded successfully. Here are the first 5 rows:")


Dataset loaded successfully. Here are the first 5 rows:


In [5]:
df.head()

Unnamed: 0,title,plot
0,Spy Movie,A spy navigates intrigue in Paris to stop a te...
1,Romance in Paris,A couple falls in love in Paris under romantic...
2,Action Flick,A high-octane chase through New York with expl...


### 3. Create Embeddings for Movie Plots

In [6]:
# Load the pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for the movie plots
# This may take a moment to run
print("Creating embeddings for movie plots...")
embeddings = model.encode(df['plot'].tolist(), convert_to_tensor=False)
print("Embeddings created successfully.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating embeddings for movie plots...
Embeddings created successfully.


### 4. Implement the Search Function

In [7]:
def search_movies(query, top_n=5):
    """
    Searches for movies based on a query using semantic similarity.

    Args:
        query (str): The search query.
        top_n (int): The number of top results to return.

    Returns:
        pandas.DataFrame: A DataFrame with the top N movies, including their 
                          titles, plots, and similarity scores.
    """
    # Encode the query to get its embedding
    query_embedding = model.encode([query], convert_to_tensor=False)
    
    # Calculate cosine similarity between the query and all movie plots
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    
    # Get the indices of the top N most similar movies
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    
    # Create a result DataFrame
    result_df = df.iloc[top_indices].copy()
    result_df['similarity'] = similarities[top_indices]
    
    return result_df

### 5. Test the Search Function

In [8]:
# Test the search function with the query 'spy thriller in Paris'
test_query = 'spy thriller in Paris'
results = search_movies(test_query, top_n=3)

print(f"Top 3 results for the query: '{test_query}'")
results

Top 3 results for the query: 'spy thriller in Paris'


Unnamed: 0,title,plot,similarity
0,Spy Movie,A spy navigates intrigue in Paris to stop a te...,0.769684
1,Romance in Paris,A couple falls in love in Paris under romantic...,0.388029
2,Action Flick,A high-octane chase through New York with expl...,0.256777
