# Problem Statement
Semantic search aims to enhance search accuracy by comprehending the content of the search query. In contrast to conventional search engines that primarily rely on lexical matches to locate documents, semantic search can also identify synonyms.

# Table of Contents

1. [**Step 1: Install and Import Required Libraries**](#step-1-install-and-import-required-libraries)
2. [**Step 2: Data Loading**](#step-2-data-loading)
3. [**Step 3: Data Cleaning and Pre-processing**](#step-3-data-cleaning-and-pre-processing)
4. [**Step 4: Unveiling Hot Keywords with Not Fully Fledged Semantic Search**](#step-4-unveiling-hot-keywords-with-not-fully-fledged-semantic-search)
    1. [**Step 4-1: Building Word Dictionary**](#step-4-1-building-word-dictionary)
    2. [**Step 4-2: Feature Extraction (Bag of Words)**](#step-4-2-feature-extraction-bag-of-words)
5. [**Step 5: Unveiling Hot Keywords with Word2Vec Semantic Search**](#step-5-unveiling-hot-keywords-with-word2vec-semantic-search)
6. [**Step 6: Unveiling Hot Keywords with Sentence Transformers Search**](#step-6-unveiling-hot-keywords-with-sentence-transformers-search)

---

## Step 1 : Install and Import Required Libraries

In [None]:
# Install the spaCy library
!pip install spacy

In [None]:
# Download the spaCy English language model
!python -m spacy download en_core_web_sm

In [None]:
# Install the YAKE keyword extraction library
!pip install yake

In [None]:
# Upgrade the Gensim library to the latest version
!pip install --upgrade gensim

In [None]:
# Install the Sentence Transformers library
!pip install sentence-transformers

In [1]:
# import necessary libraries
import warnings
warnings.filterwarnings("ignore")

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ast
import spacy
import string
import gensim
import torch
import operator
import re
from gensim.similarities import MatrixSimilarity
from spacy.lang.en.stop_words import STOP_WORDS
from gensim import corpora, models
from operator import itemgetter
from sentence_transformers import SentenceTransformer
from gensim.models import Word2Vec
from yake import KeywordExtractor

In [3]:
# Import classes from python scripts
from utils import Utils
from basic_search import MovieSearch
from word2vec_search import Word2VecSearch
from Transformer_Search import TransformerSearch

## Step 2 : Data loading
We will now load the movies data csv into dataframe and quickly peek into the columns and data provided

movie-title - Title of the movie

year        - year in which the movie released

genre       - The style or category of the movie

synopsis    - The basic plot of the movie available in IMDB

cast        - List of people involved in the movie

In [4]:
# Read the CSV file into a DataFrame
df = pd.read_csv('C:/Users/Mustafa Abdulnasser/Desktop/Task 3/data/imdb_movies.csv')

# Display the DataFrame
df

Unnamed: 0,movie_title,year,genre,synopsis,cast
0,A Splash of Love,2022.0,"Comedy, Romance, Back to top","Chloe Turner, a Ph.D. candidate in Marine Mamm...","Heather Hawthorn Doyle (dir.), Rhiannon Fish, ..."
1,The Grey Man,2007.0,"Biography, Crime, Drama, Back to top",Kevin Dodds is a browbeaten deputy bank manage...,"Declan O'Dwyer (dir.), Daniel Ryan, Nitin Ganatra"
2,Descendants,2015.0,"Comedy, Family, Fantasy, Back to top","Ben, son of Belle and the once selfish Beast, ...","Kenny Ortega (dir.), Dove Cameron, Cameron Boyce"
3,Teen Wolf: The Movie,,"Action, Comedy, Drama, Back to top","A full moon rises in Beacon Hills, and with it...","Russell Mulcahy (dir.), Melissa Ponzio, Linden..."
4,High School Musical,2006.0,"Comedy, Drama, Family, Back to top",Troy Bolton and Gabriella Montez are two total...,"Kenny Ortega (dir.), Zac Efron, Vanessa Hudgens"
...,...,...,...,...,...
10050,Hyperland,2021.0,"Drama, Back to top",A mother afflicted with depression comes into ...,"Mario Sixtus (dir.), Lorna Ishema, Samuel Schn..."
10051,The Brass Ring,1983.0,"Crime, Drama, Back to top",When a wealthy businessman is murdered by a hi...,"Bob Balaban (dir.), Dina Merrill, Sylvia Sidney"
10052,One Hot Summer Night,1998.0,"Family, Back to top",A look at Yellowstone National Park's wildlife...,"James A. Contner (dir.), Erika Eleniak, Brian ..."
10053,Yellowstone in Four Seasons,2017.0,"Drama, Back to top","Biopic of sorts about Agatha Christie, the fam...",Daniel P. Dauterive (dir.)


## Step 3 : Data Cleaning and Pre-processing
Data pre-processing is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won't contribute to topic modelling in anyway.

Now let us apply removing dublications in our dataset

In [5]:
# Create an instance of the Utils class
utils = Utils()

In [6]:
# Drop duplicates by a specified column
df = utils.drop_duplicates_by_column(df, column_to_check='movie_title')

Duplicates before:


Unnamed: 0,movie_title,year,genre,synopsis,cast
56,A Splash of Love,2022.0,"Biography, Crime, Drama, Back to top",Kevin Dodds is a browbeaten deputy bank manage...,"Heather Hawthorn Doyle (dir.), Rhiannon Fish, ..."
57,The Grey Man,2007.0,"Comedy, Family, Fantasy, Back to top","Ben, son of Belle and the once selfish Beast, ...","Declan O'Dwyer (dir.), Daniel Ryan, Nitin Ganatra"
58,Descendants,2015.0,"Action, Comedy, Drama, Back to top","A full moon rises in Beacon Hills, and with it...","Kenny Ortega (dir.), Dove Cameron, Cameron Boyce"
59,Teen Wolf: The Movie,,"Comedy, Drama, Family, Back to top",Troy Bolton and Gabriella Montez are two total...,"Russell Mulcahy (dir.), Melissa Ponzio, Linden..."
60,High School Musical,2006.0,"Action, Drama, Family, Back to top",Royal Navy captain Wentworth was haughtily tur...,"Kenny Ortega (dir.), Zac Efron, Vanessa Hudgens"
...,...,...,...,...,...
9960,Dummy,1977.0,"Drama, Back to top",Jim Rayborn has for years been doing most of t...,"Franc Roddam (dir.), Geraldine James, Lara Crooks"
9961,You Lucky Dog,2010.0,"Comedy, Back to top",It looks like we don't have any Plot Summaries...,"John Bradshaw (dir.), Natasha Henstridge, Harr..."
9983,Home for the Holidays,2005.0,"Documentary, Back to top",The show focuses on the extraordinary stories ...,"Richard Compton (dir.), Sean Young, Lucia Walters"
10028,Maternal Instinct,2017.0,"Documentary, Back to top",A history and celebration of the Lincoln Highw...,"George Erschbamer (dir.), Laura Mennell, Marcu..."



DataFrame without Duplicates:


Unnamed: 0,movie_title,year,genre,synopsis,cast
0,A Splash of Love,2022.0,"Comedy, Romance, Back to top","Chloe Turner, a Ph.D. candidate in Marine Mamm...","Heather Hawthorn Doyle (dir.), Rhiannon Fish, ..."
1,The Grey Man,2007.0,"Biography, Crime, Drama, Back to top",Kevin Dodds is a browbeaten deputy bank manage...,"Declan O'Dwyer (dir.), Daniel Ryan, Nitin Ganatra"
2,Descendants,2015.0,"Comedy, Family, Fantasy, Back to top","Ben, son of Belle and the once selfish Beast, ...","Kenny Ortega (dir.), Dove Cameron, Cameron Boyce"
3,Teen Wolf: The Movie,,"Action, Comedy, Drama, Back to top","A full moon rises in Beacon Hills, and with it...","Russell Mulcahy (dir.), Melissa Ponzio, Linden..."
4,High School Musical,2006.0,"Comedy, Drama, Family, Back to top",Troy Bolton and Gabriella Montez are two total...,"Kenny Ortega (dir.), Zac Efron, Vanessa Hudgens"
...,...,...,...,...,...
9741,Hyperland,2021.0,"Drama, Back to top",A mother afflicted with depression comes into ...,"Mario Sixtus (dir.), Lorna Ishema, Samuel Schn..."
9742,The Brass Ring,1983.0,"Crime, Drama, Back to top",When a wealthy businessman is murdered by a hi...,"Bob Balaban (dir.), Dina Merrill, Sylvia Sidney"
9743,One Hot Summer Night,1998.0,"Family, Back to top",A look at Yellowstone National Park's wildlife...,"James A. Contner (dir.), Erika Eleniak, Brian ..."
9744,Yellowstone in Four Seasons,2017.0,"Drama, Back to top","Biopic of sorts about Agatha Christie, the fam...",Daniel P. Dauterive (dir.)



Duplicates after:


Unnamed: 0,movie_title,year,genre,synopsis,cast



Total Rows in Original DataFrame: 10055
Total Rows in Filtered DataFrame: 9746
Total Duplicates Before: 309
Total Duplicates After: 0


Now let us apply the data-cleaning and pre-processing function to our movies "synopsis" column and store the cleaned, tokenized data into new column

In [7]:
# df['synopsis'] contains the text I want to tokenize
print ('Cleaning and Tokenizing...')
%time df['tokenized_synopsis'] = df['synopsis'].map(utils.custom_tokenizer)

In [8]:
# Display the Head of the DataFrame
df.head()

Unnamed: 0,movie_title,year,genre,synopsis,cast
0,A Splash of Love,2022.0,"Comedy, Romance, Back to top","Chloe Turner, a Ph.D. candidate in Marine Mamm...","Heather Hawthorn Doyle (dir.), Rhiannon Fish, ..."
1,The Grey Man,2007.0,"Biography, Crime, Drama, Back to top",Kevin Dodds is a browbeaten deputy bank manage...,"Declan O'Dwyer (dir.), Daniel Ryan, Nitin Ganatra"
2,Descendants,2015.0,"Comedy, Family, Fantasy, Back to top","Ben, son of Belle and the once selfish Beast, ...","Kenny Ortega (dir.), Dove Cameron, Cameron Boyce"
3,Teen Wolf: The Movie,,"Action, Comedy, Drama, Back to top","A full moon rises in Beacon Hills, and with it...","Russell Mulcahy (dir.), Melissa Ponzio, Linden..."
4,High School Musical,2006.0,"Comedy, Drama, Family, Back to top",Troy Bolton and Gabriella Montez are two total...,"Kenny Ortega (dir.), Zac Efron, Vanessa Hudgens"


Store the tokenized column into a sepearte variable for ease of operations in subsequent sections and have a quick peek into the values

In [11]:
# Extract the 'tokenized_synopsis' column from the DataFrame and assign it to the variable 'movie_plot'
movie_plot = df['tokenized_synopsis']

# Display the first 5 rows of the 'movie_plot' Series
movie_plot[0:5]

0    [chloe, turner, candidate, marine, mammalogy, ...
1    [kevin, dodds, browbeaten, deputy, bank, manag...
2    [ben, son, belle, selfish, beast, poise, thron...
3    [moon, rise, beacon, hills, terrifying, evil, ...
4    [troy, bolton, gabriella, montez, totally, dif...
Name: tokenized_synopsis, dtype: object

## Step 4 : Unveiling Hot Keywords with Not Fully Fledged Semantic Search

### Step 4-1 : Building Word Dictionary
In the next step we will build the vocabulary of the corpus in which all the unique words are given IDs and their frequency counds are also stored. You may note that we are using gensim library for building the dictionary. In gensim, the words are referred as "tokens" adn the index of each word in the dictionary is called ID

In [12]:
#creating term dictionary
%time dictionary = corpora.Dictionary(movie_plot)

#list of few which which can be further removed
stoplist = set('hello and if this can would should could tell ask stop come go')
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
dictionary.filter_tokens(stop_ids)

Wall time: 354 ms


In [13]:
#print top 50 items from the dictionary with their unique token-id
dict_tokens = [[[dictionary[key], dictionary.token2id[dictionary[key]]] for key, value in dictionary.items() if key <= 50]]
print (dict_tokens)

[[['affect', 0], ['area', 1], ['arguably', 2], ['ask', 3], ['averse', 4], ['base', 5], ['ben', 6], ['cable', 7], ['candidate', 8], ['central', 9], ['chloe', 10], ['college', 11], ['concern', 12], ['conservation', 13], ['cove', 14], ['deem', 15], ['deny', 16], ['different', 17], ['dissertation', 18], ['distance', 19], ['echo', 20], ['economics', 21], ['endangered', 22], ['eventually', 23], ['fall', 24], ['field', 25], ['fish', 26], ['guide', 27], ['having', 28], ['help', 29], ['home', 30], ['incomplete', 31], ['independence', 32], ['interest', 33], ['knowledgeable', 34], ['let', 35], ['like', 36], ['locale', 37], ['locate', 38], ['mammalogy', 39], ['marcus', 40], ['marine', 41], ['miami', 42], ['non', 43], ['northwest', 44], ['obstacle', 45], ['operate', 46], ['opportune', 47], ['opportunity', 48], ['orca', 49], ['orcas', 50]]]


### Step 4-2 : Feature Extraction (Bag of Words)
A bag of words model, or BoW for short is a way of extracting features from text for use in modelling, such as with machine learning algorithms. It is a representation of tet that describes teh occurence of words within a document. It involves two things

A vocabulary of known words
A measure of the presence of known words
The doc2bow method of dictionary, iterates through all the words in the text, if the word already exists in the corpus, it increments the frequency count, other wise it inserts the word into the corpus and sets it freqeuncy count to 1

In [14]:
# Assuming corpus is a list of bag-of-words representations for each document
corpus = [dictionary.doc2bow(desc) for desc in movie_plot]

# Extract word frequencies from the first 3 documents in the corpus
word_frequencies = []
for doc_bow in corpus[:1]:
    word_frequency_doc = [(dictionary[id], frequency) for id, frequency in doc_bow]
    word_frequencies.append(word_frequency_doc)

print(word_frequencies)

[[('affect', 1), ('area', 1), ('arguably', 1), ('ask', 2), ('averse', 1), ('base', 1), ('ben', 2), ('cable', 3), ('candidate', 2), ('central', 1), ('chloe', 2), ('college', 1), ('concern', 1), ('conservation', 1), ('cove', 3), ('deem', 1), ('deny', 1), ('different', 1), ('dissertation', 1), ('distance', 1), ('echo', 1), ('economics', 1), ('endangered', 1), ('eventually', 1), ('fall', 1), ('field', 1), ('fish', 1), ('guide', 1), ('having', 2), ('help', 2), ('home', 1), ('incomplete', 1), ('independence', 1), ('interest', 1), ('knowledgeable', 1), ('let', 1), ('like', 1), ('locale', 1), ('locate', 1), ('mammalogy', 1), ('marcus', 2), ('marine', 2), ('miami', 1), ('non', 1), ('northwest', 1), ('obstacle', 1), ('operate', 1), ('opportune', 1), ('opportunity', 1), ('orca', 2), ('orcas', 1), ('pacific', 1), ('person', 1), ('pod', 1), ('population', 3), ('possibility', 1), ('problem', 1), ('project', 2), ('reevaluate', 1), ('regardless', 1), ('relationship', 4), ('research', 3), ('resident', 

The above results shows vocabulary with their frequency.

Build Tf-Idf and LSI Model
Tf-Idf means, Term frequency-Inverse Document Frequency. it is a commonly used NLP model that helps you determine the most important words in each document in the corpus. Once the Tf-Idf is build, pass it to LSI model and specify the num of features to build

In [15]:
# Create a TF-IDF model
%time movie_tfidf_model = gensim.models.TfidfModel(corpus, id2word=dictionary)

# Create an LSI model using the TF-IDF model
%time movie_lsi_model = gensim.models.LsiModel(movie_tfidf_model[corpus], id2word=dictionary, num_topics=300)

Wall time: 113 ms
Wall time: 3.86 s


Serialize and Store the corpus locally for easy retrival whenver required.

In [16]:
# Serialize the TF-IDF model to disk
%time gensim.corpora.MmCorpus.serialize('C:/Users/Mustafa Abdulnasser/Desktop/Task 3/models/tfidf-lsi-models/movie_tfidf_model_mm', movie_tfidf_model[corpus])

# Serialize the LSI model to disk
%time gensim.corpora.MmCorpus.serialize('C:/Users/Mustafa Abdulnasser/Desktop/Task 3/models/tfidf-lsi-models/movie_lsi_model_mm', movie_lsi_model[movie_tfidf_model[corpus]])

Wall time: 1.03 s
Wall time: 6.13 s


In [17]:
# Load the TF-IDF corpus from disk
movie_tfidf_corpus = gensim.corpora.MmCorpus('C:/Users/Mustafa Abdulnasser/Desktop/Task 3/models/tfidf-lsi-models/movie_tfidf_model_mm')

# Load the LSI corpus from disk
movie_lsi_corpus = gensim.corpora.MmCorpus('C:/Users/Mustafa Abdulnasser/Desktop/Task 3/models/tfidf-lsi-models/movie_lsi_model_mm')

# Print the TF-IDF corpus
print(movie_tfidf_corpus)

# Print the LSI corpus
print(movie_lsi_corpus)

MmCorpus(9746 documents, 28637 features, 314736 non-zero entries)
MmCorpus(9746 documents, 300 features, 2923800 non-zero entries)


In [18]:
# Create a similarity index for the LSI corpus
%time movie_index = MatrixSimilarity(movie_lsi_corpus, num_features=movie_lsi_corpus.num_terms)

Wall time: 3.7 s


We will input a search query and model will return relevant movie titles. The higher the similarity score, the more similar the query to the documetn at the given index

Now, let's embark on an exciting journey of basic semantic search tailored for uncovering hot keywords. With the movie index initialized and loaded, our aim is to input a search query and unveil the most relevant hot keywords.

The process involves using LSI weights to identify the terms that contribute significantly to the semantic meaning of the documents in the index. Each keyword will be associated with a word, indicating its importance in the context of the query.

In [19]:
# Create a MovieSearch object with specified components
movie_search = MovieSearch(
    tokenizer=utils,           # 'utils' is an instance of a tokenizer
    dictionary=dictionary,     # Dictionary used in the models
    movie_tfidf_model=movie_tfidf_model,   # TF-IDF model
    movie_lsi_model=movie_lsi_model,       # LSI model
    movie_index=movie_index,               # Similarity index
    df=df                    # DataFrame 'df' containing movie information
)

In [27]:
# Specify the search term
search_term = "The Other Guys"

# Use the MovieSearch object to find similar movies
similar_movies_info = movie_search.search_similar_movies(search_term)

# Display information about similar movies
for movie_info in similar_movies_info:
    print("Most Similar Movies with Hot Keywords:\n")
    print(f"Title: {movie_info['Title']}")
    print(f"Genre: {movie_info['Genre']}")
    print(f"Year: {movie_info['Year']:.0f}")  # Format the year as an integer
    print(f"Hot Keywords: {movie_info['Hot Keywords']}")
    
    # Add a separator line between movies for better readability
    print("\n" + "*" * 50 + "\n")

Most Similar Movies with Hot Keywords:

Title: Babe Ruth
Genre: Drama, Mystery, Romance, Back to top
Year: 1991
Hot Keywords: ['chloe', 'area', 'want', 'deny', 'evil', 'historic', 'poise', 'flak', 'fashionable', 'kitsunes']

**************************************************

Most Similar Movies with Hot Keywords:

Title: Possible Side Effects
Genre: Comedy, Back to top
Year: 2009
Hot Keywords: ['area', 'winters', 'teammate', 'dodds', 'college', 'farmer', 'daughter', 'sir', 'cheap', 'emerge']

**************************************************

Most Similar Movies with Hot Keywords:

Title: The Sisterhood
Genre: Drama, War, Back to top
Year: 2019
Hot Keywords: ['area', 'teammate', 'salish', 'flak', 'help', 'status', 'true', 'endangered', 'set', 'dance']

**************************************************



## Step 5 : Unveiling Hot Keywords with Word2Vec Semantic Search:
Unveiling Hot Keywords with Word2Vec Semantic Search
In this section, we will leverage the Word2Vec model to perform a basic semantic search. The model takes a search query and returns relevant movie titles based on the similarity score. The higher the similarity score, the more similar the query is to the document at the given index.

To achieve this:

#### Input Query: Provide a search query.
#### Compute Similarities: Utilize the Word2Vec model to calculate the similarity between the query and the movie synopses.
#### Extract Hot Keywords using two ways: Identify significant terms using Word2Vec weights, indicating their importance in the context of the query.
   ****User-Defined Method:**** Utilize a user-defined method developed to extract hot keywords. This method may be tailored to specific requirements or domain knowledge.

   ****YAKE (Yet Another Keyword Extractor):**** Alternatively, employ YAKE, a robust keyword extraction tool, to identify additional relevant terms contributing to the semantic meaning of the documents.
#### Present Results: Unveil the most relevant movie titles along with their associated hot keywords.

In [29]:
# Train a Word2Vec model on the 'movie_plot' data
# Parameters:
# - vector_size: The size of the word vectors (100 in this case)
# - window: The maximum distance between the current and predicted word within a sentence (10 in this case)
# - min_count: Ignores all words with a total frequency lower than this (1 in this case, include even infrequent words)
# - workers: Number of CPU cores to use (4 in this case)
word2vec_model = Word2Vec(movie_plot, vector_size=100, window=10, min_count=1, workers=4)

In [30]:
# Save the trained Word2Vec model to a file
word2vec_model.save("word2vec_imdb.model")

In [13]:
# Create a Word2VecSearch object for searching using the Word2Vec model
word2vec_search = Word2VecSearch(word2vec_model_path='word2vec_imdb.model', tokenizer=utils)

In [39]:
# Perform semantic search using Word2Vec for movies similar to 'A Splash of Love'
most_similar_movies_word2vec = word2vec_search.semantic_search_word2vec(query_title='A Splash of Love', df=df, yake=False)

# Print the Most Similar Movies
print("Most Similar Movies:")
for index, row in most_similar_movies_word2vec.iterrows():
    print(f"Title: {row['movie_title']}")
    print(f"Genre: {row['genre']}")
    print(f"Year: {row['year']:.0f}")  # Format the year as an integer
    print("\n" + "*" * 50 + "\n")  # Add stars between movies

Most Similar Movies:
Title: The Rocky Horror Picture Show: Let's Do the Time Warp Again
Genre: Comedy, Romance, Back to top
Year: 2016

**************************************************

Title: The Child in Time
Genre: Drama, Romance, Back to top
Year: 2017

**************************************************

Title: AD/BC: A Rock Opera
Genre: Comedy, Romance, Back to top
Year: 2004

**************************************************



In [41]:
# Extract Hot Keywords for each similar movie
for index, row in most_similar_movies_word2vec.iterrows():
    movie_title = row['movie_title']
    movie_synopsis = row['synopsis']

    # Extract top 5 hot keywords for the movie's synopsis
    hot_keywords = word2vec_search.extract_hot_keywords_word2vec(movie_synopsis, top_keywords=5)

    print(f"\nHot Keywords for Movie: {movie_title}")
    for word, similarity in hot_keywords:
        print(f"  Word: {word}, Similarity: {similarity:.4f}")

    # Print stars between movies for better readability
    print("\n" + "*" * 50 + "\n")


Hot Keywords for Movie: The Rocky Horror Picture Show: Let's Do the Time Warp Again
  Word: from, Similarity: 0.9987
  Word: for, Similarity: 0.9969
  Word: respect, Similarity: 0.9963
  Word: problems, Similarity: 0.9953
  Word: having, Similarity: 0.9946

**************************************************


Hot Keywords for Movie: The Child in Time
  Word: pursuing, Similarity: 0.9986
  Word: relationship, Similarity: 0.9981
  Word: ownership, Similarity: 0.9977
  Word: it, Similarity: 0.9967
  Word: sparks, Similarity: 0.9946

**************************************************


Hot Keywords for Movie: AD/BC: A Rock Opera
  Word: keeping, Similarity: 0.9985
  Word: largely, Similarity: 0.9961
  Word: throws, Similarity: 0.9955
  Word: especially, Similarity: 0.9954
  Word: all, Similarity: 0.9948

**************************************************



In [14]:
#Perform semantic search with hot keywords using Word2Vec for movies similar to 'The Other Guys'
most_similar_movies_word2vec_with_keywords = word2vec_search.semantic_search_word2vec(query_title='The Other Guys', 
                                                                                      df=df, yake=True)

# Print the result with hot keywords
print("Most Similar Movies with Hot Keywords:")
for index, row in most_similar_movies_word2vec_with_keywords.iterrows():
    print(f"Title: {row['movie_title']}")
    print(f"Genre: {row['genre']}")
    print(f"Year: {row['year']:.0f}")  # Format the year as an integer
    print(f"Hot Keywords: {row['Hot Keywords']}")
    print("\n" + "*" * 50 + "\n")  # Add stars between movies

Movie with title 'The Other Guys' not found. Searching based on movie title with other movies' synopses.
Most Similar Movies with Hot Keywords:
Title: A Christmas Movie Christmas
Genre: Drama, Back to top
Year: 2019
Hot Keywords: ['sugar daddy decision', 'art history transfer', 'majoring art history', 'junior majoring art', 'transfer local community']

**************************************************

Title: Stranger in My Bed
Genre: Comedy, Romance, Back to top
Year: 2005
Hot Keywords: ['mind potentially painful', 'work reflect grade', 'reluctance open potentially', 'painful heartache fail', 'math teacher middle']

**************************************************

Title: The Exotic Time Machine II: Forbidden Encounters
Genre: Comedy, Back to top
Year: 2000
Hot Keywords: ['clark daryl sabara', 'neve chloe bridges', 'chloe bridges heather', 'heather haley ramm', 'evening thing bad']

**************************************************



## Step 6 : Unveiling Hot Keywords with Sentence Transformers Search
Sentence Transformers are a type of pre-trained transformer-based models designed for encoding and understanding the semantic meaning of sentences. These models, often fine-tuned on specific tasks, are adept at capturing contextual relationships and semantic nuances within the text.

#### Semantic Search Process:
**Input Query** : Begin by inputting a search query, representing the user's information need.

**Sentence Embeddings** : Utilize the Sentence Transformer model to convert the query and movie synopses into high-dimensional embeddings, capturing their semantic content.

**Compute Similarities**: Calculate similarity scores between the query embedding and the embeddings of movie synopses. This reflects the semantic similarity between the query and each movie synopsis.

**Extract Hot Keywords by two ways:**

   ****User-Defined Method:**** Utilize a user-defined method developed to extract hot keywords. This method may be tailored to specific requirements or domain knowledge.

   ****YAKE (Yet Another Keyword Extractor):**** Alternatively, employ YAKE, a robust keyword extraction tool, to identify additional relevant terms contributing to the semantic meaning of the documents.

**Present Results:** Unveil the most relevant movie titles along with their associated hot keywords, providing insights into the semantic relationships discovered by the Sentence Transformer.

**Key Advantages:**

   ****Contextual Understanding:**** Sentence Transformers capture contextual information, allowing them to understand the meaning of words within the context of a sentence.

   ****Semantic Relationships:**** These models excel at revealing semantic relationships between sentences, making them suitable for tasks like semantic search.

In [15]:
# Load the Sentence Transformer model
sentence_transformer_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create a TransformerSearch object with the loaded sentence transformer model
transformer_search = TransformerSearch(sentence_transformer_model=sentence_transformer_model)

In [16]:
# Perform semantic search using the Transformer model for movies similar to 'A Splash of Love'
most_similar_movies_transformer = transformer_search.semantic_search_transformer(query_title='A Splash of Love',
                                                                                 df=df, yake=False)

# Print the Most Similar Movies
print("Most Similar Movies:")
for index, row in most_similar_movies_transformer.iterrows():
    print(f"Title: {row['movie_title']}")
    print(f"Genre: {row['genre']}")
    print(f"Year: {row['year']:.0f}")  # Format the year as an integer
    print("\n" + "*" * 50 + "\n")  # Add stars between movies

Most Similar Movies:
Title: The Rocky Horror Picture Show: Let's Do the Time Warp Again
Genre: Comedy, Romance, Back to top
Year: 2016

**************************************************

Title: Love on Safari
Genre: Drama, Romance, Back to top
Year: 2018

**************************************************

Title: Stone Pillow
Genre: Comedy, Romance, Back to top
Year: 1985

**************************************************



In [18]:
# Extract Hot Keywords for each similar movie using the Transformer model
for index, row in most_similar_movies_transformer.iterrows():
    movie_title = row['movie_title']
    movie_synopsis = row['synopsis']

    # Extract hot keywords from the movie's synopsis
    hot_keywords = transformer_search.extract_hot_keywords_transformer(movie_synopsis)

    print(f"\nHot Keywords for Movie: {movie_title}")
    for word, similarity in hot_keywords:
        print(f"  Word: {word}, Similarity: {similarity:.4f}")

    # Print stars between movies for better readability
    print("\n" + "*" * 50 + "\n")


Hot Keywords for Movie: The Rocky Horror Picture Show: Let's Do the Time Warp Again
  Word: let, Similarity: 0.6263
  Word: concerning, Similarity: 0.6241
  Word: specifically, Similarity: 0.6083
  Word: arguably, Similarity: 0.6037
  Word: project, Similarity: 0.6012

**************************************************


Hot Keywords for Movie: Love on Safari
  Word: subject, Similarity: 0.6740
  Word: writes, Similarity: 0.6634
  Word: come, Similarity: 0.6420
  Word: writer, Similarity: 0.6141
  Word: having, Similarity: 0.6107

**************************************************


Hot Keywords for Movie: Stone Pillow
  Word: fellow, Similarity: 0.6517
  Word: interesting, Similarity: 0.6308
  Word: friend, Similarity: 0.6104
  Word: allison, Similarity: 0.6027
  Word: way, Similarity: 0.6018

**************************************************



In [19]:
# Example usage: Perform semantic search with hot keywords using the Transformer model for movies similar to 'The Other Guys'
most_similar_movies_transformer_with_keywords = transformer_search.semantic_search_transformer(query_title='The Other Guys',
                                                                                               df=df, yake=True)

# Print the result with hot keywords
print("\nMost Similar Movies with Hot Keywords:")
for index, row in most_similar_movies_transformer_with_keywords.iterrows():
    print(f"Title: {row['movie_title']}")
    print(f"Genre: {row['genre']}")
    print(f"Year: {row['year']:.0f}")  # Format the year as an integer
    print(f"Hot Keywords: {row['Hot Keywords']}")
    print("\n" + "*" * 50 + "\n")  # Add stars between movies

Movie with title 'The Other Guys' not found. Searching based on movie title with other movies' synopses.

Most Similar Movies with Hot Keywords:
Title: Haunted High
Genre: Comedy, Back to top
Year: 2012
Hot Keywords: ['offbeat eccentric friends', 'awkward experiences racy', 'experiences racy tribulations', 'tribulations manny offbeat', 'friends']

**************************************************

Title: The Choking Game
Genre: Comedy, Drama, Back to top
Year: 2014
Hot Keywords: ['group men kent', 'form large plot', 'men kent clive', 'martin clunes rob', 'clunes rob neil']

**************************************************

Title: Last Rites
Genre: Comedy, Drama, Sport, Back to top
Year: 1998
Hot Keywords: ['day life group', 'baseball team game', 'group gambling buddies', 'buddies sit beloved', 'sit beloved baseball']

**************************************************

