<center><h2>Task 1: Preprocessing</h2></center>

<h4>Read input data into pandas dataframe.</h4>

In [1]:
import pandas as pd
file_path = "assignment2_data.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Title,Popularity,Tagline,Overview
0,Minions,875.581305,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by..."
1,Interstellar,724.247784,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...
2,Deadpool,514.569956,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...
3,Guardians of the Galaxy,481.098624,All heroes start somewhere.,"Light years from Earth, 26 years after being a..."
4,Mad Max: Fury Road,434.278564,What a Lovely Day.,An apocalyptic story set in the furthest reach...


<h4>Cleanup the data</h4>
<ol>
    <li>Remove records which missing under "Tagline" and "Overview"</li>
    <li>If even one of those columns has value, do not remove the record</li>
</ol>

In [2]:
# Remove records where both "Tagline" and "Overview" are missing
df_cleaned = df.dropna(subset=['Tagline', 'Overview'], how='all')

# Display dataset size before and after cleaning
print("Before Cleaning:", df.shape[0], "records")
print("After Cleaning:", df_cleaned.shape[0], "records")

Before Cleaning: 4803 records
After Cleaning: 4800 records


In [3]:
df_cleaned.head()

Unnamed: 0,Title,Popularity,Tagline,Overview
0,Minions,875.581305,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by..."
1,Interstellar,724.247784,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...
2,Deadpool,514.569956,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...
3,Guardians of the Galaxy,481.098624,All heroes start somewhere.,"Light years from Earth, 26 years after being a..."
4,Mad Max: Fury Road,434.278564,What a Lovely Day.,An apocalyptic story set in the furthest reach...


<h4>DataFrame "data" using "Title" and "Popularity" columns of the original data</h4>
<h5><b>"Full_Overview" = "Tagline" + " " + "Overview"</b></h5>

In [4]:
data = df_cleaned[['Title', 'Popularity']].copy()
data["Full_Overview"] = df_cleaned["Tagline"].fillna('')+" "+ df_cleaned["Overview"].fillna('')
print(data.head())

                     Title  Popularity  \
0                  Minions  875.581305   
1             Interstellar  724.247784   
2                 Deadpool  514.569956   
3  Guardians of the Galaxy  481.098624   
4       Mad Max: Fury Road  434.278564   

                                       Full_Overview  
0  Before Gru, they had a history of bad bosses M...  
1  Mankind was born on Earth. It was never meant ...  
2  Witness the beginning of a happy ending Deadpo...  
3  All heroes start somewhere. Light years from E...  
4  What a Lovely Day. An apocalyptic story set in...  


<h4>Perform Regular Expressions to remove punctuations, symbols and special characters from "Title" & "Full_Overview" columns.</h4>

In [5]:
import re
# Step 3: Remove punctuation, symbols, and special characters from "Title" and "Full_Overview"
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
        text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
        return text
    return text

data["Title"] = data["Title"].apply(clean_text)
data["Full_Overview"] = data["Full_Overview"].apply(clean_text)
data.head()

Unnamed: 0,Title,Popularity,Full_Overview
0,Minions,875.581305,Before Gru they had a history of bad bosses Mi...
1,Interstellar,724.247784,Mankind was born on Earth It was never meant t...
2,Deadpool,514.569956,Witness the beginning of a happy ending Deadpo...
3,Guardians of the Galaxy,481.098624,All heroes start somewhere Light years from Ea...
4,Mad Max Fury Road,434.278564,What a Lovely Day An apocalyptic story set in ...


<h4>Tokenize your input documents using SpaCy Model 'en_core_web_sm'.</h4>

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
def tokenize_text(text):
    if isinstance(text, str):
        doc = nlp(text)
        return [token.text for token in doc]
    return []

data["Tokens"] = data["Full_Overview"].apply(tokenize_text)
data.head()

Unnamed: 0,Title,Popularity,Full_Overview,Tokens
0,Minions,875.581305,Before Gru they had a history of bad bosses Mi...,"[Before, Gru, they, had, a, history, of, bad, ..."
1,Interstellar,724.247784,Mankind was born on Earth It was never meant t...,"[Mankind, was, born, on, Earth, It, was, never..."
2,Deadpool,514.569956,Witness the beginning of a happy ending Deadpo...,"[Witness, the, beginning, of, a, happy, ending..."
3,Guardians of the Galaxy,481.098624,All heroes start somewhere Light years from Ea...,"[All, heroes, start, somewhere, Light, years, ..."
4,Mad Max Fury Road,434.278564,What a Lovely Day An apocalyptic story set in ...,"[What, a, Lovely, Day, An, apocalyptic, story,..."


<h4>Remove stopwords from the "Full_Overview" column {Use SpaCy's stopwords list}</h4>

In [7]:
stopwords = nlp.Defaults.stop_words
def remove_stopwords(text):
    if isinstance(text, list):
        return [word for word in text if word.lower() not in stopwords]
    return text

data["Tokens_Cleaned"] = data["Tokens"].apply(remove_stopwords)
data.head()

Unnamed: 0,Title,Popularity,Full_Overview,Tokens,Tokens_Cleaned
0,Minions,875.581305,Before Gru they had a history of bad bosses Mi...,"[Before, Gru, they, had, a, history, of, bad, ...","[Gru, history, bad, bosses, Minions, Stuart, K..."
1,Interstellar,724.247784,Mankind was born on Earth It was never meant t...,"[Mankind, was, born, on, Earth, It, was, never...","[Mankind, born, Earth, meant, die, Interstella..."
2,Deadpool,514.569956,Witness the beginning of a happy ending Deadpo...,"[Witness, the, beginning, of, a, happy, ending...","[Witness, beginning, happy, ending, Deadpool, ..."
3,Guardians of the Galaxy,481.098624,All heroes start somewhere Light years from Ea...,"[All, heroes, start, somewhere, Light, years, ...","[heroes, start, Light, years, Earth, 26, years..."
4,Mad Max Fury Road,434.278564,What a Lovely Day An apocalyptic story set in ...,"[What, a, Lovely, Day, An, apocalyptic, story,...","[Lovely, Day, apocalyptic, story, set, furthes..."


<h4>I choose Lemmatization because...</h4>
<ol>
    <li><b>More Meaningful Representation</b></li>
    <li><b>Improves document similarity calculations</b></li>
    <li><b>Improves Model efficiency by reducing the number of unique words.</b></li>
</ol>

In [8]:
def lemmatize_text(tokens):
    if isinstance(tokens, list):
        return [token.lemma_ for token in nlp(" ".join(tokens))]
    return tokens

data["Tokens_Lemmatized"] = data["Tokens_Cleaned"].apply(lemmatize_text)
data.head()

Unnamed: 0,Title,Popularity,Full_Overview,Tokens,Tokens_Cleaned,Tokens_Lemmatized
0,Minions,875.581305,Before Gru they had a history of bad bosses Mi...,"[Before, Gru, they, had, a, history, of, bad, ...","[Gru, history, bad, bosses, Minions, Stuart, K...","[gru, history, bad, boss, minion, Stuart, Kevi..."
1,Interstellar,724.247784,Mankind was born on Earth It was never meant t...,"[Mankind, was, born, on, Earth, It, was, never...","[Mankind, born, Earth, meant, die, Interstella...","[mankind, bear, Earth, mean, die, Interstellar..."
2,Deadpool,514.569956,Witness the beginning of a happy ending Deadpo...,"[Witness, the, beginning, of, a, happy, ending...","[Witness, beginning, happy, ending, Deadpool, ...","[Witness, begin, happy, end, Deadpool, tell, o..."
3,Guardians of the Galaxy,481.098624,All heroes start somewhere Light years from Ea...,"[All, heroes, start, somewhere, Light, years, ...","[heroes, start, Light, years, Earth, 26, years...","[hero, start, light, year, Earth, 26, year, ab..."
4,Mad Max Fury Road,434.278564,What a Lovely Day An apocalyptic story set in ...,"[What, a, Lovely, Day, An, apocalyptic, story,...","[Lovely, Day, apocalyptic, story, set, furthes...","[lovely, Day, apocalyptic, story, set, furth, ..."


<h1><center>Task 2 - Similarity using Sparse Vectors</center></h1>

<h4>Main objective of this task:</h4>
<ol>
    <li>Create TF-IDF representation from "Full_Overview".</li>
    <li>Each row under "Full_Overview" considered as document.</li>
</ol>

In [9]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

<h4>Perform TF-IDF sparse vectorization</h4>

In [10]:
# Fill NaN values in "Full_Overview" with an empty string
data["Full_Overview"] = data["Full_Overview"].fillna("")

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Apply TF-IDF transformation on "Full_Overview"
tfidf_matrix = tfidf_vectorizer.fit_transform(data["Full_Overview"])

# Get feature names (words in vocabulary)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF sparse matrix into a DataFrame for visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Display first few rows
print(tfidf_df.head())

# Confirm completion
print("TF-IDF representation saved as 'tfidf_representation.csv'.")


    00  000  007  050506   10  100  1000  10000  100000  1000000  ...  zula  \
0  0.0  0.0  0.0     0.0  0.0  0.0   0.0    0.0     0.0      0.0  ...   0.0   
1  0.0  0.0  0.0     0.0  0.0  0.0   0.0    0.0     0.0      0.0  ...   0.0   
2  0.0  0.0  0.0     0.0  0.0  0.0   0.0    0.0     0.0      0.0  ...   0.0   
3  0.0  0.0  0.0     0.0  0.0  0.0   0.0    0.0     0.0      0.0  ...   0.0   
4  0.0  0.0  0.0     0.0  0.0  0.0   0.0    0.0     0.0      0.0  ...   0.0   

   zuzu  zwei  zyklon  æon  éloigne  émigré  única  übertarget  最后的舞者  
0   0.0   0.0     0.0  0.0      0.0     0.0    0.0         0.0    0.0  
1   0.0   0.0     0.0  0.0      0.0     0.0    0.0         0.0    0.0  
2   0.0   0.0     0.0  0.0      0.0     0.0    0.0         0.0    0.0  
3   0.0   0.0     0.0  0.0      0.0     0.0    0.0         0.0    0.0  
4   0.0   0.0     0.0  0.0      0.0     0.0    0.0         0.0    0.0  

[5 rows x 24272 columns]
TF-IDF representation saved as 'tfidf_representation.csv'.


<h4>Function to calculate document similarities</h4>
<h5>Finds the top N most similar movies based on cosine similarity.</h5>
<h5><b>Parameters:</b></h5>
<ol>
    <li>query_index: Index of the query document in the dataset.</li>
    <li>tfidf_matrix: TF-IDF representation of all documents.</li>
    <li>data: DataFrame containing movie titles and popularity.</li>
    <li>top_n: Number of top similar movies to return.</li>
</ol>
<h5><b>Returns:</b></h5>
<h5>DataFrame with top N most similar movies sorted by popularity.</h5>

In [11]:
def find_similar_movies(query_index, tfidf_matrix, data, top_n=5):
    # Compute cosine similarity between the query movie and all other movies
    cosine_similarities = cosine_similarity(tfidf_matrix[query_index], tfidf_matrix).flatten()

    # Get indices of top similar movies (excluding itself)
    similar_indices = np.argsort(cosine_similarities)[::-1][1:top_n+1]

    # Retrieve similar movies with their scores
    similar_movies = data.iloc[similar_indices][["Title", "Popularity"]].copy()
    similar_movies["Similarity_Score"] = cosine_similarities[similar_indices]

    # Sort by popularity score in descending order
    similar_movies = similar_movies.sort_values(by="Popularity", ascending=False)

    return similar_movies

# Query movie titles
query_movies = ["Taken", "Pulp Fiction", "Mad Max", "Rain Man", "Bruce Almighty"]

# Find indices of the query movies in the dataset
query_indices = data[data["Title"].isin(query_movies)].index.tolist()

# Compute and display the most similar movies for each query movie
for query_index, movie in zip(query_indices, query_movies):
    print(f"\nTop 5 most similar movies to '{movie}':")
    print(find_similar_movies(query_index, tfidf_matrix, data))



Top 5 most similar movies to 'Taken':
               Title  Popularity  Similarity_Score
420   Men in Black 3   52.035179          0.111708
1193       The Sting   28.500913          0.108671
1204           Crash   28.223163          0.137383
1741          Clerks   19.748658          0.124037
1808      Timecrimes   19.029650          0.107731

Top 5 most similar movies to 'Pulp Fiction':
                    Title  Popularity  Similarity_Score
58          Batman Begins  115.040024          0.157471
288   The Incredible Hulk   62.898336          0.149320
912                  Hulk   34.981698          0.146552
3120     My Name Is Bruce    7.559100          0.172524
4411            Road Hard    0.859014          0.237082

Top 5 most similar movies to 'Mad Max':
                         Title  Popularity  Similarity_Score
1092               Drive Angry   30.387148          0.172601
1194               Bad Teacher   28.497242          0.161232
1386  The Transporter Refueled   25.002715       

<h4>Questions for Task 2</h4>

<h4>1. What is the vector size of TF-IDF vectors?</h4>

In [12]:
# Get the size (dimensions) of the TF-IDF vectors
num_documents, num_features = tfidf_matrix.shape

# Display the vector size
print(f"Vector Size of TF-IDF: {num_documents} documents × {num_features} unique words")


Vector Size of TF-IDF: 4800 documents × 24272 unique words


<h4>2. What is the vocabulary size of TF-IDF vectors?</h4>

In [13]:
# Get the vocabulary size of TF-IDF vectors
vocabulary_size = len(tfidf_vectorizer.get_feature_names_out())

# Display the vocabulary size
print(f"TF-IDF Vocabulary Size: {vocabulary_size} unique words")


TF-IDF Vocabulary Size: 24272 unique words


<h4>3. For each query and top 5 recommended movies, read the full overviews and state whether or not you agree with TF-IDF-based recommender system. Note that you do not need to read all the 4800+ overviews to come up with your answer, just judge whether the top 5 picks for each query movie are fair.</h4>

In [14]:
# Function to get movie overviews for the query and top recommended movies
def evaluate_recommendations(query_index, tfidf_matrix, data):
    # Find the top 5 most similar movies using the existing function
    similar_movies = find_similar_movies(query_index, tfidf_matrix, data)
    
    # Get the Full_Overview of the query movie
    query_movie = data.iloc[query_index]
    query_title = query_movie["Title"]
    query_overview = query_movie["Full_Overview"]
    
    # Print the query movie details
    print(f"\nQuery Movie: {query_title}")
    print(f"Overview: {query_overview}\n")
    
    # Print the top 5 recommended movies with their overviews
    print("Top 5 Recommended Movies:")
    for index, row in similar_movies.iterrows():
        print(f"\nTitle: {row['Title']}")
        print(f"Overview: {data[data['Title'] == row['Title']]['Full_Overview'].values[0]}")
        print(f"Similarity Score: {row['Similarity_Score']:.4f}")

# Query movie titles
query_movies = ["Taken", "Pulp Fiction", "Mad Max", "Rain Man", "Bruce Almighty"]

# Find indices of query movies
query_indices = data[data["Title"].isin(query_movies)].index.tolist()

# Evaluate recommendations for each query movie
for query_index in query_indices:
    evaluate_recommendations(query_index, tfidf_matrix, data)



Query Movie: Pulp Fiction
Overview: Just because you are a character doesnt mean you have character A burgerloving hit man his philosophical partner a drugaddled gangsters moll and a washedup boxer converge in this sprawling comedic crime caper Their adventures unfurl in three stories that ingeniously trip back and forth in time

Top 5 Recommended Movies:

Title: Men in Black 3
Overview: They are back in time Agents J Will Smith and K Tommy Lee Jones are backin time J has seen some inexplicable things in his 15 years with the Men in Black but nothing not even aliens perplexes him as much as his wry reticent partner But when Ks life and the fate of the planet are put at stake Agent J will have to travel back in time to put things right J discovers that there are secrets to the universe that K never told him secrets that will reveal themselves as he teams up with the young Agent K Josh Brolin to save his partner the agency and the future of humankind
Similarity Score: 0.1117

Title: The

<h1><center>Task 3: Similarity using Dense Vectors</br></br>3.1 Training Dense Vectors</center></h1>

In [15]:
import gensim
print(gensim.__version__)


4.3.3


In [16]:
import numpy as np
import pandas as pd
import spacy
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

<h4>Function to tokenize text</h4>

In [17]:
def tokenize_text(text):
    if isinstance(text, str):
        doc = nlp(text)
        return [token.text.lower() for token in doc if not token.is_punct and not token.is_space]
    return []

# Apply tokenization to "Full_Overview"
data["Tokens"] = data["Full_Overview"].apply(tokenize_text)


<h4>Experiment with vector sizes between 150 and 300, window size between 7 and 13 and even for iterations between 10 and 20</br>getting best model</h4>

In [18]:
vector_sizes = [150, 200, 250, 300]
best_vector_size = None
best_model = None
best_loss = float('inf')

# Experiment with window sizes between 7 and 13
window_size = 10  # Chosen within range [7, 13]

# Set the training iterations
iterations = 15  # Chosen within range [10, 20]

# Train models with different vector sizes and select the best
for size in vector_sizes:
    print(f"Training Word2Vec with vector size: {size}")
    
    # Train Word2Vec using Skip-gram model (sg=1), min_count=1, and chosen parameters
    model = Word2Vec(sentences=data["Tokens"], vector_size=size, window=window_size, min_count=1, sg=1, epochs=iterations)
    
    # Calculate loss (proxy for model quality)
    loss = model.get_latest_training_loss()
    
    # Select the best model with the lowest training loss
    if loss < best_loss:
        best_loss = loss
        best_vector_size = size
        best_model = model

# Save the best model
best_model.save("word2vec_best.model")
print(f"Best Vector Size: {best_vector_size}")


Training Word2Vec with vector size: 150
Training Word2Vec with vector size: 200
Training Word2Vec with vector size: 250
Training Word2Vec with vector size: 300
Best Vector Size: 150


In [19]:
# Load the best model
best_model = Word2Vec.load("word2vec_best.model")

# Test: Find similar words to "hero"
print(best_model.wv.most_similar("hero"))

# Test: Find similar words to "revenge"
print(best_model.wv.most_similar("revenge"))


[('highlands', 0.6263734698295593), ('bodeen', 0.6096665859222412), ('republican', 0.6064697504043579), ('warrior', 0.6024805903434753), ('skywalker', 0.5977209806442261), ('algren', 0.5936062932014465), ('1966', 0.5932949185371399), ('armor', 0.5910802483558655), ('odins', 0.5894972681999207), ('theseus', 0.5887970924377441)]
[('exact', 0.6946643590927124), ('diablo', 0.6782649755477905), ('vetter', 0.6257968544960022), ('intended', 0.6104127168655396), ('carry', 0.605866551399231), ('rampage', 0.6045740842819214), ('munro', 0.598985493183136), ('redemption', 0.5938963890075684), ('seeks', 0.5909680724143982), ('ling', 0.5839093327522278)]


<h4>Finds similar movies using Word2Vec and cosine similarity.</h4> <h5><b>Parameters:</b></h5> <ol> <li><b>query_index:</b> Index of the query movie.</li> <li><b>data:</b> DataFrame with movie details.</li> <li><b>model:</b> Trained Word2Vec model.</li> </ol> <h5><b>Returns:</b> DataFrame with top similar movies.</h5>

In [21]:
def find_similar_movies_dense(query_index, data, model, top_n=5):
    if "Tokens" not in data.columns:
        raise ValueError("Dataset is missing the 'Tokens' column. Ensure tokenization is completed.")
    
    # Function to compute the centroid vector for a document using the trained Word2Vec model
    def get_document_vector(tokens, model):
        word_vectors = [model.wv[word] for word in tokens if word in model.wv]
        return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(model.vector_size)
    
    # Compute document vectors for all documents (if not already computed)
    if "Dense_Vector" not in data.columns:
        data["Dense_Vector"] = data["Tokens"].apply(lambda tokens: get_document_vector(tokens, model))
    
    # Get the dense vector of the query movie
    query_vector = data.iloc[query_index]["Dense_Vector"].reshape(1, -1)
    
    # Compute cosine similarity between the query vector and all other document vectors
    document_vectors = np.stack(data["Dense_Vector"].values)
    cosine_similarities = cosine_similarity(query_vector, document_vectors).flatten()
    
    # Get indices of top similar movies (excluding itself)
    similar_indices = np.argsort(cosine_similarities)[::-1][1:top_n+1]
    
    # Retrieve similar movies with their scores
    similar_movies = data.iloc[similar_indices][["Title", "Popularity"]].copy()
    similar_movies["Similarity_Score"] = cosine_similarities[similar_indices]
    
    # Sort by popularity score in descending order
    return similar_movies.sort_values(by="Popularity", ascending=False)

# Query movie titles
query_movies = ["Taken", "Pulp Fiction", "Mad Max", "Rain Man", "Bruce Almighty"]

# Find indices of the query movies in the dataset
query_indices = data[data["Title"].isin(query_movies)].index.tolist()

# Compute and display the most similar movies for each query movie
for query_index, movie in zip(query_indices, query_movies):
    print(f"\nTop 5 most similar movies to '{movie}':")
    print(find_similar_movies_dense(query_index, data, model))



Top 5 most similar movies to 'Taken':
                 Title  Popularity  Similarity_Score
2114     Sliding Doors   15.639016          0.919980
3254  Virgin Territory    6.760922          0.921518
3513       Big Trouble    5.201688          0.923634
4392        The Hammer    0.957022          0.926005
4801        Alien Zone    0.000372          0.919205

Top 5 most similar movies to 'Pulp Fiction':
                                        Title  Popularity  Similarity_Score
1529                              The Gambler   22.622453          0.947449
3536                        Copying Beethoven    5.062687          0.944743
4385  The Rocket The Legend of Rocket Richard    0.983484          0.945835
4457                          Out of the Blue    0.679351          0.943115
4600                           Roadside Romeo    0.253595          0.941664

Top 5 most similar movies to 'Mad Max':
                                    Title  Popularity  Similarity_Score
469                         

<h1><center>3.2 Using pretrained dense vectors</center></h1>

In [22]:
import gensim.downloader as api
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

<h4>Load the pretrained Google News Word2Vec model (300-dimensional vectors)</h4>

In [23]:
print("Loading Pretrained Word2Vec Model...")
pretrained_model = api.load("word2vec-google-news-300")
print("Model Loaded Successfully!")


Loading Pretrained Word2Vec Model...
Model Loaded Successfully!


<h4>Finds similar movies using pretrained Word2Vec embeddings and cosine similarity.</h4>
    
<h5><b>Parameters:</b></h5>
    <ol>
        <li><b>query_index:</b> Index of the query movie.</li>
        <li><b>data:</b> DataFrame with movie details.</li>
        <li><b>model:</b> Pretrained Word2Vec model.</li>
    </ol>
    
<h5><b>Returns:</b> DataFrame with top similar movies.</h5>


In [24]:
def find_similar_movies_pretrained(query_index, data, model, top_n=5):
 
    if "Tokens" not in data.columns:
        raise ValueError("Dataset is missing the 'Tokens' column. Ensure tokenization is completed.")
    
    # Function to compute the centroid vector for a document
    def get_pretrained_document_vector(tokens, model):
        word_vectors = [model[word] for word in tokens if word in model]
        return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(model.vector_size)
    
    # Compute document vectors for all documents (if not already computed)
    if "Pretrained_Dense_Vector" not in data.columns:
        data["Pretrained_Dense_Vector"] = data["Tokens"].apply(lambda tokens: get_pretrained_document_vector(tokens, model))
    
    # Get the dense vector of the query movie
    query_vector = data.iloc[query_index]["Pretrained_Dense_Vector"].reshape(1, -1)
    
    # Compute cosine similarity between the query vector and all other document vectors
    document_vectors = np.stack(data["Pretrained_Dense_Vector"].values)
    cosine_similarities = cosine_similarity(query_vector, document_vectors).flatten()
    
    # Get indices of top similar movies (excluding itself)
    similar_indices = np.argsort(cosine_similarities)[::-1][1:top_n+1]
    
    # Retrieve similar movies with their scores
    similar_movies = data.iloc[similar_indices][["Title", "Popularity"]].copy()
    similar_movies["Similarity_Score"] = cosine_similarities[similar_indices]
    
    # Sort by popularity score in descending order
    return similar_movies.sort_values(by="Popularity", ascending=False)

# Query movie titles
query_movies = ["Taken", "Pulp Fiction", "Mad Max", "Rain Man", "Bruce Almighty"]

# Find indices of the query movies in the dataset
query_indices = data[data["Title"].isin(query_movies)].index.tolist()

# Compute and display the most similar movies for each query movie using pretrained embeddings
for query_index, movie in zip(query_indices, query_movies):
    print(f"\nTop 5 most similar movies to '{movie}' using Pretrained Word2Vec:")
    print(find_similar_movies_pretrained(query_index, data, pretrained_model))



Top 5 most similar movies to 'Taken' using Pretrained Word2Vec:
                             Title  Popularity  Similarity_Score
1138                  Analyze This   29.415229          0.854923
2114                 Sliding Doors   15.639016          0.846045
3254              Virgin Territory    6.760922          0.847204
3466                  Soul Kitchen    5.461487          0.848808
4616  The Ghastly Love of Johnny X    0.209475          0.870943

Top 5 most similar movies to 'Pulp Fiction' using Pretrained Word2Vec:
            Title  Popularity  Similarity_Score
490      Superman   48.507081          0.857545
904   SpiderMan 2   35.149586          0.861984
1532   Hellraiser   22.583834          0.859091
3773       Krrish    3.759988          0.856107
4223  Saint Ralph    1.688495          0.853940

Top 5 most similar movies to 'Mad Max' using Pretrained Word2Vec:
                               Title  Popularity  Similarity_Score
875                        Homefront   35.737655   

<h4>1. For each query and top 5 recommended movies, read the full overviews and state whether or not you agree with Word2Vec-based recommender system. Note that you do not need to read all the 4800+ overviews to come up with your answer, just judge whether the top 5 picks for each query movie are fair.</h4>

<h4># Function to print Full_Overview for the query and top 5 recommended movies</h4>
<h5>Prints the Full_Overview for the query movie and the top 5 recommended movies.</h5>
<h5><b<>Parameters:</b></h5>
<ol>
    <li><b>query_index: </b>Index of the query movie.</li>
    <li><b>data: </b>DataFrame containing Title, Full_Overview, and popularity.</li>
</ol>

In [26]:
def evaluate_word2vec_recommendations(query_index, data, model):
    # Get query movie details
    query_movie = data.iloc[query_index]
    query_title = query_movie["Title"]
    query_overview = query_movie["Full_Overview"]
    
    # Print query movie overview
    print(f"\nQuery Movie: {query_title}")
    print(f"Overview: {query_overview}\n")
    
    # Get top 5 recommended movies (Pass model as an argument)
    similar_movies = find_similar_movies_pretrained(query_index, data, model)
    
    # Print recommended movies and their overviews
    print("Top 5 Recommended Movies:")
    for index, row in similar_movies.iterrows():
        recommended_title = row["Title"]
        recommended_overview = data[data["Title"] == recommended_title]["Full_Overview"].values[0]
        print(f"\nTitle: {recommended_title}")
        print(f"Overview: {recommended_overview}")
        print(f"Similarity Score: {row['Similarity_Score']:.4f}")

# Query movie titles
query_movies = ["Taken", "Pulp Fiction", "Mad Max", "Rain Man", "Bruce Almighty"]

# Find indices of query movies
query_indices = data[data["Title"].isin(query_movies)].index.tolist()

# Evaluate recommendations for each query movie (Pass pretrained_model)
for query_index in query_indices:
    evaluate_word2vec_recommendations(query_index, data, pretrained_model)



Query Movie: Pulp Fiction
Overview: Just because you are a character doesnt mean you have character A burgerloving hit man his philosophical partner a drugaddled gangsters moll and a washedup boxer converge in this sprawling comedic crime caper Their adventures unfurl in three stories that ingeniously trip back and forth in time

Top 5 Recommended Movies:

Title: Analyze This
Overview: New Yorks most powerful gangster is about to get in touch with his feelings YOU try telling him his 50 minutes are up Countless wiseguy films are spoofed in this film that centers on the neuroses and angst of a powerful Mafia racketeer who suffers from panic attacks When Paul Vitti needs help dealing with his role in the family unlucky shrink Dr Ben Sobel is given just days to resolve Vittis emotional crisis and turn him into a happy welladjusted gangster
Similarity Score: 0.8549

Title: Sliding Doors
Overview: What if one split second sent your life in two completely different directions Gwyneth Paltro

<h4>2. Comment on training Word2Vec vectors versus using pretrained ones. Which one worked better? Do you recommend training dense vectors for a task like this or is it better to use pretrained vectors?</h4>

In [29]:

# Compute Dense Vectors for Each Document
data["Trained_Dense_Vector"] = data["Tokens"].apply(lambda tokens: get_document_vector(tokens, trained_model))
data["Pretrained_Dense_Vector"] = data["Tokens"].apply(lambda tokens: get_pretrained_document_vector(tokens, pretrained_model))

# Function to Find Similar Movies Using Any Vector Representation
def find_similar_movies(query_index, vectors, data, top_n=5):
    """Finds top N most similar movies using cosine similarity."""
    query_vector = vectors[query_index].reshape(1, -1)
    cosine_similarities = cosine_similarity(query_vector, vectors).flatten()

    # Get indices of top similar movies (excluding itself)
    similar_indices = np.argsort(cosine_similarities)[::-1][1:top_n+1]

    # Retrieve similar movies with their scores
    similar_movies = data.iloc[similar_indices][["Title", "Popularity"]].copy()
    similar_movies["Similarity_Score"] = cosine_similarities[similar_indices]

    # Sort by popularity score in descending order
    similar_movies = similar_movies.sort_values(by="Popularity", ascending=False)
    return similar_movies

# Query Movies
query_movies = ["Taken", "Pulp Fiction", "Mad Max", "Rain Man", "Bruce Almighty"]
query_indices = data[data["Title"].isin(query_movies)].index.tolist()

# Compute and Compare Results for Trained vs. Pretrained Word2Vec
for query_index, movie in zip(query_indices, query_movies):
    print(f"\nTop 5 most similar movies to '{movie}':")

    # Trained Word2Vec Recommendations
    print("\nUsing Trained Word2Vec:")
    print(find_similar_movies(query_index, np.stack(data["Trained_Dense_Vector"].values), data))

    # Pretrained Word2Vec Recommendations
    print("\nUsing Pretrained Word2Vec:")
    print(find_similar_movies(query_index, np.stack(data["Pretrained_Dense_Vector"].values), data))


Loading Pretrained Word2Vec Model...
Pretrained Model Loaded Successfully!

Top 5 most similar movies to 'Taken':

Using Trained Word2Vec:
               Title  Popularity  Similarity_Score
2114   Sliding Doors   15.639016          0.914470
2588        Repo Man   11.353440          0.919353
3081  Country Strong    7.809701          0.918699
3612  Get on the Bus    4.623059          0.916312
4798      Alien Zone    0.000372          0.918360

Using Pretrained Word2Vec:
                             Title  Popularity  Similarity_Score
1138                  Analyze This   29.415229          0.840227
2640                    Salton Sea   10.991281          0.842788
2922                  The Visitors    8.893676          0.834058
2980                      Defendor    8.453420          0.840445
4614  The Ghastly Love of Johnny X    0.209475          0.856451

Top 5 most similar movies to 'Pulp Fiction':

Using Trained Word2Vec:
                                        Title  Popularity  Similar

<h4>3. Based on the results so far, rank sparse vectorization, task-specific training of dense vectors, and using pretrained dense word vectors for the document similarity task given in this assignment. Comment on this ranking. Is the result what you expected?</h4>

In [32]:
# Creating a ranking dataframe
ranking_data = {
    "Vectorization Method": [
        "Sparse Vectorization (TF-IDF)",
        "Task-Specific Trained Word2Vec",
        "Pretrained Word2Vec (Google News 300)"
    ],
    "Rank": [3, 2, 1],  # 1 is best, 3 is worst
    "Comments": [
        "TF-IDF is limited to exact keyword matches; does not capture semantic meaning.",
        "Trained Word2Vec learns dataset-specific representations but may struggle with small datasets.",
        "Pretrained Word2Vec captures deep semantic relationships and works well even on small datasets."
    ],
    "Expected Outcome?": [
        "Yes, TF-IDF is expected to perform the worst due to lack of semantic understanding.",
        "Partially, trained Word2Vec should improve with a larger dataset but does well for dataset-specific terms.",
        "Yes, pretrained embeddings generalize best, providing the most meaningful recommendations."
    ]
}

# Convert to DataFrame
ranking_df = pd.DataFrame(ranking_data)

# Display ranking results
print(ranking_df)



                    Vectorization Method  Rank  \
0          Sparse Vectorization (TF-IDF)     3   
1         Task-Specific Trained Word2Vec     2   
2  Pretrained Word2Vec (Google News 300)     1   

                                            Comments  \
0  TF-IDF is limited to exact keyword matches; do...   
1  Trained Word2Vec learns dataset-specific repre...   
2  Pretrained Word2Vec captures deep semantic rel...   

                                   Expected Outcome?  
0  Yes, TF-IDF is expected to perform the worst d...  
1  Partially, trained Word2Vec should improve wit...  
2  Yes, pretrained embeddings generalize best, pr...  
