# Similarity-Based Movie Recommendation System

This notebook implements a hybrid similarity system to recommend movies.

## Overview

Our system combines **6 different similarity metrics** to find movies similar to a query movie:

1. **Info Similarity** - Based on movie description embeddings
2. **Genre Similarity** - Based on genre overlap (action, drama, etc.)
3. **Year Similarity** - Based on release year proximity
4. **Content Rating Similarity** - Based on age rating (PG, R, etc.)
5. **Review Style Similarity** - Based on how critics write about the movie
6. **Review Type Similarity** - Based on fresh/rotten patterns across critics

## Key Technologies

- **Sentence Transformers** - Converts text to 384-dimensional embeddings
- **Cosine Similarity** - Measures similarity between vectors
- **Hybrid Scoring** - Weighted combination of all metrics

## Process

1. Load and group data by movie
2. Vectorize movie data (convert to numbers)
3. Build similarity matrices
4. Combine into hybrid score
5. Query for similar movies


## Step 1: Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import pickle
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Set plotting style
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
sns.set_style("whitegrid")


## Step 2: Define Global Variables

These are all possible genres and age ratings in our dataset.

In [100]:
all_genres = [
    'science fiction & fantasy', 'drama', 'western', 'comedy', 'classics',
    'action & adventure', 'kids & family', 'musical & performing arts',
    'documentary', 'art house & international', 'horror', 'sports & fitness',
    'faith & spirituality', 'mystery & suspense', 'animation', 'special interest', 'romance'
]
#no 'nr' to make it a neutral vector if its in the rating
all_age_ratings = ['pg', 'r', 'g', 'pg-13', 'nc17']

## Step 3: Data Loading Functions

In [None]:
def get_dataset() -> pd.DataFrame:

    dataset = ''

    final_dataset = pd.read_csv(dataset)

    return final_dataset

In [102]:
def group() -> list[dict]:

    final_dataset = get_dataset()
    
    #group movies together
    grouped = final_dataset.groupby('rotten_tomatoes_link')

    movie_data = []

    for movie_id, group in grouped:

        #Critic data
        reviews = group['review_content'].tolist()
        critic_names = group['critic_name'].to_list()
        review_types = group['review_type'].to_list()

        # Get metadata (take first row since they're all the same)
        first_row = group.iloc[0]

        #append dict of each movie
        movie_data.append({
            'movie_id': movie_id,
            'movie_title': first_row['movie_title'],
            'content_rating': first_row['content_rating'],
            'genres': first_row['genres'],
            'year': first_row['original_release_date'],
            'movie_info': first_row['movie_info'], #List of all review info texts

            #Per critic data
            'reviews': reviews,  # List of all review texts
            'critic_names': critic_names,   #List of all critic names
            'review_types': review_types, #List of all review types

        })
        
    #minimum amount of reviews
    min_reviews = 5

    movie_data_filtered = [
        movie for movie in movie_data
        if len(movie['reviews']) >= min_reviews
    ]
    
    print(f"Loaded {len(movie_data_filtered)} movies with ≥{min_reviews} reviews")
    movie_data = movie_data_filtered

    return movie_data

## Step 4: Vectorization Functions

**Goal:** Convert text and categories to numbers that we can compare

In [103]:
#Create embeddings for content and review style
def vectorize(movie_data: list) -> list[dict]:

    #creates binary list of genres 
    def encode_genres(movie_genres:list,all_genres:list):
        return [1 if genre in movie_genres else 0 for genre in all_genres]
    
    #creates binary list of ratings
    def encode_content_rating(content_rating,all_content_ratings):
        return [1 if cr == content_rating else 0 for cr in all_content_ratings]
    
    def encode_review_type(review_types):
        return [1 if rt == "fresh" else 0 for rt in review_types]
       
    print("Initializing sentence transformer...")

    model = SentenceTransformer('all-MiniLM-L6-v2')

    #normalize year range
    year = [movie['year'] for movie in movie_data if movie['year'] > 0]
    min_year = min(year)
    max_year = max(year)

    total_movies = len(movie_data)

    for i, movie in enumerate(movie_data, start=1):

        # Print progress
        print(f"Processing movie {i}/{total_movies} ({i/total_movies*100:.1f}%) - {movie['movie_title']}")

        # Encode all movie infos for this movie
        movie['movie_info_embeddings'] = model.encode(movie['movie_info'])

        #Content rating
        movie['content_rating_norm'] = encode_content_rating(movie['content_rating'],all_age_ratings)

        #Genres
        movie['genre_vector'] = encode_genres(movie['genres'], all_genres)

        #Year
        movie['year_norm'] = (movie['year'] - min_year) / (max_year - min_year)

        # Encode all reviews for this movie
        review_embeddings = model.encode(movie['reviews'])# Shape: (num_reviews, 384)
        #Average to get single movie embedding
        movie['avg_review_embeddings'] = review_embeddings.mean(axis=0)

        #review type
        movie['review_types_norm'] = encode_review_type(movie['review_types'])
       
    vectorized_movie_data = movie_data

    return vectorized_movie_data

Checks if vectorized movie data pickle exists. If not, generate, save, and return it

**Caching:** Results saved to `../cache/vectorized_movie_data.pkl` so we don't recompute (takes ~10-20 minutes first time)

In [104]:
def load_or_create_vectorized_data(pickle_path="../cache/vectorized_movie_data.pkl") -> list[dict]:
    if os.path.exists(pickle_path):
        print(f"Loading vectorized movie data from {pickle_path}...")
        with open(pickle_path, "rb") as f:
            vectorized_movie_data = pickle.load(f)
    else:
        print("Pickle file not found. Generating vectorized movie data...")
        grouped = group()
        vectorized_movie_data = vectorize(grouped)
        with open(pickle_path, "wb") as f:
            pickle.dump(vectorized_movie_data, f)
        print(f"Vectorized movie data saved to {pickle_path}.")
    return vectorized_movie_data


## Step 5: Similarity Matrix Functions

**Goal:** Create 6 separate similarity matrices, each capturing a different aspect

Each function:
1. Stacks all movie vectors into a matrix (one row per movie)
2. Computes pairwise cosine similarity
3. Returns matrix where `matrix[i][j]` = similarity between movie i and movie j

**Similarity matrices:**

1. **Info** - Based on movie description embeddings (semantic meaning)
2. **Genre** - Based on genre overlap (how many genres they share)
3. **Year** - Based on release year proximity (1999 vs 2000 = very similar)
4. **Content Rating** - Based on age rating match (both R-rated = similar)
5. **Style** - Based on review writing style (how critics describe them)
6. **Type** - Based on fresh/rotten patterns across critics (correlation of ratings)

**Cosine Similarity:** Ranges 0-1 where:
- 1.0 = identical
- 0.5 = somewhat similar
- 0.0 = completely different

**Caching:** All matrices saved to `../cache/similarity_matrices.pkl`

In [105]:
def build_info_sim(vectorized_movie_data) -> np.ndarray:

    info_vectors = np.vstack([movie['movie_info_embeddings'] for movie in vectorized_movie_data])

    #pairwise cosine similarity
    info_sim = cosine_similarity(info_vectors).astype('float32')

    print(f"Info similarity matrix: {info_sim.shape}")

    return info_sim


def build_content_rating_sim(vectorized_movie_data) -> np.ndarray:

    content_rating_vectors = np.vstack([movie['content_rating_norm'] for movie in vectorized_movie_data])

    #pairwise cosine similarity
    content_rating_sim = cosine_similarity(content_rating_vectors).astype('float32')

    print(f"Content rating similarity matrix: {content_rating_sim.shape}")

    return content_rating_sim

def build_genre_sim(vectorized_movie_data) -> np.ndarray:

    genre_vectors = np.vstack([movie['genre_vector'] for movie in vectorized_movie_data])

    #pairwise cosine similarity
    genre_sim = cosine_similarity(genre_vectors).astype('float32')

    print(f"Genre similarity matrix: {genre_sim.shape}")

    return genre_sim

def build_year_sim(vectorized_movie_data) -> np.ndarray:

    year_vectors = np.vstack([movie['year_norm'] for movie in vectorized_movie_data])

    #pairwise cosine similarity
    year_sim = cosine_similarity(year_vectors).astype('float32')

    print(f"Year similarity matrix: {year_sim.shape}")

    return year_sim

#content
def build_content_sim(vectorized_movie_data) -> np.ndarray:

    content_vectors = np.vstack([movie['content_vector'] for movie in vectorized_movie_data])

    #pairwise cosine similarity
    content_sim = cosine_similarity(content_vectors).astype('float32')

    print(f"Content similarity matrix: {content_sim.shape}")

    return content_sim
    
#style
def build_review_style_sim(vectorized_movie_data) -> np.ndarray:
    
    review_embeddings = np.vstack([movie['avg_review_embeddings'] for movie in vectorized_movie_data])

    #pairwise cosine similarity
    review_sim = cosine_similarity(review_embeddings).astype('float32')

    print(f"Review similarity matrix: {review_sim.shape}")

    return review_sim

def build_review_type_sim(vectorized_movie_data) -> np.ndarray:

    #Get all critics
    all_critics = set()
    for movie in vectorized_movie_data:
        all_critics.update(movie['critic_names'])
    all_critics = sorted(all_critics)

    #Get all movie titles
    all_movies = [movie['movie_title'] for movie in vectorized_movie_data]
    
    #create Movie x Critic matrix
    matrix = pd.DataFrame(np.nan, index = all_movies, columns=all_critics)

    #Fill in review scores
    for movie in vectorized_movie_data:

        mit = movie['movie_title']

        for critic,review_type in zip(movie['critic_names'], movie['review_types_norm']):
            matrix.loc[mit,critic] = review_type

    #Pearson correlation movie x movie (1 to -1)
    type_sim = matrix.T.corr(method='pearson')

    # normalize to 0-1
    type_sim = (type_sim + 1) / 2

    #convert to np array
    type_sim_array = type_sim.values

    #fill empty pairs with 0.5(neutral)
    type_sim = np.nan_to_num(type_sim_array, nan=0.5)

    print(f"Type similarity matrix: {type_sim.shape}")

    return type_sim

Checks if similarity matrices pickle exists. If not, generate, save, and return it

In [106]:
def load_or_create_similarity_matrices(vectorized_data, pickle_path="../cache/similarity_matrices.pkl"):
    """Load or compute all similarity matrices"""
    
    if os.path.exists(pickle_path):
        print(f"Loading similarity matrices from {pickle_path}...")
        with open(pickle_path, "rb") as f:
            matrices = pickle.load(f)
        info_sim = matrices['info']
        content_rating_sim = matrices['content_rating']
        genre_sim = matrices['genre']
        year_sim = matrices['year']
        style_sim = matrices['style']
        type_sim = matrices['type']
    else:
        print("Pickle file not found. Computing similarity matrices...")
        info_sim = build_info_sim(vectorized_data)
        content_rating_sim = build_content_rating_sim(vectorized_data)
        genre_sim = build_genre_sim(vectorized_data)
        year_sim = build_year_sim(vectorized_data)
        style_sim = build_review_style_sim(vectorized_data)
        type_sim = build_review_type_sim(vectorized_data)
        
        print(f"Saving similarity matrices to {pickle_path}...")
        with open(pickle_path, "wb") as f:
            pickle.dump({
                'info': info_sim,
                'content_rating': content_rating_sim,
                'genre': genre_sim,
                'year': year_sim,
                'style': style_sim,
                'type': type_sim
            }, f)
    
    return info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim

## Step 6: Hybrid Scoring

**Goal:** Combine all 6 similarity matrices into one final score

**Method:** Weighted sum
- Each similarity matrix gets a weight (α, β, γ, δ, ε, ζ)
- Final similarity = weighted average of all 6 scores

**Current weights:**
- α (alpha) = 0.2 → Info similarity (movie description)
- β (beta) = 0.2 → Content rating similarity
- γ (gamma) = 0.2 → Genre similarity
- δ (delta) = 0.2 → Year similarity
- ε (epsilon) = 0.1 → Review style similarity
- ζ (zeta) = 0.1 → Review type similarity

**Total = 1.0** (weights must sum to 1)

**You can experiment with different weights!** For example:
- Emphasize genre: `gamma = 0.4`
- Ignore year: `delta = 0.0`

In [107]:
def hybrid_score(info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim, alpha, beta, gamma, delta, epsilon, zeta):
    
    hybrid_sim = (alpha * info_sim) + (beta * content_rating_sim) + (gamma * genre_sim) + (delta * year_sim) + (epsilon * style_sim) + (zeta * type_sim)

    return hybrid_sim

## Step 7: Query Function

**Goal:** Find movies most similar to a query movie

**Process:**
1. Find the query movie in our dataset (by title)
2. Get its row from the hybrid similarity matrix
3. Sort all movies by similarity score (highest first)
4. Return top k movies (excluding the query itself)

**Parameters:**
- `movie_title` - Name of movie to find similar to (lowercase)
- `k` - Number of recommendations to return (default 10)

**Output:** Prints a formatted table of recommendations with scores

In [108]:
def query_movie(movie_title,vectorized_movie_data,hybrid_sim,k=10):

    #Find iterable of query movie
    query_id = None
    for i,movie in enumerate(vectorized_movie_data):
        if movie['movie_title'] == movie_title:
            query_id = i
            break
   
    #row of hybrid sim scores sorted
    sim = hybrid_sim[query_id]
    sorted_indices = np.argsort(sim)[::-1]

    recommendations = []

    for pos in sorted_indices[1:k+1]:
        recommendations.append((vectorized_movie_data[pos]['movie_title'], sim[pos]))

    ################################
    max_len = max(len(movie) for movie, _ in recommendations)
    print()
    print(f"Query movie: {movie_title}")
    print()
    print("MOVIE".ljust(max_len)  +  "  SCORE")

    for m,s in recommendations:
        score = round(s,2)
        print(f'{m:{max_len}}  {score}')
    
    return recommendations

## Step 8: Load All Data and Compute Similarities

**This cell does everything:**

1. **Load vectorized data** (or create if first time)
   - If cached: loads in ~10 seconds
   - If not cached: takes ~10-30 minutes to compute

2. **Load similarity matrices** (or create if first time)
   - If cached: loads in ~5 seconds
   - If not cached: takes ~2-5 minutes to compute

3. **Compute hybrid similarity**
   - Combines all 6 matrices with weights
   - Takes ~1 second

**After this runs, the system is ready to query!**

In [109]:
# Load or create vectorized data
vectorized_data = load_or_create_vectorized_data('../cache/vectorized_movie_data.pkl')

# Load or create similarity matrices  
info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim = load_or_create_similarity_matrices(
    vectorized_data, 
    '../cache/similarity_matrices.pkl'
)

# Compute hybrid similarity
hybrid_sim = hybrid_score(info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim, 
                          0.2, 0.2, 0.2, 0.2, 0.1, 0.1)

print("System ready")

Loading vectorized movie data from ../cache/vectorized_movie_data.pkl...
Loading similarity matrices from ../cache/similarity_matrices.pkl...
System ready



## Step 9: Query for Similar Movies!

Change `"aliens"` to any movie title in the dataset (lowercase).

In [110]:
# Query Example

query_movie("aliens", vectorized_data, hybrid_sim, k=10);


Query movie: aliens

MOVIE                                 SCORE
alien3                                0.83
lifeforce                             0.82
virus                                 0.82
alien resurrection                    0.81
contamination                         0.8
underworld: awakening                 0.8
doom                                  0.8
pandorum                              0.8
aliens vs. predator: requiem (avp 2)  0.8
deepstar six                          0.8
