# Movie Recommender System
A basic content-based Movie Recommender system. When given a movie, the system will recommend N most similar movies that you might like.

In the future I would like to use the ratings & user data to make an item-based collaborative filtering recommender system.

Dataset is used is the MovieLens latest dataset (small version), which can be found here: https://grouplens.org/datasets/movielens/#:~:text=recommended%20for%20education%20and%20development

## Text Preprocessing
1. We will **combine the different textual columns into a single text feature,** the multiple features being the movie name, genres, and tags.
2. Then all **text will be converted to lowercase** to ensure uniformity.
3. **Punctuation and special characters will be removed** to further reduce unnecessary variance.
4. **The text will be tokenized,** which is essential for natural language processing (will also remove stop words).

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re

# Google News Word2Vec model
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Load data from movies and tags files
movies_df = pd.read_csv('ml-latest-small/movies.csv')
tags_df = pd.read_csv('ml-latest-small/tags.csv')

In [3]:
# Aggregate tags by movieId
tags_combined = tags_df.groupby('movieId')['tag'].apply(lambda x: ' '.join(set(x.astype(str)))).reset_index()

# Merge tags with movies dataframe
movies_with_tags = pd.merge(movies_df, tags_combined, on='movieId', how='left')

# Handle missing tags
movies_with_tags['tag'] = movies_with_tags['tag'].fillna('notags')

# Split genres by spaces
movies_with_tags['genres'] = movies_with_tags['genres'].str.replace('|', ' ', regex=False)

def preprocess_title(title: str):
    # Remove leading and trailing quotation marks if they exist
    if title.startswith('"') and title.endswith('"'):
        title = title[1:-1]

    # Remove year from the title
    title = re.sub(r' \(\d{4}\)$', '', title)

    return title

# Apply the preprocess_title function to the title column
movies_with_tags['title'] = movies_with_tags['title'].apply(preprocess_title)

# print(movies_with_tags.head())

In [4]:
# Combine into a single text feature
movies_with_tags['combined_features'] = (movies_with_tags['title'] + ' ' +
                                         movies_with_tags['genres'] + ' ' +
                                         movies_with_tags['tag'])

# Convert text to lowercase
movies_with_tags['combined_features'] = movies_with_tags['combined_features'].str.lower()

# Replace hyphens and underscore with spaces
movies_with_tags['combined_features'] = movies_with_tags['combined_features'].str.replace(r'[-_]+', ' ', regex=True)

# Remove punctuation and special characters
movies_with_tags['combined_features'] = movies_with_tags['combined_features'].apply(lambda x: re.sub(r'[\W_]+', ' ', x))

# Replace repeating spaces with a single space
movies_with_tags['combined_features'] = movies_with_tags['combined_features'].str.replace(r'\s+', ' ', regex=True)

# Tokenization and stop word removal happens during vectorization

print(movies_with_tags[['combined_features']].head())

                                   combined_features
0  toy story adventure animation children comedy ...
1  jumanji adventure children fantasy fantasy gam...
2          grumpier old men comedy romance moldy old
3      waiting to exhale comedy drama romance notags
4  father of the bride part ii comedy pregnancy r...


## Word2Vec
After preprocessing the text, we can proceed to **Word2Vec.** We will use a pre-trained model to vectorize our text data.

In [5]:
movies_with_tags.columns

Index(['movieId', 'title', 'genres', 'tag', 'combined_features'], dtype='object')

Download Google's pre-trained Word2Vec model here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g

In [6]:
# Load the pre-trained Word2Vec model (this example uses the Google News model)
model_path = 'GoogleNews-vectors-negative300.bin'
word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

def vectorize_text(text):
    # Tokenize the text
    words = text.split()

    # Retrieve vectors for each word and ignore out-of-vocabulary words
    vectors = [word2vec_model[word] for word in words if word in word2vec_model]

    # If no words in the text are in the model, return a zero vector
    if not vectors:
        return np.zeros(word2vec_model.vector_size)

    # Aggregate word vectors using mean
    return np.mean(vectors, axis=0)

# Apply vectorization to your combined text features
movies_with_tags['vector'] = movies_with_tags['combined_features'].apply(vectorize_text)

# Now, 'movies_with_tags' contains a 'vector' column with Word2Vec representations

## Cosine Similarity
We use cosine similarity to give us the N most similar movies based on the given movie.

In [7]:
# Convert the list of vectors into a matrix (assuming 'vector' column contains numpy arrays)
vector_matrix = np.vstack(movies_with_tags['vector'])

# Compute cosine similarity matrix
cosine_sim_matrix = cosine_similarity(vector_matrix)

# cosine_sim_matrix[i, j] represents the similarity score between movie i and movie j

def recommend_movies(movie_id, movies_df, cosine_sim_matrix, top_n=10):
    # Get the index of the movie that matches the movie_id
    movie_idx = movies_df.index[movies_df['movieId'] == movie_id].tolist()[0]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim_matrix[movie_idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top_n most similar movies
    sim_scores = sim_scores[1:top_n+1]  # Skipping 0 because it's the movie itself

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top_n most similar movies
    return movies_df['movieId'].iloc[movie_indices]

In [8]:
def get_movie_title(movie_id, movies_df):
    # Find the title for the given movie_id
    title = movies_df.loc[movies_df['movieId'] == movie_id, 'title'].iloc[0]
    return title

In [10]:
# Example usage
movie_id = 1
recommended_movie_ids = recommend_movies(movie_id=movie_id, movies_df=movies_with_tags, cosine_sim_matrix=cosine_sim_matrix, top_n=10)
# print(f"Recommended movie IDs for movie ID {movie_id}: {recommended_movie_ids.tolist()}")
recommend_titles = '\n'.join(f'[{movie_recc}] ' + get_movie_title(movie_recc, movies_with_tags) for movie_recc in recommended_movie_ids.tolist())
print(f"Recommended movies for: {get_movie_title(movie_id, movies_with_tags)}\n{recommend_titles}")

Recommended movies for: Toy Story
[78499] Toy Story 3
[2294] Antz
[65577] Tale of Despereaux, The
[166461] Moana
[3114] Toy Story 2
[53121] Shrek the Third
[84637] Gnomeo & Juliet
[45074] Wild, The
[8974] SpongeBob SquarePants Movie, The
[103755] Turbo
