In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm

## Project:
## A recommender system for movies, based only in their script.

Most recommendation algorithms we are used to use user data for deciding what the user should be recommended. So I thought it was interesting to try and build a recommendation system that relied only on the content of the movie. 

First, we import the scripts. Refer to the README to find the resources for getting this data yourself.

In [2]:
filenames = [file for file in tqdm(os.listdir("scripts"))]

movie_scripts = []

for file in filenames:
    with open(f"scripts/{file}", 'r') as f:
        movie_scripts = movie_scripts + [f.read()]

100%|██████████| 2059/2059 [00:00<00:00, 6030776.49it/s]


Now, we're gonna use doc2vec to convert every document into a vector. As the name suggests, doc2vec converts each individual document into a numerical vector that we can use for our model. 

### Memory intensive! Run at your own risk.

In [3]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


tagged_data = [TaggedDocument(words=script.split(), tags=[str(i)]) for i, script in enumerate(movie_scripts)]

model = Doc2Vec(vector_size=300, window=5, min_count=1, workers=4, epochs=10)
model.build_vocab(tagged_data)


for epoch in tqdm(range(model.epochs), desc="Training Doc2Vec"):
    model.train(tagged_data, total_examples=model.corpus_count, epochs=1)
    model.alpha -= 0.002  
    model.min_alpha = model.alpha  


movie_vectors = [(title, model.dv[i].tolist()) for i, title in enumerate(filenames)]

Training Doc2Vec: 100%|██████████| 10/10 [04:14<00:00, 25.48s/it]


Now, I decided to have two versions of the dataset. One in which every column represents a vector feature (dfscript), and another one where one column holds the entire vector (dfvectors). Each format can be useful in different situations. What follows is the code for generating dfscript, and also removing the "-" from the document names in our dataframe.

In [4]:
l = []
v = []

for pair in movie_vectors:
    l = l + [[pair[0][:-4].replace("-", " ")] + pair[1]]


dfscript = pd.DataFrame(l)
dfscript.rename(columns={0: 'Title'}, inplace= True)
dfvectors = pd.DataFrame(movie_vectors)

dfscript.head()

Unnamed: 0,Title,1,2,3,4,5,6,7,8,9,...,291,292,293,294,295,296,297,298,299,300
0,No Country for Old Men,-0.028571,-0.626604,-0.426759,1.275116,-0.664674,-1.341324,-3.400897,0.028038,-1.144914,...,0.38004,0.712044,-1.997797,0.810932,2.649217,0.865844,-0.188804,-1.0562,2.869202,-2.518997
1,Jerry Maguire,-0.433639,0.544738,-2.904589,0.683909,-0.801527,-1.431894,0.144542,1.422127,1.429749,...,0.212943,1.244765,1.381701,0.259034,-0.564082,-0.617412,-1.39702,-2.012383,-0.637088,-0.622857
2,Addams Family The,-0.881256,0.899343,0.571875,-0.255316,-1.069809,-0.638983,-1.601006,-1.181492,0.871448,...,0.938385,2.192843,-0.078602,-1.737962,2.752352,1.265107,1.523025,-0.050683,1.919953,-0.686938
3,Machine Gun Preacher,0.468456,1.561777,-2.543494,-2.23648,-0.68352,2.238774,-0.783927,-0.151393,1.29502,...,-1.60003,-0.73276,0.190548,-1.395511,-0.084236,1.45326,1.515121,-2.202631,-0.54452,-1.760366
4,Things My Father Never Taught Me The,0.147514,0.087347,-0.853418,1.076034,-0.907109,0.590321,-0.213311,-0.114611,-1.113721,...,-1.368993,1.145039,0.131187,-1.088247,0.10137,0.515319,-0.18467,-0.288485,0.118238,-1.162176


In [5]:
# Removing the "-" from the movie names and renaming the columns in the dfvectors dataframe

dfvectors[0] = dfvectors[0].str.replace("-", " ").apply(lambda x: x[:-4])
dfvectors.columns = ["Title", "Vector"]
dfvectors.head()

Unnamed: 0,Title,Vector
0,No Country for Old Men,"[-0.028570765629410744, -0.6266040205955505, -..."
1,Jerry Maguire,"[-0.43363916873931885, 0.5447375178337097, -2...."
2,Addams Family The,"[-0.8812557458877563, 0.8993426561355591, 0.57..."
3,Machine Gun Preacher,"[0.46845629811286926, 1.5617766380310059, -2.5..."
4,Things My Father Never Taught Me The,"[0.14751435816287994, 0.08734674006700516, -0...."


Now, we're gonna define some funtions. First of all, cosine similarity, which is useful for comparing texts; It works based on the direction of two vectors.
Then, we're gonna define a function that helps us find the list of most similar films, by cosine similarity, of a given movie.

In [6]:
from numpy.linalg import norm

def cosine_sim(A,B):
    return np.dot(A, B)/(norm(A)*norm(B))

def similar_movies(movie_selected):

    sel_vector = np.array(dfvectors.loc[dfvectors['Title'] == movie_selected].values.flatten()[1])

    l = []

    dfop = dfvectors.loc[dfvectors['Title'] != movie_selected]


    for i, movie in dfop.iterrows():
        mov_vector = np.array(movie['Vector'])
        similarity = cosine_sim(sel_vector, mov_vector)

        l = l + [[movie['Title']] + [similarity]]

    df_results = pd.DataFrame(l, columns=['movie', 'similarity']).sort_values(by='similarity', ascending= False)
    

    return df_results

This, by itself, already works fine. But we can add clusters in order to only recommend movies in the same cluster. We will then use cosine similarity to find the closest members in that cluster. We also have to adapt our algorithm to work with more than one movie as input.

For clusterization, I'm gonna use kmeans clustering.

In [7]:

from sklearn.cluster import KMeans


X = dfscript.iloc[:, 1:].values  


num_clusters = 40


kmeans = KMeans(n_clusters=num_clusters, random_state=42)
dfscript['cluster'] = kmeans.fit_predict(X)


  super()._check_params_vs_input(X, default_n_init=10)
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f5f450ce2a0>
Traceback (most recent call last):
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
                   ^^^^^^^^^^^^^^^^^^
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
      

After training, we can write a function to solve other problems, such as: Deciding between clusters, handling multiple movies, etc.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

def recommend(movie_selected):
    if type(movie_selected) == str:
        movie_selected = [movie_selected]
    # identifies if the object passed is a string and converts to usable list

    watched_movies_vectors = dfscript[dfscript['Title'].isin(movie_selected)].iloc[:, 1:-1].values
    average_vector = watched_movies_vectors.mean(axis=0)
    # Calculate the average between the vectors of each script

    watched_movies_cluster = kmeans.predict([average_vector])
    # Finds the cluster that best acommodates the average

    cluster_movies = dfscript[dfscript['cluster'] == watched_movies_cluster[0]]
    cluster_movies = cluster_movies[~cluster_movies['Title'].isin(movie_selected)]
    # Finds the movies in the same cluster as the average

    similarities = cosine_similarity([average_vector], cluster_movies.iloc[:, 1:-1].values)
    cluster_movies['similarity'] = similarities[0]
    # Calculates the similarity between each film in the cluster and the calculated average

    cluster_movies = cluster_movies.sort_values(by='similarity', ascending=False)
    # Sorts the values from most similar to least

    top_recommendations = cluster_movies[['Title', 'similarity']].head(5)
    print(f"Top Recommendations for '{movie_selected}':")
    # Defines the 5 most similar movies and then prints a message for the user 

    # Then returns the list
    return print(top_recommendations)

Testing:

In [12]:
recommend(['No Country for Old Men', 'Mean Girls'])

Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f5f2d2cefc0>
Traceback (most recent call last):
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
                   ^^^^^^^^^^^^^^^^^^
  File "/home/gustamatos/anaconda3/lib/python3.11/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
             ^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' ob

Top Recommendations for '['No Country for Old Men', 'Mean Girls']':
                   Title  similarity
1003  The Usual Suspects    0.464227
1137      Field of Dream    0.454095
1181       Apartment The    0.436844
163    The Hateful Eight    0.426630
1282       Stir Of Echoe    0.413532


If you ask me, those movies feel exactly in the middle of "No Country for Old Men" and "Mean Girls".

## Conclusion

We used clustering and mathematical methods to create a working recommendation system with no user data.

The algorithm could probably use some polishing; Such as recommending movies from more than one singular cluster. I might come back to that eventually. I also have to find a way to deal with duplicates.