# Movie Recommendations task
Here the task was to create a machine learning model that recommends movies. At first to explore the data, then build a model.
There was an option between developing a model of one's own, and using a more traditional techniqeus for recommendations: either content-filtering or collaborative-filtering. As collaborative-filtering can offer more diverse recommendations, and also suggest movies that the user might not otherwise have found (according to both Wikipedia and ChatGPT), this technique was chosen initially.

As we will see, the results were less than satisfactory with that model, which is why content based filtering was implemented later instead.

## EDA 

In [26]:
import pandas as pd
import numpy as np

# Load the data
movie_df = pd.read_csv('moviedata/movies.csv', header=0)
rating_df = pd.read_csv('moviedata/ratings.csv', header=0)

In [27]:
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Already in this stage collaborative filtering was in mind, which is why the title was imported into the `rating_df`:

In [28]:
rating_df['title'] = rating_df['movieId'].map(movie_df.set_index('movieId')['title'])

In [29]:
rating_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,296,5.0,1147880044,Pulp Fiction (1994)
1,1,306,3.5,1147868817,Three Colors: Red (Trois couleurs: Rouge) (1994)
2,1,307,5.0,1147868828,Three Colors: Blue (Trois couleurs: Bleu) (1993)
3,1,665,5.0,1147878820,Underground (1995)
4,1,899,3.5,1147868510,Singin' in the Rain (1952)


Checking how many unique movie titles are in the dataframe:

In [30]:
rating_df['title'].nunique()

58958

To get a more reasonably sized model, 1M rows were sampled out of the 25M. 

In [31]:
smaller_df = rating_df.sample(n=1000000, random_state=1)
smaller_df.shape

(1000000, 5)

Check how many unique movies are in this smaller subset:

In [32]:
smaller_df['title'].nunique()

23166

Check how many unique movies with 5 star ratings:

In [33]:
smaller_df[(smaller_df['rating'] == 5)]['title'].value_counts()

title
Shawshank Redemption, The (1994)             1617
Pulp Fiction (1994)                          1282
Schindler's List (1993)                      1072
Star Wars: Episode IV - A New Hope (1977)    1061
Forrest Gump (1994)                          1021
                                             ... 
Another Cinderella Story (2008)                 1
D.C.H. (Dil Chahta Hai) (2001)                  1
Peacemaker, The (1997)                          1
Prison Break: The Final Break (2009)            1
Biutiful (2010)                                 1
Name: count, Length: 8600, dtype: int64

8600 unique movie titles with 5 star ratings seemed like it would cover most movies, so it was deemed good enough. The vague idea was to recommend movies based on 5 star ratings.

## Choosing the model

To get an overview of the difference between the different approaches, I asked ChatGPT. The comparison shows that Collaborative Filtering probably would be the best choice for this domain, which is also what I was thinking after reading up on the two models on Wikipedia:  

*ChatGPT4: Comparing Content Filtering and Collaborative Filtering:*

| Feature                         | Content Filtering                                   | Collaborative Filtering                                      |
|---------------------------------|-----------------------------------------------------|--------------------------------------------------------------|
| **Data Required**               | Item features (e.g., genre, author)                  | User-item interactions (e.g., ratings, views)                |
| **Recommendation Basis**        | Similarity between item features and user preferences| Similarity between users or items based on user interactions |
| **Advantages**                  | - Privacy-friendly <br> - Can recommend new items <br> - Transparent reasoning      | - Diverse recommendations <br> - Can discover serendipitous items <br> - Effective without item metadata            |
| **Limitations**                 | - Limited by item features <br> - Risk of over-specialization <br> - Cold start problem for new users | - Cold start problem for new users/items <br> - Requires large amounts of data <br> - Scalability issues            |
| **Application Examples**        | - E-commerce product recommendations <br> - Online libraries and content platforms | - Movie, music, and book recommendations <br> - E-commerce and social networking sites                             |
| **User/Item Newness Handling**  | Can handle new items if item features are available  | Struggles with new users and items due to lack of interaction data                                                 |
| **Diversity of Recommendations**| May recommend items too similar to user's past likes | Generally offers more diverse recommendations                |
| **Requirement for Metadata**    | High (needs detailed item features)                   | Low (relies on user behavior rather than item specifics)      |

Thus, as already stated, **Collaborative Filtering** was chosen initially. 

ChatGPT suggested to use **Singular Value Decomposition (SVD)** to handle the sparse data and the large number of rows to implement a specific item based Collaborative Filtering method. As it was one of the methods also mentioned in the [Wikipedia article](https://en.wikipedia.org/wiki/Collaborative_filtering), and for a lack of further domain knowledge, that's what was developed.

## Collaborative filtering
As user ratings of the movies is the foundation of this method, some things were immediately clear: 
- To ignore tags.csv, links.csv
- From movies.csv use `title` column, in ratings.csv all columns.

Also, a random subset needed to be selected to perform the training as 25,000,000 ratings are quite a lot (as done above already).

### Theory
*Credit for this section goes to ChatGPT, which was tasked with giving an overview of the theory for SVD, as that was a bit unclear.*

Singular Value Decomposition, or SVD, is a powerful technique that breaks down a matrix $A$ into three distinct matrices: $U$, $\Sigma$, and $V^{T}$. Here's how it goes:

- $A$ is a matrix where the rows represent users and columns represent movies, filled with ratings users have given to movies.
- $U$ is a matrix where each column is a "user feature vector," representing hidden characteristics of users.
- $\Sigma$ is a diagonal matrix whose entries are singular values. These values rank the importance of each latent feature, from most to least impactful.
- $V^{T}$, the transpose of $V$, contains "item feature vectors" in its columns, showing hidden characteristics of items.

The magic formula looks like this: $A = U \Sigma V^{T}$.

By breaking down $A$ into these components, we can uncover hidden patterns in how users rate items. It's all about finding out what common tastes or preferences (represented by the latent features in $U$ and $V$) users share.

The singular values in $\Sigma$ tell us how significant each hidden feature is. By keeping only the top few features (the highest values in $\Sigma$), we can simplify the complex world of user-item ratings into something more manageable, yet still insightful for making recommendations.

This process allows us to predict how a user might rate items they haven't encountered yet, based on the latent features.

### Implementing the model

Apparently there was need for an additional package called `scikit-surprise` to use a ready built package for Singular Value Decomposition. As it was not possible to install `scikit-surprise` using `pip`, an environment was created using `conda`, where the required packages were installed.

The data was loaded into the SVD and 5-fold Cross Validation was performed to evaluate the results:

In [34]:
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy

# Load the dataset into Surprise's format. As is was so fast, we could use the full dataset. 
# model.fit() on the full dataset runs in just under 14 minutes on an 8 year old laptop.
# But due to the size of the pickle file, and execution time at use, the smaller dataset of 1M rows was used.

reader = Reader(rating_scale=(min(smaller_df['rating']), max(smaller_df['rating'])))
data = Dataset.load_from_df(smaller_df[['userId', 'movieId', 'rating']], reader)

model = SVD()

# Perform 5-fold cross-validation to check results
results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True, n_jobs=-2)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9162  0.9147  0.9177  0.9184  0.9155  0.9165  0.0014  
MAE (testset)     0.7058  0.7040  0.7063  0.7061  0.7050  0.7054  0.0008  
Fit time          22.72   23.15   22.00   15.69   14.75   19.66   3.66    
Test time         4.42    5.26    4.85    3.32    2.43    4.06    1.04    


With an RMSE consistently at 0.92 (rounded) it seemed to be doing very well (although perhaps being a bit too consistent?). Next, split the data into a training set and a test set, to avoid data leakage, for later use, and check RMSE: 

In [35]:
# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Train the model on the training set
model.fit(trainset)

# Predict ratings for the test set
predictions = model.test(testset)

# Evaluate the predictions
rmse = accuracy.rmse(predictions)

RMSE: 0.9184


Also this time the RMSE on the testset is at 0.92. 

It was time to implement the recommendations. - Are we going to be blown away by this great model, or what?
### Putting the model to the test
Here's a function for entering favorite movie titles:

In [36]:
# Separated this out to be able to use it later also
def enter_favorite_movies():
    user_favorites_titles = []
    print("Enter your favorite movies (press enter to finish):")
    while True:
        title_input = input()
        if title_input == "":
            break
        user_favorites_titles.append(title_input)
    return user_favorites_titles

The code below finds the `movie_id`'s. There are some limitations to this model, as it returns all movies containing the strings from the user input. E.g. if you enter "Matrix", all movies with that word will be returned. But for an initial building of a model it was deemed good enough.

In [41]:
user_favorites_titles = enter_favorite_movies()

favorite_movie_ids = []

for user_title in user_favorites_titles:
    matches = movie_df[movie_df['title'].str.contains(user_title, case=False, na=False)]
    if not matches.empty:
        for _, row in matches.iterrows():
            favorite_movie_ids.append(row['movieId'])
            print(f"Found: {row['title']} with ID: {row['movieId']}")
    else:
        print(f"No matches found for: {user_title}")

print("Your favorite movie IDs:", favorite_movie_ids)


Enter your favorite movies (press enter to finish):
Found: Matrix, The (1999) with ID: 2571
Found: Matrix Reloaded, The (2003) with ID: 6365
Found: Matrix Revolutions, The (2003) with ID: 6934
Found: Animatrix, The (2003) with ID: 27660
Found: Return to Source: The Philosophy of The Matrix (2004) with ID: 132490
Found: Armitage: Dual Matrix (2002) with ID: 157721
Found: The Matrix Revisited (2001) with ID: 172255
Found: The Living Matrix (2009) with ID: 179489
Found: Matrix of Evil (2003) with ID: 181103
Your favorite movie IDs: [2571, 6365, 6934, 27660, 132490, 157721, 172255, 179489, 181103]


Here the user enters some favorite movies and gets recommendations. To get the `predicted_ratings` and finding out how to get those values, the debugger was used to inspect the `predictions` variable.

In [42]:
new_user_id = 'new_user'  

all_movie_ids = set(smaller_df['movieId'].unique())

# Exclude movies that the new user has listed as their favorites
movies_to_predict = list(all_movie_ids - set(favorite_movie_ids))

# Create a list of tuples in the form of (new_user_id, movieId, actual_rating)
# Since we don't have actual ratings for these, we use a dummy rating value of 5
testset = [[new_user_id, movie_id, 5.] for movie_id in movies_to_predict] 

# Predict ratings for all movies the new user hasn't rated
predictions = model.test(testset)

# Convert predictions to a list of (movieId, predicted_rating) tuples
predicted_ratings = [(pred.iid, pred.est) for pred in predictions]

# Sort the predictions by estimated rating in descending order
predicted_ratings.sort(key=lambda x: x[1], reverse=True)

# Get the top 10 recommendations
top_recommendations = predicted_ratings[:10]

print("Top 10 movie recommendations for you:")
print("=====================================")
i=1
for movie_id, rating in top_recommendations:
    movie_name = movie_df.loc[movie_df['movieId'] == movie_id, 'title'].values[0]
    print(f"{i}) {movie_name}, Predicted Rating: {rating:.1f}")
    i += 1

# Get the top 10 movies to stay away from
bottom_recommendations = predicted_ratings[-10:]

i=10
print("\nTop 10 movies for you to avoid:")
print("##############################################")
for movie_id, rating in bottom_recommendations:
    movie_name = movie_df.loc[movie_df['movieId'] == movie_id, 'title'].values[0]
    print(f"{i}) {movie_name}, Predicted Rating: {rating:.1f}")
    i -= 1


Top 10 movie recommendations for you:
1) Planet Earth II (2016), Predicted Rating: 4.6
2) Band of Brothers (2001), Predicted Rating: 4.6
3) Planet Earth (2006), Predicted Rating: 4.5
4) Children of Paradise (Les enfants du paradis) (1945), Predicted Rating: 4.5
5) To Live (Huozhe) (1994), Predicted Rating: 4.4
6) Exterminating Angel, The (Ángel exterminador, El) (1962), Predicted Rating: 4.4
7) Song of the Little Road (Pather Panchali) (1955), Predicted Rating: 4.4
8) Shawshank Redemption, The (1994), Predicted Rating: 4.4
9) Harakiri (Seppuku) (1962), Predicted Rating: 4.4
10) There Once Was a Dog (1982), Predicted Rating: 4.4

Top 10 movies for you to avoid:
##############################################
10) Baby Geniuses (1999), Predicted Rating: 2.0
9) Pokemon 4 Ever (a.k.a. Pokémon 4: The Movie) (2002), Predicted Rating: 1.9
8) Kazaam (1996), Predicted Rating: 1.9
7) Pokémon 3: The Movie (2001), Predicted Rating: 1.9
6) Glitter (2001), Predicted Rating: 1.9
5) Problem Child 2 (199

Saving the model (skipping as the file is useless and takes up too much space):

In [41]:
from joblib import dump, load
# dump(model, './save/SVD_movie_model_1M.pkl', compress=5)

['./save/SVD_movie_model_1M.pkl']

### Issue
This looks all good at first glance. There is a slight drawback though: **the recommendations are always the same**. 

The reasons for this eluded the author, suspecting it might have something to do with the `user_id` feature (but then why would it have such a great impact?), or the sparse interaction of users given the totality of movies? In the end, the author couldn't think of a way to get around it, and figured that the gods might be more favorable to implementing content based filtering instead.

## Content based filtering
Due to the poor handling of the sparse dataset by the above model, here's a go at content based filtering.
First, there's now a need to include the metadata consisting of the genre discriptions and the user provided tags.

In [1]:
import pandas as pd
import numpy as np

# Load the data
movie_df = pd.read_csv('moviedata/movies.csv', header=0)
rating_df = pd.read_csv('moviedata/ratings.csv', header=0)
tag_df = pd.read_csv('moviedata/tags.csv', header=0)

aggregated_tags_df = tag_df.groupby('movieId')['tag'].agg(list).reset_index()

# Merge the aggregated tags into the movie_df based on movieId
movie_df_with_tags = pd.merge(movie_df, aggregated_tags_df, on='movieId', how='left')

movie_df_with_tags

Unnamed: 0,movieId,title,genres,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Owned, imdb top 250, Pixar, Pixar, time trave..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Robin Williams, time travel, fantasy, based o..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"[funny, best friend, duringcreditsstinger, fis..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[based on novel or book, chick flick, divorce,..."
4,5,Father of the Bride Part II (1995),Comedy,"[aging, baby, confidence, contraception, daugh..."
...,...,...,...,...
62418,209157,We (2018),Drama,
62419,209159,Window of the Soul (2001),Documentary,
62420,209163,Bad Poems (2018),Comedy|Drama,
62421,209169,A Girl Thing (2001),(no genres listed),


Checking proportion of movies without tags

In [2]:
movie_df_with_tags['tag'].isna().sum(), movie_df_with_tags.shape

(17172, (62423, 4))

Create `metadata` column to avoid `NaN` values and only have one feature for all metadata.

In [3]:
# Ensure 'genres' is a string
movie_df_with_tags['genres'] = movie_df_with_tags['genres'].astype(str)
# Lambda function handles NaN values and ensures all items are treated as strings (needed ChatGPT for this one)
movie_df_with_tags['tag'] = movie_df_with_tags['tag'].apply(lambda x: ', '.join([str(tag) for tag in x]) if isinstance(x, list) else '')

# Concatenate 'genres' and 'tag' into a 'metadata' column
movie_df_with_tags['metadata'] = movie_df_with_tags['genres'] + ' ' + movie_df_with_tags['tag'].str.strip()

In [4]:
movie_df_with_tags

Unnamed: 0,movieId,title,genres,tag,metadata
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"Owned, imdb top 250, Pixar, Pixar, time travel...",Adventure|Animation|Children|Comedy|Fantasy Ow...
1,2,Jumanji (1995),Adventure|Children|Fantasy,"Robin Williams, time travel, fantasy, based on...","Adventure|Children|Fantasy Robin Williams, tim..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"funny, best friend, duringcreditsstinger, fish...","Comedy|Romance funny, best friend, duringcredi..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"based on novel or book, chick flick, divorce, ...","Comedy|Drama|Romance based on novel or book, c..."
4,5,Father of the Bride Part II (1995),Comedy,"aging, baby, confidence, contraception, daught...","Comedy aging, baby, confidence, contraception,..."
...,...,...,...,...,...
62418,209157,We (2018),Drama,,Drama
62419,209159,Window of the Soul (2001),Documentary,,Documentary
62420,209163,Bad Poems (2018),Comedy|Drama,,Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed),,(no genres listed)


Checking that there are no missing values:

In [5]:
movie_df_with_tags['metadata'].isna().sum()

0

Drop `genres`, `tag` columns to slim dataframe:

In [6]:

movie_df_with_tags = movie_df_with_tags.drop(columns=['genres', 'tag'])

### Choosing the method: TF-IDF
When researching how to usually implement a content based filtering model, I quickly stumbled upon the TF-IDF method, and decided to give it a go, as cosine similarity is a familiar concept, and the name had been mentioned during several lectures. The Sklearn library also seemed pretty straight forward [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

The first step was to fit and transform the TF-IDF Vectorizer on the metadata, for it to learn the vocabulary and then convert it to numerical form. 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Skip separators and lowercase the text
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words=[',','|'])

In [48]:
# Fit and transform the metadata to TF-IDF features
tfidf_matrix = tfidf_vectorizer.fit_transform(movie_df_with_tags['metadata'])

To get more of a feel for what the vectorization did, the TF-IDF Vectorizer and the matrix output were inspected: 

In [8]:
tfidf_vectorizer.vocabulary_

{'adventure': 657,
 'animation': 1424,
 'children': 5767,
 'comedy': 6477,
 'fantasy': 10919,
 'owned': 23409,
 'imdb': 15335,
 'top': 32525,
 '250': 208,
 'pixar': 24552,
 'time': 32316,
 'travel': 32826,
 'funny': 12127,
 'witty': 35178,
 'rated': 26112,
 'computer': 6609,
 'good': 12985,
 'cartoon': 5179,
 'chindren': 5788,
 'friendship': 11985,
 'bright': 4280,
 'daring': 7792,
 'rescues': 26664,
 'fanciful': 10895,
 'heroic': 14324,
 'mission': 20999,
 'humorous': 14990,
 'light': 18546,
 'rousing': 27383,
 'toys': 32669,
 'come': 6464,
 'to': 32401,
 'life': 18526,
 'unlikely': 33559,
 'friendships': 11986,
 'warm': 34585,
 'disney': 8778,
 'boy': 4065,
 'next': 22260,
 'door': 9076,
 'bullying': 4563,
 'friends': 11984,
 'jealousy': 16329,
 'martial': 19851,
 'arts': 1935,
 'neighborhood': 22109,
 'new': 22220,
 'toy': 32663,
 'rescue': 26661,
 'resourcefulness': 26700,
 'rivalry': 27043,
 'comes': 6481,
 'walkie': 34509,
 'talkie': 31586,
 'clever': 6143,
 'tom': 32461,
 'hanks

Apparently all words in the vocabulary were used as features (see below, where `feature_names` are the features extracted from the corpus). This seemed a bit odd, but deemed to be something to deal with later if the results were to be unsatisfactory.

In [46]:
voclen = len(tfidf_vectorizer.vocabulary_)
feature_names = tfidf_vectorizer.get_feature_names_out()

print(f"Vocabulary length: {voclen}\nFeature length: {len(feature_names)}")

Vocabulary length: 36025
Feature length: 36025


Checking the most and least frequent words (the debugger and documentation were used to find out how to get to the values):

In [11]:
idf_values = tfidf_vectorizer.idf_
idf_dict = dict(zip(feature_names, idf_values))
sorted_idf = sorted(idf_dict.items(), key=lambda x: x[1])

print("Lowest IDF values (most common terms):")
print(sorted_idf[:10])

print("\nHighest IDF values (least common terms):")
print(sorted_idf[-10:])

Lowest IDF values (most common terms):
[('drama', 1.8791664536078412), ('comedy', 2.276158644373692), ('thriller', 2.95161329641659), ('romance', 3.0088181492175083), ('action', 3.092080348042381), ('horror', 3.3113378841036405), ('crime', 3.380238415223935), ('documentary', 3.385619905513741), ('no', 3.4095773662882602), ('genres', 3.5117931319725852)]

Highest IDF values (least common terms):
[('惊悚', 11.348557915236652), ('扭曲', 11.348557915236652), ('斯巴达克斯', 11.348557915236652), ('淘金记', 11.348557915236652), ('独闯龙潭', 11.348557915236652), ('竞技场之神', 11.348557915236652), ('臥底', 11.348557915236652), ('莫声版', 11.348557915236652), ('魔鬼司令', 11.348557915236652), ('카운트다운', 11.348557915236652)]


Apparently, some chinese words are included in the vocabulary, constituting the least common terms. 

Next to be explored to get a feel for how well it captures what is special about a particular movie: What are the TF-IDF scores for the movie with a specified index number?

In [12]:
def print_tfidf_scores(doc_index):
    feature_names = tfidf_vectorizer.get_feature_names_out()
    tfidf_scores = tfidf_matrix[doc_index].toarray().flatten()
    
    scores_dict = dict(zip(feature_names, tfidf_scores))
    sorted_scores = sorted(scores_dict.items(), key=lambda x: x[1], reverse=True)
    
    for term, score in sorted_scores:
        if score > 0:
            print(f"{term}: {score}")

# According to movie_df.head() index 1 is "Jumanji"
print_tfidf_scores(1)


robin: 0.5446167136607799
williams: 0.5121227672667045
board: 0.2920345565561034
game: 0.24535226462132634
time: 0.2257176790211587
travel: 0.22426969548899756
animals: 0.18892492557180915
fantasy: 0.1841419060632971
cgi: 0.15610941156195504
kid: 0.11901695063268322
flick: 0.11687951873875763
scary: 0.11325956282652598
bad: 0.08138703201423816
dunst: 0.07531870707143819
kirsten: 0.07448789483781622
magic: 0.06360251108349665
children: 0.05993814149032501
kids: 0.054416420120817835
book: 0.05359588577004541
family: 0.048878009765698865
for: 0.04791134950633545
herds: 0.046441238547714675
recaptured: 0.046441238547714675
not: 0.0446225018349652
johnston: 0.04130411157874487
adventure: 0.03997129264894706
monkey: 0.033276821319161896
joe: 0.031201831471880374
childhood: 0.02712967702839731
fiction: 0.026686060681343323
of: 0.02542277467377018
zathura: 0.02473114826061427
based: 0.024301448984561255
lebbat: 0.02384754544710914
on: 0.023344585212458423
allsburg: 0.023220619273857337
adapted

#### Initial evaluation
This method so far seemed to be performing really well. In the example movie above *("Jumanji")* the top words are `Robin Williams board game time travel`. which is probably some of the top words most humans would use to describe the movie.

#### Next step: Cosine similarity
Below, we get familiar with how to calculate cosine similarities between movies to get a feel for how it works. Getting the slicing to work with `cosine_similarity()` took some figuring out, instead of just being able to use the index directly (the method requires a 2D matrix, not an array).

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between the first and second movie
similarity_matrix = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
similarity_matrix


array([[0.05164751]])

As that was otherwise pretty straight forward, the next step was to go for the full shebang by getting movie recommendations based on the users' favorite movies:

In [14]:
# Vstack stacks sparse matrices (where most elements are 0) vertically. Further explained below.
from scipy.sparse import vstack
import re

Function to get recommendations based on entered movies:

In [49]:
def get_content_based_recommendations(movie_df_with_tags, favorite_movie_titles, tfidf_matrix):
    favorite_movie_indices = []
    for title in favorite_movie_titles:
        # Matches whole words (regex \b) to exclude some potential non-intended matches
        matched_indices = movie_df_with_tags.index[movie_df_with_tags['title'].str.contains(r'\b' + re.escape(title) + r'\b', case=False, regex=True)]
        if not matched_indices.empty:
            # Further narrowing it down by only taking the first match if there are multiple matches
            # This would then make it possible for it to recommend other movies in e.g. a trilogy.
            favorite_movie_indices.append(matched_indices[0])
        else:
            print(f"Warning: '{title}' not found in the movie database.")

    if not favorite_movie_indices:
        return "No favorite movies found in the dataset."
    
    # This section was problematic to get working and eventually I had to resort to asking ChatGPT to solve it,
    # that also needed a couple of rounds. The issue was that the sparse matrix was not being converted to a dense array,
    # and thus not accepted by the cosine_similarity function.
    user_profile = vstack([tfidf_matrix[index] for index in favorite_movie_indices]).mean(axis=0)
    user_profile_dense = user_profile.A  # Convert to dense array

    cosine_similarities = cosine_similarity(user_profile_dense, tfidf_matrix)
    # Get indices of similar movies, sorted by similarity score
    similar_movies_indices = cosine_similarities.argsort().flatten()  
    # Exclude the favorite movies from the recommendations
    similar_movies_indices = [idx for idx in similar_movies_indices[::-1] if idx not in favorite_movie_indices][:10]

    similar_movies_titles = movie_df_with_tags.iloc[similar_movies_indices]['title'].tolist()[::-1]  # -1: Most similar first
    return similar_movies_titles

Finally we get to enter our favorite movies and get recommendations:

In [50]:
user_favorites = enter_favorite_movies()
recommendations = get_content_based_recommendations(movie_df_with_tags, user_favorites, tfidf_matrix)

print("Top 10 movie recommendations for you:")
print("=====================================")
i=1
for movie in recommendations:
    print(f"{i}) {movie}")
    i += 1

Enter your favorite movies (press enter to finish):
Top 10 movie recommendations for you:
1) The Machine (1994)
2) Ex Machina (2015)
3) Machine, The (2013)
4) I, Monster (1971)
5) Screamers (1995)
6) Debug (2014)
7) Game Box 1.0 (2004)
8) Ace (1981)
9) Matrix Reloaded, The (2003)
10) Matrix Revolutions, The (2003)


**Evaluation**: This model subjectively works much better than the collaboration based model, and is quite satisfactory.
Some research was done regarding how to evaluate a model such as this, but nothing similar to RMSE or such was found. Instead, it seemed like the best way to evaluate was to see the results in an intented use situation, which has now already been done.

Save the trained vectorizer:

In [18]:
from joblib import dump
dump(tfidf_vectorizer, './save/tfidf_vectorizer.pkl', compress=5)

['./save/tfidf_vectorizer.pkl']

## Summary
For this task, the *Content Based Filtering* using TF-IDF turned out to be much much more adequate. 

For the *Collaboration Filtering*, there were methods to measure RMSE and MAE and get the nice scientific looking metrics to confirm small error rates in the predictions. 5-fold Cross Validation was performed and it seemed very promising.

However, when put to the real test, that of recommending movies based on favorite movies, it always generated the same list of recommendations from top to bottom, no matter which movies were entered as favorite. This is another good example of how metrics are not always indicative of a well performing model. 

This was particularly surprising, as SVD was supposed to be a technique especially well suited for handling sparse matrices. The letdown was so big, that the possibility of implementing another technique than SVD for collaboration filtering didn't even cross the mind until much later. There might be a way to make it work, but no viable option was found within the given timeframe.

**However, the content based approach proved to have several key advantages:**
- It was super quick to fit and transform (3 seconds) and use (0.1 seconds).
- No movies needed to be filtered out, thus a complete coverage.
- Subjectively accurate recommendations, that actually changed (applause) as the inputted movies changed.
- TF-IDF seemed to capture the essence of a movies' charachteristics very well, given the available metadata.
- No need to install strange libraries using `conda` that couldn't be installed using `pip`, even after having installed the Microsoft C++ compiler thingy that was recommended to resolve the issue in the `pip` error message.
- Could easily be developed further to recommend movies not just based on the title, but also on key words.

It performed that well, even though there was the somewhat worrying finding that the vectorizer used all words in the vocabulary as features (there was an expectation that not all words would be used).

The task has given insight and experience into how machine learning algorithms can be used for content recommendations, what some challenges are, and how well TF-IDF performed in this scenario. 

As a sidenote, something no longer included in this notebook was an early exploration of the dataframes, that recommended movies directly by selecting data from the dataframe ("users who had given 5 star ratings to the entered favorite movies had also given 5 star ratings to these movies"), which took a good minute or two just to comb through the whole dataframe before printing the recommendations. That showed how ML can really help speed things up, using way less resources.

All in all the project underscored the importance of selecting the right approach for the task at hand, which might not be obvious from the get-go. Especially not as the initial research here indicated something that was then vastly disproven. It also demonstrated the importance of thorough testing, and not relying solely on metrics.