# Movie Recommendations task
Here the task is to create a model that recommends movies. At first to explore the data and and build a simpler model.

Thereafter to describe and implement a usual technique for recommendations. Suggested techniqeus are either content-filtering or collaborative-filtering. As collaborative-filtering can offer more diverse recommendations, and also suggest movies that the user might not otherwise have found, this technique was chosen. 

After preliminary opening of the movielens archive, some limitations were immediately obvious: 
- To ignore tags.csv, links.csv
- From movies.csv use genres column, in ratings.csv all columns.

A random subset needs to be selected to perform the training as 25,000,000 ratings are quite a lot.

In [1]:
import pandas as pd
import numpy as np

# Load the data
movie_df = pd.read_csv('moviedata/movies.csv', header=0)
rating_df = pd.read_csv('moviedata/ratings.csv', header=0)

In [2]:
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# Reset the index of rating_df
rating_df = rating_df.reset_index(drop=True)

# Perform the reindexing operation
rating_df['title'] = rating_df['movieId'].map(movie_df.set_index('movieId')['title'])

In [4]:
rating_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,296,5.0,1147880044,Pulp Fiction (1994)
1,1,306,3.5,1147868817,Three Colors: Red (Trois couleurs: Rouge) (1994)
2,1,307,5.0,1147868828,Three Colors: Blue (Trois couleurs: Bleu) (1993)
3,1,665,5.0,1147878820,Underground (1995)
4,1,899,3.5,1147868510,Singin' in the Rain (1952)


In [5]:
rating_df['title'].nunique()

58958

To get a more reasonably sized model, 1M rows will be sampled out of the 25M. 

In [25]:
smaller_df = rating_df.sample(n=1000000, random_state=1)
smaller_df.shape

(1000000, 5)

Check how many unique movies are in the database:

In [26]:
smaller_df['title'].nunique()

23166

Check how many unique movies with 5 star ratings:

In [27]:
smaller_df[(smaller_df['rating'] == 5)]['title'].value_counts()

title
Shawshank Redemption, The (1994)             1617
Pulp Fiction (1994)                          1282
Schindler's List (1993)                      1072
Star Wars: Episode IV - A New Hope (1977)    1061
Forrest Gump (1994)                          1021
                                             ... 
Another Cinderella Story (2008)                 1
D.C.H. (Dil Chahta Hai) (2001)                  1
Peacemaker, The (1997)                          1
Prison Break: The Final Break (2009)            1
Biutiful (2010)                                 1
Name: count, Length: 8600, dtype: int64

## Implementing a model

For the second task I choose the second option: to implement a recommendation model based on either Content Filtering or Collaborative Filtering. To get an overview of the difference between the models, I asked ChatGPT. The comparison shows that Collaborative Filtering probably is the best choice for this domain, which is also what I was thinking after reading up on the two models on Wikipedia.  

ChatGPT4: Here's a table that compares Content Filtering and Collaborative Filtering across several dimensions:

| Feature                         | Content Filtering                                   | Collaborative Filtering                                      |
|---------------------------------|-----------------------------------------------------|--------------------------------------------------------------|
| **Data Required**               | Item features (e.g., genre, author)                  | User-item interactions (e.g., ratings, views)                |
| **Recommendation Basis**        | Similarity between item features and user preferences| Similarity between users or items based on user interactions |
| **Advantages**                  | - Privacy-friendly <br> - Can recommend new items <br> - Transparent reasoning      | - Diverse recommendations <br> - Can discover serendipitous items <br> - Effective without item metadata            |
| **Limitations**                 | - Limited by item features <br> - Risk of over-specialization <br> - Cold start problem for new users | - Cold start problem for new users/items <br> - Requires large amounts of data <br> - Scalability issues            |
| **Application Examples**        | - E-commerce product recommendations <br> - Online libraries and content platforms | - Movie, music, and book recommendations <br> - E-commerce and social networking sites                             |
| **User/Item Newness Handling**  | Can handle new items if item features are available  | Struggles with new users and items due to lack of interaction data                                                 |
| **Diversity of Recommendations**| May recommend items too similar to user's past likes | Generally offers more diverse recommendations                |
| **Requirement for Metadata**    | High (needs detailed item features)                   | Low (relies on user behavior rather than item specifics)      |

Collaborative Filtering was chosen, thus genres and tags are not necessary features for this implementation. ChatGPT suggests to use Singular Value Decomposition (SVD) to handle the sparse data and the large number of rows to implement a specific Collaborative Filtering method. 

## Theory

Singular Value Decomposition, or SVD, is a powerful technique that breaks down a matrix $A$ into three distinct matrices: $U$, $\Sigma$, and $V^{T}$. Here's how it goes:

- Imagine $A$ as a grid where rows represent users and columns represent items, filled with ratings users have given to items.
- $U$ is a matrix where each column is a "user feature vector," representing hidden characteristics of users.
- $\Sigma$ is a diagonal matrix whose entries are singular values. These values rank the importance of each latent feature, from most to least impactful.
- $V^{T}$, the transpose of $V$, contains "item feature vectors" in its columns, showing hidden characteristics of items.

The magic formula looks like this: $A = U \Sigma V^{T}$.

By breaking down $A$ into these components, we can uncover hidden patterns in how users rate items. It's all about finding out what common tastes or preferences (represented by the latent features in $U$ and $V$) users share.

The singular values in $\Sigma$ tell us how significant each hidden feature is. By keeping only the top few features (the highest values in $\Sigma$), we can simplify the complex world of user-item ratings into something more manageable, yet still incredibly insightful for making recommendations.

This process, in essence, allows us to predict how a user might rate items they haven't encountered yet, based on the latent features. It's a cornerstone for building recommendation systems that feel almost psychic in their ability to guess what users will like.

## Implementing the model
The basis for the Collaborative Filtering using SVD is 

As it was not possible to install scikit surprise using pip, an environment was created using conda, where the required packages were installed. 

In [37]:
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy

# Load the dataset into Surprise's format. As is was so fast, we could use the full dataset 
# model.fit() on the full dataset runs in just under 14 minutes on my 8 year old laptop
# But due to the size of the pickle file, the smaller dataset of 1M rows will be used.

reader = Reader(rating_scale=(min(smaller_df['rating']), max(smaller_df['rating'])))
data = Dataset.load_from_df(smaller_df[['userId', 'movieId', 'rating']], reader)

model = SVD()

# Perform 5-fold cross-validation to check results
results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True, n_jobs=-2)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9151  0.9138  0.9178  0.9170  0.9181  0.9164  0.0017  
MAE (testset)     0.7044  0.7032  0.7058  0.7058  0.7063  0.7051  0.0011  
Fit time          17.07   17.96   17.76   15.08   13.87   16.35   1.60    
Test time         3.92    4.09    4.53    2.72    2.72    3.60    0.74    


In [38]:
# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Train the model on the training set
model.fit(trainset)

# Predict ratings for the test set
predictions = model.test(testset)

# Evaluate the predictions
rmse = accuracy.rmse(predictions)

RMSE: 0.9159


It looks very good. RMSE on the testset consistently at 0.92.
### Now to input some Favourite movies

In [39]:
# User input process
user_favorites_titles = []
print("Enter your favorite movies (press enter to finish):")
while True:
    title_input = input()
    if title_input == "":
        break
    user_favorites_titles.append(title_input)

# Find matching movie IDs
favorite_movie_ids = []

for user_title in user_favorites_titles:
    # Using .str.contains to find matches; case-insensitive
    matches = movie_df[movie_df['title'].str.contains(user_title, case=False, na=False)]
    if not matches.empty:
        for _, row in matches.iterrows():
            favorite_movie_ids.append(row['movieId'])
            print(f"Found: {row['title']} with ID: {row['movieId']}")
    else:
        print(f"No matches found for: {user_title}")

print("Your favorite movie IDs:", favorite_movie_ids)


Enter your favorite movies (press enter to finish):
Found: Emil i Lönneberga (1971) with ID: 51354
Your favorite movie IDs: [51354]


In [58]:
# New user details
new_user_id = 'new_user'  

# Get the list of all movies in the dataset
all_movie_ids = set(smaller_df['movieId'].unique())

# Exclude movies that the new user has already rated (their top 3 favorites in this case)
movies_to_predict = list(all_movie_ids - set(favorite_movie_ids))

# Create a list of tuples in the form of (new_user_id, movieId, actual_rating)
# Since we don't have actual ratings for these (new user), we can use a dummy rating value
testset = [[new_user_id, movie_id, 5.] for movie_id in movies_to_predict]  # Dummy rating of 5

# Predict ratings for all movies the new user hasn't rated
predictions = model.test(testset)

# Convert predictions to a list of (movieId, predicted_rating) tuples
predicted_ratings = [(pred.iid, pred.est) for pred in predictions]

# Sort the predictions by estimated rating in descending order
predicted_ratings.sort(key=lambda x: x[1], reverse=True)

# Get the top 10 recommendations
top_recommendations = predicted_ratings[:10]

print("Top 10 movie recommendations for you:")
print("=====================================")
i=1
for movie_id, rating in top_recommendations:
    movie_name = movie_df.loc[movie_df['movieId'] == movie_id, 'title'].values[0]
    print(f"{i}) {movie_name}, Predicted Rating: {rating:.1f}")
    i += 1

# Get the top 10 movies to stay away from
bottom_recommendations = predicted_ratings[-10:]

i=10
print("\nTop 10 movie recommendations for you to avoid:")
print("##############################################")
for movie_id, rating in bottom_recommendations:
    movie_name = movie_df.loc[movie_df['movieId'] == movie_id, 'title'].values[0]
    print(f"{i}) {movie_name}, Predicted Rating: {rating:.1f}")
    i -= 1


Top 10 movie recommendations for you:
1) Planet Earth II (2016), Predicted Rating: 4.6
2) Planet Earth (2006), Predicted Rating: 4.5
3) Children of Paradise (Les enfants du paradis) (1945), Predicted Rating: 4.5
4) Lady Eve, The (1941), Predicted Rating: 4.4
5) Band of Brothers (2001), Predicted Rating: 4.4
6) Shawshank Redemption, The (1994), Predicted Rating: 4.4
7) Jetée, La (1962), Predicted Rating: 4.4
8) Godfather, The (1972), Predicted Rating: 4.4
9) Mulholland Dr. (1999), Predicted Rating: 4.4
10) George Carlin: Life Is Worth Losing (2005), Predicted Rating: 4.4

Top 10 movie recommendations for you to avoid:
##############################################
10) Pokémon 3: The Movie (2001), Predicted Rating: 1.9
9) Norbit (2007), Predicted Rating: 1.9
8) Pokemon 4 Ever (a.k.a. Pokémon 4: The Movie) (2002), Predicted Rating: 1.9
7) Kazaam (1996), Predicted Rating: 1.9
6) Problem Child 2 (1991), Predicted Rating: 1.8
5) From Justin to Kelly (2003), Predicted Rating: 1.8
4) Glitter (

In [41]:
from joblib import dump, load
dump(model, './save/SVD_movie_model_1M.pkl', compress=5)


['./save/SVD_movie_model_1M.pkl']