# Movie Recommendations task
Here the task is to create a model that recommends movies. At first to explore the data and build a simpler model.

Thereafter to describe and implement a usual technique for recommendations. Suggested techniqeus are either content-filtering or collaborative-filtering. As collaborative-filtering can offer more diverse recommendations, and also suggest movies that the user might not otherwise have found, this technique was chosen. 

After preliminary opening of the movielens archive, some limitations were immediately obvious: 
- To ignore tags.csv, links.csv
- From movies.csv use genres column, in ratings.csv all columns.

A random subset needed to be selected to perform the training, as 25,000,000 ratings are quite a lot. 200 000 random samples were taken out of the 25M. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
movie_df = pd.read_csv('moviedata/movies.csv')
rating_df = pd.read_csv('moviedata/ratings.csv')

Putting relevant information in one dataframe:

In [2]:
rating_df = rating_df.drop('timestamp', axis=1)
rating_df = rating_df.dropna()
rating_df['title'] = rating_df['movieId'].map(movie_df.set_index('movieId')['title'])

In [6]:
rating_df.head()

Unnamed: 0,userId,movieId,rating,genres,title
0,1,296,5.0,Adventure|Animation|Children|Comedy|Fantasy,Pulp Fiction (1994)
1,1,306,3.5,Adventure|Children|Fantasy,Three Colors: Red (Trois couleurs: Rouge) (1994)
2,1,307,5.0,Comedy|Romance,Three Colors: Blue (Trois couleurs: Bleu) (1993)
3,1,665,5.0,Comedy|Drama|Romance,Underground (1995)
4,1,899,3.5,Comedy,Singin' in the Rain (1952)


In [17]:
# Checking number of unique movies in the dataset
rating_df['title'].nunique()

58958

To get a more reasonably sized dataset, sampling 200k rows out of the 25M. 

In [23]:
smaller_df = rating_df.sample(n=200000, random_state=1)
smaller_df.shape

(200000, 5)

Check how many unique movies are in the database:

In [24]:
smaller_df['title'].nunique()

13165

Check how many unique movies with 5 star ratings, as that is the metric to use to recommend movies:

In [28]:
smaller_df[(smaller_df['rating'] == 5)]['title'].value_counts()

title
Shawshank Redemption, The (1994)             334
Pulp Fiction (1994)                          266
Star Wars: Episode IV - A New Hope (1977)    215
Schindler's List (1993)                      210
Forrest Gump (1994)                          201
                                            ... 
Scary Movie 4 (2006)                           1
One False Move (1992)                          1
She's Gotta Have It (1986)                     1
Fallen Angels (Duo luo tian shi) (1995)        1
Shock Corridor (1963)                          1
Name: count, Length: 4408, dtype: int64

4408 movies covers a sufficient portion of the movies, presumably covering most movies most users are likely to like. 

For the simpler model, let's use tags and genre to predict movie preference. Or only genre? It will be too many features otherwise.
One hot encoding of the features? Need to download dataset again.

## Implementing a simple model

A simple linear regression model will be chosen, due to its speed. Or not? I don't know how to implement it actually .. My model was more direct. 

ChatGPT4: Here's a table that compares Content Filtering and Collaborative Filtering across several dimensions:

| Feature                         | Content Filtering                                   | Collaborative Filtering                                      |
|---------------------------------|-----------------------------------------------------|--------------------------------------------------------------|
| **Data Required**               | Item features (e.g., genre, author)                  | User-item interactions (e.g., ratings, views)                |
| **Recommendation Basis**        | Similarity between item features and user preferences| Similarity between users or items based on user interactions |
| **Advantages**                  | - Privacy-friendly <br> - Can recommend new items <br> - Transparent reasoning      | - Diverse recommendations <br> - Can discover serendipitous items <br> - Effective without item metadata            |
| **Limitations**                 | - Limited by item features <br> - Risk of over-specialization <br> - Cold start problem for new users | - Cold start problem for new users/items <br> - Requires large amounts of data <br> - Scalability issues            |
| **Application Examples**        | - E-commerce product recommendations <br> - Online libraries and content platforms | - Movie, music, and book recommendations <br> - E-commerce and social networking sites                             |
| **User/Item Newness Handling**  | Can handle new items if item features are available  | Struggles with new users and items due to lack of interaction data                                                 |
| **Diversity of Recommendations**| May recommend items too similar to user's past likes | Generally offers more diverse recommendations                |
| **Requirement for Metadata**    | High (needs detailed item features)                   | Low (relies on user behavior rather than item specifics)      |

My initial feeling when reading about them is that I think collaborative filtering would be more suitable when it comes to movie recommendations, which was corroborated by ChatGPT.

Thus genres and tags are not necessary features for this implementation.

In [15]:
# First implementing my own movie recommendation algorithm

# A user enters at least one movie they like. The algorithm will then find the most similar movies to the one entered.

# Say the user enters Pulp Fiction. It will then find all users who rated that movie 5 stars. It will then find all the other movies
# those users rated 5 stars and recommend the top 3 of those movies to the user. 

user_pref_list = []
print("Enter a movie you like. The algorithm will then find the most similar movies to the one entered.")
print("Break by entering nothing and pressing enter.")
while True:
    user_pref_list.append(input("Enter a movie you like: "))
    if user_pref_list[-1] == "":
        user_pref_list.pop()
        break

pattern = '|'.join(user_pref_list).replace(' ', '\s')

print(f"\nYou entered: {user_pref_list}")
print("Hang on, this will take a while ... ")

# Use str.contains() to match the title, ignoring case and allowing for additional characters (like the year in parentheses)
# Note: Be aware that this method might also match titles that contain the search string as a substring.
user_id_5_star = rating_df[
    rating_df['title'].str.contains(pattern, case=False, regex=True) & 
    (rating_df['rating'] == 5)
]['userId']

# Sample 10 users who rated the movie 5 stars
sample_user_id = user_id_5_star.sample(100, random_state=1)

# Find all the movies those users rated 5 stars (this was auto suggested by GitHub Copilot)
recommended_movies = rating_df[
    rating_df['userId'].isin(sample_user_id) & 
    ~rating_df['title'].str.contains(pattern, case=False, regex=True) & 
    (rating_df['rating'] == 5)
]['title'].value_counts().head(3)

print("Recommended movies based on your input:")
print(recommended_movies)

Enter a movie you like. The algorithm will then find the most similar movies to the one entered.
Break by entering nothing and pressing enter.
You entered: ['Pulp Fiction', 'Matrix']
Hang on, this will take a while ... 
Recommended movies based on your input:
title
Shawshank Redemption, The (1994)    44
Fight Club (1999)                   34
Silence of the Lambs, The (1991)    34
Name: count, dtype: int64
