# IMPLEMENTING A BASIC RECOMMENDER

## Instructor: Ekpe Okorafor
## The CODATA-RDA School for Research Data Science¶


## Introduction:

In this Hands-On Exercise, you will build a movie recommendation engine. You will use both the Content-based Filtering approach and the Collaborative Filtering approach to build a basic movie recommendation engine.

### The Dataset

The dataset used was from MovieLens, and is publicly available here (https://files.grouplens.org/datasets/movielens/ml-latest-small.zip).

In a bid to keep the recommender simple, we will use the smallest dataset available (ml-latest-small.zip)



## Exercise #1:

 1. Download, save and extract files

 2. Note the location of where the files are. You will need the path shortly

 3. Examine the files (movies & ratings) in Excel or spreadsheet program to get a sense of the file structures

 4. You can delete the timestamp column in the ratings csv file

 5. Explore the structure of the files



## 1. Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds


## 2. Load the Dataset

In [3]:
# Load the Ratings Data
ratings = pd.read_csv('ratings.csv')

# Load the Movies Data
movies = pd.read_csv('movies.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 3. Explore the Structure of the Files

### Step 1: Movies Dataset

In [None]:
# Display the structure of the movies dataset
print(f'Movies Dataset: {movies.shape[0]} rows and {movies.shape[1]} columns')
print(movies.info())
print(movies.head())


### Step 2: Ratings Dataset

In [None]:
# Display the structure of the ratings dataset
print(f'Ratings Dataset: {ratings.shape[0]} rows and {ratings.shape[1]} columns')
print(ratings.info())
print(ratings.head())


### Step 3.  Display the first few rows of both Datasets

In [None]:
# Display the first few rows of both datasets

print(ratings.head())
print(movies.head())

## 4. Cool Visualizations

## Exercise #2:

Yesterday, you did visualizations in Python. Now is your time to brag about your awesome visualization skills. Let us see who comes up with the coolest innovative visualizations from the ratings dataset.

 1. Create a basic histogram with count on the y-axis and unique ratings on the x-axis
 2. Now, wow us! – Do your thing, create an amazing visualization
 3. Here below are simple examples to get you started!



### Step 1: Distribution of Movie Ratings

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(ratings['rating'], bins=20, kde=False, color='blue')
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


### Step 2: Number of Ratings per Movie

In [None]:
# Count the number of ratings per movie
ratings_per_movie = ratings.groupby('movieId').size().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.histplot(ratings_per_movie, bins=50, kde=False, color='orange')
plt.title('Number of Ratings per Movie')
plt.xlabel('Number of Ratings')
plt.ylabel('Count of Movies')
plt.yscale('log')
plt.show()


### Step 3: Top 10 Movies with the Most Ratings

In [None]:
# Merge movies and ratings datasets
movie_ratings = ratings.merge(movies, on='movieId')

# Group by movie title and count the number of ratings
top_10_movies = movie_ratings.groupby('title').size().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_movies.values, y=top_10_movies.index, palette='viridis')
plt.title('Top 10 Movies with Most Ratings')
plt.xlabel('Number of Ratings')
plt.ylabel('Movie Title')
plt.show()


### Step 4: Average Ratings per Movie

In [None]:
# Calculate the average rating per movie
average_ratings = movie_ratings.groupby('title')['rating'].mean().sort_values(ascending=False)

# Filter out movies with less than 100 ratings
popular_movies = movie_ratings.groupby('title').filter(lambda x: x['rating'].count() >= 100)
average_ratings_filtered = popular_movies.groupby('title')['rating'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=average_ratings_filtered.values, y=average_ratings_filtered.index, palette='magma')
plt.title('Top 10 Movies with Highest Average Ratings (with at least 100 ratings)')
plt.xlabel('Average Rating')
plt.ylabel('Movie Title')
plt.show()


## 5. Content-Based Filtering

Like the name suggests, the Content-based Filtering approach involves analyzing an item a user interacted with and giving recommendations that are similar in content to that item. Content, in this case, refers to a set of attributes/features that describes your item. For a movie recommendation engine, a content-based approach would be to recommend movies that are of highest similarity based on its **features**, such as genres, actors, directors, year of production, etc. The assumption here is that users have preferences for a certain type of product, so we try to recommend a similar product to what the user has expressed liking for. Also, the goal here is to provide alternatives or substitutes to the item that was viewed.

We will be building a basic content-based recommender engine based on **movie genres** only.

## Exercise #3:

 1. Combine the movie titles with genres
 2. Create a vector representation
 3. Compute the similarity matrix
 4. Show the similarity matrix for the first 5 movies
 5. Build the recommendation function
 6. Change the movie and get the recommended movies


### Step 1: Preprocess the Data

In [None]:
# Combine movie titles with genres for the content-based filtering
movies['title_genres'] = movies['title'] + ' ' + movies['genres']

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the movies' genres and titles
tfidf_matrix = tfidf_vectorizer.fit_transform(movies['title_genres'])

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Show the similarity matrix for the first 5 movies
print(cosine_sim[:5, :5])


### Step 2: Build a Recommendation Function

In [None]:
# Function to get movie recommendations based on content
def get_content_based_recommendations(title, cosine_sim=cosine_sim):
    idx = movies[movies['title'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]  # Get top 10 movies
    movie_indices = [i[0] for i in sim_scores]
    return movies['title'].iloc[movie_indices]

# Example: Get recommendations for a movie
recommended_movies = get_content_based_recommendations('Toy Story (1995)')
#recommended_movies = get_content_based_recommendations('Star Wars: Episode IV - A New Hope (1977)')

print(recommended_movies)


## 6. User-Based Collaborative Filtering Approach

The User-Based Collaborative Filtering approach groups users according to prior usage behavior or according to their preferences, and then recommends an item that a similar user in the same group viewed or liked. To put this in layman terms, if user 1 liked movie A, B and C, and if user 2 liked movie A and B, then movie C might make a good recommendation to user 2. The User-Based Collaborative Filtering approach mimics how word-of-mouth recommendations work in real life.

## Exercise #4:

 1. Create the User-Item Matrix
 2. Build the recommendation function
 3. Change the user and get the recommended movies


### Step 1: Create the User-Item Matrix

In [15]:
# Create the user-item matrix
user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)

# Convert the user-item matrix into a sparse matrix
user_item_matrix_sparse = csr_matrix(user_item_matrix.values)

# Perform Singular Value Decomposition
U, sigma, Vt = svds(user_item_matrix_sparse, k=50)

# Convert sigma to a diagonal matrix
sigma = np.diag(sigma)

# Compute the predicted ratings
predicted_ratings = np.dot(np.dot(U, sigma), Vt)

# Create a DataFrame for the predicted ratings
predicted_ratings_df = pd.DataFrame(predicted_ratings, columns=user_item_matrix.columns)


### Step 2: Build a Recommendation Function

In [None]:
# Function to recommend movies to a user based on collaborative filtering
def get_collaborative_recommendations(user_id, num_recommendations=10):
    user_idx = user_id - 1
    user_ratings = predicted_ratings_df.iloc[user_idx].sort_values(ascending=False)
    recommended_movie_ids = user_ratings.index[:num_recommendations]
    return movies[movies['movieId'].isin(recommended_movie_ids)]['title']

# Example: Get recommendations for a user
collaborative_recommendations = get_collaborative_recommendations(user_id=88)
print(collaborative_recommendations)


## 7. Evaluation

Evaluate Collaborative Filtering Model


In [18]:
from sklearn.metrics import mean_squared_error

# Split data into training and test sets
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# Create training user-item matrix
train_user_item_matrix = train_data.pivot(index='userId', columns='movieId', values='rating').fillna(0)
train_user_item_matrix_sparse = csr_matrix(train_user_item_matrix.values)

# Perform SVD on training data
U_train, sigma_train, Vt_train = svds(train_user_item_matrix_sparse, k=50)
sigma_train = np.diag(sigma_train)

# Compute predicted ratings for the training set
predicted_ratings_train = np.dot(np.dot(U_train, sigma_train), Vt_train)
predicted_ratings_train_df = pd.DataFrame(predicted_ratings_train, columns=train_user_item_matrix.columns, index=train_user_item_matrix.index)

# Create a user-item matrix for the test data
test_user_item_matrix = test_data.pivot(index='userId', columns='movieId', values='rating')

# Align the test set with the predicted ratings
aligned_test_user_item_matrix = test_user_item_matrix.reindex_like(predicted_ratings_train_df).fillna(0)

# Calculate the aligned predictions
aligned_predicted_ratings_test = predicted_ratings_train_df.loc[aligned_test_user_item_matrix.index, aligned_test_user_item_matrix.columns].fillna(0)

# Calculate RMSE for the test set
rmse = np.sqrt(mean_squared_error(aligned_test_user_item_matrix.values, aligned_predicted_ratings_test.values))
print(f'Collaborative Filtering RMSE: {rmse}')




Collaborative Filtering RMSE: 0.37420556921016007


## Bonus Exercise

 ### Task 1: Content-Based Filtering Enhancement
 1. Incorporate Additional Features:
    - The current content-based filtering uses only movie titles and genres. Try incorporating tags (if available) or keywords associated with movies to improve the recommendations.
    - You can simulate this by creating a new column that combines the title, genres, and some synthetic tags you create (e.g., "action-packed," "romantic," "classic").
    

 2. Adjust the TF-IDF Vectorizer:
    - Experiment with different parameters of the TfidfVectorizer, such as max_features, ngram_range, or max_df/min_df, to see how they affect the quality of recommendations.
    - Compare the results before and after these changes by getting recommendations for the same movie.


### Hint

 1. Create a synthetic 'tags' column (this can be adjusted as needed)
 2. Combine title, genres, and tags for a richer content description
 3. Re-run the TF-IDF Vectorizer with the new combined column
 4. Compute the cosine similarity matrix
 5. Get recommendations for a movie with the enhanced content-based filtering


## Task 2: Visualization

 3. Visualize Similarity:
 - Create a heatmap of the cosine similarity matrix for a small subset of movies (e.g., the top 20 most-rated movies). This visualization will help you understand which movies are considered similar.
