In [1]:
import numpy as np
from scipy import stats
import random

# Create synthetic data that mimics the structure of the .mat file
# -----------------------------------------------------------------

# Create a list of sample movie titles (50 movies total)
movie_titles = [
    "The Shawshank Redemption", "The Godfather", "The Dark Knight", "Pulp Fiction",
    "Schindler's List", "The Lord of the Rings: The Return of the King", "Fight Club",
    "Forrest Gump", "Inception", "The Matrix", "Goodfellas", "Star Wars: Episode V",
    "One Flew Over the Cuckoo's Nest", "The Silence of the Lambs", "Interstellar",
    "The Usual Suspects", "The Green Mile", "The Prestige", "Gladiator", "Whiplash",
    "The Departed", "The Lion King", "Back to the Future", "Alien", "Django Unchained",
    "The Shining", "Parasite", "Apocalypse Now", "Good Will Hunting", "Raiders of the Lost Ark",
    "City of God", "The Pianist", "Memento", "Avengers: Endgame", "The Intouchables",
    "The Dark Knight Rises", "American History X", "Casablanca", "Psycho", "Joker",
    "Braveheart", "The Godfather Part II", "Toy Story", "Inglourious Basterds", "Eternal Sunshine",
    "Amadeus", "Full Metal Jacket", "The Sixth Sense", "Scarface", "No Country for Old Men"
]

# Reshape movie titles into a 2D array of shape (50,1)
# This mimics the cell array structure in MATLAB
movies = np.array(movie_titles, dtype=object).reshape(-1, 1)

# Generate synthetic user ratings data
# Assume 200 users rating 50 movies on a scale of 0-5 (0 means not rated)
num_users = 200
num_movies = 50
np.random.seed(42)  # For reproducibility

# Create user ratings matrix with some sparsity (some movies not rated by all users)
users_movies = np.random.choice([0, 1, 2, 3, 4, 5], size=(num_users, num_movies),
                               p=[0.3, 0.1, 0.1, 0.2, 0.15, 0.15])

# Select 20 most popular movies (simplification: just take first 20)
index_small = np.arange(20)  # Indices of the 20 "popular" movies

# Extract ratings for these 20 popular movies
users_movies_sort = users_movies[:, index_small]

# Create a trial user vector (ratings for the 20 popular movies)
trial_user = np.array([5, 4, 0, 5, 3, 4, 0, 5, 4, 3, 0, 5, 4, 3, 5, 4, 0, 3, 5, 4])

print("======= MOVIE RECOMMENDER SYSTEM =======")
print(f"Movies shape: {movies.shape}")
print(f"Users Movies shape: {users_movies.shape}")
print(f"Users Movies Sort shape: {users_movies_sort.shape}")
print(f"Index Small shape: {index_small.shape}")
print(f"Trial User shape: {trial_user.shape}")
print(f"Dimensions of users_movies: {users_movies.shape[0]} rows (users), {users_movies.shape[1]} columns (movies)")

# Print the titles of the 20 most popular movies
print('\nRating is based on these movies:')
for idx in index_small:
    print(f"{idx+1}. {movies[idx][0]}")

# Get the dimensions of the users_movies_sort matrix
m1, n1 = users_movies_sort.shape

Movies shape: (50, 1)
Users Movies shape: (200, 50)
Users Movies Sort shape: (200, 20)
Index Small shape: (20,)
Trial User shape: (20,)
Dimensions of users_movies: 200 rows (users), 50 columns (movies)

Rating is based on these movies:
1. The Shawshank Redemption
2. The Godfather
3. The Dark Knight
4. Pulp Fiction
5. Schindler's List
6. The Lord of the Rings: The Return of the King
7. Fight Club
8. Forrest Gump
9. Inception
10. The Matrix
11. Goodfellas
12. Star Wars: Episode V
13. One Flew Over the Cuckoo's Nest
14. The Silence of the Lambs
15. Interstellar
16. The Usual Suspects
17. The Green Mile
18. The Prestige
19. Gladiator
20. Whiplash


# Movie Recommendation System: How It Works

## Introduction

This document explains the movie recommendation system implemented in the accompanying Python code. The system uses collaborative filtering techniques to suggest movies based on user ratings and similarities between users.

## Data Structure

The system works with the following key data structures:

1. **Movies Matrix** (`movies`): A collection of movie titles.
2. **User-Movie Ratings Matrix** (`users_movies`): A matrix where each row represents a user and each column represents a movie. The values are ratings from 0-5, where 0 means the user hasn't rated the movie.
3. **Popular Movies Subset** (`users_movies_sort`): A matrix containing user ratings for only the most popular movies.
4. **Popular Movie Indices** (`index_small`): Indices of the most popular movies in the original movie list.
5. **Trial User** (`trial_user`): Ratings provided by a new user for the popular movies.

## Recommendation Process

### Step 1: Data Preparation
- Load or generate the movie rating data
- Identify the most popular movies
- Extract the trial user's ratings for these popular movies

### Step 2: Find Similar Users
The system identifies users who have rated all (or most) of the popular movies, which will form the basis for comparison.

### Step 3: Calculate Similarity Using Two Methods

#### Method 1: Euclidean Distance
The Euclidean distance measures the straight-line distance between two users' rating vectors:

```
distance = √(Σ(rating_user1 - rating_user2)²)
```

**Intuition**: Users with a smaller distance have more similar absolute ratings.

#### Method 2: Pearson Correlation
The Pearson correlation coefficient measures the linear correlation between two users' rating patterns:

```
r = Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)² * Σ(y - ȳ)²)
```

Where:
- x and y are the rating vectors
- x̄ and ȳ are the mean ratings for each user

**Intuition**: This accounts for differences in rating scales between users. For example, if one user tends to rate everything higher than another, but they still like/dislike the same things, they'll have a high correlation.

### Step 4: Generate Recommendations
For each similarity method:
1. Find the most similar user to the trial user
2. Identify movies that the similar user rated highly (5/5)
3. Recommend these movies to the trial user (excluding movies they've already rated highly)

## Why Two Different Methods?

The two recommendation methods address different patterns in user ratings:

- **Euclidean Distance** works best when users rate on similar scales
- **Pearson Correlation** works better when users have similar preferences but different rating scales (e.g., one user might rarely give 5 stars while another gives them frequently)

## Example Scenarios

### Scenario 1: Similar Absolute Ratings
User A and User B both rate movies similarly:
- User A rates Movie X: 5, Movie Y: 2
- User B rates Movie X: 5, Movie Y: 2

Both methods will find these users similar.

### Scenario 2: Similar Preferences, Different Scales
User A is generous with ratings, User B is stricter:
- User A rates Movie X: 5, Movie Y: 3
- User B rates Movie X: 4, Movie Y: 2

Euclidean distance will find them somewhat different, but Pearson correlation will recognize they have the same preference pattern (they both like Movie X more than Movie Y by the same proportion).

## Interpreting Results

When the recommendations from both methods match:
- High confidence in the recommendation

When they differ:
- Euclidean recommendations: "Users who gave similar absolute ratings liked these movies"
- Pearson recommendations: "Users with similar taste patterns liked these movies"

## Customization

The system allows for personalization by:
1. Entering your own ratings for popular movies
2. Generating personalized recommendations using both similarity methods
3. Comparing which method produces better recommendations for your taste

## Conclusion

This recommendation system demonstrates the fundamentals of collaborative filtering. While modern commercial systems use more sophisticated techniques (including matrix factorization, deep learning, etc.), the principles illustrated here form the foundation of many recommendation engines.

In [2]:
ratings = []

# Loop through each row in users_movies_sort
for j in range(m1):
    # Check if the user has rated all 20 movies (no zeros in their ratings)
    if np.prod(users_movies_sort[j, :]) != 0:
        # Append the row to the ratings list
        ratings.append(users_movies_sort[j, :])

# Convert the ratings list to a NumPy array
ratings = np.array(ratings)
print(f"\nFound {len(ratings)} users who rated all 20 popular movies")

# If no users rated all movies, relax the constraint to find some comparable users
if len(ratings) < 5:
    print("Too few users rated all movies, relaxing constraints...")
    ratings = []
    for j in range(m1):
        # Consider users who rated at least 15 of the 20 movies
        if np.count_nonzero(users_movies_sort[j, :]) >= 15:
            ratings.append(users_movies_sort[j, :])
    ratings = np.array(ratings)
    print(f"Found {len(ratings)} users who rated at least 15/20 popular movies")

# Calculate Euclidean Distance Recommendations
# --------------------------------------------
m2, n2 = ratings.shape
# Initialize an empty list to store the Euclidean distances
eucl = []

# Loop through each row in ratings
for i in range(m2):
    # Calculate the Euclidean distance between trial_user and current user
    # Only consider movies that both users have rated (non-zero ratings)
    mask = (ratings[i, :] > 0) & (trial_user > 0)
    if np.sum(mask) > 0:  # Only calculate if they have at least one movie in common
        distance = np.linalg.norm(ratings[i, mask] - trial_user[mask])
        eucl.append(distance)
    else:
        eucl.append(float('inf'))  # Assign infinite distance if no movies in common

# Convert the eucl list to a NumPy array
eucl = np.array(eucl)

# Sort the Euclidean distances in ascending order
DistIndex = np.argsort(eucl)
MinDist = np.sort(eucl)

# Find the index of the closest user
closest_user_Dist = DistIndex[0]
print(f"\nClosest user by Euclidean distance: User #{closest_user_Dist}")
print(f"Euclidean distance: {MinDist[0]:.2f}")

# Calculate Pearson Correlation Recommendations
# --------------------------------------------
# Initialize the pearson array
pearson = np.zeros(m2)

# Compute Pearson correlation coefficients
for i in range(m2):
    # Create masks for non-zero ratings from both users
    mask = (ratings[i, :] > 0) & (trial_user > 0)
    if np.sum(mask) > 1:  # Need at least 2 points to calculate correlation
        # Get the ratings where both users have rated the movie
        user_ratings = ratings[i, mask]
        trial_ratings = trial_user[mask]

        # Calculate correlation if there's enough variance
        if np.std(user_ratings) > 0 and np.std(trial_ratings) > 0:
            pearson[i] = np.corrcoef(user_ratings, trial_ratings)[0, 1]
        else:
            pearson[i] = 0
    else:
        pearson[i] = 0


Found 0 users who rated all 20 popular movies
Too few users rated all movies, relaxing constraints...
Found 76 users who rated at least 15/20 popular movies

Closest user by Euclidean distance: User #64
Euclidean distance: 3.46


In [3]:
PearsonIndex = np.argsort(pearson)[::-1]
MaxPearson = np.sort(pearson)[::-1]

# Find the index of the user with the highest correlation coefficient
closest_user_Pearson = PearsonIndex[0]
print(f"Most similar user by Pearson correlation: User #{closest_user_Pearson}")
print(f"Pearson correlation: {MaxPearson[0]:.2f}")

# Compare the elements of the vectors DistIndex and PearsonIndex
print("\nTop 5 users by Euclidean distance:", DistIndex[:5])
print("Top 5 users by Pearson correlation:", PearsonIndex[:5])

# Check if the closest users by each metric are the same
if closest_user_Pearson == closest_user_Dist:
    print("The closest user is the same using both metrics.")
else:
    print("The closest user is different using the two metrics.")

# Generate Recommendations
# -----------------------
# Get indices of users from the original users_movies matrix
dist_user_idx = closest_user_Dist
pearson_user_idx = closest_user_Pearson

# Recommendations based on the distance criterion (highly rated movies by the closest user)
recommend_dist = []
for k in range(num_movies):
    if users_movies[dist_user_idx, k] == 5:
        recommend_dist.append(k)

# Recommendations based on the Pearson correlation coefficient criterion
recommend_pearson = []
for k in range(num_movies):
    if users_movies[pearson_user_idx, k] == 5:
        recommend_pearson.append(k)

Most similar user by Pearson correlation: User #64
Pearson correlation: 0.57

Top 5 users by Euclidean distance: [64 10 44  3 11]
Top 5 users by Pearson correlation: [64 58 65  7 44]
The closest user is the same using both metrics.


In [4]:
# Movies highly rated by the trial user (rated 5)
liked = []
for k in range(20):  # Only consider the 20 popular movies
    if trial_user[k] == 5:
        liked.append(index_small[k])

# Print the results
print("\n===== RECOMMENDATIONS =====")
print("\nMovies liked by the trial user:")
for idx in liked:
    print(f"- {movies[idx][0]}")

print("\nRecommended movies based on Euclidean distance:")
for idx in recommend_dist:
    # Only recommend movies that the trial user hasn't rated highly already
    if idx not in liked:
        print(f"- {movies[idx][0]}")

print("\nRecommended movies based on Pearson correlation:")
for idx in recommend_pearson:
    # Only recommend movies that the trial user hasn't rated highly already
    if idx not in liked:
        print(f"- {movies[idx][0]}")

# Create your own ratings
# -----------------------
print("\n===== PERSONAL RECOMMENDATIONS =====")
print("Creating personal recommendations based on user-defined ratings")

# Define your own ratings for the 20 popular movies (1-5, or 0 for not seen)
# You can change these values to get personalized recommendations
myratings = np.array([5, 3, 4, 2, 5, 1, 0, 4, 5, 3, 4, 5, 2, 0, 5, 3, 4, 1, 5, 2])

print("\nYour ratings for the popular movies:")
for i, rating in enumerate(myratings):
    movie = movies[index_small[i]][0]
    if rating > 0:
        print(f"{movie}: {rating}/5")
    else:
        print(f"{movie}: Not rated")

# Calculate Euclidean distances for personal recommendations
eucl_personal = []
for i in range(m2):
    # Only consider movies that both you and the other user rated
    mask = (ratings[i, :] > 0) & (myratings > 0)
    if np.sum(mask) > 0:
        distance = np.linalg.norm(ratings[i, mask] - myratings[mask])
        eucl_personal.append(distance)
    else:
        eucl_personal.append(float('inf'))

eucl_personal = np.array(eucl_personal)
DistIndex_personal = np.argsort(eucl_personal)
closest_user_Dist_personal = DistIndex_personal[0]


===== RECOMMENDATIONS =====

Movies liked by the trial user:
- The Shawshank Redemption
- Pulp Fiction
- Forrest Gump
- Star Wars: Episode V
- Interstellar
- Gladiator

Recommended movies based on Euclidean distance:
- Fight Club
- The Shining
- Casablanca
- Scarface

Recommended movies based on Pearson correlation:
- Fight Club
- The Shining
- Casablanca
- Scarface

===== PERSONAL RECOMMENDATIONS =====
Creating personal recommendations based on user-defined ratings

Your ratings for the popular movies:
The Shawshank Redemption: 5/5
The Godfather: 3/5
The Dark Knight: 4/5
Pulp Fiction: 2/5
Schindler's List: 5/5
The Lord of the Rings: The Return of the King: 1/5
Fight Club: Not rated
Forrest Gump: 4/5
Inception: 5/5
The Matrix: 3/5
Goodfellas: 4/5
Star Wars: Episode V: 5/5
One Flew Over the Cuckoo's Nest: 2/5
The Silence of the Lambs: Not rated
Interstellar: 5/5
The Usual Suspects: 3/5
The Green Mile: 4/5
The Prestige: 1/5
Gladiator: 5/5
Whiplash: 2/5


In [5]:
pearson_personal = np.zeros(m2)
for i in range(m2):
    mask = (ratings[i, :] > 0) & (myratings > 0)
    if np.sum(mask) > 1:
        user_ratings = ratings[i, mask]
        my_ratings = myratings[mask]
        if np.std(user_ratings) > 0 and np.std(my_ratings) > 0:
            pearson_personal[i] = np.corrcoef(user_ratings, my_ratings)[0, 1]

PearsonIndex_personal = np.argsort(pearson_personal)[::-1]
closest_user_Pearson_personal = PearsonIndex_personal[0]

# Generate personal recommendations
recommend_dist_personal = []
for k in range(num_movies):
    if users_movies[closest_user_Dist_personal, k] == 5:
        recommend_dist_personal.append(k)

recommend_pearson_personal = []
for k in range(num_movies):
    if users_movies[closest_user_Pearson_personal, k] == 5:
        recommend_pearson_personal.append(k)

# Movies you liked (rated 5)
liked_personal = []
for k in range(20):
    if myratings[k] == 5:
        liked_personal.append(index_small[k])

# Print personal recommendations
print("\nMovies you rated 5/5:")
for idx in liked_personal:
    print(f"- {movies[idx][0]}")

print("\nRecommended movies based on Euclidean distance:")
for idx in recommend_dist_personal:
    # Only recommend movies that you haven't rated highly already
    if idx not in liked_personal:
        print(f"- {movies[idx][0]}")

print("\nRecommended movies based on Pearson correlation:")
for idx in recommend_pearson_personal:
    # Only recommend movies that you haven't rated highly already
    if idx not in liked_personal:
        print(f"- {movies[idx][0]}")


Movies you rated 5/5:
- The Shawshank Redemption
- Schindler's List
- Inception
- Star Wars: Episode V
- Interstellar
- Gladiator

Recommended movies based on Euclidean distance:
- The Dark Knight
- One Flew Over the Cuckoo's Nest
- The Usual Suspects
- Back to the Future
- Alien
- Django Unchained
- Parasite
- Eternal Sunshine
- Scarface

Recommended movies based on Pearson correlation:
- The Dark Knight
- One Flew Over the Cuckoo's Nest
- The Usual Suspects
- Back to the Future
- Alien
- Django Unchained
- Parasite
- Eternal Sunshine
- Scarface
