# Project 4: Movie Recommender System

CS 598 Practical Statistical Learning

2023-12-10

UIUC Fall 2023

**Authors**
* Ryan Fogle
    - rsfogle2@illinois.edu
    - UIN: 652628818
* Sean Enright
    - seanre2@illinois.edu
    - UIN: 661791377

**Contributions**

### Load in the Data

In [None]:
import pandas as pd
from pathlib import Path

In [None]:
movie_dir = Path('.') / 'ml-1m' / 'ml-1m' 
ratings = pd.read_csv(movie_dir / 'ratings.dat', sep='::', engine = 'python', header=None)
ratings.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']
movies = pd.read_csv(movie_dir / 'movies.dat', sep='::', engine = 'python',
                     encoding="ISO-8859-1", header = None)
movies.columns = ['MovieID', 'Title', 'Genres']
users = pd.read_csv(movie_dir / 'users.dat', sep='::', engine='python', header=None)
users.columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zipcode']

# Create new entry for each genre in a movie, join movie and ratings together. 
movies['Genres'] = movies['Genres'].str.split('|')
df = movies.explode('Genres')
df = df.merge(ratings, on=['MovieID'], how='left')
df.rename(columns={'Genres': 'Genre'}, inplace=True)
df

Create a MovieID -> Title map for later

In [None]:
mov_title_map = dict(zip(movies['MovieID'], movies['Title']))
list(mov_title_map.items())[:5]

## System I: Recommendation Based on Genres

General idea: Use Bayseian probabilty to calcuate new ratings based upon additional prior assumptions, and then rank movies by this new rating to select the top 5 per genre.

This algorthm is based upon this stack overflow post:
https://stackoverflow.com/questions/2495509/how-to-balance-number-of-ratings-versus-the-ratings-themselves

$$
\tilde{R} = \frac{\bar{w} \bar{r} + \sum_{i=1}^{n}{r_i}}{\bar{w} + n}
$$

Where:
- $\tilde{R}$ is the new rating
- $\bar{w}$ is the predefined number of ratings (weight) to include in our prior assumption
- $\bar{r}$ is the predefined average rating to include in our prior assumption
- $n$ is the number of ratings
- $r_i$ is the rating for a given entry.

**Interpretation**:

Consider $\bar{w}$ to be the average number of ratings for a given genre, and $\bar{r}$ to be the number of times to consider that rating for a given genre. We first assume the rating of a movie to be defined as $\frac{\bar{w} \bar{r}}{\bar{w}}$ when $n=0$, then slightly update the estimate for each new given rating. 

In our implementation we defined $\bar{w}$ to be the median genre rating from the average movie ratings in that genre. We also define $\bar{r}$ to be the genre's 25th percentile count (of ratings per movie)

In [None]:
# group by Genre and Movie, this will be used to find median ratings and percentile counts for our prior
gb = df.groupby(['Genre', 'MovieID'])

# Median Genre ratings of Average Movie ratings
median_ratings = gb['Rating'].mean().reset_index().groupby('Genre')['Rating'].median().reset_index()
median_ratings = dict(zip(median_ratings['Genre'], median_ratings['Rating']))
median_ratings

In [None]:
# Grab 25th Percentile of count by genre
quantile_count = gb['Timestamp'].count().reset_index().groupby('Genre')['Timestamp'].quantile(0.25).reset_index()
quantile_count.columns = ['Genre', 'Count']
quantile_count = dict(zip(quantile_count['Genre'], quantile_count['Count']))
quantile_count

Run algorithm

In [None]:
weighted_ratings = []
for (genre, movie_id), movie in gb:
    n = movie.shape[0]
    w = quantile_count[genre]
    r = median_ratings[genre] 
    weighted_rating = (r * w + movie['Rating'].sum()) / (w + n)
    weighted_ratings.append((genre, movie_id, mov_title_map[movie_id], weighted_rating, movie['Rating'].sum() / n, n))

w_df = pd.DataFrame(weighted_ratings, columns=['Genre', 'MovieID', 'Title', 'WeightedRating', 'AverageRating', '# of Ratings'])
w_df

Sort ratings by WeightedRating, group by genre and grab the first five occurrences. 

In [None]:
sysI_recs = w_df.sort_values('WeightedRating', ascending=False).groupby('Genre').head(n=10).sort_values(['Genre', 'WeightedRating'], ascending=[True, False])
sysI_recs

Output System I recommends for the dashboard to use.

In [None]:
sysI_recs.to_csv('sysI_recs.csv', index=False)
w_df.to_csv('sysI_recs_full.csv', index=False)

## System II: Recommendation Based on IBCF

### Similarity Matrix Construction

To construct the similarity matrix, we require user ratings for various items. The input rating matrix is $R_{a \times i}$, where $a$ is the number of users who have reviewed one or more movie, and $i$ is the number of movies.

In the case of our dataset, there are 6040 users and 3706 movies, so $R$ is of shape $6040 \times 3706$.

In [None]:
user_mov_df = pd.read_csv('Rmat.csv')
user_mov_df.shape

### Normalization of Ratings Matrix
We normalize the rating matrix by subtracting the row means from each row, ignoring `NA` entries. This addresses the variation in each user's average rating.

In [None]:
user_mov_df_norm = user_mov_df.sub(user_mov_df.mean(axis=1, skipna=True), axis=0)


### Cosine Similarity

We seek to compute the similarity between movies (items). We select centered cosine similarity as our measure of similarity. Having normalized our ratings matrix by each user's average rating, the next step is computation of similarity.

Our implementation of the cosine similarity between items is described below.

In [None]:
import numpy as np
from tqdm import tqdm

def cosine_similarity(x, min_cardinality=3):    
    # Prepare symmetric result matrix
    s = np.empty((x.shape[1], x.shape[1]))
    s[:] = np.nan

    # Calculate similarity for upper trianglular matrix
    for i in tqdm(range(0, x.shape[1] - 1)):
        i_valid = ~np.isnan(x[:, i])
        for j in range(i + 1, x.shape[1]):
            j_valid = ~np.isnan(x[:, j])
            row_mask = np.logical_and(i_valid, j_valid)
            if row_mask.sum() >= min_cardinality:
                r_li = x[row_mask, i]
                r_lj = x[row_mask, j]
                s[i, j] = (np.dot(r_li, r_lj)
                           / (np.sqrt(np.power(r_li, 2).sum()) 
                              * np.sqrt(np.power(r_lj, 2).sum())))
    s = 0.5 + s / 2

    # Transpose upper triangular matrix to form lower
    lower_idx = np.tril_indices(x.shape[1])
    s[lower_idx] = s.T[lower_idx]
    return s

We apply this function to our centered ratings matrix, producing a symmetric similarity matrix $S_{i \times i}$.

We extract and re-wrap the column indices to retain the movie IDs.

In [None]:
min_cardinality = 3

s = cosine_similarity(user_mov_df_norm.to_numpy(), min_cardinality=min_cardinality)
s = pd.DataFrame(data=s,
                 index=user_mov_df_norm.columns,
                 columns=user_mov_df_norm.columns)

#### Validation of Similarity Matrix Before Filtering

In order to validate our similarity matrix and our implementation of centered cosine similarity, we show the pairwise similarity values from the $S$ matrix for the following specified movies:

```m1, m10, m100, m1510, m260, m3212```

We are validating our results against the values in [Campuswire post #861](https://campuswire.com/c/G06C55090/feed/861)

In [None]:
pd.set_option("display.precision", 7)
specified_movies = ["m1", "m10", "m100", "m1510", "m260", "m3212"]
s.loc[specified_movies, specified_movies]

### Filtering by Most Similar Movies

Next, for each movie, we determine the 30 most similar movies and set all other movies to NA. This allows for a more compact $S$ matrix. For movies that have fewer than 30 similar movies, all available similar movies (i.e., non-`NA`) are kept.

In [None]:
max_similar = 30

for i in range(s.shape[0]):
    row = s.iloc[i, :]
    num_selected = min([(~np.isnan(row)).sum(), max_similar, len(row)])
    # Find max allowed similarity with NaN vals
    max_sim = np.roll(np.sort(row)[::-1],
                      -np.count_nonzero(np.isnan(row)))[num_selected - 1]
    na_mask = row < max_sim
    s.iloc[i, na_mask] = np.nan

This filtered similarity matrix is written to file as `similarity.csv`.

In [None]:
s.to_csv("similarity.csv")

### ICBF

#### Implementation of ICBF

We calculate IBCF for all non-rated movies and return the 10 highest recommendations.

In the case of tie breaks, movies are recommended by `WeightedRating` from System I, then `movieID` for further tie breaks in descending order, so the highest `WeightedRating`, `movieID` is included first, followed by the second-highest, etc.

If fewer than 10 recommendations are calculated, we fill the missing recommendations with the highest-rated movies in the user's most watched genres. We use the `WeightedRatings` from the SystemI implementation for our definition of "highest-rated movies". In the case of multiple highest-rated movies, we pick the genre of the lowest `movieID`.

In [None]:
def myIBCF(s, newuser, sysI, num_recs=10):

    recs = newuser.copy(deep=True)
    recs.iloc[:] = np.nan

    i_in_w = ~np.isnan(newuser)
    # Compute IBCF for non-rated movies
    for l in np.arange(newuser.shape[0])[np.isnan(newuser)]:
        s_li = s.iloc[l, :]
        i_in_sl = ~np.isnan(s_li)
        col_mask = np.logical_and(i_in_sl, i_in_w)
        if s_li[col_mask].sum() == 0:
            continue
        recs.iloc[l] = (
            1 / (s_li[col_mask].sum())
            * np.dot(s_li[col_mask], newuser[col_mask])
        )
    recs = recs[~np.isnan(recs)] 

    # Create mappings needed for ranking
    mid_to_rating = dict(zip(sysI['MovieID'], sysI['WeightedRating']))
    mid_to_genre = dict(zip(sysI['MovieID'], sysI['Genre']))
    #print(f"# ratings: {np.count_nonzero(~np.isnan(newuser))}")
    #print(f"   # recs: {recs.shape[0]}")
    if recs.shape[0] >= num_recs:
        #print(recs.iloc[recs.argsort().iloc[-num_recs:]])
        rec_df = recs.iloc[recs.argsort().iloc[-num_recs:]]

        # Find (mid, IBCF value, Weighted Rating from System I) pairs
        recnames = [(mid, val, mid_to_rating[int(mid[1:])]) for mid, val in zip(rec_df.index, rec_df.values)]

        # Sort by (IBCF value, Weighted rating from System I, then mid) descending
        recs = [x[0] for x in sorted(recnames, key=lambda x: (x[1], x[2], int(x[0][1:])))][::-1]
        return recs
    else:
        additional_recs = num_recs - recs.shape[0]

        # Run through regular logic
        rec_df = recs.iloc[recs.argsort().iloc[-num_recs:]]
        mids = [int(mid[1:]) for mid in rec_df.index]
        recnames = [(mid, val, mid_to_rating[int(mid[1:])]) for mid, val in zip(rec_df.index, rec_df.values)]
        recs = [x[0] for x in sorted(recnames, key=lambda x: (x[1], x[2], int(x[0][1:])))][::-1]

        # From the movies rated by the user, find the most watched genre and return top recommendations from it
        # If there is a tie for most watched genre, then both are considered. 
        # Select the top movies by WeightedRating from System I for the given top genre(s). 
        # Make sure that the movies from the genre are not the same movies the user rated and also not already included from the IBCF recommendations.
        rated_movies = newuser[~np.isnan(newuser)]
        genre_mids = [int(movie[1:]) for movie in rated_movies.index]
        genres = np.unique([mid_to_genre[mid] for mid in genre_mids])
        mids.extend(genre_mids)
        movie_ids = sysI[sysI['Genre'].isin(genres) & ~sysI['MovieID'].isin(mids)].sort_values(by=['WeightedRating', 'MovieID'], ascending=[False, True])[:additional_recs]['MovieID']
        movie_ids = [f'm{mid}' for mid in movie_ids.values]

        return recs + movie_ids
        

#### Validation of `myIBCF`

To validate our implementation of `myIBCF`, we show the top 10 recommendations for:
* User "u1181" from rating matrix $R$
* User "u1351" from rating matrix $R$
* A hypothetical user who rates movie “m1613” with 5 and movie “m1755” with 4

In [None]:
hypothetical_user = user_mov_df.iloc[0, :].copy(deep=True)
hypothetical_user.iloc[:] = np.nan
hypothetical_user.loc[["m1613", "m1755"]] = [5, 4]

test_users = [
    ("User u1181", user_mov_df.loc["u1181", :]),
    ("User u1351", user_mov_df.loc["u1351", :]),
    ("Hypothetical user", hypothetical_user)
]

for username, w in test_users:
    print(f"\n{username}\n--{len(username)*'-'}\n{myIBCF(s, w, w_df)}")

Test for edge-case of less than 10 recommendations given by IBCF

In [None]:
hypothetical_user = user_mov_df.iloc[0, :].copy(deep=True)
hypothetical_user.iloc[:] = np.nan
hypothetical_user.loc[["m6"]] = [5]

test_users = [
    ("Hypothetical user", hypothetical_user)
]

for username, w in test_users:
    print(f"\n{username}\n--{len(username)*'-'}\n{myIBCF(s, w, w_df)}")