# Project 4: Movie Recommender System

CS 598 Practical Statistical Learning

2023-12-10

UIUC Fall 2023

**Authors**
* Ryan Fogle
    - rsfogle2@illinois.edu
    - UIN: 652628818
* Sean Enright
    - seanre2@illinois.edu
    - UIN: 661791377

**Contributions**

### Load in the Data

In [None]:
import pandas as pd
from pathlib import Path

In [None]:
movie_dir = Path('.') / 'ml-1m' / 'ml-1m' 
ratings = pd.read_csv(movie_dir / 'ratings.dat', sep='::', engine = 'python', header=None)
ratings.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']
movies = pd.read_csv(movie_dir / 'movies.dat', sep='::', engine = 'python',
                     encoding="ISO-8859-1", header = None)
movies.columns = ['MovieID', 'Title', 'Genres']
users = pd.read_csv(movie_dir / 'users.dat', sep='::', engine='python', header=None)
users.columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zipcode']

# Create new entry for each genre in a movie, join movie and ratings together. 
movies['Genres'] = movies['Genres'].str.split('|')
df = movies.explode('Genres')
df = df.merge(ratings, on=['MovieID'], how='left')
df.rename(columns={'Genres': 'Genre'}, inplace=True)
df

Create a MovieID -> Title map for later

In [None]:
mov_title_map = dict(zip(movies['MovieID'], movies['Title']))
list(mov_title_map.items())[:5]

## System I: Recommendation Based on Genres

General idea: Use Bayseian probabilty to calcuate new ratings based upon additional prior assumptions, and then rank movies by this new rating to select the top 5 per genre.

This algorthm is based upon this stack overflow post:
https://stackoverflow.com/questions/2495509/how-to-balance-number-of-ratings-versus-the-ratings-themselves

$$
\tilde{R} = \frac{\bar{w} \bar{r} + \sum_{i=1}^{n}{r_i}}{\bar{w} + n}
$$

Where:
- $\tilde{R}$ is the new rating
- $\bar{w}$ is the predefined number of ratings (weight) to include in our prior assumption
- $\bar{r}$ is the predefined average rating to include in our prior assumption
- $n$ is the number of ratings
- $r_i$ is the rating for a given entry.

**Interpretation**:

Consider $\bar{w}$ to be the average number of ratings for a given genre, and $\bar{r}$ to be the number of times to consider that rating for a given genre. We first assume the rating of a movie to be defined as $\frac{\bar{w} \bar{r}}{\bar{w}}$ when $n=0$, then slightly update the estimate for each new given rating. 

In our implementation we defined $\bar{w}$ to be the median genre rating from the average movie ratings in that genre. We also define $\bar{r}$ to be the genre's 25th percentile count (of ratings per movie)

In [None]:
# group by Genre and Movie, this will be used to find median ratings and percentile counts for our prior
gb = df.groupby(['Genre', 'MovieID'])

# Median Genre ratings of Average Movie ratings
median_ratings = gb['Rating'].mean().reset_index().groupby('Genre')['Rating'].median().reset_index()
median_ratings = dict(zip(median_ratings['Genre'], median_ratings['Rating']))
median_ratings

In [None]:
# Grab 25th Percentile of count by genre
quantile_count = gb['Timestamp'].count().reset_index().groupby('Genre')['Timestamp'].quantile(0.25).reset_index()
quantile_count.columns = ['Genre', 'Count']
quantile_count = dict(zip(quantile_count['Genre'], quantile_count['Count']))
quantile_count

Run algorithm

In [None]:
weighted_ratings = []
for (genre, movie_id), movie in gb:
    n = movie.shape[0]
    w = quantile_count[genre]
    r = median_ratings[genre] 
    weighted_rating = (r * w + movie['Rating'].sum()) / (w + n)
    weighted_ratings.append((genre, movie_id, mov_title_map[movie_id], weighted_rating, movie['Rating'].sum() / n, n))

w_df = pd.DataFrame(weighted_ratings, columns=['Genre', 'MovieID', 'Title', 'WeightedRating', 'AverageRating', '# of Ratings'])
w_df

Sort ratings by WeightedRating, group by genre and grab the first five occurrences. 

In [None]:
sysI_recs = w_df.sort_values('WeightedRating', ascending=False).groupby('Genre').head(n=10).sort_values(['Genre', 'WeightedRating'], ascending=[True, False])
sysI_recs

Output System I recommends for the dashboard to use.

In [None]:
sysI_recs.to_csv('sysI_recs.csv', index=False)

## System II: Recommendation Based on IBCF

In [None]:
usr_mov_df = pd.read_csv('Rmat.csv')
usr_mov_df