We will consider the following five simple baseline models to provide foundational insights. These will serve as the baseline performance metrics that any subsequent machine learning models aim to beat.

1. **Global Mean Rating**:
This model predicts the global mean rating for all user-item pairs, serving as the most basic form of recommendation.

2. **User Mean Rating**:
For this model, the mean rating of each user is calculated and used to predict ratings for all items the user has not yet interacted with.

3. **Item Mean Rating**:
In contrast to the User Mean Rating, this model focuses on the mean rating of each item and uses it to predict ratings for all users.

4. **User-Item Mean Rating**:
This model takes a more nuanced approach by predicting a rating for a user-item pair as the average of the user's mean rating and the item's mean rating. The formula is:
$$prediction = \frac{User Mean Rating + Item Mean Rating}{2}$$

5. **Weighted Mean Ratings**:
This model employs a weighted average of the user mean and item mean ratings. The weight ( w ) can be adjusted based on domain understanding. The formula is :
$$prediction = w \times User Mean Rating + (1 - w) \times Item Mean Rating, \space where \space 0 \leq w \leq 1$$

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
movies_df = pd.read_csv("../data/movies_metadata_after_eda.csv")
ratings_df = pd.read_csv("../data/ratings_small.csv")

In [3]:
train_df, test_df = train_test_split(ratings_df, test_size=0.3, random_state=42)

In [5]:
from sameer.ModelExperimentation import calculate_user_item_mean_rating

test_df = calculate_user_item_mean_rating(train_df, test_df)
test_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,global_mean_rating,user_mean_rating,item_mean_rating,user_item_mean_rating
0,128,1028,5.0,1049690908,3.540256,3.844444,3.836364,3.840404
1,665,4736,1.0,1010197684,3.540256,3.294304,3.540256,3.41728
2,120,4002,3.0,1167420604,3.540256,3.573684,3.318182,3.445933
3,257,1274,4.0,1348544094,3.540256,3.80137,3.791667,3.796518
4,468,6440,4.0,1296191715,3.540256,2.946196,3.9,3.423098
