# Content-Based Recommender System using Naive Bayes

This notebook implements two types of content-based recommendation systems using the MovieLens dataset:
1. User-specific recommender using Naive Bayes (user profile models)
2. Global recommender using Kronecker product of user/item features
3. Evaluation methodology for realistic recommendation performance

### Load and Preprocess Data

In [1]:
import pandas as pd
import os
import re

DATA_PATH = "../ml-latest-small"

ratings = pd.read_csv(os.path.join(DATA_PATH, "ratings.csv"))
movies = pd.read_csv(os.path.join(DATA_PATH, "movies.csv"))
tags = pd.read_csv(os.path.join(DATA_PATH, "tags.csv"))

## 🧹 Preprocess Movie Metadata
tags_agg = tags.groupby("movieId")["tag"].apply(lambda x: " ".join(x)).reset_index()
movies = movies.merge(tags_agg, on="movieId", how="left")
movies["tag"] = movies["tag"].fillna("")
movies["content"] = movies["genres"].str.replace("|", " ") + " " + movies["tag"]

In [2]:
movies.head()

Unnamed: 0,movieId,title,genres,tag,content
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar pixar fun,Adventure Animation Children Comedy Fantasy pi...
1,2,Jumanji (1995),Adventure|Children|Fantasy,fantasy magic board game Robin Williams game,Adventure Children Fantasy fantasy magic board...
2,3,Grumpier Old Men (1995),Comedy|Romance,moldy old,Comedy Romance moldy old
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,,Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy,pregnancy remake,Comedy pregnancy remake


## 1. User-Specific Naive Bayes Recommender

The model is trained on metadata including the movie title and genres, with titles cleaned to remove release years.

In [18]:
# metadata available for each movie
metadata = movies[["movieId", "title", "genres"]]
metadata.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Removing the year from movie titles helps clean the data for content-based recommendation. The year adds no semantic value for models using text features like TF-IDF and can introduce noise or inflate the vocabulary. By focusing on the actual title, we ensure better feature extraction and more accurate similarity comparisons between movies.

In [19]:
# Clean the title by removing the year in parentheses
def clean_title(title):
    return re.sub(r'\s*\(\d{4}\)', '', title)

In [20]:
metadata['title'] = metadata['title'].apply(clean_title)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metadata['title'] = metadata['title'].apply(clean_title)


In [21]:
metadata.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,Adventure|Children|Fantasy
2,3,Grumpier Old Men,Comedy|Romance
3,4,Waiting to Exhale,Comedy|Drama|Romance
4,5,Father of the Bride Part II,Comedy


### Binarize the genres

Binarizing genres turns categorical data into a binary format, making it easier for machine learning models to process. This method helps handle movies with multiple genres and captures interactions between them, improving recommendation accuracy. It simplifies the feature engineering process and ensures the model can effectively learn from genre information.

In [22]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(metadata['genres'].str.split('|'))

In [23]:
genres_encoded

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [24]:
# Create a DataFrame with the encoded genres
genres_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)
genres_df.head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
# Concatenate the original metadata with the encoded genres
metadata = pd.concat([metadata[['movieId', 'title']], genres_df], axis=1)
# metadata = metadata.drop(columns=['(no genres listed)'])
metadata.head()

Unnamed: 0,movieId,title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer

tfidf = TfidfVectorizer(max_features=1000, stop_words='english')

# Create a preprocessor that transforms the movie metadata:
# - Applies TF-IDF vectorization to the cleaned 'title' column to extract textual features.
# - Passes through the binary genre columns (already transformed by MultiLabelBinarizer).
# - Drops any remaining columns that are not explicitly selected.
preprocessor = ColumnTransformer(
    transformers=[
        ('tfidf', tfidf, 'title'),
        ('genres', 'passthrough', genres_df.columns)
    ],
    remainder='drop'
)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

def user_specific_model(user_id, top_k=10):
    user_ratings = ratings[ratings['userId'] == user_id]
    user_data = pd.merge(user_ratings, metadata, on='movieId')
    

    # Create labels
    user_data['label'] = user_data['rating'].apply(lambda r: 1 if r >= 4 else (0 if r <= 2 else None))
    user_data = user_data.dropna(subset=['label'])
    user_data['label'] = user_data['label'].astype(int)

    if user_data.empty:
        return pd.DataFrame()
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', MultinomialNB())
    ])

    X_train = user_data.drop(columns=['userId', 'rating', 'label', 'timestamp'])  # drop irrelevant columns
    y_train = user_data['label']

    pipeline.fit(X_train, y_train)

    seen = user_ratings['movieId'].unique()
    candidate_pool = metadata[~metadata['movieId'].isin(seen)].copy()
    input_cols = list(X_train.columns)  # X used during training

    candidate_pool['score'] = pipeline.predict_proba(candidate_pool[input_cols])[:, 1]
    return candidate_pool.sort_values("score", ascending=False)[['movieId', 'title', 'score']].head(top_k)

In [34]:
user_specific_model(user_id=37)

Unnamed: 0,movieId,title,score
5657,27549,Dead or Alive: Final,0.999369
454,519,RoboCop 3,0.999117
2248,2985,RoboCop,0.999117
4843,7235,Ichi the Killer (Koroshiya 1),0.999023
4693,7007,"Last Boy Scout, The",0.998991
3657,5027,Another 48 Hrs.,0.998991
19,20,Money Train,0.998991
1076,1396,Sneakers,0.998991
3989,5628,Wasabi,0.998991
1103,1432,Metro,0.998991


# Evaluate the model

In [41]:
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from tqdm import tqdm

def evaluate_accuracy_all_users(ratings_df, metadata_df, min_ratings=50):
    results = []

    user_ids = ratings_df['userId'].value_counts()
    user_ids = user_ids[user_ids >= min_ratings].index

    for user_id in tqdm(user_ids, desc="Evaluating accuracy"):
        user_ratings = ratings_df[ratings_df['userId'] == user_id]
        user_data = pd.merge(user_ratings, metadata_df, on='movieId')

        # Create binary labels: like (1), dislike (0), ignore neutral
        user_data['label'] = user_data['rating'].apply(lambda r: 1 if r >= 4 else (0 if r <= 2 else None))
        user_data = user_data.dropna(subset=['label'])
        user_data['label'] = user_data['label'].astype(int)

        if len(user_data) < min_ratings:
            continue

        # Train/test split
        train = user_data.sample(frac=0.8, random_state=42)
        test = user_data.drop(train.index)

        if test.empty or train.empty:
            continue

        X_train = train.drop(columns=['userId', 'rating', 'label', 'timestamp'])
        y_train = train['label']
        X_test = test.drop(columns=['userId', 'rating', 'label', 'timestamp'])
        y_test = test['label']

        try:
            pipeline = Pipeline([
                ('preprocessor', preprocessor),
                ('classifier', MultinomialNB())
            ])

            pipeline.fit(X_train, y_train)
            y_pred = pipeline.predict(X_test)
            acc = accuracy_score(y_test, y_pred)

            results.append({
                'userId': user_id,
                'accuracy': acc,
                'n_train': len(train),
                'n_test': len(test)
            })

        except Exception as e:
            continue

    return pd.DataFrame(results)

In [42]:
results_df = evaluate_accuracy_all_users(ratings, metadata)
print(results_df)

Evaluating accuracy: 100%|██████████| 385/385 [00:03<00:00, 107.63it/s]

     userId  accuracy  n_train  n_test
0       414  0.700297     1349     337
1       599  0.834286      699     175
2       474  0.764423      834     208
3       448  0.624413      851     213
4       274  0.683673      391      98
..      ...       ...      ...     ...
301     585  1.000000       44      11
302     267  0.818182       44      11
303     413  1.000000       42      10
304     348  1.000000       42      11
305     224  1.000000       43      11

[306 rows x 4 columns]





In [43]:
results_df.describe()

Unnamed: 0,userId,accuracy,n_train,n_test
count,306.0,306.0,306.0,306.0
mean,302.287582,0.840202,141.849673,35.421569
std,182.190233,0.133264,141.621072,35.40189
min,1.0,0.4,40.0,10.0
25%,139.25,0.75,57.0,14.0
50%,299.5,0.857143,91.0,23.0
75%,461.5,0.948381,177.0,44.75
max,610.0,1.0,1349.0,337.0


## 2. Global Content-Based Recommender (Single Model for All Users)

In [None]:
def train_global_model():
    data = pd.merge(ratings, tags_agg, on='movieId')
    data = data.dropna(subset=['tag'])

    data['label'] = data['rating'].apply(lambda r: 1 if r >= 4 else (0 if r <= 2 else None))
    data = data.dropna(subset=['label'])
    data['label'] = data['label'].astype(int)

    tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
    X = tfidf.fit_transform(data['tag'])
    y = data['label']

    model = MultinomialNB()
    model.fit(X, y)
    return model, tfidf

model_global, tfidf_global = train_global_model()

def recommend_global(user_id, top_k=10):
    seen = ratings[ratings['userId'] == user_id]['movieId'].unique()
    unseen = tags_agg[~tags_agg['movieId'].isin(seen)].copy()
    X_test = tfidf_global.transform(unseen['tag'])

    unseen['score'] = model_global.predict_proba(X_test)[:, 1]
    result = unseen.merge(movies[['movieId', 'title']], on='movieId')
    return result.sort_values('score', ascending=False)[['movieId', 'title', 'score']].head(top_k)

### 🔍 Test it:

In [None]:
recommend_global(user_id=1)

## 3. Evaluation Methodology (Candidate Pool Strategy)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, roc_auc_score

def evaluate_user_model(user_id):
    user_ratings = ratings[ratings['userId'] == user_id]
    if len(user_ratings) < 5:
        return None

    train, test = train_test_split(user_ratings, test_size=0.4, random_state=42)
    train_data = pd.merge(train, movies, on='movieId')
    train_data['label'] = train_data['rating'].apply(lambda r: 1 if r >= 4 else (0 if r <= 2 else None))
    train_data = train_data.dropna(subset=['label'])
    train_data['label'] = train_data['label'].astype(int)

    if len(train_data) < 3:
        return None

    tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
    X_train = tfidf.fit_transform(train_data['content'])
    y_train = train_data['label']

    model = MultinomialNB()
    model.fit(X_train, y_train)

    test_data = pd.merge(test, movies, on='movieId')
    X_test = tfidf.transform(test_data['content'])
    y_test = test_data['rating'].apply(lambda r: 1 if r >= 4 else 0)

    preds = model.predict(X_test)
    precision = precision_score(y_test, preds)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

    return {
        'user_id': user_id,
        'precision': precision,
        'auc': auc
    }


### 🔍 Test it:

In [None]:
evaluate_user_model(user_id=1)