# Hybrid approach

The Hybrid Model combines multiple recommendation techniques to enhance the accuracy and robustness of predictions. In this implementation, a hybrid approach is used, leveraging Cosine Similarity to blend different recommendation strategies.

Cosine similarity measures the cosine of the angle between two vectors, representing the similarity between user or item profiles. By combining the predictions from multiple models based on cosine similarity, the hybrid model aims to provide more personalized and accurate recommendations by considering both collaborative and content-based aspects.

This approach allows us to take advantage of both the strengths of individual models while mitigating their weaknesses, ultimately offering a more balanced recommendation system.

## Setup

In [1]:
%%bash
DATA_FOLDER="../data"
if [ ! -d "$DATA_FOLDER" ]; then
    wget --no-check-certificate "https://drive.usercontent.google.com/download?id=1qe5hOSBxzIuxBb1G_Ih5X-O65QElollE&export=download&confirm=t&uuid=b2002093-cc6e-4bd5-be47-9603f0b33470" -O KuaiRec.zip
    unzip KuaiRec.zip -d "$DATA_FOLDER"
    rm KuaiRec.zip
fi

In [2]:
import pandas as pd
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [3]:
# Load the datasets.
data = "../data/KuaiRec 2.0/data"
interactions = pd.read_csv(f"{data}/big_matrix.csv")
categories = pd.read_csv(f"{data}/item_categories.csv")

In [4]:
# Drop useless data.
interactions = interactions.dropna()
interactions = interactions.drop_duplicates()
interactions = interactions[interactions["timestamp"] >= 0]

# Drop useless columns.
interactions.drop(columns=["play_duration", "video_duration", "time", "date", "timestamp"], inplace=True)

# Drop useless data.
categories = categories.dropna()
categories = categories.drop_duplicates()

# Convert the feat column from a string representation of an array into an array.
categories["feat"] = categories["feat"].apply(lambda x: json.loads(x))

## Collaborative filtering using cosine similarity

In [5]:
user_item_matrix = interactions.pivot_table(
    index="user_id", columns="video_id", values="watch_ratio",
    aggfunc="max", fill_value=0
)

In [6]:
user_similarity = cosine_similarity(user_item_matrix)

In [7]:
def recommend_videos_by_users(user_id, nb_top_similar=10, res_length=10):
    """
    Get top content recommendations for a given user using collaborative filtering.

    The function finds the most similar users to the target `user_id` and aggregates their preferences
    to return a ranked list of recommended items.

    Args:
        user_id (int): The ID of the user for whom to generate recommendations.
        nb_top_similar (int, optional): Number of most similar users to consider. Default is 10.
        res_length (int, optional): Number of top recommendations to return. Default is 10.

    Returns:
        pandas.Series: A sorted list of recommended items for the specified user.
    """
    similar_users = user_similarity[user_id].argsort()[:-nb_top_similar:-1]
    recommended_items = user_item_matrix.loc[similar_users].mean(axis=0).sort_values(ascending=False)
    
    items_user_has_watched = user_item_matrix.loc[user_id][user_item_matrix.loc[user_id] != 0].index
    recommended_items = recommended_items[~recommended_items.index.isin(items_user_has_watched)]

    return recommended_items.head(res_length)


## Content Based filtering using cosine similarity

In [8]:
# Create TF-IDF matrix from the video feats.
data_as_strings = [" ".join(map(str, doc)) for doc in categories["feat"]]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data_as_strings)


In [9]:
content_based_cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [10]:
def recommend_similar_videos(video_id, res_length=20):
    """
    Retrieve top-N videos similar to a given video using content-based filtering.

    This function returns a list of videos most similar to the specified `video_id`,
    based on content features and cosine similarity from TF-IDF.

    Args:
        video_id (int): The ID of the reference video for which similar videos are to be found.
        res_length (int, optional): The number of most similar videos to return. Default is 20.

    Returns:
        pandas.Series: A Series containing the IDs of the top-N similar videos,
                       sorted by descending similarity score.
    """
    idx = categories[categories['video_id'] == video_id].index[0]
    sim_scores = list(enumerate(content_based_cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:res_length+1]
    
    # Extract indices and scores separately
    movie_indices = [i[0] for i in sim_scores]
    scores = [i[1] for i in sim_scores]

    video_ids = categories["video_id"].iloc[movie_indices].reset_index(drop=True)

    return pd.Series(scores, index=video_ids, name="similarity")

## Hybrid filtering

In [11]:
def get_most_watched_videos_by_user_id(df, user_id, head_count=5):
    """
    Get the most watched videos by a specific user, sorted by watch ratio.

    Args:
        df (pandas.DataFrame): The input DataFrame containing video data, where each row represents a video watched by a user.
        user_id (int or str): The ID of the user for whom the most watched videos are being retrieved.
        head_count (int, optional): The number of top videos to return based on the highest watch ratio. Default is 5.

    Returns:
        pandas.DataFrame: A DataFrame containing the top 'head_count' most watched videos by the specified user,
                           sorted by the 'watch_ratio' column in descending order.
    """
    return df[df["user_id"] == user_id].sort_values(by="watch_ratio", ascending=False).head(head_count)

In [12]:
def recommmend_videos_for_user(user_id, content_weight=0.3, res_length=10):
    """
    Generate hybrid video recommendations for a given user by combining 
    content-based and collaborative filtering approaches.

    This function returns a ranked list of recommended videos by merging 
    the outputs of:
      - Collaborative filtering: finds videos liked by similar users.
      - Content-based filtering: finds videos similar to those the user has interacted with.

    Args:
        user_id (int): The ID of the user for whom to generate recommendations.
        content_weight (float, optional): Weight given to content-based recommendations. 
                                           Collaborative filtering is weighted as (1 - content_weight). Default is 0.3.
        res_length (int, optional): The number of top recommendations to return. Default is 10.

    Returns:
        pandas.Series: A Series where the index contains recommended video IDs and the values
                       represent the aggregated relevance scores (e.g., weighted similarity),
                       sorted in descending order.
    """
    collab_scores = recommend_videos_by_users(user_id, nb_top_similar=10, res_length=50)
    
    # Get the history of the user's favorite videos.
    user_history = get_most_watched_videos_by_user_id(interactions, user_id, 20)["video_id"].tolist()
    
    content_scores = pd.Series(dtype=float)
    for video_id in user_history:
        similar_videos = recommend_similar_videos(video_id, res_length=10)
        content_scores = content_scores.add(similar_videos, fill_value=0)
        
    # Normalize.    
    if not content_scores.empty:
        content_scores /= content_scores.max()
    if not collab_scores.empty:
        collab_scores /= collab_scores.max()
    
    # Calculate the video score by combining the two systems using a weight. 
    combined_scores = content_scores.mul(content_weight).add(
        collab_scores.mul(1 - content_weight), fill_value=0
    )

    # Remove already watched videos
    combined_scores = combined_scores.drop(labels=user_history, errors='ignore')

    return combined_scores.sort_values(ascending=False).head(res_length)


## Model testing

Let's test our model by trying to recommend videos to the user 0.

In [13]:
# The videos recommended by our model.
recommended_videos  = recommmend_videos_for_user(0, content_weight=0.3, res_length=5)
recommended_videos

video_id
6013    0.700000
7143    0.514801
2589    0.479550
861     0.397293
6769    0.390690
dtype: float64

In [14]:
# The top 5 most watched videos by the user 0.
most_watched_videos_by_user_0 = get_most_watched_videos_by_user_id(interactions, 0)
most_watched_videos_by_user_0

Unnamed: 0,user_id,video_id,watch_ratio
434,0,7046,130.818866
2014,0,3289,89.249457
1419,0,6088,33.397893
17,0,171,33.276021
2179,0,6309,9.239163


In [15]:
def get_features_of_videos(videos):
    """
    Retrieve the unique features associated with a list of video IDs.

    This function looks up each video ID in the global `categories` DataFrame 
    and extracts its associated features from the 'feat' column. It returns a 
    set of all distinct features found across the given videos.

    Args:
        videos (list of int): A list of video IDs for which to collect features.

    Returns:
        set: A set containing all unique features associated with the provided video IDs.
    """
    video_categories = set()
    for video_id in videos:
        feats = categories[categories['video_id'] == video_id]['feat'].values[0]
        video_categories.update(feats)
    return video_categories

In [16]:
# Features of the most watched videos of user 0.
get_features_of_videos(most_watched_videos_by_user_0["video_id"])

{6, 8, 10, 11, 17, 25, 26}

In [17]:
# Features of the recommended videos.
get_features_of_videos(recommended_videos.index)

{6, 11, 26, 28}

## Benchmarks

In [18]:
def precision_at_k(recommended, relevant, k):
    relevant_items_in_top_k = len(recommended.intersection(relevant))
    return relevant_items_in_top_k / k

def recall_at_k(recommended, relevant):
    relevant_items_in_top_k = len(recommended.intersection(relevant))
    return relevant_items_in_top_k / len(relevant) if len(relevant) > 0 else 0.0

def f1_score_at_k(precision, recall):
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

def hit_ratio_at_k(recommended, relevant):
    return 1 if len(recommended.intersection(relevant)) > 0 else 0

def coverage_at_k(recommended, total_features):
    return len(recommended) / total_features

In [19]:
# Load the tests.
test_df = pd.read_csv(f"{data}/small_matrix.csv")

# Count the number of unique features.
total_features = pd.Series([feat for feats in categories["feat"] for feat in feats]).nunique()

def metrics(K):
    recommended_features = get_features_of_videos(recommmend_videos_for_user(14, content_weight=0.3, res_length=K).index)

    most_watched_videos_by_user_14 = get_most_watched_videos_by_user_id(test_df, 14, K)
    favorite_videos_features_of_user_14 = get_features_of_videos(most_watched_videos_by_user_14["video_id"])

    precision = precision_at_k(recommended_features, favorite_videos_features_of_user_14, K)
    recall = recall_at_k(recommended_features, favorite_videos_features_of_user_14)
    f1_score = f1_score_at_k(precision, recall)
    hit_ratio = hit_ratio_at_k(recommended_features, favorite_videos_features_of_user_14)
    coverage = coverage_at_k(recommended_features, total_features)

    print(f"➡️  Precision@{K}               : {round(precision, 4)}")
    print(f"➡️  Recall@{K}                  : {round(recall, 4)}")
    print(f"➡️  F1-Score@{K}                : {round(f1_score, 4)}")
    print(f"➡️  Hit Ratio@{K}               : {round(hit_ratio, 4)}")
    print(f"➡️  Coverage@{K}                : {round(coverage, 4)}")

print("📊 Model Evaluation Metrics:")
print("-----------------------------------------")
for K in [5, 10, 100, 1000]:
    metrics(K)
    print("-----------------------------------------")

📊 Model Evaluation Metrics:
-----------------------------------------
➡️  Precision@5               : 0.6
➡️  Recall@5                  : 0.5
➡️  F1-Score@5                : 0.5455
➡️  Hit Ratio@5               : 1
➡️  Coverage@5                : 0.129
-----------------------------------------
➡️  Precision@10               : 0.5
➡️  Recall@10                  : 0.5
➡️  F1-Score@10                : 0.5
➡️  Hit Ratio@10               : 1
➡️  Coverage@10                : 0.2581
-----------------------------------------
➡️  Precision@100               : 0.17
➡️  Recall@100                  : 0.9444
➡️  F1-Score@100                : 0.2881
➡️  Hit Ratio@100               : 1
➡️  Coverage@100                : 0.6129
-----------------------------------------
➡️  Precision@1000               : 0.022
➡️  Recall@1000                  : 0.8148
➡️  F1-Score@1000                : 0.0428
➡️  Hit Ratio@1000               : 1
➡️  Coverage@1000                : 0.7419
---------------------------------

## Conclusion

In this notebook, we implemented and evaluated a **hybrid recommender system** based on **cosine similarity**. The system recommends items by combining multiple features and measuring their similarity, aiming to suggest the most relevant items to users.

After evaluating the model using different metrics at various **K values** (5, 10, 100, and 1000), we observed the following:

### Key Metrics and Observations:

- **Precision**: As the value of `K` increases, **precision** decreases—from **0.6 at K=5** to **0.022 at K=1000**. This trend is expected in recommender systems: increasing the number of recommendations often includes less relevant items, reducing overall precision.

- **Recall**: In contrast, **recall** improves with higher `K`, reaching **0.8148 at K=1000**. This shows the system is capable of retrieving a significant portion of relevant items when more recommendations are presented.

- **F1-Score**: The **F1-Score**, which balances precision and recall, peaks at **K=5** with a score of **0.5455** and gradually declines to **0.0428 at K=1000**, highlighting the trade-off between precision and recall as recommendation size increases.

- **Hit Ratio**: The **hit ratio** is consistently **1.0** across all `K` values, indicating that the system successfully recommends *at least one* relevant item to every user, regardless of the list length.

- **Coverage**: **Coverage** improves with larger `K`, increasing from **0.129 at K=5** to **0.7419 at K=1000**. This suggests that as the number of recommendations increases, the system is able to utilize a broader portion of the item catalog, providing users with more diverse suggestions.

### Overall Insights:

- The hybrid recommender system shows strong performance at **lower K values**, particularly at **K=5**, where it achieves high **precision (0.6)** and the highest **F1-Score (0.5455)**.

- It maintains excellent **hit ratio** across all `K` values, consistently offering relevant content.

- **Coverage** increases steadily with `K`, indicating the system's ability to explore and recommend a wider range of items as more slots become available.

- The observed performance metrics align with typical trade-offs in recommender systems: higher recall and coverage come at the cost of precision.

In conclusion, this cosine similarity-based hybrid approach offers an effective recommendation strategy with strong early precision and wide coverage at scale. The model can be tuned depending on the application—favoring **precision** for targeted recommendations or **recall/coverage** for exploratory or diverse content discovery.

Further enhancements could include integrating additional user behavior signals or experimenting with more advanced hybridization techniques to further optimize recommendation quality.
