# Collaborative Filtering-Based Recommender System with Hybrid Approach

## Project Overview
This notebook implements a hybrid collaborative filtering-based recommender system that combines **user-based** and **item-based** methods. The main focus is to predict user ratings for books using a combination of these two methods, adjusted by a threshold-based approach to handle sparse data. Additionally, a fallback mechanism using the global mean rating ensures predictions even for sparse cases.


### **Parameter Optimization**
   - The parameters of the hybrid model, including `alpha` (weight between user-based and item-based predictions) and thresholds for user and book rating counts, were optimized by testing and validation on the training dataset split into train/test subsets.
   - The best-performing parameters were determined based on the lowest **Mean Squared Error (MSE)** on the validation set.

### **Kaggle Submission**
   - The optimized parameters were used to predict ratings for the test dataset provided in the competition.
   - Predictions were saved in the required format and submitted to Kaggle for evaluation.

---

## Highlights of the Approach
1. **Hybrid Model**: Combines user-based and item-based predictions for improved accuracy.
2. **Threshold-Based Strategy**: Dynamically chooses the best prediction method based on data availability for users and books.
3. **Fallback Mechanism**: Ensures robustness in cases of extreme sparsity by falling back to the global mean rating(less than threshold).
4. **Parameter Optimization**: Conducted extensive testing on the train dataset to find the optimal parameters for the model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
# Function to predict user-based rating
def predict_user_based_rating(user_id, book_id, user_item_matrix, user_similarity_df):
    if book_id not in user_item_matrix.columns:
        return np.nan
    
    user_ratings = user_item_matrix.loc[:, book_id]
    user_similarities = user_similarity_df.loc[user_id] if user_id in user_similarity_df.index else pd.Series(0, index=user_item_matrix.index)
    
    rated_users = user_ratings[user_ratings.notnull()].index
    similarities = user_similarities[rated_users]
    ratings = user_ratings[rated_users]
    
    if len(rated_users) == 0:
        return np.nan
    
    weighted_sum = np.dot(similarities, ratings)
    similarity_sum = np.sum(np.abs(similarities))
    
    if similarity_sum == 0:
        return np.nan
    
    return weighted_sum / similarity_sum

# Function to predict item-based rating
def predict_item_based_rating(user_id, book_id, user_item_matrix, item_similarity_df):
    if user_id not in user_item_matrix.index:
        return np.nan
    
    item_ratings = user_item_matrix.loc[user_id]
    item_similarities = item_similarity_df.loc[book_id] if book_id in item_similarity_df.index else pd.Series(0, index=user_item_matrix.columns)
    
    rated_items = item_ratings[item_ratings.notnull()].index
    similarities = item_similarities[rated_items]
    ratings = item_ratings[rated_items]
    
    if len(rated_items) == 0:
        return np.nan
    
    weighted_sum = np.dot(similarities, ratings)
    similarity_sum = np.sum(np.abs(similarities))
    
    if similarity_sum == 0:
        return np.nan
    
    return weighted_sum / similarity_sum

In [3]:
# Optimized predict_combined_with_threshold function
def predict_combined_with_threshold(
    user_id, book_id, user_item_matrix, user_similarity_df, item_similarity_df, 
    user_rating_counts, book_rating_counts, global_mean, alpha=0.5, threshold_book=5, threshold_user=5
):
    """
    Predict rating for a given user and book by combining user-based and item-based collaborative filtering,
    with optimizations for efficiency.
    """
    # Retrieve precomputed rating counts for user and book
    user_ratings_count = user_rating_counts.get(user_id, 0)
    book_ratings_count = book_rating_counts.get(book_id, 0)

    # Fallback to global mean if both user and book have insufficient ratings
    if user_ratings_count < threshold_user and book_ratings_count < threshold_book:
        return global_mean

    # Predict using user-based method if book has sufficient ratings
    user_based_prediction = (
        predict_user_based_rating(user_id, book_id, user_item_matrix, user_similarity_df)
        if book_ratings_count >= threshold_book
        else np.nan
    )

    # Predict using item-based method if user has sufficient ratings
    item_based_prediction = (
        predict_item_based_rating(user_id, book_id, user_item_matrix, item_similarity_df)
        if user_ratings_count >= threshold_user
        else np.nan
    )

    # Combine predictions
    if np.isnan(user_based_prediction) and np.isnan(item_based_prediction):
        return np.nan
    elif np.isnan(user_based_prediction):
        return item_based_prediction
    elif np.isnan(item_based_prediction):
        return user_based_prediction
    else:
        return alpha * user_based_prediction + (1 - alpha) * item_based_prediction

In [4]:
# Load the training data
train_data = pd.read_csv('/kaggle/input/dis-project-2-recommender-systems-f2024/train.csv')

# Create user-item rating matrices for the training set
user_item_matrix_train = train_data.pivot(index='user_id', columns='book_id', values='rating')
item_user_matrix_train = user_item_matrix_train.T  # Transpose for item-user matrix

# Fill NaN values with user mean (for books)
user_item_filled_train = user_item_matrix_train.apply(lambda row: row.fillna(row.mean()), axis=1)
# Fill NaN values with book mean (for users)
item_user_filled_train = item_user_matrix_train.apply(lambda row: row.fillna(row.mean()), axis=1)

# Calculate user and item similarity matrices
user_similarity = cosine_similarity(user_item_filled_train)
item_similarity = cosine_similarity(item_user_filled_train)

# Convert to DataFrames for easy indexing
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix_train.index, columns=user_item_matrix_train.index)
item_similarity_df = pd.DataFrame(item_similarity, index=item_user_matrix_train.index, columns=item_user_matrix_train.index)

In [5]:
test_data = pd.read_csv('/kaggle/input/dis-project-2-recommender-systems-f2024/test.csv')

# Precompute global mean rating from the training data

# Calculate the number of ratings for each user and book
user_rating_counts = user_item_matrix_train.notna().sum(axis=1)  # Number of ratings by each user
book_rating_counts = user_item_matrix_train.notna().sum(axis=0)  # Number of ratings for each book

low_threshold_book = book_rating_counts[book_rating_counts < 2].index
low_threshold_user = user_rating_counts[user_rating_counts < 2].index

# Extract ratings where book_id and user_id meet the threshold criteria
low_threshold_ratings = user_item_matrix_train.loc[low_threshold_user, low_threshold_book]

# Flatten the matrix to consider only the valid ratings (non-NaN values)
low_threshold_mean_rating = low_threshold_ratings.stack().mean()

# Create a list to store predictions
predictions = []
alpha_value = 0.3  # You can modify this value as needed
# threshold_value = 5  # Minimum number of ratings for threshold logic

# Iterate through the test data
for _, row in test_data.iterrows():
    user_id = row['user_id']
    book_id = row['book_id']
    
    # Predict the rating using the threshold-based function
    predicted_rating = predict_combined_with_threshold(
        user_id, book_id, user_item_matrix_train, user_similarity_df, item_similarity_df, 
        user_rating_counts, book_rating_counts, global_mean = low_threshold_mean_rating, alpha=alpha_value, threshold_book=2,threshold_user=2
    )
    
    # Handle missing predictions with a fallback
    if np.isnan(predicted_rating):
        predicted_rating = global_mean  # Use the global mean rating as a fallback
    
    # Append the prediction with the corresponding ID
    predictions.append({'id': row['id'], 'rating': predicted_rating})

# Convert to DataFrame format and save as submission file
predictions_df = pd.DataFrame(predictions)
predictions_df.to_csv('/kaggle/working/submission.csv', index=False)