# Recommendation Improvement Analysis

This notebook analyzes the recommendation system to identify areas for improvement.

## Table of Contents

1. [Import Libraries](#1-import-libraries)
2. [Load Data and Models](#2-load-data-and-models)
3. [Analyze Recommendation Errors](#3-analyze-recommendation-errors)
4. [Investigate User Behavior](#4-investigate-user-behavior)
5. [Improve Model Performance](#5-improve-model-performance)
6. [Conclusions and Next Steps](#6-conclusions-and-next-steps)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from surprise import Dataset, Reader


2. Load Data and Models

In [None]:
# Load evaluation data
eval_data_path = '../data/evaluation/eval_set_1.csv'
eval_data = pd.read_csv(eval_data_path)

# Load trained model
with open('../models/trained_models/current_model.pkl', 'rb') as f:
    model = pickle.load(f)


Analyze Recommendation Errors

In [None]:
# Prepare data for predictions
reader = Reader(rating_scale=(eval_data['rating'].min(), eval_data['rating'].max()))
data = Dataset.load_from_df(eval_data[['user_id', 'movie_id', 'rating']], reader)
trainset = data.build_full_trainset()

# Get predictions
predictions = model.test(trainset.build_testset())

# Convert predictions to DataFrame
predictions_df = pd.DataFrame(predictions, columns=['user_id', 'movie_id', 'actual_rating', 'predicted_rating', 'details'])

# Calculate error
predictions_df['error'] = predictions_df['actual_rating'] - predictions_df['predicted_rating']

# Plot error distribution
plt.figure(figsize=(10,6))
sns.histplot(predictions_df['error'], bins=50, kde=True)
plt.title('Distribution of Prediction Errors')
plt.xlabel('Error')
plt.ylabel('Count')
plt.show()


Investigate User Behavior

In [None]:
# Identify users with highest prediction errors
user_errors = predictions_df.groupby('user_id')['error'].mean().reset_index()
top_users = user_errors.sort_values('error', ascending=False).head(10)
print("Users with highest average prediction errors:")
print(top_users)

# Analyze ratings of top users
for user_id in top_users['user_id']:
    user_data = eval_data[eval_data['user_id'] == user_id]
    plt.figure(figsize=(8,4))
    sns.countplot(x='rating', data=user_data)
    plt.title(f'Rating Distribution for User {user_id}')
    plt.show()


Improve Model Performance
- Consider adding more user-specific features.
- Experiment with different algorithms or hyperparameters.
- Address data sparsity issues by implementing dimensionality reduction techniques.

- The model performs well overall but struggles with certain users.
- Further investigation is needed to understand these discrepancies.
- Implementing the identified improvements could enhance the recommendation quality.

