# Content-Based Movie Recommendation System

This notebook implements a content-based movie recommendation system based on the paper:

> Surendiran B., Syed Ibrahim S.P. (2021). Hybrid movie recommendation based on interactive genetic algorithm with temporal and demographic features. International Journal of Information Technology, 14(1), 375â€“382. https://doi.org/10.1007/s41870-021-00769-w

## 1. Import Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from ast import literal_eval
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)

## 2. Load and Explore Data

We'll load the movie dataset and user ratings to build our recommendation system.

In [None]:
# Load movie data
movies_file = '../preprocessing/movies.dat'
movies = pd.read_csv(movies_file, sep='::', engine='python', encoding='latin-1',
                    names=['MovieID', 'Title', 'Genres'], header=None)

# Load ratings data
ratings_file = '../preprocessing/ratings.dat'
ratings = pd.read_csv(ratings_file, sep='::', engine='python', encoding='latin-1',
                     names=['UserID', 'MovieID', 'Rating', 'Timestamp'], header=None)

# Load user data
users_file = '../preprocessing/users.dat'
users = pd.read_csv(users_file, sep='::', engine='python', encoding='latin-1',
                   names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'], header=None)

# Check if we have merged data already
try:
    merged_data = pd.read_csv('../preprocessing/merged_data.csv')
    print("Loaded pre-merged data from CSV.")
except FileNotFoundError:
    print("Merging datasets...")
    # Continue with merging process below

In [None]:
# Display basic information about the datasets
print("Movies dataset shape:", movies.shape)
print("Ratings dataset shape:", ratings.shape)
print("Users dataset shape:", users.shape)

# Display the first few rows of each dataset
print("\nMovies dataset preview:")
display(movies.head())

print("\nRatings dataset preview:")
display(ratings.head())

print("\nUsers dataset preview:")
display(users.head())

## 3. Data Preprocessing

According to the reference paper, we need to:
1. Extract the movie genres and create feature vectors
2. Consider user demographics (age, gender) for cold start problem
3. Consider temporal aspects (recent transactions with higher ratings)

In [None]:
# Create a merged dataset if not already loaded
if 'merged_data' not in locals():
    # Merge ratings with movies
    ratings_movies = pd.merge(ratings, movies, on='MovieID')
    
    # Merge with users
    merged_data = pd.merge(ratings_movies, users, on='UserID')
    
    # Convert timestamp to datetime
    merged_data['Date'] = pd.to_datetime(merged_data['Timestamp'], unit='s')
    
    # Save the merged data
    merged_data.to_csv('../preprocessing/merged_data.csv', index=False)
    print("Created and saved merged dataset.")

# Display the merged dataset
display(merged_data.head())

In [None]:
# Extract year from movie title
def extract_year(title):
    year_match = re.search(r'\((\d{4})\)$', title)
    if year_match:
        return int(year_match.group(1))
    return None

# Create a clean movie title without year
def clean_title(title):
    return re.sub(r'\s*\(\d{4}\)$', '', title)

# Apply the functions to the movies dataframe
movies['Year'] = movies['Title'].apply(extract_year)
movies['Clean_Title'] = movies['Title'].apply(clean_title)

# Split genres into a list
movies['Genres'] = movies['Genres'].apply(lambda x: x.split('|'))

# Display the processed movies dataset
display(movies.head())

## 4. Analyze User Preferences

As mentioned in the paper, we'll focus on higher ratings (5-star) and recent transactions to model user preferences.

In [None]:
# Extract high-rated movies (rating = 5)
high_rated = merged_data[merged_data['Rating'] == 5]

# Add a year column to the merged data
merged_data['Year'] = merged_data['Date'].dt.year

# Get the most recent year in the dataset
max_year = merged_data['Year'].max()

# Consider recent transactions (last 3 years as mentioned in the paper)
recent_years = [max_year - i for i in range(3)]
recent_high_rated = high_rated[high_rated['Year'].isin(recent_years)]

print(f"Total number of 5-star ratings: {len(high_rated)}")
print(f"Number of recent 5-star ratings (last 3 years): {len(recent_high_rated)}")

## 5. Content-Based Features Engineering

Now we'll create content-based features for our movies.

In [None]:
# Create genre matrix using one-hot encoding
def create_genre_matrix(movies_df):
    # Get all unique genres
    all_genres = set()
    for genres in movies_df['Genres']:
        all_genres.update(genres)
    
    # Create a dataframe with genre columns
    genre_matrix = pd.DataFrame(index=movies_df.index)
    
    for genre in all_genres:
        genre_matrix[genre] = movies_df['Genres'].apply(lambda x: 1 if genre in x else 0)
    
    # Add MovieID as a column before returning
    genre_matrix['MovieID'] = movies_df['MovieID']
    return genre_matrix

# Create the genre matrix
genre_matrix = create_genre_matrix(movies)

# Display the genre matrix
display(genre_matrix.head())

In [None]:
# Compute similarity between movies based on genres
# Remove MovieID before computing similarity
genre_features = genre_matrix.drop('MovieID', axis=1)

# Compute cosine similarity
movie_similarity = cosine_similarity(genre_features)

# Create a DataFrame for the similarity matrix
movie_sim_df = pd.DataFrame(movie_similarity, 
                           index=movies['MovieID'], 
                           columns=movies['MovieID'])

print("Movie similarity matrix shape:", movie_sim_df.shape)
display(movie_sim_df.iloc[:5, :5])  # Show a 5x5 slice of the similarity matrix

## 6. User Profile Creation

Following the approach in the paper, we'll create user profiles based on their preferences.

In [None]:
# Create a function to build a user profile based on genre preferences
def create_user_profile(user_id, ratings_df=merged_data, movies_df=movies, genre_mat=genre_matrix):
    # Get the user's ratings
    user_ratings = ratings_df[ratings_df['UserID'] == user_id]
    
    # Focus on high-rated recent movies (as per the paper)
    max_year = user_ratings['Year'].max() if not user_ratings.empty else 0
    recent_years = [max_year - i for i in range(3) if max_year - i > 0]
    
    user_high_ratings = user_ratings[
        (user_ratings['Rating'] == 5) & 
        (user_ratings['Year'].isin(recent_years))
    ]
    
    # If no recent high ratings, use all high ratings
    if user_high_ratings.empty:
        user_high_ratings = user_ratings[user_ratings['Rating'] == 5]
    
    # If still empty, use all ratings above average
    if user_high_ratings.empty and not user_ratings.empty:
        avg_rating = user_ratings['Rating'].mean()
        user_high_ratings = user_ratings[user_ratings['Rating'] >= avg_rating]
    
    # If the user has no ratings, return demographic information for cold start
    if user_high_ratings.empty:
        user_info = users[users['UserID'] == user_id]
        if not user_info.empty:
            return {
                'user_id': user_id,
                'demographics': {
                    'age': user_info['Age'].values[0],
                    'gender': user_info['Gender'].values[0],
                    'occupation': user_info['Occupation'].values[0]
                },
                'has_ratings': False
            }
        return None
    
    # Get the genre features of the movies rated highly by the user
    user_movie_ids = user_high_ratings['MovieID'].tolist()
    
    # Extract genre features for these movies
    movie_genres = genre_mat[genre_mat['MovieID'].isin(user_movie_ids)].drop('MovieID', axis=1)
    
    # Create a user profile by averaging the genre features
    user_profile = movie_genres.mean(axis=0)
    
    return {
        'user_id': user_id,
        'profile': user_profile,
        'has_ratings': True,
        'favorite_movies': user_movie_ids
    }

# Example: Create profile for a user
sample_user_id = merged_data['UserID'].sample(1).values[0]
user_profile = create_user_profile(sample_user_id)

if user_profile and user_profile.get('has_ratings'):
    print(f"User {sample_user_id} profile (genre preferences):")
    # Display top 5 genres for this user
    top_genres = user_profile['profile'].sort_values(ascending=False).head(5)
    display(top_genres)
else:
    print(f"User {sample_user_id} has no ratings. Demographics information:")
    display(user_profile['demographics'] if user_profile else "No user info found")

## 7. Demographic-Based Clustering for Cold Start

As mentioned in the paper, we'll use demographic information to handle the cold start problem.

In [None]:
# Process age groups
def map_age_group(age):
    if age < 18:
        return 'Under 18'
    elif 18 <= age < 25:
        return '18-24'
    elif 25 <= age < 35:
        return '25-34'
    elif 35 <= age < 45:
        return '35-44'
    elif 45 <= age < 55:
        return '45-54'
    else:
        return '55+'

# Add age group to users dataframe
users['AgeGroup'] = users['Age'].apply(map_age_group)

# Add gender and age group to merged data
merged_with_demo = merged_data.copy()
merged_with_demo['AgeGroup'] = merged_with_demo['Age'].apply(map_age_group)

# Find top movies by demographic group
def get_top_movies_by_demo(df, demo_col, min_ratings=10):
    # Count ratings per movie per demographic group
    movie_demo_ratings = df.groupby(['MovieID', demo_col])['Rating'].agg(['mean', 'count'])
    
    # Filter movies with at least min_ratings
    qualified = movie_demo_ratings[movie_demo_ratings['count'] >= min_ratings]
    
    # Get top movies per demographic group
    top_movies = qualified.reset_index().sort_values(by=[demo_col, 'mean'], 
                                                   ascending=[True, False])
    
    # Get movie titles
    top_movies = pd.merge(top_movies, movies[['MovieID', 'Title']], on='MovieID')
    
    # Get top 10 movies per group
    top_n = top_movies.groupby(demo_col).apply(lambda x: x.nlargest(10, 'mean')).reset_index(drop=True)
    
    return top_n

# Get top movies by age group
top_movies_by_age = get_top_movies_by_demo(merged_with_demo, 'AgeGroup')
display(top_movies_by_age.head(10))

# Get top movies by gender
top_movies_by_gender = get_top_movies_by_demo(merged_with_demo, 'Gender')
display(top_movies_by_gender.head(10))

# Save top movies by age group for cold start recommendations
top_movies_by_age.to_csv('../preprocessing/top_movies_by_age_group.csv', index=False)

## 8. Content-Based Movie Recommendation Function

Now, let's implement the main recommendation function that incorporates both content-based filtering and demographic attributes for cold start.

In [None]:
def recommend_movies(user_id, n=10, include_watched=False):
    # Get user profile
    profile = create_user_profile(user_id)
    
    # Handle users with ratings
    if profile and profile.get('has_ratings'):
        user_profile = profile['profile']
        favorite_movies = profile['favorite_movies']
        
        # Calculate similarity between user profile and all movies
        # Get genre features for all movies
        all_movie_features = genre_matrix.drop('MovieID', axis=1)
        
        # Calculate similarity scores
        similarity_scores = all_movie_features.apply(lambda x: 
                                                 cosine_similarity([user_profile.values], [x.values])[0][0],
                                                 axis=1)
        
        # Create a DataFrame with movie IDs and similarity scores
        sim_df = pd.DataFrame({
            'MovieID': genre_matrix['MovieID'],
            'similarity': similarity_scores
        })
        
        # Remove already watched movies if required
        if not include_watched:
            sim_df = sim_df[~sim_df['MovieID'].isin(favorite_movies)]
        
        # Get top N movies
        top_movies = sim_df.sort_values('similarity', ascending=False).head(n)
        
        # Get movie details
        recommendations = pd.merge(top_movies, movies[['MovieID', 'Title', 'Genres']], on='MovieID')
        
        return recommendations, 'content-based'
        
    # Handle cold start using demographic information
    else:
        # Get user demographic information
        user_info = users[users['UserID'] == user_id]
        
        if user_info.empty:
            return pd.DataFrame(), 'no-user-info'
        
        # Use age group as the primary demographic factor
        age_group = user_info['AgeGroup'].values[0]
        
        # Get top movies for this age group
        try:
            # Try to load from saved file
            top_movies = pd.read_csv('../preprocessing/top_movies_by_age_group.csv')
            recommendations = top_movies[top_movies['AgeGroup'] == age_group].head(n)
        except FileNotFoundError:
            # Calculate on the fly
            top_movies_by_age = get_top_movies_by_demo(merged_with_demo, 'AgeGroup')
            recommendations = top_movies_by_age[top_movies_by_age['AgeGroup'] == age_group].head(n)
        
        return recommendations, 'demographic-based'

# Test the recommendation function with a known user
test_user_id = merged_data['UserID'].sample(1).values[0]
recommendations, rec_type = recommend_movies(test_user_id, n=10)

print(f"Recommendations for user {test_user_id} (Type: {rec_type})")
display(recommendations)

## 9. Evaluation of the Recommendation System

Let's evaluate our content-based recommendation system using a simple approach:

In [None]:
def evaluate_recommendations(n_users=10, n_recommendations=10):
    # Sample users for evaluation
    sampled_users = merged_data['UserID'].sample(n_users).unique()
    
    # Metrics to track
    results = {
        'user_id': [],
        'recommendation_type': [],
        'num_recommendations': []
    }
    
    # Get recommendations for each user
    for user_id in sampled_users:
        recommendations, rec_type = recommend_movies(user_id, n=n_recommendations)
        
        results['user_id'].append(user_id)
        results['recommendation_type'].append(rec_type)
        results['num_recommendations'].append(len(recommendations))
    
    # Create a DataFrame of results
    results_df = pd.DataFrame(results)
    
    # Calculate coverage metrics
    coverage = results_df[results_df['num_recommendations'] > 0].shape[0] / n_users * 100
    
    # Count recommendation types
    rec_type_counts = results_df['recommendation_type'].value_counts(normalize=True) * 100
    
    print(f"Recommendation coverage: {coverage:.2f}%")
    print("Recommendation type distribution:")
    display(rec_type_counts)
    
    return results_df

# Run evaluation
eval_results = evaluate_recommendations(n_users=20, n_recommendations=10)
display(eval_results)

## 10. Interactive Recommendation Example

Let's create an interactive function to get recommendations for any user:

In [None]:
def get_user_recommendations(user_id=None, n=10):
    # If no user_id provided, select a random one
    if user_id is None:
        user_id = merged_data['UserID'].sample(1).values[0]
    
    # Check if user exists
    if user_id not in merged_data['UserID'].values and user_id not in users['UserID'].values:
        print(f"User {user_id} not found!")
        return
    
    # Get user info
    user_info = users[users['UserID'] == user_id]
    if not user_info.empty:
        age = user_info['Age'].values[0]
        gender = user_info['Gender'].values[0]
        age_group = map_age_group(age)
        print(f"User {user_id} - Age: {age} ({age_group}), Gender: {gender}")
    
    # Get recommendations
    recommendations, rec_type = recommend_movies(user_id, n=n)
    
    print(f"\nRecommendation type: {rec_type}")
    
    if recommendations.empty:
        print("No recommendations found!")
    else:
        # Clean up the display format
        if 'Genres' in recommendations.columns:
            recommendations['Genres'] = recommendations['Genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
        
        # Display recommendations
        display(recommendations[['Title', 'similarity']] if 'similarity' in recommendations.columns 
                              else recommendations[['Title', 'mean']])
    
    return recommendations

# Test the function with a random user
random_user = merged_data['UserID'].sample(1).values[0]
get_user_recommendations(random_user, n=10)

## 11. Conclusion and Future Work

In this notebook, we implemented a content-based movie recommendation system based on the approach described in the paper by Surendiran & Syed Ibrahim. The system has the following key features:

1. **Content-based filtering** using movie genres
2. **Temporal aspects** by focusing on recent high-rated movies
3. **Demographic-based recommendations** for cold start problems

Future improvements could include:
- Incorporating collaborative filtering for a hybrid approach
- Adding more content features (directors, actors, plot keywords)
- Implementing A/B testing to evaluate recommendation quality
- Adding real-time user feedback mechanisms

The current system follows the main concepts from the reference paper, particularly focusing on temporal aspects and demographic attributes to enhance recommendation quality.