**Import Libraries and Configure Logging**

This cell imports the necessary libraries for building a movie recommender system and sets up logging to monitor the execution process.



In [89]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, Model, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import precision_score, recall_score
from difflib import get_close_matches
import logging
from collections import Counter
import ast

In [90]:
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

**Load and Preprocess Data**

This cell defines the load_data function to load and preprocess movie metadata, ratings, and keywords datasets.





Purpose: Load raw data from CSV files, clean it, and prepare it for model training.



Functionality:





Loads movies_metadata.csv, ratings_small.csv, and keywords.csv from Kaggle dataset.



Cleans movie data by selecting relevant columns (id, original_title, genres, overview), removing duplicates, and handling missing values.



Processes genres by parsing JSON-like strings into lists of genre names.



Processes keywords by extracting the top 1000 most frequent keywords to reduce dimensionality.



Merges ratings with movies to include genres and keywords, ensuring consistency.



Logging:





Logs the number of movies and ratings with non-empty genres for quality checks.





In [91]:
# Load and preprocess data
def load_data():
    movies = pd.read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv', low_memory=False)
    ratings = pd.read_csv('/kaggle/input/the-movies-dataset/ratings_small.csv')
    keywords = pd.read_csv('/kaggle/input/the-movies-dataset/keywords.csv')
    
    # Select relevant columns and drop rows with missing overviews
    movies = movies[['id', 'original_title', 'genres', 'overview']].dropna(subset=['id', 'original_title', 'genres', 'overview'])
    logger.info(f"Dropped {45466 - len(movies)} movies with missing overviews. Remaining: {len(movies)}")
    
    # Convert id to int64, ensuring no NaN values
    movies['id'] = pd.to_numeric(movies['id'], errors='coerce')
    movies = movies[movies['id'].notna()]
    movies['id'] = movies['id'].astype('int64')
    
    # Remove duplicates, keeping first occurrence
    movies = movies[~movies['id'].duplicated(keep='first')]
    movies = movies[~movies['original_title'].duplicated(keep='first')].reset_index(drop=True)
    logger.info(f"After duplicate removal, movies remaining: {len(movies)}")
    
    # Verify no NaN titles
    if movies['original_title'].isna().any():
        logger.error("NaN titles detected after cleaning.")
        raise ValueError("NaN titles detected in movies dataset.")
    
    # Process genres
    def parse_genres(text):
        try:
            genres = ast.literal_eval(text)
            if not isinstance(genres, list):
                return []
            return [g['name'] for g in genres if isinstance(g, dict) and 'name' in g]
        except (ValueError, SyntaxError):
            return []
    
    movies['genres'] = movies['genres'].apply(parse_genres)
    
    # Filter out movies with empty genres
    movies = movies[movies['genres'].map(len) > 0].reset_index(drop=True)
    logger.info(f"Movies with non-empty genres: {len(movies)} / {len(movies)}")
    
    # Verify no empty genres
    if any(movies['genres'].map(len) == 0):
        logger.error("Empty genres detected after filtering.")
        raise ValueError("Empty genres detected in movies dataset.")
    
    # Process keywords
    def parse_keywords(text):
        try:
            keywords_list = ast.literal_eval(text)
            return [k['name'] for k in keywords_list if isinstance(k, dict) and 'name' in k]
        except (ValueError, SyntaxError):
            return []
    
    keywords['keywords'] = keywords['keywords'].apply(parse_keywords)
    all_keywords = [k for sublist in keywords['keywords'] for k in sublist]
    keyword_counts = pd.Series(all_keywords).value_counts().head(1000).index.tolist()
    keywords['keywords'] = keywords['keywords'].apply(lambda x: [k for k in x if k in keyword_counts])
    
    # Merge keywords with movies
    movies = movies.merge(keywords[['id', 'keywords']], on='id', how='left')
    movies['keywords'] = movies['keywords'].fillna('[]').apply(lambda x: x if isinstance(x, list) else [])
    
    # Merge ratings with movies
    ratings = ratings.merge(movies[['id', 'genres', 'keywords']], left_on='movieId', right_on='id', how='inner')
    ratings['genres'] = ratings['genres'].apply(lambda x: x if isinstance(x, list) and len(x) > 0 else [])
    ratings['keywords'] = ratings['keywords'].apply(lambda x: x if isinstance(x, list) else [])
    ratings = ratings[['userId', 'movieId', 'rating', 'genres', 'keywords']].dropna(subset=['userId', 'movieId', 'rating'])
    logger.info(f"Ratings with non-empty genres: {sum(len(g) > 0 for g in ratings['genres'])} / {len(ratings)}")
    
    # Verify data integrity
    logger.info(f"Final movies: {len(movies)}, Final ratings: {len(ratings)}")
    return movies, ratings

**Create Index Mappings**

This cell defines the create_mappings function to generate index mappings for users, movies, genres, and keywords.





Purpose: Create numerical indices for categorical variables to enable embedding layers in the neural network.



Functionality:





Maps unique userId and movie id to zero-based indices.



Collects unique genres and keywords from both movies and ratings datasets, adding an 'Unknown' token for missing values.



Creates dictionaries mapping genres and keywords to indices.



Logging:





Logs the number of unique genres for verification.

In [92]:
# Create mappings
def create_mappings(movies, ratings):
    user_ids = ratings['userId'].unique()
    movie_ids = movies['id'].unique()
    user_to_index = {uid: idx for idx, uid in enumerate(user_ids)}
    movie_to_index = {mid: idx for idx, mid in enumerate(movie_ids)}
    
    unique_genres = set()
    unique_keywords = set()
    movies['genres'].apply(lambda x: unique_genres.update(x))
    ratings['genres'].apply(lambda x: unique_genres.update(x))
    movies['keywords'].apply(lambda x: unique_keywords.update(x))
    ratings['keywords'].apply(lambda x: unique_keywords.update(x))
    unique_genres.add('Unknown')
    unique_keywords.add('Unknown')
    genre_to_index = {genre: idx for idx, genre in enumerate(unique_genres)}
    keyword_to_index = {keyword: idx for idx, keyword in enumerate(unique_keywords)}
    logger.info(f"Unique genres: {len(unique_genres)}")
    
    return user_to_index, movie_to_index, genre_to_index, keyword_to_index

**Find Similar Movie Titles**

This cell defines the find_similar_titles function to suggest similar movie titles for user input errors.





Purpose: Enhance user experience by suggesting close matches for movie titles not found in the dataset.



Functionality:





Uses difflib.get_close_matches to find up to 3 movie titles with a similarity score above 0.6.



Compares the input title against all original_title values in the movies dataset.

In [93]:
# Find similar movie titles
def find_similar_titles(title, movies, n=3):
    titles = movies['original_title'].tolist()
    similar = get_close_matches(title, titles, n=n, cutoff=0.6)
    return similar

In [94]:
# Display user's historical data
def display_user_history(user_id, movies, ratings):
    if user_id not in ratings['userId'].values:
        logger.info(f"No historical data found for user {user_id}.")
        print(f"No historical data found for user {user_id}.")
        return None
    
    user_ratings = ratings[ratings['userId'] == user_id][['movieId', 'rating']]
    user_history = user_ratings.merge(
        movies[['id', 'original_title', 'genres', 'overview']],
        left_on='movieId',
        right_on='id',
        how='left'
    )
    
    if user_history.empty:
        logger.info(f"No valid movie data found for user {user_id}'s ratings.")
        print(f"No valid movie data found for user {user_id}'s ratings.")
        return None
    
    # Format genres and truncate overview
    user_history['genres'] = user_history['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) and x else 'Unknown')
    user_history['overview'] = user_history['overview'].apply(lambda x: x[:100] + '...' if isinstance(x, str) and len(x) > 100 else x if isinstance(x, str) else 'No overview available')
    
    # Select and rename columns
    user_history = user_history[['original_title', 'genres', 'overview', 'rating']]
    user_history = user_history.rename(columns={
        'original_title': 'Movie Title',
        'genres': 'Genres',
        'overview': 'Overview',
        'rating': 'User Rating'
    })
    
    # Log genre distribution
    genres_list = [g.strip() for genres in user_history['Genres'] for g in genres.split(',') if g.strip() != 'Unknown']
    genre_counts = Counter(genres_list)
    logger.info(f"User {user_id} genre preferences: {dict(genre_counts)}")
    
    print(f"\nHistorical data for user {user_id} ({len(user_history)} movies rated):")
    return user_history

**Prepare Data for Training**

This cell defines the prepare_data function to transform data into a format suitable for model training.





Purpose: Preprocess ratings data, generate negative samples, and prepare input features for the neural network.



Functionality:





Maps user and movie IDs to indices using provided mappings.



Converts genres and keywords to numerical indices, using 'Unknown' for missing values.



Generates negative samples (1.5x the number of positive samples per user) with a rating of 0.5 for unrated movies.



Pads genre and keyword sequences to fixed lengths for consistent input shapes.



Normalizes ratings to [0, 1] from [0.5, 5.0] for training stability.



Creates binary relevance labels (1 if rating > 3.5, else 0).



Logging:





Logs the number of negative samples and the rating distribution.


Output: Returns user indices, movie indices, padded genre and keyword data, normalized ratings, relevance labels, and maximum sequence lengths.

In [95]:
# Prepare data for training
def prepare_data(ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index):
    ratings['user_index'] = ratings['userId'].map(user_to_index)
    ratings['movie_index'] = ratings['movieId'].map(movie_to_index)
    ratings['genre_indices'] = ratings['genres'].apply(
        lambda x: [genre_to_index[g] for g in x if g in genre_to_index] or [genre_to_index['Unknown']]
    )
    ratings['keyword_indices'] = ratings['keywords'].apply(
        lambda x: [keyword_to_index[k] for k in x if k in keyword_to_index] or [keyword_to_index['Unknown']]
    )
    
    ratings = ratings[ratings['movie_index'].notna()]
    ratings.loc[:, 'movie_index'] = ratings['movie_index'].astype(np.int64)
    
    # Add negative sampling
    negative_samples = []
    for user_id in user_to_index.keys():
        user_ratings = ratings[ratings['userId'] == user_id]
        rated_movies = set(user_ratings['movieId'])
        all_movies = set(movie_to_index.keys())
        unrated_movies = list(all_movies - rated_movies)
        if len(unrated_movies) > int(1.5 * len(user_ratings)):
            unrated_sample = np.random.choice(unrated_movies, size=int(1.5*len(user_ratings)), replace=False)
        else:
            unrated_sample = unrated_movies
        for movie_id in unrated_sample:
            movie_row = ratings[ratings['movieId'] == movie_id]
            if movie_row.empty:
                movie_row = movies[movies['id'] == movie_id]
                if movie_row.empty:
                    continue
                genres = movie_row['genres'].iloc[0]
                keywords = movie_row['keywords'].iloc[0]
            else:
                genres = movie_row['genres'].iloc[0]
                keywords = movie_row['keywords'].iloc[0]
            negative_samples.append({
                'userId': user_id,
                'movieId': movie_id,
                'rating': 0.5,
                'genres': genres,
                'keywords': keywords,
                'user_index': user_to_index[user_id],
                'movie_index': movie_to_index[movie_id],
                'genre_indices': [genre_to_index[g] for g in genres if g in genre_to_index] or [genre_to_index['Unknown']],
                'keyword_indices': [keyword_to_index[k] for k in keywords if k in keyword_to_index] or [keyword_to_index['Unknown']]
            })
    
    negative_df = pd.DataFrame(negative_samples)
    logger.info(f"Negative samples created: {len(negative_df)}")
    ratings = pd.concat([ratings, negative_df], ignore_index=True)
    
    # Pad genre and keyword indices
    max_genres = max(len(g) for g in ratings['genre_indices'])
    max_keywords = min(max(len(k) for k in ratings['keyword_indices']), 50)
    genre_data = pad_sequences(ratings['genre_indices'], maxlen=max_genres, padding='post')
    keyword_data = pad_sequences(ratings['keyword_indices'], maxlen=max_keywords, padding='post')
    
    # Normalize ratings to [0, 1]
    ratings_data = ratings['rating'].values
    ratings_data = (ratings_data - 0.5) / 4.5  # [0.5, 5] to [0, 1]
    logger.info(f"Rating distribution: {np.histogram(ratings_data, bins=10)}")
    
    # Binary relevance labels
    relevance_labels = (ratings['rating'] > 3.5).astype(int).values
    
    return (ratings['user_index'].values, ratings['movie_index'].values, genre_data, keyword_data,
            ratings_data, relevance_labels, max_genres, max_keywords)

**Build the Neural Network Model**

This cell defines the build_model function to construct a neural network for movie rating prediction and relevance classification.





Purpose: Define a multi-task learning model that predicts movie ratings and binary relevance.



Architecture:





Inputs: User ID, movie ID, genre indices, and keyword indices.



Embeddings: 256-dimensional for users and movies, 32-dimensional for genres and keywords, with L2 regularization.



Processing:





Flattens user and movie embeddings.



Applies global average pooling to genre and keyword embeddings.



Concatenates all embeddings.



Dense Layers: Four layers (128, 64, 32, 16 units) with ReLU activation, batch normalization, and 30% dropout.



Outputs:





rating_output: Sigmoid activation, scaled and clipped to [0.5, 5.0].



relevance_output: Sigmoid for binary classification (relevant if rating > 3.5).



Compilation:





Optimizer: Adam with learning rate 0.0001.



Loss: Mean squared error for ratings, binary cross-entropy for relevance, with weights 1.0 and 0.85.



Metrics: MAE for ratings, accuracy for relevance.



Output: Returns the compiled Keras model.

In [96]:
# Define the model
def build_model(num_users, num_movies, num_genres, num_keywords, max_genres, max_keywords):
    user_input = layers.Input(shape=(1,), name='user_input')
    movie_input = layers.Input(shape=(1,), name='movie_input')
    genre_input = layers.Input(shape=(max_genres,), name='genre_input')
    keyword_input = layers.Input(shape=(max_keywords,), name='keyword_input')
    
    # Embeddings with L2 regularization
    user_embed = layers.Embedding(num_users, 256, embeddings_regularizer=regularizers.l2(1e-4), name='user_embedding')(user_input)
    movie_embed = layers.Embedding(num_movies, 256, embeddings_regularizer=regularizers.l2(1e-4), name='movie_embedding')(movie_input)
    genre_embed = layers.Embedding(num_genres, 32, embeddings_regularizer=regularizers.l2(1e-4), name='genre_embedding')(genre_input)
    keyword_embed = layers.Embedding(num_keywords, 32, embeddings_regularizer=regularizers.l2(1e-4), name='keyword_embedding')(keyword_input)
    
    # Flatten embeddings
    user_vec = layers.Flatten()(user_embed)
    movie_vec = layers.Flatten()(movie_embed)
    genre_vec = layers.GlobalAveragePooling1D()(genre_embed)
    keyword_vec = layers.GlobalAveragePooling1D()(keyword_embed)
    
    # Concatenate
    concat = layers.Concatenate()([user_vec, movie_vec, genre_vec, keyword_vec])
    
    # Dense layers with batch normalization
    dense = layers.Dense(128, activation='relu')(concat)
    dense = layers.BatchNormalization()(dense)
    dense = layers.Dropout(0.3)(dense)
    dense = layers.Dense(64, activation='relu')(dense)
    dense = layers.BatchNormalization()(dense)
    dense = layers.Dropout(0.3)(dense)
    dense = layers.Dense(32, activation='relu')(dense)
    dense = layers.Dense(16, activation='relu')(dense)
    
    # Outputs
    rating_output = layers.Dense(1, activation='sigmoid', name='rating_sigmoid')(dense)
    rating_output = layers.Lambda(lambda x: tf.clip_by_value(x * 4.5 + 0.5, 0.5, 5.0), name='rating_output')(rating_output)  # [0, 1] to [0.5, 5]
    relevance_output = layers.Dense(1, activation='sigmoid', name='relevance_output')(dense)
    
    model = Model(inputs=[user_input, movie_input, genre_input, keyword_input], 
                  outputs=[rating_output, relevance_output])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss={
            'rating_output': 'mean_squared_error',
            'relevance_output': 'binary_crossentropy'
        },
        loss_weights={
            'rating_output': 1.0,
            'relevance_output': 0.85
        },
        metrics={
            'rating_output': ['mae'],
            'relevance_output': ['accuracy']
        }
    )
    
    return model

**Train the Model**

This cell defines the train_model function to train the neural network on the prepared data.





Purpose: Train the model using training and validation data, with callbacks to optimize performance.



Functionality:





Unpacks training and test data (users, movies, genres, keywords, ratings, relevance).



Configures batch size (512) and logs expected steps per epoch.



Uses callbacks:





EarlyStopping: Stops training if val_rating_output_mae doesn’t improve for 3 epochs, restoring best weights.



ReduceLROnPlateau: Reduces learning rate by 50% if val_rating_output_mae doesn’t improve for 2 epochs, down to 1e-6.



Trains for 20 epochs, feeding inputs and outputs for both tasks.



Logging:





Logs training sample count and batch size details.



Output: Returns the training history object.

In [97]:
# Train the model
def train_model(model, train_data, test_data):
    (train_users, train_movies, train_genres, train_keywords, train_ratings, train_relevance) = train_data
    (test_users, test_movies, test_genres, test_keywords, test_ratings, test_relevance) = test_data
    
    logger.info(f"Training samples: {len(train_users)}")
    batch_size = 512
    logger.info(f"Batch size: {batch_size}, Steps per epoch: {len(train_users) // batch_size + (1 if len(train_users) % batch_size else 0)}")
    
    early_stopping = EarlyStopping(
        monitor='val_rating_output_mae', 
        patience=3, 
        restore_best_weights=True, 
        verbose=1, 
        mode='min'
    )
    
    reduce_lr = ReduceLROnPlateau(
        monitor='val_rating_output_mae',
        factor=0.5,
        patience=2,
        min_lr=1e-6,
        verbose=1
    )
    
    history = model.fit(
        [train_users, train_movies, train_genres, train_keywords],
        {'rating_output': train_ratings, 'relevance_output': train_relevance},
        validation_data=([test_users, test_movies, test_genres, test_keywords], 
                         {'rating_output': test_ratings, 'relevance_output': test_relevance}),
        epochs=20,
        batch_size=batch_size,
        verbose=1,
        callbacks=[early_stopping, reduce_lr]
    )
    
    return history

**Evaluate Top-K Metrics**

This cell defines the evaluate_top_k function to compute Precision@10 and Recall@10 for the recommender system.





Purpose: Evaluate the model’s ability to recommend relevant movies (rating > 3.5) in the top-10 predictions.



Functionality:





Predicts ratings for test data and rescales to [0.5, 5.0].



Groups predictions by user and selects the top-10 highest-rated movies.



Computes:





Precision@10: Proportion of top-10 movies with true rating > 3.5.



Recall@10: Proportion of all relevant movies captured in the top-10.



Handles edge cases (e.g., users with no relevant movies).



Logging:





Logs prediction range, number of user groups, and final metrics.



Output: Returns mean Precision@10 and Recall@10.

In [98]:
# Evaluate top-K metrics
def evaluate_top_k(model, test_data, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, k=10):
    logger.info("Starting top-K evaluation")
    test_users, test_movies, test_genres, test_keywords, test_ratings, test_relevance = test_data
    try:
        predictions, _ = model.predict([test_users, test_movies, test_genres, test_keywords], verbose=0)
        predictions = np.clip(predictions.flatten() * 4.5 + 0.5, 0.5, 5.0)  # Rescale [0, 1] to [0.5, 5]
        logger.info(f"Top-K prediction range: {predictions.min():.2f} to {predictions.max():.2f}")
        
        test_df = pd.DataFrame({
            'user_index': test_users,
            'movie_index': test_movies,
            'rating': test_ratings * 4.5 + 0.5,  # Rescale [0, 1] to [0.5, 5]
            'prediction': predictions
        })
        
        user_groups = test_df.groupby('user_index')
        precisions, recalls = [], []
        logger.info(f"Processing {len(user_groups)} user groups")
        for user_idx, group in user_groups:
            top_k_indices = group['prediction'].argsort()[-k:][::-1]
            true = group['rating'] > 3.5
            true_positive = sum(true.iloc[top_k_indices])
            precisions.append(true_positive / k)
            recall = true_positive / sum(true) if sum(true) > 0 else 0
            recalls.append(recall)
        
        precision = np.mean(precisions)
        recall = np.mean(recalls)
        logger.info(f"Precision@{k}: {precision:.4f}")
        logger.info(f"Recall@{k}: {recall:.4f}")
        print(f"Precision@{k}: {precision:.4f}")
        print(f"Recall@{k}: {recall:.4f}")
        return precision, recall
    except Exception as e:
        logger.error(f"Error in evaluate_top_k: {str(e)}")
        raise

**Predict Rating for a User-Movie Pair**

This cell defines the predict_rating function to predict a rating for a specific user and movie.





Purpose: Provide a single rating prediction for a user-movie pair, handling edge cases and ensuring valid outputs.



Functionality:





Validates user and movie existence using index mappings.



Suggests similar titles if the movie is not found.



Prepares input data (user index, movie index, padded genre and keyword indices).



Predicts rating, rescales from [0, 1] to [0.5, 5.0], and clips to ensure range compliance.



Asserts rating is within [0.5, 5.0].



Logging:





Logs raw and clipped rating values for debugging.



Output: Returns a formatted string with the predicted rating or an error message.

In [99]:
# Predict rating for a user-movie pair
def predict_rating(model, user_id, movie_title, movies, ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords):
    user_idx = user_to_index.get(user_id, -1)
    if user_idx == -1:
        return f"User {user_id} not found."
    
    movie_row = movies[movies['original_title'].str.lower() == movie_title.lower()]
    if movie_row.empty:
        similar_titles = find_similar_titles(movie_title, movies)
        if similar_titles:
            suggestion = f"Movie '{movie_title}' not found. Did you mean: {', '.join(similar_titles)}?"
        else:
            suggestion = f"Movie '{movie_title}' not found in the dataset."
        return suggestion
    
    movie_id = movie_row['id'].iloc[0]
    movie_idx = movie_to_index.get(movie_id, -1)
    if movie_idx == -1:
        return f"Movie ID {movie_id} not found in the model's movie index."
    
    genres = movie_row['genres'].iloc[0]
    keywords = movie_row['keywords'].iloc[0]
    genre_indices = [genre_to_index[g] for g in genres if g in genre_to_index] or [genre_to_index['Unknown']]
    keyword_indices = [keyword_to_index[k] for k in keywords if k in keyword_to_index] or [keyword_to_index['Unknown']]
    genre_data = pad_sequences([genre_indices], maxlen=max_genres, padding='post')
    keyword_data = pad_sequences([keyword_indices], maxlen=max_keywords, padding='post')
    
    rating, _ = model.predict([np.array([user_idx]), np.array([movie_idx]), genre_data, keyword_data], verbose=0)
    raw_rating = rating[0][0]  # Store raw prediction
    rating = np.clip(raw_rating * 4.5 + 0.5, 0.5, 5.0)
    assert 0.5 <= rating <= 5.0, f"Rating {rating} out of [0.5, 5.0]"
    logger.info(f"Raw rating for {movie_title}: {raw_rating:.4f}, Clipped: {rating:.4f}")
    return f"Predicted rating for '{movie_title}' by user {user_id}: {rating:.2f}"

**Recommend Top-N Movies for a User**

This cell defines the recommend_movies function to generate a list of top-N movie recommendations for a user.





Purpose: Recommend the top-N movies with predicted ratings above 3.5, formatted as a table.



Functionality:





Validates user existence and available movies.



Prepares input data for all valid movies (user index, movie indices, padded genres, keywords).



Predicts ratings, rescales to [0.5, 5.0], and filters movies with ratings > 3.5.



Falls back to top-N if no movies exceed the threshold.



Formats output as a DataFrame with movie titles, genres, and truncated overviews.



Logging:





Logs the range of predicted ratings and fallback usage.



Output: Returns a DataFrame of recommendations and prints a formatted table.

In [100]:
# Recommend top-N movies for a user
def recommend_movies(model, user_id, movies, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, top_n=5):
    user_idx = user_to_index.get(user_id, -1)
    if user_idx == -1:
        return f"User {user_id} not found."
    
    valid_movies = movies[movies['id'].isin(movie_to_index.keys())].copy()
    if valid_movies.empty:
        return "No valid movies available for recommendation."
    
    movie_indices = np.array([movie_to_index[mid] for mid in valid_movies['id']])
    user_data = np.full_like(movie_indices, user_idx)
    genre_data = pad_sequences(
        valid_movies['genres'].apply(
            lambda x: [genre_to_index[g] for g in x if g in genre_to_index] or [genre_to_index['Unknown']]
        ),
        maxlen=max_genres,
        padding='post'
    )
    keyword_data = pad_sequences(
        valid_movies['keywords'].apply(
            lambda x: [keyword_to_index[k] for k in x if k in keyword_to_index] or [keyword_to_index['Unknown']]
        ),
        maxlen=max_keywords,
        padding='post'
    )
    
    ratings, _ = model.predict([user_data, movie_indices, genre_data, keyword_data], verbose=0)
    ratings = np.clip(ratings.flatten() * 4.5 + 0.5, 0.5, 5.0)  # Rescale [0, 1] to [0.5, 5]
    logger.info(f"Recommendation ratings range: {ratings.min():.2f} to {ratings.max():.2f}")
    
    # Filter predictions above threshold
    valid_indices = np.where(ratings > 3.5)[0]
    if len(valid_indices) == 0:
        # Fallback: select top-N movies regardless of threshold
        valid_indices = np.argsort(ratings)[::-1][:top_n]
        logger.info(f"User {user_id}: No movies above 3.5; using top {top_n} instead")
    
    ratings = ratings[valid_indices]
    movie_indices = movie_indices[valid_indices]
    valid_movies = valid_movies.iloc[valid_indices].reset_index(drop=True)
    
    top_indices = np.argsort(ratings)[::-1][:top_n]
    top_movie_ids = [valid_movies['id'].iloc[i] for i in top_indices]
    top_movies = movies[movies['id'].isin(top_movie_ids)][['original_title', 'genres', 'overview']].copy()
    
    # Format genres as string and create DataFrame
    top_movies['genres'] = top_movies['genres'].apply(lambda x: ', '.join(x) if x else 'Unknown')
    top_movies['overview'] = top_movies['overview'].apply(lambda x: x[:100] + '...' if len(x) > 100 else x)
    result_df = top_movies[['original_title', 'genres', 'overview']].reset_index(drop=True)
    
    # Display table
    print(f"\nTop {top_n} recommendations for user {user_id}:")
    return result_df

In [111]:
# Recommend movies similar to a given movie, personalized for a user
def recommend_similar_movies(model, user_id, movie_title, movies, ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, top_n=5):
    user_idx = user_to_index.get(user_id, -1)
    if user_idx == -1:
        logger.info(f"User {user_id} not found.")
        return f"User {user_id} not found."
    
    movie_row = movies[movies['original_title'].str.lower() == movie_title.lower()]
    if movie_row.empty:
        similar_titles = find_similar_titles(movie_title, movies)
        if similar_titles:
            suggestion = f"Movie '{movie_title}' not found. Did you mean: {', '.join(similar_titles)}?"
        else:
            suggestion = f"Movie '{movie_title}' not found in the dataset."
        logger.info(suggestion)
        return suggestion
    
    # Get reference movie details
    ref_movie_id = movie_row['id'].iloc[0]
    ref_genres = set(movie_row['genres'].iloc[0])
    ref_keywords = set(movie_row['keywords'].iloc[0])
    
    # Compute similarity (Jaccard for genres and keywords)
    def jaccard_similarity(set1, set2):
        if not set1 and not set2:
            return 0.0
        intersection = len(set1 & set2)
        union = len(set1 | set2)
        return intersection / union if union > 0 else 0.0
    
    valid_movies = movies[movies['id'].isin(movie_to_index.keys())].copy()
    if valid_movies.empty:
        logger.info("No valid movies available for recommendation.")
        return "No valid movies available for recommendation."
    
    valid_movies['similarity'] = valid_movies.apply(
        lambda x: 0.7 * jaccard_similarity(ref_genres, set(x['genres'])) + 
                  0.3 * jaccard_similarity(ref_keywords, set(x['keywords'])),
        axis=1
    )
    
    # Filter top 50 similar movies (excluding the reference movie)
    similar_movies = valid_movies[valid_movies['id'] != ref_movie_id].nlargest(50, 'similarity')
    if similar_movies.empty:
        logger.info(f"No movies similar to '{movie_title}' found.")
        return f"No movies similar to '{movie_title}' found."
    
    # Prepare model inputs
    movie_indices = np.array([movie_to_index[mid] for mid in similar_movies['id']])
    user_data = np.full_like(movie_indices, user_idx)
    genre_data = pad_sequences(
        similar_movies['genres'].apply(
            lambda x: [genre_to_index[g] for g in x if g in genre_to_index] or [genre_to_index['Unknown']]
        ),
        maxlen=max_genres,
        padding='post'
    )
    keyword_data = pad_sequences(
        similar_movies['keywords'].apply(
            lambda x: [keyword_to_index[k] for k in x if k in keyword_to_index] or [keyword_to_index['Unknown']]
        ),
        maxlen=max_keywords,
        padding='post'
    )
    
    # Predict ratings
    ratings, _ = model.predict([user_data, movie_indices, genre_data, keyword_data], verbose=0)
    ratings = np.clip(ratings.flatten() * 4.5 + 0.5, 0.5, 5.0)
    assert ratings.min() >= 0.5 and ratings.max() <= 5.0, f"Ratings range {ratings.min()} to {ratings.max()}"
    logger.info(f"Similar movies ratings range for user {user_id}: {ratings.min():.2f} to {ratings.max():.2f}")
    
    # Filter by threshold or take top-N
    valid_indices = np.where(ratings > 3.5)[0]
    if len(valid_indices) == 0:
        valid_indices = np.argsort(ratings)[::-1][:top_n]
        logger.info(f"User {user_id}: No similar movies above 3.5; using top {top_n} instead")
    
    ratings = ratings[valid_indices]
    movie_indices = movie_indices[valid_indices]
    similar_movies = similar_movies.iloc[valid_indices].reset_index(drop=True)
    
    top_indices = np.argsort(ratings)[::-1][:top_n]
    top_movie_ids = [similar_movies['id'].iloc[i] for i in top_indices]
    top_movies = movies[movies['id'].isin(top_movie_ids)][['original_title', 'genres', 'overview']].copy()
    
    # Format output
    top_movies['genres'] = top_movies['genres'].apply(lambda x: ', '.join(x) if x else 'Unknown')
    top_movies['overview'] = top_movies['overview'].apply(lambda x: x[:100] + '...' if len(x) > 100 else x)
    result_df = top_movies[['original_title', 'genres', 'overview']].reset_index(drop=True)
    
    print(f"\nTop movies similar to '{movie_title}' for user {user_id}:")
    return result_df

**Main Execution and Model Training**

This cell contains the main execution block to orchestrate data loading, model training, and evaluation.





Purpose: Execute the end-to-end pipeline for training and evaluating the recommender system.



Functionality:





Loads and preprocesses data using load_data.



Creates index mappings with create_mappings.



Prepares training data with prepare_data.



Splits data into 80% training and 20% test sets.



Builds and trains the model using build_model and train_model.



Evaluates top-K metrics with evaluate_top_k.



Saves the trained model to a file.



Logging:





Logs data loading, mapping sizes, sample counts, training completion, and best validation MAE.



Output: Saves the model and logs evaluation metrics.

In [102]:
# Main execution
if __name__ == "__main__":
    # Load data
    movies, ratings = load_data()
    logger.info("Data loaded successfully")
    
    # Create mappings
    user_to_index, movie_to_index, genre_to_index, keyword_to_index = create_mappings(movies, ratings)
    logger.info(f"Users: {len(user_to_index)}, Movies: {len(movie_to_index)}, Genres: {len(genre_to_index)}, Keywords: {len(keyword_to_index)}")
    
    # Prepare data
    user_data, movie_data, genre_data, keyword_data, ratings_data, relevance_data, max_genres, max_keywords = prepare_data(
        ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index
    )
    
    # Split data
    train_idx, test_idx = train_test_split(np.arange(len(ratings_data)), test_size=0.2, random_state=42)
    train_data = (user_data[train_idx], movie_data[train_idx], genre_data[train_idx], keyword_data[train_idx], 
                  ratings_data[train_idx], relevance_data[train_idx])
    test_data = (user_data[test_idx], movie_data[test_idx], genre_data[test_idx], keyword_data[test_idx], 
                 ratings_data[test_idx], relevance_data[test_idx])
    
    logger.info(f"Total samples: {len(ratings_data)}, Training samples: {len(train_data[0])}, Test samples: {len(test_data[0])}")
    
    # Build and train model
    model = build_model(len(user_to_index), len(movie_to_index), len(genre_to_index), len(keyword_to_index), max_genres, max_keywords)
    logger.info("Model built successfully")
    history = train_model(model, train_data, test_data)
    logger.info(f"Training completed. Best validation MAE: {min(history.history['val_rating_output_mae']):.4f}")
    
    # Evaluate top-K metrics
    try:
        precision, recall = evaluate_top_k(model, test_data, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, k=10)
        logger.info(f"Top-K evaluation completed successfully")
    except Exception as e:
        logger.error(f"Failed to evaluate top-K metrics: {str(e)}")
    
    # Save model
    model.save('/kaggle/working/final_precision_recall_movie_recommender.h5')
    logger.info("Model saved to /kaggle/working/final_precision_recall_movie_recommender.h5")

Epoch 1/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 192ms/step - loss: 3.9308 - rating_output_loss: 2.0184 - rating_output_mae: 1.2309 - relevance_output_accuracy: 0.2214 - relevance_output_loss: 1.3632 - val_loss: 3.3552 - val_rating_output_loss: 2.0711 - val_rating_output_mae: 1.3920 - val_relevance_output_accuracy: 0.2114 - val_relevance_output_loss: 0.8562 - learning_rate: 1.0000e-04
Epoch 2/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 181ms/step - loss: 2.0802 - rating_output_loss: 0.6292 - rating_output_mae: 0.6725 - relevance_output_accuracy: 0.2809 - relevance_output_loss: 1.0771 - val_loss: 1.8774 - val_rating_output_loss: 0.7321 - val_rating_output_mae: 0.7712 - val_relevance_output_accuracy: 0.2466 - val_relevance_output_loss: 0.7708 - learning_rate: 1.0000e-04
Epoch 3/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 178ms/step - loss: 1.4485 - rating_output_loss: 0.3585 - rating_output_mae: 0.5186 - rel

**Display User’s Historical Data**

This cell defines the display_user_history function to show the movies a user has rated, along with their titles, genres, overviews, and ratings.





Purpose: Display a user’s rating history to analyze preferences before generating predictions or recommendations.



Functionality:





Filters ratings for the specified user and merges with movie metadata.



Formats genres as comma-separated strings and truncates overviews to 100 characters.



Creates a DataFrame with columns: Movie Title, Genres, Overview, User Rating.



Logs the genre distribution to identify dominant preferences (e.g., drama).



Logging:





Logs the number of rated movies and genre counts.



Output: Returns a DataFrame of the user’s history and prints a formatted table, or a message if no data is found.

In [108]:
# Example usage for user
print("\n=== User 123 Analysis ===")
history_123 = display_user_history(123, movies, ratings)
if history_123 is not None:
    display(history_123)


=== User 123 Analysis ===

Historical data for user 123 (13 movies rated):


Unnamed: 0,Movie Title,Genres,Overview,User Rating
0,The Wanderers,Drama,The streets of the Bronx are owned by 60’s you...,4.0
1,High Noon,Western,High Noon is about a recently freed leader of ...,5.0
2,Kurz und schmerzlos,"Drama, Thriller",Three friends get caught in a life of major cr...,5.0
3,Dog Day Afternoon,"Crime, Drama, Thriller",A man robs a bank to pay for his lover's opera...,3.0
4,Fools Rush In,"Drama, Comedy, Romance",Alex Whitman (Matthew Perry) is a designer fro...,4.0
5,Jezebel,"Drama, Romance","In 1850s Louisiana, the willfulness of a tempe...",4.0
6,Anatomie de l'enfer,Drama,A man rescues a woman from a suicide attempt i...,4.0
7,The Greatest Story Ever Told,"Drama, History",All-star epic retelling of Christ's life.,5.0
8,The Bourne Supremacy,"Action, Drama, Thriller",When a CIA operation to purchase classified Ru...,5.0
9,Young and Innocent,"Drama, Crime",Derrick De Marney finds himself in a 39 Steps ...,5.0


**Example Predictions and Recommendations**

Demonstrates the model’s functionality by generating example predictions and recommendations.





Purpose: Showcase the model’s ability to predict ratings and recommend movies for specific users.



Functionality:





Predicts ratings for user 123 for ‘Ein Lied von Liebe und Tod – Gloomy Sunday’ and ‘Toy Story’.



Predicts rating for user 13 for a non-existent movie to test error handling.



Generates top-5 movie recommendations for user 123, displayed as a formatted table.



Output: Prints predicted ratings and recommendation table.

In [104]:
# Example predictions
print(predict_rating(model, 123, 'Ein Lied von Liebe und Tod – Gloomy Sunday', movies, ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords))
print(predict_rating(model, 13, 'Non Existent Movie', movies, ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords))
print(predict_rating(model, 123, 'Toy Story', movies, ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords))

Predicted rating for 'Ein Lied von Liebe und Tod – Gloomy Sunday' by user 123: 4.33
Movie 'Non Existent Movie' not found. Did you mean: Silent Movie, No Home Movie, Extreme Movie?
Predicted rating for 'Toy Story' by user 123: 2.77


In [105]:
# Display recommendations as table
recommendations = recommend_movies(model, 123, movies, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, top_n=10)
display(recommendations)


Top 10 recommendations for user 123:


Unnamed: 0,original_title,genres,overview
0,Attack of the Killer Tomatoes!,"Comedy, Horror, Science Fiction",Attack of the Killer Tomatoes is a 1978 comedy...
1,Trouble in Paradise,"Comedy, Romance",Trouble in Paradise is one of the most importa...
2,Du rififi chez les hommes,"Drama, Action, Crime","Out of prison after a five-year stretch, jewel..."
3,My Darling Clementine,"Drama, Western",Wyatt Earp and his brothers Morgan and Virgil ...
4,Monsieur Ibrahim et les fleurs du Coran,Drama,Monsieur Ibrahim is a story about a young Jewi...
5,Journal d'un curé de campagne,Drama,A frail priest is assigned to a small French p...
6,INLAND EMPIRE,"Horror, Drama, Mystery, Thriller",An actress's perception of reality becomes inc...
7,Les demoiselles de Rochefort,"Music, Romance, Drama, Comedy",Delphine and Solange are two sisters living in...
8,The Butter Battle Book,"Animation, Family",The Zooks and the Yooks are at war over the bu...
9,Ballo a tre passi,Drama,Four separate-but-interconnected stories - one...


In [118]:
similar_recommendations_123 = recommend_similar_movies(model, 1, 'Toy Story', movies, ratings, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, top_n=5)
display(similar_recommendations_123)


Top movies similar to 'Toy Story' for user 1:


Unnamed: 0,original_title,genres,overview
0,A Close Shave,"Family, Animation, Comedy",Wallace falls in love with wool-shop owner Wen...
1,The Wrong Trousers,"Animation, Comedy, Family",Gromit finds himself being pushed out of his r...
2,"Monsters, Inc.","Animation, Comedy, Family","James Sullivan and Mike Wazowski are monsters,..."
3,Meet the Robinsons,"Animation, Comedy, Family","In this animated adventure, brilliant preteen ..."
4,The Simpsons Movie,"Animation, Comedy, Family",After Homer accidentally pollutes the town's w...


In [119]:
# Display recommendations as table
recommendations = recommend_movies(model, 1, movies, user_to_index, movie_to_index, genre_to_index, keyword_to_index, max_genres, max_keywords, top_n=10)
display(recommendations)


Top 10 recommendations for user 1:


Unnamed: 0,original_title,genres,overview
0,Once Were Warriors,Drama,A drama about a Maori family lving in Auckland...
1,Sleepless in Seattle,"Comedy, Drama, Romance",A young boy who tries to set his dad up on a d...
2,Mission: Impossible,"Adventure, Action, Thriller","When Ethan Hunt, the leader of a crack espiona..."
3,A River Runs Through It,Drama,A River Runs Through is a cinematographically ...
4,Taxi,"Action, Comedy, Crime","In Marseilles (France), skilled pizza delivery..."
5,Broken Flowers,"Comedy, Drama, Mystery, Romance",As the devoutly single Don Johnston is dumped ...
6,Lonely Hearts,"Drama, Thriller, Crime, Romance","In the late 1940's, Martha Beck and Raymond Fe..."
7,Terminator Salvation,"Action, Science Fiction, Thriller","All grown up in post-apocalyptic 2018, John Co..."
8,Miffo,"Comedy, Drama","Tobias is the new, idealistic priest in a subu..."
9,The Lost Continent,"Adventure, Fantasy",An eclectic group of characters set sail on Ca...
