The following code snippet imports necessary libraries and modules for a machine learning task. Here's a description of each import:

- `numpy` (imported as `np`): A library for numerical operations and array manipulation in Python.
- `pandas` (imported as `pd`): A library for data manipulation and analysis, providing data structures and functions to work with structured data.
- `scipy.stats`: A module from SciPy, a scientific computing library in Python, providing statistical functions and distributions.
- `sklearn.model_selection`: A module from scikit-learn, a popular machine learning library in Python, used for splitting data into training and testing sets.
- `keras.models`: A module from Keras, a deep learning library, used for defining and training models.
- `keras.layers`: A module from Keras, used for constructing the layers of a neural network model.
- `keras.optimizers`: A module from Keras, providing various optimization algorithms for training neural networks.
- `keras.callbacks`: A module from Keras, containing callbacks that can be used during model training.
- `keras.losses`: A module from Keras, providing different loss functions for regression and classification tasks.

This code snippet sets up the necessary imports for working with Keras and other required libraries in a machine learning project. These libraries provide essential functionality for data manipulation, model construction, optimization, and evaluation.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Embedding, Flatten, Dot
from keras.optimizers import Adam, RMSprop, SGD
from keras.callbacks import ModelCheckpoint
from keras.losses import MeanAbsoluteError, MeanSquaredError, Huber, LogCosh

# Load data

Here we are loading the processed data

In [None]:
users=pd.read_csv('../Data/users.csv')
ratings=pd.read_csv('../Data/ratings.csv')
books=pd.read_csv('../Data/books.csv')

# Create user-id and book-id mapping

The following code snippet explains the purpose of each line:

1. **Get Unique User IDs**: Retrieves the unique user IDs from the 'user_id' column of the ratings DataFrame and converts them to a list using the `tolist()` function. The user IDs are stored in the `user_ids` variable.

2. **Create User ID to Index Mapping**: Creates a dictionary, `user2user_encoded`, to map each user ID to its corresponding index. The dictionary comprehension iterates over the `user_ids` list, assigning an index (starting from 0) to each user ID using the `enumerate()` function. The resulting dictionary maps each user ID to its index.

3. **Create Index to User ID Mapping**: Creates a dictionary, `userencoded2user`, to map each index to its corresponding user ID. The dictionary comprehension performs the reverse mapping, iterating over the enumerated indices and assigning each index to its corresponding user ID.

4. **Get Unique Book IDs**: Retrieves the unique book IDs from the 'book_id' column of the ratings DataFrame and converts them to a list using the `tolist()` function. The book IDs are stored in the `book_ids` variable.

5. **Create Book ID to Index Mapping**: Creates a dictionary, `book2book_encoded`, to map each book ID to its corresponding index. Similar to the user ID mapping, the dictionary comprehension assigns an index (starting from 0) to each book ID using the `enumerate()` function.

6. **Create Index to Book ID Mapping**: Creates a dictionary, `book_encoded2book`, to map each index to its corresponding book ID. The dictionary comprehension performs the reverse mapping, assigning each index to its corresponding book ID.

These mappings between user and book IDs and their corresponding indices are essential when working with embedding layers in recommendation systems. They provide a convenient way to convert between raw IDs and indices during model training and prediction.



In [None]:
user_ids = ratings['user_id'].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
book_ids = ratings['book_id'].unique().tolist()
book2book_encoded = {x: i for i, x in enumerate(book_ids)}
book_encoded2book = {i: x for i, x in enumerate(book_ids)}

# Map user-id and book-ids to user and book indices

1. **Map User IDs to Indices**: Adds a new column called 'user' to the ratings DataFrame by mapping the 'user_id' column to the corresponding user indices using the `map()` function with the `user2user_encoded` dictionary. Each user ID in the 'user_id' column is replaced with its corresponding index value.

2. **Map Book IDs to Indices**: Adds a new column called 'book' to the ratings DataFrame by mapping the 'book_id' column to the corresponding book indices using the `map()` function with the `book2book_encoded` dictionary. Each book ID in the 'book_id' column is replaced with its corresponding index value.

By performing these mapping operations, the user and book IDs are transformed into their respective indices, which are necessary for feeding the data into the embedding layers of the recommendation model. This allows the model to work with indices instead of raw IDs, enabling efficient computations and improved performance.


In [None]:
ratings['user'] = ratings['user_id'].map(user2user_encoded)
ratings['book'] = ratings['book_id'].map(book2book_encoded)

# Split data into training and testing set

1. **Split Data into Train and Test Sets**: Splits the ratings DataFrame into train and test sets using the `train_test_split()` function from scikit-learn. The `train_test_split()` function takes the input data (ratings) and splits it into two subsets based on the specified `test_size` parameter. In this case, the test set will have a size of 20% of the entire dataset.

2. **Assign Train and Test Sets**: Assigns the train and test sets to the variables `train` and `test`, respectively. The train set will contain 80% of the data, while the test set will contain 20% of the data.

By splitting the data into train and test sets, we create separate subsets that can be used for training and evaluating the recommendation model. The train set is used to train the model, while the test set is used to evaluate its performance and generalization on unseen data.


In [None]:
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

# Get the number of users and books


1. **Get the Number of Users**: Calculates the number of unique users in the dataset by taking the length of the `user2user_encoded` dictionary. The `len()` function returns the number of elements in the dictionary, which represents the total number of unique users.

2. **Get the Number of Books**: Calculates the number of unique books in the dataset by taking the length of the `book_encoded2book` dictionary. Similarly, the `len()` function returns the number of elements in the dictionary, which represents the total number of unique books.

By obtaining the number of users and books, we can determine the dimensions of the embedding layers in the recommendation model. The number of users corresponds to the number of unique user indices, and the number of books corresponds to the number of unique book indices. These values are essential for setting the correct dimensions of the embedding layers to capture the user and book representations.


In [None]:
num_users = len(user2user_encoded)
num_books = len(book_encoded2book)

# Set embedding dimension

1. **Set the Embedding Dimension**: Specifies the dimensionality of the embedding vectors in the recommendation model. The `embedding_dim` variable is set to 10, which means each user and book will be represented by a vector of length 10 in the embedding space.

By setting the embedding dimension, we determine the size of the vector representations for users and books. A higher embedding dimension may allow for more expressive representations but can also increase the model's complexity and resource requirements. On the other hand, a lower embedding dimension may result in more compact representations but may also limit the model's ability to capture intricate user-item interactions.


In [None]:
embedding_dim=10

# Build model
We're using Keras Functional API to build a model with Embedding layers for users and books.  
These embeddings will learn to represent user preferences and book properties during training.


1. **Define Input Layers**: Creates two input layers, `user_input` and `book_input`, which correspond to the user and book inputs, respectively. These input layers define the shape of the input data.

2. **Embedding Layers**: Creates embedding layers for users and books using the `Embedding` class. The embedding layers map the user and book indices to their corresponding embedding vectors in the latent space. The dimensions of the embedding layers are determined by the number of users and books (`num_users` and `num_books`) and the specified `embedding_dim`.

3. **Flatten Layers**: Flattens the user and book embedding layers using the `Flatten` class. This step converts the 2D tensor outputs from the embedding layers into 1D vectors.

4. **Concatenate Layers**: Concatenates the flattened user and book embeddings using the `Concatenate` class. This step combines the user and book representations into a single vector that captures the interactions between users and books.

5. **Dense Layers**: Adds a dense layer with 16 units and ReLU activation function on top of the concatenated layer. This layer learns higher-level representations based on the combined user and book information.

6. **Output Layer**: Adds a dense layer with 1 unit and linear activation function as the output layer. This layer predicts the rating for a given user-book pair.

7. **Create the Model**: Creates an instance of the `Model` class, specifying the input and output layers. This model defines the architecture for the recommendation system.

8. **Return the Model**: Returns the created model.

This function encapsulates the creation of the recommendation model with its layers and architecture.


In [None]:
def create_model():
        
    user_input = Input(shape=(1,), name="user_input")
    user_embedding = Embedding(num_users, embedding_dim, name="user_embedding")(user_input)
    user_vec = Flatten()(user_embedding)

    book_input = Input(shape=[1], name="book_input")
    book_embedding = Embedding(num_books, embedding_dim, name="book_embedding")(book_input)
    book_vec = Flatten()(book_embedding)
    
    product = Dot(axes=1)([book_vec, user_vec])
    model = Model(inputs=[user_input, book_input], outputs=product)

    return model

# Train model

The `train_model` function performs the following steps to train the recommendation model:

1. **Compile the Model**: Compiles the recommendation model with the specified `loss_function` and `optimizer`. This step configures the model for training by defining the loss function to optimize and the optimizer algorithm to use.

2. **Fit the Model**: Fits the recommendation model to the training data using the `fit` method. It specifies the training inputs (`[train.user.values, train.book.values]`), the training targets (`train.rating.values`), and other parameters such as `batch_size`, `epochs`, and `verbose`. This step trains the model on the training data for the specified number of epochs.

3. **Validate the Model**: Evaluates the trained model on the validation data (`[test.user.values, test.book.values]` and `test.rating.values`) during the training process. This provides insights into the model's performance on unseen data and helps in monitoring its progress.

4. **Save the Model**: Saves the trained model to the specified `model_save_path` using the `ModelCheckpoint` callback. This ensures that only the best model based on the validation loss is saved.

5. **Return the Validation Loss**: Returns the validation loss (`val_loss`) from the history of the model. The validation loss provides an indication of how well the model is generalizing to unseen data.


In [None]:
def train_model(batch_size, optimizer, loss_function, model_save_path, num_epochs):
    checkpoint = ModelCheckpoint(model_save_path, monitor='val_loss', save_best_only=True, mode='min')
    model.compile(loss=loss_function, optimizer=optimizer)
    history = model.fit(x=[train.user.values, train.book.values], y=train.rating.values,
                        batch_size=batch_size, epochs=num_epochs, verbose=1,
                        validation_data=([test.user.values, test.book.values], test.rating.values),
                        callbacks=[checkpoint])
    model.save(model_save_path)
    checkpoint = ModelCheckpoint(model_save_path, monitor='val_loss', save_best_only=True, mode='min')
    return history.history['val_loss'][-1]

<b>Tuning the Recommendation Model</b>

- **Optimizers**: A list that contains instances of different optimizers, including Adam and RMSprop. Optimizers are responsible for updating the model's weights during training to minimize the loss function.

- **Loss Functions**: A list that contains instances of different loss functions, such as MeanAbsoluteError, MeanSquaredError, Huber, and LogCosh. Loss functions quantify the difference between predicted ratings and actual ratings.

- **Batch Sizes**: A list that contains different batch sizes, such as 32 and 64, which determine the number of samples processed before updating the model's weights.

- **Model Paths**: An empty list that will store the paths of the best models found during the tuning process.

- **Best Validation Loss**: A variable initialized with a large value (`float('inf')`) to track the best validation loss achieved during training.

- **Best Optimizer**: A variable that will store the optimizer yielding the best validation loss.

- **Best Loss Function**: A variable that will store the loss function yielding the best validation loss.

- **Best Batch Size**: A variable that will store the batch size yielding the best validation loss.

- **Best Number of Epochs**: A variable that will store the number of epochs corresponding to the best validation loss.

- **Best Model Path**: A variable that will store the path of the model with the best validation loss.

These variables and lists will be used to track and update the best hyperparameters and model performance during the tuning process.

In [None]:
optimizers = [Adam(), RMSprop()]
loss_functions = [MeanAbsoluteError(), MeanSquaredError(), Huber(), LogCosh()]
batch_sizes = [16, 32]
model_paths = []

best_val_loss = float('inf')
best_optimizer = None
best_loss_function = None
best_batch_size = None
best_num_epochs = None
best_model_path = None

<b>Tuning Hyperparameters and Training the Recommendation Model</b>

1. **Optimizer Loop**: The outer loop iterates over the optimizers in the `optimizers` list.

2. **Loss Function Loop**: The second loop iterates over the loss functions in the `loss_functions` list.

3. **Batch Size Loop**: The innermost loop iterates over the batch sizes in the `batch_sizes` list.

4. **Number of Epochs**: The `num_epochs` variable is set to 20, indicating the number of training epochs for each combination of hyperparameters.

5. **Create Model**: A new instance of the recommendation model is created using the `create_model()` function.

6. **Model Save Path**: The `model_save_path` variable is generated to specify the path for saving the trained model based on the current combination of optimizer, loss function, and batch size.

7. **Print Hyperparameters**: The optimizer, loss function, and batch size are printed for tracking the progress of the tuning process.

8. **Train the Model**: The `train_model()` function is called to train the model with the current hyperparameters. The function takes the batch size, optimizer, loss function, model save path, and number of epochs as input arguments. It returns the validation loss of the trained model.

9. **Update Best Hyperparameters**: If the validation loss obtained with the current hyperparameters is lower than the previous best validation loss (`best_val_loss`), the best hyperparameters and model information are updated.

10. **Best Model Path**: The `best_model_path` variable stores the path of the model with the lowest validation loss among all the combinations.

By iterating over different combinations of optimizers, loss functions, and batch sizes, the code finds the best hyperparameters for training the recommendation model.

In [None]:
for optimizer in optimizers:
    for loss_function in loss_functions:
        for batch_size in batch_sizes:
            num_epochs = 30
            model = create_model()
            model_save_path = f"model_{optimizer.__class__.__name__}_{loss_function.__class__.__name__}_batch{batch_size}.tf"
            model_paths.append(model_save_path)
            print(optimizer, loss_function, batch_size)
            val_loss = train_model(batch_size, optimizer, loss_function, model_save_path, num_epochs)
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_optimizer = optimizer
                best_loss_function = loss_function
                best_batch_size = batch_size
                best_num_epochs = num_epochs
                best_model_path = model_save_path

1. **Best Optimizer**: The `best_optimizer` variable holds the optimizer object with the lowest validation loss. This line prints the class name of the best optimizer using the `__class__.__name__` attribute.

2. **Best Loss Function**: The `best_loss_function` variable holds the loss function object with the lowest validation loss. This line prints the class name of the best loss function using the `__class__.__name__` attribute.

3. **Best Batch Size**: The `best_batch_size` variable stores the batch size value that yielded the lowest validation loss.

4. **Best Number of Epochs**: The `best_num_epochs` variable stores the number of epochs used for training the model with the best hyperparameters.

5. **Best Model Path**: The `best_model_path` variable contains the file path where the best-trained model is saved.

By printing this information, you can easily track the best hyperparameters and model details for further analysis and evaluation.

In [None]:
print(f"Best optimizer: {best_optimizer.__class__.__name__}")
print(f"Best loss function: {best_loss_function.__class__.__name__}")
print(f"Best batch size: {best_batch_size}")
print(f"Best number of epochs: {best_num_epochs}")
print(f"Best model path: {best_model_path}")

# Evaluate models
We evaluate our trained models on the test data to see which one perfomrs better

In [None]:
model_paths = [
    'model_Adam_MeanAbsoluteError_batch32.h5', 'model_Adam_MeanAbsoluteError_batch64.h5',
    'model_Adam_MeanSquaredError_batch32.h5', 'model_Adam_MeanSquaredError_batch64.h5',
    'model_Adam_Huber_batch32.h5', 'model_Adam_Huber_batch64.h5',
    'model_Adam_LogCosh_batch32.h5', 'model_Adam_LogCosh_batch64.h5',
    'model_RMSprop_MeanAbsoluteError_batch32.h5', 'model_RMSprop_MeanAbsoluteError_batch64.h5',
    'model_RMSprop_MeanSquaredError_batch32.h5', 'model_RMSprop_MeanSquaredError_batch64.h5',
    'model_RMSprop_Huber_batch32.h5', 'model_RMSprop_Huber_batch64.h5',
    'model_RMSprop_LogCosh_batch32.h5', 'model_RMSprop_LogCosh_batch64.h5'
]

results = []

for model_path in model_paths:
    model = load_model(model_path)
    evaluation = model.evaluate([test.user.values, test.book.values], test.rating.values)
    results.append((model_path, evaluation))

# Sort the results based on test MSE in ascending order
results.sort(key=lambda x: x[1])

# Print the models and their corresponding test MSE in ranking order
for i, (model_path, evaluation) in enumerate(results, start=1):
    print(f'Rank {i}: Test MSE for {model_path}: {evaluation}')