### This notebook implements a part of the Cross-Domain Recommendation System, focusing on genre classification and prediction using a pre-trained RoBERTa model. The system maps book genres to a unified genre space and generates recommendations based on user preferences.

### This notebook performs the following steps:

* Imports and Setup: Loads necessary libraries and downloads NLTK data.
* Load Data and Model: Loads genre mappings and the pre-trained RoBERTa model.
* Text Preprocessing: Defines a function to clean and preprocess text data.
* Prepare Labels: Extracts genre labels for mapping predictions.
* Map Predictions: Converts model output to human-readable genres.
* Make Predictions: Processes test data, makes predictions, and displays them.
* Execute Predictions: Runs the prediction pipeline and displays results.

By following these steps, the system can effectively classify book genres and facilitate cross-domain recommendations between books and movies.

In [3]:
import pickle
import json
import nltk
import re
from nltk.corpus import stopwords

# Download necessary NLTK data files
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\MSI-1\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MSI-1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Load the book genres from a JSON file
with open('book_genres.json', 'r') as f:
    book_genres = json.load(f)

# Load the pre-trained RoBERTa model
with open('model-v1.pkl', 'rb') as f:
    model = pickle.load(f)

In [4]:
# Define regular expressions for text cleaning
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;-]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

  REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]-')


In [5]:
def clean_text(text):
    """
    Cleans the input text by:
    - Converting to lowercase
    - Replacing specific symbols with space
    - Removing unwanted symbols
    - Removing stopwords
    """
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)  # Replace specified symbols with space
    text = BAD_SYMBOLS_RE.sub('', text)        # Remove unwanted symbols
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)  # Remove stopwords
    return text

In [6]:
# Extract genres from book_genres, excluding the last entry if necessary
labels = [x['genre'] for x in book_genres[:-1]]
labels

['Fiction',
 'Romance',
 'Nonfiction',
 "Children's",
 'Young Adult',
 'Teen',
 'Mystery',
 'Crime',
 'Thriller',
 'Fantasy',
 'Science Fiction',
 'Horror',
 'Drama',
 'Poetry',
 'Art',
 'Humor',
 'Religion']

In [7]:
def map_predictions(predictions):
    """
    Maps binary prediction arrays to their corresponding genre labels.

    Args:
        predictions (list of lists): Each sublist contains binary indicators for genres.

    Returns:
        list of lists: Each sublist contains the genres predicted for a book.
    """
    return [[genre for i, genre in enumerate(labels) if prediction[i]] for prediction in predictions]

In [8]:
def make_predictions():
    """
    Processes the test data, makes genre predictions, and prints the results.

    Returns:
        tuple: Contains raw predictions, labeled predictions, and model outputs.
    """
    # Read and clean test inputs
    with open('test.txt', 'r') as f:
        test_inputs = [clean_text(x.strip()) for x in f.readlines()]
    
    # Make predictions using the pre-trained model
    predictions, outputs = model.predict(test_inputs)
    
    # Map binary predictions to genre labels
    labeled_predictions = map_predictions(predictions)
    
    # Print each set of predicted genres
    for p in labeled_predictions:
        print(p)
    
    return predictions, labeled_predictions, outputs


In [9]:
# Execute the prediction function and store results
p, lp, o = make_predictions()

1it [00:07,  7.20s/it]
100%|██████████| 1/1 [00:00<00:00,  2.40it/s]

[]
['Nonfiction']
['Fiction', 'Fantasy']





In [11]:
# Display the first element of the model's outputs
o[0]

array([0.34887695, 0.07330322, 0.0165863 , 0.0053215 , 0.01327515,
       0.00201035, 0.07519531, 0.04302979, 0.09570312, 0.01542664,
       0.00734711, 0.00391388, 0.08404541, 0.00069046, 0.00054836,
       0.0165863 , 0.00728989])