## # MovieBookGenreClassifier Notebook
## 
## **Purpose:**  
## This notebook prepares and maps book and movie genres to a set of 20 major genre 
## categories, performing data cleaning and feature mapping. It also trains a 
## RoBERTa-based multi-label classification model to classify books into these genres.
##
## **Datasets Used:**
## - `movies.csv`: Contains movie metadata and descriptions.
## - `books.csv`: Contains book metadata, genres, and descriptions.
##
## These datasets will be linked to the final cross-domain recommendation system 
## described in the main project. The outputs from this notebook (e.g., processed 
## data, saved model) will feed into the recommendation logic and other components.
##
## **Key Steps:**
## 1. Load and preview the raw datasets.
## 2. Clean and preprocess the data (remove NaNs, unwanted columns, stopwords).
## 3. Map original genres to a reduced set of 20 major genres.
## 4. Train a multi-label classification model (RoBERTa) on book descriptions.
## 5. Save transformed data and the trained model for downstream tasks.
## --------------------------------------------------------------------------------


In [6]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import re

In [18]:
# Load the movies dataset
movies = pd.read_csv("movies.csv")
len(movies["names"])

10178

In [19]:
# Load the books dataset
books = pd.read_csv("books.csv")
len(books["title"])

52478

In [3]:
# Download NLTK stopwords and punkt tokenizer if not already available
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mausam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mausam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Google Drive Mounting (Colab Specific):**
## The following cells assume running in Google Colab. They mount Google Drive and 
## handle credential files for Kaggle. If you're running locally, you may adjust 
## or skip these steps.
## --------------------------------------------------------------------------------

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
from google.colab import files
import os

# Check if Kaggle credentials are already present. If not, upload them.
if not os.path.isfile('kaggle.json'):
  files.upload()

In [None]:
# Install Kaggle CLI tool
!pip install -q kaggle

# Set permissions and configure Kaggle environment
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

# Print current directory for confirmation
!pwd

## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Downloading Datasets from Kaggle:**
## In this section, we use Kaggle CLI to download two datasets:
## 1. Goodreads Best Books Ever (for books)
## 2. IMDb Movies Dataset (for movies)
##
## After downloading, we unzip and reorganize them into a `dataset` directory 
## for standardized referencing.
## --------------------------------------------------------------------------------


In [None]:
!kaggle datasets download -d arnabchaki/goodreads-best-books-ever
!kaggle datasets download -d ashpalsingh1525/imdb-movies-dataset
!unzip goodreads-best-books-ever.zip -d books
!unzip imdb-movies-dataset.zip -d movies

In [None]:
# Create a dataset directory and move final CSVs inside it
!mkdir 'dataset'
!mv '/content/movies/imdb_movies.csv' '/content/dataset/movies.csv'
!mv '/content/books/books_1.Best_Books_Ever.csv' '/content/dataset/books.csv'

mkdir: cannot create directory ‘dataset’: File exists


## --------------------------------------------------------------------------------
## **Data Loading:**
## Here we define paths to the processed `books.csv` and `movies.csv` stored 
## in the `dataset` directory. We then load them into Pandas DataFrames.
## --------------------------------------------------------------------------------


In [4]:
books_path = 'books.csv'
movies_path = 'movies.csv'

books_df = pd.read_csv(books_path)
movies_df = pd.read_csv(movies_path)

## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Data Cleaning:**
## We drop unnecessary or noisy columns to streamline our datasets. Then we remove 
## rows with missing values (NaN). For movies, we also extract the first atomic genre 
## from the genre field.
## --------------------------------------------------------------------------------


In [6]:
# Drop irrelevant columns from books
books_df.drop(['series', 'author', 'characters', 'bookFormat', 'edition', 'pages', 
               'publisher', 'publishDate', 'firstPublishDate', 'awards', 'ratingsByStars', 
               'likedPercent', 'setting', 'bbeScore', 'bbeVotes', 'price'], 
              axis=1, inplace=True)
books_df.dropna(inplace=True)

# Drop irrelevant columns from movies
movies_df.drop(['date_x', 'crew', 'status', 'orig_lang', 'budget_x', 'revenue', 'country'], 
               axis=1, inplace=True)
movies_df.dropna(inplace=True)

# Extract a single representative genre (atomic genre) from the first listed genre in movies
movies_df['atomic_genres'] = movies_df['genre'].apply(lambda x: x.split(',')[0])

In [7]:
books_df.head()

Unnamed: 0,title,rating,description,language,isbn,genres,numRatings
0,The Hunger Games,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780440000000.0,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",6376780
1,Harry Potter and the Order of the Phoenix,4.5,There is a door at the end of a silent corrido...,English,9780440000000.0,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",2507623
2,To Kill a Mockingbird,4.28,The unforgettable novel of a childhood in a sl...,English,10000000000000.0,"['Classics', 'Fiction', 'Historical Fiction', ...",4501075
3,Pride and Prejudice,4.26,Alternate cover edition of ISBN 9780679783268S...,English,10000000000000.0,"['Classics', 'Fiction', 'Romance', 'Historical...",2998241
4,Twilight,3.6,About three things I was absolutely positive.\...,English,9780320000000.0,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",4964519


## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Text Cleaning Setup:**
## We define regular expressions and a cleaning function to remove symbols, 
## lower case text, and filter out stopwords from both movie and book descriptions.
## --------------------------------------------------------------------------------


In [8]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;-]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

In [9]:
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Replace specified symbols with space
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    # Remove symbols not defined in allowed pattern
    text = BAD_SYMBOLS_RE.sub('', text)
    # Remove stopwords
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

In [10]:
# Clean movie descriptions
movies_df['clean_description'] = movies_df['overview'].apply(clean_text)

In [11]:
movies_df.head()

Unnamed: 0,names,score,genre,overview,orig_title,atomic_genres,clean_description
0,Creed III,73.0,"Drama, Action","After dominating the boxing world, Adonis Cree...",Creed III,Drama,dominating boxing world adonis creed thriving ...
1,Avatar: The Way of Water,78.0,"Science Fiction, Adventure, Action",Set more than a decade after the events of the...,Avatar: The Way of Water,Science Fiction,set decade events first film learn story sully...
2,The Super Mario Bros. Movie,76.0,"Animation, Adventure, Family, Fantasy, Comedy","While working underground to fix a water main,...",The Super Mario Bros. Movie,Animation,working underground fix water main brooklyn pl...
3,Mummies,70.0,"Animation, Comedy, Family, Adventure, Fantasy","Through a series of unfortunate events, three ...",Momias,Animation,series unfortunate events three mummies end pr...
4,Supercell,61.0,Action,Good-hearted teenager William always lived in ...,Supercell,Action,good hearted teenager william always lived hop...


In [10]:
# Compare original vs cleaned description for the first movie
movies_df.iloc[0]['overview'], movies_df.iloc[0]['clean_description']

('After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodigy, Damien Anderson, resurfaces after serving a long sentence in prison, he is eager to prove that he deserves his shot in the ring. The face-off between former friends is more than just a fight. To settle the score, Adonis must put his future on the line to battle Damien — a fighter who has nothing to lose.',
 'dominating boxing world adonis creed thriving career family life childhood friend former boxing prodigy damien anderson resurfaces serving long sentence prison eager prove deserves shot ring face former friends fight settle score adonis must put future line battle damien fighter nothing lose')

In [12]:
# Clean book descriptions
books_df['clean_description'] = books_df['description'].apply(clean_text)

In [14]:
books_df.head()

Unnamed: 0,title,rating,description,language,isbn,genres,numRatings,clean_description
0,The Hunger Games,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780440000000.0,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",6376780,winning means fame fortunelosing means certain...
1,Harry Potter and the Order of the Phoenix,4.5,There is a door at the end of a silent corrido...,English,9780440000000.0,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",2507623,door end silent corridor haunting harry pottte...
2,To Kill a Mockingbird,4.28,The unforgettable novel of a childhood in a sl...,English,10000000000000.0,"['Classics', 'Fiction', 'Historical Fiction', ...",4501075,unforgettable novel childhood sleepy southern ...
3,Pride and Prejudice,4.26,Alternate cover edition of ISBN 9780679783268S...,English,10000000000000.0,"['Classics', 'Fiction', 'Romance', 'Historical...",2998241,alternate cover edition isbn 9780679783268sinc...
4,Twilight,3.6,About three things I was absolutely positive.\...,English,9780320000000.0,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",4964519,three things absolutely positivefirst edward v...


In [15]:
# Compare original vs cleaned description for the first book
books_df.iloc[0]['description'], books_df.iloc[0]['clean_description']

("WINNING MEANS FAME AND FORTUNE.LOSING MEANS CERTAIN DEATH.THE HUNGER GAMES HAVE BEGUN. . . .In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and once girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV.Sixteen-year-old Katniss Everdeen regards it as a death sentence when she steps forward to take her sister's place in the Games. But Katniss has been close to dead before—and survival, for her, is second nature. Without really meaning to, she becomes a contender. But if she is to win, she will have to start making choices that weight survival against humanity and life against love.",
 'winning means fame fortunelosing means certain deaththe hunger games begun ruins place known north america lies nation panem shining capitol surrounded 

In [16]:
# Extract a single atomic genre from the book's genre list
books_df['atomic_genre'] = books_df['genres'].apply(lambda x: eval(x)[0] if len(eval(x)) else None)

In [17]:
books_df['atomic_genre'].head()

0    Young Adult
1        Fantasy
2       Classics
3       Classics
4    Young Adult
Name: atomic_genre, dtype: object

In [13]:
# Filter out books that have no genre assigned
books_classify_df = books_df.dropna()

In [14]:
books_classify_df.loc[0]['clean_description'], books_classify_df.loc[0]['atomic_genre']

('winning means fame fortunelosing means certain deaththe hunger games begun ruins place known north america lies nation panem shining capitol surrounded twelve outlying districts capitol harsh cruel keeps districts line forcing send one boy girl ages twelve eighteen participate annual hunger games fight death live tvsixteen year old katniss everdeen regards death sentence steps forward take sisters place games katniss close dead beforeand survival second nature without really meaning becomes contender win start making choices weight survival humanity life love',
 'Young Adult')

In [15]:
books_classify_df.head()

Unnamed: 0,title,rating,description,language,isbn,genres,numRatings,clean_description,atomic_genre
0,The Hunger Games,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780440000000.0,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",6376780,winning means fame fortunelosing means certain...,Young Adult
1,Harry Potter and the Order of the Phoenix,4.5,There is a door at the end of a silent corrido...,English,9780440000000.0,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",2507623,door end silent corridor haunting harry pottte...,Fantasy
2,To Kill a Mockingbird,4.28,The unforgettable novel of a childhood in a sl...,English,10000000000000.0,"['Classics', 'Fiction', 'Historical Fiction', ...",4501075,unforgettable novel childhood sleepy southern ...,Classics
3,Pride and Prejudice,4.26,Alternate cover edition of ISBN 9780679783268S...,English,10000000000000.0,"['Classics', 'Fiction', 'Romance', 'Historical...",2998241,alternate cover edition isbn 9780679783268sinc...,Classics
4,Twilight,3.6,About three things I was absolutely positive.\...,English,9780320000000.0,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",4964519,three things absolutely positivefirst edward v...,Young Adult


## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Mapping to Major Genres:**
## We load a JSON file that contains mappings of various sub-genres to a set of 
## 20 major genres. This normalization allows the model to generalize better.
## --------------------------------------------------------------------------------

## Training Multi Label Classifier using Transformers

In [8]:
# Converting all given genres to 20 major genres
import json
major_genres = json.load(open('book_genres.json', 'r'))
major_genres

[{'genre': 'Fiction',
  'includes': ['Young Adult',
   'Fantasy',
   'Classics',
   'Historical Fiction',
   'Science Fiction',
   'Horror',
   'Mystery',
   'Thriller',
   'Dystopian',
   'Contemporary',
   'Mythology',
   'Paranormal',
   'Coming of Age',
   'Epic',
   'Western',
   'Gothic',
   'Crime',
   'Fairy Tales',
   'Regency',
   'High Fantasy']},
 {'genre': 'Romance',
  'includes': ['Romance',
   'Contemporary Romance',
   'Historical Romance',
   'Romantic Suspense',
   'Young Adult Romance',
   'Science Fiction Romance',
   'M/M Romance',
   'F/F Romance',
   'Interracial Romance',
   'Regency Romance']},
 {'genre': 'Nonfiction',
  'includes': ['History',
   'Biography',
   'Memoir',
   'Travel',
   'Science',
   'Philosophy',
   'Religion',
   'True Crime',
   'Self Help',
   'Art',
   'Food',
   'Economics',
   'Psychology',
   'Business',
   'Education',
   'Nature',
   'Health',
   'Music',
   'Journalism']},
 {'genre': "Children's",
  'includes': ["Children's",
   'P

In [20]:
tf_books_df = books_df.copy()

In [21]:
books_out_df = tf_books_df[['title', 'description', 'genres']]
trans_books_df = tf_books_df[['clean_description', 'genres']]
trans_books_df.head()

Unnamed: 0,clean_description,genres
0,winning means fame fortunelosing means certain...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas..."
1,door end silent corridor haunting harry pottte...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',..."
2,unforgettable novel childhood sleepy southern ...,"['Classics', 'Fiction', 'Historical Fiction', ..."
3,alternate cover edition isbn 9780679783268sinc...,"['Classics', 'Fiction', 'Romance', 'Historical..."
4,three things absolutely positivefirst edward v...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire..."


In [9]:
# Extract the major genre categories from the loaded JSON
book_features = [x['genre'] for x in major_genres[:-1]]
book_features

['Fiction',
 'Romance',
 'Nonfiction',
 "Children's",
 'Young Adult',
 'Teen',
 'Mystery',
 'Crime',
 'Thriller',
 'Fantasy',
 'Science Fiction',
 'Horror',
 'Drama',
 'Poetry',
 'Art',
 'Humor',
 'Religion']

In [35]:
# Create a dictionary mapping each major genre to the set of sub-genres it includes
book_feature_dict = {
    item['genre']: item['includes'] for item in major_genres
}
book_feature_dict

{'Fiction': ['Young Adult',
  'Fantasy',
  'Classics',
  'Historical Fiction',
  'Science Fiction',
  'Horror',
  'Mystery',
  'Thriller',
  'Dystopian',
  'Contemporary',
  'Mythology',
  'Paranormal',
  'Coming of Age',
  'Epic',
  'Western',
  'Gothic',
  'Crime',
  'Fairy Tales',
  'Regency',
  'High Fantasy'],
 'Romance': ['Romance',
  'Contemporary Romance',
  'Historical Romance',
  'Romantic Suspense',
  'Young Adult Romance',
  'Science Fiction Romance',
  'M/M Romance',
  'F/F Romance',
  'Interracial Romance',
  'Regency Romance'],
 'Nonfiction': ['History',
  'Biography',
  'Memoir',
  'Travel',
  'Science',
  'Philosophy',
  'Religion',
  'True Crime',
  'Self Help',
  'Art',
  'Food',
  'Economics',
  'Psychology',
  'Business',
  'Education',
  'Nature',
  'Health',
  'Music',
  'Journalism'],
 "Children's": ["Children's",
  'Picture Books',
  'Graphic Novels',
  'Middle Grade'],
 'Young Adult': ['Young Adult', 'Dystopian', 'High School'],
 'Teen': ['Teen', 'Young Adult 

In [37]:
def map_feature(feature_list, target):
    # Convert lists to sets for efficient intersection operations
    sft = set(feature_list)
    mft = set(book_feature_dict[target])
    # If target genre is present or intersects with sub-genres, return 1 (True)
    if target in sft or len(sft.intersection(mft)):
        return 1
    return 0

In [None]:
# Map each book's genres to binary features for each major genre
for feature in book_features:
    trans_books_df[feature] = trans_books_df['genres'].map(lambda x: map_feature(eval(x), feature))
    books_out_df[feature] = books_out_df['genres'].map(lambda x: map_feature(eval(x), feature))

In [41]:
trans_books_df.head()


Unnamed: 0,clean_description,genres,Fiction,Romance,Nonfiction,Children's,Young Adult,Teen,Mystery,Crime,Thriller,Fantasy,Science Fiction,Horror,Drama,Poetry,Art,Humor,Religion
0,winning means fame fortunelosing means certain...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0
1,door end silent corridor haunting harry pottte...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0
2,unforgettable novel childhood sleepy southern ...,"['Classics', 'Fiction', 'Historical Fiction', ...",1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,alternate cover edition isbn 9780679783268sinc...,"['Classics', 'Fiction', 'Romance', 'Historical...",1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,three things absolutely positivefirst edward v...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",1,1,0,0,1,1,0,0,0,1,0,1,0,0,0,0,0


In [25]:
# Save processed book data with major genre flags
books_out_df.to_csv('books_out_df.csv', index=False)
trans_books_df.to_csv('trans_books_df.csv', index=False)

In [53]:
import random
books_out_df.iloc[0][3:].values.tolist()

[1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]

## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Multi-label Classification Setup:**
## We will train a RoBERTa model using the `simpletransformers` library. The model 
## will be trained to predict the 20 major genres for each book based on the 
## 'clean_description'.
## --------------------------------------------------------------------------------


In [None]:
from simpletransformers.classification import MultiLabelClassificationModel, MultiLabelClassificationArgs
import logging

In [10]:
trans_books_df = pd.read_csv("trans_books_df.csv")

# Create a new column 'labels' that holds a list of 0/1 indicating the presence of each major genre
trans_books_w_labels = trans_books_df.copy()
trans_books_w_labels['labels'] = trans_books_w_labels.apply(lambda x: [x[feature] for feature in book_features], axis=1)
trans_books_w_labels.head()

Unnamed: 0,clean_description,genres,Fiction,Romance,Nonfiction,Children's,Young Adult,Teen,Mystery,Crime,Thriller,Fantasy,Science Fiction,Horror,Drama,Poetry,Art,Humor,Religion,labels
0,winning means fame fortunelosing means certain...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,"[1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, ..."
1,door end silent corridor haunting harry pottte...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,"[1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
2,unforgettable novel childhood sleepy southern ...,"['Classics', 'Fiction', 'Historical Fiction', ...",1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,"[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,alternate cover edition isbn 9780679783268sinc...,"['Classics', 'Fiction', 'Romance', 'Historical...",1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,three things absolutely positivefirst edward v...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",1,1,0,0,1,1,0,0,0,1,0,1,0,0,0,0,0,"[1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, ..."


In [11]:
import math
# Split data into training and test sets (70% train, 30% test)
train_size = math.floor(len(trans_books_w_labels) * 0.7)
books_t_train = trans_books_w_labels[:train_size].copy()[['clean_description', 'labels']]
books_t_test = trans_books_w_labels[train_size:].copy()[['clean_description', 'labels']]

In [14]:
len(books_t_train)

33366

In [None]:
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Prepare dataframes for the model
train_df = books_t_train
train_df.columns = ["text", "labels"]

eval_df = books_t_test
eval_df.columns = ["text", "labels"]

## --------------------------------------------------------------------------------
## MARKDOWN CELL:
## **Model Configuration:**
## We define training arguments such as epochs, batch sizes, learning rate, and 
## sequence length. Then we initialize a RoBERTa-based MultiLabelClassificationModel.
## --------------------------------------------------------------------------------


In [28]:
# Optional model configuration
model_args = MultiLabelClassificationArgs(
    num_train_epochs=5,
    overwrite_output_dir=True,
    train_batch_size=16,
    eval_batch_size = 16,
    evaluate_during_training_steps=True,
    gradient_accumulation_steps=1,
    learning_rate=3e-4,
    max_seq_lenght=128,
    optimizer="AdamW",
    save_steps=2000
    )

# Create a MultiLabelClassificationModel
model = MultiLabelClassificationModel(
    "roberta",
    "roberta-base",
    num_labels=len(book_features),
    args=model_args
)

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(
    eval_df
)


Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
67it [00:23,  2.88it/s]                        
Epochs 1/5. Running Loss:    0.1006: 100%|██████████| 4171/4171 [10:12<00:00,  6.81it/s]
Epochs 2/5. Running Loss:    0.2601: 100%|██████████| 4171/4171 [09:05<00:00,  7.65it/s]
Epochs 3/5. Running Loss:    0.0935: 100%|██████████| 4171/4171 [08:56<00:00,  7.78it/s]
Epochs 4/5. Running Loss:    0.0740: 100%|██████████| 4171/4171 [08:51<00:00,  7.85it/s]
Epochs 5/5. Running Loss:    0.2770: 100%|██████████| 4171/4171 [08:51<00:00,  7.85it/s]
Epoch 5 of 5: 100%|██████████| 5/5 [46:05<00:00, 553.05s/it]
29it [00:20,  1

In [29]:
result

{'LRAP': 0.9011010019326716, 'eval_loss': 0.2321822503581643}

## --------------------------------------------------------------------------------
## **Testing the Model:**
## We try out the model on a test description (for instance, from a movie or a book) 
## to see which major genres it predicts.
## --------------------------------------------------------------------------------


In [30]:
test_desc = "John Form has found the perfect gift for his expectant wife, Mia - a beautiful, rare vintage doll in a pure white wedding dress. But Mia's delight with Annabelle doesn't last long. On one horrific night, their home is invaded by members of a satanic cult, who violently attack the couple. Spilled blood and terror are not all they leave behind. The cultists have conjured an entity so malevolent that nothing they did will compare to the sinister conduit to the damned that is now... Annabelle"
# Clean the test description
test_desc = clean_text(test_desc)

In [31]:
prediction, raw_outputs = model.predict([test_desc])

1it [00:05,  5.72s/it]
100%|██████████| 1/1 [00:00<00:00,  9.60it/s]


In [32]:
# Show which major genres were predicted
[feature for i, feature in enumerate(book_features) if prediction[0][i]]

['Fiction', 'Mystery', 'Crime', 'Thriller', 'Fantasy', 'Horror']

In [None]:
raw_outputs

In [34]:
# Save the trained model for future inference
import pickle
pickle.dump(model, open('model-v2.pkl', 'wb'))

In [35]:
# Save the transformed dataset with labels
trans_books_w_labels.to_csv('trans_books_w_labels.csv', index=False)