<a href="https://www.kaggle.com/code/mohammedmohsen0404/proj32-nlp-cmu-books-recommendation-system?scriptVersionId=196402836" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
**<center><h1>CMU Books Recommendation System</h1></center>**
<center><h3>Learning ML, DL through 100 Practical Projects</h3></center>

---

This project aims to analyze and compare plot summaries from the CMU Book Summary Dataset, which includes summaries for 16,559 books extracted from Wikipedia. By leveraging the TF-IDF (Term Frequency-Inverse Document Frequency) model to represent the text, we can calculate the cosine similarity between different book summaries. This approach enables us to identify similarities and patterns across various genres and authors, offering insights into the thematic and stylistic connections between different literary works.

# **Import Libraries and Data**
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import re
import html
import unicodedata
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from collections import Counter
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


In [3]:
#fix nltk kaggle problem

import nltk
import subprocess

# Download and unzip wordnet
try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

# Now you can import the NLTK resources as usual
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /kaggle/working/...
Archive:  /kaggle/working/corpora/wordnet.zip
   creating: /kaggle/working/corpora/wordnet/
  inflating: /kaggle/working/corpora/wordnet/lexnames  
  inflating: /kaggle/working/corpora/wordnet/data.verb  
  inflating: /kaggle/working/corpora/wordnet/index.adv  
  inflating: /kaggle/working/corpora/wordnet/adv.exc  
  inflating: /kaggle/working/corpora/wordnet/index.verb  
  inflating: /kaggle/working/corpora/wordnet/cntlist.rev  
  inflating: /kaggle/working/corpora/wordnet/data.adj  
  inflating: /kaggle/working/corpora/wordnet/index.adj  
  inflating: /kaggle/working/corpora/wordnet/LICENSE  
  inflating: /kaggle/working/corpora/wordnet/citation.bib  
  inflating: /kaggle/working/corpora/wordnet/noun.exc  
  inflating: /kaggle/working/corpora/wordnet/verb.exc  
  inflating: /kaggle/working/corpora/wordnet/README  
  inflating: /kaggle/working/corpora/wordnet/index.sense  
  inflating: /kaggle/working/corpora/wordnet/data.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report , f1_score, confusion_matrix

In [5]:
! pip install kaggle
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d ymaricar/cmu-book-summary-dataset
!unzip cmu-book-summary-dataset.zip

cp: cannot stat '/content/drive/MyDrive/kaggle/kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/ymaricar/cmu-book-summary-dataset
License(s): CC-BY-SA-3.0
Downloading cmu-book-summary-dataset.zip to /kaggle/working
 62%|███████████████████████▍              | 10.0M/16.2M [00:00<00:00, 36.7MB/s]
100%|██████████████████████████████████████| 16.2M/16.2M [00:00<00:00, 52.0MB/s]
Archive:  cmu-book-summary-dataset.zip
  inflating: booksummaries.txt       


# **Converting Data into a Dataframe**
---

In [6]:
import nltk
import json
import re
import csv
from tqdm import tqdm
pd.set_option('display.max_colwidth', 300)

In [7]:
data = []

with open("booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in tqdm(reader):
        data.append(row)

16559it [00:00, 18407.67it/s]


In [8]:
book_index = []
book_id = []
book_author = []
book_name = []
summary = []
genre = []
a = 1
for i in tqdm(data):
    book_index.append(a)
    a = a+1
    book_id.append(i[0])
    book_name.append(i[2])
    book_author.append(i[3])
    genre.append(i[5])
    summary.append(i[6])

df = pd.DataFrame({'Index': book_index, 'ID': book_id, 'BookTitle': book_name, 'Author': book_author,
                       'Genre': genre, 'Summary': summary}).copy()

100%|██████████| 16559/16559 [00:00<00:00, 419470.93it/s]


# **Take a look at the data**
---

In [9]:
df.head()

Unnamed: 0,Index,ID,BookTitle,Author,Genre,Summary
0,1,620,Animal Farm,George Orwell,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a p..."
1,2,843,A Clockwork Orange,Anthony Burgess,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""Novella"", ""/m/014dfn"": ""Speculative fiction"", ""/m/0c082"": ""Utopian and dystopian fiction"", ""/m/06nbt"": ""Satire"", ""/m/02xlf"": ""Fiction""}","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and..."
2,3,986,The Plague,Albert Camus,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fiction"", ""/m/0pym5"": ""Absurdist fiction"", ""/m/05hgj"": ""Novel""}","The text of The Plague is divided into five parts. In the town of Oran, thousands of rats, initially unnoticed by the populace, begin to die in the streets. A hysteria develops soon afterward, causing the local newspapers to report the incident. Authorities responding to public pressure order t..."
3,4,1756,An Enquiry Concerning Human Understanding,David Hume,,"The argument of the Enquiry proceeds by a series of incremental steps, separated into chapters which logically succeed one another. After expounding his epistemology, Hume explains how to apply his principles to specific topics. In the first section of the Enquiry, Hume provides a rough introdu..."
4,5,2080,A Fire Upon the Deep,Vernor Vinge,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90"": ""Science Fiction"", ""/m/014dfn"": ""Speculative fiction"", ""/m/01hmnh"": ""Fantasy"", ""/m/02xlf"": ""Fiction""}","The novel posits that space around the Milky Way is divided into concentric layers called Zones, each being constrained by different laws of physics and each allowing for different degrees of biological and technological advancement. The innermost, the ""Unthinking Depths"", surrounds the galacti..."


In [10]:
df.describe()

Unnamed: 0,Index
count,16559.0
mean,8280.0
std,4780.315889
min,1.0
25%,4140.5
50%,8280.0
75%,12419.5
max,16559.0


In [11]:
df.isnull().sum()

Index        0
ID           0
BookTitle    0
Author       0
Genre        0
Summary      0
dtype: int64

In [12]:
df['Summary'][0]

' Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. N

In [13]:
df['Genre'][0]

'{"/m/016lj8": "Roman \\u00e0 clef", "/m/06nbt": "Satire", "/m/0dwly": "Children\'s literature", "/m/014dfn": "Speculative fiction", "/m/02xlf": "Fiction"}'

# **Data Preprocessing**
--------

In [14]:
# Define emoji removal function
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def convert_emojis(text):
    for emot, emoji in UNICODE_EMO.items():
        text = re.sub(r'('+re.escape(emoji)+')', "_".join(emot.replace(",", "").replace(":", "").split()), text)
    return text

# Define URL removal function
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

def remove_special_chars(text):
    re1 = re.compile(r'  +')
    x1 = text.lower().replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
        ' @-@ ', '-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x1))

def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_non_ascii(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def replace_numbers(text):
    return re.sub(r'\d+', '', text)

def remove_whitespaces(text):
    return text.strip()

def text2words(text):
    return word_tokenize(text)

def remove_stopwords(words, stop_words):
    return [word for word in words if word not in stop_words]

def remove_frequent_words(words, frequent_words):
    return [word for word in words if word not in frequent_words]

def stem_words(words):
    """Stem words in text"""
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in words]

def lemmatize_words(words):
    """Lemmatize words in text"""
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

def lemmatize_verbs(words):
    """Lemmatize verbs in text"""
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word, pos='v') for word in words])

def compute_frequent_words(texts, n=10):
    """Compute the most frequent words in a list of texts"""
    cnt = Counter()
    for text in texts:
        for word in text.split():
            cnt[word] += 1
    return set(word for word, _ in cnt.most_common(n))

def normalize_text(text, frequent_words, stop_words):
    """Normalize text by performing various preprocessing steps"""
    text = remove_emoji(text)
    #text = convert_emojis(text)
    text = remove_urls(text)
    text = remove_special_chars(text)
    text = remove_html_tags(text)
    text = remove_non_ascii(text)
    text = to_lowercase(text)
    text = remove_punctuation(text)
    text = replace_numbers(text)
    text = remove_whitespaces(text)
    words = text2words(text)
    words = remove_stopwords(words, stop_words)
    words = remove_frequent_words(words, frequent_words)
    words = lemmatize_words(words)
    #words = stem_words(words)
    text = ' '.join(words)
    #text = correct_spellings(text)
    return text

texts = df["Summary"].values
frequent_words = compute_frequent_words(texts)

# Apply normalization function to the DataFrame
df["Summary"] = df["Summary"].apply(lambda x: normalize_text(x, frequent_words, stop_words))


**json**

In [15]:
df['Genre'] = df['Genre'].str.lower()
df['Genre'] = df['Genre'].apply(lambda x: re.sub('(<.*?>)', ' ', x))

In [16]:
import json

# Step 1: Drop rows where the genre column is empty
df.drop(df[df['Genre'] == ''].index, inplace=True)

# Step 2: Function to parse JSON and extract genre values
def extract_genres(genre_str):
    # Parse the JSON string and extract the values (genre names)
    genre_dict = json.loads(genre_str)
    return list(genre_dict.values())

# Step 3: Apply the function to the 'genre' column and create a new 'genre_new' column
df['Genre_new'] = df['Genre'].apply(extract_genres)

# Optionally, inspect the DataFrame to ensure everything is correct
print(df[['Genre', 'Genre_new']].head())

                                                                                                                                                                                  Genre  \
0                              {"/m/016lj8": "roman \u00e0 clef", "/m/06nbt": "satire", "/m/0dwly": "children's literature", "/m/014dfn": "speculative fiction", "/m/02xlf": "fiction"}   
1  {"/m/06n90": "science fiction", "/m/0l67h": "novella", "/m/014dfn": "speculative fiction", "/m/0c082": "utopian and dystopian fiction", "/m/06nbt": "satire", "/m/02xlf": "fiction"}   
2                                                                           {"/m/02m4t": "existentialism", "/m/02xlf": "fiction", "/m/0pym5": "absurdist fiction", "/m/05hgj": "novel"}   
4                                {"/m/03lrw": "hard science fiction", "/m/06n90": "science fiction", "/m/014dfn": "speculative fiction", "/m/01hmnh": "fantasy", "/m/02xlf": "fiction"}   
5                                                                

In [17]:
df.drop('Genre', axis=1, inplace=True)
df.rename(columns={'Genre_new': 'Genre'}, inplace=True)

# **Text preparation**
----------

**Combine Text**

In [18]:
df['GenreString'] = df['Genre'].apply(lambda x: ' '.join(x))
df["combined_text"] = df["Summary"] + " " + df["Author"] + " " + df["GenreString"]

In [19]:
df["combined_text"][0]

"old major old boar manor farm call animal farm meeting compare human parasite teach animal revolutionary song beast england major dy two young pig snowball napoleon assume command turn dream philosophy animal revolt drive drunken irresponsible mr jones farm renaming animal farm adopt seven commandment animalism important animal equal snowball attempt teach animal reading writing food plentiful farm run smoothly pig elevate position leadership set aside special food item ostensibly personal health napoleon take pup farm dog train privately napoleon snowball struggle leadership snowball announces plan build windmill napoleon dog chase snowball away declares leader napoleon enacts change governance structure farm replacing meeting committee pig run farm using young pig named squealer mouthpiece napoleon claim credit windmill idea animal work harder promise easier life windmill violent storm animal find windmill annihilated napoleon squealer convince animal snowball destroyed although sco

**TF-IDF**

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
tf = TfidfVectorizer(analyzer = "word", ngram_range=(1,2), min_df=1, stop_words='english') # min_df should be at least 1

tfidf_matrix = tf.fit_transform(df['combined_text'])

**Cosine Similarty**

In [22]:
cosine =  cosine_similarity(tfidf_matrix, tfidf_matrix)

# **Recommendation System**
-------

In [23]:
# Create a series to map book titles to their index
indices = pd.Series(df.index, index=df['BookTitle']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine):
    # Get the index of the book that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all books with that book
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar books
    sim_scores = sim_scores[1:11]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar books
    return df['BookTitle'].iloc[book_indices]


In [24]:
print(get_recommendations("Dune"))

7950                    The Road
1582             The Giving Tree
7856         Children of Orpheus
16494           The Summer Birds
3229       The Phantom Freighter
1920           The Polar Express
8588     The Wreck of the Zephyr
1475               The Borrowers
10329      The Book of Dead Days
8052            Arrow to the Sun
Name: BookTitle, dtype: object
