### 1.0 Data Understanding

The Dataset used was from Kaggle,(https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset?select=Books+Data+with+Category+Language+and+Summary) extracted in csv form and excel form (both in text form) and containing important information with regards to Book Summary and Category which were the primary basis for perfoming Natural Language Processing, inorder to recommend similar books. The Dataset contains 1,031,175 books. I used the Csv (Comma Separated Values) form.

       Importing Libraries for NLP

In [1]:
import nltk   # the natural language toolkit for natural language processing tasks inthis project
import pandas as pd  

from nltk.sentiment import SentimentIntensityAnalyzer # for sentiment analysis
from nltk.tokenize import word_tokenize, blankline_tokenize # for word tokenization and blank lines tokenization
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer
import re # regular expression module for pattern matching
import warnings
warnings.filterwarnings("ignore")


nltk.download("vader_lexicon") # Valence Aware Dictionary and sEntiment Reasoner Lexicon for sentiment analysis
nltk.download("punkt")  # downloading punkt models for tokenization
nltk.download("wordnet") # downloading wordnet, a lexical database for Lemmatization
nltk.download("stopwords") # downloading common stop words in English such as "the", "and" etc which I will remove as they carry no meaning in NLP
nltk.download("omw-1.4") # downloading Open Multilingual Wordnet dataset for lemmatization



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:

file_path = r"C:\Users\user\Desktop\book_summary_dataset.csv"
# The "r" string prefix notation i used when opening this file was to ensure "\" was not intepreted as mode but as the normal symbol


books = pd.read_csv(file_path)

books.tail()

Unnamed: 0,rating,book_title,book_author,publisher,Summary,Category
1031170,0,As Hogan Said . . . : The 389 Best Things Anyo...,Randy Voorhees,Simon & Schuster,Golf lovers will revel in this collection of t...,['Humor']
1031171,5,All Elevations Unknown: An Adventure in the He...,Sam Lightner,Broadway Books,A daring twist on the travel-adventure genre t...,['Nature']
1031172,7,Why stop?: A guide to Texas historical roadsid...,Claude Dooley,Lone Star Books,9,9
1031173,7,The Are You Being Served? Stories: 'Camping In...,Jeremy Lloyd,Kqed Books,These hilarious stories by the creator of publ...,['Fiction']
1031174,10,"Dallas Street Map Guide and Directory, 2000 Ed...",Mapsco,American Map Corporation,9,9


### 2.0 Data Pre-processing

       Getting the summary of the Dataframe

In [3]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1031175 entries, 0 to 1031174
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   rating       1031175 non-null  int64 
 1   book_title   1031175 non-null  object
 2   book_author  1031175 non-null  object
 3   publisher    1031175 non-null  object
 4   Summary      1031175 non-null  object
 5   Category     1031175 non-null  object
dtypes: int64(1), object(5)
memory usage: 47.2+ MB


            Getting the number of rows and columns

In [4]:
books.shape

# There are 1,031,175 rows and 6 columns

(1031175, 6)

          Checking for duplicates

In [5]:
books.duplicated().value_counts()

# There are 589421 duplicated values which is a huge number 

True     589421
False    441754
dtype: int64

        Dropping duplicates
Some books were recorded multiple times as in the case from index 1 to index 4 which were recorded four times or more. Therefore; this will lead to reducing the number of books 

In [6]:

book_df = books.drop_duplicates(subset=["book_title", "book_author", "publisher"]) #books containing same info on book title, book author and publisher, to be removed

book_df.head(5)

Unnamed: 0,rating,book_title,book_author,publisher,Summary,Category
0,0,Classical Mythology,Mark P. O. Morford,Oxford University Press,Provides an introduction to classical myths pl...,['Social Science']
1,5,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,"In a small town in Canada, Clara Callan reluct...",['Actresses']
15,0,Decision in Normandy,Carlo D'Este,HarperPerennial,"Here, for the first time in paperback, is an o...",['1940-1949']
18,0,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,Farrar Straus Giroux,"Describes the great flu epidemic of 1918, an o...",['Medical']
29,0,The Mummies of Urumchi,E. J. W. Barber,W. W. Norton & Company,A look at the incredibly well-preserved ancien...,['Design']


In [7]:
book_df.info() #from below the dataframe now has 264,984 books

<class 'pandas.core.frame.DataFrame'>
Int64Index: 264984 entries, 0 to 1031174
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   rating       264984 non-null  int64 
 1   book_title   264984 non-null  object
 2   book_author  264984 non-null  object
 3   publisher    264984 non-null  object
 4   Summary      264984 non-null  object
 5   Category     264984 non-null  object
dtypes: int64(1), object(5)
memory usage: 14.2+ MB


           Checking for missing values in each colmn

In [8]:
book_df.isna().sum()

rating         0
book_title     0
book_author    0
publisher      0
Summary        0
Category       0
dtype: int64

no missing values

           Removing columns that have less use in this analysis

In [9]:
book_df.drop(columns=["rating", "publisher"], inplace=True)


In [10]:
# Converting Category from a list of string to string; by removing square brackets

book_df['Category'] = book_df['Category'].apply(lambda x: re.sub(r'\[|\]|\'|\s', ' ', x))

book_df.head(5)

Unnamed: 0,book_title,book_author,Summary,Category
0,Classical Mythology,Mark P. O. Morford,Provides an introduction to classical myths pl...,Social Science
1,Clara Callan,Richard Bruce Wright,"In a small town in Canada, Clara Callan reluct...",Actresses
15,Decision in Normandy,Carlo D'Este,"Here, for the first time in paperback, is an o...",1940-1949
18,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,"Describes the great flu epidemic of 1918, an o...",Medical
29,The Mummies of Urumchi,E. J. W. Barber,A look at the incredibly well-preserved ancien...,Design


In [11]:
book_slicing = book_df.iloc[0:27000, 0:4] #Reducing the df size to 27,000 books to reduce memory therefore avoiding computational complexities ahead

book_slicing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27000 entries, 0 to 461621
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   book_title   27000 non-null  object
 1   book_author  27000 non-null  object
 2   Summary      27000 non-null  object
 3   Category     27000 non-null  object
dtypes: object(4)
memory usage: 1.0+ MB


    Using the preprocess_text Function for nltk preprocessing 

In [12]:
 # Initializing the Stopwords,Lemmatizer and analyzer

stop_words=set(stopwords.words("english"))
 
lemmatizer=WordNetLemmatizer()

sia=SentimentIntensityAnalyzer()

def preprocess_text(text):
    text=re.sub(r"[^a-zA-Z]", " ", text)  # Removing special characters/non-alphabetical characters from the text.
    tokens = text.split()      #splitting into tokens
    tokens=word_tokenize(text.lower())  # tokenizing the words and converting to lower case
    tokens=[t for t in tokens if t not in stop_words] # removing stop words
    tokens= [lemmatizer.lemmatize(t) for t in tokens] #Lemmatizing the tokens
    return " ".join(tokens)

def preprocess_df(df_name, column):
    if df_name[column].dtype == "object": #"object" data type is used as a container for various non-numeric types of data. 
        df_name[column] = df_name[column].apply(lambda x: preprocess_text(str(x)))
    return df_name


new_book_df= preprocess_df(book_slicing, "book_title")
new_book_df= preprocess_df(book_slicing,"book_author")
new_book_df=preprocess_df(book_slicing, "Summary")
new_book_df=preprocess_df(book_slicing,"Category")

new_book_df.head()


# summary and category had numerical values which were converted to space and hence no missing values

Unnamed: 0,book_title,book_author,Summary,Category
0,classical mythology,mark p morford,provides introduction classical myth placing a...,social science
1,clara callan,richard bruce wright,small town canada clara callan reluctantly tak...,actress
15,decision normandy,carlo este,first time paperback outstanding military hist...,
18,flu story great influenza pandemic search viru...,gina bari kolata,describes great flu epidemic outbreak killed f...,medical
29,mummy urumchi,e j w barber,look incredibly well preserved ancient mummy f...,design


         Creating a new column called "book_info" that contains Summary and Book Category

In [13]:

new_book_df["book_info"]=new_book_df["Summary"] + " " + new_book_df["Category"]


new_book_df.drop(columns=["Summary", "Category"], inplace=True)

new_book_df["book_info"][0] # preview the first book_info row


'provides introduction classical myth placing addressed topic within historical context discussion archaeological evidence support mythical event theme portrayed literature art social science'

### 3.0 Feature Engineering

I) Data Vectorization

Text data in the "new_book_df" dataframe is transformed into numerical vectors to make it suitable for mathematical
operations.
Techniques I used for vectorization are TF-IDF 
(Term Frequency-Inverse Document Frequency) Vectorization


       Term Frequency (TF)

This will measure the frequency of terms (tokens/words). Words that appear frequently within single document but rare acrosss the corpus are assigned higher scores. this will help to capture the uniqueness of terms 

      Inverse Document Frequency (IDF)

IDF measures the importance of a term across a collection of documents (corpus). IDF has an inverse relationship with the number of documents containing the term. If a term appears in many documents (high document frequency), its IDF value will be lower because it is considered less important for distinguishing documents. The primary purpose of IDF is to identify terms that are discriminative or rare in the corpus, giving them higher weights.

      TF-IDF Score

The TF-IDF score for a term in a document combines both term frequency and inverse document frequency.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


tf = TfidfVectorizer(analyzer="word", ngram_range=(1,2), min_df=0, stop_words="english") #analyzer will use words/tokens, ngram_range=(1,2) meaning both unigrams(single word) and 
                                                                                     # bigrams (2 adjacent words) should be considered. minimum_document frequency=0 meaning words that appear only in a single corpus to be also considered.
# Converting the vectors into Tf_Idf Matrix
tfidf_matrix = tf.fit_transform(new_book_df["book_info"]) #if Double brackets to show its a list of columns
print(tfidf_matrix)

  (0, 189685)	0.10576753329233111
  (0, 10475)	0.17596808362257044
  (0, 119929)	0.1830288257081547
  (0, 156479)	0.1830288257081547
  (0, 205149)	0.1830288257081547
  (0, 68145)	0.1830288257081547
  (0, 137910)	0.1830288257081547
  (0, 199746)	0.1830288257081547
  (0, 68341)	0.17596808362257044
  (0, 9409)	0.1830288257081547
  (0, 56396)	0.1830288257081547
  (0, 41538)	0.1830288257081547
  (0, 93958)	0.17596808362257044
  (0, 208195)	0.1830288257081547
  (0, 2265)	0.1830288257081547
  (0, 153765)	0.1830288257081547
  (0, 137887)	0.1830288257081547
  (0, 35593)	0.1830288257081547
  (0, 104259)	0.1830288257081547
  (0, 161707)	0.1830288257081547
  (0, 180565)	0.0902495489233351
  (0, 189613)	0.09648483161322553
  (0, 10362)	0.09448759558190829
  (0, 119926)	0.11308780227763199
  (0, 156475)	0.1550021822924483
  :	:
  (26997, 146122)	0.25852378884240157
  (26997, 58537)	0.24855066144598337
  (26997, 164451)	0.24855066144598337
  (26997, 13746)	0.24147461494813813
  (26997, 184346)	0.2277

II) Dimensionality Reduction

Since book data is high dimensional, It will be necessary for me to reduce the number of features to avoid computational complexities when getting the cosine similarity.Important information will be preserved. Otherwise; attempting without reducing the number of documents will return "Memory Error"

    Latent Semantic Analysis
LSA for analysing r/ship between words and undelying patterns of word co-occurence in the documents, while retrieving important information.
I used Truncated Singular Value Decomposition (which is the mathematical basis for LSA)

In [15]:
from sklearn.decomposition import TruncatedSVD
svd=TruncatedSVD(n_components= 100, random_state=42)
tfidf_svd=svd.fit_transform(tfidf_matrix)





III) Cosine Similarity:

After vectorization, you have numerical representations of text documents. These vectors can be considered as points in a high-dimensional space.
Cosine similarity measures the cosine of the angle between two vectors. The cosine of 0 degrees is 1, and the cosine of 90 degrees is 0. In other words, if two vectors have a cosine similarity of 1, they point in the same direction (high similarity), and if the cosine similarity is 0, they are orthogonal (no similarity).
In the context of NLP, cosine similarity calculates how similar two text documents are based on their vector representations. If the vectors point in a similar direction, they have a higher cosine similarity, indicating that the documents are more similar.

In [16]:
cos_similarity=cosine_similarity(tfidf_svd, tfidf_svd)

IV) Converting Book Title column to Series "indices" 


In [17]:
indices=pd.Series(new_book_df["book_title"]) 
indices[1900:1907]

91968                                        hallowed grnd
91969    woman war essential voice nuclear age brillian...
91971                                        lonesome dove
92045                                       bitter harvest
92114                       sex art american culture essay
92132            little woman treasury illustrated classic
92135    swiss family robinson treasury illustrated cla...
Name: book_title, dtype: object

### 4.0 Book Recommender System
Takes 2 params; book_title (which I want to find recommendations) and cosine similarity score for books

In [18]:
def recommend(book_title, cos_similarity=cos_similarity):
    if book_title not in indices.values:
        return "Book title does not exist in my Book Catalogue. Please contact Isaac for the book to be added."
    recommended_books = []
    idx = indices[indices == book_title].index[0]   # to get the index of book title matching the input
    score_series = pd.Series(cos_similarity[idx]).sort_values(ascending = False)   # similarity scores in descending order
    top_5_indices = list(score_series.iloc[1:6].index)   # to get the indices of top 5 most similar books
    # [1:6] to exclude 0 (index 0 is the input book itself)
    
    for i in top_5_indices:   # to append the titles of top 5 similar books
        recommended_books.append(list(book_df["book_title"])[i])
        
    return recommended_books
    

Now lets try to Recommend 5 other similar books, by giving the Book Title as Input 

In [19]:
recommend("mummy urumchi")

['Something Special: A Story',
 'The Doors of Perception and Heaven and Hell',
 'Island',
 'Der Kleine Prinz, Mit Zeichnungen Des Verfassers (Harbrace Paperbound Library)',
 'Blood Shot (V.I. Warshawski Novels (Paperback))']

In [20]:
recommend("clara callan")

['Kaaterskill Falls',
 'Too Close to the Falls: A Memoir',
 'Sleepwalk',
 'LAKE NEWS : A Novel',
 'Extra Virgin: A Young Woman Discovers the Italian Riviera, Where Every Month Is Enchanted']

Let me try recommending a book that I'm pretty sure is not in my catalogue 

In [21]:
recommend("mstahiki meya")

'Book title does not exist in my Book Catalogue. Please contact Isaac for the book to be added.'

In [22]:
recommend("classical mythology")

['Gay Ideas: Outing and Other Controversies',
 'Walden Two (Trade Book)',
 'The Community in America',
 'A Certain People: American Jews and Their Lives Today',
 'Media Virus!: Hidden Agendas in Popular Culture']

Thats all for now