## 2.1 Preparing Data

In [2]:
from langchain_community.document_loaders import TextLoader     #converts text into langchain compatible format
from langchain_text_splitters import CharacterTextSplitter      #splits whole document into meaningful chunks, in this case descriptions
from langchain_openai import OpenAIEmbeddings                   #handles the creation of vectorised embedings of description chunks
from langchain_chroma import Chroma                             #stores embedings in a vector database
import csv          #although not needed for to_csv method, this is neccesary to fix "" causing single " wrapping bug

In [3]:
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
import pandas as pd
#importing the nice and clean dataset
books = pd.read_csv('books_cleaned.csv')

In [5]:
books["tagged_description"]

0       9780002005883 A NOVEL THAT READERS and critics...
1       9780002261982 A new 'Christie for Christmas' -...
2       9780006178736 A memorable, mesmerizing heroine...
3       9780006280897 Lewis' work on the nature of lov...
4       9780006280934 "In The Problem of Pain, C.S. Le...
                              ...                        
5192    9788172235222 On A Train Journey Home To North...
5193    9788173031014 This book tells the tale of a ma...
5194    9788179921623 Wisdom to Create a Life of Passi...
5195    9788185300535 This collection of the timeless ...
5196    9789027712059 Since the three volume edition o...
Name: tagged_description, Length: 5197, dtype: object

In [6]:
#langchain cannot access information in a pandas dataframe
books["tagged_description"].to_csv("tagged_description.txt",
                                   sep="\n",
                                   index=False,
                                   header=False,
                                   quoting=csv.QUOTE_NONE)
#this will save the tagged descriptions into a langchain compatible txt file

In [7]:
raw_documents = TextLoader("tagged_description.txt", encoding="UTF-8").load()   #loads text document, had to specify encoding or face unicodedecodeerror leading to runtime error
text_splitter = CharacterTextSplitter(chunk_size=0, chunk_overlap=0, separator="\n")  #initialises the text splitter object, chunksize=0 ensures split on seperator  #splits text
documents = text_splitter.split_documents(raw_documents)    #splits the list of one item containing all descriptions into a list with each item containing one description

Created a chunk of size 1168, which is longer than the specified 0
Created a chunk of size 1214, which is longer than the specified 0
Created a chunk of size 373, which is longer than the specified 0
Created a chunk of size 309, which is longer than the specified 0
Created a chunk of size 479, which is longer than the specified 0
Created a chunk of size 482, which is longer than the specified 0
Created a chunk of size 960, which is longer than the specified 0
Created a chunk of size 188, which is longer than the specified 0
Created a chunk of size 843, which is longer than the specified 0
Created a chunk of size 284, which is longer than the specified 0
Created a chunk of size 193, which is longer than the specified 0
Created a chunk of size 877, which is longer than the specified 0
Created a chunk of size 1088, which is longer than the specified 0
Created a chunk of size 1189, which is longer than the specified 0
Created a chunk of size 304, which is longer than the specified 0
Create

In [8]:
display(documents[0])
display(documents[1589])
#verifying that the txt file has been loaded and split up correctly

Document(metadata={'source': 'tagged_description.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gi

Document(metadata={'source': 'tagged_description.txt'}, page_content="9780349113463 THE TIPPING POINT is the biography of an idea, and the idea is quite simple. It is that many of the problems we face - from crime to teenage delinquency to traffic jams - behave like epidemics. They aren't linear phenomena in the sense that they steadily and predictably change according to the level of effort brought to bear against them. They are capable of sudden and dramatic changes in direction. Years of well-intentioned intervention may have no impact at all, yet the right intervention - at just the right time - can start a cascade of change. Many of the social ills that face us today, in other words, are as inherently volatile as the epidemics that periodically sweep through the human population: little things can cause them to 'tip' at any time and if we want to understand how to confront and solve them we have to understand what those 'Tipping Points' are. In this revolutionary new study, Malcol

## 2.2 Vector Database

In [9]:
#using open AI embedding to create a vectorised database
db_books = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings())

In [10]:
#test query
query = "A book to teach me about egyptian mythology"       #can be any book request
docs = db_books.similarity_search(query, k=5)   #compares query vector against database vectors and returns the 5 best matches
docs                                            #displaying the suggested books based on the query

[Document(id='951ee001-f4ed-44b5-af8f-e5a55209ca33', metadata={'source': 'tagged_description.txt'}, page_content='9780192803467 Egyptian myths articulated the core values of one of the longest lasting civilizations in history, and myths of deities such as Isis and Osiris influenced contemporary cultures and became part of the Western cultural heritage. Egyptian Mythology: A Very Short Introduction explains the cultural and historical background to the fascinating and complex world of Egyptian myth, with each chapter dealing with a particular theme. To show the variety of source material for Egyptian myth, each chapter features a particular object--such as the obelisk known as Cleopatra\'s Needle, a golden statue of Tutankhamun, and a papyrus containing a story in which the Egyptian gods behave outrageously--which is illustrated by a photograph or line-drawing. The myth "The Contendings of Horus and Seth" is looked at in detail, and the many interpretations it has provoked are examined.

In [15]:
#converting the above sample query back into a pandas dataframe
books[books["isbn13"] == int(docs[0].page_content.split()[0].strip())]

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,missing_description,age_of_book,words_in_description,title_and_subtitle,tagged_description
948,9780192803467,192803468,Egyptian Myth: A Very Short Introduction,,Geraldine Pinch,History,http://books.google.com/books/content?id=pXkRD...,Egyptian myths articulated the core values of ...,2004.0,3.62,143.0,178.0,0,21.0,180,Egyptian Myth: A Very Short Introduction,9780192803467 Egyptian myths articulated the c...


In [35]:
#function for retirieving a list of book recommendations based on a query
def retrieve_recommendations(query: str, top_k: int = 5) -> pd.DataFrame:
    recommendations = db_books.similarity_search(query, k = 50)     #search vector database for books associated most strongly with the query
    books_list = []                                 #initialize an empty list
    for i in range(0, len(recommendations)):        #begin for loop for appending book recommendations to book list
        books_list += [int(recommendations[i].page_content.split()[0].strip())]
    return books[books["isbn13"].isin(books_list)].head(top_k)  #returns the books associated with each recommended book from the vector database


In [36]:
#runs the above function
retrieve_recommendations("books about epic adventures")

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,missing_description,age_of_book,words_in_description,title_and_subtitle,tagged_description
55,9780007136599,7136595,The Fellowship of the Ring,,John Ronald Reuel Tolkien;Alan Lee,"Baggins, Frodo (Fictitious character)",http://books.google.com/books/content?id=K7xSP...,Tolkien's classic fantasy about the quest to s...,2002.0,4.35,410.0,56.0,0,23.0,41,The Fellowship of the Ring,9780007136599 Tolkien's classic fantasy about ...
276,9780060891565,60891564,Eaters of the Dead,,Michael Crichton,Fiction,http://books.google.com/books/content?id=o2yu6...,An ambassador of the tenth-century Caliph of B...,2006.0,3.66,304.0,27412.0,0,19.0,32,Eaters of the Dead,9780060891565 An ambassador of the tenth-centu...
348,9780061052392,61052396,Realms of Dragons,The Worlds of Weis and Hickman,Margaret Weis;Denise Little;Tracy Hickman,Fiction,http://books.google.com/books/content?id=ieAFA...,"In the tradition of ""The Wheel of Time, "" this...",1999.0,4.04,218.0,47.0,0,26.0,33,Realms of Dragons: The Worlds of Weis and Hickman,"9780061052392 In the tradition of ""The Wheel o..."
358,9780061094156,61094153,Imajica II,The Reconciliation,Clive Barker,Fiction,http://books.google.com/books/content?id=DZVKS...,The magical tale of ill-fated lovers lost amon...,1995.0,4.42,544.0,2538.0,0,30.0,104,Imajica II: The Reconciliation,9780061094156 The magical tale of ill-fated lo...
440,9780066238500,66238501,The Chronicles of Narnia (adult),,C. S. Lewis,Fiction,http://books.google.com/books/content?id=3VGkK...,"Journeys to the end of the world, fantastic cr...",2001.0,4.26,767.0,425445.0,0,24.0,188,The Chronicles of Narnia (adult),9780066238500 Journeys to the end of the world...
