### __Useful Imports__

1. __Pandas:__ This library was used to handle the data.

2. __Warnings:__ This was only to ignore the warnings.

3. __Requests:__ This was used to send requests to APIs and websites to get the data.

4. __BeautifulSoup:__ This library was used to extract data from the source code of websites.

5. __Pyisbn:__ This was used later to convert ISBN10 to ISBN13 to search for books on bookswagon.com.

6. __Concurrent:__ This was used to do multithreaded scraping.

7. __Time:__ This was used to insert a delay if the threads are sending requests too quick and getting denied.


In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from requests import request
from bs4 import BeautifulSoup

import pyisbn

from concurrent.futures import ThreadPoolExecutor, as_completed

### __Data Ingestion__

The following data was collected from the RC of the college with the following fields:-

1. __Acc. Date__:- This is the date in which the RC acquired the book.

2. __Acc. No.__:- This is the number of book that was assigned to it when acquiring it.

3. __Title__:- This is the title of the book.

4. __ISBN__:- This is the ISBN of the book. (We will mainly focus on this)

5. __Ed./Vol.__:- This signifies if the book was a different edition (than first edition).

6. __Place & Publisher__:- This tells us about the place and name of the publisher.

7. __Year__:- This is the year of publishing the book.

8. __Pages__:- This constitutes of the amount of pages the book has.

9. __Class No./Book No.__:- This gives us another identifier for the book.

The data was cleaned manually as well as programmatically. It had a lot of inconsistency due to titles of the books containing `,` and `;` both of which are used in `.csv` as delimiters. Some of the ISBNs were wrong and dates were wrong too. ~400 books were removed due to having no ISBN or wrong ISBNs.

In [11]:
data_file = './Accession Register-Books _with_ ISBN_numbers.xlsx'
data_table = pd.read_excel(data_file)
data_table.drop_duplicates(subset='ISBN', keep='first', inplace=True)
data_table.sort_values('Acc. No.', inplace=True)
data_table.index = range(len(data_table))
data_table

Unnamed: 0,Acc. Date,Acc. No.,Title,ISBN,Author/Editor,Place & Publisher,Year,Page(s),Class No.Book No.
0,08-09-2001,1,Network design : management and technical pers...,849334047,"Mann-Rubinson, Teresa C.","Boca Raton: CRC Press,",1999.0,405 p.,004.6 MAN
1,08-09-2001,2,Multimedia information analysis and retrieval ...,9783540648260,"Ip, Horace H. S.","Berlin: Springer,",1998.0,"viii, 264 p.;",004 IPH
2,08-09-2001,3,"Multimedia systems : delivering, generating, a...",1852332484,"Morris, Tim","London: Springer,",2000.0,"xi, 191 p.;",006.7 MOR
3,08-09-2001,4,Principles of Data Mining and Knowledge Discovery,9783540410669,"Zytkov, Jan. M.","New York: Springer-Verlag,",1999.0,593 p.,006.3 ZYT
4,08-09-2001,5,Focusing solutions for data mining : analytica...,3540664297,"Reinartz, Thomas","New York: Springer,",1999.0,"xiv, 307 p.;",006.3 REI
...,...,...,...,...,...,...,...,...,...
31527,05-01-2026,36354,The Ruba'iyat of Omar Khayyam,9780140443844,"Khayyam, Omar","London : Penguin Books,",1981.0,116 p.,891.5511 KHA
31528,05-01-2026,36355,Artificial intelligence for robotics,9781805129592,Francis X.,"UK : Packt Publishing,",2024.0,"xvii, 325 p. ;",006.3 FRA
31529,05-01-2026,36356,Too big to fail: The inside story of how wall ...,9780143118244,"Sorkin, Andrew Ross","New York : Penguin Books,",2010.0,"xx, 618p. ;",330.973 SOR
31530,07-01-2026,36357,Digital communications : A foundational approach,9781009429665,"Fischer, Robert F. H.","Cambridge : Cambridge University Press,",2024.0,"xiv, 397p. ;",621.382 FIS


In [12]:
data_table.to_excel('./Data.xlsx', index_label="Index")

### __Data Scraping__

This is the header that was used which was taken after visiting a website and going into the developer mode in firefox.

In [3]:
header = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:143.0) Gecko/20100101 Firefox/143.0'}

We first made a scraper that would scrap books from the Google Books site and then if it doesn't find summary there it would fallback to goodreads.com.

Later, we realised that google provides an API for books which would be faster than scraping their sites and hence we switched our method. We also realised that alot of the summaries provided by goodreads.com were not of good quality and hence in the second method we stopped using that and searched for better alternatives.

In [4]:
# USING GOOGLE WEBSITE

def summary_finder(isbn, header):
    try:
        summary = ""
        isbn = str(isbn)
        if len(isbn) < 10:
            isbn = f"{'0'*(10 - len(isbn))}{isbn}"
        print(isbn)
        search_base_link = f'https://books.google.com/books?vid=ISBN{isbn}'
        search_req = request(method='GET', url=search_base_link, headers=header)
        book_soup = BeautifulSoup(search_req.text, 'html.parser')
        # print('initial google search')
        if book_soup.find(name='a', attrs={'class': 'opt-in-header-link'}):
            if book_soup.find(name='a', attrs={'class': 'opt-in-header-link'}).text  == "Try the new Google Books":
                new_link = book_soup.find(name='a', attrs={'class': 'opt-in-header-link'}).attrs['href']
                new_req = request(method='GET', url=new_link, headers=header)
                new_soup = BeautifulSoup(new_req.text, 'html.parser')
                # print('new google search')
                book_soup = new_soup
        # print('finding summary in google')
        summary = book_soup.find(name='div', attrs={'class': 'Mhmsgc'}) or book_soup.find(name='div', attrs={'id': 'synopsistext'})
        if not summary:
            # print('goodreads')
            search_base_link = f'https://www.goodreads.com/search?utf8=%E2%9C%93&search%5Bquery%5D={isbn}'
            search_req = request(method='GET', url=search_base_link, headers=header)
            book_soup = BeautifulSoup(search_req.text, 'html.parser')
            summary = book_soup.find(name='span', attrs={'class': 'Formatted'})
    except:
        raise ReferenceError(f"{isbn} failed")
        # return isbn, "FAILED"
    if summary:
        for br in summary.find_all('br'):
            br.replace_with('\\n')
        return isbn, summary.text
    return isbn, ""
        

We found out that openLibrary has its own API as well and started using it. We noticed that google and openlibrary both give us category/subject which can be really helpful for the semantic search that the data science team would perform and hence came to the conclusion of saving the keywords as well. Also, we found out about the site bookswagon.com which had summary and keywords for a lot more of the books that was not available from google as well as openlibrary.

Our function first tries google, then goes for OpenLibrary and then finally bookswagon.com as a fallback. If at any time it finds keywords as well as summary, it directly returns the isbn, keywords and summary as a tuple. And when going from one site to another, it only keeps the higher quality summary/keywords and does not replace it. If the ISBN is given wrong, the code would throw an error at line 60, which tries to convert the ISBN, and so we added a try-except catch to write the ISBNs in a new file that shows the wrong ISBNs.

<div style="background-color: #e7f3ff; padding: 15px; border-radius: 5px; border-left: 5px solid #007acc; color: #444; width: 95%">
    <b>Note:</b> 
    <br>
    We were not able to scrap much using this function, the details are given later in this file.
    <br>
    Also note that we did use an API Key from Google but we removed it before pushing it to GitHub.
</div>


In [5]:
# Using Google API, OpenLibrary API and bookwagon.com

def summary_finder(isbn, header):
    try:
        summary = ""
        keywords = []        
        isbn = str(isbn)
        if len(isbn) < 10:
            isbn = f"{'0'*(10 - len(isbn))}{isbn}"
        # print(isbn)
        
        # print('openlibrary')
        api_link = f"https://www.openlibrary.org/isbn/{isbn}"
        api_req = request(method="GET", url=api_link, headers=header)
        try:
            api_res = api_req.json()
            if 'subjects' in api_res:
                keywords = api_res['subjects']
            if 'description' in api_res:
                if type(api_res['description']) == dict:
                    summary = api_res['description']['value']
                else:
                    summary = api_res['description']
            if not summary and 'first_sentence' in api_res:
                if type(api_res['first_sentence']) == dict:
                    summary = api_res['first_sentence']['value']
                else:
                    summary = api_res['first_sentence']
            if 'works' in api_res:
                api_link = f"https://www.openlibrary.org{api_res['works'][0]['key']}.json"
                api_req = request(method="GET", url = api_link, headers=header)
                api_res = api_req.json()
                if 'subjects' in api_res:
                    if len(keywords) < len(api_res['subjects']):
                        keywords = api_res['subjects']
                if 'description' in api_res:
                    if type(api_res['description']) == dict and len(api_res['description']['value']) > len(summary):
                        summary = api_res['description']['value']
                    else:
                        if len(api_res['description']) > len(summary):
                            summary = api_res['description']
            if summary and keywords:
                return isbn, ', '.join(keywords), summary
        except:
            pass
        
            
        
        # print('bookswagon')
        if len(isbn) == 10:
            try:
                isbn = pyisbn.convert(isbn)
            except:
                return isbn, "", ""
        search_base_link = f'https://www.bookswagon.com/book/c/{isbn}'
        search_req = request(method='GET', url=search_base_link, headers=header)
        book_soup = BeautifulSoup(search_req.text, 'html.parser')
        new_summary = book_soup.find(name='div', attrs={'id': 'aboutbook'})
        if new_summary:
            new_summary = new_summary.p
            if len(summary) < len(new_summary.text.strip()):
                summary = new_summary.text.strip()
        cats = book_soup.find('ul', attrs={'class': 'blacklistreview'})
        if cats:
            cats = cats.find_all('a')
            cats = list({k.text.strip() for k in cats})
            if len(cats) > len(keywords):
                keywords = cats
        if keywords and summary:
            return isbn, ', '.join(keywords), summary
                
                
        # print('finding summary in google')
        api_link = f"https://www.googleapis.com/books/v1/volumes?q=isbn:{isbn}"
        api_req = request(method='GET', url=api_link, headers=header)
        api_res = api_req.json()
        if api_res['totalItems'] != 0:
            if 'description' in api_res['items'][0]['volumeInfo'] and len(summary) < len(api_res['items'][0]['volumeInfo']['description']):
                summary = api_res['items'][0]['volumeInfo']['description']
            if 'categories' in api_res['items'][0]['volumeInfo'] and len(keywords) < len(api_res['items'][0]['volumeInfo']['categories']):
                keywords = api_res['items'][0]['volumeInfo']['categories']
        
        return isbn, ', '.join(keywords), summary
        
    except:
        with open('./notFound.txt', 'a') as nf:
            nf.write(f'{isbn}\n')
        return isbn, "", ""
        

Here we used multithreading to work at several books in a batch at once. There were several oversights that were looked into when applying the function using something similar to the following function which instead of retrying, raised the Error with the details of the book and error. The oversigts and bugs were then fixed. Later this function, was only getting error when one of the providers were giving us timeout. So, we made this recursive after waiting for 5 seconds for the timeout to go away and continue from where it left off. Also this concatinates the new results with the old ones that we found in previous iterations.

<div style="background-color: #e7f3ff; padding: 15px; border-radius: 5px; border-left: 5px solid #007acc; color: #444; width: 95%">
    <b>Note:</b> 
    <br>
    We were supervising this function till 1600 books. All that time the function was working perfectly while giving the keywords and summary. After 8000+ books were scraped, the function ran into an error loop (due to the try except clause). When we dug deeper into what happened, we found the last bug which was openLibrary sending us summary as <code>dict</code> sometimes and not <code>str</code>. We fixed that instantly, but after that when we checked the output file, we found out that the summary of most of the books (~97%) were overwritten for some reason. After which we didn't have enough time to scrap that data again and submit this project before deadline and hence we asked another team for their scraped data.
</div>

In [6]:
def scrap_and_save(start_idx):
    for i in range(start_idx, 3500):
        books = []
        with ThreadPoolExecutor(max_workers=4) as executor:
            futures = [
                executor.submit(summary_finder, isbn, header)
                for isbn in data_table['ISBN'][10*i:10*(i+1)]
            ]
            
            for future in as_completed(futures):
                isbn, keywords, summary = future.result()
                books.append({
                    'isbn': isbn,
                    'keywords': keywords,
                    'summary': summary
                })
            
        pd.DataFrame(books).to_excel(f'./Summaries/{i}.xlsx', index=False)
        
# scrap_and_save(0)

After running this function, we have created a seperate

### __Data Merging__

In [3]:
data_table = pd.read_excel('./Data.xlsx')

In [4]:
def norm_isbn(isbn):
    isbn = str(isbn)
    if len(isbn) < 10:
        isbn = f"{'0' * (10-len(isbn))}{isbn}"
    if len(isbn) == 10:
        try:
            new_isbn = pyisbn.convert(isbn)
        except:
            return isbn
    else:
        new_isbn = isbn
    return new_isbn

In [5]:
data_table['Chunk'] = data_table['Index'] //10
data_table['Norm ISBN'] = data_table['ISBN'].apply(norm_isbn)
data_table

Unnamed: 0,Index,Acc. Date,Acc. No.,Title,ISBN,Author/Editor,Place & Publisher,Year,Page(s),Class No.Book No.,Chunk,Norm ISBN
0,0,08-09-2001,1,Network design : management and technical pers...,849334047,"Mann-Rubinson, Teresa C.","Boca Raton: CRC Press,",1999.0,405 p.,004.6 MAN,0,9780849334047
1,1,08-09-2001,2,Multimedia information analysis and retrieval ...,9783540648260,"Ip, Horace H. S.","Berlin: Springer,",1998.0,"viii, 264 p.;",004 IPH,0,9783540648260
2,2,08-09-2001,3,"Multimedia systems : delivering, generating, a...",1852332484,"Morris, Tim","London: Springer,",2000.0,"xi, 191 p.;",006.7 MOR,0,9781852332488
3,3,08-09-2001,4,Principles of Data Mining and Knowledge Discovery,9783540410669,"Zytkov, Jan. M.","New York: Springer-Verlag,",1999.0,593 p.,006.3 ZYT,0,9783540410669
4,4,08-09-2001,5,Focusing solutions for data mining : analytica...,3540664297,"Reinartz, Thomas","New York: Springer,",1999.0,"xiv, 307 p.;",006.3 REI,0,9783540664291
...,...,...,...,...,...,...,...,...,...,...,...,...
31527,31527,05-01-2026,36354,The Ruba'iyat of Omar Khayyam,9780140443844,"Khayyam, Omar","London : Penguin Books,",1981.0,116 p.,891.5511 KHA,3152,9780140443844
31528,31528,05-01-2026,36355,Artificial intelligence for robotics,9781805129592,Francis X.,"UK : Packt Publishing,",2024.0,"xvii, 325 p. ;",006.3 FRA,3152,9781805129592
31529,31529,05-01-2026,36356,Too big to fail: The inside story of how wall ...,9780143118244,"Sorkin, Andrew Ross","New York : Penguin Books,",2010.0,"xx, 618p. ;",330.973 SOR,3152,9780143118244
31530,31530,07-01-2026,36357,Digital communications : A foundational approach,9781009429665,"Fischer, Robert F. H.","Cambridge : Cambridge University Press,",2024.0,"xiv, 397p. ;",621.382 FIS,3153,9781009429665


In [6]:
book_table = pd.DataFrame(columns=['isbn', 'keywords', 'summary', 'Chunk', 'Norm ISBN'])
location = './Summaries/'
for chunk in range(0, 3154):
    book_chunk = pd.read_excel(f"{location}{chunk}.xlsx")
    book_chunk['Norm ISBN'] = book_chunk['isbn'].apply(norm_isbn)
    book_chunk['Chunk'] = chunk
    book_table = pd.concat([book_table, book_chunk], ignore_index=True)

In [7]:
book_reset = book_table.reset_index()
book_reset

Unnamed: 0,index,isbn,keywords,summary,Chunk,Norm ISBN
0,0,9783540410669,"The Arts, Interdisciplinary studies, Internet ...",This book constitutes the refereed proceedings...,0,9783540410669
1,1,9783540648260,"Digital signal processing (DSP), Computing and...",This book constitutes the refereed proceedings...,0,9783540648260
2,2,9781852332488,"Computing and Information Technology, Computer...",What are Multimedia Systems? This book is inte...,0,9781852332488
3,3,9780849334047,"Computing and Information Technology, Computer...",Network Design outlines the fundamental princi...,0,9780849334047
4,4,9783540660828,"The Arts, Internet guides and online services,...",This book constitutes the refereed proceedings...,0,9783540660828
...,...,...,...,...,...,...
31527,31527,9781805129592,"Storage media and peripherals, Computer hardwa...",Let an AI and robotics expert help you apply A...,3152,9781805129592
31528,31528,9780143118244,"Economic geography, Geography, Economic histor...",NEW YORK TIMES BESTELLER • The definitive acco...,3152,9780143118244
31529,31529,9780140443844,"Poetry by individual poets, Poetry, Biography,...",Only widely-available edition of Khayyam's lyr...,3152,9780140443844
31530,31530,9781009429665,Communications engineering / telecommunication...,Introducing the fundamentals of digital commun...,3153,9781009429665


In [8]:
data_table_merge = data_table.merge(book_reset, on=['Chunk', 'Norm ISBN'], how='left')

In [9]:
final_data = data_table_merge.drop_duplicates(['Index'])[[
    'Acc. Date', 
    'Acc. No.', 
    'Title', 
    'ISBN', 
    'Norm ISBN', 
    'Author/Editor', 
    'Place & Publisher', 
    'Year', 
    'Page(s)', 
    'Class No.Book No.', 
    'keywords', 
    'summary'
]]

In [10]:
final_data.columns = [
    'AccDate',
    'AccNo',
    'Title',
    'ISBN',
    'ISBN13',
    'Author',
    'Publisher',
    'Year',
    'Pages',
    'DDC',
    'Keywords',
    'Summary'  
]

In [11]:
final_data

Unnamed: 0,AccDate,AccNo,Title,ISBN,ISBN13,Author,Publisher,Year,Pages,DDC,Keywords,Summary
0,08-09-2001,1,Network design : management and technical pers...,849334047,9780849334047,"Mann-Rubinson, Teresa C.","Boca Raton: CRC Press,",1999.0,405 p.,004.6 MAN,"Computing and Information Technology, Computer...",Network Design outlines the fundamental princi...
1,08-09-2001,2,Multimedia information analysis and retrieval ...,9783540648260,9783540648260,"Ip, Horace H. S.","Berlin: Springer,",1998.0,"viii, 264 p.;",004 IPH,"Digital signal processing (DSP), Computing and...",This book constitutes the refereed proceedings...
2,08-09-2001,3,"Multimedia systems : delivering, generating, a...",1852332484,9781852332488,"Morris, Tim","London: Springer,",2000.0,"xi, 191 p.;",006.7 MOR,"Computing and Information Technology, Computer...",What are Multimedia Systems? This book is inte...
3,08-09-2001,4,Principles of Data Mining and Knowledge Discovery,9783540410669,9783540410669,"Zytkov, Jan. M.","New York: Springer-Verlag,",1999.0,593 p.,006.3 ZYT,"The Arts, Interdisciplinary studies, Internet ...",This book constitutes the refereed proceedings...
4,08-09-2001,5,Focusing solutions for data mining : analytica...,3540664297,9783540664291,"Reinartz, Thomas","New York: Springer,",1999.0,"xiv, 307 p.;",006.3 REI,"Data capture and analysis, Research and inform...","In the first part, this book analyzes the know..."
...,...,...,...,...,...,...,...,...,...,...,...,...
31551,05-01-2026,36354,The Ruba'iyat of Omar Khayyam,9780140443844,9780140443844,"Khayyam, Omar","London : Penguin Books,",1981.0,116 p.,891.5511 KHA,"Poetry by individual poets, Poetry, Biography,...",Only widely-available edition of Khayyam's lyr...
31552,05-01-2026,36355,Artificial intelligence for robotics,9781805129592,9781805129592,Francis X.,"UK : Packt Publishing,",2024.0,"xvii, 325 p. ;",006.3 FRA,"Storage media and peripherals, Computer hardwa...",Let an AI and robotics expert help you apply A...
31553,05-01-2026,36356,Too big to fail: The inside story of how wall ...,9780143118244,9780143118244,"Sorkin, Andrew Ross","New York : Penguin Books,",2010.0,"xx, 618p. ;",330.973 SOR,"Economic geography, Geography, Economic histor...",NEW YORK TIMES BESTELLER • The definitive acco...
31554,07-01-2026,36357,Digital communications : A foundational approach,9781009429665,9781009429665,"Fischer, Robert F. H.","Cambridge : Cambridge University Press,",2024.0,"xiv, 397p. ;",621.382 FIS,Communications engineering / telecommunication...,Introducing the fundamentals of digital commun...


In [95]:
final_data.to_excel('./Final_Data.xlsx', index=False)

### __Stats for books__

In [118]:
total = len(final_data)

# Books with no summaries
no_sum = final_data['Summary'].isna().sum()
print(f"No summaries: {f'{no_sum} ( {no_sum/total*100:.2f} % )':>50}")

# Books with no keywords
no_key = final_data['Keywords'].isna().sum()
print(f"No keywords: {f'{no_key} ( {no_key/total*100:.2f} % )':>51}")

# Books with no extra details
no_sumkey = final_data[final_data['Summary'].isna() & final_data['Keywords'].isna()].shape[0]
print(f"Neither summary nor keywords: {f'{no_sumkey} ( {no_sumkey/total*100:.2f} % )':>34}")

# Total
print(f"Total Books: {total:>39}")

No summaries:                                   4006 ( 12.70 % )
No keywords:                                    4268 ( 13.54 % )
Neither summary nor keywords:                   3391 ( 10.75 % )
Total Books:                                   31532


#### Extra

When trying to clean the data given by the RC, we found out that the RC has even more data that is not only helpful but also crucial in their website. They already had a lot of summary, subjects, description, contributor names which were not in the data provided. And hence we did try to scrap from there, but later stopped doing so after the warning from the faculty.

In [None]:
# Data Correction from RC website scraping


for i in range(334, 2000):
    not_Found = []
    cleaned_data_table = pd.DataFrame(columns=[
                        'Acc. No.', 
                        'Acc. Date', 
                        'ISBN', 
                        'Title', 
                        'Edition', 
                        'Author', 
                        'Contributor(s)',
                        'Series',
                        'Publisher', 
                        'Publication Date', 
                        'Publisher Location', 
                        'Description', 
                        'Subjects', 
                        'DDC', 
                        'Summary'
                    ])
    try:
        for acc_no, acc_date, isbn in data_table[10*i:10*(i+1)].itertuples(index=False):
            while len(isbn) < 10:
                isbn = f'0{isbn}'
            # print(isbn)
            search_base_link = f'https://opac.daiict.ac.in/cgi-bin/koha/opac-search.pl?q={isbn}'
            print('searched', isbn)
            search_req = request('GET', search_base_link, headers=header)
            rc_soup = BeautifulSoup(search_req.text, 'html.parser')
            if rc_soup.find('div', {'id': 'didyoumean'}):
                if rc_soup.find('h1', {'id': 'numresults'}).text == 'No results found!':
                    not_Found.append((acc_no, acc_date, isbn))
                    print(isbn, 'not found')
                    print(f'{"-"*30}')
                    continue
                print('clicking results')
                results = rc_soup.find_all('a', {'class': 'title'})
                for idx, result in enumerate(results):
                    search_base_link = f"https://opac.daiict.ac.in{result.get_attribute_list('href')[0]}"
                    search_req = request('GET', search_base_link, headers=header)
                    rc_soup = BeautifulSoup(search_req.text, 'html.parser')
                    if isbn == rc_soup.find('span', {'property': 'isbn'}).text.strip():
                        break
                    print(f'going to next result {idx}/{len(results)}')
            title = rc_soup.find('h1', {'class': 'title'}).text
            # print(repr(title))
            # if rc_soup.find('span', {'property': 'bookEdition'}):
            edition = rc_soup.find('span', {'property': 'bookEdition'}).text if rc_soup.find('span', {'property': 'bookEdition'}) else ""
            # print(repr(edition))
            # else:
                # edition = ""
            author = rc_soup.find('span', {'property': 'author'}).text
            # print(repr(author))
            contributors = [i.text for i in rc_soup.find_all('span', {'property': 'contributor'})] if rc_soup.find('span', {'property': 'contributor'}) else []
            # print(repr(contributors))
            series = rc_soup.find('span', {'class': 'series'}).a.text if rc_soup.find('span', {'class': 'series'}) else ""
            # print(repr(series))
            publisher = rc_soup.find('span', {'class': 'publisher_name'}).text[:-2]
            # print(repr(publisher))
            pub_date = rc_soup.find('span', {'class': 'publisher_date'}).text
            # print(repr(pub_date))
            pub_place = rc_soup.find('span', {'class': 'publisher_place'}).text[:-2]
            # print(repr(pub_place))
            desc = rc_soup.find('span', {'property': 'description'}).text
            # print(repr(desc))
            sub = rc_soup.find('span', {'class': 'subjects'}).ul.text.strip().split('\n') if rc_soup.find('span', {'class': 'subjects'}) else []
            # print(repr(sub))
            ddc = rc_soup.find('span', {'class': 'ddc'}).ul.text
            # print(repr(ddc))
            summary = rc_soup.find('span', {'class': 'summary'}).text[9:] if rc_soup.find('span', {'class': 'summary'}) else ""
            # print(repr(summary))
            
            book = {
                'Acc. No.': acc_no,
                'Acc. Date': acc_date,
                'ISBN': isbn,
                'Title': title,
                'Edition': edition,
                'Author': author,
                'Contributor(s)': contributors,
                'Series': series,
                'Publisher': publisher, 
                'Publication Date': pub_date, 
                'Publisher Location': pub_place, 
                'Description': desc, 
                'Subjects': sub, 
                'DDC': ddc, 
                'Summary': summary
            }
            cleaned_data_table = pd.concat([cleaned_data_table, pd.DataFrame([book])], ignore_index=True)
            
            print(isbn, 'added')
            print(f"{'-'*30}")

        if not_Found:
            with open('notFound.txt', 'a') as nf:
                for n in not_Found:
                    nf.write(f'{n}\n')
        old_df = pd.read_excel('./CleanedData.xlsx')
        pd.concat([old_df, cleaned_data_table], ignore_index=True).to_excel('./CleanedData.xlsx', index=False)
    except:
        raise LookupError(f"Failed for {isbn} and i={i}")
        
    