### Understand the schema
This involves looking at the number of columns and what columns we have. We then determine what columns we'll actually need right now, what can be kept for later, and which ones are not going to be needed at all.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
cols = ["title", "author", "desc", "genre", "isbn", "link", "pages"]
df = pd.read_csv("GoodReads_100k_books.csv", usecols=cols)

In [None]:
len(df)

100000

In [None]:
df.head()

Unnamed: 0,author,desc,genre,isbn,link,pages,title
0,Laurence M. Hauptman,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",002914180X,https://goodreads.com/book/show/1001053.Betwee...,0,Between Two Fires: American Indians in the Civ...
1,"Charlotte Fiell,Emmanuelle Dirix",Fashion Sourcebook - 1920s is the first book i...,"Couture,Fashion,Historical,Art,Nonfiction",1906863482,https://goodreads.com/book/show/10010552-fashi...,576,Fashion Sourcebook 1920s
2,Andy Anderson,The seminal history and analysis of the Hungar...,"Politics,History",948984147,https://goodreads.com/book/show/1001077.Hungar...,124,Hungary 56
3,Carlotta R. Anderson,"""All-American Anarchist"" chronicles the life a...","Labor,History",814327079,https://goodreads.com/book/show/1001079.All_Am...,324,All-American Anarchist: Joseph A. Labadie and ...
4,Jean Leveille,"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa ta...",,2761920813,https://goodreads.com/book/show/10010880-les-o...,177,Les oiseaux gourmands


In [None]:
df.columns
# mandatory - title, author, desc, genre
# deduplication - isbn, book_id (might have to create this or get this from link column)
# optional metadata - pages, link,
# not needed (for now) - bookformat, img, isbn3, rating, reviews, totalratings

Index(['author', 'desc', 'genre', 'isbn', 'link', 'pages', 'title'], dtype='object')

### Check for missing or incomplete values
This involves looking at individual columns and checking if there are any missing values or values like "Unknown" or blanks. We focus mainly on the columns we will be needing mandatorily and then go to the optional columns.

In [None]:
# How many books have missing titles?

df[df["title"].isna()]
# count 1

Unnamed: 0,author,desc,genre,isbn,link,pages,title
54953,Jacqui Malpass,,"Diary,Journaling",,https://goodreads.com/book/show/13552877-n-a,0,


In [None]:
# How many books have no authors or author is blank or Unknown?

len(df[df["author"] == "Unknown"])
# count 15
df[df["author"].str.len() <= 7]
# TODO more anomalies might be present but keeping them for later

Unnamed: 0,author,desc,genre,isbn,link,pages,title
944,Plato,"This edition is written in English. However, t...","Philosophy,Nonfiction,Classics,Literature,Anci...",546399401,https://goodreads.com/book/show/10006519-lesse...,48,Lesser Hippias
1078,Tom Ang,In the fast-changing world of digital photogra...,"Art,Photography,Nonfiction,Reference",756682142,https://goodreads.com/book/show/10082170-digit...,360,Digital Photography Essentials
1102,Igort,Dopo aver esplorato il mondo del jazz e del cr...,"Sequential Art,Graphic Novels,Sequential Art,C...",8804604425,https://goodreads.com/book/show/10083629-quade...,180,Quaderni ucraini: Memorie dai tempi dell'URSS
1226,albac,,,,https://goodreads.com/book/show/10091448-392003,0,392003
1248,Cao Yu,,,231056567,https://goodreads.com/book/show/1009277.Peking...,181,Peking Man
...,...,...,...,...,...,...,...
98935,Yeyu,The son of a Han traitor who had let the Xianb...,"Historical,Romance,M M Romance,LGBT,M M Romanc...",1623803659,https://goodreads.com/book/show/17279572-erasi...,350,Erasing Shame
98974,Demi,When a peaceful kingdom is overtaken by an evi...,"Childrens,Picture Books,Fantasy,Mythology,Reli...",1937786056,https://goodreads.com/book/show/17280735-the-f...,44,The Fantastic Adventures of Krishna
98996,27Press,"Learn Everything You Need To Know About Tea,Th...","Nonfiction,Food and Drink,Tea,Food and Drink,C...",,https://goodreads.com/book/show/17281679-19-le...,118,"19 Lessons On Tea: Become an Expert on Buying,..."
99564,Amos Oz,'On the kibbutz it's hard to know. We're all s...,"Fiction,Short Stories,Cultural,Israel,Literatu...",701187964,https://goodreads.com/book/show/17302874-betwe...,208,Between Friends


In [None]:
# How many books have missing descriptions?

len(df[df["desc"].isna()]) / len(df) * 100
# count 6772 i.e. 6.77%

6.772

In [None]:
# How many books have missing genres?

len(df[df["genre"].isna()]) / len(df) * 100
# count 10467 i.e. 10.47%

10.467

In [None]:
# How many books have 0 pages?

df[df["pages"] == 0]
# count 7752

Unnamed: 0,author,desc,genre,isbn,link,pages,title
0,Laurence M. Hauptman,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",002914180X,https://goodreads.com/book/show/1001053.Betwee...,0,Between Two Fires: American Indians in the Civ...
14,Graham Purchase,"In this wide-ranging book, Graham Purchase, on...","Biology,Ecology",961328983,https://goodreads.com/book/show/1001220.Anarch...,0,Anarchism & Environmental Survival
16,Umberto Eco,In the course of the long debate on the nature...,History,9027232938,https://goodreads.com/book/show/1001231.On_The...,0,On The Medieval Theory Of Signs
38,Ronald Jackson II,,,791482375,https://goodreads.com/book/show/10014315-scrip...,0,"Scripting the Black Masculine Body: Identity, ..."
39,Richard Allen,,,915992582,https://goodreads.com/book/show/1001432.A_Narr...,0,A Narrative of the Proceedings of the Black Pe...
...,...,...,...,...,...,...,...
99979,Jaqueline E. Pearson,"My brother disappeared when I was eight, after...","Paranormal,Vampires,Fantasy,Paranormal",,https://goodreads.com/book/show/17319136-i-was...,0,I was Sold to My Dead Brother's Best Friend
99980,Mi-Ri Hwang,,"Manga,Manhwa,Sequential Art,Manga",,https://goodreads.com/book/show/17319160-say-s...,0,Say say say(#2)
99983,Roberta Horton,Important Note about PRINT ON DEMAND Editions:...,"Crafts,Quilting",1571200479,https://goodreads.com/book/show/1731941.Scrap_...,0,Scrap Quilts
99984,Roberta Horton,Important Note about PRINT ON DEMAND Editions:...,"Crafts,Quilting",914881035,https://goodreads.com/book/show/1731942.Calico...,0,Calico and Beyond - Print on Demand Edition


In [None]:
# How many books have no ISBN code?

len(df[df["isbn"].isna()]) / len(df) * 100
# count 14482 i.e. 14.48%

14.482000000000001

In [None]:
# How many books have no link?

df[ (df["link"].isna()) | (df["link"] == "Unknown") | (df["link"] == "")]
# None baabbyy

Unnamed: 0,author,desc,genre,isbn,link,pages,title


### Detective work

**Look for anomalies or inconsistencies**

These will be little things that can mess up the embeddings. Descriptions might be the most important field so we will heavily focus on that.

_Description field_ - we will need to check whether there are any blank descriptions, or whether descriptions contain HTML tags or regex expressions or escape characters, or if we have too short descriptions. Too short descriptions may of be no use when generating embeddings.

_Genre field_ - we will need to check how chaotic they are; while they are currently comma-separated in one column, we'll need to see if this follows through consistently across all rows or if we have nay anomalies here. These could cause problems when creating embeddings or trying to detect a genre.

_Duplicates_ - we will need to check if there are duplicate rows or whether there are any combinational duplicates meaning title + author could repeat with two different ISBN codes, or there could be two rows where one contains title, author, ISBN, other contains title, description, ISBN and we may need to combine them, or similar descriptions where other fields are different. Duplicates matter for indexing and recommendation quality. Duplicates could create bias towards the duplicated entity and we need to avoid that.

#### _Genre_

In [None]:
df[(~df["genre"].isna()) & (~df["genre"].str.contains(",", na=False))]
df[~df["genre"].str.contains(",", na=True)]
# above two queries will result the same
# single genres also possible

Unnamed: 0,author,desc,genre,isbn,link,pages,title
16,Umberto Eco,In the course of the long debate on the nature...,History,9027232938,https://goodreads.com/book/show/1001231.On_The...,0,On The Medieval Theory Of Signs
44,"Renate Klein,Janice G. Raymond,Lynette Dumble",AÂ classic text for health activists and femin...,Feminism,963008307,https://goodreads.com/book/show/1001463.Ru_486,0,"Ru 486: Misconceptions, Myths, and Morals"
47,Richard L. Kagan,This engrossing book examines the particular i...,History,300083149,https://goodreads.com/book/show/1001480.Urban_...,0,"Urban Images Of The Hispanic World, 1493 1793"
131,Andrew Bradstock,"""The present state of the old world is running...",History,1845117654,https://goodreads.com/book/show/10022889-radic...,224,Radical Religion in Cromwell's England: A Conc...
133,"Donna Campbell,Nicholas A. Veronico,John M. C...",,Aviation,879388544,https://goodreads.com/book/show/1002306.F4U_Co...,144,"F4U Corsair: Combat, Development and Racing Hi..."
...,...,...,...,...,...,...,...
99877,JenÅ‘ RejtÅ‘,"Pipacs alacsony, de igen vastag ember volt. Va...",Humor,9636980039,https://goodreads.com/book/show/17317286-pipac...,150,"Pipacs, a fenegyerek"
99897,Judy Lowe,Never garden alone! The Month-By-Month series ...,Reference,1591865786,https://goodreads.com/book/show/17318234-tenne...,240,Tennessee Kentucky Month-by-Month Gardening: ...
99906,Mike Bartlett,Two jobs. Three candidates. This would be a re...,Plays,1848422806,https://goodreads.com/book/show/17318546-bull,64,Bull
99959,Samuel Carr,Stunningly designed with nostalgic illustratio...,Poetry,184994119X,https://goodreads.com/book/show/17318821-ode-t...,96,Ode to Flowers: A Celebration of the Poetry of...


In [None]:
df1 = df.copy()
df1 = df1[~df1["genre"].isna()]
df1["split_genre"] = df1["genre"].str.split(",")
df1.head()

Unnamed: 0,author,desc,genre,isbn,link,pages,title,split_genre
0,Laurence M. Hauptman,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",002914180X,https://goodreads.com/book/show/1001053.Betwee...,0,Between Two Fires: American Indians in the Civ...,"[History, Military History, Civil War, America..."
1,"Charlotte Fiell,Emmanuelle Dirix",Fashion Sourcebook - 1920s is the first book i...,"Couture,Fashion,Historical,Art,Nonfiction",1906863482,https://goodreads.com/book/show/10010552-fashi...,576,Fashion Sourcebook 1920s,"[Couture, Fashion, Historical, Art, Nonfiction]"
2,Andy Anderson,The seminal history and analysis of the Hungar...,"Politics,History",948984147,https://goodreads.com/book/show/1001077.Hungar...,124,Hungary 56,"[Politics, History]"
3,Carlotta R. Anderson,"""All-American Anarchist"" chronicles the life a...","Labor,History",814327079,https://goodreads.com/book/show/1001079.All_Am...,324,All-American Anarchist: Joseph A. Labadie and ...,"[Labor, History]"
5,Jeffrey Pfeffer,Why is common sense so uncommon when it comes ...,"Business,Leadership,Romance,Historical Romance...",875848419,https://goodreads.com/book/show/1001090.The_Hu...,368,The Human Equation: Building Profits by Puttin...,"[Business, Leadership, Romance, Historical Rom..."


In [None]:
def get_unique_genres(df, column):
    return df1["split_genre"].explode().unique().tolist()
    # all_genres = []
    # for genre_list in df[column].tolist():
    #     all_genres.extend(genre_list)
    # return list(set(all_genres))

all_genres = get_unique_genres(df1, "split_genre")
len(all_genres)
# count 1182

1182

#### _Duplicates_

- There are no complete duplicate rows in the dataset.

In [None]:
# Find duplicated rows - None baabbyy
# Find duplicates titles + authors
df2 = df[df.duplicated(subset=["title", "author"])]

In [None]:
# Find duplicates titles + authors

df2 = df[df.duplicated(subset=["title", "author", "isbn"])]
len(df2)    # 92
df2.head()

# Count how often duplicate titles + authors appear
# df.value_counts(["title", "author", "isbn"])

Unnamed: 0,author,desc,genre,isbn,link,pages,title
6055,Milla Paloniemi,,"Sequential Art,Comics,European Literature,Finn...",,https://goodreads.com/book/show/10400365-kiroi...,80,Kiroileva siili
22072,Lee Hyeon-sook,"In 18th century England, Gabriel, an orphan gi...","Sequential Art,Manga,Manga,Manhwa,Sequential A...",,https://goodreads.com/book/show/11474117-savag...,196,Savage Garden
22073,Lee Hyeon-sook,"In 18th century England, Gabriel, an orphan gi...","Sequential Art,Manga,Manga,Manhwa,Sequential A...",,https://goodreads.com/book/show/11474144-savag...,196,Savage Garden
44991,Mi-Ri Hwang,,"Sequential Art,Manga,Manga,Manhwa,Romance,Mang...",,https://goodreads.com/book/show/13094539-honggane,0,Honggane
44993,Mi-Ri Hwang,,"Sequential Art,Manga,Manga,Manhwa,Romance,Mang...",,https://goodreads.com/book/show/13094586-honggane,0,Honggane


In [None]:
# Testing with an example
# df3 = df[(df["title"] == "Love in the Mask") & (df["author"] == "Yu-Rang Han")]
# df3 = df[(df["title"] == "Honggane") & (df["author"] == "Mi-Ri Hwang")]
df3 = df[(df["title"] == "The Mahabharata") & (df["author"] == "Krishna-Dwaipayana Vyasa,Bibek Debroy")]
df3 = df[(df["title"] == "Kiroileva siili") & (df["author"] == "Milla Paloniemi")]
df3.shape

# Check if they are true duplicates or partial duplicates
df3.nunique()

# Visually inspect rows
df3.head(20)

Unnamed: 0,author,desc,genre,isbn,link,pages,title
5978,Milla Paloniemi,"We all love hedgehogs, don't we? How could any...","Sequential Art,Comics,European Literature,Finn...",,https://goodreads.com/book/show/10395261-kiroi...,80,Kiroileva siili
6055,Milla Paloniemi,,"Sequential Art,Comics,European Literature,Finn...",,https://goodreads.com/book/show/10400365-kiroi...,80,Kiroileva siili


#### _Descriptions_

In [None]:
df.head()

Unnamed: 0,author,desc,genre,isbn,link,pages,title
0,Laurence M. Hauptman,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",002914180X,https://goodreads.com/book/show/1001053.Betwee...,0,Between Two Fires: American Indians in the Civ...
1,"Charlotte Fiell,Emmanuelle Dirix",Fashion Sourcebook - 1920s is the first book i...,"Couture,Fashion,Historical,Art,Nonfiction",1906863482,https://goodreads.com/book/show/10010552-fashi...,576,Fashion Sourcebook 1920s
2,Andy Anderson,The seminal history and analysis of the Hungar...,"Politics,History",948984147,https://goodreads.com/book/show/1001077.Hungar...,124,Hungary 56
3,Carlotta R. Anderson,"""All-American Anarchist"" chronicles the life a...","Labor,History",814327079,https://goodreads.com/book/show/1001079.All_Am...,324,All-American Anarchist: Joseph A. Labadie and ...
4,Jean Leveille,"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa ta...",,2761920813,https://goodreads.com/book/show/10010880-les-o...,177,Les oiseaux gourmands


In [None]:
# Do descriptions have HTML tags?

# match a substring that starts with a <, followed by one or more characters that are not a >, and ends with a >
mask = df["desc"].str.contains(r"<[^>]+>", regex=True, na=False)
print(f"mask_desc_html_tags count: {mask.sum()}")
df[mask]

mask_desc_html_tags count: 14


Unnamed: 0,author,desc,genre,isbn,link,pages,title
3992,"NisiOisiN,VOFAN",100ãƒ‘ãƒ¼ã‚»ãƒ³ãƒˆä¿®ç¾…ã§æ›¸ã‹ã‚ŒãŸå°èª¬ã...,"Novels,Light Novel,Asian Literature,Japanese L...",4062837676.0,https://goodreads.com/book/show/10278609-kabuk...,356,å‚¾ç‰©èªž [Kabukimonogatari]
12531,Dan Pearson,<!--StartFragment--> Ten years ago Dan Pearson...,"Gardening,Nonfiction",1840915374.0,https://goodreads.com/book/show/10834573-home-...,272,Home Ground: Sanctuary in the City
13396,Architecture For Humanity,"Design Like You Give a Damn [2], is the indisp...","Architecture,Design,Nonfiction",810997029.0,https://goodreads.com/book/show/10887582-desig...,336,Design Like You Give a Damn {2}: Building Chan...
26520,F. Burton Howard,"Eternal marriage cannot be ,achieved without ...","Christianity,Lds,Religion,Nonfiction",1590382765.0,https://goodreads.com/book/show/1182250.Eterna...,58,Eternal Marriage and the Parable of the Silver...
39896,Ben Kiernan,This book narrates the history of the differen...,"History,Nonfiction,Cultural,Asia,Cultural",195160762.0,https://goodreads.com/book/show/12781033-viet-nam,592,Viet Nam: A History from Earliest Times to the...
40570,Brit Stakston,"From the backside:,Nu behÃ¶ver ni inte ha dÃ¥l...","Social Science,Social Media",9187003007.0,https://goodreads.com/book/show/12832059-gilla...,160,"Gilla! Dela Engagemang, Passion Och IdÃ©er Via..."
64320,Liz Adair,",I know there was a problem ,with the mine at...","Mystery,Lds,Lds Fiction,Christianity,Lds",1590383060.0,https://goodreads.com/book/show/1432469.Snakew...,248,Snakewater Affair: A Spider Latham Mystery
76399,Mary Ellen Edmunds,"style=""COLOR: black; mso-bidi-font-size: 10.0p...","Christianity,Lds,Nonfiction,Inspirational,Reli...",1590383125.0,https://goodreads.com/book/show/1574830.Mee_Th...,185,Mee Thinks: Random Thoughts on Life's Wrinkles
82896,Orna Ross,"""One is immersed in this epic story immediatel...","Historical,Historical Fiction,Cultural,Ireland...",,https://goodreads.com/book/show/15994867-after...,293,After the Rising (An Irish Trilogy Book 1)
83359,"Sadanatsu Anda,åºµç”° å®šå¤,Shiromizakana,ç™½...",æ­¢ã¾ã‚‰ãªã„ã€æ­¢ã¾ã‚Œãªã„ã€æ­¢ã‚ã‚‰ã...,"Novels,Light Novel,Mystery,Asian Literature,Ja...",4047265373.0,https://goodreads.com/book/show/16007775-kokor...,314,ã‚³ã‚³ãƒ­ã‚³ãƒã‚¯ãƒˆ ã‚­ã‚ºãƒ©ãƒ³ãƒ€ãƒ [Koko...


In [None]:
# Are there escaped characters?

# match a single, literal backslash character
mask = df["desc"].str.contains(r"\\", regex=True, na=False)
print(f"mask_desc_escaped_chars count: {mask.sum()}")

mask_desc_escaped_chars count: 24


In [None]:
# HTML escaped characters

mask = df["desc"].str.contains(r"&[A-Za-z0-9#]+;", regex=True, na=False)
print(f"mask_desc_html_escaped count: {mask.sum()}")
df.loc[86622]["desc"]

mask_desc_html_escaped count: 56


'Bargain omnibus editions of one of the bestselling manga series of all time!,In an alchemical ritual gone wrong, Edward Elric lost his arm and his leg, and his brother Alphonse became nothing but a soul in a suit of armor. Equipped with mechanical ""auto-mail"" limbs, Edward becomes a state alchemist, seeking the one thing that can restore his and his brother\'s bodies...the legendary Philosopher\'s Stone.,Contains volumes 10, 11 and 12 of Fullmetal Alchemist!, , Ed returns to Resembool and meets his estranged father Hohenheim for the first time in many years. Though their meeting is brief and strained, Ed comes away with the revelation that Al&#8217;s body is still alive somewhere. But before the newly energized brothers can search for it, Scar returns, catalyzing an unlikely alliance between the Elric brothers, Prince Lin of Xing, and Colonel Mustang. Though they hope to use Scar to lure in a homunculus, the hunters become the hunted when Gluttony proves more than they can handle., 

In [None]:
# Control characters (non-printable Unicode)
mask = df["desc"].str.contains(r"[\x00-\x1F\x7F]", regex=True)
print(f"mask_desc_control_chars count: {mask.sum()}")

mask_desc_control_chars count: 1881


## Data Manipulation

In [None]:
from pathlib import Path
import html
import pandas as pd
import re

In [None]:
INPUT_COLS = ["title", "author", "desc", "genre", "isbn", "link", "pages"]
COMMON_GENRE_DELIMS = [";", "|", "/", "•"]

In [None]:
def normalize_author(author_col):
    # Ensure author column is string
    authors_list = author_col.fillna("").astype(str)

    # Replace separators in bulk
    authors_list = (authors_list
                    .str.replace(";", ",", regex=False)
                    .str.replace("&", ",", regex=False)
                    .str.replace(r"\s+and\s+", ",", regex=True)
                )

    # Remove editor markers in bulk
    authors_list = (authors_list
                    .str.replace(r"\(.*?editor.*?\)", "", regex=True, case=False)
                    .str.replace(r"\beditor\b", "", regex=True, case=False)
                    .str.replace(r"\bed\.\b", "", regex=True, case=False)
                    )

    # Collapse extra spaces
    authors_list = authors_list.str.replace(r"\s+", " ", regex=True).str.strip()

    # Split into lists
    authors_list = authors_list.str.split(",")

    # Trim whitespace inside list items
    authors_list = authors_list.apply(lambda lst: [item.strip() for item in lst if item.strip()])

    # Remove invalid values
    invalid_authors = ["Unknown"]
    authors_list = authors_list.apply(lambda lst: [item for item in lst if item not in invalid_authors])

    # Deduplicate + title case + sort
    authors_list = authors_list.apply(lambda lst: sorted(set([item.title() for item in lst if isinstance(lst, list)])))

    return authors_list

def normalize_genre(genre_col):
    # Ensure string type
    genre_list = genre_col.fillna("").astype(str)

    # Replace common separators with commas
    for sep in COMMON_GENRE_DELIMS:
        genre_list = genre_list.str.replace(sep, ",", regex=False)

    # Collapse extra spaces
    genre_list = genre_list.str.replace(r"\s+", " ", regex=True).str.strip()

    # Split into lists
    genre_list = genre_list.str.split(",")

    # Trim whitespace inside list items
    genre_list = genre_list.apply(lambda lst: [item.strip() for item in lst if item.strip()])

    # Deduplicate + title case + sort
    # Remove truncated / obviously invalid entries
    genre_list = genre_list.apply(lambda lst: sorted(set([item.title() for item in lst if isinstance(lst, list) and len(item) > 2 and "..." not in item])))

    return genre_list

def normalize_text_field(text_col):
    # Ensure string type
    text_col = text_col.fillna("").astype(str)

    # Remove HTML tags
    text_col = text_col.str.replace(r"<[^>]+>", "", regex=True)

    # Remove escaped characters
    text_col = text_col.str.replace(r"[\n\t\r]", " ", regex=True)

    # Collapse multiple spaces
    text_col = text_col.str.strip().str.replace(r"\s+", " ", regex=True)

    # Decode HTML entities
    text_col = text_col.apply(html.unescape)

    # Remove control/non-printable characters
    text_col = text_col.apply(lambda s: "".join(ch for ch in s if ch.isprintable()))

    return text_col

# Handles duplicates based on identical title and author(s) combinations
def handle_duplicate_rows(df):
    # Make unique key using title and author
    title_lower = df["title"].str.lower()
    author_list_lower = df["author"].apply(lambda lst: tuple(sorted([item.lower() for item in lst])))
    df["title_author_key"] = list(zip(title_lower, author_list_lower))

    # Union genres for each title + author(s) group
    genre_union_map = {}
    for idx, row in df.iterrows():
        key = row["title_author_key"]
        if key not in genre_union_map:
            genre_union_map[key] = set()
        genre_union_map[key].update(row["genre"])

    # Convert sets to lists
    for key in genre_union_map:
        genre_union_map[key] = list(genre_union_map[key])

    # Apply the unioned genres to all rows
    df["genre_union"] = df["title_author_key"].map(genre_union_map)

    # Drop helper column
    df = df.drop(columns="title_author_key")

    return df

In [None]:
# Step 0: snapshot raw CSV
# snapshot = snapshot_raw(input_csv, CONFIG["snapshot_dir"])

# Step 1: schema validation
df = pd.read_csv("GoodReads_100k_books.csv", usecols=INPUT_COLS)

# Ensure expected columns exist
missing_cols = [c for c in INPUT_COLS if c not in df.columns]
if missing_cols:
    raise ValueError(f"Missing columns in input CSV: {missing_cols}")

# Only use the needed columns for processing
df = df[INPUT_COLS]

# Step 2: keep raw backup columns
for c in INPUT_COLS:
    df[c] = df[c].fillna("").astype(str)
    df[f"raw_{c}"] = df[c]  # keep original

# Step 3: convert and standardize pages as integers
df["pages"] = pd.to_numeric(df["pages"], errors="coerce").fillna(0).astype(int).clip(lower=0)

# Step 4: normalize authors
df["author"] = normalize_author(df["author"])

# Step 5: normalize genres
df["genre"] = normalize_genre(df["genre"])

# Step 6: normalize titles and descriptions
df["title"] = normalize_text_field(df["title"])
df["desc"] = normalize_text_field(df["desc"])

# Step 7: handling duplicate data
df = handle_duplicate_rows(df)

# Step 8: Remove missing or incomplete data
# Drop rows only if all essential fields are invalid:
# missing authors AND genres AND descriptions AND (a blank OR "unknown" title).
mask = (df["author"].apply(lambda x: len(x) == 0) &
        df["genre_union"].apply(lambda x: len(x) == 0) &
        df["desc"].apply(lambda x: (not isinstance(x, str)) or (x.strip() == "")) &
        df["title"].apply(lambda t: (not isinstance(t, str)) or ("unknown" in t.lower())))

df_cleaned = df[~mask].copy()

# For now
df_cleaned = df_cleaned.drop(df_cleaned.filter(regex='^raw').columns, axis=1)

In [None]:
num_rows_removed = mask.sum()
print(f"Number of rows removed in Step 8: {num_rows_removed}")

In [None]:
df_cleaned.head()

Unnamed: 0,title,author,desc,genre,isbn,link,pages,genre_union
0,Between Two Fires: American Indians in the Civ...,[Laurence M. Hauptman],Reveals that several hundred thousand Indians ...,"[American Civil War, American History, Civil W...",002914180X,https://goodreads.com/book/show/1001053.Betwee...,0,"[Native Americans, Military History, Civil War..."
1,Fashion Sourcebook 1920s,"[Charlotte Fiell, Emmanuelle Dirix]",Fashion Sourcebook - 1920s is the first book i...,"[Art, Couture, Fashion, Historical, Nonfiction]",1906863482,https://goodreads.com/book/show/10010552-fashi...,576,"[Couture, Fashion, Art, Historical, Nonfiction]"
2,Hungary 56,[Andy Anderson],The seminal history and analysis of the Hungar...,"[History, Politics]",948984147,https://goodreads.com/book/show/1001077.Hungar...,124,"[History, Politics]"
3,All-American Anarchist: Joseph A. Labadie and ...,[Carlotta R. Anderson],"""All-American Anarchist"" chronicles the life a...","[History, Labor]",814327079,https://goodreads.com/book/show/1001079.All_Am...,324,"[History, Labor]"
4,Les oiseaux gourmands,[Jean Leveille],"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa tab...",[],2761920813,https://goodreads.com/book/show/10010880-les-o...,177,[]


### _Analyzing on clean data_

In [None]:
# df[(df["title"].isna())]
df[df["title"].str.lower().str.contains("unknown")]

Unnamed: 0,title,author,desc,genre,isbn,link,pages,genre_union
449,Supernatural: Your Guide Through the Unexplain...,"[Colin Wilson, Damon Wilson]",Popular culture has created a new thirst for s...,"[Fantasy, Nonfiction, Occult, Paranormal, Supe...",1907486550,https://goodreads.com/book/show/10048356-super...,600,"[Fantasy, Paranormal, Nonfiction, Supernatural..."
2007,Unknown Armies,"[Greg Stolze, J. Tynes]","What will you risk to change the world? ,The a...","[Fantasy, Fiction, Games, Gaming, Horror, Role...",1589780132,https://goodreads.com/book/show/1013943.Unknow...,336,"[Fantasy, Sports And Games, Gaming, Games, Urb..."
2479,The Once Unknown Familiar: Shamanic Paths to U...,[Timothy Roderick],Discover the magical animal of power residing ...,[Witchcraft],875424392,https://goodreads.com/book/show/1016970.The_On...,240,[Witchcraft]
3360,Memoirs of a Monster Hunter: A Five-Year Journ...,[Nick Redfern],"For centuries, people across the world have ha...","[Autobiography, Cryptozoology, Fantasy, Folklo...",1564149765,https://goodreads.com/book/show/1023148.Memoir...,256,"[Monsters, Memoir, Fantasy, Paranormal, Folklo..."
3648,Unknown Book 10254514,[Unknown Author 413],,"[Manga, Sequential Art, Yuri]",,https://goodreads.com/book/show/10254514-unkno...,0,"[Yuri, Manga, Sequential Art]"
4027,His Unknown Heir,[Chantelle Shaw],"Ramon Velaquez, heir to the Velaquez winery, c...","[Adult, Category Romance, Contemporary, Contem...",263886328,https://goodreads.com/book/show/10281439-his-u...,144,"[Adult, Harlequin Presents, Contemporary Roman..."
4034,The Unknown Five,"[Alfred Bester, Cleve Cartmill, Donald R. Bens...","Contents:,â€¢ Author! Author! by Isaac Asimov,...","[Anthologies, Science Fiction]",,https://goodreads.com/book/show/10281867-the-u...,192,"[Anthologies, Science Fiction]"
5548,Borderlands: The Ultimate Exploration of the U...,[Mike Dash],"Explore the ,Borderlands,...,* The charred rem...","[Fantasy, Nonfiction, Occult, Paranormal]",440236568,https://goodreads.com/book/show/1037360.Border...,544,"[Paranormal, Nonfiction, Fantasy, Occult]"
9859,The Quickening: Unknown Poetry of Tahirih,"[Amrollah Hemmat, John S. Hatcher]","The Quickening, is a newly translated collecti...","[Baha I, Religion]",1931847835,https://goodreads.com/book/show/10650046-the-q...,261,"[Religion, Baha I]"
13539,Unknown Earth: A Handbook of Geological Enigmas,[William R. Corliss],,"[Geology, Science]",915554062,https://goodreads.com/book/show/1089700.Unknow...,0,"[Geology, Science]"


In [None]:
invalid_authors = ['Unknown']
mask = df_cleaned["author"].apply(lambda g: (g is None) or (len(g) == 0) or (len(set(g).intersection(invalid_authors)) > 0))
# len(df_cleaned[mask])
df_cleaned[mask]

Unnamed: 0,title,author,desc,genre,isbn,link,pages,genre_union
4540,Poema de FernÃ¡n GonzÃ¡lez,[],,"[European Literature, Poetry, Spanish Literature]",8470390252.0,https://goodreads.com/book/show/10312702-poema...,147,"[Poetry, Spanish Literature, European Literature]"
7399,The Yoga Cookbook: Vegetarian Food for Body an...,[],",Eat Wisely and Well, , The teachings of yoga ...","[Cookbooks, Cooking, Food, Food And Drink, Non...",684856417.0,https://goodreads.com/book/show/1048226.The_Yo...,160,"[Self Help, Food And Drink, Vegan, Vegetarian,..."
17598,Imasara Hito Ni Kikenai RÄ«su Torihiki No HÅri...,[],,[],4901380842.0,https://goodreads.com/book/show/11189913-imasa...,239,[]
23712,"Look Homeward, Angel Part 2 of 2",[],,[],5557123725.0,https://goodreads.com/book/show/1158811.Look_H...,0,[]
25533,Unknown Book 11751242,[],,"[Computers, Internet, Web]",,https://goodreads.com/book/show/11751242-unkno...,0,"[Web, Internet, Computers]"
31146,Peak Oil,[],,[],3462033514.0,https://goodreads.com/book/show/12166429-peak-oil,0,[]
31178,"Playboy's Party Jokes, No. 6",[],,[],515070610.0,https://goodreads.com/book/show/12170158-playb...,0,[]
57285,Judith (poem),[],,[],,https://goodreads.com/book/show/13611543-judith,0,[]
71749,"Diggers, Tractors and Trucks (Jigsaw Transport)",[],"Find out about diggers, tractors, and trucks w...",[],333781023.0,https://goodreads.com/book/show/15204616-digge...,12,[]
81524,Ancient and Modern Full Music Edition: Hymns a...,[],The world's most famous hymn book has undergon...,[],1848252420.0,https://goodreads.com/book/show/15902230-ancie...,1904,[]


In [None]:
mask = (df["desc"].isna()) | (df["desc"] == "")
df[mask]
len(df[mask]) / len(df) * 100

6.772

In [None]:
mask = df["genre_union"].apply(lambda g: g is None or len(g) == 0)
len(df[mask]) / len(df) * 100

10.463000000000001

In [None]:
df[df["pages"] == 0]

Unnamed: 0,title,author,desc,genre,isbn,link,pages,genre_union
0,Between Two Fires: American Indians in the Civ...,[Laurence M. Hauptman],Reveals that several hundred thousand Indians ...,"[American Civil War, American History, Civil W...",002914180X,https://goodreads.com/book/show/1001053.Betwee...,0,"[Native Americans, Military History, Civil War..."
14,Anarchism & Environmental Survival,[Graham Purchase],"In this wide-ranging book, Graham Purchase, on...","[Biology, Ecology]",961328983,https://goodreads.com/book/show/1001220.Anarch...,0,"[Ecology, Biology]"
16,On The Medieval Theory Of Signs,[Umberto Eco],In the course of the long debate on the nature...,[History],9027232938,https://goodreads.com/book/show/1001231.On_The...,0,[History]
38,"Scripting the Black Masculine Body: Identity, ...",[Ronald Jackson Ii],,[],791482375,https://goodreads.com/book/show/10014315-scrip...,0,[]
39,A Narrative of the Proceedings of the Black Pe...,[Richard Allen],,[],915992582,https://goodreads.com/book/show/1001432.A_Narr...,0,[]
...,...,...,...,...,...,...,...,...
99979,I was Sold to My Dead Brother's Best Friend,[Jaqueline E. Pearson],"My brother disappeared when I was eight, after...","[Fantasy, Paranormal, Vampires]",,https://goodreads.com/book/show/17319136-i-was...,0,"[Vampires, Fantasy, Paranormal]"
99980,Say say say(#2),[Mi-Ri Hwang],,"[Manga, Manhwa, Sequential Art]",,https://goodreads.com/book/show/17319160-say-s...,0,"[Manhwa, Manga, Sequential Art]"
99983,Scrap Quilts,[Roberta Horton],Important Note about PRINT ON DEMAND Editions:...,"[Crafts, Quilting]",1571200479,https://goodreads.com/book/show/1731941.Scrap_...,0,"[Quilting, Crafts]"
99984,Calico and Beyond - Print on Demand Edition,[Roberta Horton],Important Note about PRINT ON DEMAND Editions:...,"[Crafts, Quilting]",914881035,https://goodreads.com/book/show/1731942.Calico...,0,"[Quilting, Crafts]"


In [None]:
len(df[df["isbn"].isna()]) / len(df) * 100

0.0

In [None]:
df[(df["link"].isna()) | (df["link"] == "Unknown") | (df["link"] == "")]

Unnamed: 0,title,author,desc,genre,isbn,link,pages,genre_union


## Books_Cleaned :)

In [None]:
import pandas as pd
df = pd.read_csv("books_cleaned.csv")

In [None]:
df.head()

Unnamed: 0,title,author,desc,genre_union
0,Between Two Fires: American Indians in the Civ...,['Laurence M. Hauptman'],Reveals that several hundred thousand Indians ...,"['Military History', 'Nonfiction', 'Civil War'..."
1,Fashion Sourcebook 1920s,"['Charlotte Fiell', 'Emmanuelle Dirix']",Fashion Sourcebook - 1920s is the first book i...,"['Art', 'Nonfiction', 'Historical', 'Couture',..."
2,Hungary 56,['Andy Anderson'],The seminal history and analysis of the Hungar...,"['Politics', 'History']"
3,All-American Anarchist: Joseph A. Labadie and ...,['Carlotta R. Anderson'],"""All-American Anarchist"" chronicles the life a...","['Labor', 'History']"
4,Les oiseaux gourmands,['Jean Leveille'],"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa tab...",[]


In [None]:
len(df)

99996

## Vector Embedding