In [19]:
import re
import pandas as pd
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException

# Preprocessing

In [58]:
df = pd.read_csv('../data/raw/books_dataset.csv')
df = df[['book_title', 'book_authors', 'book_desc']]

In [66]:
df.head()

Unnamed: 0,book_title,book_authors,book_desc
0,The Hunger Games,Suzanne Collins,"Winning will make you famous. Losing means certain death.The nation of Panem, formed from a post-apocalyptic North America, is a country that consists of a wealthy Capitol region surrounded by 12 poorer districts. Early in its history, a rebellion led by a 13th district against the Capitol resulted in its destruction and the creation of an annual televised event known as the Hunger Games. In punishment, and as a reminder of the power and grace of the Capitol, each district must yield one boy and one girl between the ages of 12 and 18 through a lottery system to participate in the games. The 'tributes' are chosen during the annual Reaping and are forced to fight to the death, leaving only one survivor to claim victory.When 16-year-old Katniss's young sister, Prim, is selected as District 12's female representative, Katniss volunteers to take her place. She and her male counterpart Peeta, are pitted against bigger, stronger representatives, some of whom have trained for this their whole lives. , she sees it as a death sentence. But Katniss has been close to death before. For her, survival is second nature."
1,Harry Potter and the Order of the Phoenix,J.K. Rowling|Mary GrandPré,"There is a door at the end of a silent corridor. And it’s haunting Harry Pottter’s dreams. Why else would he be waking in the middle of the night, screaming in terror?Harry has a lot on his mind for this, his fifth year at Hogwarts: a Defense Against the Dark Arts teacher with a personality like poisoned honey; a big surprise on the Gryffindor Quidditch team; and the looming terror of the Ordinary Wizarding Level exams. But all these things pale next to the growing threat of He-Who-Must-Not-Be-Named---a threat that neither the magical government nor the authorities at Hogwarts can stop.As the grasp of darkness tightens, Harry must discover the true depth and strength of his friends, the importance of boundless loyalty, and the shocking price of unbearable sacrifice.His fate depends on them alll.(back cover)"
2,To Kill a Mockingbird,Harper Lee,"The unforgettable novel of a childhood in a sleepy Southern town and the crisis of conscience that rocked it, To Kill A Mockingbird became both an instant bestseller and a critical success when it was first published in 1960. It went on to win the Pulitzer Prize in 1961 and was later made into an Academy Award-winning film, also a classic.Compassionate, dramatic, and deeply moving, To Kill A Mockingbird takes readers to the roots of human behavior - to innocence and experience, kindness and cruelty, love and hatred, humor and pathos. Now with over 18 million copies in print and translated into forty languages, this regional story by a young Alabama woman claims universal appeal. Harper Lee always considered her book to be a simple love story. Today it is regarded as a masterpiece of American literature."
3,Pride and Prejudice,Jane Austen|Anna Quindlen|Mrs. Oliphant|George Saintsbury|Mark Twain|A.C. Bradley|Walter A. Raleigh|Virginia Woolf,"«È cosa ormai risaputa che a uno scapolo in possesso di un'ingente fortuna manchi soltanto una moglie. Questa verità è cosí radicata nella mente delle famiglie del luoho che, nel momento in cui un simile personaggio viene a far parte del vicinato, prima ancora di conoscere anche lontanamente i suoi desiderî in proposito, viene immediatamente considerato come proprietà legittima di una o l'altra delle loro figlie.»Orgoglio e pregiudizio è uno dei primi romanzi di Jane Austen. La scrittrice lo iniziò a ventun anni; il libro, rifiutato da un editore londinese, rimase in un cassetto fino alla sua pubblicazione anonima nel 1813, e da allora è considerato tra i piú importanti romanzi della letteratura inglese. È la storia delle cinque sorelle Bennet e dei loro corteggiatori, con al centro il romantico contrasto tra l'adorabile e capricciosa Elizabeth e l'altezzoso Darcy; lo spirito di osservazione implacabile e quasi cinico, lo studio arguto dei caratteri, la satira delle vanità e delle debolezze della vita domestica, fanno di questo romanzo una delle piú efficaci e indimenticabili commedie di costume del periodo Regency inglese."
4,Twilight,Stephenie Meyer,"About three things I was absolutely positive.First, Edward was a vampire.Second, there was a part of him—and I didn't know how dominant that part might be—that thirsted for my blood.And third, I was unconditionally and irrevocably in love with him.In the first book of the Twilight Saga, internationally bestselling author Stephenie Meyer introduces Bella Swan and Edward Cullen, a pair of star-crossed lovers whose forbidden relationship ripens against the backdrop of small-town suspicion and a mysterious coven of vampires. This is a love story with bite."


In [61]:
df.shape

(55442, 3)

## Nulls and duplicates

In [23]:
df = df.drop_duplicates()
df = df.dropna()

## Language standarization

In [25]:
# I'm going to keep only the books with english descriptions
DetectorFactory.seed = 0

def detect_language(text):
    try:
        return detect(text) 
    except LangDetectException:
        return None

df['is_english'] = df['book_desc'].apply(lambda x: detect_language(x) == 'en')
df = df[df['is_english']]
df = df.drop(columns='is_english')

In [48]:
# I will also remove the books with arabic characters, meaning that they are not in english and the language detection failed
df = df[~df['book_title'].apply(lambda x: bool(re.search(r'[\u0600-\u06FF\u0750-\u077F]', x)))]
df = df[~df['book_desc'].apply(lambda x: bool(re.search(r'[\u0600-\u06FF\u0750-\u077F]', x)))]
df = df[df['book_title'].apply(lambda x: x.isascii())]

## Handling description column

In [36]:
# We have some duplicated description, refering genericly to the book, I will remove them
df['book_desc'].value_counts()

book_desc
This book was converted from its physical edition to the digital format by a community of volunteers. You may find it for free on the web. Purchase of the Kindle edition includes wireless delivery.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

In [37]:
df.drop_duplicates(subset=['book_desc'], inplace=True)

In [39]:
# I'll drop rows with ISBN in book_desc, as they refer to alternative editions
df = df[~df['book_desc'].str.contains(r'ISBN', na=False)]

## Handling books with more than one entry

In [42]:
df['book_title'].value_counts().sort_values(ascending=False)

book_title
Broken                                                                           13
Selected Poems                                                                   13
Legacy                                                                           11
A Christmas Carol                                                                11
Little Women                                                                     10
                                                                                 ..
Intellectuals and Race                                                            1
The Road to Wigan Pier                                                            1
Black Butler, Vol. 16                                                             1
Sexting                                                                           1
Power Without Responsibility: Press, Broadcasting and the Internet in Britain     1
Name: count, Length: 40988, dtype: int64

In [43]:
pd.set_option('display.max_colwidth', None)

In [51]:
df_book_to_check = df[df['book_title'] == '1984']
df_book_to_check

Unnamed: 0,book_title,book_authors,book_desc
100,1984,George Orwell|Erich Fromm,"Among the seminal texts of the 20th century, Nineteen Eighty-Four is a rare work that grows more haunting as its futuristic purgatory becomes more real. Published in 1949, the book offers political satirist George Orwell's nightmare vision of a totalitarian, bureaucratic world and one poor stiff's attempt to find individuality. The brilliance of the novel is Orwell's prescience of modern life--the ubiquity of television, the distortion of the language--and his ability to construct such a thorough version of hell. Required reading for students since it was published, it ranks among the most terrifying novels ever written."
5161,1984,George Orwell|Peter Hobley Davison,"'It was a bright cold day in April, and the clocks were striking thirteen.'Winston Smith works for the Ministry of truth in London, chief city of Airstrip One. Big Brother stares out from every poster, the Thought Police uncover every act of betrayal. When Winston finds love with Julia, he discovers that life does not have to be dull and deadening, and awakens to new possibilities. Despite the police helicopters that hover and circle overhead, Winston and Julia begin to question the Party; they are drawn towards conspiracy. Yet Big Brother will not tolerate dissent - even in the mind. For those with original thoughts they invented Room 101 . . . Nineteen Eighty-Four is George Orwell's terrifying vision of a totalitarian future in which everything and everyone is slave to a tyrannical regime."
21486,1984,George Orwell,"Nineteen Eighty-Four revealed George Orwell as one of the twentieth century’s greatest mythmakers. While the totalitarian system that provoked him into writing it has since passed into oblivion, his harrowing cautionary tale of a man trapped in a political nightmare has had the opposite fate: its relevance and power to disturb our complacency seem to grow decade by decade. In Winston Smith’s desperate struggle to free himself from an all-encompassing, malevolent state, Orwell zeroed in on tendencies apparent in every modern society, and made vivid the universal predicament of the individual."


In [45]:
# The goal is to handle cases where the same book appears multiple times with different authors.
# I assume that versions with fewer authors are the original ones, while others might be adaptations or special editions.
# To simplify the dataset, we will keep only the version of each book with the fewest authors.

df['num_authors'] = df['book_authors'].str.split('|').apply(len)
df = df.loc[df.groupby('book_title')['num_authors'].idxmin()]
df = df.drop(columns=['num_authors'])

In [52]:
# We can see that the book 1984 is now only present once
df_book_to_check = df[df['book_title'] == '1984']
df_book_to_check

Unnamed: 0,book_title,book_authors,book_desc
21486,1984,George Orwell,"Nineteen Eighty-Four revealed George Orwell as one of the twentieth century’s greatest mythmakers. While the totalitarian system that provoked him into writing it has since passed into oblivion, his harrowing cautionary tale of a man trapped in a political nightmare has had the opposite fate: its relevance and power to disturb our complacency seem to grow decade by decade. In Winston Smith’s desperate struggle to free himself from an all-encompassing, malevolent state, Orwell zeroed in on tendencies apparent in every modern society, and made vivid the universal predicament of the individual."


## Save dataset

In [54]:
df.reset_index(drop=True, inplace=True)

In [57]:
df.shape

(40589, 3)

In [55]:
# Save dataset to csv
df.to_csv('../data/processed/cleaned_books.csv', index=False)