# Process
In this notebook, I am removing the authors in the dataset whose name cannot be decoded with the latin alphabet.

## Import libraries

In [2]:
import pandas as pd

## Read book file

In [3]:
filename = "data/items_books.csv" # Books-Crossing books
books = pd.read_csv(filename, low_memory = False).drop(["Image-URL-S", "Image-URL-M","Image-URL-L"], axis=1) # read books and remove images
books.columns = ["ISBN", "title", "author", "year", "publisher"] # rename columns to simplify

In [4]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [5]:
print("There are "+str(len(books))+" books.")

There are 271360 books.


## Unique authors

In [6]:
authors = pd.DataFrame(books.author.unique(), columns=["author"]).dropna().reset_index().drop("index", axis=1) # find unique authors and remove nan values

In [7]:
print("There are "+str(len(authors))+" unique authors.")

There are 102023 unique authors.


In [8]:
authors.head()

Unnamed: 0,author
0,Mark P. O. Morford
1,Richard Bruce Wright
2,Carlo D'Este
3,Gina Bari Kolata
4,E. J. W. Barber


## Check encoding

In [9]:
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

In [10]:
# Give example of non latin name
for index, row in authors.iterrows():
    if not isEnglish(row["author"]): # if the author name is not latin
        print(row["author"])
        break

Isabel-Clara SimÃ³


In [11]:
for index, row in authors.iterrows():
    if not isEnglish(row["author"]): # if the author name is not latin
        authors.drop(index, inplace=True)

In [12]:
print("After removing non latin names, there are "+str(len(authors))+" unique authors.")

After removing non latin names, there are 100642 unique authors.


## Keep latin authored books

In [13]:
books = pd.merge(books, authors)

In [14]:
books

Unnamed: 0,ISBN,title,author,year,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0801319536,Classical Mythology,Mark P. O. Morford,1998,John Wiley &amp; Sons
2,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
3,0771597185,The teacher's daughter,Richard Bruce Wright,1982,Macmillan of Canada
4,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
...,...,...,...,...,...
269269,0762412119,"Burpee Gardening Cyclopedia: A Concise, Up to ...",Allan Armitage,2002,Running Press Book Publishers
269270,1582380805,Tropical Rainforests: 230 Species in Full Colo...,"Allen M., Ph.D. Young",2001,Golden Guides from St. Martin's Press
269271,1845170423,Cocktail Classics,David Biggs,2004,Connaught
269272,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books


## Save files

In [13]:
books.to_csv("data/items_books_latin.csv")
authors.to_csv("data/authors_latin.csv")