# Process
In this notebook, I will analyze the authors names and process them accordingly. 

## Import libraries

In [16]:
from viapy.api import ViafAPI, ViafEntity, SRUResult, SRUItem
import pandas as pd
import pickle as pkl
import time
from lxml import etree
from urllib.request import urlopen, quote
import json
import requests
import logging
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Read files

In [17]:
authors = pd.read_csv("data/authors_latin.csv", index_col=0)
books = pd.read_csv("data/items_books_latin.csv", low_memory = False, index_col=0)

In [41]:
books[books.author == "J.K. Rowling"]

Unnamed: 0,ISBN,title,author,year,publisher
132974,043965548X,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,2004,Scholastic Paperbacks
132975,1855494981,Harry Potter and the Philosopher's Stone (Cove...,J.K. Rowling,1999,BBC Consumer Publishing
132976,1855496704,Harry Potter and the Philosopher's Stone (Cove...,J.K. Rowling,2000,BBC Consumer Publishing
132977,8478889019,Harry Potter y la Ã?rden del FÃ©nix,J.K. Rowling,2004,Lectorum Publications
132978,2070556859,Harry Potter et l'Ordre du PhÃ©nix (Harry Pott...,J.K. Rowling,2003,Gallimard Jeunesse
132979,185549860X,Harry Potter and the Philosopher's Stone,J.K. Rowling,1999,BBC Consumer Publishing
132980,0747561966,Harry Potter and the Philosopher's Stone,J.K. Rowling,2003,Bloomsbury
132981,8475967760,I El Pres D'askaban,J.K. Rowling,0,CELESA (Centro de exportacion de Libros Espano...
132982,8475967744,I La Pedra Filosofal,J.K. Rowling,0,CELESA (Centro de exportacion de Libros Espano...
132983,0939173344,Harry Potter and the Sorcerer's Stone,J.K. Rowling,1999,"National Braille Press, Inc."


In [43]:
books.loc[132975].title

"Harry Potter and the Philosopher's Stone (Cover to Cover)"

## Correct some encoding
I noticed that some authors (173) get wrongly decoded as {name_1} &amp; {name_2} (the symbol & is decoded as &amp;). So, I will first replace this wrong decoding.

There are four cases:

1. Two author last names next to each other, e.g. "Denning &amp; Phillips" is Melita Denning and Osborne Phillips.
2. An author's last name, then first name and last of the other, then last name of first one, e.g. "William &amp; Johnson, Virginia Masters" is William Masters and Virginia Johnson.
3. Two authors' first names with their common last name, e.g. "Mike &amp; Mary Couillard" is Mike and Mary Couillard.
4. Institution names, e.g. The Staff of Research &amp; Education Association.

I will make a rule scheme to fix these issues. 

In [19]:
# authors["first_author"] = authors["author"]
# authors["second_author"] = ""

In [20]:
# for index, row in authors.iterrows():
#     author_name = row["author"]
#     if "&amp;" in author_name:
#         # case 2
#         splt_author = author_name.split("&amp;")
#         if "," in splt_author[1]: # so basically if it's case number two
#             first_first_name = splt_author[0]
#             splt_author_2 = splt_author[1].split(",")
#             second_last_name = splt_author_2[0]
#             splt_author_3 = splt_author_2[1].split(" ")
#             second_first_name = splt_author_3[1]
#             first_last_name = splt_author_3[-1]
#             first_author = first_first_name +" "+first_last_name
#             second_author = second_first_name +" "+second_last_name
#             authors.at[index, "first_author"] = first_author
#             authors.at[index, "second_author"] = second_author
#         else:
#             # it can be either case 1 or 3
#             second_part = splt_author[-1].split(" ")
#             if len(second_part) == 2: #then case number 1
#                 first_last_name = splt_author[0]
#                 second_last_name = splt_author[-1]
#                 authors.at[index, "first_author"] = first_last_name
#                 authors.at[index, "second_author"] = second_last_name
#             else: #then case number 3
#                 first_first_name = splt_author[0]
#                 second_part = splt_author[1].split(" ")
#                 if len(second_part) >1:
#                     common_last_name = second_part[-1]
#                     second_first_name = second_part[-2]
#                     authors.at[index, "first_author"] = first_first_name +" "+common_last_name
#                     authors.at[index, "second_author"] = second_first_name +" "+common_last_name
#                 else:
#                     print("WTF", second_part, author_name)

In [21]:
# authors[authors.second_author!=""]

## Some checks

how many have only one name.

In [36]:
sum_single_names = 0
for author in authors.author:
    if len(author.split(" "))==1:
        sum_single_names+=1
        print(author)

Roy
Schiller
Golding
HarperReference
LTD
Jack
Moveon
Alain-Fournier
Lewis
Pergaud
Collectif
Vinke
Plato
Janosch
Tournier
Dick
Anonimo
Grierson
Osuntoki
Scott
Aleramo
Halim
Sark
Daley
Holcomb
Starhawk
Ahne
Sapphire
Pirandello
Benni
Brizzi
Balzac
Bstan-Dzin-Rgya-Mtsho
Fitzpatrick
Robbins
Jelloun
Anonymous
Seuss
Hentoff
Peters
Renee
Fernandez
Christie
Boll
Bird
Anthony
Nostradamus
Parragon
Illiad
Trevanian
Thucydides
Williams
Rhue
Bach
Salinger
McCaffrey
Carroll
Zolar
Virgil
Landoll
Bukowski
Collard
Magnan
Jerome
Tamaro
Milligan
Prabhavananda
Curl
Lucretius
Horace
Korman
Girling
Lunn
Silverberg
Gibran
Schwanitz
Jewel
Sage
Hitchcock
Allende
Weigle
Assorted
HSA-UWC
Avi
Springfiel
Curtis
Clarke
Aoumiel
Archer
Pinclo
Bergren
Stevens
Sophocles
Stanley
Colette
Corneille
Rudrananda
Hoffmann
Oxford
Merriam-Webster
Ortho
Grimes
Kaplan
Sallenav
Chiflet
Herge
Walker
Forsyth
Clamp
Dalglish
Merril
Stuart
Colman
Fowler
Giles
Marisol
Pla
Berberova
Fredriksson
Roberts
Fergusson
Fodor's
Calchman
Gzowski
Y

Henslin
Siviter
Mayer
Wilkerson
McLeish
Lamming
Comte-Sponville
Matthews
Gateway
Rickett
Mermet
Mccarthy
Grousset
O'Callagh
Evangelisti
Ardai
Goforth
Dobb
Editors
Satyavan
Hampton
Jace
Melody
Janry
Darcy
Brandewyn
Ballmann
Knightmare
Cornelle
T/K
Tormont
Tanaka
Poilus
Schulman
Aufderstrab
Alain
Huxley
Steel
KEENE
DeLorme
Pitigrilli
Royall
Bischoff
Theophane
Groebner
Prevention
Tamaya
MacLennan
Feyerabend
Litton
Momatiuk
Miklowitz
John
McIver
Montalban
Plauto
Storm
Kazin
Spoerl
Dominiqu
Datamyte
Eichhorn
Helmstetter
Wilkins
Bravestarr
Frohlichstein
Patrick
Morss
Bancroft
Hawksley
Mapsco
Franck
Wolf
Raimond
Marshall
Brussolo-S
Theroux
Miryam
Begarnie
Garrett
Anon.
Jamoo
Sabin
Vallier
Russ
Fawkes
Mortimer
Nottingham
Fried
Imagineers
Liffiton
Dadie
Kraft
Corbeyran
Maccaig
Judy
Lustbade
Westminster
Damascene
Brussolo
Lansky
Hobsbawm
Odell
Blish
Smullyan
Attac
Percy
Stine
Brownlow
Harcourt
Han-Shan
Opencourt
Harnadek
Pollard
Samois
Bascove
Schaef
Sweetgall
Langseth
Dodson
Foord
Lukin
Wayland

In [23]:
print(sum_single_names, "authors in authors")

2484 authors in authors


In [14]:
sum_single_names = 0
for author in books.author:
    if len(author.split(" "))==1:
        sum_single_names+=1

In [15]:
print(sum_single_names, "books")

4949 books


In [27]:
sum_and_in_author = 0
for author in authors.author:
    if " and " in author:
        sum_and_in_author+=1

In [28]:
print(sum_and_in_author, "authors in authors")

136 authors in authors


In [35]:
sum_plus_in_author = 0
for author in authors.author:
    if "&" in author:
        sum_plus_in_author+=1
        print(author)

Mary-Kate &amp; Ashley Olsen
Denning &amp; Phillips
The Staff of Research &amp; Education Association
G &amp; R Publishing
W &amp; M Hoffer
Douglas R. &amp; Dennett, Daniel C. Hofstadter
Yuri &amp; Wiseman, Ian Rubinsky
Fc &amp; A Publishing Staff
Howe &amp; Steiger
Bette &amp; Sansan Lord
Harry &amp; Elizabeth Lawrence
Andrews &amp; McMeel
Gregory &amp; Griffin, Ricky W. Moorhead
Beatrix &amp; Atkinson, Allen Potter
Walker &amp; Co
Barns &amp; Budd
Susan &amp; Kuhiwczak, Piotr Bassnett
The American Poetry &amp; Literacy Project
Thomas N. &amp; Robinson, Frank M. Scortia
FC&amp;A
Norman D. &amp; Brown, Walter R. Anderson
Armen A. &amp; Allen, William R. Alchain
Linda &amp; richard Eyre
A &amp; C Black Ltd.
Better Homes &amp; Gardens Editors
Yitta Halberstam &amp; Judith Leventhal
Rose &amp; Radner, Gilda Roseannadanna
Erin&amp;Bill Woods
Margaret&amp;Maurine Moon
Marvin &amp; Koppel, Ted Kalb
Simon &amp; Schuster
Richard &amp; Sterin, Chuck Shames
Ernst &amp; Young
Carol &amp; Price, N

In [34]:
print(sum_plus_in_author, "authors in authors")

173 authors in authors


# Perhaps I shouldn't be doing that...

In [4]:
len(authors),len(books)

(100642, 269274)

In [9]:
authors = authors[authors["author"].str.contains("&amp;")==False ]
authors = authors[authors["author"].str.contains(" and ")==False ]

books = books[books["author"].str.contains("&amp;")==False ]
books = books[books["author"].str.contains(" and ")==False ]

In [10]:
len(authors), len(books)

(100334, 268766)

In [11]:
authors.author = authors.author.str.replace("Ph.D","")
authors.author = authors.author.str.replace(",","")

books.author = books.author.str.replace("Ph.D","")
books.author = books.author.str.replace(",","")

## Save author file

In [245]:
authors.to_csv("data/authors_latin_fixed.csv")
books.to_csv("data/items_books_latin_fixed.csv")