This Notebook is to try to fill in missing values, where possible. 

In [1]:
import requests
import json
import csv
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import random
import os
import string
from time import sleep
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import WebDriverException
from dotenv import load_dotenv
load_dotenv();

In [2]:
books = pd.read_csv('book_data_clean.csv')

First, let's drop our duplicate rows.

In [3]:
len(books)

24971

In [4]:
books = books.drop_duplicates()

In [5]:
len(books)

24861

In [6]:
books.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year
0,The Vanished Birds,Simon Jimenez,124205.0,55.18,6.37,1.95,0.36,1.58,"['Science Fiction', 'Fiction', 'Fantasy', 'Que...",2020.0
1,The Price of Honor,Jonathan P. Brazee,77253.0,35.35,8.71,2.63,0.71,1.92,['Science Fiction'],2017.0
2,The Mathematical Murder of Innocence,Michael Carter,37688.0,24.08,8.11,4.13,1.56,2.58,[],2020.0
3,The Case of the Baker Street Irregulars,Anthony Boucher,80557.0,32.33,8.41,3.72,1.64,2.08,"['Mystery', 'Fiction', 'Crime', 'Humor', 'Clas...",1940.0
4,Zombie Nation,Charlie Dalton,64396.0,51.11,8.22,2.21,0.58,1.63,[],2020.0


I've noticed that some books are missing Prosecraft data. Let's find why.

In [7]:
missing_prosecraft = books[books['vividness'].isna()].copy()

In [8]:
len(missing_prosecraft)

410

In [9]:
missing_prosecraft.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year
176,Infinite Baseball,Alva No',,,,,,,"['Baseball', 'Sports', 'Nonfiction', 'Philosop...",2019.0
233,The Little Buddhist Monk,César Aira,,,,,,,"['Fiction', 'Latin American', 'Contemporary', ...",2017.0
329,The Fire Engine That Disappeared,Maj Sjöwall & Per Wahlöö,,,,,,,"['Mystery', 'Crime', 'Fiction', 'Scandinavian ...",1969.0
388,How to Turn Into a Bird,María José Ferrada,,,,,,,"['Fiction', 'Contemporary', 'Coming Of Age', '...",2022.0
588,The Silence of the White City,Eva García Sáenz,,,,,,,"['Thriller', 'Mystery', 'Crime', 'Fiction', 'S...",2016.0


What these seem to have in common is that they contain special characters. When I check the URL of the books on Prosecraft, the accents do not appear. That's likely where the error occurred. First, I find all the characters that appear in titles

In [10]:
alphanum = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890')

In [11]:
print(set(missing_prosecraft.title.sum()+missing_prosecraft.author.sum()).difference(alphanum))

{'í', '’', '̈', '̂', "'", '–', 'ô', 'ö', '/', 'Ø', 'ç', '%', 'ä', 'ō', '̇', 'ž', 'ë', 'Æ', ')', '!', 'ó', '́', 'ð', ';', '̀', 'è', 'ū', '?', 'ü', '‘', '.', ' ', 'Ž', '-', '+', ':', 'ø', '̊', 'é', ']', '̃', ',', '(', 'â', '°', 'à', '̧', '$', 'á', 'ñ', '[', '̌', '&', 'ł', '#', 'ï', '*'}


In [12]:
replacements = {'à':'a','é':'e','â':'a','ç':'c','ñ':'n','Ž':'Z',
                'ž':'z','è':'e','ö':'o','á':'a','ó':'o','ū':'u',
                'í':'i','ô':'o','Ø':'O','ł':'l','ä':'a','ï':'i',
                'ë':'e','ü':'u','ō':'o'}

In [13]:
missing_prosecraft['author_clean'] = missing_prosecraft['author']

In [14]:
for before, after in replacements.items():
    missing_prosecraft.author_clean = missing_prosecraft.author_clean.str.replace(before,after)

In [15]:
pd.concat([missing_prosecraft[['author','author_clean']].head(),(missing_prosecraft[['author','author_clean']].tail())])

Unnamed: 0,author,author_clean
176,Alva No',Alva No'
233,César Aira,César Aira
329,Maj Sjöwall & Per Wahlöö,Maj Sjowall & Per Wahloo
388,María José Ferrada,María José Ferrada
588,Eva García Sáenz,Eva García Sáenz
24739,Ragnar Jónasson,Ragnar Jónasson
24745,María Amparo Ruiz de Burton,María Amparo Ruiz de Burton
24783,M. L. Longworth,M. L. Longworth
24955,Tor Fleck,Tor Fleck
24970,Joe Schrieber,Joe Schrieber


Hmm... not all of the accents are gone. Let's investigate why!

In [16]:
print(set(missing_prosecraft.author_clean.sum()).difference(alphanum))

{'’', '̈', '̂', "'", '̇', 'Æ', ')', '́', 'ð', '̀', ' ', '.', '-', 'ø', '̊', '̃', '(', ',', '̧', '̌', '&'}


In [17]:
missing_prosecraft.iloc[1].author.replace('é','e')

'César Aira'

Strange! It's almost as if it's a different character entirely. So I tried copy/pasting the accented e directly from the previous line's output, and...

In [18]:
missing_prosecraft.iloc[1].author.replace('é','e')

'Cesar Aira'

Well, here's your problem!

In [19]:
'é' == 'é'

False

Let's try running this again with the accented letters that haven't changed replaced with the correct versions.

In [32]:
corrected_replacements = {'á':'a','é':'e','ó':'o','ë':'e'}

In [21]:
for before, after in corrected_replacements.items():
    missing_prosecraft.author_clean = missing_prosecraft.author_clean.str.replace(before,after)

In [22]:
pd.concat([missing_prosecraft[['author','author_clean']].head(),(missing_prosecraft[['author','author_clean']].tail())])

Unnamed: 0,author,author_clean
176,Alva No',Alva No'
233,César Aira,Cesar Aira
329,Maj Sjöwall & Per Wahlöö,Maj Sjowall & Per Wahloo
388,María José Ferrada,María Jose Ferrada
588,Eva García Sáenz,Eva García Saenz
24739,Ragnar Jónasson,Ragnar Jonasson
24745,María Amparo Ruiz de Burton,María Amparo Ruiz de Burton
24783,M. L. Longworth,M. L. Longworth
24955,Tor Fleck,Tor Fleck
24970,Joe Schrieber,Joe Schrieber


Now let's run it for the titles!

In [23]:
missing_prosecraft['title_clean'] = missing_prosecraft['title']

In [33]:
replacements.update(corrected_replacements)

In [34]:
for before, after in (replacements).items():
    missing_prosecraft.title_clean = missing_prosecraft.title_clean.str.replace(before,after)

In [35]:
pd.concat([missing_prosecraft[['title','title_clean']].head(),(missing_prosecraft[['title','title_clean']].tail())])

Unnamed: 0,title,title_clean
176,Infinite Baseball,Infinite Baseball
233,The Little Buddhist Monk,The Little Buddhist Monk
329,The Fire Engine That Disappeared,The Fire Engine That Disappeared
388,How to Turn Into a Bird,How to Turn Into a Bird
588,The Silence of the White City,The Silence of the White City
24739,The Mist,The Mist
24745,The Squatter and the Don,The Squatter and the Don
24783,A Noël Killing,A Noel Killing
24955,Agency ‘O’,Agency ‘O’
24970,Star Wars - The Mandalorian: Junior Novel,Star Wars - The Mandalorian: Junior Novel


We still have a number of special characters left that are probably giving us trouble. 

In [41]:
specials = set(missing_prosecraft.title_clean.sum()+missing_prosecraft.author_clean.sum()).difference(alphanum)
print(special)

{'’', '̈', '̂', "'", '–', '/', '%', '̇', 'Æ', ')', '!', '́', 'ð', ';', '̀', '?', '‘', '.', ' ', '-', '+', ':', 'ø', '̊', ']', '̃', ',', '(', '°', '̧', '$', '[', '̌', '&', '#', '*'}


In [48]:
special_df = pd.DataFrame()

In [72]:
#Creates a dataframe with only titles/authors whose cleaned version has special characters
for special in specials:
    special_df = pd.concat([special_df,missing_prosecraft[(missing_prosecraft['title_clean'].str.contains('\\'+special)) | 
                             (missing_prosecraft['author_clean'].str.contains('\\'+special))]])

In [73]:
special_df

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,author_clean,title_clean
891,"The Year’s Best Fantasy, Volume 1","Various Authors (ed, Paula Guran)",,,,,,,[],,"Various Authors (ed, Paula Guran)","The Year’s Best Fantasy, Volume 1"
1233,The Gravediggers’ Bread,Frédéric Dard,,,,,,,[],,Frederic Dard,The Gravediggers’ Bread
3336,"Surely You’re Joking, Mr. Feynman!",Richard P. Feynman,,,,,,,"['Science', 'Nonfiction', 'Biography', 'Physic...",1985.0,Richard P. Feynman,"Surely You’re Joking, Mr. Feynman!"
3438,Star Wars - Galaxy’s Edge: Black Spire,Delilah S. Dawson,,,,,,,"['Star Wars', 'Science Fiction', 'Fiction', 'A...",2019.0,Delilah S. Dawson,Star Wars - Galaxy’s Edge: Black Spire
3596,Goosebumps: Don’t go to Sleep!,R. L. Stine,,,,,,,[],,R. L. Stine,Goosebumps: Don’t go to Sleep!
...,...,...,...,...,...,...,...,...,...,...,...,...
24667,Other Terrors,"Various Authors (ed, Vince A. Liaguno & Rena M...",,,,,,,"['Horror', 'Short Stories', 'Anthologies', 'Fi...",2022.0,"Various Authors (ed, Vince A. Liaguno & Rena M...",Other Terrors
5383,#MurderTrending,Gretchen McNeil,,,,,,,"['Young Adult', 'Mystery', 'Horror', 'Thriller...",2018.0,Gretchen McNeil,#MurderTrending
6026,Nova Project #1,Emma Trevayne,,,,,,,"['Young Adult', 'Science Fiction', 'Dystopia',...",2016.0,Emma Trevayne,Nova Project #1
15172,Juror #3,James Patterson & Nancy Allen,,,,,,,"['Mystery', 'Fiction', 'Thriller', 'Mystery Th...",2018.0,James Patterson & Nancy Allen,Juror #3


In [71]:
type(special_df)

pandas.core.frame.DataFrame

In [75]:
more_replacements = {'#':'','*':'-',',':'','(':'',')':'','&':'and','.':''}

In [76]:
for before, after in (replacements).items():
    missing_prosecraft.title_clean = missing_prosecraft.title_clean.str.replace(before,after)
    missing_prosecraft.author_clean = missing_prosecraft.author_clean.str.replace(before,after)

In [77]:
specials = set(missing_prosecraft.title_clean.sum()+missing_prosecraft.author_clean.sum()).difference(alphanum)
print(special)

*


In [78]:
print(type(special))

<class 'str'>


True