## Authors Processing

This Jupyter notebook processes raw authors csv data performing basic data cleaning:

1. Selecting only desired features (raw/derived)
2. String cleaning
3. Handling missing values

#### Import Libraries

In [21]:
import pandas as pd

#### Data Upload

In [22]:
path = "C:\\Users\\juhic\\OneDrive\\Desktop\\goodreads_kaggle_authors.csv"
authors = pd.read_csv(path)
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209517 entries, 0 to 209516
Data columns (total 20 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   authorid           209517 non-null  int64  
 1   name               209517 non-null  object 
 2   workcount          209517 non-null  int64  
 3   fan_count          209517 non-null  int64  
 4   gender             209517 non-null  object 
 5   image_url          209517 non-null  object 
 6   about              86724 non-null   object 
 7   born               31230 non-null   object 
 8   died               12488 non-null   object 
 9   influence          7882 non-null    object 
 10  average_rate       209517 non-null  float64
 11  rating_count       209517 non-null  int64  
 12  review_count       209517 non-null  int64  
 13  website            58320 non-null   object 
 14  twitter            35122 non-null   object 
 15  genre              73983 non-null   object 
 16  or

In [23]:
path = "C:\\Users\\juhic\\OneDrive\\Desktop\\goodreads_kaggle_authors.csv"
authors = pd.read_csv(path, usecols = [
                                       'name',
                                       'workcount',
                                       'fan_count',
                                       'gender',
                                       'average_rate',
                                       'rating_count',
                                       'review_count',
                                       'country'])


# Rename columns ---------------------------------------------------------
cols = {'name': 'author',
        'workcount': 'work_count',
        'average_rate': 'author_avg_rating',
        'rating_count': 'author_rating_count',
        'review_count': 'author_review_count',
        'country': 'author_country',
        'gender': 'sex'}
authors.rename(columns = cols, inplace = True)

#### Data Overview I

In [24]:
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209517 entries, 0 to 209516
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   author               209517 non-null  object 
 1   work_count           209517 non-null  int64  
 2   fan_count            209517 non-null  int64  
 3   sex                  209517 non-null  object 
 4   author_avg_rating    209517 non-null  float64
 5   author_rating_count  209517 non-null  int64  
 6   author_review_count  209517 non-null  int64  
 7   author_country       44599 non-null   object 
dtypes: float64(1), int64(4), object(3)
memory usage: 12.8+ MB


In [25]:
authors.head()

Unnamed: 0,author,work_count,fan_count,sex,author_avg_rating,author_rating_count,author_review_count,author_country
0,Jason Wallace,2,13,male,3.74,1028,175,United Kingdom
1,Rosan Hollak,4,0,unknown,3.73,15,1,
2,Nanna Foss,6,156,female,4.35,1172,205,
3,Terri Savelle Foy,23,125,female,4.56,1054,151,
4,Vishwas Nangare Patil,1,127,unknown,4.15,725,43,


#### String Cleaning

In [26]:
cols = ['author','sex','author_country']

# lowercase, strip spaces from ends
authors[cols] = authors[cols].apply(func = lambda x: x.str.lower().str.strip(), axis = 1)

# strip spaces between ends
for c in cols:
    authors[c] = authors[c].str.replace(r" +", " ")
    

# Authors: Drop authors with unknown sex -------------------------------------------------
authors.drop(authors[authors['sex'] == 'unknown'].index, inplace = True)
authors.reset_index(drop = True, inplace = True)

  authors[c] = authors[c].str.replace(r" +", " ")


#### Handling Missing Values (WIP)

In [27]:
print(f"Null records \n")
for c in authors.columns:
    i = authors[authors[c].isnull()].shape[0]
    print(f"{c}: {i}")

Null records 

author: 0
work_count: 0
fan_count: 0
sex: 0
author_avg_rating: 0
author_rating_count: 0
author_review_count: 0
author_country: 49626


In [28]:
authors['author_country'].value_counts()

united states                  16718
united kingdom                  4425
canada                          1276
france                          1006
germany                          924
                               ...  
solomon islands                    1
french southern territories        1
micronesia                         1
djibouti                           1
mauritania                         1
Name: author_country, Length: 210, dtype: int64

*Observation*: There is data imbalance considering author country. 

*Action*: We will discard author country from our analysis instead of dropping records with null values. This way, we will not be making any inference based on country.

In [29]:
authors.drop(axis = 1, columns = 'author_country',inplace = True)

#### Quality check for unique authors

In [30]:
authors_deduped = authors.groupby(by = ['author','sex'], as_index = False).agg({'work_count':'max',
                                                                                'fan_count':'max',
                                                                                'author_avg_rating':'max',
                                                                                'author_rating_count':'max',
                                                                                'author_review_count':'max'})

In [31]:
authors_deduped[authors_deduped['author']=='alexis hall']

Unnamed: 0,author,sex,work_count,fan_count,author_avg_rating,author_rating_count,author_review_count
2328,alexis hall,female,15,10,3.52,112,21
2329,alexis hall,male,33,1918,4.1,23489,5633


In [32]:
authors_deduped[['author','work_count']].drop_duplicates().shape

(87656, 2)

In [33]:
authors_deduped['author'].value_counts()

jamie ivey           2
robin hardy          2
lee weeks            2
j.m. winchester      2
robin palmer         2
                    ..
maría baranda        1
john everson         1
lawrence a. machi    1
momoko koda          1
vint virga           1
Name: author, Length: 87639, dtype: int64

In [34]:
authors_deduped[authors_deduped['author']=='robin hardy']

Unnamed: 0,author,sex,work_count,fan_count,author_avg_rating,author_rating_count,author_review_count
69417,robin hardy,female,36,34,4.22,1691,119
69418,robin hardy,male,23,4,3.77,754,88


Let's merge it with books dataset and then check for duplicate authors present there.

#### Data Download

In [35]:
# Download processed authors data --------------------------------------------------
authors_deduped.to_csv('C:\\Users\\juhic\\OneDrive\\Desktop\\authors_processed.csv')