## Authors Processing

This Jupyter notebook processes raw authors csv data performing basic data cleaning:

1. Selecting only desired features (raw/derived)
2. String cleaning
3. Handling missing values

#### Import Libraries

In [1]:
import os
import pandas as pd

#### Data Upload

In [2]:
path = "C:\\Users\\juhic\\OneDrive\\Desktop\\goodreads_kaggle_authors.csv"
authors = pd.read_csv(path, usecols = ['name',
                                     'workcount',
                                     'gender',
                                     'average_rate',
                                     'rating_count',
                                     'review_count',
                                     'country'])


# Rename columns ---------------------------------------------------------
cols = {'name': 'author',
        'workcount': 'work_count',
        'average_rate': 'author_avg_rating',
        'rating_count': 'author_rating_count',
        'review_count': 'author_review_count',
        'country': 'author_country',
        'gender': 'sex'}
authors.rename(columns = cols, inplace = True)

#### Data Overview I

In [3]:
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209517 entries, 0 to 209516
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   author               209517 non-null  object 
 1   work_count           209517 non-null  int64  
 2   sex                  209517 non-null  object 
 3   author_avg_rating    209517 non-null  float64
 4   author_rating_count  209517 non-null  int64  
 5   author_review_count  209517 non-null  int64  
 6   author_country       44599 non-null   object 
dtypes: float64(1), int64(3), object(3)
memory usage: 11.2+ MB


In [4]:
authors.head()

Unnamed: 0,author,work_count,sex,author_avg_rating,author_rating_count,author_review_count,author_country
0,Jason Wallace,2,male,3.74,1028,175,United Kingdom
1,Rosan Hollak,4,unknown,3.73,15,1,
2,Nanna Foss,6,female,4.35,1172,205,
3,Terri Savelle Foy,23,female,4.56,1054,151,
4,Vishwas Nangare Patil,1,unknown,4.15,725,43,


#### String Cleaning

In [5]:
cols = ['author','sex','author_country']

# lowercase, strip spaces from ends
authors[cols] = authors[cols].apply(func = lambda x: x.str.lower().str.strip(), axis = 1)

# strip spaces between ends
for c in cols:
    authors[c] = authors[c].str.replace(r" +", " ")
    

# Authors: Drop authors with unknown sex -------------------------------------------------
authors.drop(authors[authors['sex'] == 'unknown'].index, inplace = True)
authors.reset_index(drop = True, inplace = True)

  authors[c] = authors[c].str.replace(r" +", " ")


#### Handling Missing Values (WIP)

In [6]:
print(f"Null records \n")
for c in authors.columns:
    i = authors[authors[c].isnull()].shape[0]
    print(f"{c}: {i}")

Null records 

author: 0
work_count: 0
sex: 0
author_avg_rating: 0
author_rating_count: 0
author_review_count: 0
author_country: 49626


Losing author country will lead to great data loss; let's review:

In [7]:
authors['author_country'].value_counts()

united states             16718
united kingdom             4425
canada                     1276
france                     1006
germany                     924
                          ...  
andorra                       1
bhutan                        1
macao                         1
congo republic                1
svalbard and jan mayen        1
Name: author_country, Length: 210, dtype: int64

*Observation*: There is data imbalance considering author country. 

*Action*: We will discard author country from our analysis instead of dropping records with null values. This way, we will not be making any inference based on country.

In [8]:
authors.drop(axis = 1, columns = 'author_country',inplace = True)

#### Data Overview II

In [9]:
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88544 entries, 0 to 88543
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   author               88544 non-null  object 
 1   work_count           88544 non-null  int64  
 2   sex                  88544 non-null  object 
 3   author_avg_rating    88544 non-null  float64
 4   author_rating_count  88544 non-null  int64  
 5   author_review_count  88544 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 4.1+ MB


In [10]:
authors.head()

Unnamed: 0,author,work_count,sex,author_avg_rating,author_rating_count,author_review_count
0,jason wallace,2,male,3.74,1028,175
1,nanna foss,6,female,4.35,1172,205
2,terri savelle foy,23,female,4.56,1054,151
3,phil hamman,6,male,3.83,1340,193
4,august turak,3,male,4.34,274,49


#### Data Download

In [11]:
# Download processed authors data --------------------------------------------------
authors.to_csv('authors_processed.csv')