In [30]:
from core.goodreads import GoodReadsData
import gc

goodreads = GoodReadsData()
filename = goodreads.file_names[2]
filename

'goodreads_book_series'

In [31]:
gc.collect()

0

# Análisis de `goodreads_book_series`

In [32]:
import os
import pandas as pd
import numpy as np

In [33]:
# Download
if not os.path.exists(goodreads.get_file_path(filename)):
    goodreads.download_file(filename)
    
# Load
df = goodreads.load_file(filename)

### Análisis
1. Comprensión de datos.
2. Detección de valores `nan`.
3. Detección de duplicados.
4. Detección de posibles errores.

In [34]:
df.sample(5)

Unnamed: 0,numbered,note,description,title,series_works_count,series_id,primary_work_count
363361,True,book order per author's website 8/28/12,,Valentine Submission,2,413652,2
13748,True,,The Legend of Drizzt is the overarching series...,The Legend of Drizzt,36,447501,31
130822,True,,,Boss Me,3,957947,3
162028,True,Seduced by Fire was originally part of the Par...,"With over 8 million reads online, Tara Sue Me'...",Submissive,13,830335,12
291923,True,,,Funf Freunde Horspiele,80,895707,80


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400390 entries, 0 to 400389
Data columns (total 7 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   numbered            400390 non-null  object
 1   note                400390 non-null  object
 2   description         400390 non-null  object
 3   title               400390 non-null  object
 4   series_works_count  400390 non-null  int64 
 5   series_id           400390 non-null  int64 
 6   primary_work_count  400390 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 21.4+ MB


In [36]:
for i in range(10):
    try:
        print(i, df[df["note"]!=""]["note"].iloc[i])
    except:
        break

0 http://en.wikipedia.org/wiki/Foundation_Trilogy#List_of_books_within_the_Foundation_Universe
1 Please only keep directly related series in the quick links, otherwise it will turn into a mess.
2 According to Asimov, The Complete Robot was the first book in the series, although it did include all the stories in the earlier I, Robot and The Rest of the Robots. 
    Therefore the earlier collections might be considered precursors for the entire series.
    
    The remaining short story collections and The Positronic Man chronologically fall between The Complete Robot and the The Caves of Steel.
    One could also argue that the Tiedemann/Irvine Robot Mystery series as well as Allen's Caliban series belong in the series.
3 http://en.wikipedia.org/wiki/Foundation_Trilogy#List_of_books_within_the_Foundation_Universe
4 Better known with the English title
5 GR generally goes with the author's official series name (Will Trent); they are the same books, they should not have two different serie

In [37]:
for i in range(10):
    try:
        print(i, df[df["description"]!=""]["description"].iloc[i])
    except:
        break

0 This series is also known as * Avalon : Jalinan Sihir (Bahasa Indonesia) See also the spin-off manga series .
1 Plot-wise, "Crowner's Crusade" is a prequel to the series, but #15 in publication order.
2 When seven women get struck by lightning, they'll find out fate isn't what they thought and now they will all turn into different paranormal creatures. Angels, leprechauns, pixies, brownies, djinn, dragons, bears shifters, and many others will come together when the magic of their blood is revealed. Humans aren't as alone as they choose to believe. Every human possesses a trait of supernatural that lays dormant within their genetic make-up. Centuries of diluting and breeding have allowed humans to think they are alone and untouched by magic. But what happens when something changes?
3 Patrick Grant, a professor and amateur sleuth, in Oxford, England:
4 Part of the . The Foundation series is a science fiction series by Isaac Asimov which covers a span of about 550 years. It consists o

In [38]:
(df == "").sum()

numbered                   0
note                  375111
description           249371
title                      6
series_works_count         0
series_id                  0
primary_work_count         0
dtype: int64

In [39]:
df["series_id"].duplicated().sum()

np.int64(0)

In [40]:
df.describe()

Unnamed: 0,series_works_count,series_id,primary_work_count
count,400390.0,400390.0,400390.0
mean,21.588149,623045.0,19.771653
std,65.1031,294445.3,63.501377
min,-14.0,144392.0,0.0
25%,3.0,363737.2,3.0
50%,6.0,615837.0,5.0
75%,14.0,877564.8,12.0
max,893.0,1143859.0,893.0


Observamos valores negativos en `series_works_count` que indican ser valores erróneos. Lo transformaremos a valor absoluto.

### Transformación
1. Corregimos errores.
2. Transformamos tipos de datos.
3. Añadimos valores `nan`.

In [41]:
df["series_works_count"] = df["series_works_count"].abs()

Guardamos ínidces con valores `nan`.

In [42]:
n_nan = df["note"] == ""
d_nan = df["description"] == ""           
t_nan = df["title"] == ""                 

Preparamos datos la trasnformación, corrigiendo posibles errores.

In [43]:
df["numbered"].replace({"true": 1, "false": 0}, inplace=True)
df["numbered"] = df["numbered"].astype(np.uint8)

df["note"] = df["note"].str.strip()
df["description"] = df["description"].str.strip()
df["title"] = df["title"].str.strip()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["numbered"].replace({"true": 1, "false": 0}, inplace=True)
  df["numbered"].replace({"true": 1, "false": 0}, inplace=True)


In [44]:
df["numbered"] = df["numbered"].astype(bool)
df["description"] = df["description"].astype("string")
df["note"] = df["note"].astype("string")
df["title"] = df["title"].astype("string")
df["series_id"] = df["series_id"].astype(np.uint32)
df["series_works_count"] = df["series_works_count"].astype(np.int16)
df["primary_work_count"] = df["primary_work_count"].astype(np.int16)

In [45]:
df.loc[n_nan, "note"] = np.nan
df.loc[d_nan, "description"] = np.nan
df.loc[t_nan, "title"] = np.nan

df.isna().sum()

numbered                   0
note                  375111
description           249371
title                      6
series_works_count         0
series_id                  0
primary_work_count         0
dtype: int64

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400390 entries, 0 to 400389
Data columns (total 7 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   numbered            400390 non-null  bool  
 1   note                25279 non-null   string
 2   description         151019 non-null  string
 3   title               400384 non-null  string
 4   series_works_count  400390 non-null  int16 
 5   series_id           400390 non-null  uint32
 6   primary_work_count  400390 non-null  int16 
dtypes: bool(1), int16(2), string(3), uint32(1)
memory usage: 12.6 MB
