# Sistemas Recomendadores - Preparación de los datos

<font size=4>Técnicas Avanzadas de Análisis de Datos</font>

> Daute Rodríguez Rodríguez

Para ejecutar satisfactoriamente este notebook son necesarios los siguientes ficheros:

* goodreads_interactions.csv
* goodreads_book_authors.json
* book_id_map.csv
* goodreads_book_genres_initial.json
* goodreads_books.json

Éstos se pueden descargar en [https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home).

## Carga de librerías

In [0]:
import json
import numpy as np
import pandas as pd

## Lectura de los datos

### Valoraciones

In [0]:
ratingsDf = pd.read_csv('goodreads_interactions.csv')
print(ratingsDf.shape)
ratingsDf.head()

(228648342, 5)


Unnamed: 0,user_id,book_id,is_read,rating,is_reviewed
0,0,948,1,5,0
1,0,947,1,5,1
2,0,946,1,5,0
3,0,945,1,5,0
4,0,944,1,5,0


Eliminación de variables innecesarias y eliminación de registros en los que el usuario no ha dado una valoración al correspondiente libro. En concreto, las acciones que se realizarán son:

* Borrado de la columna is_reviewed
* Eliminación de las instancias en las que el boolean is_read es falso
* Borrado de la columna is_read

In [0]:
ratingsDf.drop(['is_reviewed'], axis=1, inplace=True)
ratingsDf.drop(ratingsDf[ratingsDf['is_read'] == 0 ].index, inplace=True)
ratingsDf.drop(['is_read'], axis=1, inplace=True)

### Autores

Los datos sobre los autores están almacenados en formato json (un json por línea). Por cada autor se almacenará su nombre y su identificador.

In [0]:
authorsData = open('goodreads_book_authors.json').read().split("\n")

authors = {'author_id': [], 'name': []}

for element in authorsData:
    author = json.loads(element)
    authors['author_id'].append(author['author_id'])
    authors['name'].append(author['name'])

authorsDf = pd.DataFrame(authors, columns = ['author_id', 'name'])
print(authorsDf.shape)
authorsDf.head()

(829529, 2)


Unnamed: 0,author_id,name
0,604031,Ronald J. Fields
1,626222,Anita Diamant
2,10333,Barbara Hambly
3,9212,Jennifer Weiner
4,149918,Nigel Pennick


### Libros

Al igual que en el caso de los autores, la información sobre los libros se almacena en un fichero cuyas líneas se corresponden con json. También resulta necesario realizar una conversión de identificadores:

In [0]:
idsDf = pd.read_csv('book_id_map.csv')
idsDf.set_index('book_id', inplace=True)
idsMap = idsDf.to_dict('index')

In [0]:
bookGenres = {}
with open('goodreads_book_genres_initial.json') as data:
    for line in data:
        book = json.loads(line)
        currentBookGenres = []
        for key in book['genres'].keys():
            aux = key.split(',')
            aux = list(map(lambda value: value.strip(), aux))
            for genre in aux:
                currentBookGenres.append(genre)
        bookGenres[book['book_id']] = currentBookGenres

In [0]:
books = {'book_id': [], 'title': [], 'format': [], 'authors': [], 'size': [], 'publicationYear': [], 'genres': []}

with open('goodreads_books.json') as data:
    for line in data:
        book = json.loads(line)
        try:
            realId = book['book_id']
            csvId = idsMap[int(realId)]['book_id_csv']
            books['book_id'].append(csvId)
            books['title'].append(book['title'])
            books['format'].append(book['format'])
            books['authors'].append(','.join(list(map(lambda value: value['author_id'], book['authors']))))
            books['size'].append(book['num_pages'])
            books['publicationYear'].append(book['publication_year'])
            books['genres'].append(','.join(bookGenres[realId]))
        except KeyError:
            pass

In [0]:
booksDf = pd.DataFrame(books, columns = ['book_id', 'title', 'format', 'authors', 'size', 'publicationYear', 'genres'])
booksDf.shape

(2360650, 7)

In [0]:
for column in booksDf.columns:
    booksDf.loc[booksDf[column] == "", column] = np.NaN

booksDf.head()

  res_values = method(rvalues)


Unnamed: 0,book_id,title,format,authors,size,publicationYear,genres
0,1950356.0,W.C. Fields: A Life on Film,Paperback,604031,256.0,1984.0,"history,historical fiction,biography"
1,2084644.0,Good Harbor,Audio CD,626222,,2001.0,"fiction,history,historical fiction,biography"
2,740362.0,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",Hardcover,10333,600.0,1987.0,"fantasy,paranormal,fiction,mystery,thriller,cr..."
3,14854.0,Best Friends Forever,Hardcover,9212,368.0,2009.0,"fiction,romance,mystery,thriller,crime"
4,979469.0,Runic Astrology: Starcraft and Timekeeping in ...,,149918,,,non-fiction


## Tratamiento de valores nulos

Puesto que la cantidad de datos de la que se dispone es tan grande, todas aquellas observaciones del conjunto de datos de los libros con algún valor de variable nulo serán descartadas. El dataframe de las valoraciones y el de los autores se actualizarán en consecuencia.

In [0]:
def GetDataVolume(ratingsDf, authorsDf, booksDf):
    print('Ratings:', ratingsDf.shape)
    print('Authors:', authorsDf.shape)
    print('Books:', booksDf.shape)

In [0]:
GetDataVolume(ratingsDf, authorsDf, booksDf)

Ratings: (112131203, 3)
Authors: (829529, 2)
Books: (2360650, 7)


In [0]:
booksDf.isna().sum()

book_id                 0
title                   7
format             646754
authors               537
size               764131
publicationYear    599624
genres             409513
dtype: int64

In [0]:
booksDf.dropna(inplace=True)

In [0]:
ratingsDf = ratingsDf[ratingsDf['book_id'].isin(booksDf['book_id'])]

In [0]:
authors = set()
booksDf['authors'].apply(lambda element: authors.update(element.split(',')))
authorsDf = authorsDf[authorsDf['author_id'].isin(authors)]

In [0]:
GetDataVolume(ratingsDf, authorsDf, booksDf)

Ratings: (84411022, 3)
Authors: (472220, 2)
Books: (1230089, 7)


In [0]:
rCopy = ratingsDf.copy()
aCopy = authorsDf.copy()
bCopy = booksDf.copy()

In [0]:
ratingsDf = rCopy.copy()
authorsDf = aCopy.copy()
booksDf = bCopy.copy()

## Reducción de la dimensionalidad

Dada la gran cantidad de datos disponibles, resulta necesario reducir la dimensionalidad de los mismos. Para lograr este objetivo se realizarán las siguientes operaciones:


* Selección de usuarios con un número de valoraciones determinado
* Selección de libros que hayan sido valorados en al menos un número específico de ocasiones

In [0]:
def ReduceDimensionality(ratingsDf, minUserRatings=20, maxUserRatings=50, minBooksRating=20):
    userCounts = ratingsDf['user_id'].value_counts()
    bookCounts = ratingsDf['book_id'].value_counts()    
    usersWithFewRatings = (userCounts < minUserRatings).any()
    usersWithManyRatings = (userCounts > maxUserRatings).any()
    booksWithFewRatings = (bookCounts < minBooksRating).any()

    while (usersWithFewRatings or usersWithManyRatings or booksWithFewRatings):
        if usersWithFewRatings:
            users = userCounts < minUserRatings
            users = set(users[users].index.values)
            ratingsDf = ratingsDf[~ratingsDf['user_id'].isin(users)]
        
        if usersWithManyRatings:
            users = userCounts > maxUserRatings
            users = set(users[users].index.values)
            ratingsDf = ratingsDf[~ratingsDf['user_id'].isin(users)]
        
        if booksWithFewRatings:
            books = bookCounts < minBooksRating
            books = set(books[books].index.values)
            ratingsDf = ratingsDf[~ratingsDf['book_id'].isin(books)]
            
        userCounts = ratingsDf['user_id'].value_counts()
        bookCounts = ratingsDf['book_id'].value_counts()
        usersWithFewRatings = (userCounts < minUserRatings).any()
        usersWithManyRatings = (userCounts > maxUserRatings).any()
        booksWithFewRatings = (bookCounts < minBooksRating).any()

    return ratingsDf

In [0]:
ratingsDf = ReduceDimensionality(ratingsDf, 5, 8, 10)

books = ratingsDf['book_id']
booksDf = booksDf[booksDf['book_id'].isin(books)]
authors = set()
booksDf['authors'].apply(lambda element: authors.update(element.split(',')))
authorsDf = authorsDf[authorsDf['author_id'].isin(authors)]

In [0]:
GetDataVolume(ratingsDf, authorsDf, booksDf)

Ratings: (75444, 3)
Authors: (903, 2)
Books: (1101, 7)


## Guardado de los datos

In [0]:
suffix = '-5-8-10'

In [0]:
authorsDf.to_csv(f'data/authors{suffix}.csv')
booksDf.to_csv(f'data/books{suffix}.csv')
ratingsDf.to_csv(f'data/ratings{suffix}.csv')