<h2> Problema:

Se quiere construir un sistema de recomendación de libros basado en los resumentes de los libros y los temas (topicos) de los mismos.

Para tal fin, se utiliza el [CMU Book Summary Dataset](https://www.cs.cmu.edu/~dbamman/booksummaries.html)

# Bibliotecas y funciones

In [None]:
!pip install pyLDAvis  #biblioteca que extrae información de un modelo LDA para obtener una visualización interactiva

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install scikit-learn==1.0.2  # para que sea compatible con pyLDAvis 3.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Para leer los datos
import csv
import json

import pandas as pd
import numpy as np
from collections import Counter # para contar frecuencias

# preprocesar texto
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Modelado de tópicos
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import pyLDAvis
import matplotlib.pyplot as plt 
import seaborn as sns 

In [None]:
import sklearn
sklearn.__version__
# 1.0.2

'1.0.2'

In [None]:
pyLDAvis.__version__

'3.4.0'

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Lectura y exploración de datos

In [None]:
data = []

with open("/content/drive/MyDrive/Datos/BookSummaries/booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in reader:
        data.append(row)

In [None]:
len(data)

16559

In [None]:
title = []
author = []
genre = []
summary = []
for i in range(len(data)):
  title.append(data[i][2])
  author.append(data[i][3])
  if data[i][5] == '':
      genre.append([''])
  else:
      genre.append([j for j in json.loads(data[i][5]).values()])
  summary.append(data[i][6])

df = pd.DataFrame({ 'Title': title, 'Author': author,
                       'Genre': genre, 'Summary': summary})
print(df.shape)
df.head(3)


(16559, 4)


Unnamed: 0,Title,Author,Genre,Summary
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca..."
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan..."
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16559 entries, 0 to 16558
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Title    16559 non-null  object
 1   Author   16559 non-null  object
 2   Genre    16559 non-null  object
 3   Summary  16559 non-null  object
dtypes: object(4)
memory usage: 517.6+ KB


In [None]:
df[['Title', 'Author']].nunique()


Title     16277
Author     4715
dtype: int64

In [None]:
df.Title.value_counts().head()

Nemesis     6
Outcast     4
Haunted     4
Inferno     4
The Gift    3
Name: Title, dtype: int64

 ¿por qué hay más de un resumen para cada titulo?

In [None]:
df[df.Title=='Inferno']

Unnamed: 0,Title,Author,Genre,Summary
1174,Inferno,Jerry Pournelle,"[Science Fiction, Speculative fiction, Fiction...",Inferno is based upon the hell described in D...
7339,Inferno,Alexander C. Irvine,"[Science Fiction, Speculative fiction]","A former firefighter, now an explosives exper..."
7482,Inferno,Troy Denning,"[Science Fiction, Speculative fiction, Fiction]","Jacen Solo, now the Sith lord Darth Caedus, c..."
7957,Inferno,August Strindberg,[Autobiographical novel],"The narrator (ostensibly Strindberg, although..."


¿Cuántas categorías tiene la variable Genre?

In [None]:
genre_dict = {}
for i in df["Genre"]:
    for j in i:
        if j not in genre_dict:
            genre_dict[j] = 1
        else:
            genre_dict[j] += 1
frec_genre = Counter(genre_dict)

print('Generos distintos: {}\n '.format(len(frec_genre)))

frec_genre.most_common(10)

Generos distintos: 228
 


[('Fiction', 4747),
 ('Speculative fiction', 4314),
 ('', 3718),
 ('Science Fiction', 2870),
 ('Novel', 2463),
 ('Fantasy', 2413),
 ("Children's literature", 2122),
 ('Mystery', 1396),
 ('Young adult literature', 825),
 ('Suspense', 765)]

Observar que 3,718 resumenes no cuentan con información sobre género del libro

In [None]:
df['len Summary'] = df['Summary'].apply(lambda x: len(str(x).split()))

df['len Summary'] .describe()

count    16559.000000
mean       429.202126
std        500.339692
min          1.000000
25%        120.000000
50%        263.000000
75%        569.000000
max      10334.000000
Name: len Summary, dtype: float64

In [None]:
df[df['len Summary']<10].sort_values('len Summary')

Unnamed: 0,Title,Author,Genre,Summary,len Summary
16531,Guardians of Ga'Hoole Book 4: The Siege,Helen Dunmore,"[Speculative fiction, Fantasy, Historical novel]",==Receptio,1
11215,Chucaro: Wild Pony of the Pampa,Francis Kalnay,[Children's literature],==Reference,1
5879,The Caverns of Kalte,Joe Dever,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,1
5693,The Deathlord of Ixia,John Grant,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,1
5972,The Eyes of Darkness,Dean Koontz,"[Speculative fiction, Horror, Fiction, Romance...",==Character,1
...,...,...,...,...,...
13201,Archform: Beauty,"L. E. Modesitt, Jr.",[Science Fiction],Archform: Beauty is set in 24th century Earth.,8
9689,"The Princess Diaries, Volume VII and 3/4: Vale...",Meg Cabot,[Young adult literature],Mia and Michael share Valentine's Day togethe...,9
12201,The Temple of the Ten,H. Bedford-Jones,[Fantasy],The novel adventures in the realms of Prester...,9
12856,The Sword of Aldones,Marion Zimmer Bradley,[Science Fiction],The novel concerns involved intrigue on the p...,9


In [None]:
df = df[df['len Summary']>=10].copy().reset_index(drop=True)
df.sort_values('len Summary')

Unnamed: 0,Title,Author,Genre,Summary,len Summary
11840,The Abyss of Wonders,Perley Poore Sheehan,[Science Fiction],The novel concerns a lost race in the Gobi De...,10
11810,Seeds of Life,Eric Temple Bell,[Science Fiction],The novel concerns the creation of a superman...,10
6395,Bullet Time,David A. McIntee,[Science Fiction],Sarah Jane Smith encounters the Seventh Docto...,10
10853,Stone Tables,Orson Scott Card,"[History, Fiction]",Stone Tables is a novelization of the life of...,10
12356,Yellow Fog,Les Daniels,"[Speculative fiction, Horror]",The novel concerns the vampire Don Sebastian ...,10
...,...,...,...,...,...
14161,March to the Stars,John Ringo,[Science Fiction],The story opens in the restored city of Voita...,6560
12448,Dawkins vs. Gould,Kim Sterelny,[],In the introductory chapter the author points...,7182
14619,Fire World,Chris D'Lacey,[Fantasy],It opens on the planet Co:pern:ica with Couns...,7958
518,"The History of Tom Jones, a Foundling",Henry Fielding,"[Fiction, Novel]",The novel's events occupy eighteen books. Squ...,9055


In [None]:
df.loc[12401,'Summary']

' In the aftermath of World War II, the island of Okinawa was occupied by the American military. Captain Fisby, a young army officer, is transferred to a tiny Okinawa island town called Tobiki by his Commanding Officer Colonel Purdy. Fisby is tasked with the job of implementing “Plan B”. The plan calls for teaching the natives all things American and the first step for Capt. Fisby is to establish a democratically elected Mayor, Chief of Agriculture, Chief of Police and President of the Ladies League for Democratic Action. Plan “B” also calls for the building of a schoolhouse (Pentagon shaped), democracy lessons and establishing capitalism through means left up to the good Captain’s judgment. A local Tobiki native, Sakini by name, is assigned to act as Fisby’s interpreter. Sakini, a Puck-like character, attempts to acquaint Fisby with the local customs as well as guide the audiences through the play providing both historical and cultural framework through his asides and monologues. Afte

# Obtener los tópicos principales

## Vectorización de textos

In [None]:
def preprocesar(texto):
  #convierte a minúsculas
  texto = (texto).lower()

  #elimina stopwords
  stop = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
  texto = stop.sub('', texto) 

  #Quitar puntuación y números
  texto = re.sub('[^ña-z]+', ' ', texto)

  #lematizar y quedarnos con palabras que tengan más de tres caracteres
  st = PorterStemmer()
  texto = texto.split()
  texto = ' '.join([st.stem(i) for i in texto if len(i)>2])
  
  return(texto)

In [None]:
df['Summary_pp'] = df['Summary'].apply(preprocesar)
df.head()

Unnamed: 0,Title,Author,Genre,Summary,len Summary,Summary_pp
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",957,old major old boar manor farm call anim farm m...
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",998,alex teenag live near futur england lead gang ...
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,1119,text plagu divid five part town oran thousand ...
3,An Enquiry Concerning Human Understanding,David Hume,[],The argument of the Enquiry proceeds by a ser...,2825,argument enquiri proce seri increment step sep...
4,A Fire Upon the Deep,Vernor Vinge,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,722,novel posit space around milki way divid conce...


In [None]:
vectorizer = CountVectorizer(min_df=10, max_df =0.10, ngram_range=(1,2))
BOW = vectorizer.fit_transform(df['Summary_pp'])
BOW.shape

(16496, 37122)

In [None]:
vocabulario = vectorizer.get_feature_names_out()
len(vocabulario)

37122

In [None]:
vocabulario

array(['aaron', 'aback', 'abandon', ..., 'zuckerman', 'zulu', 'zurich'],
      dtype=object)

## Entrenamiento del modelo

El número óptimo de topicos depende de las caracteristicas del texto a analizar (el largo de los textos, la cantidad de distintas ideas)

No obstante existen algunas metricas que ayudan a determinar k.

In [None]:
k = 10

In [None]:

lda_model = LatentDirichletAllocation(n_components=k, learning_method='online',
                                      random_state=42, max_iter=50) 

In [None]:
%%time
lda_model.fit(BOW) # entrena el modelo y obtienela matriz documento-topico

CPU times: user 8min 58s, sys: 1min 40s, total: 10min 39s
Wall time: 9min 17s


LatentDirichletAllocation(learning_method='online', max_iter=50,
                          random_state=42)

### Distribución de temas en cada noticia  ($X$)

In [None]:
doc_top = pd.DataFrame(lda_model.transform(BOW))
print(doc_top.shape)
doc_top.head()

(16496, 10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.032797,0.051167,0.481432,0.1756,0.030791,0.000227,0.000227,0.138991,0.044412,0.044356
1,0.013268,0.000204,0.152536,0.000204,0.000204,0.154217,0.000204,0.1032,0.07289,0.503075
2,0.078117,0.083671,0.240716,0.071468,0.000222,0.088561,0.050103,0.228708,0.158212,0.000222
3,9.1e-05,9.1e-05,0.740925,0.027551,9.1e-05,0.061346,0.047996,0.078882,0.042938,9.1e-05
4,0.091719,0.021369,0.048427,0.000311,0.000311,0.008979,0.815066,0.013198,0.000311,0.000311


In [None]:
doc_top.sum(axis=1)

0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
        ... 
16491    1.0
16492    1.0
16493    1.0
16494    1.0
16495    1.0
Length: 16496, dtype: float64

In [None]:
df_lda = pd.merge(df, doc_top, left_index=True, right_index=True)
df_lda.head(2)

Unnamed: 0,Title,Author,Genre,Summary,len Summary,Summary_pp,0,1,2,3,4,5,6,7,8,9
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",957,old major old boar manor farm call anim farm m...,0.032797,0.051167,0.481432,0.1756,0.030791,0.000227,0.000227,0.138991,0.044412,0.044356
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",998,alex teenag live near futur england lead gang ...,0.013268,0.000204,0.152536,0.000204,0.000204,0.154217,0.000204,0.1032,0.07289,0.503075


### Distribución de palabras en cada tema ($\mu$)

In [None]:
μs = pd.DataFrame(lda_model.exp_dirichlet_component_,
                         columns=vocabulario)
print(μs.shape)
μs

(10, 37122)


Unnamed: 0,aaron,aback,abandon,abandon child,abandon church,abandon citi,abandon famili,abandon farm,abandon group,abandon home,...,zoe,zoey,zola,zombi,zombi like,zone,zoo,zuckerman,zulu,zurich
0,1.355498e-10,1.356356e-10,0.0008111187,1.356291e-10,1.358826e-10,1.35487e-10,1.354017e-10,1.359244e-10,4.357188e-05,1.357232e-10,...,1.35555e-10,1.352e-10,1.352381e-10,1.354298e-10,1.355846e-10,0.0001164478,1.984918e-05,1.352016e-10,1.851751e-05,1.357418e-10
1,0.0003070025,8.138796e-11,0.0009502357,2.065533e-05,8.11222e-11,8.140782e-11,8.121868e-11,3.451142e-05,8.109617e-11,8.112871e-11,...,0.0001542362,8.092416e-11,8.098876e-11,8.100732e-11,8.099291e-11,8.106464e-11,8.115582e-11,8.104724e-11,8.099705e-11,8.132636e-11
2,7.058952e-11,6.040138e-05,0.0003347842,7.044255e-11,7.043288e-11,7.076264e-11,7.072868e-11,7.065276e-11,7.0585e-11,7.078434e-11,...,7.046038e-11,7.041263e-11,5.481145e-05,7.050952e-11,7.049319e-11,2.166476e-05,7.079634e-05,7.059045e-11,7.047087e-11,7.053272e-11
3,0.0001348652,7.850058e-06,0.0004341682,7.280481e-11,7.360871e-11,3.655307e-05,7.277956e-11,7.310606e-11,7.282042e-11,2.253673e-05,...,7.28785e-11,7.273261e-11,7.271236e-11,1.255581e-05,7.291286e-11,7.281637e-11,7.30611e-11,7.27148e-11,7.303678e-11,7.274827e-11
4,4.843314e-10,4.8536e-10,4.994452e-10,4.834587e-10,4.843427e-10,4.836513e-10,4.838551e-10,4.841726e-10,4.841051e-10,4.838029e-10,...,4.846546e-10,0.00329703,4.83387e-10,4.84048e-10,4.845398e-10,4.837177e-10,4.8464e-10,4.834086e-10,4.853142e-10,4.853453e-10
5,1.737892e-10,1.74137e-10,0.0005768488,1.737303e-10,1.73856e-10,1.734864e-10,0.000150134,1.739269e-10,1.735084e-10,1.743208e-10,...,1.739074e-10,1.733114e-10,1.733429e-10,0.0005395855,5.698358e-05,1.735579e-10,5.434565e-05,1.734932e-10,1.739134e-10,1.737664e-10
6,1.323559e-10,1.319668e-10,0.0008714662,1.317152e-10,1.319137e-10,1.32696e-10,1.320224e-10,1.335075e-10,1.320693e-10,1.318225e-10,...,1.320955e-10,1.317033e-10,1.317111e-10,0.000745519,1.331664e-10,0.0009800475,0.000153489,1.318458e-10,1.317998e-10,1.318118e-10
7,8.986888e-11,9.011798e-11,0.0004028629,7.562188e-06,8.992275e-11,9.009651e-11,8.995602e-11,8.990773e-11,9.002228e-11,1.001535e-05,...,8.984491e-11,8.976231e-11,2.84847e-05,8.98808e-11,8.978677e-11,8.998324e-11,8.998222e-11,8.984001e-11,3.13706e-06,9.032393e-11
8,9.321567e-05,1.098619e-10,0.0001480112,1.097574e-10,1.09779e-10,1.101748e-10,1.102483e-10,1.109483e-10,1.100346e-10,1.098552e-10,...,1.099043e-10,1.096763e-10,1.098575e-10,1.098808e-10,1.098595e-10,1.10104e-10,1.099159e-10,1.099393e-10,7.761681e-05,2.676966e-05
9,8.032495e-11,8.045586e-11,0.0002617053,8.01615e-11,2.844392e-05,8.010711e-11,8.029343e-11,8.025329e-11,8.019729e-11,8.083726e-11,...,0.0001906665,8.006726e-11,8.00935e-11,8.026008e-11,8.033223e-11,8.027071e-11,0.0002082789,9.227851e-05,8.026694e-11,4.446689e-05


In [None]:
for top in range(k):
  print('\nPalabras más frecuentes del topico {}'.format(top))
  tokMasFrec = μs.T.loc[:,top].sort_values(ascending=False).head(10).index
  for tok in tokMasFrec:
      print(tok)


Palabras más frecuentes del topico 0
ship
island
captain
command
soldier
crew
sea
armi
prison
boat

Palabras más frecuentes del topico 1
mr
john
david
mari
babi
peter
aunt
paul
spend
care

Palabras más frecuentes del topico 2
chapter
narrat
author
natur
reader
discuss
god
self
women
view

Palabras más frecuentes del topico 3
king
magic
dragon
lord
god
dark
armi
queen
forest
anim

Palabras más frecuentes del topico 4
harri
simon
alic
luke
georg
ghote
freddi
val
jedi
poirot

Palabras más frecuentes del topico 5
jack
tom
vampir
thoma
jim
room
laura
rachel
anita
sarah

Palabras más frecuentes del topico 6
earth
planet
space
ship
doctor
alien
system
race
star
univers

Palabras más frecuentes del topico 7
marriag
ladi
sir
letter
london
arthur
england
king
henri
richard

Palabras más frecuentes del topico 8
govern
unit
american
polit
nation
unit state
presid
countri
militari
soviet

Palabras más frecuentes del topico 9
polic
car
job
case
york
new york
miss
alex
room
apart


### Visualización del modelo

In [None]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(lda_model, BOW, vectorizer,  )

  default_term_info = default_term_info.sort_values(


### Guardamos modelo

In [None]:
# import pickle
# path = '/content/drive/MyDrive/Modelos/modelosLDA/LDA Books/'
# tuple_models = (lda_model, BOW, vectorizer)
# pickle.dump(tuple_models, open (path + "tuple_model_books_k10.pkl", 'wb'))

In [None]:
# import pickle
# path = '/content/drive/MyDrive/Modelos/modelosLDA/LDA Books/'
# lda_model, BOW, vectorizer = pickle.load(open(path + "tuple_model_books_k10.pkl", 'rb'))


## Sistema de recomendación usando similitud coseno

In [None]:

def similitud_coseno(a_vector, b_vector):
    '''Calcula la similitud coseno entre los vectores a y b'''

    numerador = np.dot(a_vector, b_vector)
    
    a_norm = np.sqrt(np.sum(a_vector**2))  
    b_norm = np.sqrt(np.sum(b_vector**2))
    
    denominador = a_norm * b_norm
    
    similitud_coseno = numerador / denominador 
    
    return similitud_coseno

In [None]:
def documentos_similares(titulo):
  inx = df[df['Title']==titulo].index[0]
  q_k = doc_top.loc[inx].values
  n = doc_top.shape[0]
  similaridad = {}
  relevantes={}
  
  # Calcular similitud coseno
  for doc_inx in range(n):
      if doc_inx == inx:
          continue
      similaridad[doc_inx] = similitud_coseno(q_k, doc_top.loc[doc_inx].values)

  rank = {k:v for k,v in sorted(similaridad.items(), key=lambda x: x[1], 
                                reverse=True)}
  top5= pd.DataFrame.from_dict(rank, orient = 'index', columns=['sim_cos']).head()
  recomendaciones = pd.merge(df.iloc[:,0:3], top5, how='right',  right_index=True, left_index=True)        
  recomendaciones.index = np.arange(1, 6)
  return recomendaciones

  

In [None]:
documentos_similares('Dune')

Unnamed: 0,Title,Author,Genre,sim_cos
1,The Mandalorian Armor,K. W. Jeter,"[Science Fiction, Speculative fiction]",0.977255
2,Balance Point,Kathy Tyers,"[Science Fiction, Speculative fiction, Fantasy]",0.957015
3,The Fight for Truth,Judy Blundell,"[Science Fiction, Speculative fiction, Childre...",0.954514
4,The Clone Wars,Karen Traviss,[Science Fiction],0.950693
5,Invincible,Troy Denning,"[Science Fiction, Speculative fiction]",0.948524


In [None]:
documentos_similares('The Time Machine')

Unnamed: 0,Title,Author,Genre,sim_cos
1,The Flames: A Fantasy,Olaf Stapledon,"[Science Fiction, Novel]",0.993844
2,The Great Romance,,"[Science Fiction, Speculative fiction, Utopian...",0.987241
3,Bones of the Earth,Michael Swanwick,"[Science Fiction, Speculative fiction]",0.985852
4,Isaac Asimov's Robot City: Prodigy,Isaac Asimov,[Science Fiction],0.985713
5,Die Vecna Die!,Steve Miller,[Role-playing game],0.985452


In [None]:
documentos_similares('Harry Potter and the Chamber of Secrets')

Unnamed: 0,Title,Author,Genre,sim_cos
1,Harry Potter and the Philosopher's Stone,J. K. Rowling,"[Children's literature, Fantasy, Speculative f...",0.992247
2,Harry Potter and the Goblet of Fire,J. K. Rowling,[Speculative fiction],0.986299
3,Harry Potter and the Prisoner of Azkaban,J. K. Rowling,"[Fantasy, Speculative fiction, Young adult lit...",0.982123
4,Harry Potter and the Order of the Phoenix,J. K. Rowling,"[Fantasy, Young adult literature, Fiction]",0.981412
5,Harry Potter and the Half-Blood Prince,J. K. Rowling,"[Fantasy, Young adult literature, Fiction]",0.979495
