<a href="https://colab.research.google.com/github/AnIsAsPe/Recomendaci-n-de-libros-usando-LDA/blob/main/Notebooks/Recomendaci%C3%B3n_de_libros_usando_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2> Problema:

Se quiere construir un sistema de recomendación de libros basado en los resumentes de los libros y los temas (topicos) de los mismos.

Para tal fin, se utiliza el [CMU Book Summary Dataset](https://www.cs.cmu.edu/~dbamman/booksummaries.html)

# Bibliotecas y funciones

In [None]:
!pip install pyLDAvis  #biblioteca que extrae información de un modelo LDA para obtener una visualización interactiva

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.1 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=0f84ad32914cd1915223b996b1227b53d09ef9dcd3b28d74e45c6cba803f3ccd
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
  Building wheel for sklearn (setup.

In [None]:
# Para leer los datos
import csv
import json

import pandas as pd
import numpy as np
from collections import Counter # para contar frecuencias

# preprocesar texto
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Modelado de tópicos
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import pyLDAvis
import matplotlib.pyplot as plt 
import seaborn as sns 

  from collections import Iterable
  from collections import Mapping


In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

<h2>  Funciones

# Lectura y exploración de datos

In [None]:
data = []

with open("/content/drive/MyDrive/Datos/BookSummaries/booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in reader:
        data.append(row)

In [None]:
len(data)

16559

In [None]:
title = []
author = []
genre = []
summary = []
for i in range(len(data)):
  title.append(data[i][2])
  author.append(data[i][3])
  if data[i][5] == '':
      genre.append([''])
  else:
      genre.append([j for j in json.loads(data[i][5]).values()])
  summary.append(data[i][6])

df = pd.DataFrame({ 'Title': title, 'Author': author,
                       'Genre': genre, 'Summary': summary})
print(df.shape)
df.head(3)


(16559, 4)


Unnamed: 0,Title,Author,Genre,Summary
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca..."
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan..."
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...


In [None]:
df[['Title', 'Author']].nunique()


Title     16277
Author     4715
dtype: int64

In [None]:
df.Title.value_counts().head()

Nemesis     6
Outcast     4
Haunted     4
Inferno     4
The Gift    3
Name: Title, dtype: int64

 ¿por qué hay más de un resumen para cada titulo?

In [None]:
df[df.Title=='Nemesis']

Unnamed: 0,Title,Author,Genre,Summary
375,Nemesis,Isaac Asimov,"[Science Fiction, Speculative fiction, Childre...",The novel is set in an era in which interstel...
3499,Nemesis,Agatha Christie,"[Crime Fiction, Mystery, Children's literature...",Miss Marple receives a post card from the rec...
5157,Nemesis,Scott Ciencin,"[Speculative fiction, Horror]",One of Fred's old friends from graduate schoo...
6159,Nemesis,Jo Nesbø,[Crime Fiction],A bank robbery is committed by a lone robber ...
13696,Nemesis,Philip Roth,[],Nemesis explores the effect of a 1944 polio e...
13842,Nemesis,,[],"The story, set in Latium in AD 77, opens with..."


¿Cuántas categorías tiene la variable Genre?

In [None]:
genre_dict = {}
for i in df["Genre"]:
    for j in i:
        if j not in genre_dict:
            genre_dict[j] = 1
        else:
            genre_dict[j] += 1
frec_genre = Counter(genre_dict)

print('Generos distintos: {}\n '.format(len(frec_genre)))

frec_genre.most_common(10)

Generos distintos: 228
 


[('Fiction', 4747),
 ('Speculative fiction', 4314),
 ('', 3718),
 ('Science Fiction', 2870),
 ('Novel', 2463),
 ('Fantasy', 2413),
 ("Children's literature", 2122),
 ('Mystery', 1396),
 ('Young adult literature', 825),
 ('Suspense', 765)]

Observar que 3,718 resumenes no cuentan con información sobre género del libro

In [None]:
df['len Summary'] = df['Summary'].apply(lambda x: len(str(x).split()))

df['len Summary'] .describe()

count    16559.000000
mean       429.202126
std        500.339692
min          1.000000
25%        120.000000
50%        263.000000
75%        569.000000
max      10334.000000
Name: len Summary, dtype: float64

In [None]:
df[df['len Summary']<10].sort_values('len Summary')

Unnamed: 0,Title,Author,Genre,Summary,len Summary
16531,Guardians of Ga'Hoole Book 4: The Siege,Helen Dunmore,"[Speculative fiction, Fantasy, Historical novel]",==Receptio,1
11215,Chucaro: Wild Pony of the Pampa,Francis Kalnay,[Children's literature],==Reference,1
5879,The Caverns of Kalte,Joe Dever,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,1
5693,The Deathlord of Ixia,John Grant,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,1
5972,The Eyes of Darkness,Dean Koontz,"[Speculative fiction, Horror, Fiction, Romance...",==Character,1
...,...,...,...,...,...
13201,Archform: Beauty,"L. E. Modesitt, Jr.",[Science Fiction],Archform: Beauty is set in 24th century Earth.,8
9689,"The Princess Diaries, Volume VII and 3/4: Vale...",Meg Cabot,[Young adult literature],Mia and Michael share Valentine's Day togethe...,9
12201,The Temple of the Ten,H. Bedford-Jones,[Fantasy],The novel adventures in the realms of Prester...,9
12856,The Sword of Aldones,Marion Zimmer Bradley,[Science Fiction],The novel concerns involved intrigue on the p...,9


In [None]:
df = df[df['len Summary']>=10].copy()
df.sort_values('len Summary')

Unnamed: 0,Title,Author,Genre,Summary,len Summary
11879,The Abyss of Wonders,Perley Poore Sheehan,[Science Fiction],The novel concerns a lost race in the Gobi De...,10
11849,Seeds of Life,Eric Temple Bell,[Science Fiction],The novel concerns the creation of a superman...,10
6405,Bullet Time,David A. McIntee,[Science Fiction],Sarah Jane Smith encounters the Seventh Docto...,10
10887,Stone Tables,Orson Scott Card,"[History, Fiction]",Stone Tables is a novelization of the life of...,10
12401,Yellow Fog,Les Daniels,"[Speculative fiction, Horror]",The novel concerns the vampire Don Sebastian ...,10
...,...,...,...,...,...
14214,March to the Stars,John Ringo,[Science Fiction],The story opens in the restored city of Voita...,6560
12494,Dawkins vs. Gould,Kim Sterelny,[],In the introductory chapter the author points...,7182
14672,Fire World,Chris D'Lacey,[Fantasy],It opens on the planet Co:pern:ica with Couns...,7958
518,"The History of Tom Jones, a Foundling",Henry Fielding,"[Fiction, Novel]",The novel's events occupy eighteen books. Squ...,9055


# Obtener los tópicos principales

## Vectorización de textos

In [None]:
def preprocesar(texto):
  #convierte a minúsculas
  texto = (texto).lower()

  #elimina stopwords
  stop = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
  texto = stop.sub('', texto) 

  #Quitar puntuación y números
  texto = re.sub('[^ña-z]+', ' ', texto)

  #lematizar y quedarnos con palabras que tengan más de tres caracteres
  st = PorterStemmer()
  texto = texto.split()
  texto = ' '.join([st.stem(i) for i in texto if len(i)>2])
  
  return(texto)

In [None]:
df['Summary_pp'] = df['Summary'].apply(preprocesar)
df.head()

Unnamed: 0,Title,Author,Genre,Summary,len Summary,Summary_pp
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",957,old major old boar manor farm call anim farm m...
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",998,alex teenag live near futur england lead gang ...
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,1119,text plagu divid five part town oran thousand ...
3,An Enquiry Concerning Human Understanding,David Hume,[],The argument of the Enquiry proceeds by a ser...,2825,argument enquiri proce seri increment step sep...
4,A Fire Upon the Deep,Vernor Vinge,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,722,novel posit space around milki way divid conce...


In [None]:
vectorizer = CountVectorizer(min_df=10, max_df =0.10, ngram_range=(1,2))
BOW = vectorizer.fit_transform(df['Summary_pp'])
BOW.shape

(16496, 37247)

In [None]:
vocabulario = vectorizer.get_feature_names_out()
len(vocabulario)

37247

## Entrenamiento del modelo

El número óptimo de topicos depende de las caracteristicas del texto a analizar (el largo de los textos, la cantidad de distintas ideas)

No obstante existen algunas metricas que ayudan a determinar k.

In [None]:
k = 10

In [None]:

lda_model = LatentDirichletAllocation(n_components=k, learning_method='online',
                                      random_state=42, max_iter=50) 

In [None]:
%%time
lda_model.fit(BOW) # entrena el modelo y obtienela matriz documento-topico

CPU times: user 6min 50s, sys: 1min 58s, total: 8min 49s
Wall time: 6min 46s


LatentDirichletAllocation(learning_method='online', max_iter=50,
                          random_state=42)

### Distribución de temas en cada noticia  ($\theta$)

In [None]:
doc_top = pd.DataFrame(lda_model.transform(BOW))
print(doc_top.shape)
doc_top.head()

(16496, 10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.215131,0.134591,0.014022,0.018184,0.0605,0.12834,0.035139,0.121208,0.084215,0.18867
1,0.206781,0.020721,0.000187,0.000187,0.018205,0.000187,0.100419,0.02601,0.627115,0.000187
2,0.271411,0.08317,0.014507,0.057841,0.105605,0.000194,0.143514,0.199848,0.114129,0.009781
3,0.76399,7.9e-05,0.109207,0.022806,0.029919,7.9e-05,0.002289,7.9e-05,0.071472,7.9e-05
4,0.000272,0.000272,0.463323,0.000272,0.479507,0.000272,0.000272,0.000272,0.007345,0.048194


In [None]:
doc_top.sum(axis=1)

0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
        ... 
16491    1.0
16492    1.0
16493    1.0
16494    1.0
16495    1.0
Length: 16496, dtype: float64

In [None]:
df_lda = pd.merge(df, doc_top, left_index=True, right_index=True)
df_lda.head(2)

Unnamed: 0,Title,Author,Genre,Summary,len Summary,0,1,2,3,4,5,6,7,8,9
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",957,0.215131,0.134591,0.014022,0.018184,0.0605,0.12834,0.035139,0.121208,0.084215,0.18867
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",998,0.206781,0.020721,0.000187,0.000187,0.018205,0.000187,0.100419,0.02601,0.627115,0.000187


### Distribución de palabras en cada tema ($\mu$)

In [None]:
μs = pd.DataFrame(lda_model.exp_dirichlet_component_,
                         columns=vocabulario)
print(μs.shape)
μs

(10, 37247)


Unnamed: 0,aaron,aback,abandon,abandon child,abandon church,abandon citi,abandon famili,abandon farm,abandon group,abandon home,...,zoe,zoey,zola,zombi,zombi like,zone,zoo,zuckerman,zulu,zurich
0,6.55361e-11,6.575006e-11,0.000243,6.545777e-11,6.54106e-11,3.288741e-05,2.388161e-05,6.573345e-11,6.554218e-11,6.570583e-11,...,6.543511e-11,6.544801e-11,7.255484e-05,6.550433e-11,6.545155e-11,6.554761e-11,6.562192e-11,6.553318e-11,7.57303e-06,6.550983e-11
1,0.0007025162,1.484435e-10,0.000255,1.487027e-10,1.482216e-10,1.49371e-10,1.483484e-10,1.485381e-10,1.262603e-05,1.491897e-10,...,1.482229e-10,1.48179e-10,1.48219e-10,1.484207e-10,1.485186e-10,1.484634e-10,1.482912e-10,1.481327e-10,0.0001149664,1.485156e-10
2,8.44364e-11,8.449317e-11,0.000447,8.429517e-11,8.428923e-11,8.633624e-11,8.428704e-11,8.453545e-11,8.44069e-11,8.455685e-11,...,8.420432e-11,8.451737e-11,8.417244e-11,8.447097e-11,1.344257e-05,3.128272e-05,8.436388e-11,8.42139e-11,8.421698e-11,8.415767e-11
3,9.136459e-11,2.358386e-05,0.000536,1.659349e-05,9.138472e-11,9.132191e-11,9.140445e-11,9.124645e-11,9.128102e-11,1.321716e-05,...,9.12331e-11,9.121048e-11,9.135255e-11,9.124103e-11,9.117275e-11,9.129039e-11,9.132063e-11,9.121401e-11,9.132578e-11,7.408956e-05
4,7.684005e-11,7.707053e-11,0.000575,7.660247e-11,7.671193e-11,7.692268e-11,7.674248e-11,7.716572e-11,1.716959e-05,7.677925e-11,...,7.668808e-11,7.658514e-11,7.66073e-11,7.689035e-11,7.728166e-11,0.0006320289,1.020222e-05,7.662925e-11,7.672196e-11,7.678593e-11
5,2.050839e-10,2.045477e-10,0.000116,2.044061e-10,2.045876e-10,2.04428e-10,2.049206e-10,2.045421e-10,2.046677e-10,2.056383e-10,...,0.0008784326,2.04913e-10,2.043714e-10,2.047341e-10,2.056575e-10,2.046805e-10,9.650596e-05,2.042701e-10,2.043821e-10,2.04614e-10
6,1.029271e-10,1.030658e-10,0.000538,1.029091e-10,1.923886e-05,1.040002e-10,1.028656e-10,1.030588e-10,1.034413e-10,1.036997e-10,...,1.028123e-10,0.000700572,1.027177e-10,1.028953e-10,1.030548e-10,1.028523e-10,1.028867e-10,1.02749e-10,1.028944e-10,1.029806e-10
7,7.335595e-11,5.169479e-05,0.000825,1.15083e-05,7.330654e-11,7.332062e-11,7.342953e-11,3.119239e-05,7.332922e-11,2.021456e-05,...,7.318662e-11,7.327348e-11,7.318146e-11,7.322074e-11,7.329533e-11,7.32499e-11,1.856184e-05,7.324334e-11,7.31446e-11,7.325505e-11
8,0.0001265502,7.160248e-11,0.000186,7.152498e-11,7.15975e-11,7.144051e-11,3.484761e-05,7.156313e-11,7.163583e-11,7.197139e-11,...,7.152426e-11,7.142848e-11,7.148391e-11,7.156978e-11,1.114659e-05,7.154445e-11,7.16819e-11,8.229663e-05,7.147023e-11,7.168499e-11
9,1.350605e-10,1.352306e-10,0.000496,1.34894e-10,2.088335e-05,1.356124e-10,1.350209e-10,1.352622e-10,1.36051e-10,1.35247e-10,...,1.350792e-10,1.349123e-10,1.348213e-10,0.001209596,1.352541e-10,1.35072e-10,0.0005912947,1.348502e-10,1.349481e-10,1.349447e-10


In [None]:
for top in range(k):
  print('\nPalabras más frecuentes del topico {}'.format(top))
  tokMasFrec = μs.T.loc[:,top].sort_values(ascending=False).head(10).index
  for tok in tokMasFrec:
      print(tok)


Palabras más frecuentes del topico 0
chapter
describ
narrat
author
societi
differ
polit
histori
experi
write

Palabras más frecuentes del topico 1
king
armi
battl
fight
princ
command
captur
empir
emperor
assassin

Palabras más frecuentes del topico 2
magic
dragon
king
god
destroy
dark
creatur
vampir
land
demon

Palabras más frecuentes del topico 3
marriag
husband
mr
letter
ladi
sir
london
england
sister
money

Palabras más frecuentes del topico 4
ship
earth
planet
destroy
system
space
unit
crew
alien
doctor

Palabras más frecuentes del topico 5
school
boy
harri
david
wolf
rachel
teacher
student
jake
simon

Palabras más frecuentes del topico 6
men
villag
jack
tom
dead
prison
manag
soldier
taken
camp

Palabras más frecuentes del topico 7
children
say
feel
sister
boy
parent
town
child
stay
year old

Palabras más frecuentes del topico 8
polic
investig
case
york
new york
paul
car
bodi
offic
job

Palabras más frecuentes del topico 9
cat
island
dog
anim
fli
water
ship
red
captain
jim


### Visualización del modelo

In [None]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(lda_model, BOW, vectorizer)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


### Guardamos modelo

In [None]:
import pickle
path = '/content/drive/MyDrive/Modelos/modelosLDA/LDA Books/'
tuple_models = (lda_model, BOW, vectorizer)
pickle.dump(tuple_models, open (path + "tuple_model_books_k10.pkl", 'wb'))

In [None]:
import pickle
path = '/content/drive/MyDrive/Modelos/modelosLDA/LDA Books/'
lda_model, BOW, vectorizer = pickle.load(open(path + "tuple_model_books_k10.pkl", 'rb'))


## Sistema de recomendación usando similitud coseno

In [None]:

def similitud_coseno(a_vector, b_vector):
    '''Calcula la similitud coseno entre los vectores a y b'''

    numerador = np.dot(a_vector, b_vector)
    
    a_norm = np.sqrt(np.sum(a_vector**2))  
    b_norm = np.sqrt(np.sum(b_vector**2))
    
    denominador = a_norm * b_norm
    
    similitud_coseno = numerador / denominador 
    
    return similitud_coseno

In [None]:
def documentos_similares(titulo):
  inx = df[df['Title']==titulo].index[0]
  q_k = doc_top.loc[inx].values
  n = doc_top.shape[0]
  similaridad = {}
  relevantes={}
  
  # Calcular similitud coseno
  for doc_inx in range(n):
      if doc_inx == inx:
          continue
      similaridad[doc_inx] = similitud_coseno(q_k, doc_top.loc[doc_inx].values)

  rank = {k:v for k,v in sorted(similaridad.items(), key=lambda x: x[1], 
                                reverse=True)}
  top10 = pd.DataFrame.from_dict(rank, orient = 'index', columns=['sim_cos']).head()
  recomendaciones = pd.merge(df.iloc[:,0:3], top10, how='right',  right_index=True, left_index=True)        
  recomendaciones.index = np.arange(1, 6)
  return recomendaciones

  

In [None]:
documentos_similares('Dune')

Unnamed: 0,Title,Author,Genre,sim_cos
1,Children of Dune,Frank Herbert,"[Science Fiction, Speculative fiction, Childre...",0.98876
2,Passage,Lois McMaster Bujold,"[Speculative fiction, Fantasy]",0.98733
3,Rushing to Paradise,J. G. Ballard,"[Speculative fiction, Fantasy, Fiction, Novel]",0.986713
4,Damned,Chuck Palahniuk,[],0.986615
5,The Return of Conan,Björn Nyberg,"[Sword and sorcery, Speculative fiction, Fantasy]",0.98384


In [None]:
documentos_similares('The Time Machine')

Unnamed: 0,Title,Author,Genre,sim_cos
1,The Crystal World,J. G. Ballard,"[Science Fiction, Speculative fiction, Fiction...",0.983638
2,The Sound of His Horn,,"[Alternate history, Science Fiction, Dystopia]",0.971643
3,Vulcan!,Kathleen Sky,[Speculative fiction],0.967763
4,The Charnel Prince,Gregory Keyes,"[Speculative fiction, Fantasy]",0.961541
5,Anvil of Stars,Greg Bear,"[Science Fiction, Speculative fiction, Fiction]",0.959628


In [None]:
documentos_similares('Harry Potter and the Chamber of Secrets')

Unnamed: 0,Title,Author,Genre,sim_cos
1,Harry Potter and the Philosopher's Stone,J. K. Rowling,"[Children's literature, Fantasy, Speculative f...",0.997807
2,Harry Potter and the Goblet of Fire,J. K. Rowling,[Speculative fiction],0.980242
3,A Right to Die,Rex Stout,"[Mystery, Detective fiction, Fiction, Suspense]",0.970375
4,The Hot Kid,Elmore Leonard,"[Thriller, Mystery, Fiction, Novel]",0.969886
5,Harry Potter and the Order of the Phoenix,J. K. Rowling,"[Fantasy, Young adult literature, Fiction]",0.969733
