### <font color='red'>Objetivo do Projeto: Criar um Algoritmo de ML para Recomendação de Livros</font>

Link dataset: https://www.kaggle.com/datasets/ishikajohari/best-books-10k-multi-genre-data

In [1]:
# Importação dos pacotes
import pandas as pd
import numpy as np
import sklearn
import numpy as np
import pandas as pd
from langdetect import detect
from ast import literal_eval
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import streamlit as st
import ipywidgets as widgets
from IPython.display import display
pd.options.mode.chained_assignment = None

### <font color='red'>Parte 1: Exploração e Limpeza dos Dados</font>

In [2]:
# Importando dados
df = pd.read_csv('data.csv')

In [3]:
# Primeiros valores do dataset
df.head(5)

Unnamed: 0.1,Unnamed: 0,Book,Author,Description,Genres,Avg_Rating,Num_Ratings,URL
0,0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...",4.27,5691311,https://www.goodreads.com/book/show/2657.To_Ki...
1,1,Harry Potter and the Philosopher’s Stone (Harr...,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",4.47,9278135,https://www.goodreads.com/book/show/72193.Harr...
2,2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical...",4.28,3944155,https://www.goodreads.com/book/show/1885.Pride...
3,3,The Diary of a Young Girl,Anne Frank,Discovered in the attic in which she spent the...,"['Classics', 'Nonfiction', 'History', 'Biograp...",4.18,3488438,https://www.goodreads.com/book/show/48855.The_...
4,4,Animal Farm,George Orwell,Librarian's note: There is an Alternate Cover ...,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",3.98,3575172,https://www.goodreads.com/book/show/170448.Ani...


In [4]:
df.shape

(10000, 8)

In [5]:
# Verificando colunas
df.columns

Index(['Unnamed: 0', 'Book', 'Author', 'Description', 'Genres', 'Avg_Rating',
       'Num_Ratings', 'URL'],
      dtype='object')

In [6]:
'''
As colunas que serão removidas serão: 'Unnamed: 0', 'URL', Avg_Rating e Num_Rating pois não acrescentam uma informação importante para 
a recomendação de algum livro.
'''
#Lista de colunas para remover
colunas_para_remover = ['Unnamed: 0', 'URL', 'Avg_Rating', 'Num_Ratings']

#Removendo colunas
df = df.drop(colunas_para_remover, axis = 1)

df.sample(5)

Unnamed: 0,Book,Author,Description,Genres
1282,Survivor,Chuck Palahniuk,From the author of the underground sensation F...,"['Fiction', 'Contemporary', 'Thriller', 'Novel..."
2431,Antigone,Jean Anouilh,Antigone was originally produced in Paris in 1...,"['Plays', 'Classics', 'France', 'Theatre', 'Dr..."
8056,"The Warrior's Path (The Sacketts, #3)",Louis L'Amour,"Filled with exciting tales of the frontier, th...","['Westerns', 'Fiction', 'Historical Fiction', ..."
5603,My Life: An Attempt at an Autobiography,Leon Trotsky,Autobiographical account by a leader of the Oc...,"['History', 'Biography', 'Politics', 'Nonficti..."
669,The Joy of Cooking,Irma S. Rombauer,"Since its original publication, Joy of Cooking...","['Cookbooks', 'Cooking', 'Food', 'Nonfiction',..."


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Book         10000 non-null  object
 1   Author       10000 non-null  object
 2   Description  9923 non-null   object
 3   Genres       10000 non-null  object
dtypes: object(4)
memory usage: 312.6+ KB


In [8]:
# Verificando valores ausentes
def valores_ausentes(df):
    soma = df.isnull().sum()
    porcentagem = ((df.isnull().sum() / len(df)) * 100).map('{:.2f}%'.format)
    return pd.DataFrame({'valores_ausentes' : soma,
                         'porcentagem': porcentagem})

In [9]:
valores_ausentes(df)

Unnamed: 0,valores_ausentes,porcentagem
Book,0,0.00%
Author,0,0.00%
Description,77,0.77%
Genres,0,0.00%


In [10]:
# Como são poucos os valores ausentes, serão apenas removidos do dataset
df.dropna(inplace = True)

In [11]:
valores_ausentes(df)

Unnamed: 0,valores_ausentes,porcentagem
Book,0,0.00%
Author,0,0.00%
Description,0,0.00%
Genres,0,0.00%


In [12]:
# Remover o texto dentro dos parênteses e os próprios parênteses
df['Book'] = df['Book'].str.replace(r'\s*\(.*\)', '', regex=True)

In [13]:
#Alterando nome das colunas
df = df.rename(columns = {'Book': 'book',
                          'Author': 'author',
                          'Description': 'description',
                          'Genres': 'genres'
                          })

In [14]:
df.head(3)

Unnamed: 0,book,author,description,genres
0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ..."
1,Harry Potter and the Philosopher’s Stone,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',..."
2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","['Classics', 'Fiction', 'Romance', 'Historical..."


### <font color='red'>Parte 2: Processamento de Texto</font>

In [15]:
# Convertendo valores na coluna com literal_eval
'''
A função literal_eval será usada para converter strings que representam listas de gêneros de filmes em listas reais de Python. 
Inicialmente, a coluna do DataFrame comtém os gêneros como strings no formato de listas, como ['Classics', 'Fiction', 'Historical Fiction', ...]. 
Usando literal_eval, essas strings serão transformadas em listas de verdade, permitindo que cada célula da coluna genres seja convertida para o 
formato de lista Python correspondente.
'''
# Conversão de strings para listas usando literal_eval
df['genres'] = df['genres'].apply(literal_eval)

In [16]:
df.head(3)

Unnamed: 0,book,author,description,genres
0,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"[Classics, Fiction, Historical Fiction, School..."
1,Harry Potter and the Philosopher’s Stone,J.K. Rowling,Harry Potter thinks he is an ordinary boy - un...,"[Fantasy, Fiction, Young Adult, Magic, Childre..."
2,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride and...","[Classics, Fiction, Romance, Historical Fictio..."


In [17]:
# Inicializando uma lista vazia
new_column = []

# Loop para percorrer cada descrição e adicionar a lista de palavras à nova lista
for description in df['description']:
    words_list = description.split()  # Dividindo a descrição em uma lista de palavras
    new_column.append(words_list)     # Adicionando a lista de palavras à nova coluna

# Substituindo a coluna original pela nova lista de listas
df['description'] = new_column

In [18]:
df.head(3)

Unnamed: 0,book,author,description,genres
0,To Kill a Mockingbird,Harper Lee,"[The, unforgettable, novel, of, a, childhood, ...","[Classics, Fiction, Historical Fiction, School..."
1,Harry Potter and the Philosopher’s Stone,J.K. Rowling,"[Harry, Potter, thinks, he, is, an, ordinary, ...","[Fantasy, Fiction, Young Adult, Magic, Childre..."
2,Pride and Prejudice,Jane Austen,"[Since, its, immediate, success, in, 1813,, Pr...","[Classics, Fiction, Romance, Historical Fictio..."


In [19]:
# Inicializando uma lista vazia
new_column = []

# Loop para percorrer cada descrição e adicionar a lista de palavras à nova lista
for author in df['author']:
    new_column.append([author])

df['author'] = new_column

In [20]:
df.head(3)

Unnamed: 0,book,author,description,genres
0,To Kill a Mockingbird,[Harper Lee],"[The, unforgettable, novel, of, a, childhood, ...","[Classics, Fiction, Historical Fiction, School..."
1,Harry Potter and the Philosopher’s Stone,[J.K. Rowling],"[Harry, Potter, thinks, he, is, an, ordinary, ...","[Fantasy, Fiction, Young Adult, Magic, Childre..."
2,Pride and Prejudice,[Jane Austen],"[Since, its, immediate, success, in, 1813,, Pr...","[Classics, Fiction, Romance, Historical Fictio..."


### <font color='red'>Parte 3: Limpeza de Dados</font>

In [21]:
# Removendo espaços vazios

df['author'] = df['author'].apply(lambda x:[i.replace(' ','') for i in x])

df['description'] = df['description'].apply(lambda x:[i.replace(' ','') for i in x])

df['genres'] = df['genres'].apply(lambda x:[i.replace(' ','') for i in x])

In [22]:
# Removendo ponto de autores
df['author'] = df['author'].apply(lambda x:[i.replace('.','') for i in x])

In [23]:
df.head()

Unnamed: 0,book,author,description,genres
0,To Kill a Mockingbird,[HarperLee],"[The, unforgettable, novel, of, a, childhood, ...","[Classics, Fiction, HistoricalFiction, School,..."
1,Harry Potter and the Philosopher’s Stone,[JKRowling],"[Harry, Potter, thinks, he, is, an, ordinary, ...","[Fantasy, Fiction, YoungAdult, Magic, Children..."
2,Pride and Prejudice,[JaneAusten],"[Since, its, immediate, success, in, 1813,, Pr...","[Classics, Fiction, Romance, HistoricalFiction..."
3,The Diary of a Young Girl,[AnneFrank],"[Discovered, in, the, attic, in, which, she, s...","[Classics, Nonfiction, History, Biography, Mem..."
4,Animal Farm,[GeorgeOrwell],"[Librarian's, note:, There, is, an, Alternate,...","[Classics, Fiction, Dystopia, Fantasy, Politic..."


### <font color='red'>Parte 4: Preparando DataFrame para Vetorização</font>

In [24]:
# Criamos a coluna de tags, nesse caso um vetor de strings com os valores das colunas
df['tags'] = df['author'] + \
             df['description'] + \
             df['genres']

In [25]:
df.head()

Unnamed: 0,book,author,description,genres,tags
0,To Kill a Mockingbird,[HarperLee],"[The, unforgettable, novel, of, a, childhood, ...","[Classics, Fiction, HistoricalFiction, School,...","[HarperLee, The, unforgettable, novel, of, a, ..."
1,Harry Potter and the Philosopher’s Stone,[JKRowling],"[Harry, Potter, thinks, he, is, an, ordinary, ...","[Fantasy, Fiction, YoungAdult, Magic, Children...","[JKRowling, Harry, Potter, thinks, he, is, an,..."
2,Pride and Prejudice,[JaneAusten],"[Since, its, immediate, success, in, 1813,, Pr...","[Classics, Fiction, Romance, HistoricalFiction...","[JaneAusten, Since, its, immediate, success, i..."
3,The Diary of a Young Girl,[AnneFrank],"[Discovered, in, the, attic, in, which, she, s...","[Classics, Nonfiction, History, Biography, Mem...","[AnneFrank, Discovered, in, the, attic, in, wh..."
4,Animal Farm,[GeorgeOrwell],"[Librarian's, note:, There, is, an, Alternate,...","[Classics, Fiction, Dystopia, Fantasy, Politic...","[GeorgeOrwell, Librarian's, note:, There, is, ..."


In [26]:
# Adicionar coluna de id_movie
# Adicionar uma coluna 'movie_id' com IDs sequenciais
df['book_id'] = range(1, len(df) + 1)

In [27]:
df_final = df[['book_id', 'book', 'tags']]

In [28]:
df_final.head(5)

Unnamed: 0,book_id,book,tags
0,1,To Kill a Mockingbird,"[HarperLee, The, unforgettable, novel, of, a, ..."
1,2,Harry Potter and the Philosopher’s Stone,"[JKRowling, Harry, Potter, thinks, he, is, an,..."
2,3,Pride and Prejudice,"[JaneAusten, Since, its, immediate, success, i..."
3,4,The Diary of a Young Girl,"[AnneFrank, Discovered, in, the, attic, in, wh..."
4,5,Animal Farm,"[GeorgeOrwell, Librarian's, note:, There, is, ..."


In [29]:
# Join das strings para simplificar o vetor
df_final['tags'] = df_final['tags'].apply(lambda x:" ".join(x))

In [30]:
# Coloca tudo em minúsculo para evitar diferença de palavras maiúsculo/minúsculo
df_final['tags'] = df_final['tags'].apply(lambda x:x.lower())

In [31]:
df_final.head(5)

Unnamed: 0,book_id,book,tags
0,1,To Kill a Mockingbird,harperlee the unforgettable novel of a childho...
1,2,Harry Potter and the Philosopher’s Stone,jkrowling harry potter thinks he is an ordinar...
2,3,Pride and Prejudice,janeausten since its immediate success in 1813...
3,4,The Diary of a Young Girl,annefrank discovered in the attic in which she...
4,5,Animal Farm,georgeorwell librarian's note: there is an alt...


### <font color='red'>Parte 5: Perse e Vetorização</font>

Stemming é o processo de redução de uma palavra ao seu radical que está ligado a sufixos e prefixos ou às raízes de palavras conhecidas como "lemmas". Stemming é importante na compreensão de linguagem natural e processamento de linguagem natural.

In [32]:
# Criamos o parser
parser_ps = PorterStemmer()

In [33]:
# Função de stemming
def stem(text):
    
    # Cria uma lista vazia chamada y para armazenar as palavras após o stemming.
    y = []
    
    # Divide a string de entrada 'text' em palavras e itera sobre elas.
    for i in text.split():
        
        # Realiza o stemming na palavra atual 'i' e adiciona o resultado à lista y.
        y.append(parser_ps.stem(i))
    
    # Retorna as palavras processadas como uma string, unindo-as com espaços.
    return " ".join(y)

In [34]:
# Aplica a função à coluna de tags
df_final['tags'] = df_final['tags'].apply(stem)

In [35]:
df_final.head()

Unnamed: 0,book_id,book,tags
0,1,To Kill a Mockingbird,harperle the unforgett novel of a childhood in...
1,2,Harry Potter and the Philosopher’s Stone,jkrowl harri potter think he is an ordinari bo...
2,3,Pride and Prejudice,"janeausten sinc it immedi success in 1813, pri..."
3,4,The Diary of a Young Girl,annefrank discov in the attic in which she spe...
4,5,Animal Farm,georgeorwel librarian' note: there is an alter...


A vetorização no contexto de processamento de linguagem natural (PLN) é o processo de converter texto em uma representação numérica, geralmente na forma de vetores. Este processo é fundamental para que algoritmos de aprendizado de máquina possam trabalhar com dados textuais, uma vez que eles requerem entradas numéricas.

In [36]:
# Cria o vetorizador com no máximo 5000 atributos

common_words = ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'I']

cv = CountVectorizer(max_features = 15000, stop_words = common_words)

In [37]:
# Cria os vetores para as tags
vectors = cv.fit_transform(df_final['tags']).toarray()

In [38]:
len(cv.get_feature_names_out())

15000

In [39]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### <font color='red'>Parte 6: Distância de Vetores</font>

In [40]:
# Calcula a similaridade entre os vetores calculando as distâncias entre eles
similaridades = cosine_similarity(vectors)

### <font color='red'>Parte 7: Construção do Sistema de Recomendação</font>

In [41]:
# Função para o sistema de recomendação
def sistema_recomendacao(book):
    
    # Obtém o índice do filme passado como argumento (o que o usuário assistiu)
    # Index[0] = significa que vai trazer apenas o primeiro elemento
    index = df_final[df_final['book'] == book].index[0]
    
    # Verificamos então os filmes com vetores de menor distância para o filme passado como argumento
    distances = sorted(list(enumerate(similaridades[index])), reverse = True, key = lambda x: x[1])
    
    # E então consideramos os 5 filmes com menor distância, ou seja, maior similaridade
    for i in distances[1:6]:
        print(df_final.iloc[i[0]]['book'])

### <font color='red'>Parte 8: Aplicando o Sistema de Recomendação</font>

In [42]:
# Ajustar configurações para mostrar texto completo
pd.set_option('display.max_colwidth', None)

# Ajustar configuração para exibir todas as linhas
pd.set_option('display.max_rows', None)

In [43]:
sistema_recomendacao('The Hobbit')

The Hobbit
Oliver Twist
Alice in Wonderland
J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings
The Life and Opinions of Tristram Shandy, Gentleman


In [44]:
sistema_recomendacao('Harry Potter and the Philosopher’s Stone')

Harry Potter and the Goblet of Fire
Harry Potter and the Half-Blood Prince
Harry Potter and the Cursed Child: Parts One and Two
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Order of the Phoenix


In [45]:
sistema_recomendacao('1984')

Foundation and Earth
Archer's Voice
Too Loud a Solitude
The Storied Life of A.J. Fikry
The Arbitrator


In [46]:
sistema_recomendacao('The Lord of the Rings')

The Fellowship of the Ring
The Two Towers
J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings
The Children of Húrin
Hitman: My Real Life in the Cartoon World of Wrestling


In [47]:
sistema_recomendacao('Harry Potter and the Deathly Hallows')

Harry Potter and the Order of the Phoenix
River Secrets
Harry Potter and the Chamber of Secrets
Fool Moon
Cold Days


### <font color='red'>Sistema e Versões dos Pacotes</font>

In [48]:
%reload_ext watermark
%watermark -v -m
%watermark --iversions

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 186 Stepping 3, GenuineIntel
CPU cores   : 12
Architecture: 64bit

streamlit : 1.32.0
langdetect: 1.0.9
numpy     : 1.26.4
IPython   : 8.25.0
sklearn   : 1.4.2
nltk      : 3.8.1
ipywidgets: 7.8.1
pandas    : 2.2.2

