##Importando bilbiotecas
- diferentes dos modelos de aprendizado que criamos anteriomente, neste vamos utilizar uma equação de coseno similar, que é uma das métricas mais utilizadas para sistemas de recomendaçao 
- vale lembrar que determinadas cenarios podem ser implementados modelos de machine learning como Kmeans para agrupar os dados semelhantes, ou o KNN utilizando a metrica de distancia manhattan para verificar os pontos que possuiem menor distancia

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

##Carregando e Tratando base
 - esta base se trata de livros 
 - nela contem dados como o titulo, autor, genero, peso e editora 

In [2]:
uri = 'https://raw.githubusercontent.com/RafaelBernardo18/aprendizado-de-maquina/main/books.csv.txt'

livros = pd.read_csv(uri)

In [3]:
livros.head()

Unnamed: 0,Title,Author,Genre,Height,Publisher
0,Fundamentals of Wavelets,"Goswami, Jaideva",signal_processing,228,Wiley
1,Data Smart,"Foreman, John",data_science,235,Wiley
2,God Created the Integers,"Hawking, Stephen",mathematics,197,Penguin
3,Superfreakonomics,"Dubner, Stephen",economics,179,HarperCollins
4,Orientalism,"Said, Edward",history,197,Penguin


In [4]:
#verificando intrancias na base
#podemos precerber que há campos não preenchidos
livros.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      211 non-null    object
 1   Author     187 non-null    object
 2   Genre      211 non-null    object
 3   Height     211 non-null    int64 
 4   Publisher  115 non-null    object
dtypes: int64(1), object(4)
memory usage: 8.4+ KB


In [5]:
livros = livros.dropna() #retirando valores nulos 
livros = livros.reset_index(drop=True) #não esqueça de resetar o index sempre que retirar valores nulos

livros.tail()

Unnamed: 0,Title,Author,Genre,Height,Publisher
107,Rationality & Freedom,"Sen, Amartya",economics,213,Springer
108,Clash of Civilizations and Remaking of the Wor...,"Huntington, Samuel",history,228,Simon&Schuster
109,Uncommon Wisdom,"Capra, Fritjof",nonfiction,197,Fontana
110,One,"Bach, Richard",nonfiction,172,Dell
111,To Sir With Love,Braithwaite,fiction,197,Penguin


In [6]:
#criaremos uma coluna que representa o id do livro para usarmos posteriormente
numeros = [ ]
for i in range(1, 113):
    numeros.append(i)


df_numeros = pd.DataFrame(numeros, columns = ["BookId"])

base = pd.concat([livros, df_numeros], axis=1)

base.tail()

Unnamed: 0,Title,Author,Genre,Height,Publisher,BookId
107,Rationality & Freedom,"Sen, Amartya",economics,213,Springer,108
108,Clash of Civilizations and Remaking of the Wor...,"Huntington, Samuel",history,228,Simon&Schuster,109
109,Uncommon Wisdom,"Capra, Fritjof",nonfiction,197,Fontana,110
110,One,"Bach, Richard",nonfiction,172,Dell,111
111,To Sir With Love,Braithwaite,fiction,197,Penguin,112


In [7]:
base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      112 non-null    object
 1   Author     112 non-null    object
 2   Genre      112 non-null    object
 3   Height     112 non-null    int64 
 4   Publisher  112 non-null    object
 5   BookId     112 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 5.4+ KB


In [8]:
#chamado metodo para verificar se ainda há campos nulos
base.isnull().values.any()

#retornando a False quer dizer que não posuimos campos não preenchidos

False

##Extraindo Valores de interesse 

In [9]:
#vamos separar apenas as colunas importam para a recomendacao de livros
colunas = ['Title', 'Author', 'Genre', 'Publisher']

#visualizando colunas que separamos
base[colunas].head(3)

Unnamed: 0,Title,Author,Genre,Publisher
0,Fundamentals of Wavelets,"Goswami, Jaideva",signal_processing,Wiley
1,Data Smart,"Foreman, John",data_science,Wiley
2,God Created the Integers,"Hawking, Stephen",mathematics,Penguin


In [10]:
base[colunas].shape

(112, 4)

In [11]:
base[colunas].isnull().values.any()

False

In [12]:
#criando funçao para juntar os atributos de interesse na nossa base
def pegar_valores_importantes(dado):
    valores_importantes = [ ]
    for i in range(0, dado.shape[0]):
      valores_importantes.append(dado['Title'][i] + ' ' + dado['Author'][i] + ' ' + dado['Genre'][i] + ' ' + dado['Publisher'][i])
    return valores_importantes

In [13]:
#criando nova coluna na base com os valores que juntamos
base['valores_importates'] = pegar_valores_importantes(base)

In [14]:
base.head() #vendo coluna nova

Unnamed: 0,Title,Author,Genre,Height,Publisher,BookId,valores_importates
0,Fundamentals of Wavelets,"Goswami, Jaideva",signal_processing,228,Wiley,1,"Fundamentals of Wavelets Goswami, Jaideva sign..."
1,Data Smart,"Foreman, John",data_science,235,Wiley,2,"Data Smart Foreman, John data_science Wiley"
2,God Created the Integers,"Hawking, Stephen",mathematics,197,Penguin,3,"God Created the Integers Hawking, Stephen math..."
3,Superfreakonomics,"Dubner, Stephen",economics,179,HarperCollins,4,"Superfreakonomics Dubner, Stephen economics Ha..."
4,Orientalism,"Said, Edward",history,197,Penguin,5,"Orientalism Said, Edward history Penguin"


##Transformando valores para calculo de coseno similar

In [15]:
#o countVectorizer pode transformar um conjunto de dados de texto em matrizes de tokens 
modelo = CountVectorizer().fit_transform(base['valores_importates'])

modelo

<112x436 sparse matrix of type '<class 'numpy.int64'>'
	with 829 stored elements in Compressed Sparse Row format>

In [16]:
#criando a matriz de cossenos similares utilizando o modelo tranformado
cos_similar = cosine_similarity(modelo)

cos_similar

array([[1.        , 0.15430335, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.15430335, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.13363062],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.18257419,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.18257419, 1.        ,
        0.        ],
       [0.        , 0.        , 0.13363062, ..., 0.        , 0.        ,
        1.        ]])

In [17]:
#perceba que para essa tabela temos uma matriz 112x112 
#pois foi realizado um calculo de distancia para todos os valores na coluna
cos_similar.shape 

(112, 112)

##Testando recomendação

In [18]:
#econtranto o valor da instancia/index/bookid referente ao nome do livro que queremos
t = 'God Created the Integers'

livro_id = base[base.Title == t]['BookId'].values[0] #buscadno o valor do campo BookID na base

print(livro_id)

3


In [19]:
#listando o cossenos similares mais proximos
#cirando lista ordenada por valores de bookid o como segundo conteudo o valor de coseno similar referente a variavael
pontuacao = list(enumerate(cos_similar[livro_id]))

print(pontuacao)

[(0, 0.0), (1, 0.0), (2, 0.15811388300841894), (3, 0.9999999999999999), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.0), (29, 0.0), (30, 0.14907119849998596), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.0), (42, 0.1414213562373095), (43, 0.6), (44, 0.1690308509457033), (45, 0.0), (46, 0.0), (47, 0.0), (48, 0.0), (49, 0.0), (50, 0.0), (51, 0.0), (52, 0.0), (53, 0.0), (54, 0.0), (55, 0.0), (56, 0.15811388300841894), (57, 0.19999999999999998), (58, 0.0), (59, 0.0), (60, 0.0), (61, 0.0), (62, 0.0), (63, 0.0), (64, 0.0), (65, 0.0), (66, 0.0), (67, 0.0), (68, 0.0), (69, 0.0), (70, 0.0), (71, 0.0), (72, 0.19999999999999998), (73, 0.0), (74, 0.0), (75, 0.14907119849998596), (76, 0.0), (77, 0.0), (78, 0.0),

In [20]:
#ordenado a pelo cosseno similar maior
pontuacao_ordenada = sorted(pontuacao, key = lambda x:x[1], reverse = True)
pontuacao_ordenada = pontuacao_ordenada[1:] #retiradno o primeiro elementeo que seria o proprio livro buscado

print(pontuacao_ordenada)

[(43, 0.6), (57, 0.19999999999999998), (72, 0.19999999999999998), (107, 0.18257418583505539), (44, 0.1690308509457033), (2, 0.15811388300841894), (56, 0.15811388300841894), (90, 0.15811388300841894), (91, 0.15811388300841894), (30, 0.14907119849998596), (75, 0.14907119849998596), (95, 0.14907119849998596), (103, 0.14907119849998596), (42, 0.1414213562373095), (0, 0.0), (1, 0.0), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.0), (29, 0.0), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.0), (45, 0.0), (46, 0.0), (47, 0.0), (48, 0.0), (49, 0.0), (50, 0.0), (51, 0.0), (52, 0.0), (53, 0.0), (54, 0.0), (55, 0.0), (58, 0.0), (59, 0.0), (60, 0.0), (61, 0.0), (62, 0.0), (63, 0.0), (64, 0.0), (65, 0.0), (66, 0.0

In [21]:
#por fim vamos visualizar os filmes em que o nosso sistema recomendou
j = 0

print('os filmes mais recomendados para quem leu: ', t, '\nsão:')
for i in pontuacao_ordenada:
    livro_titulo = base[base.BookId == i[0]]['Title'].values[0]
    print(j+1, livro_titulo)
    j = j+1
    if(j>13):
        break

os filmes mais recomendados para quem leu:  God Created the Integers 
são:
1 Tales of Mystery and Imagination
2 Textbook of Economic Theory
3 Prisoner of Birth, A
4 History of Western Philosophy
5 Freakonomics
6 Data Smart
7 Soft Computing & Intelligent Systems
8 Bookless in Baghdad
9 Theory of Everything, The
10 Complete Sherlock Holmes, The - Vol II
11 Last Mughal, The
12 Burning Bright
13 Zen & The Art of Motorcycle Maintenance
14 Russian Journal, A
