## Protótipo II - Sistema de recomendação baseado em modelos de <i>Machine Learning</i>

Nesse notebook, continuamos o desenvolvimento do TCC acerca do tema acima, aqui iremos pegar os arquivos .pkl criados e trabalharmos um pouco mais no processamento e organização dos dados ali presentes.

In [4]:
# CONSTANTES COLLABORATIVE FILTERING

# CAMINHOS

# 
PATH_TO_FULL_CF_FILE = "preprocessed-data/CF/data_cf.pkl"

PATH_TO_MOVIES_CF_FILE = "preprocessed-data/CF/movies_cf.pkl"

PATH_TO_RATINGS_CF_FILE = "preprocessed-data/CF/ratings_cf.pkl"

# DataFrames Names

# data_cf = arquivo completo 
# movies_cf = arquivos de filmes
# ratings_cf = arquivos de ratings

# KNN
N_NEIGHBORS = 10

In [5]:
# Importando bibliotecas necessárias
import pandas as pd
pd.set_option("display.max_rows", 25)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sys

# Importando garbage collector
import gc

# Importando bibliotecas para o sistema de recomendação
import scipy.sparse as sparse # Matriz esparsa (csr_matrix)

# Importando sklearn
import sklearn
from sklearn.neighbors import NearestNeighbors

# Importando Regular Expression operations
import re

# 1 - Preparação dos dados para Collaborative Filtering

#### Trabalharemos aqui com a criação da matriz filme x usuário da melhor maneira possível.

In [77]:
# Definindo Função que carrega os arquivos
def load_cf_files(full_file=True, movie_file=False, ratings_file=False):
    if(full_file):
        # Carregando o arquivo pré processado completo
        data_cf = pd.read_pickle(PATH_TO_FULL_CF_FILE)
        data_cf = data_cf[["movieId", "title", "userId", "rating"]] # Reajustando ordem das colunas
        
        print("Arquivo completo: Carregado com sucesso!")
    else:
        print("Arquivo completo: Não foi carregado, verifique os parâmetros para ver se essa era a intenção!")
        
    if(movie_file):
        # Carregando arquivo de filmes
        movies_cf = pd.read_pickle(PATH_TO_MOVIES_CF_FILE)
        print("Arquivo de filmes: Carregado com sucesso!")
    else:
        print("Arquivo de filmes: Não foi carregado, verifique os parâmetros para ver se essa era a intenção!")
        
    if(ratings_file):
        ratings_cf = pd.read_pickle(PATH_TO_RATINGS_CF_FILE)
        print("Arquivo de ratings: Carregado com sucesso!")
    else:
        print("Arquivo de ratings: Não foi carregado, verifique os parâmetros para ver se essa era a intenção!")
        
    #definindo retornos
    if("data_cf" in locals()):
        if("movies_cf" in locals()):
            if("ratings_cf" in locals()):
                return data_cf, movies_cf, ratings_cf
            else:
                return data_cf, movies_cf
        else:
            return data_cf
    elif("movies_cf" in locals()):
        if("ratings_cf" in locals()):
            return movies_cf, ratings_cf
        else:
            return movies_cf
    elif("ratings_cf" in locals()):
        return ratings_cf
    
    else:
        return None

In [78]:
# Chamando função de carregar os arquivos
data_cf, movies_cf = load_cf_files(full_file=True, movie_file=True, ratings_file=False)

Arquivo completo: Carregado com sucesso!
Arquivo de filmes: Carregado com sucesso!
Arquivo de ratings: Não foi carregado, verifique os parâmetros para ver se essa era a intenção!


In [79]:
data_cf.tail()

Unnamed: 0,movieId,title,userId,rating
25000090,209157,We,119571,1.5
25000091,209159,Window of the Soul,115835,3.0
25000092,209163,Bad Poems,6964,4.5
25000093,209169,A Girl Thing,119571,3.0
25000094,209171,Women of Devil's Island,119571,3.0


#### Analisando um pouco o arquivo de filmes e o completo

In [80]:
# Informações gerais sobre o arquivo completo
print(data_cf.describe())
print("\n\n")
print(data_cf['title'].value_counts())

            movieId        userId        rating
count  2.500010e+07  2.500010e+07  2.500010e+07
mean   2.138798e+04  8.118928e+04  3.533854e+00
std    3.919886e+04  4.679172e+04  1.060744e+00
min    1.000000e+00  1.000000e+00  5.000000e-01
25%    1.196000e+03  4.051000e+04  3.000000e+00
50%    2.947000e+03  8.091400e+04  3.500000e+00
75%    8.623000e+03  1.215570e+05  4.000000e+00
max    2.091710e+05  1.625410e+05  5.000000e+00



Forrest Gump                          81491
Shawshank Redemption, The             81482
Pulp Fiction                          79672
Silence of the Lambs, The             74127
Matrix, The                           72674
                                      ...  
Young Billy Young                         1
I Promessi sposi - Secondo il Trio        1
Woman Basketball Player No. 5             1
First Round Down                          1
H3 - Halloween Horror Hostel              1
Name: title, Length: 55463, dtype: int64


#### Podemos notar acima que, alguns filmes tem apenas 1 avaliação, o que pode complicar na comparação, devemos removê-los?

In [81]:
# Mostrando filmes com menos de X votos
cpy = movies_cf.copy()

rating_count = pd.DataFrame(data_cf.groupby("movieId").count()["rating"])
rating_count.reset_index(inplace=True)
rating_count.rename(columns={"rating": "rating count"},inplace=True)

res = cpy.merge(rating_count)

res.head()

Unnamed: 0,title,movieId,rating count
0,Toy Story,1,57309
1,Jumanji,2,24228
2,Grumpier Old Men,3,11804
3,Waiting to Exhale,4,2523
4,Father of the Bride Part II,5,11714


In [82]:
# Exibindo
X = rating_count["rating count"].quantile(0.50) # 90% dos filmes da lista tem pelo menos essa quantidade de votos

print(X)

res[res["rating count"] == 1]

6.0


Unnamed: 0,title,movieId,rating count
1759,Ratchet,1847,1
3462,Stacy's Knights,3561,1
5793,Soap Girl,5905,1
8461,B.F.'s Daughter,25935,1
8522,All at Sea,26016,1
...,...,...,...
59042,We,209157,1
59043,Window of the Soul,209159,1
59044,Bad Poems,209163,1
59045,A Girl Thing,209169,1


In [83]:
data_cf.value_counts("title")

title
Forrest Gump                                        81491
Shawshank Redemption, The                           81482
Pulp Fiction                                        79672
Silence of the Lambs, The                           74127
Matrix, The                                         72674
                                                    ...  
L'uccello migratore                                     1
L'opéra de quat'sous                                    1
L'onorata famiglia                                      1
L'oeuvre au noir                                        1
"BLOW THE NIGHT!" Let's Spend the Night Together        1
Length: 55463, dtype: int64

#### Vamos verificar agora ocorrências de filmes com títulos em linguas diferentes de inglês e removê-los.

In [84]:
# Mostrando filmes que contem caracteres não ascii no titulo
data_cf[data_cf["title"].map(lambda x: not x.isascii())]["title"].value_counts()

Amelie (Fabuleux destin d'Amélie Poulain, Le)              34320
Léon: The Professional (a.k.a. The Professional) (Léon)    33680
WALL·E                                                     27374
Life Is Beautiful (La Vita è bella)                        23976
Alien³ (a.k.a. Alien 3)                                    13783
                                                           ...  
Komisarz Blond i Oko sprawiedliwości                           1
Мышонок Пик                                                    1
Dzień dobry, kocham cię!                                       1
Krakonoš a lyžníci                                             1
Attack of La Niña                                              1
Name: title, Length: 2322, dtype: int64

### 1.1 - Problemas do Collaborative Filtering:
<ul>
    <li>Esparsidade</li>
    <li>Cold Start</li>
</ul>

#### Técnicas possíveis:
<ul>
    <li><b>Algoritmos não probabilisticos:</b></li>
    <li>User-based nearest neighbor</li>
    <li>Item-based nearest neighbor</li>
    <li>Reducing dimensionality</li>
</ul>

<ul>
    <li><b>Algoritmos probabilisticos:</b></li>
    <li>Bayesian-network model</li>
    <li>Expectation-minimization</li>
</ul>


Ver: https://pub.towardsai.net/recommendation-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-filtering-533ff8a0e444

### 1.2 - Criando uma estrutura de matriz esparsa com o dataframe

In [85]:
# Usando o dataframe para criar uma matriz esparsa que contenha todos os filmes
def create_sparse_matrix(df):
    sparse_matrix = sparse.csr_matrix((df["rating"], (df["userId"], df["movieId"])))
    
    return sparse_matrix

# a matriz gerada é uma User x Movie, queremos Movie x User, então fazemos a transposta
user_movie_matrix = create_sparse_matrix(data_cf).transpose()

### 1.3 - Criando um modelo simples utilizando o KNN -  CF - Item Based

In [86]:
# Criando o modelo knn
knn_cf = NearestNeighbors(n_neighbors=N_NEIGHBORS, algorithm='auto', metric='cosine') # temos que mexer nos parâmetros posteriormente

knn_cf.fit(user_movie_matrix)

NearestNeighbors(metric='cosine', n_neighbors=10)

In [97]:
# criando função que gera recomendações basedo em um filme - utilizando um modelo KNN
def get_recommendations_cf(movie_name, model): #nome do filme, modelo
    # Pegando o Id do filme que tenha o nome passado
    movieId = data_cf.loc[data_cf["title"] == movie_name]["movieId"].values[0]
    
    distances, suggestions = model.kneighbors(user_movie_matrix.getrow(movieId).todense().tolist(), n_neighbors=N_NEIGHBORS)
    
    for i in range(0, len(distances.flatten())):
        if(i == 0):
            print('Recomendações para {0}: \n'.format(movie_name))
        else:
            print('{0}: {1}, com distância de {2}:'.format(i, data_cf.loc[data_cf["movieId"] == suggestions.flatten()[i]]["title"].values[0], distances.flatten()[i]))
    
    return distances, suggestions

In [98]:
# Função para pesquisar o nome correto do filme
def search_movies_cf(search_word):
    return movies_cf[movies_cf.title.str.contains(search_word, flags=re.IGNORECASE)]

In [99]:
# Setando um tamanho de coluna, para ver o nome completo dos filmes
pd.set_option('display.max_colwidth', 500)
# Pesquisando filmes
search_movies_cf("Spider-Man")

Unnamed: 0,title,movieId
5241,Spider-Man,5349
7923,Spider-Man 2,8636
11561,Spider-Man 3,52722
14527,Spider-Man: The Ultimate Villain Showdown,76709
18241,"Amazing Spider-Man, The",95510
21454,The Amazing Spider-Man 2,110553
25074,Untitled Spider-Man Reboot,122926
56890,Spider-Man: Into the Spider-Verse,195159
59844,Spider-Man: Far from Home,201773


In [100]:
#Voltando a coluna ao normal 
pd.set_option('display.max_colwidth', 50)

In [101]:
# Pegando recomendações
a, b = get_recommendations_cf("Spider-Man 2", knn_cf)

Recomendações para Spider-Man 2: 

1: Spider-Man, com distância de 0.30051949781903664:
2: X2: X-Men United, com distância de 0.38035866481375724:
3: Pirates of the Caribbean: The Curse of the Black Pearl, com distância de 0.44583045604832083:
4: X-Men, com distância de 0.45347815941142156:
5: Incredibles, The, com distância de 0.45585894066206556:
6: Shrek 2, com distância de 0.4564660168298432:
7: Batman Begins, com distância de 0.45717348204712305:
8: Star Wars: Episode III - Revenge of the Sith, com distância de 0.46928468125362977:
9: Finding Nemo, com distância de 0.4844064554284505:


## Ideia: Se o filme tiver menos que X votos, utilizar exclusivamente Content-Based, caso contrario, fazemos a abordagem entrelaçada.

# 2 - Preparação dos dados para Content-Based

In [6]:
# CONSTANTES CONTENT BASED

# CAMINHOS

PATH_TO_FULL_CB_FILE = "preprocessed-data/CB/data_cb.pkl"

PATH_TO_MOVIES_CB_FILE = "preprocessed-data/CB/movies_cb.pkl"

PATH_TO_RATINGS_CB_FILE = "preprocessed-data/CB/ratings_cb.pkl"

PATH_TO_RATINGS_INFOS_CB_FILE = "preprocessed-data/CB/ratings_info_cb.pkl"

PATH_TO_TAG_RELEVANCE_GROUPED_CB_FILE = "preprocessed-data/CB/tag_relevance_grouped_cb.pkl"

PATH_TO_TAG_RELEVANCE_CB_FILE = "preprocessed-data/CB/tag_relevance_cb.pkl"

PATH_TO_TAGS_PROCESSED_CB_FILE = "preprocessed-data/CB/tags_processed_cb.pkl"

# DataFrames Names

# data_cb = arquivo completo 
# movies_cb = arquivos de filmes
# ratings_cb = arquivos de ratings
# ratings_infos_cb = arquivos de informações sobre os ratings
# tag_relevance_grouped_cb = relevancia de tags após o agrupamento
# tag_relevance_cb = relevancia de tags original
# tags_processed_cb = tags todas juntas em uma coluna e processadas pelo nltk

In [7]:
def load_cb_files(full=True, movies=False, ratings=False, ratings_infos=False ,relevance_grouped=False, relevance=False, tags_processed=False):
    data_cb = None
    movies_cb = None 
    ratings_cb = None
    ratings_infos_cb = None
    tag_relevance_grouped_cb = None
    tag_relevance_cb = None
    tags_processed_cb = None
    
    # Caso se queira carregar o completo
    if(full):
        data_cb = pd.read_pickle(PATH_TO_FULL_CB_FILE)
        print("Arquivo completo: Carregado com sucesso!")
    else:
        print("Arquivo completo: Não foi carregado, verifique se era o que desejava.")
    
    # Caso queira-se carregar o arquivo de filmes
    if(movies):
        movies_cb = pd.read_pickle(PATH_TO_MOVIES_CB_FILE)
        print("Arquivo movies: Carregado com sucesso!")
    else:
        print("Arquivo movies: Não foi carregado, verifique se era o que desejava.")
        
    if(ratings):
        ratings_cb = pd.read_pickle(PATH_TO_RATINGS_CB_FILE)
        print("Arquivo ratings: Carregado com sucesso!")
    else:
        print("Arquivo ratings: Não foi carregado, verifique se era o que desejava.")
    
    if(ratings_infos):
        ratings_infos_cb = pd.read_pickle(PATH_TO_RATINGS_INFOS_CB_FILE)
        print("Arquivo ratings infos: Carregado com sucesso!")
    else:
        print("Arquivo ratings infos: Não foi carregado, verifique se era o que desejava.")
        
    if(relevance_grouped):
        tag_relevance_grouped_cb = pd.read_pickle(PATH_TO_TAG_RELEVANCE_GROUPED_CB_FILE)
        print("Arquivo relevance grouped: Carregado com sucesso!")
    else:
        print("Arquivo relevance grouped: Não foi carregado, verifique se era o que desejava.")
    
    if(relevance):
        tag_relevance_cb = pd.read_pickle(PATH_TO_TAG_RELEVANCE_CB_FILE)
        print("Arquivo relevance: Carregado com sucesso!")
    else:
        print("Arquivo relevance: Não foi carregado, verifique se era o que desejava.")
        
    if(tags_processed):
        tags_processed_cb = pd.read_pickle(PATH_TO_TAGS_PROCESSED_CB_FILE)
        print("Arquivo tags processed: Carregado com sucesso!")
    else:
        print("Arquivo tags processed: Não foi carregado, verifique se era o que desejava.")
        
        
    return data_cb, movies_cb, ratings_cb, ratings_infos_cb, tag_relevance_grouped_cb, tag_relevance_cb, tags_processed_cb
    

In [8]:
data_cb, movies_cb, ratings_cb, ratings_infos_cb, tag_relevance_grouped_cb, tag_relevance_cb, tags_processed_cb = load_cb_files(full=True, movies=True, ratings_infos=True, tags_processed=True)

Arquivo completo: Carregado com sucesso!
Arquivo movies: Carregado com sucesso!
Arquivo ratings: Não foi carregado, verifique se era o que desejava.
Arquivo ratings infos: Carregado com sucesso!
Arquivo relevance grouped: Não foi carregado, verifique se era o que desejava.
Arquivo relevance: Não foi carregado, verifique se era o que desejava.
Arquivo tags processed: Carregado com sucesso!


In [9]:
movies_cb.tail()

Unnamed: 0,movieId,title,genres
62418,209157,We (2018),[Drama]
62419,209159,Window of the Soul (2001),[Documentary]
62420,209163,Bad Poems (2018),"[Comedy, Drama]"
62421,209169,A Girl Thing (2001),[(no genres listed)]
62422,209171,Women of Devil's Island (1962),"[Action, Adventure, Drama]"


In [10]:
ratings_infos_cb.tail()

Unnamed: 0,movieId,average rating,rating count,weighted rating
59042,209157,1.5,1,3.067578
59043,209159,3.0,1,3.071202
59044,209163,4.5,1,3.074825
59045,209169,3.0,1,3.071202
59046,209171,3.0,1,3.071202


Olhando aqui o fim da lista de filmes e de informações de ratings, podemos notar que varios filmes não tem nenhum rating, e isso é importante de se lembrar mais para frente.

## 2.1 - FOCANDO NOS GÊNEROS: Separando gêneros em colunas diferentes

In [11]:
movies_genres_separated_cb = movies_cb.copy()

genres_list = []
#Para cada linha no dataframe, iteramos pela lista de generos e colocamos 1 na coluna correspondente
for index, row in movies_genres_separated_cb.iterrows():
    for genre in row['genres']:
        movies_genres_separated_cb.at[index, genre] = 1
        if(not genre in genres_list):
            genres_list.append(genre)
            
#Todos os valores não preenchidos se tornam 0
movies_genres_separated_cb = movies_genres_separated_cb.fillna(0)

In [12]:
movies_genres_separated_cb.tail()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
62418,209157,We,[Drama],2018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62419,209159,Window of the Soul,[Documentary],2001,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
62420,209163,Bad Poems,"[Comedy, Drama]",2018,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62421,209169,A Girl Thing,[(no genres listed)],2001,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
62422,209171,Women of Devil's Island,"[Action, Adventure, Drama]",1962,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2.1.1 - Renomeando a coluna "(no genres listed)" para "None", para ficar mais legivel e mais facil de acessar.

In [13]:
movies_genres_separated_cb.rename({"(no genres listed)":"None"}, axis=1, inplace=True)

movies_genres_separated_cb.tail()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,None
62418,209157,We,[Drama],2018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62419,209159,Window of the Soul,[Documentary],2001,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
62420,209163,Bad Poems,"[Comedy, Drama]",2018,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62421,209169,A Girl Thing,[(no genres listed)],2001,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
62422,209171,Women of Devil's Island,"[Action, Adventure, Drama]",1962,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Backup do dataset com a lista de generos ainda incluida
movies_genres_separated = movies_genres_separated_cb.copy()

In [15]:
# removendo coluna de generos
movies_genres_separated_cb.pop("genres")

0        [Adventure, Animation, Children, Comedy, Fantasy]
1                           [Adventure, Children, Fantasy]
2                                        [Comedy, Romance]
3                                 [Comedy, Drama, Romance]
4                                                 [Comedy]
                               ...                        
62418                                              [Drama]
62419                                        [Documentary]
62420                                      [Comedy, Drama]
62421                                 [(no genres listed)]
62422                           [Action, Adventure, Drama]
Name: genres, Length: 62423, dtype: object

In [16]:
movies_genres_separated_cb.head()

Unnamed: 0,movieId,title,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,None
0,1,Toy Story,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2.2 Criando o sistema de recomendação Content-Based - KNN

### 2.2.1 Utilizando generos e ano

In [17]:
# Removendo colunas "inuteis" para a criação da matriz
movies_genres_separated_cb_cpy = movies_genres_separated_cb.copy()

movies_genres_separated_cb_cpy.pop("title")
#movies_genres_separated_cb_cpy.pop("year")

movies_genres_separated_cb_cpy.head()

Unnamed: 0,movieId,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,None
0,1,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# Convertendo coluna Year para float
movies_genres_separated_cb_cpy['year'] = movies_genres_separated_cb_cpy['year'].astype(float)

In [19]:
# Criando matriz com o dataFrame acima
movie_genres_matrix = pd.pivot_table(movies_genres_separated_cb_cpy, index = ["movieId"])

In [20]:
# Mostrando matriz
movie_genres_matrix

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,...,IMAX,Musical,Mystery,None,Romance,Sci-Fi,Thriller,War,Western,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1995.0
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1995.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1995.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1995.0
5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1995.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209157,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018.0
209159,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2001.0
209163,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2018.0
209169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2001.0


In [21]:
# Gerando matriz esparsa com a matriz acima
sparse_matrix_cb = sparse.csr_matrix(movie_genres_matrix)

In [22]:
# Criando o modelo knn
knn_genres_cb = NearestNeighbors(n_neighbors=N_NEIGHBORS, algorithm='auto', metric='cosine') # temos que mexer nos parâmetros posteriormente

knn_genres_cb.fit(sparse_matrix_cb)

NearestNeighbors(metric='cosine', n_neighbors=10)

In [23]:
# procura filmes baseando-se nas bases de dados que contém os generos
def search_movies_genres_cb(search_word):
    return movies_cb[movies_cb.title.str.contains(search_word, flags=re.IGNORECASE)]

In [24]:
# criando função que gera recomendações basedo em um filme - utilizando um modelo KNN - baseando-se nos generos
def get_recommendations_genres_cb(movie_name, model): #nome do filme, modelo
    # Pegando o Id do filme que tenha o nome passado
    movieId = movies_cb.loc[movies_cb["title"] == movie_name]["movieId"].values[0]
    
    index = movies_cb.loc[movies_cb["movieId"] == movieId].index.values[0]
    
    distances, suggestions = model.kneighbors(sparse_matrix_cb.getrow(index).todense().tolist(), n_neighbors=N_NEIGHBORS)

    for i in range(0, len(distances.flatten())):
        if(i == 0):
            print('Recomendações para {0}: \n'.format(movie_name))
        else:
            print('{0}: {1}, com distância de {2}:'.format(i, movies_cb.loc[movies_cb.index == suggestions.flatten()[i]]["title"].values[0], distances.flatten()[i]))
    
    return distances, suggestions

In [25]:
# Setando um tamanho de coluna, para ver o nome completo dos filmes
pd.set_option('display.max_colwidth', 500)

search_movies_genres_cb("Spider-Man")

Unnamed: 0,movieId,title,genres,year
5241,5349,Spider-Man,"[Action, Adventure, Sci-Fi, Thriller]",2002
7923,8636,Spider-Man 2,"[Action, Adventure, Sci-Fi, IMAX]",2004
11561,52722,Spider-Man 3,"[Action, Adventure, Sci-Fi, Thriller, IMAX]",2007
14527,76709,Spider-Man: The Ultimate Villain Showdown,[Animation],2002
18241,95510,"Amazing Spider-Man, The","[Action, Adventure, Sci-Fi, IMAX]",2012
21454,110553,The Amazing Spider-Man 2,"[Action, Sci-Fi, IMAX]",2014
25074,122926,Untitled Spider-Man Reboot,"[Action, Adventure, Fantasy]",2017
56890,195159,Spider-Man: Into the Spider-Verse,"[Action, Adventure, Animation, Sci-Fi]",2018
59844,201773,Spider-Man: Far from Home,"[Action, Adventure, Sci-Fi]",2019


In [26]:
# Setando o tamanho de coluna de volta ao normal
pd.set_option('display.max_colwidth', 50)

In [27]:
a, b = get_recommendations_genres_cb("Spider-Man 2", knn_genres_cb)

Recomendações para Spider-Man 2: 

1: Superman Returns, com distância de 4.950484466803573e-13:
2: Star Wars: Episode II - Attack of the Clones, com distância de 4.970468481246826e-13:
3: Star Trek, com distância de 3.0845326293160724e-12:
4: Transformers: Revenge of the Fallen, com distância de 3.0845326293160724e-12:
5: Avatar, com distância de 3.0845326293160724e-12:
6: Tron: Legacy, com distância de 4.437561429426751e-12:
7: John Carter, com distância de 7.87325760143176e-12:
8: Avengers, The, com distância de 7.87325760143176e-12:
9: Amazing Spider-Man, The, com distância de 7.87325760143176e-12:


### 2.2.2 Utilizando as tags para fazer um sistema de recomendação baseado na matriz TFidf

In [28]:
# Carregando o arquivo que contém 
tags_processed_cb.head()

Unnamed: 0,movieId,tags,tags_processed
0,1,owned imdb top 250 pixar pixar time travel chi...,imdb top pixar pixar time travel child comedy ...
1,2,robin williams time travel fantasy based on ch...,robin williams time travel fantasy base child ...
2,3,funny best friend duringcreditsstinger fishing...,funny best friend duringcreditsstinger fish ol...
3,4,based on novel or book chick flick divorce int...,base novel book chick flick divorce interracia...
4,5,aging baby confidence contraception daughter g...,age baby confidence contraception daughter gyn...


#### Criação de uma matriz palavras_chave x filmes

### PODE-SE FAZER APENAS COM A CONTAGEM DE OCORRENCIA DAS PALAVRAS.

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(tags_processed_cb["tags_processed"].values.astype('U'))
#tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=tags_processed_cb.index.tolist())
print(tfidf_matrix.shape)

(45251, 32277)


In [54]:
tfidf.get_feature_names()[5000:5010]

['choreographed',
 'choreographer',
 'choreographic',
 'choreography',
 'chores',
 'chorine',
 'chorus',
 'chosen',
 'chou',
 'chow']

In [55]:
# Verificando graficamente

# Compress with SVD
# from sklearn.decomposition import TruncatedSVD
# svd = TruncatedSVD(n_components=700)
# latent_matrix = svd.fit_transform(tfidf_matrix_cpy)
 
# # plot var expalined to see what latent dimensions to use
# explained = svd.explained_variance_ratio_.cumsum()
# plt.plot(explained, '.-', ms = 16, color='red')
# plt.xlabel('Singular value components', fontsize= 12)
# plt.ylabel('Cumulative percent of variance', fontsize=12)        
# plt.show()

In [56]:
# Criando o modelo knn utilizando as tags
knn_tags_cb = NearestNeighbors(n_neighbors=N_NEIGHBORS, algorithm='auto', metric='cosine') # temos que mexer nos parâmetros posteriormente

knn_tags_cb.fit(tfidf_matrix)

NearestNeighbors(metric='cosine', n_neighbors=10)

In [57]:
# criando função que gera recomendações basedo em um filme - utilizando um modelo KNN - baseando-se nas tags
def get_recommendations_tags_cb(movie_name, model): #nome do filme, modelo
    # Pegando o Id do filme que tenha o nome passado
    movieId = movies_cb.loc[movies_cb["title"] == movie_name]["movieId"].values[0]
    
    index = movies_cb.loc[movies_cb["movieId"] == movieId].index.values[0]
    
    distances, suggestions = model.kneighbors(tfidf_matrix.getrow(index).todense().tolist(), n_neighbors=N_NEIGHBORS)

    for i in range(0, len(distances.flatten())):
        if(i == 0):
            print('Recomendações para {0}: \n'.format(movie_name))
        else:
            print('{0}: {1}, com distância de {2}:'.format(i, movies_cb.loc[movies_cb.index == suggestions.flatten()[i]]["title"].values[0], distances.flatten()[i]))
    
    return distances, suggestions

In [58]:
a, b = get_recommendations_tags_cb("Spider-Man 2", knn_tags_cb)

print("\n", b)

Recomendações para Spider-Man 2: 

1: Stateside, com distância de 0.43413231440510947:
2: 47 Ronin, The (Genroku Chûshingura), com distância de 0.4668735525384128:
3: Symphony No. 42, com distância de 0.4923495317314386:
4: Going Home, com distância de 0.5466584022331984:
5: Secret Garden, The, com distância de 0.5468300241019454:
6: Heat, The, com distância de 0.5530226005100427:
7: Submarine X-1, com distância de 0.5600991413092565:
8: Nothing Left Unsaid: Gloria Vanderbilt & Anderson Cooper, com distância de 0.5627528096766188:
9: Chosen, com distância de 0.5737336287443238:

 [[ 7923  8281  8409 41548 33215  8479 19925 18740 40517 41918]]


In [59]:
ratings_infos_cb.head()

Unnamed: 0,movieId,average rating,rating count,weighted rating
0,1,3.893708,57309,3.887824
1,2,3.251527,24228,3.248508
2,3,3.142028,11804,3.13964
3,4,2.853547,2523,2.884188
4,5,3.058434,11714,3.058875


### 2.2.2 Utilizando as tags para fazer um sistema de recomendação baseado na ocorrência e ausencia de tags

In [60]:
# colocando as tags processadas em listas
tags_list_df = tags_processed_cb.copy()

tags_list_df['tags_processed'] = tags_list_df['tags_processed'].str.split(' ')

tags_list_df.drop("tags", inplace=True, axis=1)

tags_list_df.head()

Unnamed: 0,movieId,tags_processed
0,1,"[imdb, top, pixar, pixar, time, travel, child,..."
1,2,"[robin, williams, time, travel, fantasy, base,..."
2,3,"[funny, best, friend, duringcreditsstinger, fi..."
3,4,"[base, novel, book, chick, flick, divorce, int..."
4,5,"[age, baby, confidence, contraception, daughte..."


In [61]:
tags_list_df.describe()

Unnamed: 0,movieId
count,45251.0
mean,106384.672206
std,62915.776571
min,1.0
25%,55009.5
50%,121797.0
75%,159553.0
max,209063.0


#### AVISO: Não executar as duas células abaixo, apenas carregar o arquivo na posterior

In [52]:
# fazendo uma lista de tags, para cada tag na lista de tags daquela linha, caso não esteja na lista, adicionamos.
tags_list = []

for index, row in tags_list_df.iterrows():
    for tag in row['tags_processed']:
        #tags_list_df.at[index, tag] = 1
        if(not tag in tags_list):
            tags_list.append(tag)

len(tags_list)

32443

In [53]:
# salvando a lista de tags em um arquivo pickle
import pickle
with open('preprocessed-data/CB/movies_tags_matrix/processed_tags_list.pkl', 'w+b') as f:
     pickle.dump(tags_list, f)

In [62]:
tags_list = pd.read_pickle('preprocessed-data/CB/movies_tags_matrix/processed_tags_list.pkl');

In [63]:
print(tags_list)



#### "tags_list" são as nossas colunas e "moviesId_list" as linhas

In [64]:
# pegando lista de moviesID
moviesId_list = tags_list_df["movieId"].tolist()

# recuperando dimensoes da matriz
number_of_movies_with_tags = len(moviesId_list) # num linhas
number_of_tags = len(tags_list) # num colunas

In [65]:
# criando uma matriz temporaria preenchida nos locais corretos
temp = np.zeros((number_of_movies_with_tags, number_of_tags))

#### AVISO: Não executar as duas células abaixo, apenas carregar o arquivo na posterior

In [66]:
coordinates_to_fill = []
# recuperando os lugares a serem preenchidos na matriz
for index, row in tags_list_df.iterrows():
    for tag in row['tags_processed']:
        coordinates_to_fill.append([moviesId_list.index(row['movieId']), tags_list.index(tag)])


KeyboardInterrupt: 

In [None]:
# salvando as coordenadas a serem preenchidas num arquivo pickle
import pickle
with open('preprocessed-data/CB/movies_tags_matrix/coordinates_to_fill.pkl', 'w+b') as f:
     pickle.dump(coordinates_to_fill, f)

In [67]:
coordinates_to_fill = pd.read_pickle('preprocessed-data/CB/movies_tags_matrix/coordinates_to_fill.pkl')

In [68]:
# preenchendo essa matriz nos locais corretos
for coordinate in coordinates_to_fill:
    temp[coordinate[0]][coordinate[1]] = 1

In [69]:
# criando a matriz final
movies_tags_matrix = sparse.csr_matrix(temp, dtype=int)

movies_tags_matrix

<45251x32443 sparse matrix of type '<class 'numpy.intc'>'
	with 608415 stored elements in Compressed Sparse Row format>

In [70]:
# Criando o modelo knn utilizando as tags
knn_tags_processed_cb = NearestNeighbors(n_neighbors=N_NEIGHBORS, algorithm='auto', metric='cosine') # temos que mexer nos parâmetros posteriormente

knn_tags_processed_cb.fit(movies_tags_matrix)

NearestNeighbors(metric='cosine', n_neighbors=10)

In [73]:
# criando função que gera recomendações basedo em um filme - utilizando um modelo KNN - baseando-se nas tags processadas
def get_recommendations_tags_processed_cb(movie_name, model): #nome do filme, modelo
    # Pegando o Id do filme que tenha o nome passado
    movieId = movies_cb.loc[movies_cb["title"] == movie_name]["movieId"].values[0]
    
    index = moviesId_list.index(movieId)
    
    distances, suggestions = model.kneighbors(movies_tags_matrix.getrow(index).todense().tolist(), n_neighbors=N_NEIGHBORS)
    
    suggestionsIds_list = [moviesId_list[index] for index in suggestions.flatten()]
    
    for i in range(0, len(distances.flatten())):
        if(i == 0):
            print('Recomendações para {0}: \n'.format(movie_name))
        else:
            print('{0}: {1}, com distância de: {2}'.format(i, movies_cb.loc[movies_cb["movieId"] == suggestionsIds_list[i]]["title"].values[0], distances.flatten()[i]))
    
    return distances, suggestions

In [74]:
a, b = get_recommendations_tags_processed_cb("Spider-Man 2", knn_tags_processed_cb)

Recomendações para Spider-Man 2: 

1: Spider-Man, com distância de: 0.47533317688720234
2: Spider-Man 3, com distância de: 0.5851433159120089
3: Batman Begins, com distância de: 0.6811919836281075
4: Batman Returns, com distância de: 0.7031840172183927
5: Superman III, com distância de: 0.734730957399361
6: Batman, com distância de: 0.741303137003815
7: Superman Returns, com distância de: 0.7444917564055193
8: Superman IV: The Quest for Peace, com distância de: 0.7522839326113557
9: X-Men: The Last Stand, com distância de: 0.753803711212755


# AS RECOMENDACOES KNN CONTENT BASED NAO ESTAO TAO BOAS.
# TODO - PESQUISAR OUTRAS FORMAS DE RECOMENDAR

# TODO2 - VER UMA FORMA DE MEDIR A ACURACIA DO KNN

# IDEIA: UTILIZAR PARA COMPARAÇÃO DE ACURÁCIA: CALCULAR O AVERAGE RATING DOS FILMES RECOMENDADOS E COMPARAR COM O AVERAGE RATING DO FILME INPUT.  
# SOURCE: https://hendra-herviawan.github.io/Movie-Recommendation-based-on-KNN-K-Nearest-Neighbors.html