<a href="https://colab.research.google.com/github/BrunoSaintClair/AlgoritmoDeRecomendacao/blob/main/Modelo%201%20-%20Embedding%20e%20similaridade%20por%20cosseno.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## O programa a seguir trata-se de um algoritmo de recomendação de filmes.


## Para fazer a recomendação ele utiliza embedding, representando textos como vetores numéricos, e similaridade por cosseno, para calcular a similaridade entre os vetores gerados.

## O algoritmo é `content based filtering`, ou seja, utiliza-se do conteúdo dos filmes(título, sinopse, palavras-chave e gêneros) para descobrir semelhantes.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Link dataset:
# https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies

df = pd.read_csv("/content/drive/MyDrive/TMDB_movie_dataset_v11.csv")

In [None]:
# display(df.info())
# display(df.head(3))

### Filtrando os dados e deixando apenas as colunas desejadas:


In [None]:
df = df[["title", "vote_average", "vote_count", "overview", "genres", "keywords", "runtime"]]

df['title'] = df['title'].drop_duplicates()

df.dropna(subset=[
    "title", "overview", "genres", "keywords"], how="any", inplace=True)

df.query("vote_count >= 75", inplace=True)

In [None]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 18289 entries, 0 to 21828
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         18289 non-null  object 
 1   vote_average  18289 non-null  float64
 2   vote_count    18289 non-null  int64  
 3   overview      18289 non-null  object 
 4   genres        18289 non-null  object 
 5   keywords      18289 non-null  object 
 6   runtime       18289 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 1.1+ MB


None

### Deixando a nota média no formato desejado e criando uma nova coluna contendo os dados que serão usados posteriormente:

In [None]:
df['vote_average'] = df['vote_average'].round(2)

df['info'] = df['title'] + ' | ' + df['overview'] + ' | ' + df['keywords'] + ' | ' + df['genres']

df.head(3)

Unnamed: 0,title,vote_average,vote_count,overview,genres,keywords,runtime,info
0,Inception,8.36,34495,"Cobb, a skilled thief who commits corporate es...","Action, Science Fiction, Adventure","rescue, mission, dream, airplane, paris, franc...",148,"Inception | Cobb, a skilled thief who commits ..."
1,Interstellar,8.42,32571,The adventures of a group of explorers who mak...,"Adventure, Drama, Science Fiction","rescue, future, spacecraft, race against time,...",169,Interstellar | The adventures of a group of ex...
2,The Dark Knight,8.51,30619,Batman raises the stakes in his war on crime. ...,"Drama, Action, Crime, Thriller","joker, sadism, chaos, secret identity, crime f...",152,The Dark Knight | Batman raises the stakes in ...


### Criando o modelo:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
vectorizer = TfidfVectorizer()

info_embedding = vectorizer.fit_transform(df['info'])

similarity = cosine_similarity(info_embedding)

similarity_df = pd.DataFrame(similarity, index=df['title'], columns=df['title'])
display(similarity_df.head())

title,Inception,Interstellar,The Dark Knight,Avatar,The Avengers,Deadpool,Avengers: Infinity War,Fight Club,Guardians of the Galaxy,Pulp Fiction,...,Stasera a casa di Alice,Pretty Poison,Elsewhere,The Lusty Men,Motocrossed,Riddick: Blindsided,ZMD: Zombies of Mass Destruction,Daleks' Invasion Earth: 2150 A.D.,The Killer Shrews,Ménilmontant
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,1.0,0.050875,0.023865,0.027553,0.034165,0.024962,0.043719,0.007389,0.044825,0.030299,...,0.005001,0.004343,0.045339,0.00877,0.014445,0.021843,0.018286,0.028464,0.021138,0.007798
Interstellar,0.050875,1.0,0.017219,0.207241,0.033476,0.018742,0.06209,0.047074,0.094458,0.038551,...,0.023313,0.009511,0.006836,0.018492,0.031071,0.07072,0.015531,0.081722,0.032784,0.009342
The Dark Knight,0.023865,0.017219,1.0,0.053794,0.076905,0.082962,0.085714,0.037984,0.03724,0.058257,...,0.018433,0.041507,0.036053,0.016253,0.033809,0.034865,0.032295,0.022798,0.028156,0.015871
Avatar,0.027553,0.207241,0.053794,1.0,0.067957,0.026155,0.078681,0.012863,0.082851,0.009159,...,0.022013,0.009453,0.006294,0.004018,0.01515,0.060137,0.017686,0.081712,0.018105,0.005086
The Avengers,0.034165,0.033476,0.076905,0.067957,1.0,0.088911,0.229936,0.031001,0.216067,0.018735,...,0.043379,0.014123,0.009509,0.014528,0.025523,0.055694,0.036402,0.077299,0.029196,0.024468


### Função para gerar o top10 filmes mais semelhantes de qualquer filme presente na base de dados:

In [None]:
def top10_similarity(movie):
  df = pd.DataFrame(similarity_df[movie]).sort_values(by=movie, ascending=False)
  return df.head(11)[1:]

### Exemplo de utilização:

In [None]:
top10_similarity('The Dark Knight')

Unnamed: 0_level_0,The Dark Knight
title,Unnamed: 1_level_1
Batman,0.376398
The Dark Knight Rises,0.356161
"Batman: The Long Halloween, Part One",0.348818
"Batman: The Long Halloween, Part Two",0.339057
Batman: Mask of the Phantasm,0.327778
Batman Begins,0.306638
Batman Beyond: Return of the Joker,0.296134
Batman: Under the Red Hood,0.294194
Batman Forever,0.284154
The Batman,0.28386


### Utilizando a função para criar coluna com os filmes semelhantes

In [None]:
df['similar_movies'] = df['title'].apply(lambda x: top10_similarity(x).index.tolist())

In [None]:
df[['title', 'similar_movies']].head(3)

Unnamed: 0,title,similar_movies
0,Inception,"[Inception: The Cobol Job, The Cell, Virtual R..."
1,Interstellar,"[Battle Beyond the Stars, SpaceCamp, Journey t..."
2,The Dark Knight,"[Batman, The Dark Knight Rises, Batman: The Lo..."


### Buscando no dataframe os filmes mais similares com os presentes na lista, e formatando a saída para trazer mais dados que possam ser úteis:

In [None]:
movies_list = [
    'Cars', 'Toy Story', 'Fight Club',
    'The Silence of the Lambs', 'The Dark Knight',
    ]

for movie in movies_list:
  print("=" * 100)
  print(f"\nTop 10 filmes mais semelhantes a {movie}:\n")
  for similar_movie in df.query(f"title == '{movie}'")['similar_movies'].values[0]:
    print('- ' + similar_movie + ", nota média: " + str(df.query(f"title == '{similar_movie}'")['vote_average'].values[0]) + ", quantidade de avaliações: " + str(df.query(f"title == '{similar_movie}'")['vote_count'].values[0]) + ", duração: " + str(df.query(f"title == '{similar_movie}'")['runtime'].values[0]) + " minutos")
  print()


Top 10 filmes mais semelhantes a Cars:

- Cars 2, nota média: 6.08, quantidade de avaliações: 7115, duração: 106 minutos
- The Radiator Springs 500½, nota média: 6.2, quantidade de avaliações: 78, duração: 6 minutos
- Cars 3, nota média: 6.85, quantidade de avaliações: 5263, duração: 102 minutos
- Time Travel Mater, nota média: 6.0, quantidade de avaliações: 115, duração: 7 minutos
- Vacation, nota média: 6.3, quantidade de avaliações: 3472, duração: 99 minutos
- A Goofy Movie, nota média: 6.98, quantidade de avaliações: 1592, duração: 78 minutos
- Death Race 2, nota média: 5.79, quantidade de avaliações: 1011, duração: 100 minutos
- Tom and Jerry: The Fast and the Furry, nota média: 6.86, quantidade de avaliações: 239, duração: 75 minutos
- Ford v Ferrari, nota média: 8.01, quantidade de avaliações: 6972, duração: 153 minutos
- The Great Race, nota média: 7.15, quantidade de avaliações: 282, duração: 160 minutos


Top 10 filmes mais semelhantes a Toy Story:

- Toy Story 2, nota média