# Intro to Recommender Systems

Sistemas de recomendação podem ser definidos como uma coleção de algoritmos usados para recomendar itens para um usuario, baseado em informações tiradas do proprio usuario.  

Em sistemas de recomendações podemos destacar dois tipos: 

* __Memory-Based__ 
    * Usa todo o dataset para gerar uma recomendação
    * Utiliza tecnicas estatisticas para aproximar usuarios de itens (ex: Pearson, Cosine Similarity, Euclidean Distance, etc..)
    
    
* __Model-Based__
    * Desenvolve um modelo para o usuario na tentativa de aprender as suas preferencias
    * Tais modelos podem ser criados usando tecnicas de ML como Linear Regression, Clustering, Classification, ..

## Data

Os dados foram adquiridos do [GroupLens](https://grouplens.org/datasets/movielens/), os dados referem-se a avaliações de filmes

In [3]:
#Download data
!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
print('unziping ...')
!unzip -o -j moviedataset.zip 

--2020-05-18 23:02:31--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160301210 (153M) [application/zip]
Saving to: ‘moviedataset.zip’


2020-05-18 23:16:19 (189 KB/s) - ‘moviedataset.zip’ saved [160301210/160301210]

unziping ...
Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


## Preprocessing

In [52]:
#Libraries
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#Storing the movie information into a pandas dataframe
movies_df = pd.read_csv('movies.csv')
#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('ratings.csv')
#Head is a function that gets the first N rows of a dataframe. N's default is 5.
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Vamos remover da coluna "Title" o ano, usando regex e adiciona-lo a uma nova coluna

In [3]:
#Mantendo os parenteses para não conflitCrime, Drama, Musicalar com os filmes que possuem anos em seus titulos
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

#Agora removendo os parentese
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

#Com os anos "filtrados", podemos remove-los da coluna Title
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

#Removendo qualquer possível espaço em branco no final do titulo
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [4]:
#Para simplificar os processamentos futuros vamos splitar a coluna "genres" pelo caracter ' | '
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [5]:
# new_movies_1 = {'movieId':151712 , 'title': 'Dance in the dark', 'genres': ['Crime', 'Drama', 'Musical'], 'year':2000}
# new_movies_2 = {'movieId':151711 , 'title': 'Deadpool', 'genres': ['Action', 'Adventure', 'Comedy'], 'year':2016}
# new_movies_3 = {'movieId':151711 , 'title': 'Suicide Squad', 'genres': ['Action', 'Adventure', 'Fantasy'], 'year':2016}

# movies_df = movies_df.append(new_movies_1, ignore_index=True)
# movies_df = movies_df.append(new_movies_2, ignore_index=True)
# movies_df = movies_df.append(new_movies_3, ignore_index=True)

Usaremos a técnica de __One Hot Encoding__ para converter a lista de gêneros em um vetor em que cada coluna corresponda a um valor daquela feature, sendo assim transformaremos nossa feature categórica em varias features numéricas.

In [6]:
filmes = ['Memento', 'Grown Ups', 'Hulk']
movies_df[movies_df['title'].isin(filmes)]

Unnamed: 0,movieId,title,genres,year
4133,4226,Memento,"[Mystery, Thriller]",2000
6425,6534,Hulk,"[Action, Adventure, Sci-Fi]",2003
15563,79134,Grown Ups,[Comedy],2010


In [7]:
#Copiando o dataframe
MoviesWithGenres_df = movies_df.copy()

In [8]:
#Iterando sobre cada linha do df e tambeḿ sobre uma lista dos generos, add 1 caso o filme seja daquele genero ou
# 0 caso contrario

"""
pd.iterrows() -- Método que itera sobre as linhas de um df
pd.at[] -- Semelhante ao pd.loc[] este método especifica uma valor no df (df.at[row, column])
"""

for index, row in movies_df.iterrows():
    for genre in row['genres']:
        MoviesWithGenres_df.at[index, genre] = 1
        
#Substituindo o NaN por 0 para mostrar os filmes que não possuirem
MoviesWithGenres_df = MoviesWithGenres_df.fillna(0)
MoviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
MoviesWithGenres_df = MoviesWithGenres_df.drop('(no genres listed)', 1)

In [10]:
#Dataframe ratings
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [11]:
#Dropando a coluna timestamp
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


# Content-based recommendation system

Esta tecnica busca identificar os aspectos favoritos do usuário sobre um item, e então recomenda items com aqueles mesmo aspectos. Neste caso iremos identificar os generos de filmes favoritos e suas avaliações.

Começaremos "simulando" um input inicial do usuario avaliando alguns filmes

## Modelagem

In [18]:
userInput = [
            {'title':'Memento', 'rating':4},
            {'title':'American Beauty', 'rating':5},
            {'title':'Old Boy', 'rating':4},
            {'title':"Hulk", 'rating':1},
            {'title':'Grown Ups', 'rating':1}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Memento,4
1,American Beauty,5
2,Old Boy,4
3,Hulk,1
4,Grown Ups,1


In [19]:
#Filtrando
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputId
#Merge pelo titulo por simplicidade
inputMovies = pd.merge(inputId, inputMovies)

inputMovies

Unnamed: 0,movieId,title,genres,year,rating
0,2858,American Beauty,"[Drama, Romance]",1999,5
1,4226,Memento,"[Mystery, Thriller]",2000,4
2,6534,Hulk,"[Action, Adventure, Sci-Fi]",2003,1
3,27773,Old Boy,"[Mystery, Thriller]",2003,4
4,79134,Grown Ups,[Comedy],2010,1


In [20]:
#dropando informações que não serão usadas
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
inputMovies

Unnamed: 0,movieId,title,rating
0,2858,American Beauty,5
1,4226,Memento,4
2,6534,Hulk,1
3,27773,Old Boy,4
4,79134,Grown Ups,1


Começaremos aprendendo as preferencias dos inputs dado pelo usuario.

In [21]:
#Subset com as features apenas dos filmes que o usuario add
userMovies = MoviesWithGenres_df[MoviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
2773,2858,American Beauty,"[Drama, Romance]",1999,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4133,4226,Memento,"[Mystery, Thriller]",2000,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6425,6534,Hulk,"[Action, Adventure, Sci-Fi]",2003,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
9482,27773,Old Boy,"[Mystery, Thriller]",2003,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15563,79134,Grown Ups,[Comedy],2010,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Precisamos apenas de uma tabela com os generos, então vamos limpar o que for desnecessário e reindexar a tabela

In [22]:
#redefinindo os indices para evitar problemas
userMovies = userMovies.reset_index(drop=True)

#Dropando colunas desnecessárias
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
#userGenreTable = userGenreTable.drop([0])
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Podemos agora começar a aprender as preferencias do usuario. Para isso vamos transformar cada genero em __pesos__ e então multiplica-los pela tabela de generos - _(produto escalar)_

Logo a operação realizada sera do tipo:


\begin{equation*}
    W = W^T * X
\end{equation*}

Onde $W$ é a matriz de pesos, $W^T$ é a matriz de generos transposta e $X$ o vetor com as avaliações do usuario

In [23]:
#X
inputMovies['rating']

0    5
1    4
2    1
3    4
4    1
Name: rating, dtype: int64

In [24]:
userGenreTable.shape, inputMovies.shape

((5, 19), (5, 3))

In [25]:
#Produto escalar
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
userProfile

Adventure      1.0
Animation      0.0
Children       0.0
Comedy         1.0
Fantasy        0.0
Romance        5.0
Drama          5.0
Action         1.0
Crime          0.0
Thriller       8.0
Horror         0.0
Mystery        8.0
Sci-Fi         1.0
IMAX           0.0
Documentary    0.0
War            0.0
Musical        0.0
Western        0.0
Film-Noir      0.0
dtype: float64

Agora podemos usar esse _User Profile_ , as preferencias do usuario para fazer as recomendações

In [26]:
#Capturando os generos de cada um dos filmes no dataframe
genreTable = MoviesWithGenres_df.set_index(MoviesWithGenres_df['movieId'])

#dropando informações desnecessárias
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
genreTable.shape

(34208, 19)

In [30]:
#Multiplando a tabela de generos pelos pesos e tomando a sua média
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
recommendationTable_df.head()

movieId
83266    0.966667
76153    0.933333
75408    0.933333
27016    0.900000
27781    0.900000
dtype: float64

In [31]:
#Tabela final de recomendação
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(10).keys())]

Unnamed: 0,movieId,title,genres,year
670,680,"Alphaville (Alphaville, une étrange aventure d...","[Drama, Mystery, Romance, Sci-Fi, Thriller]",1965
1236,1264,Diva,"[Action, Drama, Mystery, Romance, Thriller]",1981
2734,2819,Three Days of the Condor (3 Days of the Condor),"[Drama, Mystery, Romance, Thriller]",1975
7907,8588,Killing Me Softly,"[Drama, Mystery, Romance, Thriller]",2002
9174,27016,"Curve, The (Dead Man's Curve)","[Comedy, Drama, Mystery, Romance, Thriller]",1998
9488,27781,Svidd Neger,"[Comedy, Crime, Drama, Horror, Mystery, Romanc...",2003
13754,68685,Incendiary,"[Drama, Mystery, Romance, Thriller]",2008
15001,75408,Lupin III: Sweet Lost Night (Rupan Sansei: Swe...,"[Action, Animation, Comedy, Crime, Drama, Myst...",2008
15073,76153,Lupin III: First Contact (Rupan Sansei: Faasut...,"[Action, Animation, Comedy, Crime, Drama, Myst...",2002
16504,83266,Kaho Naa... Pyaar Hai,"[Action, Adventure, Comedy, Drama, Mystery, Ro...",2000


## Vantagens e Desvantagens

* __Vantagens:__
    * Aprende a preferência dos usuarios
    * Super personalizado por usuario
    

* __Desvantagens:__
    * Não leva em conta outros parametros sobre o item o que pode gerar baixa qualidade da recomendação
    * Extração dos dados não é tão trivial
    * Determinar quais caracteristicas o usuario gosta ou não também não é algo simples

# Collaborative Filtering

Esta tecnica considera:

* __User-based collaborative filtering:__
    * Baseia-se na similaridade da vizinhança do usuario
    
* __Item-based collaborative filtering:__
    * Baseia-se na similariade dos itens

__Alguns desafios de usar esta abordagem__

* Dados esparsos
    * Em geral usuarios avaliam um numero limitado de amostras e em um dataset grande isto é um problema
 
* Cold Start
    * Dificuldade de fazer uma recomendação para um usuario novo
    
* Escalabilidade
    * Cresce em numero de usuarios ou itens

O processo para criar um sistema de recomendação baseado em usuario consiste de:

* Selecionar o usuario com os filmes que ele assistiu
* Baseado na avaliação deste usuario, encontrar os top X vizinhos
* Obter o registro de filme assistido do usuario para cada vizinho
* Calcular a similaridade entre eles
* Recomendar os itens com maiores scores

## Modelagem

In [35]:
#movies_df.head()

In [36]:
#ratings_df.head()

In [34]:
#Input do usuario
userInput = [
            {'title':'Memento', 'rating':4},
            {'title':'American Beauty', 'rating':5},
            {'title':'Old Boy', 'rating':4},
            {'title':"Hulk", 'rating':1},
            {'title':'Grown Ups', 'rating':1}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Memento,4
1,American Beauty,5
2,Old Boy,4
3,Hulk,1
4,Grown Ups,1


In [37]:
#Add os filmes do usuario no dataset

#filtrando os filmes por titulo
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

#merge
inputMovies = pd.merge(inputId, inputMovies)

#drop
inputMovies = inputMovies.drop('year', 1)

inputMovies

Unnamed: 0,movieId,title,genres,rating
0,2858,American Beauty,"[Drama, Romance]",5
1,4226,Memento,"[Mystery, Thriller]",4
2,6534,Hulk,"[Action, Adventure, Sci-Fi]",1
3,27773,Old Boy,"[Mystery, Thriller]",4
4,79134,Grown Ups,[Comedy],1


In [38]:
#Subset com avaliações de outros usuarios sobre os mesmos filmes do nosso Usuario de interesse
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
115,4,2858,3.0
248,7,4226,5.0
597,13,2858,4.0
958,15,2858,3.5
1011,15,4226,3.5


In [42]:
#Cria varios sebdatasets que possuem os mesmos valores na coluna especificada
userSubsetGroup = userSubset.groupby(['userId'])

In [44]:
userSubsetGroup.get_group(1155)

Unnamed: 0,userId,movieId,rating
106822,1155,4226,5.0


Classificando os usuarios que compartilham mais filmes com o usuario de interesse para que eles tenham maior prioridade, aumentando o valor da recomendação

In [45]:
#Sorteando os usuarios com base na sua similaridade com o usario de interesse
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [49]:
#userSubsetGroup[1]

Para calcular a similaridade entre esses usuarios iremos utilizar o Coeficiente de Correlação de Pearson. Esta escolha baseia-se sobre tudo no fato de que este coeficiente é invariante com a escala, sendo assim torna mais consistente os valores para este caso por exemplo, dois usuarios que são do mesmo grupo podem avaliar algum item de maneira completamente diferente em valores absolutos, porem eles ainda pertencem ao mesmo grupo, ainda são usuarios similares

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 "Pearson Correlation")

Os valores resultares variam entre $r = -1$ a $r = 1 $

In [50]:
#Por simplicidade vamos impor um limit para não precisar iterar sobre todos os usuarios similares
userSubsetGroup = userSubsetGroup[0:100]

In [53]:
#Calculando o Coeficiente de Correlação de Pearson

pearsonCorrelationDict = {}

for name, group in userSubsetGroup:
    #Classificando a entrada e o grupo de usuarios para que eles não sejam misturados
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    
    #Capturando um N para a formula
    nRatings = len(group)
    
    #Score dos filmes que ambos tem em comum
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    #Armazenando os scores em uma lista temporaria (buffer)
    tempRatingList = temp_df['rating'].tolist()
    
    #Grupo de usuarios
    tempGroupList = group['rating'].tolist()
    
    #Formula
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #resultado
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

In [54]:
pearsonCorrelationDict.items()

dict_items([(815, 0.776819332832332), (2452, 0.9245881363528833), (11957, 0.8964214570007949), (14588, 0.5860090386731199), (22884, 0.8416833807014048), (28946, 0.9569487529386911), (29300, 0.9287269204251444), (31602, 0.7319250547114001), (32474, 0.9245881363528833), (35743, 0.819920061690787), (35887, 0.7637626158259734), (38446, -0.07356123579206249), (39273, 0.7862136275414388), (41609, 0.8827348295047499), (42185, 0.8706814878985729), (44027, 0.727632202114423), (44760, 0.820412654142368), (47919, 0.9583148474999101), (51033, 0.9759000729485335), (51920, 0.791117470161674), (52575, 0.7807200583588266), (52673, 0.9597148699373935), (54059, 0.35714285714285715), (54133, 0.7985957062499233), (58040, 0.8081220356417685), (60769, 0.9376144618769919), (63062, 0.9021937088963177), (63839, 0.881293447800432), (66096, 0.822612745660623), (66748, 0.7950542444263022), (67776, 0.48795003647426677), (71009, 0.9354143466934853), (71115, 0.6167939547439616), (71491, 0.7513428837969105), (73419, 

In [55]:
#DataFrame dos coeficientes
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.776819,815
1,0.924588,2452
2,0.896421,11957
3,0.586009,14588
4,0.841683,22884


In [57]:
#Top 50 usuarios mais similares ao input
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
68,1.0,148766
42,0.989743,91951
56,0.981981,134046
75,0.981878,166325
86,0.981878,198183


## Recomendação

Avaliações dos usuarios selecionados para todos os filmes.
Para isso vamos considerar a média ponderada das classifcações dos filmes usando a correlação de Pearson como peso. Porém para fazer isso, primeiro precisamos obter os filmes assistidos pelos usuarios no nosso pearsonDF apartir do dataframe de classificações, em seguida, armazenar sua correlação em uma nova coluna chamada " __similarityIndex_ ".
Isto é feito com o merge abaixo

In [58]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,148766,2,3.5
1,1.0,148766,3,3.0
2,1.0,148766,6,3.5
3,1.0,148766,7,3.5
4,1.0,148766,9,3.5


Agora precisamos multiplicar a classificação do filme pelo seu peso ( _pearson_ ), somar as novas classificações e dividir pela soma dos pesos

In [59]:
#Multiplicando a similaridade pelas avaliações
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,148766,2,3.5,3.5
1,1.0,148766,3,3.0,3.0
2,1.0,148766,6,3.5,3.5
3,1.0,148766,7,3.5,3.5
4,1.0,148766,9,3.5,3.5


In [60]:
#Somando as novas classificações após agrupar por movieId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,40.642297,155.732431
2,38.859937,114.316091
3,14.672419,43.212966
4,2.646699,5.814047
5,15.557294,41.625888


In [61]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Tomando o peso médio
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.831782,1
2,2.941747,2
3,2.945183,3
4,2.196716,4
5,2.675651,5


In [62]:
#Embaralhando e recomendando os 10 filmes mais provaveis para ele ver
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
8015,5.0,8015
4737,5.0,4737
102217,5.0,102217
5221,5.0,5221
47642,5.0,47642
2553,5.0,2553
6784,5.0,6784
6021,5.0,6021
6051,5.0,6051
6206,5.0,6206


In [63]:
#E os filmes são:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,genres,year
2469,2553,Village of the Damned,"[Horror, Sci-Fi, Thriller]",1960
4642,4737,"American Rhapsody, An",[Drama],2001
5125,5221,Harrison's Flowers,[Drama],2000
5923,6021,"American Friend, The (Amerikanische Freund, Der)","[Crime, Drama, Mystery, Thriller]",1977
5953,6051,"Harder They Come, The","[Action, Crime, Drama]",1973
6108,6206,Seize the Day,[Drama],1986
6675,6784,"Song Remains the Same, The","[Documentary, Musical]",1976
7621,8015,"Phantom Tollbooth, The","[Adventure, Animation, Children, Fantasy]",1970
11264,47642,How to Eat Fried Worms,"[Children, Drama]",2006
21000,102217,Bill Hicks: Revelations,[Comedy],1993


# Vantagens e Desvantagens

* __Vantagens:__
    * Leva em consideração a avaliação de outros usuarios
    * Não é necessário estudar ou extrair informação sobre o item que será recomendado
    * Adapato para as mudanças nos gostos do usuario
    

* __Desvantagens:__
    * Função de aproximação é lenta
    * Pode haver uma quantidade baixa de usuarios para aproximar
    * Problemas de privacidade quando se tenta aprender as preferencias do usuario