# Non personalized recommender systems on MovieLens
Filippo Fantinato 2041620

In this notebook I experienced with non personalized recommender systems on the MovieLens dataset. The techniques I used are most popular and highest rated films.

In [None]:
import numpy as np
import pandas as pd

Let's download and unzip the dataset and read movies and ratings ones.

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

--2022-12-29 14:37:38--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip.2’


2022-12-29 14:37:38 (6.62 MB/s) - ‘ml-latest-small.zip.2’ saved [978202/978202]

Archive:  ml-latest-small.zip
replace ml-latest-small/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 
error:  invalid response [{ENTER}]
replace ml-latest-small/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [None]:
movies = pd.read_csv('ml-latest-small/movies.csv')
movies.index = movies.movieId
movies = movies.drop('movieId', axis=1)
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [None]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Since the number of ratings for each movie is exploited in both techniques, I created a dataframe with such information and movieId as index.

In [None]:
num_ratings = ratings[['movieId', 'userId']].groupby(['movieId']).count()
num_ratings = num_ratings.rename(columns={"userId": "#ratings"})
num_ratings

Unnamed: 0_level_0,#ratings
movieId,Unnamed: 1_level_1
1,215
2,110
3,52
4,7
5,49
...,...
193581,1
193583,1
193585,1
193587,1


Just to avoid code reusing, I decleared a method that given a dataframe with movieId as indexes and the number of movies to get, it returns information about those films.

In [None]:
def get_films_by_idx(df, n_movies):
  movie_idx = df[:n_movies].index
  return movies.loc[movie_idx]

## Most popular

Let's sort by number of ratings in a descending order and print the $10$ most rated movies,

In [None]:
most_populars = num_ratings.sort_values(by=['#ratings'], ascending=False)
most_populars.head()

Unnamed: 0_level_0,#ratings
movieId,Unnamed: 1_level_1
356,329
318,317
296,307
593,279
2571,278


which are the following:

In [None]:
get_films_by_idx(most_populars, n_movies = 10)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War
318,"Shawshank Redemption, The (1994)",Crime|Drama
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
110,Braveheart (1995),Action|Drama|War
589,Terminator 2: Judgment Day (1991),Action|Sci-Fi
527,Schindler's List (1993),Drama|War


## Highest rated

After having got the avegare rating for each film,

In [None]:
avg_ratings = ratings[['movieId',  'rating']].groupby(['movieId']).mean()
avg_ratings = avg_ratings.rename(columns={"rating": "avgRating"})
avg_ratings

Unnamed: 0_level_0,avgRating
movieId,Unnamed: 1_level_1
1,3.920930
2,3.431818
3,3.259615
4,2.357143
5,3.071429
...,...
193581,4.000000
193583,3.500000
193585,3.500000
193587,3.500000


I apply a discount factor consisting in the ratio between the number of ratings of such film and the total number of ratings. That's because in this way I avoid that a movie with very high rating average but with few ratings, is prefered over a film with lower rating average but with higher number of ratings.

Then I sort the movies in a descending order and print the first $10$ of them,

In [None]:
tot_ratings = ratings.shape[0]

avg_ratings['avgRating'] = avg_ratings.apply(
  lambda row:
    row['avgRating'] * (num_ratings.loc[row.name].iloc[0] / tot_ratings)
  , axis=1
)
highest_ratings = avg_ratings.sort_values(by=['avgRating'], ascending=False)
highest_ratings

Unnamed: 0_level_0,avgRating
movieId,Unnamed: 1_level_1
318,0.013924
356,0.013586
296,0.012778
2571,0.011558
593,0.011514
...,...
160872,0.000005
8236,0.000005
57326,0.000005
82684,0.000005


which are the following:

In [None]:
get_films_by_idx(highest_ratings, n_movies = 10)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
318,"Shawshank Redemption, The (1994)",Crime|Drama
356,Forrest Gump (1994),Comedy|Drama|Romance|War
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
110,Braveheart (1995),Action|Drama|War
2959,Fight Club (1999),Action|Crime|Drama|Thriller
527,Schindler's List (1993),Drama|War
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
