In [1]:
!wget "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
!unzip ml-100k.zip
!ls

--2020-05-21 08:27:26--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2020-05-21 08:27:28 (5.71 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

In this analysis, we using [MovieLens](https://grouplens.org/datasets/movielens/100k/) dataset provided by GroupLens. It conmprise 100,000 ratings (1-5) from 943 users on 1682 movies.

# **Read and preprocess data**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import math

  import pandas.util.testing as tm


In [3]:
item_column_name = "movieId,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western"
item = pd.read_csv("ml-100k/u.item",sep='|',names=item_column_name.split(","),encoding='latin-1')
item

Unnamed: 0,movieId,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
col_to_removed = ['movie_title', 'release_date', 'video_release_date', 'IMDb_URL']
clear_item = item.drop(col_to_removed, axis=1).set_index('movieId')
clear_item

Unnamed: 0_level_0,unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1679,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1680,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1681,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
# the number of the kinds of topics in the whole recommendation system.
N = len(clear_item.columns)
N

19

# **Total genres of all the movies**

Let say we want to get the details of each movie's genre, we interest to know number of genres that relate to each of the movie. Below is the very first and rough idea come to my mind for get the number. It has other ways or maybe better approach to get desired result. Here we are use `sum` function to sum up the value 1 across the genre columns.

In [6]:
# convert index movieId as list
genre_num = clear_item.sum(axis = 1)
genre_num

movieId
1       3
2       3
3       1
4       3
5       3
       ..
1678    1
1679    2
1680    2
1681    1
1682    1
Length: 1682, dtype: int64

Next, we total up the number of movie for each genre.

In [7]:
num_by_genre = clear_item.sum()
df_num_of_genre = num_by_genre.to_frame()
df_num_of_genre.reset_index(inplace=True)
df_num_of_genre.columns = ['genre', 'total_number']
df_num_of_genre

Unnamed: 0,genre,total_number
0,unknown,2
1,Action,251
2,Adventure,135
3,Animation,42
4,Children,122
5,Comedy,505
6,Crime,109
7,Documentary,50
8,Drama,725
9,Fantasy,22


In [8]:
movies = genre_num.index.to_list()
len(movies)

1682

In [9]:
sum(df_num_of_genre['total_number'])

2893

# **Simple generate random result**

As a part of my recommendation research, I have to predict and recommend the movies for each test user. Since that is another complex part which is not a focus in this noebook, therefore, we just simple generate a prediction of movie list randomly for random 10 users. We use `random` function from Python and using `sample` function to make sure no repeating items in prediction list.

In [10]:
from random import randint

# random generate unique movieId lists
random_movieId = []
random_userId = random.sample(range(1,200), 10)

for i in range(10):
  random_movieId.append(random.sample(movies, 10))

x = {'userId' : random_userId, 'movieId' : random_movieId}

random_prediction = pd.DataFrame(x)
random_prediction.set_index('userId', inplace=True)
random_prediction

Unnamed: 0_level_0,movieId
userId,Unnamed: 1_level_1
8,"[1310, 377, 885, 147, 1400, 517, 1424, 1102, 7..."
77,"[1370, 1279, 1406, 1601, 1452, 1600, 1036, 575..."
5,"[1429, 997, 1592, 483, 1145, 1661, 1503, 463, ..."
178,"[862, 1351, 328, 936, 924, 760, 1259, 1221, 98..."
134,"[1179, 1169, 383, 707, 127, 671, 270, 23, 1335..."
98,"[1676, 80, 1152, 823, 27, 1083, 839, 591, 1318..."
176,"[1198, 1392, 17, 826, 1603, 626, 1264, 459, 12..."
180,"[145, 387, 453, 636, 1466, 894, 1054, 805, 50,..."
27,"[209, 1362, 771, 709, 298, 594, 1054, 1311, 19..."
97,"[731, 1379, 1087, 1351, 110, 824, 1313, 1306, ..."


# **Build function for genre calculation**

We can pre-define the function for easy to use it whenever needed.

In [0]:
# this function is to check how many genre (topics) included in the movie item
def check_genre_num(movieid):
  n_genre = genre_num.loc[movieid]
  return n_genre

# this function is to get the list of genre that included in the movieId
def check_genre_list(movieid):
  movie_genres = genre_list_by_movieid['Genres'].loc[movieid]
  return movie_genres

# **Get the genre list for each movie**

In [12]:
# group table by index and creates a dict with lists of clear_item as values
df_dict = dict(list(clear_item.groupby(clear_item.index)))

# Gather all the genres that related to all movies respectively
movieid = []
genre_list = []

for u, v in df_dict.items():
    check = v.columns[(v == 1).any()]
    if len(check) > 0:
      movieid.append(u)
      genre_list.append(check.to_list())

d = {'movieId' : movieid, 'Genres' : genre_list}

# compile in DataFrame
genre_list_by_movieid = pd.DataFrame(d)
genre_list_by_movieid.set_index('movieId', inplace=True)
genre_list_by_movieid

Unnamed: 0_level_0,Genres
movieId,Unnamed: 1_level_1
1,"[Animation, Children, Comedy]"
2,"[Action, Adventure, Thriller]"
3,[Thriller]
4,"[Action, Comedy, Drama]"
5,"[Crime, Drama, Thriller]"
...,...
1678,[Drama]
1679,"[Romance, Thriller]"
1680,"[Drama, Romance]"
1681,[Comedy]


In [13]:
# This cell is just for checking if the process above is correct.
# It should generate same result for every execution.

# Enter any movieId from 1 to 1682
# if input movieId 5, the output should be 3
check_genre_num(5)

3

In [69]:
# This cell is check if the matching is correct
# Enter any movieId from 1 to 1682 based on the random prediction output movie list
# if enter '1680', it should output ['Drama', 'Romance']
check_genre_list(1680)

['Drama', 'Romance']

Now we can collect all the genre for each movie in result list. Note that the list output is based on all movies in the random prediction, it is not group according to userId.

In [14]:
# Collect all genres that related to each movie
genres = []
genre_per_list = []

for user in random_prediction.index:
  movies = random_prediction['movieId'].loc[user]
  for i in movies:
    genres.append(check_genre_list(i))

genres

[['Drama'],
 ['Children', 'Comedy'],
 ['Horror'],
 ['Action', 'Thriller'],
 ['Drama', 'Romance'],
 ['Comedy', 'Drama', 'Romance'],
 ['Comedy', 'Drama', 'Romance'],
 ['Comedy', 'Romance'],
 ['Horror', 'Thriller'],
 ['Action', 'Comedy'],
 ['Drama', 'Thriller'],
 ['Adventure', 'Children'],
 ['Drama', 'Romance'],
 ['Thriller'],
 ['Comedy', 'Mystery'],
 ['Comedy'],
 ['Comedy', 'Fantasy'],
 ['Comedy', 'Western'],
 ['Drama', 'Romance'],
 ['Sci-Fi', 'Thriller'],
 ['Drama', 'Romance'],
 ['Comedy'],
 ['Drama'],
 ['Drama', 'Romance', 'War'],
 ['Drama'],
 ['Drama'],
 ['Adventure', 'Children'],
 ['Adventure'],
 ['Drama'],
 ['Comedy'],
 ['Adventure', 'Children', 'Comedy'],
 ['Comedy'],
 ['Action', 'Mystery', 'Romance', 'Thriller'],
 ['Comedy', 'Drama', 'Romance'],
 ['Adventure', 'Drama'],
 ['Sci-Fi'],
 ['Comedy', 'Romance'],
 ['Drama'],
 ['Animation', 'Children', 'Musical'],
 ['Action'],
 ['Comedy'],
 ['Drama'],
 ['Children', 'Comedy'],
 ['Drama'],
 ['Action', 'Crime', 'Drama'],
 ['Horror'],
 ['Dram

After get all the genre for each movie in random prediction, we can rearrange the list to match with every 10 movies for each userId.

In [15]:
# Arrange the list to bind with the random prediction based on userId
genre_per_list = [genres[x:x+10] for x in range(0, len(genres),10)]
genre_per_list

[[['Drama'],
  ['Children', 'Comedy'],
  ['Horror'],
  ['Action', 'Thriller'],
  ['Drama', 'Romance'],
  ['Comedy', 'Drama', 'Romance'],
  ['Comedy', 'Drama', 'Romance'],
  ['Comedy', 'Romance'],
  ['Horror', 'Thriller'],
  ['Action', 'Comedy']],
 [['Drama', 'Thriller'],
  ['Adventure', 'Children'],
  ['Drama', 'Romance'],
  ['Thriller'],
  ['Comedy', 'Mystery'],
  ['Comedy'],
  ['Comedy', 'Fantasy'],
  ['Comedy', 'Western'],
  ['Drama', 'Romance'],
  ['Sci-Fi', 'Thriller']],
 [['Drama', 'Romance'],
  ['Comedy'],
  ['Drama'],
  ['Drama', 'Romance', 'War'],
  ['Drama'],
  ['Drama'],
  ['Adventure', 'Children'],
  ['Adventure'],
  ['Drama'],
  ['Comedy']],
 [['Adventure', 'Children', 'Comedy'],
  ['Comedy'],
  ['Action', 'Mystery', 'Romance', 'Thriller'],
  ['Comedy', 'Drama', 'Romance'],
  ['Adventure', 'Drama'],
  ['Sci-Fi'],
  ['Comedy', 'Romance'],
  ['Drama'],
  ['Animation', 'Children', 'Musical'],
  ['Action']],
 [['Comedy'],
  ['Drama'],
  ['Children', 'Comedy'],
  ['Drama'],
  [

In [16]:
# Add the list into random prediction column
random_prediction['genres'] = genre_per_list
random_prediction

Unnamed: 0_level_0,movieId,genres
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
8,"[1310, 377, 885, 147, 1400, 517, 1424, 1102, 7...","[[Drama], [Children, Comedy], [Horror], [Actio..."
77,"[1370, 1279, 1406, 1601, 1452, 1600, 1036, 575...","[[Drama, Thriller], [Adventure, Children], [Dr..."
5,"[1429, 997, 1592, 483, 1145, 1661, 1503, 463, ...","[[Drama, Romance], [Comedy], [Drama], [Drama, ..."
178,"[862, 1351, 328, 936, 924, 760, 1259, 1221, 98...","[[Adventure, Children, Comedy], [Comedy], [Act..."
134,"[1179, 1169, 383, 707, 127, 671, 270, 23, 1335...","[[Comedy], [Drama], [Children, Comedy], [Drama..."
98,"[1676, 80, 1152, 823, 27, 1083, 839, 591, 1318...","[[Drama], [Action, Comedy, War], [Romance, War..."
176,"[1198, 1392, 17, 826, 1603, 626, 1264, 459, 12...","[[Crime, Thriller], [Drama], [Action, Comedy, ..."
180,"[145, 387, 453, 636, 1466, 894, 1054, 805, 50,...","[[Action, Sci-Fi, Thriller], [Drama], [Action,..."
27,"[209, 1362, 771, 709, 298, 594, 1054, 1311, 19...","[[Comedy, Drama, Musical], [Action], [Action, ..."
97,"[731, 1379, 1087, 1351, 110, 824, 1313, 1306, ...","[[Comedy, Drama, Romance], [Romance], [Action]..."


# **Compute diversity**

For the research, one of the paper we referred is using the following equations to compute diversity of the recommendation item list.

![alt text](https://live.staticflickr.com/65535/49911126126_6d056e799d_b.jpg)


* Lu is recommendation list for user u.

* txi is the number of topics included in item xi.

* zlu is the number of total topics in recommendation list Lu.

* Div(Lu) is elaborate as follows:

![alt text](https://live.staticflickr.com/65535/49912986207_4b64453f05_m.jpg)

* S_Lu is is the number of different topics in recommendation
list Lu.

* H(Lu) denotes the topic distribution of recommendation list Lu:

![alt text](https://live.staticflickr.com/65535/49913042447_d6d0a97bed_m.jpg)

* Nt is the number of the kinds of topics in the whole recommendation system
* qj is the probability of the occurrence of topic j in recommendation list Lu.

The probability of occurrence of topic j is calculated as below:

![alt text](https://live.staticflickr.com/65535/49912232138_110f1383df_m.jpg)

* S_Luj is the number of topic j in the set S_Lu


It might be confusing for both terms "topics" and "genre". Assuming they both are same referring to genre.



In [0]:
# get the movie list per user and return the total genres for that list, z_Lu
def total_genre_per_list(user_movies_list):
  
  total_num_list = []
  
  for n in user_movies_list:
      genre = check_genre_list(n)
      for g in genre:
        total_num_list.append(g)
  
  return total_num_list, len(total_num_list)

In [0]:
# get the movie list per user and return the unique genres for that list, S_Lu
def get_unique_genre_list_by_user(user_movies_list):

  total_list = []
  
  for h in user_movies_list:
    g = check_genre_list(h)
    for k in g:
      total_list.append(k)
  
  unique_list = list(set(total_list))

  return unique_list, len(unique_list)

In [0]:
def H_lu(lu):

  H_lu = 0
  # need to find how many times that topic j is appear in Z_Lu list
  for j in slu:
    count_appear = zlu.count(j)
    qj = count_appear/len(zlu)
    H_lu += -(qj * math.log10(qj))

  return H_lu


def Div_lu(Lu):
  sl , SLu = get_unique_genre_list_by_user(Lu)
  Nt = N
  hlu = H_lu(Lu)
  Div_lu = SLu / Nt * hlu

  return Div_lu

In [75]:
Div_lu(test_list)

0.6350528553754095

In [0]:
# get the userId list in prediction result
def diversity_per_user(user):
  sum_part = 0
  movie_list = random_prediction['movieId'].loc[user]

  # total number of topics in recommendation list per user
  zlu, n_zlu = total_genre_per_list(movie_list)

  for item in movie_list:
    txi = check_genre_num(item)
    value1 = txi/n_zlu
    value2 = math.log10(value1)
    sum_part += value1*value2

  diversity_lu = -(sum_part) * Div_lu(movie_list)

  return diversity_lu

# **Testing**

We have complete building the function, then we can test the diverisity calculation by enter the userId into `diversity_per_user()`

In [77]:
diversity_per_user(180)

0.5967663737119576

Compute the diversity of every userId in prediction result.

In [0]:
diversity = []
for u in random_prediction.index:
  diversity.append(diversity_per_user(u))

In [79]:
diversity

[0.36202930991758225,
 0.5226962806815196,
 0.30415195945912665,
 0.5505346119379276,
 0.40465877155812735,
 0.5131475224783436,
 0.4892114256947269,
 0.5967663737119576,
 0.4629688853580586,
 0.48708131167314356]

In [80]:
random_prediction['diversity'] = diversity
random_prediction

Unnamed: 0_level_0,movieId,genres,diversity
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8,"[1310, 377, 885, 147, 1400, 517, 1424, 1102, 7...","[[Drama], [Children, Comedy], [Horror], [Actio...",0.362029
77,"[1370, 1279, 1406, 1601, 1452, 1600, 1036, 575...","[[Drama, Thriller], [Adventure, Children], [Dr...",0.522696
5,"[1429, 997, 1592, 483, 1145, 1661, 1503, 463, ...","[[Drama, Romance], [Comedy], [Drama], [Drama, ...",0.304152
178,"[862, 1351, 328, 936, 924, 760, 1259, 1221, 98...","[[Adventure, Children, Comedy], [Comedy], [Act...",0.550535
134,"[1179, 1169, 383, 707, 127, 671, 270, 23, 1335...","[[Comedy], [Drama], [Children, Comedy], [Drama...",0.404659
98,"[1676, 80, 1152, 823, 27, 1083, 839, 591, 1318...","[[Drama], [Action, Comedy, War], [Romance, War...",0.513148
176,"[1198, 1392, 17, 826, 1603, 626, 1264, 459, 12...","[[Crime, Thriller], [Drama], [Action, Comedy, ...",0.489211
180,"[145, 387, 453, 636, 1466, 894, 1054, 805, 50,...","[[Action, Sci-Fi, Thriller], [Drama], [Action,...",0.596766
27,"[209, 1362, 771, 709, 298, 594, 1054, 1311, 19...","[[Comedy, Drama, Musical], [Action], [Action, ...",0.462969
97,"[731, 1379, 1087, 1351, 110, 824, 1313, 1306, ...","[[Comedy, Drama, Romance], [Romance], [Action]...",0.487081
