#### Abstract

As we've just discussed Bayesian ranking in recent two classes, yielding an estimation of ratings over any given item based on the collected reviews. So we'll employ Bayesian ranking technique to show the top-10 movies of different categories.

The Bayesian ranking model and formulation in our class is $bayesian_rating = \frac{C \cdot m + \sum_\left(ratings\right)}{C+N}$, where $N$ is the number of ratings $m$ is a prior for the average of ratings $C$ is a prior for the number of ratings. In this project, we set $C$ equals to 15 and $m$ equals to 2. So we will use above formula to calculate the Bayesian ratings and rank the tones of movies to obtain several top-10 movies. To process the data, we use a powerful and handy python module **panda**.

#### Implementation

- Reading input

In [1]:
import pandas as pd
from IPython.display import display, HTML

# set pandas print option to print full text of movies' names
# pd.set_option('display.max_colwidth',200)

# read input using pandas and set columns' names
ratings = pd.read_table('./ml-1m/ratings.dat', header=None, names=['UserID', 'MovieID','Rating','Timestamp'], sep="::", engine='python',encoding='latin-1')
movies = pd.read_table('./ml-1m/movies.dat', header=None, names=['MovieID','Title','Genres'], sep="::", engine='python',encoding='latin-1')
users = pd.read_table('./ml-1m/users.dat', header=None, names=['UserID','Gender','Age','Occupation','Zip-code'], sep="::", engine='python',encoding='latin-1')

In [2]:
# set prior distribution parameters
C, m = 15, 2

# group movies by movieid
grouped_ratings = ratings.groupby(['MovieID'])

# calculate bayesian ratings and sort them in desceding order
bayes = pd.DataFrame({'MovieID': list(grouped_ratings.groups.keys()),
                      'BayesianRanking': list(map((lambda count,mean: (C*m + count*mean)/(C + count)), grouped_ratings['Rating'].count(),grouped_ratings['Rating'].mean())),
                      'MeanRating':grouped_ratings['Rating'].mean().tolist(),
                      'Count':grouped_ratings['Rating'].count().tolist()
                      },
                      columns = ['MovieID','BayesianRanking','MeanRating','Count'])
bayes = bayes.sort_values(by='BayesianRanking', ascending=False)

# extract top-10 movies from movies' list
top10_movies = pd.merge(bayes[0:10],movies, on='MovieID')

# print the outcome
top10_movies.index = top10_movies.index + 1
display(HTML("<h3>General top-10</h3>"))
display(top10_movies[['MovieID', 'Title','Count','BayesianRanking','MeanRating']])

Unnamed: 0,MovieID,Title,Count,BayesianRanking,MeanRating
1,318,"Shawshank Redemption, The (1994)",2227,4.537467,4.554558
2,858,"Godfather, The (1972)",2223,4.508043,4.524966
3,2019,Seven Samurai (The Magnificent Seven) (Shichin...,628,4.500778,4.56051
4,50,"Usual Suspects, The (1995)",1783,4.496107,4.517106
5,527,Schindler's List (1993),2304,4.494179,4.510417
6,1148,"Wrong Trousers, The (1993)",882,4.465998,4.507937
7,745,"Close Shave, A (1995)",657,4.464286,4.520548
8,1198,Raiders of the Lost Ark (1981),2514,4.463029,4.477725
9,260,Star Wars: Episode IV - A New Hope (1977),2991,4.44145,4.453694
10,904,Rear Window (1954),1050,4.441315,4.47619


In [3]:
ratings_males = ratings[ratings['UserID'].isin(users[users['Gender']=='M']['UserID'])]

# group movies by movieid
grouped_ratings = ratings_males.groupby(['MovieID'])

# calculate bayesian ratings and sort them in desceding order
bayes = pd.DataFrame({'MovieID': list(grouped_ratings.groups.keys()),
                      'BayesianRanking': list(map((lambda count,mean: (C*m + count*mean)/(C + count)), grouped_ratings['Rating'].count(),grouped_ratings['Rating'].mean())),
                      'MeanRating':grouped_ratings['Rating'].mean().tolist(),
                      'Count':grouped_ratings['Rating'].count().tolist()
                      },
                      columns = ['MovieID','BayesianRanking','MeanRating','Count'])
bayes = bayes.sort_values(by='BayesianRanking', ascending=False)

# extract top-10 movies from movies' list
top10_movies = pd.merge(bayes[0:10],movies, on='MovieID')

# print the outcome
top10_movies.index = top10_movies.index + 1
display(HTML("<h3>Top-10 among males</h3>"))
display(top10_movies[['MovieID', 'Title','Count','BayesianRanking','MeanRating']])

Unnamed: 0,MovieID,Title,Count,BayesianRanking,MeanRating
1,858,"Godfather, The (1972)",1740,4.561254,4.583333
2,318,"Shawshank Redemption, The (1994)",1600,4.536842,4.560625
3,2019,Seven Samurai (The Magnificent Seven) (Shichin...,522,4.504655,4.576628
4,1198,Raiders of the Lost Ark (1981),1942,4.501277,4.520597
5,50,"Usual Suspects, The (1995)",1370,4.490975,4.518248
6,260,Star Wars: Episode IV - A New Hope (1977),2344,4.47944,4.495307
7,527,Schindler's List (1993),1689,4.469484,4.491415
8,750,Dr. Strangelove or: How I Learned to Stop Worr...,1136,4.432667,4.464789
9,912,Casablanca (1942),1164,4.430025,4.46134
10,904,Rear Window (1954),759,4.425065,4.472991


In [4]:
ratings_females = ratings[ratings['UserID'].isin(users[users['Gender']=='F']['UserID'])]

# group movies by movieid
grouped_ratings = ratings_females.groupby(['MovieID'])

# calculate bayesian ratings and sort them in desceding order
bayes = pd.DataFrame({'MovieID': list(grouped_ratings.groups.keys()),
                      'BayesianRanking': list(map((lambda count,mean: (C*m + count*mean)/(C + count)), grouped_ratings['Rating'].count(),grouped_ratings['Rating'].mean())),
                      'MeanRating':grouped_ratings['Rating'].mean().tolist(),
                      'Count':grouped_ratings['Rating'].count().tolist()
                      },
                      columns = ['MovieID','BayesianRanking','MeanRating','Count'])
bayes = bayes.sort_values(by='BayesianRanking', ascending=False)

# extract top-10 movies from movies' list
top10_movies = pd.merge(bayes[0:10],movies, on='MovieID')

# print the outcome
top10_movies.index = top10_movies.index + 1
display(HTML("<h3>Top-10 among females</h3>"))
display(top10_movies[['MovieID', 'Title','Count','BayesianRanking','MeanRating']])

Unnamed: 0,MovieID,Title,Count,BayesianRanking,MeanRating
1,527,Schindler's List (1993),615,4.501587,4.562602
2,318,"Shawshank Redemption, The (1994)",627,4.479751,4.539075
3,745,"Close Shave, A (1995)",180,4.441026,4.644444
4,1148,"Wrong Trousers, The (1993)",238,4.434783,4.588235
5,50,"Usual Suspects, The (1995)",413,4.425234,4.513317
6,2762,"Sixth Sense, The (1999)",664,4.42268,4.47741
7,1207,To Kill a Mockingbird (1962),300,4.415873,4.536667
8,904,Rear Window (1954),291,4.362745,4.484536
9,2324,Life Is Beautiful (La Vita è bella) (1997),367,4.327225,4.422343
10,910,Some Like It Hot (1959),255,4.325926,4.462745


In [5]:
# group movies by movieid
grouped_ratings = ratings.groupby(['MovieID'])

# calculate bayesian ratings and sort them in desceding order
bayes = pd.DataFrame({'MovieID': list(grouped_ratings.groups.keys()),
                      'BayesianRanking': list(map((lambda count,mean: (C*m + count*mean)/(C + count)), grouped_ratings['Rating'].count(),grouped_ratings['Rating'].mean())),
                      'MeanRating':grouped_ratings['Rating'].mean().tolist(),
                      'Count':grouped_ratings['Rating'].count().tolist()
                      },
                      columns = ['MovieID','BayesianRanking','MeanRating','Count'])
# select romance movie
bayes = bayes[bayes['MovieID'].isin(movies[movies['Genres'].str.contains('Romance')]['MovieID'])]
bayes = bayes.sort_values(by='BayesianRanking', ascending=False)

# extract top-10 movies from movies' list
top10_movies = pd.merge(bayes[0:10],movies, on='MovieID')

# print the outcome
top10_movies.index = top10_movies.index + 1
display(HTML("<h3>Top-10 of category Romance</h3>"))
display(top10_movies[['MovieID', 'Title','Count','BayesianRanking','MeanRating']])

Unnamed: 0,MovieID,Title,Count,BayesianRanking,MeanRating
1,912,Casablanca (1942),1669,4.39133,4.412822
2,1197,"Princess Bride, The (1987)",2318,4.288898,4.30371
3,3307,City Lights (1931),271,4.262238,4.387454
4,898,"Philadelphia Story, The (1940)",582,4.242881,4.300687
5,899,Singin' in the Rain (1952),751,4.238903,4.283622
6,1172,Cinema Paradiso (1988),615,4.233333,4.287805
7,969,"African Queen, The (1951)",1057,4.220149,4.251656
8,930,Notorious (1946),445,4.219565,4.294382
9,1247,"Graduate, The (1967)",1261,4.219436,4.245837
10,2692,Run Lola Run (Lola rennt) (1998),1072,4.194112,4.224813


In [6]:
# group movies by movieid
grouped_ratings = ratings.groupby(['MovieID'])

# calculate bayesian ratings and sort them in desceding order
bayes = pd.DataFrame({'MovieID': list(grouped_ratings.groups.keys()),
                      'BayesianRanking': list(map((lambda count,mean: (C*m + count*mean)/(C + count)), grouped_ratings['Rating'].count(),grouped_ratings['Rating'].mean())),
                      'MeanRating':grouped_ratings['Rating'].mean().tolist(),
                      'Count':grouped_ratings['Rating'].count().tolist()
                      },
                      columns = ['MovieID','BayesianRanking','MeanRating','Count'])
# select action movie
bayes = bayes[bayes['MovieID'].isin(movies[movies['Genres'].str.contains('Action')]['MovieID'])]
bayes = bayes.sort_values(by='BayesianRanking', ascending=False)

# extract top-10 movies from movies' list
top10_movies = pd.merge(bayes[0:10],movies, on='MovieID')

# print the outcome
top10_movies.index = top10_movies.index + 1
display(HTML("<h3>Top-10 of category Action</h3>"))
display(top10_movies[['MovieID', 'Title','Count','BayesianRanking','MeanRating']])

Unnamed: 0,MovieID,Title,Count,BayesianRanking,MeanRating
1,858,"Godfather, The (1972)",2223,4.508043,4.524966
2,2019,Seven Samurai (The Magnificent Seven) (Shichin...,628,4.500778,4.56051
3,1198,Raiders of the Lost Ark (1981),2514,4.463029,4.477725
4,260,Star Wars: Episode IV - A New Hope (1977),2991,4.44145,4.453694
5,1221,"Godfather: Part II, The (1974)",1692,4.336848,4.357565
6,2028,Saving Private Ryan (1998),2653,4.324213,4.337354
7,2571,"Matrix, The (1999)",2590,4.302495,4.31583
8,1197,"Princess Bride, The (1987)",2318,4.288898,4.30371
9,1196,Star Wars: Episode V - The Empire Strikes Back...,2990,4.281531,4.292977
10,1233,"Boat, The (Das Boot) (1981)",1001,4.268701,4.302697


#### Other top-10s are optional
As we can see, select praticular group of people or movie is very easy using pandas, most of time is only one extra code:
```python 
ratings_males = ratings[ratings['UserID'].isin(users[users['UserFeature']=='Something']['UserID'])]
bayes = bayes[bayes['MovieID'].isin(movies[movies['MovieFeature'].str.contains('Something')]['MovieID'])]
```
we can also filt more fature applying logical opertion on users index or movies index. Here is an example.


In [7]:
# select specific type of users
ratings_specific = ratings[ratings['UserID'].isin(users[(users['Gender']=='M') & (users['Age']==18) & (users['Occupation']==4)]['UserID'])]

# group movies by movieid
grouped_ratings = ratings_specific.groupby(['MovieID'])

# calculate bayesian ratings and sort them in desceding order
bayes = pd.DataFrame({'MovieID': list(grouped_ratings.groups.keys()),
                      'BayesianRanking': list(map((lambda count,mean: (C*m + count*mean)/(C + count)), grouped_ratings['Rating'].count(),grouped_ratings['Rating'].mean())),
                      'MeanRating':grouped_ratings['Rating'].mean().tolist(),
                      'Count':grouped_ratings['Rating'].count().tolist()
                      },
                      columns = ['MovieID','BayesianRanking','MeanRating','Count'])
# select specific type of movies
bayes = bayes[bayes['MovieID'].isin(movies[(movies['Genres'].str.contains('Crime')) | (movies['Genres'].str.contains('War'))]['MovieID'])]
bayes = bayes.sort_values(by='BayesianRanking', ascending=False)

# extract top-10 movies from movies' list
top10_movies = pd.merge(bayes[0:10],movies, on='MovieID')

# print the outcome
top10_movies.index = top10_movies.index + 1
display(HTML("<h3>Top-10 of crime/war among 18-24 male college/grad students </h3>"))
display(top10_movies[['MovieID', 'Title','Count','BayesianRanking','MeanRating']])

Unnamed: 0,MovieID,Title,Count,BayesianRanking,MeanRating
1,50,"Usual Suspects, The (1995)",149,4.45122,4.697987
2,858,"Godfather, The (1972)",148,4.355828,4.594595
3,296,Pulp Fiction (1994),184,4.321608,4.51087
4,1196,Star Wars: Episode V - The Empire Strikes Back...,218,4.296137,4.454128
5,110,Braveheart (1995),209,4.290179,4.454545
6,527,Schindler's List (1993),163,4.280899,4.490798
7,2028,Saving Private Ryan (1998),214,4.270742,4.429907
8,1213,GoodFellas (1990),124,4.223022,4.491935
9,1221,"Godfather: Part II, The (1974)",103,4.177966,4.495146
10,750,Dr. Strangelove or: How I Learned to Stop Worr...,84,4.171717,4.559524


#### Analysis and Conclusion

The Bayesian ranking consider into number of ratings, which mean rating doesn't consdier. To some degree, the outcome is different with mean rating. Also, the outcome changes when we vary the prior distribution parameters. In this project we can take some insight of what big website ranking items by rating, and we also notice other useful techniques helpful, such as the recency of the reviews or quality of review. So in future online shopping, we could indeed understand the ranking of items, it could be useful or be manipulated, which help us make appropriate decision.