# Bayesian Movie Ranking

For movie ranking systems, we always have our raw datas stored as matrices or vectors. However, to have rankings properly sorted, we need to summarize these more-than-one-dimensional infotmtion in a scalar.

In this project, we are calculating the expected value of average value based on the observations we have.
$$ E \left[ p_1+2p_2+3p_3+4p_4+5p_5 \mid O \right] = \sum_{i=1}^5 iE\left[p_i \mid O\right] $$


The main idea behind Bayesian inference approach is
> Treats all unknown quantities as random variables.

In the Bayesian approach, we would treat the unknown probability $p$ as a random variable and give $p$ with a **prior distribution**, which reflects our uncertainty about the true value of $p$ before observing the coin tosses.

After the experiment is performed and the data are gathered, the prior distribution is updated using Bayes' rule; this yields the posterior distribution, which reflects our new beliefs about $p$.


We first import necessary libraires.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

Now we will import data, with information given in README specified manually.

In [2]:
data = {'movies': None, 'ratings': None, 'users': None}
dataheader = {
    'movies': ['MovieID', 'Title', 'Genres'],
    'ratings': ['UserID','MovieID','Rating','Timestamp'],
    'users': ['UserID','Gender','Age','Occupation','Zip-code']
}
occupations = {
 0:"other/not specified",
 1:"academic/educator",
 2:"artist",
 3:"clerical/admin",
 4:"college/grad student",
 5:"customer service",
 6:"doctor/health care",
 7:"executive/managerial",
 8:"farmer",
 9:"homemaker",
10:"K-12 student",
11:"lawyer",
12:"programmer",
13:"retired",
14:"sales/marketing",
15:"scientist",
16:"self-employed",
17:"technician/engineer",
18:"tradesman/craftsman",
19:"unemployed",
20:"writer"
}

for t in ['movies', 'ratings', 'users']:
    data[t] = pd.read_csv(
        filepath_or_buffer = '{}.dat'.format(t),
        sep = '::',
        header = None,
        names = dataheader[t],
        engine = 'python',
        encoding='latin-1'
    )

We now craft the data a little bit so that Pandas can do more with the data.

In [3]:
# Parse the raw string so that `Genres` column stores Python lists
data['movies']['Genres'] = data['movies']['Genres'].apply(lambda x: x.split('|'))

# Let Pandas understand categories
data['users']['Occupation'] = data['users']['Occupation'].apply(lambda x: occupations[x]).astype('category')
data['users']['Gender'] = data['users']['Gender'].astype('category')

# Let Pandas understand timestamps
data['ratings']['Timestamp'] = pd.to_datetime(data['ratings']['Timestamp'],unit='s')

We merge the tables into one big table just because we are lazy and waste on storage is, in this case, affordable.

In [4]:
fulldata = pd.merge(pd.merge(data['ratings'], data['users']), data['movies'])

In the slides, we have
* m = 3.25 & C = 50
* m = 2 & C = 6

In this project, we have prior distribution with m = 2 & C = 15.

And we want Top 10 lists.

## Intra-Item

$$ \overline{{\rm rating}} = \frac{\sum_{i=1}^5{i\alpha_i^0} + \sum_{i=1}^5{iK_i}}{N+\sum_{i=1}^5{\alpha_i^0}} = \frac{C \cdot m + \sum{\rm ratings}}{C+N} $$

where, $K_1 + \cdots + K_5 = N$, and $\alpha^0$ is Dirichlet distribution parameter s.t.

$${\rm Pr} \left(p_1,p_2,p_3,p_4,p_5 \mid O\right) \propto \prod_{j=1}^5 p_j^{K_j+\alpha_j^0-1} $$

Here, when we are calculating Bayesian mean of one item, we don't include information from other items.

This intra-item Bayesian average differs from plain average in that it considers the number of ratings, which is, in this case, $N$.

We have observations. And what we are then finding is some probability distribution of $p$: $f(p)$ that give rise to this observation. The Bayesian average calculates the expected value of this $f(p)$, not simply the original observations.

In [5]:
m, C = 2, 15
top_x = 10

In [6]:
display(Markdown('### Overall Top {}'.format(top_x)))

grp_r = fulldata.groupby('Title').Rating

N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)

topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Overall Top 10

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Shawshank Redemption, The (1994)",2227,4.537467,4.554558
"Godfather, The (1972)",2223,4.508043,4.524966
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954),628,4.500778,4.56051
"Usual Suspects, The (1995)",1783,4.496107,4.517106
Schindler's List (1993),2304,4.494179,4.510417
"Wrong Trousers, The (1993)",882,4.465998,4.507937
"Close Shave, A (1995)",657,4.464286,4.520548
Raiders of the Lost Ark (1981),2514,4.463029,4.477725
Star Wars: Episode IV - A New Hope (1977),2991,4.44145,4.453694
Rear Window (1954),1050,4.441315,4.47619


In [7]:
display(Markdown('### Top {} ranked by Male'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'M'].groupby('Title').Rating

N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 ranked by Male

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Godfather, The (1972)",1740,4.561254,4.583333
"Shawshank Redemption, The (1994)",1600,4.536842,4.560625
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954),522,4.504655,4.576628
Raiders of the Lost Ark (1981),1942,4.501277,4.520597
"Usual Suspects, The (1995)",1370,4.490975,4.518248
Star Wars: Episode IV - A New Hope (1977),2344,4.47944,4.495307
Schindler's List (1993),1689,4.469484,4.491415
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),1136,4.432667,4.464789
Casablanca (1942),1164,4.430025,4.46134
Rear Window (1954),759,4.425065,4.472991


In [8]:
display(Markdown('### Top {} ranked by Female'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'F'].groupby('Title').Rating

N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 ranked by Female

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Schindler's List (1993),615,4.501587,4.562602
"Shawshank Redemption, The (1994)",627,4.479751,4.539075
"Close Shave, A (1995)",180,4.441026,4.644444
"Wrong Trousers, The (1993)",238,4.434783,4.588235
"Usual Suspects, The (1995)",413,4.425234,4.513317
"Sixth Sense, The (1999)",664,4.42268,4.47741
To Kill a Mockingbird (1962),300,4.415873,4.536667
Rear Window (1954),291,4.362745,4.484536
Life Is Beautiful (La Vita è bella) (1997),367,4.327225,4.422343
Some Like It Hot (1959),255,4.325926,4.462745


To have genre-wise top 10 rankings, we first write a helper function ``hasGenre`` to generate a filter boolean array.

In [9]:
def hasGenre(data_df, genre):
    filt_bools = []
    for i in range(len(data_df)):
        if genre in data_df.Genres.iat[i]:
            filt_bools.append(True)
        else:
            filt_bools.append(False)
    return filt_bools

In [10]:
dst_genre = 'Romance'
grp_r = fulldata[hasGenre(fulldata, dst_genre)].groupby('Title').Rating
display(Markdown('### Top {} in {}'.format(top_x, dst_genre)))

N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 in Romance

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Casablanca (1942),1669,4.39133,4.412822
"Princess Bride, The (1987)",2318,4.288898,4.30371
City Lights (1931),271,4.262238,4.387454
"Philadelphia Story, The (1940)",582,4.242881,4.300687
Singin' in the Rain (1952),751,4.238903,4.283622
Cinema Paradiso (1988),615,4.233333,4.287805
"African Queen, The (1951)",1057,4.220149,4.251656
Notorious (1946),445,4.219565,4.294382
"Graduate, The (1967)",1261,4.219436,4.245837
Run Lola Run (Lola rennt) (1998),1072,4.194112,4.224813


In [11]:
dst_genre = 'Action'
grp_r = fulldata[hasGenre(fulldata, dst_genre)].groupby('Title').Rating
display(Markdown('### Top {} in {}'.format(top_x, dst_genre)))

N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 in Action

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Godfather, The (1972)",2223,4.508043,4.524966
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954),628,4.500778,4.56051
Raiders of the Lost Ark (1981),2514,4.463029,4.477725
Star Wars: Episode IV - A New Hope (1977),2991,4.44145,4.453694
"Godfather: Part II, The (1974)",1692,4.336848,4.357565
Saving Private Ryan (1998),2653,4.324213,4.337354
"Matrix, The (1999)",2590,4.302495,4.31583
"Princess Bride, The (1987)",2318,4.288898,4.30371
Star Wars: Episode V - The Empire Strikes Back (1980),2990,4.281531,4.292977
"Boat, The (Das Boot) (1981)",1001,4.268701,4.302697


## Inter-Item (Optional Part)

The inter-item Bayesian average is more heuristic.

$$\bar{m_i} = \frac{C_i m_i + \sum{\rm ratings}}{C+N}$$

It is not mathmatically-derived, however, it provides an alternative way of "Bayesianizing" rating of items by introducing global information.

Personally, I think this approach is more applicable to movies of the same genre since the ranking preference will be be more similar.

In [12]:
display(Markdown('### Overall Top {}'.format(top_x)))

grp_r = fulldata.groupby('Title').Rating

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Overall Top 10

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode IV - A New Hope (1977),3.584165,4.453694
American Beauty (1999),3.584078,4.317386
Raiders of the Lost Ark (1981),3.583811,4.477725
"Shawshank Redemption, The (1994)",3.583726,4.554558
Schindler's List (1993),3.583699,4.510417
Star Wars: Episode V - The Empire Strikes Back (1980),3.583685,4.292977
"Godfather, The (1972)",3.583657,4.524966
"Sixth Sense, The (1999)",3.583587,4.406263
Saving Private Ryan (1998),3.583564,4.337354
"Silence of the Lambs, The (1991)",3.583545,4.351823


In [13]:
display(Markdown('### Top {} ranked by Male'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'M'].groupby('Title').Rating

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 ranked by Male

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode IV - A New Hope (1977),3.571751,4.495307
American Beauty (1999),3.571433,4.347301
Raiders of the Lost Ark (1981),3.571324,4.520597
Star Wars: Episode V - The Empire Strikes Back (1980),3.571281,4.344577
"Godfather, The (1972)",3.571215,4.583333
Saving Private Ryan (1998),3.571161,4.398941
"Matrix, The (1999)",3.571058,4.362235
"Shawshank Redemption, The (1994)",3.570979,4.560625
Schindler's List (1993),3.570941,4.491415
"Silence of the Lambs, The (1991)",3.570893,4.381944


In [14]:
display(Markdown('### Top {} ranked by Female'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'F'].groupby('Title').Rating

top_x = 10

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 ranked by Female

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
American Beauty (1999),3.622731,4.238901
Schindler's List (1993),3.622712,4.562602
"Shawshank Redemption, The (1994)",3.622697,4.539075
"Sixth Sense, The (1999)",3.622669,4.47741
"Silence of the Lambs, The (1991)",3.622227,4.271955
"Princess Bride, The (1987)",3.622226,4.342767
Shakespeare in Love (1998),3.622178,4.181704
Star Wars: Episode IV - A New Hope (1977),3.622153,4.302937
Raiders of the Lost Ark (1981),3.622014,4.332168
Fargo (1996),3.621954,4.217656


In [15]:
dst_genre = 'Romance'
grp_r = fulldata[hasGenre(fulldata, dst_genre)].groupby('Title').Rating
display(Markdown('### Top {} in {}'.format(top_x, dst_genre)))

N = grp_r.count()
sum_r = grp_r.sum()

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 in Romance

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Princess Bride, The (1987)",3.618235,4.30371
Casablanca (1942),3.616474,4.412822
Shakespeare in Love (1998),3.615683,4.12748
Star Wars: Episode VI - Return of the Jedi (1983),3.615428,4.022893
Forrest Gump (1994),3.614506,4.087967
"Graduate, The (1967)",3.612875,4.245837
Groundhog Day (1993),3.61272,3.953029
When Harry Met Sally... (1989),3.612364,4.073342
Annie Hall (1977),3.612252,4.141679
"African Queen, The (1951)",3.612047,4.251656


In [16]:
dst_genre = 'Action'
grp_r = fulldata[hasGenre(fulldata, dst_genre)].groupby('Title').Rating
display(Markdown('### Top {} in {}'.format(top_x, dst_genre)))

N = grp_r.count()
sum_r = grp_r.sum()

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

### Top 10 in Action

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode IV - A New Hope (1977),3.502238,4.453694
Raiders of the Lost Ark (1981),3.500725,4.477725
Star Wars: Episode V - The Empire Strikes Back (1980),3.50039,4.292977
"Godfather, The (1972)",3.500035,4.524966
Saving Private Ryan (1998),3.499815,4.337354
"Matrix, The (1999)",3.499398,4.31583
"Princess Bride, The (1987)",3.498435,4.30371
Braveheart (1995),3.498176,4.234957
Star Wars: Episode VI - Return of the Jedi (1983),3.497073,4.022893
Terminator 2: Judgment Day (1991),3.496963,4.058513


## Further Comparison

In [17]:
display(Markdown('#### Intra-item: Top {} ranked by Male Programmers'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'M']
grp_r = grp_r[grp_r.Occupation == 'programmer'].groupby('Title').Rating

C, m = 15, 2
N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])


display(Markdown('#### Inter-item: Top {} ranked by Male Programmers'.format(top_x)))

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

#### Intra-item: Top 10 ranked by Male Programmers

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars: Episode IV - A New Hope (1977),201,4.435185,4.616915
Raiders of the Lost Ark (1981),162,4.355932,4.574074
Star Wars: Episode V - The Empire Strikes Back (1980),206,4.316742,4.485437
"Matrix, The (1999)",188,4.315271,4.5
"Usual Suspects, The (1995)",112,4.259843,4.5625
"Sixth Sense, The (1999)",143,4.259494,4.496503
Blade Runner (1982),144,4.226415,4.458333
"Princess Bride, The (1987)",153,4.214286,4.431373
"Shawshank Redemption, The (1994)",118,4.203008,4.483051
Pulp Fiction (1994),130,4.2,4.453846


#### Inter-item: Top 10 ranked by Male Programmers

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode IV - A New Hope (1977),3.654719,4.616915
Star Wars: Episode V - The Empire Strikes Back (1980),3.654279,4.485437
"Matrix, The (1999)",3.654038,4.5
Raiders of the Lost Ark (1981),3.65384,4.574074
American Beauty (1999),3.653391,4.344262
"Sixth Sense, The (1999)",3.653275,4.496503
"Princess Bride, The (1987)",3.653244,4.431373
Blade Runner (1982),3.653183,4.458333
Alien (1979),3.653178,4.414474
Saving Private Ryan (1998),3.653138,4.416107


In [18]:
display(Markdown('#### Intra-item: Top {} ranked by Male Programmers aged under 45'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'M']
grp_r = grp_r[grp_r.Occupation == 'programmer']
grp_r = grp_r[grp_r.Age < 45].groupby('Title').Rating

C, m = 15, 2
N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])


display(Markdown('#### Inter-item: Top {} ranked by Male Programmers aged under 45'.format(top_x)))

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

#### Intra-item: Top 10 ranked by Male Programmers aged under 45

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars: Episode IV - A New Hope (1977),182,4.401015,4.598901
Raiders of the Lost Ark (1981),146,4.322981,4.561644
"Matrix, The (1999)",170,4.302703,4.505882
Star Wars: Episode V - The Empire Strikes Back (1980),180,4.297436,4.488889
"Usual Suspects, The (1995)",100,4.26087,4.6
"Sixth Sense, The (1999)",128,4.237762,4.5
"Princess Bride, The (1987)",140,4.225806,4.464286
Saving Private Ryan (1998),132,4.197279,4.44697
Blade Runner (1982),132,4.197279,4.44697
"Godfather, The (1972)",107,4.196721,4.504673


#### Inter-item: Top 10 ranked by Male Programmers aged under 45

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode IV - A New Hope (1977),3.6372,4.598901
Star Wars: Episode V - The Empire Strikes Back (1980),3.636704,4.488889
"Matrix, The (1999)",3.636575,4.505882
Raiders of the Lost Ark (1981),3.636285,4.561644
American Beauty (1999),3.635915,4.37037
"Princess Bride, The (1987)",3.635846,4.464286
"Sixth Sense, The (1999)",3.635723,4.5
Saving Private Ryan (1998),3.635642,4.44697
Blade Runner (1982),3.635642,4.44697
Alien (1979),3.63553,4.392593


In [19]:
display(Markdown('#### Intra-item: Top Sci-Fi {} ranked by Male Programmers aged under 45'.format(top_x)))

grp_r = fulldata[fulldata.Gender == 'M']
grp_r = grp_r[grp_r.Occupation == 'programmer']
grp_r = grp_r[grp_r.Age < 45]
grp_r = grp_r[hasGenre(grp_r, 'Sci-Fi')].groupby('Title').Rating

C, m = 15, 2
N = grp_r.count()
sum_r = grp_r.sum()

r_bayes = (C * m + sum_r)/(C + N)
r_mean = sum_r / N

topdf = pd.concat([N.to_frame(), r_bayes.to_frame(), r_mean.to_frame()], axis=1)
topdf.columns = ['count', 'Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])


display(Markdown('#### Inter-item: Top Sci-Fi {} ranked by Male Programmers aged under 45'.format(top_x)))

N = np.sum(grp_r.count())
sum_r = np.sum(grp_r.sum())
mean = grp_r.sum() / grp_r.count()
m = grp_r.mean()
C = grp_r.count()

r_bayes = (C * m + sum_r)/(C + N)

topdf = pd.concat([r_bayes.to_frame(), mean.to_frame()], axis=1)
topdf.columns = ['Bayes', 'mean']
display(topdf.sort_values(['Bayes'], ascending=False)[0:top_x])

#### Intra-item: Top Sci-Fi 10 ranked by Male Programmers aged under 45

Unnamed: 0_level_0,count,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars: Episode IV - A New Hope (1977),182,4.401015,4.598901
"Matrix, The (1999)",170,4.302703,4.505882
Star Wars: Episode V - The Empire Strikes Back (1980),180,4.297436,4.488889
Blade Runner (1982),132,4.197279,4.44697
Alien (1979),135,4.153333,4.392593
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),93,4.074074,4.408602
"Terminator, The (1984)",143,4.031646,4.244755
Aliens (1986),117,4.0,4.25641
Brazil (1985),86,3.990099,4.337209
Star Wars: Episode VI - Return of the Jedi (1983),180,3.974359,4.138889


#### Inter-item: Top Sci-Fi 10 ranked by Male Programmers aged under 45

Unnamed: 0_level_0,Bayes,mean
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars: Episode IV - A New Hope (1977),3.574421,4.598901
Star Wars: Episode V - The Empire Strikes Back (1980),3.572231,4.488889
"Matrix, The (1999)",3.571601,4.505882
Blade Runner (1982),3.567241,4.44697
Alien (1979),3.566767,4.392593
Star Wars: Episode VI - Return of the Jedi (1983),3.565915,4.138889
Terminator 2: Judgment Day (1991),3.565457,4.110497
"Terminator, The (1984)",3.565305,4.244755
Aliens (1986),3.56366,4.25641
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),3.56341,4.408602


With more detailed user profile being utilized, the top-10 list looks closer to each other.