# 13.2 MovieLens 1M Dataset

GroupLens Research provides a number of collections of movie ratings data collected
from users of MovieLens in the late 1990s and early 2000s. The data provides movie
ratings, movie metadata (genres and year), and demographic data about the users
(age, zip code, gender identification, and occupation). Such data is often of interest in
the development of recommendation systems based on machine learning algorithms.
While we do not explore machine learning techniques in detail in this book, I will
show you how to slice and dice datasets like these into the exact form you need.

The MovieLens 1M dataset contains one million ratings collected from six thousand
users on four thousand movies. It’s spread across three tables: ratings, user information, and movie information.

In [1]:
import pandas as pd

In [4]:
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("/content/users.dat", sep="::", header=None, names=unames, engine="python")

rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("/content/ratings.dat", sep="::", header=None, names=rnames, engine="python")

mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("/content/movies.dat", sep="::", header=None, names=mnames, engine="python")

Verify that everything successed by looking each Df

In [5]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [7]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [8]:
movies.head(5)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


Using pandas `merge()` function we merge `ratings` with `users` and then merge that result with the `movies` data

In [9]:
#Full data
data = pd.merge(pd.merge(ratings, users, on='user_id'), movies, on='movie_id')
data

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106,Weekend at Bernie's (1989),Comedy
1000205,6040,1094,5,956704887,M,25,6,11106,"Crying Game, The (1992)",Drama|Romance|War
1000206,6040,562,5,956704746,M,25,6,11106,Welcome to the Dollhouse (1995),Comedy|Drama
1000207,6040,1096,4,956715648,M,25,6,11106,Sophie's Choice (1982),Drama


In [10]:
data.iloc[0]

Unnamed: 0,0
user_id,1
movie_id,1193
rating,5
timestamp,978300760
gender,F
age,1
occupation,10
zip,48067
title,One Flew Over the Cuckoo's Nest (1975)
genres,Drama


To get mean movie ratings for each film grouped by gender, we can use the `pivot_table` method

In [15]:
mean_ratings = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


Filter down to movies that received at least 250 ratings, group the data by title, and use `size()` to get a Series of group sizes for each title

In [16]:
ratings_by_title = data.groupby("title").size() #size() = count()
ratings_by_title.head()

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",37
'Night Mother (1986),70
'Til There Was You (1997),52
"'burbs, The (1989)",303
...And Justice for All (1979),199


In [18]:
active_titles = ratings_by_title[ratings_by_title >= 250].index
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

Index used to select rows from `mean_ratings` using `.loc`

In [19]:
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


To see the top films among female viewers

In [26]:
top_female_ratings = mean_ratings.sort_values(by="F", ascending=False)
top_female_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.57265,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415


## Measuring Rating Disagreement

Find the movies that are most divisive between male and female viewers

In [29]:
import numpy as np
mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
mean_ratings.head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"'burbs, The (1989)",2.793478,2.962085,0.168607
10 Things I Hate About You (1999),3.646552,3.311966,-0.334586
101 Dalmatians (1961),3.791444,3.5,-0.291444
101 Dalmatians (1996),3.24,2.911215,-0.328785
12 Angry Men (1957),4.184397,4.328421,0.144024


Yields rating difference so that we can see which one were preferred by women

In [33]:
sorted_by_diff = mean_ratings.sort_values(by="diff")
sorted_by_diff.head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777


Reversing the order of the rows we get the movies preferred by men that women didn't rate as highly

In [42]:
sorted_by_diff.sort_values(by="diff", ascending=False).head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Longest Day, The (1962)",3.411765,4.031447,0.619682
"Cable Guy, The (1996)",2.25,2.863787,0.613787


Suppose instead you wanted the movies that elicited the most disagreement among
viewers, independent of gender identification. Disagreement can be measured by the
variance or standard deviation of the ratings. To get this, we first compute the rating
standard deviation by title and then filter down to the active titles:

In [43]:
rating_std_by_title = data.groupby("title")["rating"].std()
rating_std_by_title.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",1.092563
'Night Mother (1986),1.118636
'Til There Was You (1997),1.020159
"'burbs, The (1989)",1.10776
...And Justice for All (1979),0.87811


In [45]:
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"'burbs, The (1989)",1.10776
10 Things I Hate About You (1999),0.989815
101 Dalmatians (1961),0.982103
101 Dalmatians (1996),1.098717
12 Angry Men (1957),0.812731


In [46]:
rating_std_by_title.sort_values(ascending=False).head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
Dumb & Dumber (1994),1.321333
"Blair Witch Project, The (1999)",1.316368
Natural Born Killers (1994),1.307198
Tank Girl (1995),1.277695
"Rocky Horror Picture Show, The (1975)",1.260177


Since a single movie can belong to multiple genres. To group the ratings data by gender we can use the `explode` method on DataFrame

In [47]:
#First split the genres string into a list of genres
movies["genres"] = movies["genres"].str.split("|")
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),"[Animation, Children's, Comedy]"
1,2,Jumanji (1995),"[Adventure, Children's, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama]"
4,5,Father of the Bride Part II (1995),[Comedy]


Calling `explode` generates a new DataFrame with one row for each "inner element" in each list of movies genres

In [50]:
movies_exploded = movies.explode("genres")
movies_exploded.head(10)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children's
0,1,Toy Story (1995),Comedy
1,2,Jumanji (1995),Adventure
1,2,Jumanji (1995),Children's
1,2,Jumanji (1995),Fantasy
2,3,Grumpier Old Men (1995),Comedy
2,3,Grumpier Old Men (1995),Romance
3,4,Waiting to Exhale (1995),Comedy
3,4,Waiting to Exhale (1995),Drama


Now we merge the three tables together

In [51]:
ratings_with_genres = pd.merge(pd.merge(ratings, users, on='user_id'), movies_exploded, on='movie_id')
ratings_with_genres.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation
2,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Children's
3,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Musical
4,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical


In [52]:
ratings_with_genres.iloc[0]

Unnamed: 0,0
user_id,1
movie_id,1193
rating,5
timestamp,978300760
gender,F
age,1
occupation,10
zip,48067
title,One Flew Over the Cuckoo's Nest (1975)
genres,Drama


In [54]:
genre_ratings = (ratings_with_genres.groupby(["genres", "age"])["rating"].mean()).unstack(level='age')
genre_ratings.head(10)

age,1,18,25,35,45,50,56
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Action,3.506385,3.447097,3.453358,3.538107,3.528543,3.611333,3.610709
Adventure,3.449975,3.408525,3.443163,3.515291,3.528963,3.628163,3.649064
Animation,3.476113,3.624014,3.701228,3.740545,3.734856,3.78002,3.756233
Children's,3.241642,3.294257,3.426873,3.518423,3.527593,3.556555,3.621822
Comedy,3.497491,3.460417,3.490385,3.561984,3.591789,3.646868,3.650949
Crime,3.71017,3.668054,3.680321,3.733736,3.750661,3.810688,3.832549
Documentary,3.730769,3.865865,3.94669,3.953747,3.966521,3.908108,3.961538
Drama,3.794735,3.72193,3.726428,3.782512,3.784356,3.878415,3.933465
Fantasy,3.317647,3.353778,3.452484,3.482301,3.532468,3.58157,3.5327
Film-Noir,4.145455,3.997368,4.058725,4.06491,4.105376,4.175401,4.125932
